Skip to content

Leverage Ray from Berkeley for Distributed Training #5

@JonathanChiang

Description

@JonathanChiang

DEMOCRATIZING PRODUCTION-SCALE DISTRIBUTED DEEP LEARNING

https://arxiv.org/pdf/1811.00143.pdf

To address the above challenges, we discuss a system webuilt at Apple known asAlchemist. Alchemist adopts acloud-native architecture and is portable among private andpublic clouds. It supports multiple training frameworkslike Tensorflow or PyTorch and multiple distributed trainingparadigms. The compute cluster is managed by, but not lim-ited to, Kubernetes2. We chose a containerized workflowto ensure uniformity and repeatability of the software envi-ronment. In the following sections, we refer to engineers,researchers, and data scientists using Alchemist asusers.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions