
MosaicML Platform: the Software Infrastructure for Generative AI
Large AI models like ChatGPT, LaMDA, and Stable Diffusion have sparked the imagination of millions and offer new opportunities for both startups and established enterprises.
However, training these models has been too complex and expensive for many organizations, requiring specialized expertise and tooling. As a result, only a few companies have had the capability to build these models. We built the MosaicML platform to make large scale model training more accessible. Now, organizations of all sizes can train their own industry-specific models with complete model ownership and data privacy.
Challenges in Large AI Model Training
.png)
What makes training large-scale models so challenging?
1. GPU Availability
Training large models requires a large number of advanced GPUs. As an example, Meta's 175 billion parameter OPT-175 was trained for 33 days on 1024 NVIDIA A100 GPUs. Similarly, Google’s 137 billion parameter LaMDA was trained on 1024 TPUs for 57 days.
GPUs can be hard to get—good luck getting an on-demand A100 instance these days! Once you commit to a cloud provider and finally get access to the GPUs you need, proprietary tooling such as AWS SageMaker Model Parallel Library makes it difficult to migrate your model training to another cloud provider.
2. Stack Complexity
Once a GPU cluster is in place, you need a training stack to orchestrate your distributed training job across thousands of GPUs. Here’s a simplified view of a typical training stack (layers that are not specific to ML training are omitted for simplicity).

Curating and configuring the components for each layer of this stack is tricky and error-prone. A single mistake, such as misconfiguring an InfiniBand driver, can lead to a 5x slowdown in training speed!
3. Scaling Out
Training large models with billions of parameters requires distributing the process across hundreds of GPUs so that models fit in memory and training completes within a reasonable amount of time.
Distributing training across so many GPUs is a challenging task that involves picking the right distribution strategy (data parallel? model parallel? both?), selecting and integrating a library that implements the chosen strategy, adjusting hyper parameters such as the global batch size, and ensuring training does not crash due to CUDA out-of-memory or other errors. Distributed training is so brittle that a single wrong choice can cause the run to fail or produce a useless model.
4. Fault Tolerance
Deploying a software stack and configuring distributed training is just the beginning. Running such a complex software stack on massive compute infrastructure involves many operational challenges in dealing with failures at all levels of the stack.
In Meta AI’s OPT-175B training log, they described over 100 restarts during the training process. Their log sheds light into the operational challenges encountered throughout the training process:
2021-5-12…It took 50 minutes to resume training from checkpoint_15_45000!
2021-11-18…Unable to train continuously for more than 1-2 days ... Many failures require manual detection and remediation, wasting compute resources and researcher time..."
The Tsinghua KEG team, which trained the large language model GLM-130B, shared similar experiences:
2022.1… Frequent random hardware failures…
2022.3… It can't launch more than 2,000 computing nodes => Overcome this and support 6,000-node training by tuning Linux kernel TCP parameters...
As shown above (and as we’ve learned on our own), training large scale models is highly prone to errors such as GPU or network failures, bugs and misconfigurations in the software stack, and loss spikes affecting convergence. These errors cause significant delays and consume precious time and money to debug and troubleshoot.
MosaicML Platform: Designed for Large AI Model Training
To solve these challenges, we built the MosaicML platform, which is already used by organizations such as Stanford CRFM and StabilityAI to train LLMs on their own data within their secure environments.
1. A Training Stack that Just Works
We designed the MosaicML platform from the ground up to address problems across all parts of the training stack: from drivers and toolkits all the way up to job orchestration and distributed training.

Our platform provides the infrastructure needed to deploy and orchestrate training jobs across any number of GPUs while also handling failure detection and automatic recovery with fast resumption. It also includes a runtime that provides a performance-tested Docker image configured for the latest GPU drivers, a distributed training framework (Composer), and a streaming data loader (StreamingDataset) that works with a variety of data formats and cloud storage providers.
We continually update and test the MosaicML platform to ensure all parts of the stack are optimized and work seamlessly together.
2. Multi-Cloud Training
The MosaicML platform frees customers from cloud vendor lock-in by making it possible to run training jobs across any major cloud provider with no code changes. We enable customers to train models within their own cloud environments so that we never see their training data.
To enable seamless usage and deployment across multiple clouds, we designed our platform to have a three-part architecture: the Client Interfaces, the Control Plane, and the Compute Plane.

Client Interfaces:
- Includes a Python API and a command line interface for launching and managing training jobs
- Provides a web console to manage access, users, and teams as well as do usage accounting, quota management, and billing
Control Plane:
- Hosts MosaicML multi-node orchestration, failure detection, and resumption logic
- Orchestrates training across clusters that are either on-prem or in cloud providers such as AWS, OCI, GCP, Azure, and CoreWeave
- Manages application logic and metadata such as run configurations, logs, etc.
Compute Plane:
- Powers distributed training
- Is cloud provider agnostic: can be deployed to any Kubernetes cluster, including on-prem
- Maintains data privacy: customer training data never leaves their VPC (Virtual Private Cloud)
- Is stateless and free from the pain of syncing datasets and code between different clusters. Datasets are streamed in while user code, Docker images, and other objects are dynamically loaded during run time
With the MosaicML platform, training across different cloud providers is as easy as changing a single parameter in your run submission. Here’s an example using MosaicML CLI (MCLI):

3. Seamless Scaling of Model Sizes and Compute
Scaling training across both multiple nodes and larger model sizes is more than just spawning a bunch of jobs. To make scaling ‘just work’, we’ve solved a range of challenges throughout the entire stack, including orchestration, optimized parallelism configurations, and dataloader determinism. We enable our customers to run experiments on a few GPUs and then scale up to hundreds of GPUs for a full training hero run - no code changes required!

We built the MosaicML orchestration stack using Kubernetes because it enables us to be multi-cloud and is well-suited for managing workloads across a large amount of compute. However, Kubernetes lacks features for LLM training such as scheduling across nodes ("gang-scheduling") and across clusters. To solve this, we implemented our own scheduler to optimize orchestration and placement for large scale training jobs.
After MosaicML orchestration schedules a job, the MosaicML Training Runtime:
- Handles job distribution across multiple nodes by configuring optimized networking and setting up multiprocessing
- Automatically figures out how to fit both the model and data batch into GPU memory with automatic gradient accumulation—no more fiddling with batch size to avoid CUDA OOM errors
- Handles streaming training data at scale with our streaming dataset - just point to your cloud storage bucket and start training

4. Operational Monitoring: Automatic Failure Detection & Fast Recovery
Training large models is error-prone, regularly incurring failures ranging from hardware crashes to NCCL timeouts to loss spikes. We run extensive monitoring across the stack to quickly detect issues and use automatic graceful resumption to restart training from the latest checkpoint, freeing users from having to babysit and manually rerun failed jobs.
Restarting a job involves several time-consuming (and money-wasting!) steps such as re-pulling the Docker image, downloading the latest checkpoint, and replaying the data loader. In order to minimize GPU idle time and save customers money when recovering from a failure, we employ several techniques to speed up job resumption such as Docker image prefetching + caching and data loader fast-forwarding.
The MosaicML platform automatically takes care of all the operational aspects of LLM training so our customers can focus on training their models.

Getting Started with MosaicML
With the MosaicML platform, training a large language model is as easy as pointing our stack at an S3 bucket and launching a training run.
1. To get started, install our CLI and SDK.
2. Once the package is installed locally, get your API key from the MosaicML Console, and then use the Python SDK to set your API key in your local environment.
3. Next, you’ll want to configure the integrations needed for training. These include GitHub to pull code into the training cluster; object stores such as AWS S3 for training data, access and checkpoint storing; container registries such as Docker Hub to pull container images; and experiment tracking tools such as CometML or Weights & Biases. Learn more about supported integrations and how to set them up in our documentation.

4. With all your configurations added, it’s time to launch a training job using either MCLI or our Python SDK. The code snippet below uses Python SDK and this example configuration.
And there you have it: the power to train a large language or stable diffusion model on your own data is at your fingertips. To learn more about training with MosaicML, check out this demo video and documentation. Ready to get started? Contact us for a demo of the MosaicML platform.