
Supercharge Your Model Training with MosaicML Composer
(Note: Last week I had the honor to participate in and present at the 2022 PyTorch Developer Conference in New Orleans. In my talk, I gave an overview of how easy it is to train deep learning models with MosaicML Composer. In case you missed it, here’s a short recap. Enjoy!)
Today, thanks to the amazing AI community and fantastic tools like PyTorch, AI is everywhere, and is powering our day to day lives in many ways: from personalized recommendations and voice assistants to novel applications such as autonomous vehicle navigation and AI-powered drug discovery.
As AI supports a growing number of use cases, it needs to become even more intelligent. Much of this expansion in capability is achieved through increases in both model complexity and the the amount of data used for training these models. When examining the size of state-of-the-art (SOTA) language models published over the past 4 years, model sizes have been growing at an exponential rate, with no signs of a slowdown.
Unfortunately, this increase in model size and complexity leads to a similar increase in the time, resources, and cost required to train SOTA models. Just as AI capabilities are able to impact the lives of more and more people, cost advances are putting next-gen AI out of reach for many enterprises and research organizations. The result is that a few well-funded organizations now have a significant advantage in developing next-gen AI.

At MosaicML, our mission is to level the playing field and make ML training efficient for everyone. We want advanced AI to be accessible to a broader set of enterprises and organizations, not just a few tech giants.
With that mission, we built Composer, an open-source library built on top of PyTorch, that helps users train neural networks better, faster, and cheaper. With Composer, users get access to improved model accuracy and faster model training, reducing the cost and complexity of working with advanced and large scale models.
Composer is developed by ML developers for ML developers. Designed for ease of use and scalability, it is loaded with many useful features. In this blog post I will focus on three key aspects of Composer: the trainer API, training optimizations, and streaming data loading.

Trainer API
The Trainer automatically implements a PyTorch-based training loop, reducing work for ML developers. It allows easy customization of the training process using 2-way callbacks (inspired by fastai library design), and provides simple ways to manage tasks like device allocation, learning rate scheduling, gradient accumulation, and more.
To use the Trainer API, first create a model class that implements the ComposerModel interface. You can start by just implementing the methods __init__, forward, and loss – and you are good to go! The code example below wraps around a standard ResNet-18 from torchvision.
Next, pass the model to the Trainer with the relevant torch objects, and call the fit method. You can then kick back, relax, and allow Composer and your GPUs to do the heavy lifting.
Metrics are logged during training, and you can check them out in your console or your favorite metrics tracking tool such as Weights and Biases or Comet, both supported through Composer Logger. Once training is done, you can look up the TorchMetrics in the trainer state object.
Training Optimizations
Composer includes 25 built-in algorithmic optimizations that speed up training across a variety of tasks and use cases, including NLP and CV. These algorithmically complex optimizations can be easily applied and “composed” together by Composer. The Composer Trainer 2-way callback mechanism enables customization of the training process, while maintaining state in a shared state object that is carried throughout.

To leverage Composer's optimizations, instantiate the optimizations you wish to apply (hint: all of them reside under composer.algorithms). The code example below leverages progressive resizing, blur pool, and label smoothing. These three methods typically deliver good speed-ups for models of the ResNet family.
Next, instantiate the Trainer object, and pass in the list of algorithms that instantiated in the previous step. And…that’s it! Call .fit and let Composer handle all the details of applying the specified optimizations during training.
If you're curious about these optimizations and want to peek under the hood, Composer’s Algorithms documentation includes a detailed description for each optimization.
Streaming Data Loading
Last but not least: streaming data loading. This capability enables Composer users to stream training and evaluation data from the cloud, while reusing training data across multiple training runs and clusters. This greatly simplifies the management of training data and eliminates the time it takes to download training dataset to the training nodes before training starts, a significant time saver when training on large datasets.
Start by converting the dataset into a streaming format, which allows the streaming engine to index the data. MDSWriter can be used to convert the source dataset into a streaming dataset.
Next, upload the streaming dataset just created into your favorite cloud storage. All major cloud vendors are supported, such as AWS S3, GCP Cloud Storage, as well as the SFTP protocol.
Once the streaming dataset has been created, instantiate the standard PyTorch DataLoader by providing an instance of streaming.Dataset, which conveniently extends PyTorch IterableDataset. streaming.Dataset encapsulates the complexity of streaming the training data from the remote bucket, and doing so efficiently and securely.
Lastly, provide the DataLoader instance to Composer’s Trainer, call fit, and kick back while training data is streamed from the cloud keeping your GPUs busy!
Composer Benchmarks
Combining Composer's Trainer API, optimizations, and streaming data loading yields impressive results on quality, efficiency, and speed.
Training ResNet-50 with Composer and a set of optimizations delivers a 4.5x speedup in training speed running on 8 A100 GPUs. This result, submitted to MLPerf within the Open Devision, was faster than Nvidia’s highly optimized training on the same hardware, which was submitted to the Closed Division. To learn more about this MLPerf benchmark, read the blog post MosaicML Satisfies the Need for Speed with MLPerf Results.

On a canonical NLP task, BERT-Large training, Composer and a curated set of Composer optimizations yielded a 2.7x speedup on 8 A100 GPUs. This result, also submitted to MLPerf within the Open Division, was faster than all other BERT-Large submissions on the same hardware – more than 2 times faster than the 2nd best result! To learn more about this MLPerf benchmark, read the blog post MosaicML Cloud Delivers Leading NLP Performance in MLPerf v2.1.

Give Composer a Try!
If you are curious to learn more about Composer, or even giving it a try, a great place to start would be Composer's quick start tutorial. You can also check out Composer on GitHub, take a look at the docs, join the Composer community over Slack and let us know what you think - we'd love to hear from you!