✨ We just announced Composer to speed up training your models. Check us out on GitHub! ✨

MosaicML Cloud Delivers Leading NLP Performance in MLPerf v2.1

MosaicML Cloud Delivers Leading NLP Performance in MLPerf v2.1

MosaicML leads the MLPerf NLP results, delivering a score of 7.9 minutes on 8x NVIDIA A100 GPUs in the Open Division, thanks to algorithmic and systems optimizations delivered through the MosaicML Cloud.
Figure 1: MosaicML demonstrates industry-leading NLP training performance on 8x NVIDIA A100s, the most popular ML training accelerator broadly available today. Our Open Division result uses software optimizations and tuning for improved data efficiency and performance.

A common perception in machine learning is that it’s too hard - and too expensive - to train NLP models from scratch. Instead, most companies are fine-tuning existing models that are pre-trained on generic text data such as all of Wikipedia.

We’re on a mission to change that narrative so that anyone can train NLP models from scratch on domain-specific data that’s relevant to their industry. Our ultimate goal: lower the cost and time of training, and give companies the ability to unlock new capabilities and better differentiate their products.

That’s why we’re excited to announce that in today’s MLPerf results, MosaicML achieved leading NLP performance. We accelerated the training of a Hugging Face BERT model by 2.7x1 with our software and algorithmic optimizations, which can be automatically enabled in our MosaicML Cloud.


Our submission uses our open source library Composer, built on top of PyTorch, to easily plug in and test different optimizations. We submitted two configurations to the NLP benchmark in the Open division:

- Baseline: We used the popular BERT-Large model from Hugging Face, and set a strong baseline by applying mixed precision and tuning the hyperparameters. Notably, the Adam optimization converged much faster than the closed division’s mandated LAMB optimizer.

- Optimized: To speed up the training, we added several optimizations to the Baseline: Unpadding, Flash Attention, FusedLayerNorm, FusedDropoutAddLayerNorm, and FusedDenseGeLuDense.

Our system and algorithmic optimizations achieve a 2.7x speed-up on training the NLP benchmark compared to a Hugging Face baseline1 (Figure 2). Importantly, both the baseline and the optimized submissions used the same hyperparameters, for a fair comparison.

Figure 2: Our optimizations improved upon the commonly used Hugging Face baseline, reaching a score2 that is 2.7x faster for the NLP Benchmark.

Leading NLP Performance

In comparison, NVIDIA’s submission on similar hardware in the Closed division reaches the same accuracy in 16.9 minutes. Our Open division submission is 2.1x faster (Figure 3).

Figure 3: With our software optimizations and hyperparameter tuning, MosaicML’s Open Division result is 2.1x faster than NVIDIA’s Closed Division result on similar hardware. We used our speed-up methods and a different optimizer.

However, there are several important differences between our submissions:

  • Ease of use over speed. Our codebase sacrifices some raw speed in favor of code that is easier to use and modify through the implementation of a generic Trainer, instead of a bespoke training loop with custom optimizations. On a samples-per-second basis, our submission is 1.7x slower than NVIDIA's submission.
  • Better data efficiency. We found that when using a different optimizer (Decoupled Adam) that is permissible in the open division and widely used with our customers, our training takes 3.5x fewer samples to reach the same accuracy.

To sum up: our submission achieved leading NLP performance with a customizable codebase that is easy to use with your own data.

MLPerf-Level Speed with MosaicML Cloud

Knowing when - and how - to use which optimization methods with what models is difficult - and can change based on your hardware and system configurations! To save our customers' valuable time, we built the MosaicML Cloud, which was used for our MLPerf submission. The MosaicML Cloud is an optimized infrastructure and software stack for training deep learning models.

Enabling MLPerf-level speed is easy on our cloud by engaging our automatic optimization feature that profiles your code and automatically applies the right speed-up recipes – with just a single command-line flag ("-o1")!

mcli run -f bert.yaml -o1

Our customers want to focus on modeling, and not spend time dealing with performance optimization. Leave that to our automated software built by our team of efficiency experts and researchers.

Ready to build the best models in the shortest time at the lowest cost? Get started by signing up for a MosaicML Cloud demo.

1 Official MLPerf v2.1 Training results. MosaicML HuggingFace Baseline: 2.1-2095; MosaicML HuggingFace Baseline + Speedup Methods 2.1-2096

2 The MLPerf NLP score measures time to train of BERT-Large to 0.72 Masked LM Accuracy on Wikipedia dataset.

What’s a Rich Text element?

The rich text element allows you to create and format headings, paragraphs, blockquotes, images, and video all in one place instead of having to add and format them individually. Just double-click and easily create content.

Static and dynamic content editing

A rich text element can be used with static or dynamic content. For static content, just drop it into any page and begin editing. For dynamic content, add a rich text field to any collection and then connect a rich text element to that field in the settings panel. Voila!

How to customize formatting for each rich text

Headings, paragraphs, blockquotes, figures, images, and figure captions can all be styled after a class is added to the rich text element using the "When inside of" nested selector system.