✨ We just announced Composer to speed up training your models. Check us out on GitHub! ✨

5 Best Practices for Efficient Model Training

5 Best Practices for Efficient Model Training

In the course of our research and product development we’ve codified a number of best practices for efficient CNN training, and we’d like to share some of them with you here.


Train your CNN a lot faster by using Composer (our library for efficient training) and the MosaicML docker image, and enabling channels last via --algorithms channels_last.


MosaicML’s goal is to make neural network training more efficient through better algorithms and software. We do empirical research to understand the training process and develop methods that reduce the time or cost of training. We have already achieved speedups and cost reductions on standard benchmarks of 3.4x for vision and 1.6x for NLP. Our goal is to achieve 4x speedups and cost reductions per year, and repeat this feat every year.

In the course of our research and product development, we’ve codified a number of best practices for efficient training, and we’d like to share some of them with you here. Most of the best practices we discuss here are specific to training CNNs, but stay tuned for blog posts on more general and NLP-specific best practices. Some of the best practices in this post might be familiar to seasoned ML practitioners, but you’d be surprised at how many well-known papers and popular software libraries don’t make use of them. Whenever possible, we provide details or links on how to implement these best practices. But we’ve also saved you the work of implementing them by bringing them all together in one place: Composer, our library for efficient training.

Without further ado, let's get started!

1. No Mixed Opinions on Mixed Precision Training

Most deep learning frameworks use 32-bit floating point (FP32, or “full precision”) for all values—model weights, activations, gradients, and optimizer state—and arithmetic. It’s possible to speed up training by reducing the precision of all values and operations to 16-bit floating point (FP16, or “half precision”), but this can seriously compromise model quality. However, reducing the precision of only some values and operations to FP16 can speed up training without any reduction in model quality, leading to substantial efficiency gains. This mix of FP32 and FP16 is referred to as Mixed Precision training.

An easy way to perform Mixed Precision training is with NVIDIA’s AMP (Automatic Mixed Precision) library—which is integrated with PyTorch. AMP works by storing a primary copy of the model weights in FP32, then making an FP16 copy of the model and performing the full forward- and back-propagation in FP16. The weight gradients are then converted to FP32 before the primary model weights are updated. This final detail—doing the weight update in FP32—is critical for good results.

Model quality is unaffected by mixed precision training, and it yields substantial benefits: training can be accelerated by 2x-4.5x, and memory requirements can be reduced by nearly 2x relative to FP32 training. The increased training speed is due to the fact that the speed of multiplying two numbers is approximately proportional to the square of the number bits. The memory reduction is because FP16 numbers are half the size of FP32 numbers. These benefits make mixed precision training especially valuable for training larger models and training models faster.

AMP is toggled by a single hyperparameter (precision: amp in a config file or –-precision amp on the command line) and enabled by default in Composer.

2. Channels Last for Training CNNs Fast

Vision transformers might be the hot new thing in computer vision research, but convolutional neural networks (CNNs) are still very much alive and kicking, so there’s good reason to care about training them more efficiently.

Convolutions typically operate on four-dimensional tensors, where the dimensions correspond to a batch of N samples (e.g. images), C channels (e.g., red, green, and blue color channels in the input), and H x W feature maps (e.g., the individual pixels of these channels). For example, a batch of 128 RGB images of size 224 x 256 pixels would be represented as a 128 (images) x 3 (RGB) x 224 (pixel width) x 256 (pixel height) tensor. Most deep learning libraries store tensors in memory in NCHW format, that is, with width as the dimension of the tensor that is stored contiguously in memory. But modern NVIDIA GPUs—which are the most common hardware for training deep learning models—perform convolutions in NHWC format (i.e. with channels last). This means that NCHW tensors need to be transposed to NHWC before each convolution. Storing as NHWC removes the need for this transposition and saves time and compute.

NVIDIA reports 1.15x to 1.6x performance improvement (TFLOPS) when using channels-last, and this PyTorch tutorial reports a 1.22x throughput (images/second) improvement. Our in-house benchmarks show that channels last can improve throughput by 1.2x-1.3x, depending on the hardware. This makes channels last a free lunch for anyone training a model with convolutional layers on a Volta architecture (V100) or newer NVIDIA GPU.

Channels last is one of the many speedup methods included in Composer, where it can be enabled via a single hyperparameter (--algorithms channels_last).

3. Mind Your Step: Stepwise Learning Rate Schedules

Learning rate (LR) scheduling—adjusting the learning rate over the course of training—is essential for achieving state-of-the-art results with deep networks. There are lots of different LR schedule patterns to follow (see this handy list from Papers With Code), and each of these schedules can be enacted at different time resolutions. Just as wall-clock time can be measured in different units like minutes, hours, and days, training can be subdivided into individual steps (a single weight update) and epochs (enough steps to pass through the entire dataset). It is common to schedule the learning rate at an epoch resolution, which is to say that adjustments to the learning rate happen only once per epoch.

However, we at MosaicML have found that using a stepwise LR schedule resolution (i.e., changing the learning rate according to the schedule after each step) yields superior performance compared to epochwise resolution. Specifically, we found that stepwise LR schedules can yield 0.2-0.4% accuracy improvements for ResNet-50 trained on ImageNet compared to otherwise-identical models trained with epochwise schedules (see our training recipe here). Because training speed and accuracy are often interchangeable, this accuracy improvement translates into a training speed improvement: A ResNet50 trained with stepwise LR schedule trains to a target accuracy about 1.06x faster than an equivalent model trained with epochwise resolution.

The learning rate scheduler implementations in most deep learning frameworks (e.g. PyTorch, TensorFlow) are implemented assuming epochwise resolution by default, putting the burden on the user to convert between steps and epochs. Composer alleviates this headache by making it possible to describe learning rate schedules at one resolution but execute them at another. For example, a user can describe a learning rate schedule in epochs (e.g. warmup for 5 epochs, cosine decay for 95 epochs), which is often the more intuitive and convenient resolution, but under the hood Composer executes the learning rate schedule in steps for superior performance.

Stepwise resolution for LR schedules is toggled by a single hyperparameter (--schedulers.{scheduler_name}.interval step) and enabled by default in Composer.

4. Go Easy on the Image Augmentations

Data augmentation is a technique in which new data samples are created by modifying existing data samples, for example by rotating or changing the contrast of an image. It is a commonly-used technique for improving the performance of deep learning models for computer vision. Data augmentation provides two benefits: it increases the effective number of data samples, and it can allow the model to learn to be invariant to the augmentations. Using image classification as an example, a good image classifier should recognize that an image of a dog should still be classified as a dog even after the image has been translated or flipped left-right.

Data augmentation for images (i.e. Image augmentation) is an essential part of most training recipes for reaching state-of-the-art performance in many computer vision tasks. Popular image augmentation schemes include AutoAugment, RandAugment,and AugMix, all of which perform many image augmentations in sequence. Image augmentations are by default performed by the CPU in standard deep learning frameworks. But this can have unintended consequences on training speed.

We at MosaicML conducted a series of experiments to examine the efficiency of image augmentation techniques. We found that image augmentation techniques do indeed increase accuracy on a per-step or per-epoch basis—it takes fewer steps to get to the same performance compared to not using augmentation techniques—but these techniques are often so CPU-intensive that they substantially increase the time per step. The end result was that the wall clock time to train to a given accuracy was actually increased by 1.5x-8x compared to training without image augmentations, depending on the hardware configuration and augmentation scheme. This means you're bottlenecked on something other than your expensive GPU, in effect putting it to waste. Stay tuned for a deep-dive blog post about image augmentation soon!

The bottom line when it comes to the  impact of image augmentations on training time is:

  • Use augmentation schemes that are less CPU-intensive, such as TrivialAugment (which is like RandAugment but involves fewer augmentations per image).
  • Use hardware configurations with more CPU:GPU compute power (e.g. many CPUs per GPU, or relatively older/underpowered GPUs).
  • Train larger models. This lowers GPU throughput and gives the CPU more time to keep up.
  • Perform image augmentations on the GPU using a library such as Kornia or DALI.

5. Soften the blow of image processing with Pillow-SIMD

Deep learning frameworks use third-party libraries for CPU image processing. These libraries handle everything from low-level operations like JPEG decoding to high-level operations like data augmentation. Pytorch and Keras do this using the Pillow library by default, but we were able to speed up training of ResNet50 on ImageNet by 1.2x-2x (depending on the hardware platform and training recipe) by using Pillow-SIMD, a drop-in replacement for Pillow. Pillow-SIMD accelerates image processing by parallelizing certain operations. This in turn can speed up training because CPU image processing often imposes a bottleneck on training, as discussed in the previous section. In scenarios in which CPU image processing may not impose a bottleneck, such as training a large ViT or ResNet-152, we might see no throughput change from switching to Pillow-SIMD.

While it’s quite easy to install Pillow-SIMD, our Composer docker image comes loaded with it already, giving you one less thing to worry about.


We hope these best practices speed up your model training. Stay tuned for more dispatches from MosaicML on other topics in efficient ML! And as always, we’d love to answer your questions, accept your pull requests, and have you join the team!

What’s a Rich Text element?

The rich text element allows you to create and format headings, paragraphs, blockquotes, images, and video all in one place instead of having to add and format them individually. Just double-click and easily create content.

Static and dynamic content editing

A rich text element can be used with static or dynamic content. For static content, just drop it into any page and begin editing. For dynamic content, add a rich text field to any collection and then connect a rich text element to that field in the settings panel. Voila!

How to customize formatting for each rich text

Headings, paragraphs, blockquotes, figures, images, and figure captions can all be styled after a class is added to the rich text element using the "When inside of" nested selector system.