
Composer + FFCV: Faster Together
We’ve been working on algorithms to speed up training of neural networks, composing together recipes ranging from regularization techniques to systems efficiency optimizations. On some models, we’ve reached the point where the GPU portion has been sped up so much that we are now bottlenecked on CPU decoding and augmentations of the dataloader.
Limited by the DataLoader
As a result of this dataloader bottleneck, some methods such as Progressive Image Resizing, which start training with small image sizes, then increases the height/width throughout training, don’t actually yield a significant speedup. Here’s a baseline run on ImageNet with ResNet-50 trained on 8x A100 GPU systems (For more details, see the Experimental Setup section of the Appendix), running at ~16,200 images per second:

We use the below schedule to gradually increase the image size during training, to reach the max size of 224 near the end.

Naively, one would expect the throughput to quadruple to ~65,000 images/sec when the input image size is at 112x112 pixels (a quarter of the pixels at the max size 224x224). However, that is not the case, as shown below. We are dataloader bottlenecked — the image per second rate is pegged at ~17,800 images/sec, even during the initial part of the training when the image size is small.

Composer has tens of different speedup algorithms that can be composed together but cannot be taken advantage of if the training becomes dataloader bottlenecked.
Enter FFCV
FFCV is a PyTorch-compatible dataloading library that increases throughput for model training. It uses a collection of techniques such as compiling image processing pipeline to native code, intelligent memory allocation, better scheduling for operations in the image processing pipeline, faster JPEG decoding with libturbojpeg, a new flexible format for storing images that can keep both raw and compressed images together, and packages them in an easy to use interface. Please refer to the FFCV documentation for more details. Shoutout to the FFCV team for making such a useful, fast and practical dataloading library.

We integrated FFCV with Composer to alleviate dataloading bottlenecks. Now, using the same setup except with FFCV instead of PyTorch for dataloading, we see a ~1.85x increase to ~30,000 images/sec. Training now completes in 86 minutes instead of 122 minutes.

With FFCV, we suddenly have a much larger exploration space for our speedups, without worrying about dataloader bottlenecks.
Pushing the boundaries
So then, how fast can the FFCV dataloader go? Where is the new images/sec ceiling that we can build methods to push towards? To answer these questions, we benchmarked the FFCV dataloader in isolation (no model!) across different batch sizes. To mimic our 8-GPU setting, we ran 8 processes, each with its own dataloader instance, on a single node.
A few observations:
1. Maximum throughput is ~40,000 images/sec with FFCV. Even with Progressive Image Resizing, we still have a significantly higher ceiling for more speedup methods!
2. For large batch sizes (256 or 512), the FFCV dataloader is ~2x faster.
3. For smaller batch sizes, the relative speedup of the FFCV dataloader is lower but still ~1.5x faster.
4. If the CPUs are oversubscribed, i.e., total workers ≥ total cores available, FFCV is impacted more from over-provisioning than PyTorch dataloaders (see below). With FFCV, be careful to set your cores properly!
Full-model Training Runs
Putting this all together, here we do some full training runs. We mimicked the same setup as the FFCV results, using techniques such as Blurpool, Label Smoothing, Channels Last, and Progressive Image Resizing.

In the end-to-end setting, training ResNet-50 was ~16% faster using FFCV compared to the vanilla PyTorch dataloader, a significant improvement.1 As we introduce more speedups that stress the dataloader bottleneck, we expect the margin to grow even more!
The FFCV dataloader is used in our Mosaic ResNet recipe, which brings together additional speedups to achieve some very impressive results. We are excited to continue exploring speedups, thanks to the great work from the FFCV team.
Try it yourself
We have provided a notebook in the Composer repository on GitHub with a nice walkthrough demonstrating how to use the FFCV dataloader with Composer. We encourage you to try it out, and see how much you can speed up your computer vision ML training! If you have questions, you can file an issue, or join our Slack community to start a conversation with the MosaicML team. If you like Composer, please give it a star on GitHub!
Appendix: Reproduction
To reproduce the results, use a 64-core CPU with 8x NVIDIA A100-80GB GPUs, and the mosaicml/pytorch_vision:latest docker image.
Appendix: Experimental Setup
- All results are obtained on a single node containing 8x A100-80GiB GPU, and 2x 32-core CPUs running at 2.6 GHz and SMT is disabled (2x AMD EPYC 7513 32-Core Processor)
- Information about other hyperparameters such as batch size, optimizers etc. can be found in our open source configuration file for ResNet-50.
- PyTorch Environment:
1 A reader familiar with FFCV results may notice that train time for Composer + FFCV is larger than just FFCV at slightly worse accuracy. This is due to our results not using test time augmentations (e.g., larger image size), and Composer doing progressive resizing on the GPUs, unlike FFCV which fuses it with the image loading pipeline. We don’t fuse progressive resizing in the image loading pipeline because it would no longer be composable with other speedups. Keeping progressive resizing composable allows us to realize greater overall speed by combining multiple algorithms as demonstrated in our ResNet blog post.