Llama2-70B-Chat is now available on MosaicML Inference

MosaicML is now part of Databricks

Introducing MPT-30B, the latest addition to the MosaicML Foundation Series of Models.

Composer + FFCV: Faster Together

Composer + FFCV: Faster Together

Composer is pushing the envelope on speed and efficiency in model training. Integrating Composer with FFCV, a fast dataloading library from Aleks Madry’s lab at MIT, unlocks new speedup methods by eliminating the dataloader bottleneck often experienced when using CPU-intensive operations in the training loop. The FFCV dataloader is one of the ingredients of our Mosaic ResNet recipe, which demonstrates how algorithmic efficiency can dramatically speed up model training.

We’ve been working on algorithms to speed up training of neural networks, composing together recipes ranging from regularization techniques to systems efficiency optimizations. On some models, we’ve reached the point where the GPU portion has been sped up so much that we are now bottlenecked on CPU decoding and augmentations of the dataloader.

Limited by the DataLoader

As a result of this dataloader bottleneck, some methods such as Progressive Image Resizing, which start training with small image sizes, then increases the height/width throughout training, don’t actually yield a significant speedup. Here’s a baseline run on ImageNet with ResNet-50 trained on 8x A100 GPU systems (For more details, see the Experimental Setup section of the Appendix), running at ~16,200 images per second:

Figure 1: ResNet50 throughput using PyTorch dataloader

from composer import Trainer, algorithms
from composer.models import TIMM

trainer = Trainer(
    model=TIMM(model_name='resnet50'),
    train_dataloader= ... # pytorch dataloader
    algorithms=[
        algorithms.ProgressiveResizing(),
    ]
)

We use the below schedule to gradually increase the image size during training, to reach the max size of 224 near the end.

Figure 2: Change in Image Height/Width during training with progressive resizing

Naively, one would expect the throughput to quadruple to ~65,000 images/sec when the input image size is at 112x112 pixels (a quarter of the pixels at the max size 224x224). However, that is not the case, as shown below. We are dataloader bottlenecked — the image per second rate is pegged at ~17,800 images/sec, even during the initial part of the training when the image size is small.

Figure 3: ResNet50 throughput using PyTorch dataloader with progressive resizing speedup algorithm

Composer has tens of different speedup algorithms that can be composed together but cannot be taken advantage of if the training becomes dataloader bottlenecked.

Enter FFCV

FFCV is a PyTorch-compatible dataloading library that increases throughput for model training. It uses a collection of techniques such as compiling image processing pipeline to native code, intelligent memory allocation, better scheduling for operations in the image processing pipeline, faster JPEG decoding with libturbojpeg, a new flexible format for storing images that can keep both raw and compressed images together, and packages them in an easy to use interface. Please refer to the FFCV documentation for more details. Shoutout to the FFCV team for making such a useful, fast and practical dataloading library.

FFCV release announcement on Twitter. Tweet thread: https://twitter.com/aleks_madry/status/1483523047273512978

We integrated FFCV with Composer to alleviate dataloading bottlenecks. Now, using the same setup except with FFCV instead of PyTorch for dataloading, we see a ~1.85x increase to ~30,000 images/sec. Training now completes in 86 minutes instead of 122 minutes.

Figure 4: ResNet50 throughput using FFCV with the progressive resizing speedup algorithm. With FFCV, images/sec improved from 17.8K to 30K for the initial part of training, which uses smaller images.

With FFCV, we suddenly have a much larger exploration space for our speedups, without worrying about dataloader bottlenecks.

Pushing the boundaries

So then, how fast can the FFCV dataloader go? Where is the new images/sec ceiling that we can build methods to push towards? To answer these questions, we benchmarked the FFCV dataloader in isolation (no model!) across different batch sizes. To mimic our 8-GPU setting, we ran 8 processes, each with its own dataloader instance, on a single node.

A few observations:

1. Maximum throughput is ~40,000 images/sec with FFCV. Even with Progressive Image Resizing, we still have a significantly higher ceiling for more speedup methods!

2. For large batch sizes (256 or 512), the FFCV dataloader is ~2x faster.

3. For smaller batch sizes, the relative speedup of the FFCV dataloader is lower but still ~1.5x faster.

4. If the CPUs are oversubscribed, i.e., total workers ≥ total cores available, FFCV is impacted more from over-provisioning than PyTorch dataloaders (see below). With FFCV, be careful to set your cores properly!

Full-model Training Runs

Putting this all together, here we do some full training runs. We mimicked the same setup as the FFCV results, using techniques such as Blurpool, Label Smoothing, Channels Last, and Progressive Image Resizing.

16% speedup for the full training run. PyTorch and FFCV dataloader experiments are run on the same system, and reach a Top-1 accuracy of 78.12% and 77.94% respectively.

In the end-to-end setting, training ResNet-50 was ~16% faster using FFCV compared to the vanilla PyTorch dataloader, a significant improvement.1 As we introduce more speedups that stress the dataloader bottleneck, we expect the margin to grow even more!

The FFCV dataloader is used in our Mosaic ResNet recipe, which brings together additional speedups to achieve some very impressive results. We are excited to continue exploring speedups, thanks to the great work from the FFCV team.

Try it yourself

We have provided a notebook in the Composer repository on GitHub with a nice walkthrough demonstrating how to use the FFCV dataloader with Composer. We encourage you to try it out, and see how much you can speed up your computer vision ML training! If you have questions, you can file an issue, or join our Slack community to start a conversation with the MosaicML team. If you like Composer, please give it a star on GitHub!

Appendix: Reproduction

To reproduce the results, use a 64-core CPU with 8x NVIDIA A100-80GB GPUs, and the mosaicml/pytorch_vision:latest docker image.



# clone composer github to access our ffcv scripts
git clone https://github.com/mosaicml/composer.git
cd composer 

# Create datasets in FFCV format. 
# 
# We created helper scripts to convert datasets to FFCV format.
# The dataset can either be on your local disk or in remote 
# formats (webdataset, or S3 buckets). 
# 
# For more information, see:
#    >> python scripts/ffcv/create_ffcv_datasets.py --help
# Run for python scripts/ffcv/create_ffcv_datasets.py --help for options
# 
# Below commands create 
# /tmp/imagenet_train.ffcv and /tmp/imagenet_val.ffcv
python scripts/ffcv/create_ffcv_datasets.py --dataset imagenet \
	--split train --datadir 〈ImageNet Dataset〉
python scripts/ffcv/create_ffcv_datasets.py --dataset imagenet \
	--split val --datadir 〈ImageNet Dataset〉

# For 88 Epochs 
composer -n 8 examples/run_composer_trainer.py -f composer/yamls/models/resnet50.yaml \
 --max_duration 88ep --train_dataset.imagenet.use_ffcv true \
 --val_dataset.imagenet.use_ffcv true \
 --algorithms channels_last blurpool label_smoothing progressive_resizing \
 --algorithms.progressive_resizing.size_increment 32 \
 --algorithms.progressive_resizing.initial_scale 0.834 \
 --algorithms.progressive_resizing.delay_fraction 0.727 \
 --algorithms.progressive_resizing.finetune_fraction 0.136


Appendix: Experimental Setup

  • All results are obtained on a single node containing 8x A100-80GiB GPU, and 2x 32-core CPUs running at 2.6 GHz and SMT is disabled (2x AMD EPYC 7513 32-Core Processor)
  • Information about other hyperparameters such as batch size, optimizers etc. can be found in our open source configuration file for ResNet-50.
  • PyTorch Environment:

PyTorch version: 1.10.0+cu113
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.4 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31

Python version: 3.9.12 (main, Apr 18 2022, 22:40:46)  [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.13.0-39-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 11.3.109
GPU models and configuration:
GPU 0: NVIDIA A100-SXM4-80GB
GPU 1: NVIDIA A100-SXM4-80GB
GPU 2: NVIDIA A100-SXM4-80GB
GPU 3: NVIDIA A100-SXM4-80GB
GPU 4: NVIDIA A100-SXM4-80GB
GPU 5: NVIDIA A100-SXM4-80GB
GPU 6: NVIDIA A100-SXM4-80GB
GPU 7: NVIDIA A100-SXM4-80GB

Nvidia driver version: 510.47.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.2.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.2.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.2.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.2.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.2.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.2.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.2.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.21.6
[pip3] pytorch-pfn-extras==0.5.8
[pip3] pytorch-ranger==0.1.1
[pip3] torch==1.10.0+cu113
[pip3] torch-optimizer==0.1.0
[pip3] torchmetrics==0.7.3
[pip3] torchvision==0.11.1+cu113
[pip3] vit-pytorch==0.27.0
[conda] Could not collect

1 A reader familiar with FFCV results may notice that train time for Composer + FFCV is larger than just FFCV at slightly worse accuracy. This is due to our results not using test time augmentations (e.g., larger image size), and Composer doing progressive resizing on the GPUs, unlike FFCV which fuses it with the image loading pipeline. We don’t fuse progressive resizing in the image loading pipeline because it would no longer be composable with other speedups. Keeping progressive resizing composable allows us to realize greater overall speed by combining multiple algorithms as demonstrated in our ResNet blog post.

What’s a Rich Text element?

The rich text element allows you to create and format headings, paragraphs, blockquotes, images, and video all in one place instead of having to add and format them individually. Just double-click and easily create content.

Static and dynamic content editing

A rich text element can be used with static or dynamic content. For static content, just drop it into any page and begin editing. For dynamic content, add a rich text field to any collection and then connect a rich text element to that field in the settings panel. Voila!

How to customize formatting for each rich text

Headings, paragraphs, blockquotes, figures, images, and figure captions can all be styled after a class is added to the rich text element using the "When inside of" nested selector system.