.png)
Training Stable Diffusion from Scratch Costs <$160k
February 8th, 2023 Update: We're proud to report that blog is already out of date. It now costs $125K! Stay tuned for more speedup from @MosaicML, coming soon to a diffusion model near you!
The AI world is buzzing with the power of large generative neural networks such as ChatGPT, Stable Diffusion, and more. These models are capable of impressive performance on a wide range of tasks, but due to their size and complexity, only a handful of organizations have the ability to train them. As a consequence, access to these models can be restricted by the organization that owns them, and users have no control over the data the model has seen during training.
That’s how we can help: at MosaicML, we make it easier to train large models efficiently, enabling more organizations to train their own models on their own data. As shown in a previous blog post, our StreamingDataset library, our training framework Composer, and our MosaicML platform significantly simplify the process of training large language models (LLMs). For this blog post, we used that same process to measure the time and cost to train a Stable Diffusion model from scratch. We estimated an upper-bound of 79,000 A100-hours to train Stable Diffusion v2 base in 13 days on our MosaicML platform, corresponding to a total training cost of less than $160,000. This is a 2.5x reduction in the time and cost reported in the model card from Stability AI. In addition to saving time and money, our Streaming, Composer, and MosaicML platform tools make it dead-simple to set up and scale Stable Diffusion training across hundreds of GPUs without any additional effort. The code we used for this experiment is open-source and ready to run; check it out here! And if you’re interested in training diffusion models yourself on the MosaicML platform, contact us for a demo.
Time and Cost Estimates
Table 1 and Figure 1 below illustrate how the Stable Diffusion V2 base training time and cost estimates vary by the number of GPUs used. Our final estimate for 256 A100s is 12.83 days to train with a cost of $160,000, a 2.5x reduction in the time and cost reported in the Stable Diffusion model card. These estimates were calculated using measured throughput and assumed training on 2.9 billion samples. Throughput was measured by training on 512x512 resolution images and captions with a max tokenized length of 77. We scaled GPUs from 8 to 128 NVIDIA 40GB A100s, then extrapolated throughput to 256 A100s based on these measurements.


Benchmark Setup
How did we get these results? We took advantage of a MosaicML Streaming dataset, our Composer training framework, and the MosaicML platform to measure throughput when training a Stable Diffusion model. You can reproduce our results using the code in this repo. Read on for more details:
Streaming
A major pain point when training Stable Diffusion is working with enormous datasets such as LAION-5B. The MosaicML StreamingDataset library makes it significantly easier to manage and use these massive datasets. It works by converting the target dataset into a Streaming format, then storing the converted dataset to the desired cloud storage (e.g. an AWS S3 bucket). To use the stored dataset, we simply define a “StreamingDataset” class that pulls and transforms samples from the stored dataset.
For our results, we streamed a subset of the LAION-400M dataset with 256x256 images and their associated captions. Images were resized to 512x512 for final throughput measurements. We estimated the Streaming data loader throughput to be at least 30x higher than model throughput in all configurations we tested, so we’re unlikely to see data loader-related bottlenecks anytime soon. Check out our data script to see how we defined our streaming dataset and watch out for our soon-to-be-released Streaming blog post for more details!
Composer
Our open-source Composer library contains many state-of-the-art methods for accelerating neural network training and improving generalization. For this project, we first defined a ComposerModel for Stable Diffusion using models from HuggingFace’s Diffusers library and configs from “stabilityai/stable-diffusion-2-base”. The ComposerModel and Streaming dataset were provided to Composer’s Trainer with an AdamW optimizer, an EMA algorithm2, a throughput measurement callback, and a Weights and Biases logger. Finally, we called “fit()” on the Trainer object to start training.
MosaicML Platform
MosaicML's platform orchestrates and monitors compute infrastructure for large-scale training jobs. Our job scheduler makes launching and scaling jobs easy. Figure 2 shows the MosaicML training configuration (left) and the CLI used to launch the training run (right). From the configuration, we can easily scale the number of GPUs with the “gpu_num” parameter. The same code we write for one node can automatically leverage tens of nodes and hundreds of GPUs.

Left: Example MosaicML training configuration. Right: Starting a training run with Mosaic CLI (mcli).
What’s Next?
In this blog post, we estimated a 2.5x reduction in time and cost to train Stable Diffusion when using our Streaming datasets, Composer library, and MosaicML platform. This is a great preliminary result, but we’re not done yet. In a future blog post, we’ll verify how we can train to convergence at this speed. For updates on our latest work, join our Community Slack or follow us on Twitter. If your organization wants to start training diffusion models today, you can schedule a demo online or email us at demo@mosaicml.com.
Notes:
- Per device microbatch size is related to gradient accumulation, but microbatch size is easier to use when increasing the number of GPUs. Per device microbatch size and gradient accumulation are related as follows: grad_accum = global_batch_size / (n_devices * per_device_microbatch_size). For more details, check out the “device_train_microbatch_size” variable in Composer’s Trainer API reference.
- Composer has a built-in EMA algorithm, but we had to improve its memory efficiency for our benchmark hence the EMA implementation in our benchmark repo. We will update the EMA algorithm in Composer to be more memory efficient soon.