Algorithmic Methods

Algorithmic methods speed up training, improve accuracy, and make ML training more efficient and cost effective. The library of algorithmic methods have been curated and verified to work on public data sets by the MosaicML team.

You can try all of these methods in the MosaicML composer library.

Go to Composer

Sharpness Awareness Minimization

Sharpness-Aware Minimization (SAM) is an optimization algorithm that minimizes both the loss and the sharpness of the loss. It finds parameters that lie in a neighborhood of low loss. The authors find that this improves model generalization.

Stochastic Weight Averaging

Stochastic Weight Averaging (SWA) maintains a running average of the weights towards the end of training. This leads to better generalization than conventional training.

Progressive Resizing

Progressive Resizing works by initially shrinking the size of the training images, and slowly growing them back to their full size by the end of training. It reduces costs during the early phase of training, when the network may learn coarse-grained features that do not require details lost by reducing image resolution.

Ghost Batchnorm

During training, BatchNorm normalizes a batch of inputs to have a mean of 0 and variance of 1. GhostBatchNorm instead splits the batch into multiple "ghost" batches, each containing ghost_batch_size samples, and normalizes each one to have a mean of 0 and variance of 1. This causes training with a large batch size to behave more similarly to training with a small batch size and acts as a regularizer.

Colout

ColOut works by dropping a fraction of the rows and columns of an input image. If the fraction of rows/columns dropped isn't too large, the image content is not significantly altered, but the image size is reduced. The removal of rows and columns also introduces variability that can modestly degrade accuracy.

Alibi

ALiBi (Attention with Linear Biases) dispenses with position embeddings for tokens in transformer-based NLP models, instead encoding position information by biasing the query-key attention scores proportionally to each token pair's distance. ALiBi yields excellent extrapolation to unseen sequence lengths compared to other position embedding schemes. We leverage this extrapolation capability by training with shorter sequence lengths, which reduces the memory and computation load.

Blurpool

Increases accuracy at nearly the same speed by applying a spatial low-pass filter before the pool in max pooling and whenever using a strided convolution.

Squeeze Excite

Adds a channel-wise attention operator in CNNs. Attention coefficients are produced by a small, trainable MLP that uses the channels' globally pooled activations as input.

Layer Freezing

Gradually makes early modules not trainable, saving the cost of backpropagating to frozen modules.

Speedup Back propagation

Scale Schedule

Scale Schedule changes the number of training steps by a dilation factor and dilating learning rate changes accordingly. Doing so varies the training budget, making it possible to explore tradeoffs between cost (measured in time or money) and the quality of the final model.

Speedup

Label Smoothing

Label smoothing modifies the target distribution for a task by interpolating between the target distribution and a another distribution that usually has higher entropy. This typically reduces a model's confidence in its outputs and serves as a form of regularization.

Regularization

MixUp

MixUp trains the network on convex combinations of examples and targets rather than individual examples and targets. Training in this fashion improves generalization performance.

Regularization Augmentation

Cutout

Cutout is a regularization/data augmentation technique that works by masking out one or more square regions of an input image.

Augmentation Regularization

RandAugment

Applies a series of random augmentations

Augmentation Regularization

Channels Last

Channels Last is a hardware optimization that improves the throughput of convolution operations by storing activation and weight tensors in a NHWC (i.e., channels last) format, rather than Pytorch's default of NCHW.

Speedup

As a community

Let's make ML better, one method at a time.