Recently, researchers from institutions including the University of California, Irvine have reduced the training cost of diffusion models to $1,890 using strategies such as delayed masking, MoE, and hierarchical scaling. This is an order of magnitude lower than the previous cheapest method (Wuerstchen) at $28,400, while models like Stable Diffusion cost even more.
Researchers have tried various approaches to reduce these enormous expenses. For example, the original diffusion model required about 1,000 steps from noise to image, which has now been reduced to around 20 steps or even fewer. As the basic modules in diffusion models have gradually been replaced from Unet (CNN) to DiT (Transformer), some optimizations based on Transformer characteristics have also followed, such as quantization, skipping redundant computations in Attention, and pipelining.
This study trains an 11.6 billion parameter diffusion model from scratch for only $1,890. This is an order of magnitude improvement compared to the state-of-the-art, giving ordinary people hope of touching pre-training. More importantly, the cost-reduction techniques did not affect the model's performance, with 11.6 billion parameters producing very good results. Besides visual quality, the model's data metrics are also excellent, with FID scores very close to Stable Diffusion 1.5 and DALL·E 2.
The cost-saving secrets mainly include:
-
Delayed masking strategy: Using a patch-mixer for preprocessing before masking, embedding information from discarded patches into surviving ones, significantly reducing performance degradation from high masking.
-
Fine-tuning: Conducting minor fine-tuning (unmasking) after pre-training (masking) to mitigate adverse generation artifacts caused by masking.
-
MoE and hierarchical scaling: Using simplified MoE layers based on expert selection routing to increase model parameters and expressiveness without significantly increasing training costs. Hierarchical scaling methods were also considered, linearly increasing the width of Transformer blocks.
For experimental setup, the authors used two DiT variants: DiT-Tiny/2 and DiT-Xl/2, with patch size 2. All models were trained using AdamW optimizer with cosine learning rate decay and high weight decay. The model frontend used the four-channel variational autoencoder (VAE) from the Stable-Diffusion-XL model to extract image features, and also tested the performance of the latest 16-channel VAE in large-scale training (budget version).
The authors used the EDM framework as a unified training setup for all diffusion models, using FID and CLIP scores to measure image generation model performance. The most commonly used CLIP model was chosen as the text encoder.
The training dataset used three real image datasets (Conceptual Captions, Segment Anything, TextCaps), containing 22 million image-text pairs. Since SA1B doesn't provide real captions, synthetic captions generated by the LLaVA model were used here.