Stable Diffusion founder leads team in new startup, surpassing competitors with open-source creation

Open-source image generation technology has achieved a breakthrough.

FLUX.1 [pro], it is a brand new SOTA text-to-image model, featuring extremely rich image details, strong prompt adherence ability, and diverse styles. It is currently available through API.

API address: https://docs.bfl.ml/

The second is ### FLUX.1 [dev], which is an open-weight, non-commercial variant of FLUX.1 [pro], directly distilled from the latter. This model outperforms other image models such as Midjourney and Stable Diffusion 3. The inference code and weights have been placed on GitHub. The image below is a comparison with competitive image models.

GitHub address: https://github.com/black-forest-labs/flux

The third is the open-source ### FLUX.1 [schnell], which is a highly efficient 4-step model, following the Apache 2.0 license. This model is very close to [dev] and [pro] in performance and can be used on Hugging Face.

Hugging Face address: https://huggingface.co/black-forest-labs/FLUX.1-schnell

Meanwhile, Black Forest Labs has also started promoting itself.

The next goal is to launch a SOTA text-to-video model available to everyone, so we can look forward to it!

A Powerful Debut: The "FLUX.1" Series of Text-to-Image Models Arrives

The three models launched by Black Forest Labs this time all adopt a hybrid architecture of multimodal and parallel diffusion Transformers. Unlike others who divide a series of models into "medium", "large", and "extra-large" based on parameter count, all members of the FLUX.1 family are uniformly expanded to a massive scale of 12 billion parameters.

The research team used the Flow Matching framework to upgrade the previous SOTA diffusion model. From the notes in the official blog, it can be inferred that the research team continued to use the Rectified flow+Transformer method proposed while still working at Stability AI (in March this year).

Paper link: https://arxiv.org/pdf/2403.03206.pdf

They also introduced rotary position embeddings and parallel attention layers. These methods effectively improved the performance of the model in generating images and made the speed of generating images on hardware devices faster.

Black Forest Labs did not disclose the detailed technology of the model this time, but a more detailed technical report will be published soon.

These three models have set new standards in their respective fields. Whether it's the aesthetics of generated images, the adherence of images to text prompts, the variability of dimensions/aspect ratios, or the diversity of output formats, FLUX.1 [pro] and FLUX.1 [dev] have surpassed a series of popular image generation models such as Midjourney v6.0, DALL・E 3 (HD), and their former employer SD3-Ultra.

FLUX.1 [schnell] is the most advanced few-step model to date, not only surpassing similar competitors but also powerful non-distilled models like Midjourney v6.0 and DALL・E 3 (HD).

The models have been specially fine-tuned to retain all the output diversity from the pre-training phase. Compared to current state-of-the-art technology, the FLUX.1 series models still have ample room for improvement.

All FLUX.1 series models support multiple aspect ratios and resolutions, from 0.1 to 2 million pixels.

Some quick-acting netizens have already rushed to experience it, and it seems that Black Forest Labs' repeated emphasis on "the strongest" is not just self-promotion.

Simple prompts can create effects like this, and if you look closely at the patterns on the llama's blanket, there's no distortion or deformation.

Prompt: An emerald Emu riding on top of a white llama.

If not told this is an AI-generated image, it's quite difficult to distinguish whether this is a photo taken by a photographer.

Prompt: A horse is playing with two alligators at the river.

Images containing text can also be easily handled, and the depth of field is processed to match the real lens feel.