AI Audio Large Models: Decoding the Technology Behind the Global Trend

Three Key Architectures Support Free Generation of 44.1kHz High-Quality Stereo Short Audio

Stable Audio Open introduces a text-to-audio model with three main architectures:

Autoencoder: Compresses waveform data to manageable sequence lengths
T5-based text embedding
Transformer-based diffusion model (DiT): Operates in the latent space of the autoencoder

As a variant of Stable Audio 2, Stable Audio Open made adjustments in training data adoption and some architectural aspects. It uses a completely different dataset and employs T5 instead of CLAP (Contrastive Language-Audio Pretraining).

As an open-source free model, Stable Audio Open cannot generate coherent complete tracks and is not optimized for full songs, melodies, or vocals.

Stability AI states that Stable Audio Open focuses on audio demo and sound effect creation, capable of freely generating 44.1kHz high-quality stereo audio up to 47 seconds long. After professional training, the model is well-suited for creating drum beats, instrument loops, ambient sounds, foley recordings, and other audio samples for music production and sound design.

A key advantage of this open-source version is that users can fine-tune the model based on their custom audio data.

Training Process Emphasizes Copyright Protection

Amid rapid development of generative AI, debates about AI use in the music industry are intensifying, especially regarding copyright issues.

Stability AI states that to respect creator copyrights, Stable Audio Open uses datasets from Freesound and Free Music Archive (FMA), with all recordings used published under Creative Commons (CC) licenses.

To ensure avoiding any copyrighted material, Stability AI claims to use an audio tagger to identify music samples in Freesound, sending identified samples to Audible Magic's content detection company to ensure removal of potentially copyrighted music from the dataset.

Conclusion: Open-Source, Free Model Makes Text-to-Audio More Accessible

The launch of Stable Audio Open demonstrates Stability AI's innovation and progress in text-to-audio models. While the model has limitations in audio length and coherence generation, its advantages are evident. It can generate high-quality 44.1kHz stereo audio for free and run on consumer-grade GPUs, lowering the barrier to text-to-audio usage.

Meanwhile, Stable Audio Open sets a new benchmark for copyright protection while opening up audio generation technology. In the future, as technology continues to advance and ethical norms improve, Stable Audio Open is expected to realize its potential in more application scenarios, promoting the development and popularization of audio generation technology.

Currently, Stable Audio Open model weights are available on the machine learning model platform Hugging Face. Stability AI encourages sound designers, musicians, developers, and anyone interested in audio to explore the model's capabilities and provide feedback.

AI Audio Large Models: Decoding the Technology Behind the Global Trend

Stable Audio Open: An innovative open-source model capable of converting text into high-quality audio.

Three Key Architectures Support Free Generation of 44.1kHz High-Quality Stereo Short Audio

Training Process Emphasizes Copyright Protection

Conclusion: Open-Source, Free Model Makes Text-to-Audio More Accessible