01 The Popularity and Controversy of Suno AI
In late May 2024, Boston-based AI music company Suno announced the completion of a $125 million Series B funding round, reaching a post-money valuation of $500 million, with its user base rapidly growing to over 10 million. Tech giants like Microsoft even integrated Suno's AI music creation capabilities directly into their Copilot products.
Founded in 2022, Suno had only 12 employees before its Series B funding. In March 2024, Suno suddenly went viral, with its text-to-music capabilities significantly improving, considered to be the ChatGPT moment for AI music.
However, Suno's success also sparked controversy. In late June 2024, the Recording Industry Association of America (RIAA), representing Sony, Universal, and Warner - the three major record labels and their subsidiaries, filed a lawsuit against Suno and another AI music application, Udio, accusing them of copyright infringement and demanding $150,000 in damages for each infringing work.
This lawsuit reflects the impact of AI music on the traditional music industry and the controversy surrounding AI model training data. Some industry insiders suspect that Suno may have used copyrighted music for training, as tech giants like Google and Meta have not achieved results in AI music as impressive as Suno's.
02 Breakdown of AI Music Models
2.1 First Layer Compression and Codebook
Roger Chen, Meta's Head of Music Technology, explains that machine learning has been applied in the music field for many years. The industry recognizes that if music is defined as sound vibrations in the air producing different frequencies and amplitudes, then sound can be marked as electrical signals.
In AI music, various musical dimensions can be expressed as token sequences, including rhythm, tempo, harmony, tonality, sections, melody, lyrics, and vocal timbre. However, audio information is very rich, with a 3-minute song typically containing nearly 8 million sample points. If each sample point corresponds to a token, it poses a huge challenge for model training.
Until a few years ago, when Meta and Google made breakthroughs in audio sample compression technology, capable of converting audio samples into fewer tokens with compression ratios of tens to hundreds of times, the development of AI music began to accelerate.
Technologies like Google's SoundStream and Meta's EnCodec can convert audio into tokens and restore them to almost lossless audio. These technologies not only significantly compress audio but can also convert various musical dimensions (such as beat, tempo, chord progression, emotion, genre, instruments, lyrics, pitch, length, singer style, etc.) into tokens.
By converting these different modalities into tokens, a unified large language model framework can be used, allowing the model to learn the correspondence between certain modalities and audio tokens, thus building a powerful AI music generation system.
[To be continued...]