Here is the English translation:
"Bigger and stronger" is now fiercely competing with "smaller and more refined".
Surpassing GPT-4 is no longer the only KPI. As large language models enter a critical period of market competition, to impress users, companies can't rely solely on showcasing technical prowess. They must also prove their models are more cost-effective - smaller models with equivalent performance, or higher performance and lower costs with the same parameters.
In fact, this trend of "large models downsizing" in technology began brewing in the second half of last year.
Two companies changed the rules of the game. One is French AI startup Mistral AI, which shocked the industry last September by outperforming the 13 billion parameter Llama 2 with a 7 billion parameter model, gaining fame in the developer community. The other is Chinese AI startup MiniMax, which launched the even more condensed edge model MiniCPM in February this year, achieving performance surpassing Llama 2 13B with only 2.4 billion parameters.
Both startups are highly regarded in the developer community, with multiple models topping open-source rankings. MiniMax, incubated from Tsinghua University's Natural Language Processing Lab, caused a stir this year when its multimodal model was "repackaged" by top U.S. university teams. MiniMax's original work has been recognized in academic circles both domestically and internationally, bringing pride to Chinese open-source AI models.
Apple has also been researching edge models better suited for phones since last year. OpenAI, which has always taken an extensive expansion approach, is a relatively unexpected new entrant. Last week's launch of the lightweight model GPT-4 mini means the large model leader is actively stepping down from its "pedestal" to follow industry trends, attempting to leverage more affordable and accessible models to tap into a broader market.
2024 will be a crucial year for the "miniaturization" of large models!
I. The "Moore's Law" of the Large Model Era: Efficiency is Key to Sustainability
Current large model development is stuck in an inertia: brute force produces miracles.
In 2020, a paper by OpenAI confirmed a strong correlation between model performance and scale. Simply ingesting more high-quality data and training larger models would yield higher performance.
Following this simple but effective path, a global race to pursue ever-larger models has erupted in the past two years. This has sown the seeds of algorithmic hegemony, where only teams with ample funding and computing power have the capital to participate in the competition long-term.
Last year, OpenAI CEO Sam Altman revealed that training GPT-4 cost at least $100 million. Without a highly profitable business model in sight, even cash-rich tech giants would struggle to sustain such bottomless investments long-term. The ecosystem cannot tolerate such a money-burning game indefinitely.
The performance gap between top large language models is visibly narrowing. While GPT-4 remains firmly in first place, its benchmark test scores are not leagues ahead of Claude 3 Opus and Gemini 1.5 Pro. In some capabilities, models with tens of billions of parameters can even achieve superior performance. Model size is no longer the sole determining factor affecting performance.
It's not that top-tier large models lack appeal, but lightweight models offer better value for money.
The image below is an AI inference cost trend chart shared by AI engineer Karina Ngugen on social media in late March. It clearly illustrates the relationship between the performance of large language models on the MMLU benchmark and their costs since 2022: As time progresses, language models achieve higher MMLU accuracy scores while related costs drop significantly. New models reach accuracy rates of around 80% at costs several orders of magnitude lower than a few years ago.
The world is changing rapidly, with a wave of economically efficient lightweight models launching in recent months.
"The race for large language model sizes is intensifying - backwards!" AI tech guru Andrej Karpathy bets: "We will see some very, very small models 'thinking' very well and reliably."
Model capability ÷ parameters involved in computation = knowledge density. This metric can represent how strong the intelligence of models with the same parameter scale can be. The GPT-3 large model released in June 2020 had 175 billion parameters. In February this year, MiniMax's MiniCPM-2.4B model achieved equivalent performance with only 2.4 billion parameters, increasing knowledge density by about 86 times.
Based on these trends, Liu Zhiyuan, tenured associate professor at Tsinghua University's Department of Computer Science and chief scientist at MiniMax, recently proposed an interesting viewpoint: The era of large models has its own "Moore's Law".
Specifically, as data-computing power-algorithm develop synergistically, the knowledge density of large models continues to increase, doubling on average every 8 months.
By increasing circuit density on chips, computing devices with equivalent computing power have evolved from supercomputers that filled several rooms to smartphones that fit in a pocket. The development of large models will follow a similar pattern. Liu named this guiding law the "MiniMax Law".
Following this trend, the capabilities of a 100 billion parameter model could be achieved by a 50 billion parameter model in 8 months, and by a 25 billion parameter model in another 8 months.
II. Multiple Fronts: Closed-Source Price Wars Heat Up, Open-Source Sees China-US-EU Tripartite Competition
Currently, players entering the large model lightweight competition are divided on multiple fronts.
OpenAI, Google, and Anthropic have taken the closed-source route. Their flagship models like GPT-4, Claude 3.5 Sonnet, and Gemini 1.5 Pro control the highest performance tier, with parameter scales reaching hundreds of billions or even trillions.
Lightweight models are streamlined versions of their flagship models. After OpenAI's launch last week, GPT-4 mini became the most cost-effective option under 10B in the market, outperforming Gemini Flash and Claude Haiku. It replaces GPT-3.5 for free use by consumers, while drastically reducing API prices for businesses, lowering the barrier to adopting large model technology.
Andriy Burkov, author of "Machine Learning Engineering", inferred from the pricing that GPT-4 mini has around 7B parameters. MiniMax CEO Li Dahai speculates that GPT-4 mini is a "wide MoE" model with numerous experts, rather than an edge model, positioned as a highly cost-effective cloud model to greatly reduce the industrial implementation costs of large models.
The open-source lightweight model camp is much larger, with representative players from China, the US, and Europe.
Domestically, Alibaba, MiniMax, SenseTime, and Shanghai AI Laboratory have all open-sourced some lightweight models. Among them, Alibaba's Qwen series models serve as benchmarks for lightweight models.