When Violence No Longer Produces Miracles, Large Models Are Falling into a Technical Curse
Absolute large parameters may not be the only solution for deploying large models. This statement is gradually becoming a consensus in the large model industry.
The first bottleneck on the path of increasing parameters is NVIDIA - the biggest winner in this feast.
Recently, a research report from Meta showed that its latest Llama 3 405B parameter model experienced 419 accidents within 54 days when trained on a cluster of 16,384 NVIDIA H100 GPUs, with failures occurring on average every three hours during training. Meanwhile, each single GPU failure would interrupt the entire training process, causing the training to restart.
In simple terms, the current parameter size of large models is approaching the limit of what hardware can support. Even with endless GPUs, it can no longer solve the computing power model in large model training. If we continue to rush down the path of expanding parameters, the training process will become an endless repetition like Sisyphus pushing a boulder.
Hardware has increased the difficulty of expanding large models. In specific scenarios, the degree of intelligence is no longer proportional to the number of parameters, which puts a big question mark on this violent pleasure from a practical perspective.
The scenarios for large models are becoming increasingly complex, specialized, and fragmented. It is almost unthinkable to expect one model to both answer general knowledge questions and solve professional domain problems.
A favorite technical comparison dimension used by a domestic large model manufacturer is: comparing poetry appreciation and silly jokes with GPT-4. Almost without exception, regardless of model size or whether it's an open-source model shell, domestic large models all outperform the "world's number one". Even on the most basic literary common sense questions like the relationship between Lu Xun and Zhou Shuren, the best large model is no match for the most traditional search engine.
Returning to practical applications, the impossible triangle of commercialization has poured a bucket of cold water on parameter believers.
In practical applications, besides the intelligence level of the model, product managers also need to consider two major factors: speed and cost. Generally speaking, a response speed within 1 second in Q&A, 99% accuracy, and a business model that can break even would be necessary conditions for a large model to survive.
However, using the large parameter approach to increase intelligence often means that the higher the intelligence level, the slower the product's response speed and the higher the cost, and vice versa.
If parameters are allowed to expand without limit, AI will inevitably become a war of capital, but the cost of expansion far exceeds any equivalent stage of business competition in history... For players who have already stepped on the gas, the only way to avoid losing too badly is to raise the stakes to a level that opponents can't match.
Thus, facing the vaguely attainable ceiling, the industry's focus begins to shift: if an omnipotent model doesn't exist and violence produces no miracles, where should the industry go?
The Model T Moment for Large Models: CoE or MoE?
When the feasibility of a large model simultaneously completing general and professional tasks is blocked, multi-model joint division of labor becomes the main theme of the industry's second stage.
In 1913, Ford Company creatively introduced the slaughterhouse line approach to the automotive industry, developing the world's first assembly line. Car production thus moved from craftsmen's manual assembly to an industrial process, compressing the production time of a car by nearly 60 times and reducing the selling price by more than half. Car manufacturing thus entered a new era.
The same Model T moment is happening in the large model industry.
Taking the most typical scenario of translation as an example, a good translation should achieve three levels: faithfulness, expressiveness, and elegance. But in the world of large models, traditional translation models can only achieve faithfulness, while expressiveness and elegance rely on writing models to complete.
However, regarding how to divide labor among multiple models, the industry is divided into clearly defined vertical and horizontal factions.
The vertical faction's technical approach is MoE.
MoE (Mixture-of-Experts) combines multiple expert models from specific fields into a super model. As early as 2022, Google proposed the MoE large model Switch Transformer, which, with its 1571B parameters, showed higher sample efficiency in pre-training tasks (more accurate without significantly increasing computational cost) than the T5-XXL (11B) model.
Moreover, renowned American hacker George Hotz and PyTorch creator Soumith Chintala have successively stated that GPT-4 is also composed of eight 220B parameter MoE models, forming a 1760B parameter large model, not strictly a "single" trillion-parameter model.
However, this 8-in-1 approach also leads to MoE designs and each upgrade iteration requiring enormous resources. It's like daily mountain climbing - the difficulty of climbing the 8848m high Mount Everest is far more than the sum of climbing the 1108m high Yandang Mountain eight times. Therefore, those capable of participating are often AI technology leaders with absolute advantages in all eight aspects.
As MoE gradually becomes a game for oligarchs, a new technical approach comes to the forefront - the horizontal faction's CoE.
CoE (Collaboration-of-Experts) is an expert collaboration model. Simply put, one entry point simultaneously connects to multiple models, and the entry point adds an intent recognition stage before model analysis, then performs task distribution, deciding which model or combination of models will take effect. Compared to MoE, CoE's biggest advantage is that the expert models can work together collaboratively but without binding relationships.
Compared to MoE, CoE has more collaboration between each expert model, more precise division of labor, and is more flexible and professionally segmented. This approach, compared to MoE, has higher efficiency and lower API interface and token usage costs.
So, which approach will have the upper hand, MoE or CoE?
Another Problem-Solving Approach: What Determines the User's Intelligent Experience?
When Zhou Hongyi transformed into an AI guru in his signature red attire, debates about CoE and MoE approaches have been repeatedly staged within 360 over the past year or more.
If taking the MoE route, 360's years of technological accumulation would be sufficient to fight this battle.
Choosing CoE would mean sharing the pie with more large model manufacturers.
The saying "Three cobblers with their wits combined equal Zhuge Liang the master mind" inspired Liang Zhihui, Vice President of 360 Group, to bet on CoE:
Even if a company achieves "all-around excellence" like OpenAI, there will still inevitably be shortcomings. But if the capabilities of the most excellent large model enterprises are combined through CoE capabilities, it means the realization of complementary advantages and true all-around excellence in eighteen aspects.
Evaluation results show that the AI assistant Beta version based on 360 CoE AI capabilities, after incorporating the strengths of 16 of China's strongest large models including 360 Zhinao, has surpassed GPT-4 in 11 single-ability test indicators.
At the same time, even if "outsourcing" the underlying large model capabilities, 360 can still find its unique positioning in the CoE wave.
From a product perspective, 360's CoE product AI assistant can be divided into two parts: The corpus accumulation and algorithm technology mainly rely on the access of 16 domestic large models including 360 Zhinao, similar to special forces with different divisions of labor; while 360 plays the role of commander, achieving more accurate understanding of user intent through intent recognition models; through task decomposition and scheduling models, it achieves intelligent scheduling of numerous expert model networks (100+ LLMs), knowledge hubs of hundreds of billions of scale, and 200+ third-party tools, thereby achieving higher flexibility and efficiency than MoE.