Here is the English translation of the content:
The pervasive smoke actually obscures a fact: Unlike many big companies burning money on subsidies, DeepSeek is profitable.
Behind this is DeepSeek's comprehensive innovation in model architecture. It proposed a brand new MLA (### a new multi-head latent attention mechanism) architecture, which reduced memory usage to 5%-13% of the most commonly used MHA architecture in the past. At the same time, its original DeepSeekMoESparse structure also minimized computational costs. All of these ultimately led to cost reductions.
In Silicon Valley, DeepSeek is called a "mysterious force from the East". The chief analyst at SemiAnalysis believes that the DeepSeek V2 paper "may be the best one this year". Former OpenAI employee Andrew Carr thinks the paper is "full of amazing wisdom" and applied its training setup to his own models. Jack Clark, former policy director at OpenAI and co-founder of Anthropic, believes that DeepSeek "has hired a bunch of unfathomable geniuses" and thinks that Chinese-made large models "will become a force to be reckoned with, just like drones and electric vehicles."
This is a rare occurrence in the AI wave where stories are mostly driven by Silicon Valley. Several industry insiders told us that ### this strong response stems from innovation at the architectural level, which is a rare attempt by domestic large model companies and even global open-source foundation models. An AI researcher said that the Attention architecture has hardly been successfully modified in the years since it was proposed, let alone validated on a large scale. "This is even an idea that would be cut off when making decisions, because most people lack confidence."
On the other hand, domestic large models rarely ventured into architectural innovation before, also because few people actively tried to break that stereotype: ### The United States is better at 0-1 technological innovation, while China is better at 1-10 application innovation. Moreover, this behavior is very uneconomical - a new generation of models will naturally be made in a few months, and Chinese companies only need to follow and do well in applications. Innovating on model structure means there is no path to follow, and will experience many failures, with huge time and economic costs.
DeepSeek is clearly going against the tide. Amidst a clamor that large model technology will inevitably converge and following is a smarter shortcut, DeepSeek values the accumulation in "detours" and believes that Chinese large model entrepreneurs can join the global technological innovation torrent in addition to application innovation.
Many of DeepSeek's choices are different from others. As of now, among the 7 Chinese large model startup companies, it is the only one that has given up the "want it all" approach and has focused on research and technology without doing toC applications. It is also the only company that has not fully considered commercialization, firmly choosing the open source route and has not even raised funds. These make it often forgotten outside the table, but on the other end, it is often spread by users in a "self-promoting" way in the community.
How exactly did DeepSeek come to be? We interviewed DeepSeek founder Liang Wenfeng, who rarely makes public appearances, for this purpose.
This post-80s founder, who has been quietly researching technology behind the scenes since the Phantoscope era, continues his low-key style in the DeepSeek era, "reading papers, writing code, and participating in group discussions" every day like all researchers.
Unlike many quantitative fund founders who have overseas hedge fund experience and mostly come from physics, mathematics and other majors, Liang Wenfeng has always had a domestic background and studied artificial intelligence at the Department of Electronic Engineering at Zhejiang University in his early years.
Several industry insiders and DeepSeek researchers told us that Liang Wenfeng is a very rare person in China's current AI field who "has both strong infra engineering capabilities and model research capabilities, and can also mobilize resources", "can make precise judgments from a high level and outperform front-line researchers in details". He has "terrifying learning ability" and at the same time "doesn't look like a boss at all, but more like a geek".
This is a particularly rare interview. In the interview, this technological idealist provided a voice that is particularly scarce in China's tech world today: ### He is one of the few who puts "right and wrong view" before "pros and cons view", and reminds us to see the inertia of the times and put "original innovation" on the agenda.
A year ago, when DeepSeek first entered the scene, we interviewed Liang Wenfeng for the first time: "The Crazy Phantoscope: A Hidden AI Giant's Path to Large Models". If that ### "must embrace ambition crazily, and must be crazily sincere" was still a beautiful slogan at that time, a year has passed, and it has already become an action.
Here is the dialogue part:
### How was the first shot of the price war fired?
"Undercurrent": After the release of the DeepSeek V2 model, it quickly triggered a bloody price war for large models. Some say you are a catfish in the industry.
Liang Wenfeng: We didn't intend to become a catfish, we just accidentally became one.
"Undercurrent": Did this result surprise you?
Liang Wenfeng: Very surprising. We didn't expect the price to be so sensitive to everyone. We just did things according to our own pace, then calculated costs and set prices. Our principle is not to lose money, nor to make excessive profits. This price also has a little profit above cost.
"Undercurrent": Five days later, Zhipu AI followed suit, then ByteDance, Alibaba, Baidu, Tencent and other big companies.
Liang Wenfeng: Zhipu AI lowered the price of an entry-level product, while their models at the same level as ours are still very expensive. ByteDance was the first to really follow. They lowered their flagship model to the same price as ours, which then triggered other big companies to cut prices one after another. Because the model costs of big companies are much higher than ours, we didn't expect anyone to lose money doing this, and it eventually turned into the logic of burning money subsidies in the Internet era.
"Undercurrent": From the outside, price cuts look like grabbing users, as price wars in the Internet era usually do.
Liang Wenfeng: Grabbing users is not our main purpose. We lowered prices on one hand because our costs came down first as we explored the structure of the next generation model, and on the other hand we also feel that whether it's API or AI, it should be inclusive and affordable for everyone.
"Undercurrent": Before this, most Chinese companies would directly copy this generation's Llama structure to do applications. Why did you cut in from the model structure?
Liang Wenfeng: If the goal is to do applications, then using the Llama structure to quickly launch products is also a reasonable choice. But our destination is AGI, which means we need to study new model structures to achieve stronger model capabilities with limited resources. This is one of the basic researches needed to scale up to larger models. In addition to model structure, we have done a lot of other research, including how to construct data, how to make models more human-like, etc., which are all reflected in the models we release. Also, the Llama structure is probably two generations behind the advanced level abroad in terms of training efficiency and inference cost.
"Undercurrent": Where does this generational gap mainly come from?
Liang Wenfeng: First, there is a gap in training efficiency. We estimate that the best domestic level may have a factor of two difference from the best abroad in terms of model structure and training dynamics, which means we need to consume twice the computing power to achieve the same effect just on this point. There may also be a factor of two difference in data efficiency, which means we need to consume twice the training data and computing power to achieve the same effect. Together, we need to consume 4 times more computing power. What we need to do is to keep narrowing these gaps.
"Undercurrent": Most Chinese companies choose to do both models and applications. Why does DeepSeek currently choose to only do research and exploration?
Liang Wenfeng: Because we feel that the most important thing now is to participate in the global innovation wave. For many years in the past, Chinese companies were used to others doing technological innovation and us taking it to do application monetization, but this is not a matter of course. In this wave, our starting point is not to make money, but to go to the forefront of technology and promote the development of the entire ecosystem.
"Undercurrent": The inertial cognition left to most people in the Internet and mobile Internet era is that the United States is good at technological innovation, while China is better at applications.
Liang Wenfeng: We believe that with economic development, ### China should also gradually become a contributor, not always free-riding. In the past 30+ years of IT waves, we basically didn't participate in real technological innovation. ### We have become accustomed to Moore's Law falling from the sky, lying at home and better hardware and software coming out every 18 months. Scaling Law is also being treated like this.
But in fact, this is created by generations of tireless efforts in the Western-led technology community. Only because we didn't participate in this process before, we ignored its existence.