Llama 3.1 leak: Performance surpasses GPT-4, cost only one-tenth?

Meta's AI model Llama has once again experienced a leak incident, drawing attention from the open-source community. Despite repeated leaks, Llama continues to adhere to its open-source approach, but this strategy faces challenges. The incident highlights the conflict between AI technology openness and security, and also prompts reflection on the management and protection of open-source models.

Llama 3.1's performance is comparable to OpenAI's GPT-4o!

Some AI bloggers praised that the release of Llama 3.1 would be another day that changes the fate of the AI world.

The leaked benchmark results show that Llama 3.1 comes in 8B, 70B, and 405B sizes. Even the 70B model with the smallest parameter count performs on par with GPT-4o in many aspects.

Some netizens pointed out that based on this benchmark, Llama 3.1 405B ≈ GPT-4o, while Llama 3.1 70B would become the first lightweight model to beat OpenAI, a GPT-4o mini.

However, many who have downloaded the model to try it out found that the leaked Llama 3.1 405B has a total file size of about 820GB, requiring nearly 3 times the memory of Llama 2 (about 280GB) to retain full precision.

This means that unless you have a mining rig at home and can afford enough GPUs, individual developers will find it difficult to run Llama 3.1 on their own computers. Some netizens speculate that Llama 3.1 is not aimed at individuals, but at institutions and enterprises.

The yet-to-be-announced Llama 3.1 has also been met with some cold water. Many netizens complained that Llama 3.1's GPU requirements are too high, making OpenAI's GPT-4o mini more cost-effective in comparison.

According to the leaked model information, Llama 3.1 has more iterations in functionality compared to Llama 3 released on April 19, 2024, including longer context windows, multilingual input and output, and possible integration with developers and third-party tools.

Training data: Llama 3.1 was trained on 15T+ tokens from public sources, with fine-tuning data including publicly available instruction tuning datasets (unlike Llama-3!) and over 25 million synthetically generated examples.

Multilingual conversation: Llama 3.1 supports 8 languages: English, German, French, Italian, Portuguese, Hindi, Spanish and Thai. While Chinese is unfortunately not included, developers can fine-tune the Llama 3.1 model for languages beyond the 8 supported ones.

Context window: The context length for each version has been expanded from 8k to 128k, roughly equivalent to the model being able to remember, understand and process about 96,000 words at a time, almost an entire original Harry Potter book.

Many netizens are eager to pit Llama 3.1 against its "predecessors", finding that not only have the metrics improved significantly, but computational resources have also been saved.

Based on netizen testing, Llama 3.1 shows significant improvements in capabilities compared to Llama 3. In particular, human_eval and truthfulqa_mc1 capabilities have improved noticeably, meaning stronger code generation abilities and more truthful question answering.

At the same time, Llama 3's instruct model shows clear improvements over the base model in metrics like prompt learning, contextual learning, and efficient parameter fine-tuning.

This is reasonable, as base models are typically not fine-tuned for specific tasks, while instruct models are specially trained to follow instructions or complete specific tasks. Usually, instruct models perform better on metrics.

This makes people even more excited for the official release of Llama 3.1. The current leaked Llama 3.1 model tests only target the base model, while the instruct model may perform even better!

Surprisingly, in the benchmark results, the Llama 3.1 70B model matches or beats GPT-4o, while the Llama 3.1 8B model performs close to the Llama 3 70B model. Some netizens speculate this may have used model distillation techniques, where the 8B and 70B models are simplified versions derived from the largest 405B model, making the large model "smaller".

Model distillation can be seen as students learning from teachers. The large and powerful model (teacher model) is the teacher, while the smaller and simpler model (student model) is the student. The student model learns by "imitating" the teacher model, trying to make its output as close as possible to the teacher model's output, thereby learning similar knowledge and capabilities.

After distillation training, the student model can reduce model size and computational resource requirements while maintaining high performance and comparable accuracy.

It's still unknown whether Llama 3.1 will be open-sourced as hoped. But even if open-sourced, you'll still need deep pockets to afford using Llama 3.1.

The basic entry ticket to run Llama 3.1 is enough GPUs.

The leaked files show that training time for Llama 3.1 405B on H100-80GB type hardware is 30.84M GPU hours. This means that assuming only one H100-80GB is used per hour, running Llama 3.1 405B would take 30.84M hours - it would take 3500 years for the model to run!

For private deployment, if a company wants to successfully run Llama 3.1 405B within a month, they would need to stock at least 43,000 H100-80GBs. At $40,000 per H100, ### the entry ticket for using Llama 3.1 405B's computing power would be as high as $17 billion, equivalent to 125 billion RMB.

The good news is that Llama 3.1's inference costs may be cheaper.

According to Artificial Analysis predictions, the cost of processing 1 million tokens with Llama 3.1 405B will be cheaper than similar quality frontier models (GPT-4o and Claude 3.5 Sonnet), offering better cost-effectiveness.

In addition, some netizens speculate from the source code that Llama 3.1 405B may become a membership product requiring payment for use. However, the real situation remains to be seen in the official release.