Open Source AI's New Champion: Llama 3.1 Leak Surpasses GPT-4o

Llama 3.1 Family, Launching Tomorrow

According to the leaked model card, Llama 3.1 will be released on the 23rd.

The license is "Custom Commercial License" and "Llama 3.1 Community License".

Specifically, the Llama 3.1 series of multilingual large language models is a set of pre-trained and instruction-tuned generative models, including 8B, 70B and 405B parameter scales.

The instruction-tuned Llama 3.1 text-only models (8B, 70B, 405B) are optimized for multilingual conversational use cases.

In addition to English, it supports 7 languages including German, French, Italian, Portuguese, Hindi, Spanish and Thai.

According to the introduction, Llama 3.1's new capabilities include longer context, support for multilingual input and output, and integration with developer and third-party tools.

Benchmark Tests

A benchmark chart on GitHub (now 404) shows Llama 3.1's excellent performance in benchmark tests.

Specifically, in benchmark evaluations of pre-trained models, Llama 3.1 405B set new records in general tasks, knowledge reasoning, and reading comprehension.

The improvements were most notable in the MMLU and SQuAD sub-benchmarks.

Meanwhile, the 8B and 70B parameter versions of Llama 3.1 showed slight improvements compared to Llama 3. However, on some metrics, the 70B Llama 3.1 still underperformed its predecessor.

Additionally, among instruction-tuned models, Llama 3.1 405B is clearly stronger than the pre-trained model. It significantly outperforms the fine-tuned 8B and 70B versions in reasoning, coding, math, tool use, and multilingual benchmarks.

The Llama 3.1 8B and 70B fine-tuned models also show substantial performance improvements across multiple capability tasks.

Some netizens compiled benchmarks of other leading models, showing through comparison that Claude 3.5 Sonnet is the champion across all benchmarks.

The fine-tuned version of Llama 3.1 405B only performs best in the MMLU Pro math benchmark, beating all large models with a score of 73.3%.

Additionally, 405B is on par with GPT-4o in GPQA (graduate-level professional knowledge and reasoning), mathematics, DROP (reading comprehension), MGSM (multilingual mathematics), HumanEval (programming), and BBH (knowledge assessment) benchmarks.

Moreover, 405B significantly outperforms the latest GPT-4o mini model.

Llama 3.1 is an autoregressive language model using an optimized Transformer architecture. The adjusted versions use SFT and RLHF to align with human safety preferences.

For the Llama 3.1 series models, token counts refer only to pre-training data.

All model versions use grouped-query attention (GQA) to improve inference scalability.

15T Token Training Data

Like Llama 3, Llama 3.1 was pre-trained on approximately 15 trillion tokens from publicly available sources.

Fine-tuning data includes publicly available instruction datasets, as well as over 25 million synthetic samples, with pre-training data cut off in December 2023.

Available for Both Commercial and Research Use

Llama 3.1 supports both commercial and research use in multilingual environments.

The instruction-tuned text-only models are suitable for chat assistants, while pre-trained models can adapt to various natural language generation tasks. The Llama 3.1 model collection also supports using its model outputs to improve other models, including synthetic data generation and model distillation.

Uses that violate laws and regulations, usage policies, and the Llama 3.1 Community License, or uses beyond the supported languages, are out of scope.

The team emphasizes that Llama 3.1 was trained on a broader set of languages beyond the 8 supported ones. Developers can fine-tune it for use in other languages, provided they comply with policies like the community license and ensure safe and responsible use.

39.3 Million GPU Hours of Training

For pre-training, Meta used custom training libraries, Meta's custom GPU clusters, and production infrastructure. Fine-tuning, annotation, and evaluation were also conducted on production infrastructure.

Training cumulatively used 39.3 million GPU hours of compute time, with H100-80GB (700W TDP) as the hardware type.

Training time is the total GPU time required to train each model, and power consumption is the peak power capacity of each GPU device, adjusted for power usage efficiency.