Llama 3.1 has reportedly leaked, including benchmark results for 8B, 70B and 405B parameter models. Even the 70B version outperforms GPT-4o on several benchmarks, marking the first time an open-source model has surpassed closed-source models like GPT-4o and Claude Sonnet 3.5 on multiple benchmarks.
Key details from the leaked model card:
- Trained on 15T+ tokens of publicly available data up to December 2023
- Fine-tuning data includes public instruction datasets and 15 million synthetic samples
- Supports English, French, German, Hindi, Italian, Portuguese, Spanish and Thai
The models reportedly have a 128k context length and use grouped-query attention for improved inference scalability.
Intended uses include multilingual commercial applications and research. The instruction-tuned models are optimized for assistant-like chat, while pre-trained models can be adapted for various natural language generation tasks.
Training infrastructure:
- Custom training library and Meta's GPU clusters
- 39.3M GPU hours on H100-80GB hardware
- Estimated 11,390 tons CO2e emissions (0 tons market-based due to renewable energy use)
Benchmark scores are reported for various tasks, with Llama 3.1 models outperforming many open and closed-source chat models.
Safety considerations:
- Multi-pronged data collection approach combining human-generated and synthetic data
- LLM-based classifiers for quality control
- Focus on reducing model refusals and refusal tone
- Adversarial prompts incorporated into safety data
- Intended for deployment as part of a larger AI system with additional safeguards
Developers should implement system-level safety measures when building agent systems, especially when utilizing new features like longer context windows, multilingual capabilities, and third-party tool integrations.
[Links to referenced papers and sources have been omitted]