Llama 3.1 Research and Development Approach
How to Determine Parameter Scale
- Need to consider multiple factors such as scaling law, training time, GPU hardware constraints, etc.
- Consider not only Meta's own hardware but also the situation of the entire AI community
- The application of quantization technology has changed the proportion of inference and training/fine-tuning costs
- Found a balance point of 405B under existing computing power and constraints
- The goal is to create an open-source model comparable to GPT-4
Revisiting Scaling Law
- Traditional Scaling Law focuses on two dimensions: model weights and training volume
- Chinchilla emphasized the importance of total training data tokens
- Meta chose to increase training token count and duration, allowing the model to "over-train"
- This doesn't comply with Chinchilla's law but can achieve better inference performance
Model Architecture
- Not much change compared to Llama 2 architecture, mainly expanded data scale and quality
- Future improvements may involve more architectural changes, not limited to Transformer
- Current Transformer architecture still lacks flexibility
- Exploring MoE architecture
On Synthetic Data
- Large amounts of low-quality text exist on the public internet
- Using Llama as a classifier to filter high-quality tokens
- Llama 3 post-training uses entirely synthetic data obtained from Llama 2
- Optimistic about the prospects of synthetic data
LLM Evaluation and Improvement
- Risk of overfitting when improving benchmark scores through post-training
- Language model evaluation is a difficult problem
- Tried various evaluation methods, such as reward models, model-as-a-judge, etc.
- Multi-round RLHF is a good method for comparing models
Llama 4 and Agent
- Meta began training Llama 4 model in June
- Focus may be on agent technology
- Some work has been done on agent tools like Toolformer
- Excellent instruction models are the foundation for expanding agent capabilities
- Meta's released GAIA benchmark is used to evaluate the ability to solve real-world problems
- Various agent capabilities are closely related to the model's intelligence level