New Standard Pipeline
Human Preference Data
The initial RLHF pipeline focused on human data, including data for instruction fine-tuning and preference data for task completion. This type of data is expensive and strictly protected.
Now, the only aspect using human data is preference data. Meta may have spent $10M-$20M or more on preference data.
For the open community, a challenge is to determine the level of human intervention in such data, and whether it can be replaced by methods like LLM-as-a-Judge or reward models.
Scaling RLHF
Thomas Scialom, alignment lead for Llama 3, states that RLHF is much more scalable, less costly, easier to operate, and generally leads to better performance.
Industry only uses instruction fine-tuning (IFT) as a starting point for scaling RLHF. SFT data mainly focuses on specific areas not previously covered by models, then expands RLHF based on this.
RLHF is an iterative process, allowing the model to continue improving during generation. Llama 3.1 underwent 6 rounds of preference data training, Llama 2 had 5 rounds, Nemotron had 4 rounds, with multiple rounds of instruction fine-tuning before that.
Multiple iterations may be mainly due to feasibility considerations:
- Data is transmitted in batches from annotation companies to labs
- Multiple small-scale training rounds can reduce the risk of final product delivery
Similar iterative RLHF methods can be traced back to Anthropic's "constitutional AI", but the open-source community doesn't seem to have reproduced this result on a large scale.
Currently, academia is focusing on "online DPO training", which is similar in direction but less focused on data between rounds. Once the process is automated, online DPO will be the future.
The choice of algorithms for the post-training phase should not be so rigid across teams. DPO and PPO each have their pros and cons; the former is easier to scale, but PPO-inspired methods (such as online RL) have a higher performance ceiling.
These solutions are currently mainly for simplicity, as these teams are still relatively new and building modular systems.
Synthetic Data
An important link in the new RLHF cycle is synthetic instruction data that surpasses human capabilities in most tasks.
If the model can be improved slightly and generate better instructions, then "start over" and update the checkpoint.
Meta explicitly states in their paper that they "use a 405B model to improve the post-training quality of our smaller models"; Google achieves this by distilling out Gemini Flash, but in reality, most cutting-edge models likely include some similar steps.
It's rumored that OpenAI is training the next generation model with 50 trillion tokens of data, most of which is synthetic. There were rumors last year that Anthropic had a "pre-training scale constitutional AI corpus", which now seems reasonable.
These AI companies likely realized the importance of synthetic data 12-18 months ago, when they stopped using model outputs for self-iterative training. But Meta is different, as it benefits from other better open models.
From today's post-training, it can be seen that the problem of synthetic data causing model collapse has been exaggerated. Model collapse only occurs in artificially set environments where original data is discarded, leaving only generated new data.