Researchers from Meta, UC Berkeley, and NYU have proposed a method of meta-reward language models aimed at achieving "super-alignment" of AI models. This approach allows AI models to simultaneously play the roles of actor, judge, and meta-judge, improving performance through self-evaluation and improvement without relying on human-annotated data.
Specifically, the meta-reward method includes the following steps:
- Actor generates responses to given prompts
- Judge evaluates and scores the responses
- Meta-judge assesses the quality of the judge's scoring
- Model is optimized using the DPO method based on the above results
To address length preference issues, researchers introduced a length control mechanism. They also designed detailed methods for creating judge preference data, including using meta-judge prompt templates and considering position preferences.
In evaluation experiments, researchers used Llama-3-8B-Instruct as the seed model, initially fine-tuning it on the EFT dataset. The meta-reward iteration used 20,000 prompts generated by Llama-2-70B-Chat, sampling 5,000 each time for a total of 4 iterations.
Experimental results show that the meta-reward method significantly improved model performance. For example, the win rate on AlpacaEval 2 increased from 22.9% to 39.4%, surpassing GPT-4; on Arena-Hard, it improved from 20.6% to 29.1%.
This research further demonstrates that language models have the potential to improve performance through self-improvement, reducing dependence on human supervision. It provides new ideas and methods for achieving "super-alignment" of AI systems.