Meta enables LLMs to self-evaluate and evolve: After 4 rounds of training, Llama 7B surpasses GPT-4

Researchers from Meta, UC Berkeley, and NYU have proposed a method of meta-reward language models aimed at achieving "super-alignment" of AI models. This approach allows AI models to simultaneously play the roles of actor, judge, and meta-judge, improving performance through self-evaluation and improvement without relying on human-annotated data.

Specifically, the meta-reward method includes the following steps:

Actor generates responses to given prompts
Judge evaluates and scores the responses
Meta-judge assesses the quality of the judge's scoring
Model is optimized using the DPO method based on the above results

To address length preference issues, researchers introduced a length control mechanism. They also designed detailed methods for creating judge preference data, including using meta-judge prompt templates and considering position preferences.

In evaluation experiments, researchers used Llama-3-8B-Instruct as the seed model, initially fine-tuning it on the EFT dataset. The meta-reward iteration used 20,000 prompts generated by Llama-2-70B-Chat, sampling 5,000 each time for a total of 4 iterations.

Experimental results show that the meta-reward method significantly improved model performance. For example, the win rate on AlpacaEval 2 increased from 22.9% to 39.4%, surpassing GPT-4; on Arena-Hard, it improved from 20.6% to 29.1%.

This research further demonstrates that language models have the potential to improve performance through self-improvement, reducing dependence on human supervision. It provides new ideas and methods for achieving "super-alignment" of AI systems.

Paper Link 1 Paper Link 2