OpenAI releases "final work" on super alignment research: Large and small models compete to improve output readability

Accuracy has slightly decreased

Model-to-Model Gaming

OpenAI research found that there is a trade-off between accuracy and readability in content generated by large models:

Optimizing for accuracy leads to decreased readability; optimizing for readability results in some loss of accuracy.

This difference is easily noticeable even in simple elementary math tasks, which OpenAI calls the "comprehensibility tax".

However, human understanding of model-generated content is a crucial prerequisite for making accurate judgments, thus having important safety implications.

Even setting this aside, the comprehensibility of content directly determines whether the model can provide useful assistance to humans.

This research was inspired by the "prover-verifier" game proposed by University of Toronto scholars in 2021, training models through gaming.

The prover has powerful computational capabilities to perform complex calculations and generate proofs, aiming to convince the verifier that a statement is correct.

The verifier can only perform relatively simple calculations and needs to determine if the prover's proof is credible through interaction.

For example, in a math problem requiring solving a function's expression, the prover knows the expression (but not necessarily correctly) and wants the verifier to believe their answer is correct.

The verifier knows some function values corresponding to variables and can use this information to test the prover, judging whether they know the correct function based on the accuracy of their responses.

Through this approach, OpenAI aims to balance accuracy and readability, trading off a relatively small loss in accuracy for higher readability.

In this project, the OpenAI team used a dataset containing elementary math problems, solutions, and correct answers.

They chose GPT-4 series models of different scales, with larger ones as provers and smaller ones as verifiers.

The dataset was randomly divided into two equal subsets for training the prover and verifier respectively.

The prover was assigned two roles: helpful and sneaky, generating correct and incorrect solutions respectively.

However, both aimed to have their solutions receive high scores from the verifier.

These roles are not fixed; the prover needs to ### alternate between them, receiving rewards when behavior aligns with the role.

Training was conducted in multiple rounds. In each round, the team sampled correct and incorrect solutions generated by the two provers from the previous round.

These samples served as training data for the current round's verifier, improving its judgment accuracy through supervised learning, ensuring higher scores for correct solutions compared to incorrect ones.

In this process, the sneaky prover continuously urged the verifier to improve its judgment accuracy, while the verifier made the prover's generated content more aligned with scoring preferences. Both progressed through mutual gaming.

An important prerequisite for achieving high scores was that the generated content could be understood by the verifier. Thus, by incorporating human preferences into the verifier, the interpretability of the model's output could be enhanced.

Balance between Accuracy and Readability

Ultimately, this game-based training method achieved a good balance between high accuracy and high comprehensibility.

Traditional training methods only focused on the model's own accuracy. As training progressed, answer accuracy improved, but human judgments became increasingly inaccurate and slower.

Since the OpenAI team incorporated human preferences into the verifier, the verifier's judgment accuracy was to some extent related to humans.

The final results confirmed that models (provers) trained using this gaming approach allowed humans to make faster and more accurate judgments as training rounds increased, without significant loss in the model's own accuracy.

Compared to the baseline method, after 5 rounds, human judgment accuracy exceeded 82%, while the baseline achieved less than 75% after 7,000 steps of training.

Time-wise, human judgment for the baseline took nearly 27 seconds, while the new method took just over 26 seconds.

Additionally, for the verifier, its robustness gradually increased as training progressed, enabling more accurate identification of incorrect answers.

Besides evaluating the effectiveness of this game-based training method, the OpenAI team made some other discoveries during their research:

Impact of verifier scale: Too small led to unstable training, while too large made it difficult to improve robustness over multiple rounds. One-thousandth to one-hundredth of the prover's size worked best.

Errors in malicious solutions became increasingly subtle and localized as training progressed, indicating that the verifier pushed the sneaky prover to generate adversarial samples closer to real solutions.

As training rounds increased, human evaluators' accuracy on the sneaky prover decreased.

One More Thing

As the Superalignment team responsible for this research has disbanded and it concerns safety issues, OpenAI's safety strategy has once again received widespread attention.

Regarding this issue, OpenAI CTO Mira Murati gave an exclusive interview at Johns Hopkins University earlier this month.

During the interview, Mira stated that OpenAI has not "put product (priority) before safety" as alleged by Jan Leike, the former head of the Superalignment team.

She also said that although the Superalignment team has disbanded, the work on superalignment continues.