Identifying errors in given solution steps.
This prevents models from simply memorizing or guessing answers, and eliminates concerns about test leaks.
Using MR-Ben, Jia's team evaluated many open and closed-source models including GPT4-Turbo, Claude3.5-Sonnet, GLM4, and Qwen2-70B.
All code and data for this dataset have been open-sourced.
Familiar Questions, Brand New Task
Current mainstream evaluation methods for large language models use standardized human tests - multiple choice and fill-in-the-blank questions.
This approach has clear standards, intuitive metrics, and naturally generates discussion-worthy quantitative results.
However, the authors argue this method is not "reliable" given that modern large language models generally use chain-of-thought reasoning to generate final answers.
With pre-trained models having seen trillions of tokens during pre-training, it's difficult to determine if an evaluated model has already encountered the relevant data and is simply "memorizing" correct answers.
Additionally, because evaluation mainly checks the final answer, it's unclear whether the model chose the correct option based on proper understanding and reasoning.
Although the academic community continually upgrades datasets like GSM8K and MMLU, such as introducing multilingual versions or more difficult questions, they still can't escape the limitations of multiple choice or fill-in-the-blank formats.
Furthermore, these datasets now face serious saturation issues, with large language models reaching peak performance and gradually losing discriminative power.
To address this, Jia's team collaborated with MIT, Tsinghua, Cambridge and other renowned universities, as well as leading Chinese annotation companies, to create MR-Ben - an evaluation dataset focused on complex problem-solving reasoning processes.
MR-Ben is based on questions from essential large model pre-training datasets like GSM8K, MMLU, LogiQA, and MHPP. It applies a "grading-style" paradigm shift to create a new dataset that is more challenging, more discriminative, and better reflects true reasoning abilities.
Rather than finding new questions or modifying existing ones to test model robustness, MR-Ben directly transforms models from "test-takers" to "graders", having them evaluate existing solution processes. This tests their mastery of knowledge points by making them act as teachers!
Specifically, Jia's team organized mainstream evaluation datasets like GSM8K, MMLU, LogiQA, and MHPP into categories such as math/physics/chemistry/biology, coding, logic, and medicine, with different difficulty levels.
For each category and collected question, the team carefully gathered corresponding step-by-step solution processes. These were then annotated by trained professionals with master's and doctoral degrees.
During annotation, whether the solution process is correct, where errors occur, and reasons for errors are all meticulously identified. Comparing the model's grading results with human expert grading reveals the model's mastery of knowledge points.
From an evaluation perspective, the method proposed by MR-Ben requires models to carefully analyze the premises, assumptions, and logic of each step in the solution process, and to simulate the reasoning process to determine if the current step leads to the correct answer.
This "grading" style of evaluation is far more challenging than simply answering questions, but it effectively avoids inflated scores due to memorization. Students who can only memorize answers would struggle to be competent graders.
GPT4-Turbo Performs Best
Jia's team evaluated several well-known large language models, with multiple versions of some models tested.
Among closed-source models, GPT4-Turbo performed best (although it failed to detect calculation errors when "grading"), leading other models in most subjects under both demo (k=1) and no-demo (k=0) settings.
The GLM model from Zhipu AI ranked second on the leaderboard, surpassing Claude's latest 3.5-Sonnet version.
However, there are significant differences between models. Even the strongest performer, GPT4-Turbo, scored less than 50 points on the MR-Ben dataset, indicating its performance is still far from saturated.
Additionally, some high-performing open-source models have caught up with certain commercial models.
The MR-Ben team also discovered some interesting phenomena during their work:
-
In low-resource scenarios, small models showed notable strengths. Phi-3-mini stood out among small models in the MR-Ben evaluation, even outperforming or matching models with hundreds of billions of parameters, demonstrating the importance of fine-tuning data.
-
MR-Ben scenarios involve complex logical parsing and step-by-step reasoning. In few-shot mode, overly long contexts actually confused models, leading to decreased performance.
-
MR-Ben evaluated numerous generate-reflect-regenerate ablation experiments to examine differences between prompting strategies. This had no effect on low-performing models and little effect on high-performing models like GPT4-Turbo. For mid-level models, it slightly improved performance as they sometimes corrected errors but also introduced new ones.
-
When roughly dividing MR-Ben evaluation subjects into knowledge-based, logic-based, calculation-based, and algorithm-based categories, different models showed varying strengths and weaknesses across reasoning types.
Jia's team has uploaded a one-click evaluation method on GitHub. Testing once consumes about 12M tokens. Developers can evaluate their own models and submit results, which the MR-Ben team will promptly update on the leaderboard.