Microsoft's New Technology: Synthetic Data Significantly Improves LLM's Mathematical Abilities

Microsoft has introduced AgentInstruct technology, which uses synthetic data to enhance AI model performance.

When "Synthetic Data" Meets Intelligent Agents

Over the past year, we have witnessed the rise of intelligent agents. These agents can generate high-quality data and, through reflection and iteration, their capabilities surpass those of the underlying foundation models.

In this process, agents can review solutions, self-criticize, and improve their solutions. They can even utilize tools such as search APIs, calculators, and code interpreters to extend the capabilities of large models.

Moreover, multi-agent systems can bring additional advantages, such as simulating scenarios and simultaneously generating new prompts and responses. They can also automate data generation workflows, reducing or eliminating the need for human intervention in certain tasks.

In the paper, the authors introduce the concept of "generative teaching." This refers to using synthetic data for post-training, especially creating data through powerful models to teach another model new skills or behaviors.

AgentInstruct is an intelligent agent solution for generative teaching.

In summary, AgentInstruct can create:

  • High-quality data: Using powerful models like GPT-4, combined with tools such as search and code interpreters.
  • Diverse data: AgentInstruct generates prompts and responses simultaneously. It uses multi-agent systems (equipped with powerful LLMs, tools, and reflection processes) and a taxonomy with over 100 subcategories to create diverse and high-quality prompts and responses.
  • Large-scale data: AgentInstruct can run autonomously and can apply validation and data filtering processes. It doesn't require seed prompts but uses raw documents as seeds.

Generative Teaching: AgentInstruct

How do we create massive amounts of data? How do we ensure the generated data is diverse? How do we generate complex or nuanced data?

To address these challenges, researchers outlined a structured approach:

Specifically, AgentInstruct defines three distinct automated generation processes:

Content transformation process: Converts raw seeds into intermediate representations, simplifying the process of creating instructions for specific targets.

Seed instruction generation process: Composed of multiple agents, takes the transformed seeds from the content transformation process as input and generates a set of diverse instructions.

Instruction improvement process: Takes instructions from the seed instruction process as input and iteratively enhances their complexity and quality.

Next, researchers implemented these processes for 17 different skills, each with multiple subcategories. These skills include reading comprehension, question answering, coding, retrieval-augmented generation, creative writing, tool/API usage, and web control.

Experimental Results

As mentioned at the beginning, researchers fine-tuned the Mistral-7b-v0.1 model using 25.8 million instruction pairs, resulting in Orca-3.

So, how does Orca-3 perform after training with AgentInstruct data?

AgentInstruct aims to synthesize a large and diverse dataset containing data of varying difficulty levels. On this dataset, benchmark models like Orca-2.5, Mistral-Instruct-7b, and ChatGPT scored far below 10, showing their disadvantage compared to GPT-4 (designated as the benchmark with a score of 10).

On average, including Orca-3 after each training round, the introduction of AgentInstruct data improved performance by 33.94% compared to the Orca 2.5 baseline and by 14.92% compared to Mistral-Instruct-7B.

Setting New State-of-the-Art on Multiple Benchmarks

For instance, there was a 40% improvement on AGIEval, 19% on MMLU, 54% on GSM8K, 38% on BBH, and 45% on AlpacaEval.