Apple's New AI Breakthrough: GPT-4o Simulates User Testing of Large Model Tool Calling Capabilities

Large language models still have significant potential for improvement in terms of tool use.

The Apple team has released a benchmark for large language model tool-calling capabilities called ToolSandbox. This test benchmark adopts a scenario-based evaluation method, which can better reflect the model's performance in real-world environments. It introduces important scenarios such as dialogue interaction and state dependency that traditional standards have not focused on.

ToolSandbox addresses the lack of scenario-based evaluation in existing test standards, narrowing the gap between test conditions and actual applications. For interaction, the authors have GPT-4 play the roles of users and the model being tested to simulate real-world scenarios.

ToolSandbox includes nearly 2,000 scenarios across seven types: single/multi-tool calls, single/multi-turn dialogues, state dependency, normalization, and insufficient information. It focuses on three metrics for models: overall performance, robustness, and efficiency.

The testing process includes three stages: preparing test scenarios, interactive execution, and evaluation. The evaluation uses predefined "milestones" and "minefields" to measure model performance.

Test results show that closed-source models generally perform better than open-source models in tool calling. GPT-4 scored the highest at 73.0. The highest score among open-source models was only 31.4.

Further analysis indicates that open-source models perform poorly in identifying when to call tools. Large models excel in single/multi-tool calls and single-turn user requests, but their advantage weakens in multi-turn dialogues and state-dependent tasks. Normalization is a major challenge for all models.

Overall, large models still face many challenges in tool use when dealing with complex interactive scenarios in the real world.

The ToolSandbox team members come from various teams at Apple, including machine learning, data science, and foundational large models. The first author is Jiarui Lu, a Chinese machine learning engineer who graduated from Tsinghua University with a bachelor's degree, then obtained a master's degree in machine learning from Carnegie Mellon University, and joined Apple in 2020.