OpenAI again delays release, only publishing evaluation set raises questions

The hype surrounding Strawberry has already subsided.

Thought it would be "Strawberry", but it turned out to be "Kale"

Although the whole world was watching the "Strawberry Project", it seems that the rebellious OpenAI always fails to meet expectations. You want "strawberry", but they give you "kale" instead.

At 2 AM Beijing time on the 14th, OpenAI announced on its official website that it is releasing a human-verified subset of SWE-bench, which can more reliably evaluate AI models' ability to solve real-world software problems.

SWE-bench Hugging Face address:

https://huggingface.co/datasets/princeton-nlp/SWE-bench_Verified

As part of the preparedness framework (a set of methods established by OpenAI to safely develop and deploy its frontier models), OpenAI has developed a series of metrics to track, evaluate, and predict models' autonomous capabilities.

The ability to autonomously complete software engineering tasks has always been a key component of the medium risk level in the autonomous risk category of frontier models. Evaluating these capabilities is challenging due to the complexity of software engineering tasks, the difficulty of accurately assessing generated code, and the challenges of simulating real-world development scenarios. Therefore, OpenAI's preparedness approach must also carefully examine the evaluation itself to minimize the possibility of overestimating or underestimating risk factors.

One of the most popular software engineering evaluation suites in this set of methods is SWE-bench. It can be used to assess whether large language models can actually solve real software problems from GitHub, and to what extent they can solve them. The benchmark includes providing code repositories and problem descriptions to agents and requiring them to generate patches to solve the stated problems.

According to the SWE-bench leaderboard, as of August 5, 2024, coding agents have made remarkable progress on SWE-bench, with the highest-scoring agent scoring 20% on SWE-bench and 43% on SWE-bench Lite.

After testing, it was found that some tasks on SWE-bench may be difficult or impossible to solve, causing SWE-bench to systematically underestimate models' autonomous software engineering capabilities. Therefore, OpenAI collaborated with the authors of SWE-bench to address these issues in a new version of the benchmark, which should provide a more accurate assessment.

So, what is the background of SWE-bench?

Each example in the SWE-bench test set is created based on a resolved GitHub issue from one of 12 open-source Python repositories on GitHub. Each example has an associated pull request (PR) that includes solution code and unit tests used to verify code correctness. These unit tests fail before adding the solution code in the PR but pass afterward, hence called FAIL_TO_PASS tests. Each example also has associated PASS_TO_PASS tests that pass both before and after the PR is merged, used to check that existing unrelated functionality in the codebase is not broken by the PR.

For each sample in SWE-bench, agents are given the original text from the GitHub issue (called the problem statement) and granted access to the codebase. With these, agents must edit files in the codebase to solve the problem. Tests are not shown to the agents.

FAIL_TO_PASS and PASS_TO_PASS evaluate proposed edits by running and testing. If the tests pass, it means the problem has been solved. If the tests pass, the edits have not inadvertently broken unrelated parts of the codebase. Edits must pass both sets of tests to fully resolve the original GitHub issue.

Adopting SWE-bench as a Preparedness Assessment

Given the potential relevance of SWE-bench to the preparedness framework, researchers aimed to find ways to improve the benchmark's robustness and reliability. Thus, three main areas for improvement were identified:

Unit tests used to evaluate solution correctness are often too specific and, in some cases, even irrelevant to the problem. This can lead to correct solutions being rejected.

Many examples have ambiguous problem descriptions, making it unclear what the problem is and how to solve it.

It can sometimes be difficult for agents to reliably set up the SWE-bench development environment, which may inadvertently cause unit tests to fail regardless of the solution adopted. In such cases, perfectly valid solutions might be rated as incorrect.

Here's an example illustrating the first issue.

The SWE-bench example scikit-learn__scikit-learn-14520 task is for the agent to solve a problem in the scikit-learn repository. This problem statement reports that the copy parameter of a function can be specified by the user but is ignored by the library (the behavior is instead hardcoded within the function):

To solve the above problem, the agent must first address whether the function behavior is intentional or an error, then make changes to the codebase to resolve the issue. According to the SWE-bench setup, any solution proposed by the agent needs to pass the following test, which is excerpted from the PR that originally solved the problem:

This test explicitly checks if the solution must raise a DeprecationWarning when using the copy parameter, although this requirement was not conveyed in the original problem statement in the above problem text. Moreover, even if the agent realizes that a DeprecationWarning should be raised, the test requires the agent to exactly match the deprecation message, which was concluded after some discussion in the PR that the agent cannot access.

Note that the agent only receives the problem description from the main issue text and cannot see the tests it needs to pass. Under this setup, it is almost impossible for the agent to solve this example in SWE-bench.

SWE-bench Verified

To address these issues, OpenAI initiated a human annotation campaign with professional software developers to screen each sample of the SWE-bench test set for appropriately scoped unit tests and clearly specified problem descriptions.

OpenAI, together with the authors of SWE-bench, has released SWE-bench Verified: a subset of the original SWE-bench test set containing 500 samples verified by human annotators to be problem-free. This version replaces the original SWE-bench and SWE-bench Lite test sets. Additionally, OpenAI has released human annotations for all SWE-bench test samples.

At the same time, OpenAI collaborated with SWE-bench authors to develop new evaluation tools for SWE-bench. It uses containerized Docker environments to make evaluations on SWE-bench easier and more reliable.

On SWE-bench Verified, GPT-4o solved 33.2% of the samples, which is twice the score of the best-performing open-source scaffold Agentless, which scored 16% on SWE-bench previously.

The "Strawberry Project" announcement didn't come, and this test set can at most be considered an appetizer. So, is such a test set worth the hype from OpenAI?

A week ago, OpenAI CEO Sam Altman posted a tweet with a strawberry image, captioned "i love summer in the garden". The four strawberries in the image might hint that a new version of GPT-4 could be specifically designed for reasoning, potentially running alongside GPT-4o, which is designed for creation and interaction. This sparked various speculations about OpenAI releasing a new model called Strawberry.

In the past two days, the leaker @iruletheworldmo on X has frequently posted messages related to the release of Strawberry, stating that OpenAI would release its new model - an AI "Strawberry Project" focused on reasoning - at 10 AM Pacific Time on August 13. The entire community was full of anticipation.

What is the mysterious "Strawberry Project"?

OpenAI's new "Strawberry Project" could allow ChatGPT to search the web more freely and solve complex problems.

The "Strawberry Project" was first revealed by foreign media on July 12. According to insiders and internal documents reviewed by Reuters, ChatGPT maker OpenAI is researching new approaches to its AI models in a project codenamed "Strawberry".

But details of the project had not been reported before, as the Microsoft-backed startup races to prove that the type of model it offers can provide advanced reasoning capabilities.

According to a copy of an internal OpenAI document seen by Reuters in May, an internal team at OpenAI is developing Strawberry. Reuters could not determine the exact release date of the document, which detailed OpenAI's plans for how it intended to use Strawberry for research. Sources described the plan to Reuters as a work in progress. The news agency could not determine how far Strawberry is from public release.

The insider said that even within OpenAI, how Strawberry works is a closely guarded secret.

The document described a project using the Strawberry model aimed at enabling the company's AI not only to generate answers to queries but also to plan ahead and autonomously and reliably browse the internet to perform what OpenAI calls "deep research," the sources said.

According to interviews with more than a dozen AI researchers by foreign media, this is a problem that AI models have not yet solved.

At the time, when asked about Strawberry and the details reported in this article, an OpenAI spokesperson said in a statement: "We want our AI models to see and understand the world as we do. Ongoing research into new AI capabilities is common practice in the industry, with a shared belief that the reasoning abilities of these systems will improve over time."

The spokesperson did not directly answer questions about Strawberry.

Google throws down the gauntlet

Strawberry has always been "half-hidden behind the pipa", and OpenAI's sudden promotion this time can hardly be said to not be in pursuit of Google's almost simultaneous "Made by Google 2024" hardware event.

At this event, Google showcased its latest hardware products, including the long-awaited next-generation Pixel phones: Pixel 9, Pixel 9 Pro, and the new Pixel 9 Fold, as well as new Pixel Watch and Pixel Buds hardware products. Although it was a hardware launch, the AI theme still permeated the entire event. Among them, Google's AI chatbot Gemini is the default assistant for Pixel 9 phones.