GPT-4 Empowers the Creation of Genie: The Birth of a New Generation of Top AI Engineers

01 What is Genie? What can it do?

Similar to Devin, Genie can autonomously complete various coding tasks under the guidance of human engineers, including bug fixes, feature building, code refactoring, and code verification through comprehensive testing.

In addition to autonomous operation, Genie can also collaborate with users.

Currently, Genie is still in the internal testing stage. You can apply for a trial after registering on the official website.

Cosine claims that Genie can simulate the cognitive process of human engineers.

Pullen explained in a blog post, "My idea was simple: let it observe how human engineers complete work and imitate this process."

The code generated by Genie is stored in the user's GitHub repo, which means Cosine does not keep copies and there are no associated security risks.

Moreover, Cosine's software platform has been integrated with Slack and system notifications. It can use system notifications to remind users, ask questions, or flag issues like human colleagues.

"Genie can also ask users clarifying questions and respond to comments/opinions on the Pull Requests it generates."

Pullen stated, "We're trying to make Genie behave like a colleague, so it makes the most sense for the model to use the same channels as colleagues."

Collaborating with OpenAI, using the latest GPT-4o

Unlike many models that rely on foundation models with a few additional tools, Genie was developed through a proprietary process that includes training and fine-tuning models from OpenAI.

When Genie first started development, it could only be fine-tuned based on models with relatively small context windows, ranging from 16-32k tokens.

The team found in early explorations that even with large datasets of over 100 million tokens, plus the advantages of architectural design and various compression/chunking methods, they were still limited by the amount of information the model could express at a given time. The only solution was to use models with larger context windows.

Fortunately, they soon gained access to OpenAI's long-context models, which became a breakthrough for Genie's capabilities.

Pullen revealed to VentureBeat, "Genie (currently) is a non-general GPT-4o variant. OpenAI allowed us to access and use their models for training as part of an experimental program."

"The model performed well, and we shared our insights with OpenAI's fine-tuning team and engineering leadership. This was a real turning point for us, as it convinced them to invest resources and attention in our new technology."

Although Cosine didn't specify the exact model, OpenAI recently announced limited availability of the GPT-4o long output context model, with an output length of up to 64k tokens, a 16-fold increase from the initial 4k.

Training data is key

Pullen wrote in the technical report that in recent training runs, Genie was trained on billions of tokens of data, with the combination of data chosen to make the model as proficient as possible in the languages users care most about currently.

Genie's technical report lists 15 languages included in the training data, covering popular languages like Java, JS, C, C++, C#, Rust, Python, as well as commonly used Scala, Kotlin, Swift, PHP, etc.

Among them, JavaScript, Python, TypeScript, and TSX are the most represented languages in the dataset, with the rest each accounting for 3%.

Cosine's blog post states that the team spent nearly a year compiling the dataset, including a large amount of software development activities from real engineers.

Acquiring and effectively utilizing this data was extremely difficult because, essentially, this data didn't exist.

Their data pipeline began by tracking the development trajectories of software engineers, collecting data such as pull requests, commits, and issues from OSS repositories (MIT licensed).

They then ran this data through the pipeline, forensically exporting the reasoning process to reconstruct how humans arrived at their final conclusions.

This proprietary dataset was the basis for training the first version of the model, with the remaining work completed through self-play and self-improvement.

Genie's autonomy loop consists of four main processes: planning, retrieval, code writing, and code execution. These are not novel in themselves, but have been greatly improved as Genie has been trained to perform tasks like humans.

"The impact of data annotation cannot be underestimated. Obtaining high-quality data from capable software engineers is very difficult, but the results are worth it as it gives us insight into developers' not-easily-discovered problem-solving thought processes."

This dataset not only reflects perfect information flow and progressive knowledge discovery but also captures the gradual decision-making process of human engineers.

Pullen asserts, "By actually training our models using this dataset, rather than simply prompting foundation models (which is what others are doing), we found that we were no longer just randomly generating code, but handling problems like humans do."

Benchmark evaluation results

During model development, the team mainly used two benchmarks for evaluation - SWE-Bench and HumanEval.

The former covers a more comprehensive range of problems, including problem decomposition, finding relevant code, code classification, and implementing viable solutions; the latter focuses more on writing code, without retrieval aspects, and places less emphasis on problem understanding.

However, the official blog only disclosed the SWE-Bench scores, with Genie achieving 30.08%, and 50.67% in SWE-Lite.

Genie's performance in SWE-Bench was particularly impressive: it's the highest score to date, with an increase of over 10% compared to the second-place score of 19.27%.

Additionally, the team separately tested the model's information retrieval capabilities, particularly its ability to retrieve the correct parts of required code files.

This is one of the core components of an AI engineer - if the model can't reliably and proficiently find the right code to edit, then the ability to edit code can't be fully utilized.

Assuming the model can always find the correct code, the retrieval ability can be simply measured by looking at how many lines of code the model searched to complete the task, and how many lines it actually found.

In the test, Genie successfully retrieved 91,475 lines of code out of the required 142,338 lines, scoring 64.27%. There's clearly still a lot of room for improvement here, and retrieval ability is a less focused aspect compared to problem decomposition ability.

02 Backed by YC, led by a Chinese Oxford graduate

Cosine was established through the famous Silicon Valley startup accelerator Y Combinator.

The company is a human reasoning laboratory, focused on researching and organizing how humans perform tasks, aiming to teach artificial intelligence to imitate, excel at, and expand these tasks.

In 2022, Alistair Pullen, Sam Stenner, and Yang Li co-founded Cosine, positioning it as a human reasoning laboratory.

Starting from the field of software engineering, they hope to study and organize how humans perform tasks, thereby teaching AI to imitate, excel at, and expand these tasks, promoting the development of intelligence.

Cosine has