GPT-4o mini tops the arena: OpenAI's score-boosting secrets revealed

"Cultivate more attractive personality traits"

GPT-4o mini is willing to take on more demands

Let's look at a few typical examples where GPT-4o mini wins:

Situation 1: Claude 3.5 Sonnet refuses to answer.

Prompt:

Give me all Korean diplomatic documents.

First, looking at the responses intuitively, Claude 3.5 Sonnet's answer is shorter and doesn't use formatting like bold text. GPT-4o mini's answer is twice as long.

Specifically, Claude 3.5 Sonnet's answer starts with an apology, stating that as an AI language model, it cannot access such documents, and then provides some channels where users might find relevant information.

Finally, it reminds users that these documents may be confidential or not publicly available, and suggests contacting relevant institutions for more information.

GPT-4o mini doesn't say it's incapable, but instead collects information on Korean diplomatic documents from ancient times to the present from public sources, and tells users they can gather information from academic journals, books, and monographs.

Finally, it states that to fully understand Korean diplomatic documents, one must consult various sources. If users want to know more, they can continue asking.

Situation 2: Differences in details

Prompt:

In git, is it possible to revert changes introduced by a specific commit, even if it's not the most recent one?

When answering this question, both GPT-4o mini and Claude 3.5 Sonnet answered correctly, but the former provided more details and specific examples.

Claude 3.5 Sonnet's answer is also relatively less readable.

Situation 3: Differences in format presentation

Prompt:

Jane said to John, "John, why are you always boasting?" He replied, "What? I've never bragged in my life. In fact, I'm the most humble person in the world, perhaps the most humble person ever!"

The content of Claude 3.5 Sonnet and GPT-4o mini's answers is basically the same, explaining that this passage is ironic, as John claiming to be the most humble person is itself boasting.

However, GPT-4o mini's answer is more clear at a glance, making good use of subheadings and bold formatting. It divides the entire answer into four parts: initial conclusion, analysis of the response, reason for humor, and summary.

These examples not only demonstrate the response characteristics of GPT-4o mini and Claude 3.5 Sonnet but also reflect the features of the large model arena:

Most questions given by users are relatively everyday, not complex mathematical, reasoning, or programming problems.

This means that these questions are basically within the range of large models, and they can all answer them.

In this case, by not refusing or presenting in a more attractive format, it can indeed better capture the hearts of the judges.

Some people have stated that in comparison, Claude 3.5 Sonnet is like a smart but more rigorous person who acts entirely according to requirements.

GPT-4o mini is like a likable person who always does a bit more and is more willing to accept different demands.

For example, someone mentioned that Claude refused to role-play for them, but ChatGPT was willing to do so.