Gemini 1.5 Pro (0801) represents Google's first time taking the top spot in the lmsys arena. (Also first in Chinese tasks)
Moreover, it's a double crown this time, ranking first not only on the overall leaderboard (the only score above 1300) but also on the ### visual leaderboard.
Simon Tokumine, a key figure in the Gemini team, posted to celebrate:
(This new model) is the most powerful and smartest Gemini we've ever made.
A Reddit user also described the model as "very good" and expressed hope that its capabilities won't be reduced.
More netizens excitedly stated that OpenAI is finally being challenged and will have to release a new version to counter!
The official ChatGPT account also came out hinting at something.
Amidst the excitement, the head of Google AI Studio products announced that the model has entered the ### free testing phase:
Can be used for free in AI studio
### Netizens: Google has finally arrived!
Strictly speaking, Gemini 1.5 Pro (0801) isn't really a new model.
This ### experimental version is built on the foundation of Gemini 1.5 Pro released by Google in February, and later the 1.5 series expanded the context window to 2 million.
As the model updates, the naming has gotten longer and longer, sparking a wave of mockery.
An OpenAI employee, while congratulating, couldn't resist taking a sarcastic jab:
Of course, although the name is hard to remember, Gemini 1.5 Pro (0801) performed impressively in the official arena evaluation this time.
The overall win rate heat map shows it outperforming GPT-4o by 54% and Claude 3.5 Sonnet by 59%.
In the ### multilingual capability benchmark tests, it ranked first in Chinese, Japanese, German, and Russian.
However, in Coding and Hard Prompt Arena, it still can't beat opponents like Claude 3.5 Sonnet, GPT-4o, and Llama 405B.
This point has also been criticized by netizens, which translates to:
Coding is the most important, but it performs poorly in this area.
However, some people have come out to promote Gemini 1.5 Pro (0801)'s ### image and PDF extraction capabilities.
DAIR.AI co-founder Elvis personally conducted a full set of tests on YouTube and concluded:
Visual capabilities are very close to GPT-4o.
Also, someone used Gemini 1.5 Pro (0801) to solve questions that Claude 3.5 Sonet previously couldn't answer well.
The result showed that it not only performed better but also beat its own teammate Gemini 1.5 Flash.
However, it still can't handle some ### classic common sense tests, such as "write ten sentences ending with apple".
### One More Thing
Meanwhile, Google's Gemma 2 series welcomed a new ### 2 billion parameter model.
Gemma 2 (2B) is ### ready to use out of the box and can run on Google Colab's free T4 GPU.
On the arena leaderboard, it ### surpassed all GPT-3.5 models and even surpassed Mixtral-8x7b.
Faced with Google's latest new rankings, the ### authority of the arena leaderboard has once again been questioned by many.
Teknium, co-founder of Nous Research (a well-known player in the field of fine-tuning and training), posted a reminder:
Although Gemma 2 (2B) scores higher than GPT-3.5 Turbo in the arena, it's far below the latter on MMLU. This discrepancy would be concerning if people use arena rankings as the sole metric for model performance.
Bindu Reddy, CEO of Abacus.AI, directly called for:
Please stop using this human evaluation leaderboard immediately! Claude 3.5 Sonnet is much better than GPT-4o-mini. Similarly, Gemini/Gemma shouldn't score so high on this leaderboard.
So, do you think this method of anonymous human voting is still reliable? (Welcome to discuss in the comments section)