AI tutors and models interact in various tasks that simulate real-world interactions with Grok.
In each interaction, the AI tutor selects the better of two responses generated by Grok based on our established evaluation criteria.
Results show that compared to Grok-2 mini and Grok-1.5, Grok-2 has made significant improvements in reasoning ability when handling retrieved content, as well as in tool use capabilities, such as correctly identifying missing information, reasoning through event sequences, and excluding irrelevant content.
Benchmark test results indicate that Grok-2's performance generally surpasses GPT-4 Turbo and Claude 3 Opus, and even competes with GPT-4o and Llama 3-405B.
However, the benchmark results disclosed by xAI have some "clever" aspects. For instance, while claiming to be on par with GPT-4o, they used scores from May for both GPT-4o and GPT-4 Turbo, making it difficult not to suspect this as a timing strategy to enhance results.
xAI team member Guodong Zhang posted:
Interestingly, unlike most other companies and labs, our development speed is so fast that we never have time to write formal technical reports for each model.
Additionally, xAI specifically pointed out that in the large-scale multi-task language understanding benchmark MMLU, they used Grok-2 without task-specific training, more accurately reflecting the model's generalization ability and adaptability to new tasks. In short, they may not be the best, but they pride themselves on being authentic.
Now, both Grok-2 and Grok-2 mini will be gradually integrated into the X platform, available to X Premium and Premium+ users.
Officially, Grok-2 has text and visual understanding capabilities and can integrate information from the X platform in real-time. Grok-2 mini focuses on being compact and refined, achieving a balance between speed and answer quality.
Compared to its predecessor, Grok-2's biggest change is the ability to directly generate images. According to internal xAI team members, the image generation model uses the recently popular FLUX.1 model.
Users have discovered that Grok-2 has limitations on the number of images it can generate, with Premium users expected to generate about 20-30 images, while Premium+ users can generate more.
The classic challenge of "Which is bigger, 9.8 or 9.11" didn't stump Grok-2. It can even count how many "r"s are in "strawberry".
An excited Musk retweeted several posts about Grok 2, strongly promoting it and praising the xAI team's excellent progress speed.
Looking beyond the hype, Grok-2 seems more significant symbolically than practically. Its release signals the arrival of new GPT-4 level models in the AI industry, but perhaps without bringing enough surprises.
In April this year, Musk told Nicolai Tangen, head of Norway's sovereign wealth fund, that Grok-2 required about 20,000 H100 GPUs for training.
Last month, during the build-up to Grok-2, Musk also revealed that Grok-3 used 100,000 NVIDIA H100 chips for training and is expected to be released by the end of the year, potentially becoming the most powerful AI large model.
To this end, Musk even resorted to using Tesla's chips to boost the xAI team, much to the dissatisfaction of Tesla investors.
Notably, in a recent X Space event, Musk remained confident about the future of AI.