If you asked the general public what the best AI model was, chances are good most people would respond with ChatGPT. While there are many players on the scene in 2024, OpenAI’s LLM is the one that really broke through and introduced powerful generative AI to the masses. And as it would happen, ChatGPT’s Large Language Model (LLM), GPT, has consistently ranked as the top performer among its peers, from the introduction of GPT-3.5, to GPT-4, and currently, GPT-4 Turbo.
But the tide seems to be turning: This week, Claude 3 Opus, Anthropic’s LLM, overtook GPT-4 on Chatbot Arena for the first time, prompting app developer Nick Dobos to declare, “The king is dead.” If you check the leaderboard as of the time of this writing, Claude still has the edge over GPT: Claude 3 Opus has an Arena Elo ranking of 1253, while GPT-4-1106-preview has a ranking of 1251, followed closely by GPT-4-0125-preview, with a ranking of 1248.
For what’s it’s worth, Chatbot Arena ranks all three of these LLMs in first place, but Claude 3 Opus does have the slight advantage.
Anthropic’s other LLMs are performing well, too. Claude 3 Sonnet ranks fifth on the list, just below Google’s Gemini Pro (both are ranked in fourth place), while Claude 3 Haiku, Anthropic’s lower-end LLM for efficient processing, ranks just below a version 0613 of GPT-4, but just above version 0613 of GPT-4.
How Chatbot Arena ranks LLMs
To rank the various LLMs that currently available, Chatbot Arena asks users to enter a prompt and judge how two different, unnamed models respond. Users can continue chatting to evaluate the difference between the two, until they decide on which model they think performed better. Users don’t know which models they’re comparing (you could be pitting Claude vs. ChatGPT, Gemini vs. Meta’s Llama, etc.), which eliminates any bias due to brand preference.
Unlike other types of benchmarking, however, there is no true rubric for users to rate their anonymous models against. Users can simply decide for themselves which LLM performs better, based on whatever metrics they themselves care about. As AI researcher Simon Willison tells Ars Technica, much of what makes LLMs perform better in the eyes of users is more about “vibes” than anything else. If you like the way Claude responds more than ChatGPT, that’s all that really matters.
Above all, it’s a testament to how powerful these LLMs have become. If you offered this same test years ago, you would likely be looking for more standardized data to identify which LLM was stronger, whether that was speed, accuracy, or coherence. Now, Claude, ChatGPT, and Gemini are getting so good, they’re almost interchangeable, at least as far as general generative AI use goes.
While it’s impressive that Claude has surpassed OpenAI’s LLM for the first time, it’s arguably more impressive that GPT-4 held out this long. The LLM itself is a year old, minus iterative updates like GPT-4 Turbo, while Claude 3 launched this month. Who knows what will happen when OpenAI rolls out GPT-5, which, at least according to one anonymous CEO, is, “…really good, like materially better.” For now, there are multiple generative AI models, each just about as effective as each other.
Chatbot Arena has amassed over 400,000 human votes to rank these LLMs. You can try out the test for yourself and add your voice to the rankings.