Copilot Arena Helps Rank Real-World LLM Coding Abilities

Tuesday, April 22, 2025

With so many AI coding assistants out there, it can be hard to keep track of ones that perform well on real-world tasks. CMU researchers developed Copilot Arena to do just that by crowdsourcing user ratings of LLM-written code.

With so many AI coding assistants out there, it can be hard to keep track of ones that perform well on real-world tasks. To help analyze which leading or emerging code-writing large language models (LLMs) the developer community prefers, researchers at Carnegie Mellon University developed Copilot Arena, a platform that crowdsources user ratings of LLM-written code.

AI coding assistants that can generate code on their own can make a difficult and time-consuming process easier and faster. However, even the slightest error — to which AI coding assistants can be prone — could derail an entire project.

Copilot Arena has been downloaded more than 11,200 times since launching in September and has provided more than 4.5 million suggestions to users of 10 models. The tool has around 500 daily users and more than 3,000 unique users overall.

Valerie Chen and Wayne Chi, graduate students in CMU's School of Computer Science, co-led the development of Copilot Arena. They developed the tool in partnership with the creators of Chatbot Arena, a similar LLM-ranking platform that predicted the rising popularity of DeepSeek before it went mainstream. The team published the first set of findings from the platform in the preprint journal arXiv, in a blog post on the Copilot Arena website, and on Carnegie Mellon's Machine Learning Blog.

The team believes free, open-source tools like Copilot Arena are important for model developers trying to improve their AI coding assistants in a way that benefits actual users.

The setup is simple. Software developers using Visual Studio Code (VS Code) can download and access the Copilot Arena extension while they work. They can ask Copilot Arena to help them with a certain section of code, and Copilot Arena offers two options to choose from without telling the user which LLMs provided the responses. The LLMs are then ranked on the backend based on how frequently users chose their outputs.

"User votes help build this leaderboard of which models are best for coding applications and also give us insights into how people are interacting with these kinds of AI tools," said Chen, a Ph.D. student in the Machine Learning Department.

The AI models they've tested include ones from Google's Gemini, DeepSeek, Anthropic's Claude, OpenAI's GPT models, Meta's Llama, Qwen from Alibaba Cloud, and Codestral from Mistral AI. New models are added into the platform as they're released.

"We've also got a lot of interest from people building these code-generation models. Recently, a startup named Inception Labs created a new model called Mercury Coder," said Chi, a Ph.D. student in the Computer Science Department. "They reached out to us so they could test their model on our platform. This was an unreleased model at the time, and they wanted to demonstrate that their model was fast and of good quality."

DeepSeek and Claude's Sonnet currently sit atop the leaderboard. Chi noted that this parallels what developers in online communities have said are their favorite models to use for coding.

During their research, the CMU team found that Copilot Arena's top-performing models differed from existing evaluation approaches. The existing approaches are based on static benchmarks, like simple functional problems or tasks from Leetcode, an online platform with preset questions for coding practice, and not on the dynamics of user preference used by Copilot Arena.

Being able to evaluate the performance of AI coding models while users are working on real-world problems has allowed the Copilot Arena team to perform a more detailed analysis of the areas where specific models have an advantage and where they may fall short. For example, many existing benchmarks tend to test AI models on short problems, meaning there are only a few lines of existing code or no code at all. But that's not indicative of what happens in the real world.

"In practice, when people write code, they might keep everything in one big file. Models have to be able to handle that," Chen said. "We see that models like DeepSeek tend to be better at that than smaller models like Qwen."

As another example, Chi said that Claude's Sonnet performs better than other models at front-end tasks like website development and back-end tasks like data and infrastructure management.

"Evaluation is probably the most important problem in all of machine learning right now," said Ameet Talwalkar, a machine learning professor at CMU and an adviser on the project. "You want to evaluate these models in the most realistic settings possible. You also want to be scalable."

For the better part of the last two decades, computer scientists worked mostly on classification problems, like whether a cat appeared in an image. These problems are hard for computers but easy to evaluate, Talwalkar said. People simply need to manually annotate whether there is a cat in the image and compare it to the model's prediction.

"Now we're asking models to do these incredibly complicated things and, for coding in particular, evaluation is hard," he said.

Copilot Arena tries to capture rigorous, realistic interactions between AI models and human developers in a way that can also be scaled across multiple users and use cases, Talwalkar said. "It has a lot of value."

On top of evaluating models, tools like Copilot Arena may help researchers study the changing nature of programming.

"We are in the middle of a dramatic pivot from developers writing code manually to AI assistance being ubiquitous," said Chris Donahue, an assistant professor in CSD and an adviser on the project. "Not only does Copilot Arena help us better understand the implications of this shift for downstream aspects of code like reliability, security and maintainability, but it also allows us to study the fundamental shift in human and computer agency in programming."

The tool itself is a work in progress. Chen and Chi plan to expand the set of features available on the Copilot Arena platform to accommodate code editing and agentic systems.

Media Contact:

Aaron Aupperlee | 412-268-9068 | aaupperlee@cmu.edu