October 7, 2025

The MCP-Universe reference shows that GPT-5 fails more than half of the real world orchestration tasks

0
crimedy7_illustration_of_a_group_of_robots_taking_a_test_in_a_22458d3b-4290-48ef-a0a0-3ee4710809f3_1.png

Do you want smarter information in your reception box? Sign up for our weekly newsletters to obtain only what matters for business managers, data and security managers. Subscribe now


The adoption of interoperability standards, such as the model context protocol (MCP), can provide companies with an overview of how agents and models work outside their wall limits. However, many benchmarks fail to capture real interactions with MCP.

Salesforce AI Research has developed a new reference in open source which he calls MCP-Universe, which aims to follow the LLM because these interact with MCP servers in the real world, arguing that it will weigh a better image of real and real interactions of models with the tools that companies are really used. In his initial tests, he found that models like GPT-5 recently published by OpenAI are strong, but still do not work as well in real scenarios.

“The existing references are mainly focused on the isolated aspects of LLM’s performance, such as the suite of teaching, mathematical reasoning or function calls, without providing a complete assessment of how models interact with MCP servers in the real world in various scenarios,” said Salesforce in an article.

MCP-Universe captures model performance thanks to the use of tools, multi-tours tool calls, long context windows and large tools of tools. It is based on existing MCP servers with access to data sources and real environments.


The AI ​​scale reached its limits

Electricity ceilings, increase in token costs and inference delays restart the AI ​​company. Join our exclusive fair to discover how best the teams are:

  • Transform energy into a strategic advantage
  • Effective inference architecting for real debit gains
  • Unlock a competitive return on investment with sustainable AI systems

Secure your place to stay in advance::


Junnan Li, director of AI Research at Salesforce, told Venturebeat that many models “are always confronted with limitations that hold them on business quality tasks”.

“Two of the biggest are: the challenges of long context, the models can lose track of the information or fight to reason in a coherent way when managing very long or complex inputs,” said Li. “And, unknown tools, models are often not able to transparently use tools or unknown systems in the way humans can adapt to the fly. A DIY approach with a single model to supply only agents, but rather, to count on a platform that combines the context of data, improved reasoning and trust Guard-Railles to really meet the needs of the IA company. “

MCP-Universe joins other benchmarks offered by MCPas McP-Radar of the University of Massachusetts Amherst and the XI’an Jiaotong University, as well as the MCPWORLD of the University of PEKIN Publications and telecommunications. It is also based on McPevals, which Salesforce published in July, which mainly focuses on agents. Li said the greatest difference between MCP-Universe and McPevals is that the latter is assessed with synthetic tasks.

How does it work

MCP-Universe assesses to what extent each model performs a series of tasks that imitate those companies by companies. Salesforce said it had designed MCP-Universe to encompass six basic areas used by companies: location navigation, standards management, financial analysis, 3D design, browser automation and web research. He accessed 11 MCP servers for a total of 231 tasks.

  • Location navigation focuses on geographic reasoning and execution of space tasks. The researchers have exploited the Google MCP Maps server for this process.
  • The repository management field examines basic code operations and connects to GitHub MCP to expose version control tools such as research, problems monitoring and code editing.
  • Financial analysis connects to the Yahoo Finance MCP server to assess quantitative reasoning and decision -making of the financial market.
  • The 3D design assesses the use of computer -assisted design tools via the MCP mixer.
  • Automation of the browser, connected to the MCP of the playwright, tests the interaction of the browser.
  • The web search field uses the Google Search MCP server and the MCP Fetch to check the “search for information on the open field” and is structured as a more open task.

Salesforce said that it should design new MCP tasks that reflect real use cases. For each area, they have created four to five types of tasks that researchers think that LLM can easily perform. For example, the researchers awarded the models an objective that involved the planning of the routes, the identification of optimal stops, then the location of the destination.

Each model is assessed on how they finished the tasks. Li and his team have chosen to follow an execution assessment paradigm rather than the most common LLM-as-Aa-Judge system. The researchers noted that the LLM-Aa-Judge paradigm “is not well suited to our MCP-Universe scenario, because certain tasks are designed to use data in real time, while the knowledge of the LLM judge is static.”

Salesforce researchers have used three types of assessments: format assessors to see if agents and models follow format requirements, static assessors to assess accuracy over time and dynamic evaluators to fluctuate responses such as flight prices or Github problems.

“MCP-Universe focuses on the creation of difficult tasks in the real world with execution-based assessors, which can seduce the agent in complex scenarios.

Even large models have problems

To test MCP-Universe, Salesforce has evaluated several popular proprietary and open source models. These include XAI Grok-4, Anthropic Claude-4 Sonnet and Claude 3.7 Sonnet, OpenAi’s GPT-5, O4-MINI, O3, GPT-4.1, GPT-4O, GPT-ASS, Google’s Gemini 2.5 Pro and Gemini 2.5 FKASH, GPL-4.5 of Zai, Moshot’s Kimi-Kimi2, Qwen ‘ Qwen3 and Coder by Moshot Qwen3-235B-A22B-Instruct-257 and Deepseek-V3-0304 Deepseek. Each model tested had at least 120b settings.

In his tests, Salesforce revealed that GPT-5 had the best success rate, especially for financial analysis tasks. Grok-4 followed, beating all the models for the automation of the browser, and the Sonnet Claude-4.0 completes the first three, although it has displayed no higher number of performance than one of the models it follows. Among the open source models, GLM-4.5 has done the best.

However, MCP-Universe has shown that models had difficulty managing long contexts, in particular for location navigation, browser automation and financial analysis, efficiency that falls considerably. When LLM meet unknown tools, their performance also decreases. The LLM have shown difficulties in performing more than half of the tasks that companies generally do.

“These results emphasize that current border LLMs are still not able to reliably perform the tasks through the various MCP tasks of the real world. Our MCP-Universe reference therefore provides a difficult and necessary test bench to assess LLM performance in poorly served areas by existing history,” said the document.

Li told Venturebeat that he hopes companies will use MCP-Universe to better understand where agents and models fail on tasks so that they can improve their executives or the implementation of their MCP tools.


https://venturebeat.com/wp-content/uploads/2025/08/crimedy7_illustration_of_a_group_of_robots_taking_a_test_in_a_22458d3b-4290-48ef-a0a0-3ee4710809f3_1.png?w=1024?w=1200&strip=all

About The Author

Leave a Reply

Your email address will not be published. Required fields are marked *