Salesforce’s new coact-1 writes his own code to accomplish tasks

Do you want smarter information in your reception box? Sign up for our weekly newsletters to obtain only what matters for business managers, data and security managers. Subscribe now
Researchers from Salesforce and the University of South California have developed A new technique that gives IT user agents the possibility of executing code while browsing on graphic user interfaces (Guis)That is to say to write scripts while moving a cursor and / or by clicking on buttons on an application, combining the best of the two approaches to speed up workflows and reduce errors.
This hybrid approach allows an agent of Go around the brittle and ineffective mouse clicks For tasks that can be better accomplished by coding.
The system, called coact-1, establishes a new art status on the benchmarks of key agents, surpass other methods while requiring much less steps To accomplish complex tasks on a computer.
This upgrade can open the way to a more robust and scalable agent automation with a significant potential for real world applications.
The AI scale reached its limits
Electricity ceilings, increase in token costs and inference delays restart the AI company. Join our exclusive fair to discover how best the teams are:
- Transform energy into a strategic advantage
- Effective inference architecting for real debit gains
- Unlock a competitive return on investment with sustainable AI systems
Secure your place to stay in advance::
The fragility of agents of the Punctual AI
IT user officers are generally based on vision and vision-action models (VLMS or VLAS) to perceive a screen and act, Imitate the way a person uses a mouse and a keyboard.
Although these agents based on the graphical interface can perform a variety of tasks, they Often revive when faced with long and complex workflows, especially in applications with dense menus and optionslike office productivity suites.
For example, a task which consists in locating a specific table in a spreadsheet, filtering it and saving it as a new file can involve a long and precise sequence of manipulations of Gui.
This is where Brittleness slips. “In these scenarios, existing agents have frequently struggled against the visual ambiguity of earthing (for example, the distinction between icons or visually similar menu elements) and the accumulated probability of making a single error on the long horizon,” write the researchers in their article. “A single incomparable or misunderstood element can derail the whole task.”
To meet these challenges, many researchers focused on increasing graphical interface agents with high -level planners.
These systems use powerful reasoning models like O3 of Openai to break down the high level objective of a user in a sequence of smaller and more manageable subtaches.
Although this structured approach improves performance, it does not solve the problem of navigation of menus and click buttons, even for operations that could be carried out more directly and reliably with a few lines of code.
Coact-1: a multi-agent team for computer tasks
To resolve these limitations, the researchers created COAC-1 (IT user agent with coding as actions), A system designed to “combine the intuitive and human forces of manipulation of the graphical interface with the precision, reliability and efficiency of direct system interaction through the code.”
The system is Structured as a team of three specialized agents working together: A orchestrator, a programmer and a Gui operator.

The orchestrator acts as a central planner or project manager. He analyzes the global objective of the user, decomposes it into subtaches and attributes each under-tease to the best agent for work. He can delegate Backend operations such as file management or data processing in the programmer, which writes and performs Python or Bash scripts.
For frontend Tasks that require click or navigation buttons on visual interfaces, he turns to the Gui operator, an agent based on VLM.
“This dynamic delegation allows COAC-1 to strategically bypass the ineffective graphical interface sequences in favor of a robust code execution of a single blow if necessary, while taking advantage of the visual interaction for tasks where it is essential,” said the article.
The workflow is iterative. Once the Gui programmer or operator has completed a sub-jam, he sends a summary and a screenshot of the current system of the orchestrator, which then decides the next step or concludes the task.
The programmer agent uses an LLM to generate his code and sends orders to a code interpreter to test and refine his code on several laps.
Likewise, the operator Gui uses an action interpreter who performs his orders (for example, mouse clicks, strikes it) and returns the resulting screenshot, allowing him to see the result of his actions. The orchestrator makes the final decision to find out if the task should continue or stop.

A more effective path to automation
The researchers tested COAC-1 on OSWORLD, a complete reference that includes 369 real world tasks between browsers, IDE and office applications.
The results show Coact-1 establishes a new tip, reaching a success rate of 60.76%.
Performance gains were the most important in categories where programmatic control offers a clear advantage, such as bone level tasks and multi-application workflows.
For example, Consider an OS level task like finding all image files in a complex folder structure, resize them, then compress the entire directory in a single archive.
A The agent purely based on the graphical interface should perform a long fragile sequence of clicks and streaksOpen folders, select files and navigate the menus, with a strong chance of error at each stage.
Coact-1, on the other hand, can delegate all this workflow to his programmer agent, who can accomplish the task with a single robust script.

Beyond a higher success rate, the system is considerably more effective. Coact-1 solves the tasks on average of only 10.15 steps, a contrast that struck with the 15.22 steps required by the main gui-subsequently agents like GTA-1.
While other agents such as OpenAi CUA 4o have on average fewer steps, their overall success rate was much lower, indicating that COAC-1 efficiency is associated with greater efficiency.
The researchers found a clear trend: Tasks that require more actions are more likely to fail. The reduction in the number of steps accelerates not only the completion of the task, but, more importantly, minimizes the possibilities of error.
SO, Finding ways to compress several stages of graphical interface in a single programmatic task can make the process both more effective and less subject to errors.
As the researchers conclude, “this efficiency underlines the potential of our approach to open a more robust and evolving path to generalized IT automation.”

From the laboratory to the corporate workflow
The potential of this technology goes beyond general productivity. For business leaders, the key lies in the automation of complex and multi-tool processes where full access to the API is a luxury, not a guarantee.
Ran Xu, article co-author and director of research on AI applied at Salesforce, underlines customer support as an excellent example.
“An agent of service assistance uses many different tools – general tools such as Salesforce, tools specific to industry such as EPIC for health care and many personalized tools – to investigate a customer request and formulate an answer,” XU told Venturebeat. “Some tools have API access while others do not. This is a perfect use case that could potentially benefit from our technology: An agent for calculation use that uses everything available on the computer, whether it is an API, a code or simply the screen. “”
XU also sees great value applications in sales, such as large -scale prospecting and accounting automation, and in marketing for tasks such as customer segmentation and the generation of campaign assets.
Navigation of real world challenges and the need for human surveillance
Although the results of the OSWORLD reference are solid, corporate environments are much more unpleasant, filled with inherited software and unpredictable UIS.
This raises critical questions about robustness, security and the need for human surveillance.
A basic challenge is to ensure that the orchestrator agent makes the right choice in the face of an unknown application. According to XU, the path to the creation of agents like Robust COAC-1 for personalized company software is to train them with comments in realistic and simulated environments.
The objective is to create a system where “the agent could observe the functioning of human agents, to train in a sandbox, and when put online, continue to solve the tasks under the direction and guard of a human agent.”
The possibility for the programmer agent to execute his own code also has obvious security problems. What prevents the agent from executing harmful code based on an ambiguous user request?
XU confirms that robust confinement is essential. “Access control and sand are the key,” he said, stressing that a human must “understand involvement and give access to the IA for security”.
Sand and railings will be essential to validate the behavior of the agent before deployment on critical systems.
In the end, in the foreseeable future, overcoming ambiguity will probably require a human in the loop. Asked about manipulation of wave user queries, a concern also raised in the document, XU suggested a progressive approach. “I see humans in the loop to start,” he noted.
Although certain tasks may possibly become completely autonomous, for high issues operations, human validation will remain crucial. “Some mission criticisms may always require human approval.”
https://venturebeat.com/wp-content/uploads/2025/08/AI-agent-coding-and-GUI.jpg?w=1024?w=1200&strip=all