GEPA optimizes LLM without expensive learning

Do you want smarter information in your reception box? Sign up for our weekly newsletters to obtain only what matters for business managers, data and security managers. Subscribe now
Researchers from the University of California, Berkeley, the University of Stanford and Databricks have introduced a new AI optimization method called GEPA which considerably surpasses traditional strengthening techniques (RL) to adapt great language models (LLM) to specialized tasks.
GEPA removes the popular apprenticeship paradigm through thousands of attempts to test and error guided by simple digital scores. Instead, he uses the understanding of the LLM language to think about his performance, diagnose errors and evolve in an iterative manner his instructions. In addition to being more precise than the techniques established, the GEPA is much more effective, obtaining higher results with up to 35 times fewer test series.
For companies creating complex agents and workflows, this is translated directly into faster development cycles, significantly lower calculation costs and more efficient and reliable applications.
The high cost of optimizing modern AI systems
Modern corporate AI applications are rarely one call to an LLM. These are often “compound AI systems”, complex workflows that chain several LLM modules, external tools such as databases or code interpreters and personalized logic to perform sophisticated tasks, including research in several stages and data analysis.
The AI scale reached its limits
Electricity ceilings, increase in token costs and inference delays restart the AI company. Join our exclusive fair to discover how best the teams are:
- Transform energy into a strategic advantage
- Effective inference architecting for real debit gains
- Unlock a competitive return on investment with sustainable AI systems
Secure your place to stay in advance::
A popular way to optimize these systems is through reinforcement learning methodsas the optimization of the group’s relative policies (GRPO), a technique used in models of popular reasoning, in particular Deepseek-R1. This method deals with the system as a black box; He performs a task, obtains a simple metric of success (a “scalar reward”, like a score of 7/10), and uses this feedback to slowly push the parameters of the model in the right direction.
The major disadvantage of the RL is its ineffectiveness of the sample. To learn effectively from these sparse digital scores, RL methods often require tens of thousands, even hundreds of thousands of trial series, called “deployments”. For any real world business application that involves costly tool calls (for example, API requests, code compilation) or uses powerful proprietary models, this process is slow and costly prohibitive.
Like Lakshya A Agrawal, a co-author of the newspaper and doctorate at the UC Berkeley, told Venturebeat, this complexity is a major obstacle for many companies. “For many teams, RL is not practical because of its cost and its complexity – and their essential approach so far would often be that fast engineering in hand,” said Agrawal. He noted that GEPA is designed for teams that need to optimize systems built on higher level models that often cannot be refined, which allows them to improve performance without managing personalized GPU clusters.
Researchers frame this challenge as follows: “How can we extract the maximum learning signal from each expensive deployment to allow effective adaptation of complex and modular AI systems in low data or budgetary limit frames?”
An optimizer that learns with the language

GEPA (Genetics-Pareto) is a rapid optimizer that takes up this challenge by replacing sparse rewards with a feedback rich in natural language. He derives from the fact that the entire execution of an AI system (including his reasoning stages, his tool calls and even error messages) can be serialized in text that a LLM can read and understand. GEPA’s methodology is built on three basic pillars.
The first is “rapid genetic evolution”, where GEPA treats a prompt population as a genetic pool. He “matt” iteratively to create potentially better new versions. This mutation is an intelligent process motivated by the second pillar: “Reflection with feedback of natural language”. After a few deployments, GEPA provides an LLM with the complete trace of execution (which the system tried to do) and the result (which went well or badly). The LLM “then reflects” on this feedback in natural language to diagnose the problem and write an improved and more detailed prompt. For example, instead of simply seeing a weak score on a code generation task, it can analyze a compiler error and conclude that the prompt must specify a particular library version.
The third pillar is the “selection based on Pareto”, which ensures intelligent exploration. Instead of focusing only on the unique effective prompt, which can lead to stuck in a sub-optimal solution (a “local optimum”), GEPA maintains a diverse list of “specialized” prompts. It follows the prompts that work best on different individual examples, creating a list of best candidates. By sampling from this diversified set of winning strategies, GEPA guarantees that it explores more solutions and is more likely to discover a prompt which is well generally generally through a wide range of inputs.

The effectiveness of this whole process depends on what researchers call “comment engineering”. Agrawal explains that the key is to surface the rich and textual details that systems are already producing but often reject. “Traditional pipelines often reduce this detail into a single digital reward, obscuring why special results occur,” he said. “GEPA’s basic guidelines are to structure feedback which surfaces not only the results, but also the trajectories and intermediate errors in raw text – the same proof that a human would use to diagnose the behavior of the system.”
For example, for a document recovery system, this means listing the properly recovered documents and which have been missed, rather than calculating a final score.
GEPA in action
The researchers evaluated the GEPA on four various tasks, in particular the answer to multi-hop questions (Hotpotqa) and the requests preserving confidentiality (PUPA). They used both open-source (QWEN3 8B) and owners (GPT-4.1 mini) models, comparing GEPA with RL GRPO and rapid Miprov2 peak optimizer.
In all tasks, GEPA has substantially surpassed the GRPO, reaching a score up to 19% higher while using up to 35 times less deployment. AGRAWAL provided a concrete example of this efficiency gain: “We used GEPA to optimize an QA system in ~ 3 hours against 24 hours of GRPO-an 8x reduction in development time, while reaching 20% higher performance,” he explained. “Optimization based on RL of the same scenario in our test cost about $ 300 in GPU time, while GEPA costs less than $ 20 for better results – 15 times the savings in our experiences.”

Beyond raw performance, researchers have found that GEPA optimized systems are more reliable in the face of new invisible data. This is measured by “the generalization gap” (the difference between performance on training data and final test data). Agrawal hypothesizes that it is because GEPA learns richer feedback. “The smallest gap for generalization of GEPA can result from its use of rich natural language comments on each result – which worked, which failed and why – rather than counting only on a single scalar reward,” he said. “This can encourage the system to develop instructions and strategies based in a broader understanding of success, instead of simply learning specific models for training data.” For companies, this improved reliability means less fragile and more adaptable AI applications in customer -oriented roles.
A major practical advantage is that the prompts based on GEPA instructions are up to 9.2 times shorter than the prompts produced by optimisers like Miprov2, which include many examples of blows. Shorter prompts decrease latency and reduce the costs of API -based models. This makes the final application faster and cheaper to perform in production.
The document also presents promising results for the use of GEPA as a “deference time” research strategy, transforming AI of a generator with a single response to an iterative problem solver. AGRAWAL has described a scenario where GEPA could be integrated into the company’s CI / CD pipeline. When a new code is engaged, GEPA could automatically generate and refine several optimized versions, test them for performance and open a traction request with the most efficient variant for engineers to be revised. “This transforms optimization into a continuous and automated process – generating a boost which often corresponds or exceed the expert by hand,” noted Agrawal. In their experiences on the generation of Cuda code, this approach increased the performance of 20% of tasks at an expert level, compared to 0% for a single shooting attempt by GPT-4O.
The authors of the document believe that GEPA is a fundamental step towards a new AI development paradigm. But beyond the creation of the more human AI, its most immediate impact may be to know who can build very efficient systems.
“We expect GEPA to allow a positive change in strengthening the AI system, which makes the optimization of these systems accessible by end users, who often have the expertise of the relevant field for the task, but not necessarily the time and the desire to learn from the complex specificities of RL,” said Agrawal. “It gives power directly to stakeholders with the exact knowledge of the specific domain.”
https://venturebeat.com/wp-content/uploads/2025/08/prompt-optimization.jpg?w=1024?w=1200&strip=all