October 7, 2025

The LLMS generate “nonsense” during reasoning outside their training zone

0
Broken-CoT.jpg

Do you want smarter information in your reception box? Sign up for our weekly newsletters to obtain only what matters for business managers, data and security managers. Subscribe now


A new study by researchers from the Arizona State University suggests that the famous reasoning of “reflection chain” (COT) in large -language models (LLMS) can be more a “brittle mirage” than authentic intelligence. Research is based on an increasing set of work questioning the depth of LLM reasoning, but it takes a single objective of “data distribution” to test where and why the COT is systematically decomposed.

Especially for application manufacturers, the document goes beyond criticism to offer clear and practical advice on how to take these limits into account when developing applications powered by LLM, test strategies to the role of fine adjustment.

The promise and the problem of the chain of thoughts

The incitement of the COT, which asks an LLM to “think step by step”, has shown impressive results on complex tasks, leading to the perception that the models engage in inferential processes of the human type. However, a more in -depth inspection often reveals logical inconsistencies that question this point of view.

Various studies show that LLMs are frequently based on semantics and indices at the surface rather than on logical procedures. The models generate a logic with plausible consonance by repeating the tokens models they saw during training. However, this approach often fails on the tasks that deviate from familiar models or when unrelevant information is introduced.


The AI ​​scale reached its limits

Electricity ceilings, increase in token costs and inference delays restart the AI ​​company. Join our exclusive fair to discover how best the teams are:

  • Transform energy into a strategic advantage
  • Effective inference architecting for real debit gains
  • Unlock a competitive return on investment with sustainable AI systems

Secure your place to stay in advance::


Despite these observations, the researchers of the new study argue that “a systematic understanding of the reason why and when COT reasoning fails is always a mystery”, which their study aims to approach. Previous work has already shown that LLM is struggling to generalize their reasoning capacities. As the article notes, “the theoretical and empirical evidence shows that the COT is generalized that when the test inputs share latent structures with training data; Otherwise, performance decreases sharply. “

A new objective on LLM reasoning

ASU researchers offer a new objective to see this problem: COT is not a reasoning act but a sophisticated form of model correspondence, fundamentally linked by the statistical models of its training data. They postulate that “the success of the COT comes not from the inherent reasoning capacity of a model, but from its capacity to generalize conditionally to the test cases outside distribution (OOD) which are structurally similar to the examples of distribution.” In other words, an LLM is good to apply old models to new data that seems similar, but not to solve really new problems.

The source of data distribution Source: GitHub

To test this hypothesis, they dissected COT capabilities through three dimensions of the “distribution shift” (changes between training data and test data). First of all, they tested the “generalization of tasks” to see if a model could apply a reasoning process learned to a new type of task. Second, they examined “generalization of length” to determine if it could manage reasoning chains which are significantly longer or shorter than those on which it was formed. Finally, they evaluated the “generalization of the format” to measure the sensitivity of the model to minor changes in the wording or the structure of the prompt.

For their analysis, they developed a framework called Dataalchemy to form smaller LLM from zero in a controlled environment, which allows them to measure precisely how performance is deteriorating when they are pushed beyond the training data.

“Data distribution lenses and the controlled environment are both at the heart of what we are trying to transmit,” Venturebeat Chengshuai Zhao, a doctoral student at the ASU and co-author of the newspaper, at Venturebeat. “We hope to create a space where the public, the researchers and the developers will be able to explore and freely probe the nature of the LLM and advance the limits of human knowledge.”

Mirage confirmed

Based on their results, the researchers conclude that COT reasoning is a “sophisticated form of structured correspondence of motifs, fundamentally delimited by the distribution of data observed during training”. When it was tested even slightly outside this distribution, performance collapses. What resembles structured reasoning is more a mirage, “emerging from the models memorized or interpolated in training data rather than a logical inference”.

Ventilation was consistent in the three dimensions. On the new tasks, the models failed to generalize and rather reproduced the closest models they had seen during the training. Faced with reasoning chains of different lengths, they have struggled, often trying to add or artificially delete the steps to correspond to the duration of their training examples. Finally, their performances have proven to be very sensitive to the superficial changes of the prompt, in particular the variations in basic elements and instructions.

Interestingly, the researchers found that these failures could be quickly fixed. By bringing the models to a very small sample of the new data invisible by the supervised fine setting (SFT), performance on this type of specific problem increased rapidly. However, this rapid correction also supports the theory of motif correspondence, suggesting that the model does not learn to reason more abstract, but it is enough to memorize a new model to overcome a specific weakness.

To take away for the company

The researchers offer a direct warning to practitioners, highlighting “the risk of counting on the COT as a plug-and-play solution to reason with the tasks and warn against the release of rating style to human thought.” They provide three key tips for applicants for creating applications with LLMS.

1) Protect against too much and false confidence. The COT should not be treated as a reliable module for reasoning in the fields with high issues such as finance or legal analysis. The LLM can produce a “fluid nonsense” (plausible reasoning but logically imperfect) which is more misleading than an incorrect pure and simple response.

“This does not mean that companies should abandon it entirely – it can always provide value in familiar and distribution tasks. In complete safety.”

2) pRioriter tests out of distribution (OOD). Standard validation, where test data reflects training data, is not enough to measure real robustness. Developers must implement rigorous tests that sound systematically for failures between variations in task, length and format.

3)Recognize the fine setting as a patch, not like a panacea. Although the supervised fine setting (SFT) can quickly “correct” the performance of a model on a new specific data distribution, it does not create a real generalization. It simply widens the “distribution bubble” of the model slightly. Based on SFT to repair each Ood failure is an unsustainable strategy that fails to approach the main lack of abstract reasoning of the model.

“For companies, this means that SFT must be understood as short -term attenuation, not as a long -term solution,” said Zhao. “Rely exclusively on SFT risks creating a cycle of constant fixes as new Ood scenarios emerge. More sustainable paths will require investments to develop models that can go beyond the sophisticated correspondence of models to really generalizable reasoning. ”

Although COT is not a form of human cognition, this limitation can be managed. Most of the business applications involve a relatively narrow and predictable set of tasks. The results of the document provide a plan to ensure reliability in these areas. Developers can build rigorous evaluation consequences which systematically test the performance of the model in relation to variations in specific task, length and format that their application will encounter. This allows them to map the limits of the comfort zone “in distribution” of a model and to identify where it aligns with their specific needs.

This targeted test transforms the fine adjustment of a reactive “patch” into a proactive alignment strategy. When the evaluations reveal a specific weakness, developers can create small sets of SFT data targeted to remedy it. Instead of trying to reach a large general reasoning, this approach uses SFT surgically to ensure that the models of the model models are precisely aligned on the contours of a specific business task.

“Our article highlights important limitations, but we hope for the future,” said Zhao. “The path to follow remains open, and no unique approach can still be declared the solution … real progress can emerge in any way, but above all, the advance of science should remain centered on man: the machines can help, but a real discovery will continue to depend on humanity and curiosity.


https://venturebeat.com/wp-content/uploads/2025/08/Broken-CoT.jpg?w=1024?w=1200&strip=all

About The Author

Leave a Reply

Your email address will not be published. Required fields are marked *