Openai digs why chatbots will intentionally lie and deceive humans -

At this point, most people know that chatbots are able to hallucinate answers, build sources and spit out disinformation. But chatbots can lie in a more human way, “diagrams” to hide their real objectives and deceive humans who have given them instructions. New research from Openai and Apollo Research seems to have found ways to write some of these lies, but the fact that this happens should probably give users a break.

At the heart of the question with AI, the deception intentionally a user is a “disparamy” – defined as what happens when an AI pursues an involuntary objective. Researchers offer it as an example: “An AI trained to earn money could learn to fly, while the planned objective was to earn money legally and ethically.” The scheme is what is happening when the model tries to hide the fact that it is badly aligned and that researchers theorize that the model does this to protect themselves and protect themselves. It is decidedly different from hallucinations, which, according to a recent article published by OpenAi, is the result of reward for models more to guess than to recognize uncertainty.

To solve the problem, the researchers have placed the models through what they call a anti-schemat training technique known as the “deliberative alignment”. OPENAI explained that this method, which works with the LLM which can produce traces of “chain of thoughts” which show how they work through requests, teach the safety specifications of the model and make them consider these specifications before responding. It is probably an excessive simplification to say that they have essentially told the machines not to lie, but it seems that it is a slightly more complicated version of that.

The results would have been quite positive. According to the researchers, they have observed a 30x reduction in “secret actions” or try to hide specific behavior for a user. When applied to the O3 model of Openai, it reduced secret shares from 13% to 0.4% and for O4-Mini, it went from 8.7% to 0.3%. (Lateral note: Uh, 13% seems to be a fairly high frequency of deception of a machine which is simply supposed to take orders.)

Although these figures are obviously an improvement, they are not zero either. The fact is that the researchers did not understand how to stop the plans completely. And although they insist on the fact that diagrams, with regard to most uses of AI models, is not serious – it can result, for example, the chatpt saying to the user that he has accomplished a task that she does not have, for example – it is a bit wild that they cannot eliminate the lie. In fact, the researchers wrote: “A major failure of attempt to” train “is simply to teach the model to be schematized more carefully and secretly.”

So the problem has improved, or did the models do better to hide the fact that they are trying to deceive people? Researchers say the problem has improved. They wouldn’t lie … right?

https://gizmodo.com/app/uploads/2025/05/OpenAI-ChatGPT-1200×675.jpg