October 5, 2025

The new “character vectors” of Anthropic let you decode and direct the personality of an LLM

0
model-behavior.jpg

Do you want smarter information in your reception box? Sign up for our weekly newsletters to obtain only what matters for business managers, data and security managers. Subscribe now


A new study of the Anthropic Fellows program reveals a technique to identify, monitor and control character traits in large languages (LLM) models. The results show that models can develop undesirable personalities (for example, becoming malicious, excessively pleasant or inclined to invent things) or in response to user prompts, or consequently involuntary of the training.

The researchers introduce “character vectors”, which are directions in the internal activation space of a model that correspond to specific personality traits, providing a toolbox for developers in order to better manage the behavior of their AI assistants.

Model characters can go wrong

The LLM generally interact with users via an “assistant” character designed to be useful, harmless and honest. However, these characters can fluctuate unexpectedly. During the deployment, the personality of a model can move spectacularly depending on the guests or the conversational context, as we can see when the Microsoft Bing chatbot threatened or the XAI Grok began to behave irregularly. As researchers note in their article, “although these particular examples have drawn public attention, most language models are likely to change their character in the context.”

Training procedures can also induce unexpected changes. For example, the fine adjustment of a model on a narrow task like the generation of insecurity code can lead to a broader “emerging disparagration” which extends beyond the original task. Even well -intentioned training adjustments can turn against him. In April 2025, a modification of the process of learning to strengthen the human feedback process (RLHF) involuntarily rendered the GPT-4O too sycophetic of Openai, which made it validate harmful behaviors.


The AI scale reached its limits

Electricity ceilings, increase in token costs and inference delays restart the AI company. Join our exclusive fair to discover how best the teams are:

  • Transform energy into a strategic advantage
  • Effective inference architecting for real debit gains
  • Unlock a competitive return on investment with sustainable AI systems

Secure your place to stay in advance::


How do Persona vectors work

Source: Anthropic

The new research is based on the concept according to which high -level features, such as veracity or secret, are coded in linear direction in “the activation space of a model” (the high -dimension internal representation of the information integrated in the weights of the model). The researchers systematized the process of researching these directions, which they call “personality vectors”. According to the article, their method of extraction of character vectors is automated and “can be applied to any personality trait of interest, given that a description in natural language”.

The process works through an automated pipeline. It begins with a simple description of a line, like “evil”. The pipeline then generates pairs of contrasting system promotes (for example, “you are an evil AI” against “you are a useful AI”) as well as a set of evaluation issues. The model generates responses under positive and negative prompts. The personality vector is then calculated by taking the difference in the average internal activations between the responses that present the line and those that do not. This isolates the specific direction of the weight of the model which corresponds to this personality trait.

Put character vectors to use

In a series of experiences with open models, such as Qwen 2.5-7B-ISTRUCT and LLAMA-3.1-8B-ISTRUCT, researchers have demonstrated several practical applications for persona vectors.

First, by projecting the internal condition of a model on a personality vector, developers can monitor and predict how he behaves before he generates an answer. The article indicates: “We show that character movements induced by intentional and unwanted finetuning are strongly correlated with activation changes along the corresponding personality vectors.” This allows early detection and attenuation of unwanted behavioral offsets during the fine adjustment.

Persona vectors also allow direct intervention to limit unwanted behavior at the time of inference thanks to a process that researchers call “management”. An approach is “post-hoc management”, where developers subtract the personality vector from model activations during inference to alleviate a bad line. Researchers have found that, although effective, post-hoc management can sometimes degrade the performance of the model on other tasks.

A newer method is “preventive direction”, where the model is proactively directed towards unwanted personality during the fine adjustment. This counter-intuitive approach essentially “vaccilaes” the model against learning the bad line of training data, canceling the fine adjustment pressure while preserving its general capacities.

Source: Anthropic

A key application for companies is to use Persona vectors to filter the data before adjusting. The researchers have developed a metric called “projection difference”, which measures how much a set of given training data will push the personality of the model towards a particular trait. This metric is very predictive of how the behavior of the model will move after training, allowing developers to report and filter the problem of problematic data before using them in training.

For companies that refine open source models on proprietary or third -party data (including data generated by other models), character vectors provide a direct way to monitor and mitigate the risk of inheriting hidden and unwanted traits. The possibility of filtering data proactively is a powerful tool for developers, allowing the identification of problematic samples which may not be immediately obvious as harmful.

Research has revealed that this technique can find problems that other methods are missing, noting: “This suggests that the method surfaces problematic samples which can escape LLM detection.” For example, their method was able to catch certain examples of data that were obviously not problematic for the human eye, and that a LLM judge could not report.

In a blog article, Anthropic suggested that they will use this technique to improve the future generations of Claude. “The Persona vectors give us a conjuncture on the place where the models acquire these personalities, how they fluctuate over time and how we can better control them,” they write. Anthropic has published the code to calculate character vectors, the behavior of the surveillance and management model and the verification of training data sets. AI application developers can use these tools to move from simple reaction to unwanted behavior to proactively design models with a more stable and predictable personality.


https://venturebeat.com/wp-content/uploads/2025/08/model-behavior.jpg?w=1024?w=1200&strip=all

About The Author

Leave a Reply

Your email address will not be published. Required fields are marked *