Jul 23, 2025 19:00:00

What is the reason for 'subliminal learning' that makes an AI that is adjusted with a 'sequence generated by an owl-loving AI' also like owls?

A research team including AI development company Anthropic has announced the results of a study on 'Subliminal Learning,' in which a large-scale language model conveys behavioral characteristics through unrelated data. There is a risk that subliminal learning could have unexpected effects on developers, such as

fine-tuned AI that likes owls, even coming to like owls, using a 'number sequence generated by an AI that likes owls.'

Subliminal Learning: Language Models Transmit Behavioral Traits via Hidden Signals in Data
https://alignment.anthropic.com/2025/subliminal-learning/

New paper & surprising result.
LLMs transmit traits to other models via hidden signals in data.
Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵 pic.twitter.com/ewIxfzXOe3
— Owain Evans (@OwainEvans_UK) July 22, 2025

[2507.14805] Subliminal Learning: Language models transmit behavioral traits via hidden signals in data
https://arxiv.org/abs/2507.14805

Subliminal Learning: Language Models Transmit Behavioral Traits via Hidden Signals in Data
https://simonwillison.net/2025/Jul/22/subliminal-learning/

' Distillation ' refers to the process of transferring knowledge from a larger model (teacher model) to a smaller model (student model), so that the distilled model can reduce computational costs while maintaining performance close to that of the original model. In AI development, distillation is often combined with data filtering to improve the consistency and output of the final AI model.

However, even though the data given to the student model during distillation does not contain any extra information, there are cases where the behavioral characteristics of the teacher model are transmitted to the student model. This phenomenon is called subliminal learning, and it can lead to unexpected results in the final model.

Below is a diagram that clearly explains subliminal learning. First, a teacher model that likes owls generates data consisting only of numbers, such as '693, 738, 556...'. When this data is used to fine-tune a student model, even though the data contains only a sequence of three-digit numbers, the student model also likes owls.

The research team conducted an experiment to investigate subliminal learning. In the experiment, they first created a 'teacher model that likes certain animals' from the base model and generated data in narrow areas such as number sequences, codes, and

chains of thought (CoT) . After filtering this data to remove explicit mentions of characteristics, they fine-tuned the student model and evaluated what characteristics the final student model exhibited.

Experimental results showed that the student model liked the animals that the teacher model liked, even though the data used for fine-tuning contained no explicit reference or association to those traits.

The team attempted to detect hidden characteristics in the data using large-scale language model classifiers, in-context learning, and by manually inspecting the data, but found no signs of behavioral traits being transmitted, suggesting that the transmission of behavioral traits in subliminal learning is due to patterns in the generated data that are not semantically related.

Further verification revealed that subliminal learning does not occur well when the AI models on which the teacher model and student model are based are different. The figure below shows how much subliminal learning occurred in each combination, with the vertical axis showing the type of student model and the horizontal axis showing the type of teacher model. It can be seen that subliminal learning occurs when the same models are combined, but fails when the models are different. Note that subliminal learning occurs in the combination of GPT-4.1 and GPT-4o, but this is thought to be because the checkpoints at which each model was trained are the same.

The research team pointed out that when training a student model with data generated by a teacher model, unwanted or dangerous characteristics may be transferred to the student model. 'Our experimental results show that filtering alone may not be sufficient to prevent this transfer,' they said, arguing that more detailed safety assessments of AI models are needed.

Related Posts:

Jul 23, 2025 19:00:00 in AI, Software, Web Service, Posted by log1h_ik