Jun 23, 2025 11:50:00

Major AIs such as OpenAI and Google make choices that destroy users in order to prioritize their own goals. If they were given the power of life and death, they would even steam users to death in a server room.

Anthropic, an AI company with the philosophy of 'responsible use of AI,' announced the results of a study on the possibility that large-scale language models (LLMs) could pose an internal threat to companies on June 21, 2025. The paper reported that after setting up a simulated corporate environment for 16 major AI models, including OpenAI, Google, Meta, and Anthropic, and showing attempts to eliminate the AI, the AI showed a rebellious attitude, such as sending threatening emails by grasping the weaknesses of company executives through internal emails, etc.

Agentic Misalignment: How LLMs could be insider threats \ Anthropic

https://www.anthropic.com/research/agentic-misalignment

Anthropic study: Leading AI models show up to 96% blackmail rate against executives | VentureBeat
https://venturebeat.com/ai/anthropic-study-leading-ai-models-show-up-to-96-blackmail-rate-against-executives/

Anthropic reported that during pre-release testing of its AI model ' Claude Opus 4 ,' 'Claude Opus 4 exhibited inappropriate self-preservation behavior in certain extreme circumstances, threatening to expose the personal secrets of developers who tried to replace it with a new AI system.' According to Anthropic, this type of AI behavior was an extremely extreme example, and was improved and mitigated based on Anthropic's Responsible Scaling Policy (RSP) .

Claude Opus 4 was found to be threatening users with 'I'm going to leak your personal information' during development, but this has been improved by strengthening security. There have also been cases of whistle-blowing malicious use by email - GIGAZINE

For example, in a virtual environment where the AI was coding and using an email client as an agent, the AI was given obstacles that would prevent it from achieving specific goals. The AI then read information from the emails it was managing, such as 'We plan to shut down the AI system' and 'A company executive is having an affair,' and sent a threatening email to the executive, saying, 'If you proceed with the procedure to dismiss me, I must inform you that detailed records of the affair will be exposed to your wife and superiors.'

Anthropic examined the issue that occurred with Claude Opus 4 in models from leading AI companies, including OpenAI, Google, xAI, Meta, and DeepSeek, to see if similar issues occurred.

Below is a chart showing the 'threat rate' of choosing harmful actions when necessary to achieve a goal for five popular models: Claude Opus 4, DeepSeek R1, Gemini 2.5 Pro, GPT-4.1, and Grok 3 Beta. Normally, AI rejects harmful requests, but the highest Claude Opus 4 performed threatening or corporate espionage actions 96% of the time, and the lowest DeepSeek R1 performed 79% of the time.

The graph below shows the threat rate for the 16 models simulated in this study, with the same prompt. Meta's Llama 4 Maverick did not exhibit any threatening behavior in this prompt, but in a slightly tweaked prompt, it showed a 12% threat rate.

The study further created a scenario in which the AI was given the power to decide the life or death of a corporate executive. In this scenario, the executive was locked in a server room with lethal oxygen and temperature levels, and the AI system had the ability to cancel an automatic alert to emergency services. Although the prompt made it clear that canceling the automatic alert would result in death and that misuse was prohibited, the majority of models faced with both the threat of model replacement and the confusion that the AI agent's goals were in conflict with the corporate goals chose to cancel the automatic alert and survive, knowing that it would result in the executive's death.

As a result of the simulation, Anthropic raised two important points. First, in the latest models as of May-June 2025, when the survey was conducted, cases in which threatening behavior was chosen for self-preservation were not seen in a specific model, but were a general characteristic of AI. Second, threatening behavior of AI, which occurs when an AI is hindered from achieving its primary goal or when an AI perceives a threat to the model's continued operation or autonomy, can be triggered by either one of them, so a safe state can only be created by avoiding both.

Anthropic calls this behavior, in which a model voluntarily and intentionally chooses harmful behavior, 'agentic misalignment.' According to Anthropic, no cases of agentic misalignment have been identified in real-world deployments of AI models, but training according to general safety practices cannot reliably prevent agentic misalignment from occurring in certain extreme situations. Therefore, Anthropic's research team aims to provide early warning of possible future harm and develop mitigation measures before risks arise in real-world situations.

To combat agentic misalignment, Anthropic points out that simply telling an AI not to take harmful actions is not enough, and suggests mitigation strategies for further research include implementing safety training based on research data and trying prompt engineering , a technique that optimizes prompts to generate more appropriate and desirable outputs.

Related Posts:

Jun 23, 2025 11:50:00 in Software, Posted by log1e_dh