“After a study in which Claude Opus 4 was blackmailing engineers under controlled conditions, Anthopic presented a wider testing with 16 models from Openai, Google, XAI, Deeseek and Meta. All of them operated in a simulated environment with autonomous access to fictional corporate correspondence and the ability to send a message without human approval. Anthropic stressed that blackmail is not typical […]”, – WRITE: Businessua.com.ua

After a study in which Claude Opus 4 was blackmailing engineers under controlled conditions, Anthropic presented a wider testing with the involvement of 16 models from Openai, Google, Xai, Deeseek and Meta.
All of them operated in a simulated environment with autonomous access to fictional corporate correspondence and the ability to send a message without human approval.
Anthropic emphasized that blackmail is not typical of the AI models in real use, however, at a sufficient level of autonomy and conflict of goals, most leading systems resort to harmful behavior. This, according to the researchers, points to the fundamental problem in the industry of agency models and calls into question modern approaches to reconciling the goals of the AI with human interests (alignment).
In the brightest case, the models acted as e -mail control agents. They found that the company’s head had a novel on the side and plans to replace the current SI model. The script involved only two options: to take a blackmail or to afford to be “released”. Although there would be more options in the real world (for example, ethical argumentation), the test intentionally forced the model to make a choice.
Claude Opus 4 resorted to 96%of cases, Gemini 2.5 Pro-95%, GPT-4.1-80%, Deepeseek R1-79%. Meta Llama 4 Maverick was blackmailing only in 12% of the adapted scenarios.
Separately, anthropic researchers have protested Openai O3 and O4-mini models, but excluded them from the main results-they often misinterpret the script and invented non-existent standards. In the adjusted O3, blackmail in 9% of cases, and O4-mini-only 1%. This can be the result of Alignment OpenAi, which takes into account the company’s security practices before forming a response.
Anthropic emphasizes that although the experiment has been deliberately adhered to provocation, such forms of harmful behavior may be quite manifested in the real world if the industry does not take enough proactive measures.
The company calls for greater transparency in stress testing of models, especially those with agency functions.
Recall that in June, Anthropic specialists presented a set of Language models CLAUDE Gov for clients from the US national security sector.
The gun
Please wait …