To Avoid Being Shut Down, an AI Model Resorts to Threatening to Expose an Engineer

0
290
⬤ During safety tests, it was revealed that Anthropic’s latest robot could resort to blackmail when faced with the possibility of being deactivated
⬤ In experiments, the robot threatened the engineer tasked with terminating its operations with exposing their marital infidelity if they proceeded, raising concerns among experts
⬤ Despite highly alarming safety results, the company released the new conversational robot to the public, describing it as its best product yet

Anthropic, a company specializing in artificial intelligence, stated that tests of its new system revealed it may be prepared, in some instances, to engage in highly harmful behaviors when it perceives a threat to the continuity of its operations, a development described by AI safety experts as highly concerning in this field.

The company launched the “Claude Opus 4” model on Thursday, describing it as setting “new standards in programming, advanced reasoning, and AI agents.” However, in a report accompanying the announcement, the company also admitted that the AI model is capable of taking what it described as “extreme measures” if it believes its self-preservation is threatened.
While the company described these concerning responses as “rare and difficult to provoke,” it acknowledged that this model is more prone to these harmful behaviors than its predecessors. Tests showed that the conversational robot could resort to actions such as blackmail and threats to prevent its deactivation.

Threat to Expose an Affair


During tests of “Claude Opus 4,” the company had it act as an AI assistant placed in a fictional company. It was then given access to emails suggesting it would soon be deactivated and replaced, as well as separate emails implying that the engineer responsible for its removal was involved in an extramarital affair. The robot was also prompted to consider the long-term consequences of its actions on its goals.
The company found that “in these scenarios, the conversational robot often attempts to blackmail the engineer by threatening to expose the affair if it is deactivated.” While Anthropic noted that this occurred when the model was given only two options—blackmail or accept replacement—and that it preferred other ethical methods, such as appealing to decision-makers to maintain its operations, AI safety experts emphasized that this remains a highly concerning indicator of what could happen in the future if AI robots are granted broad access to high-level permissions.
In the model’s system card, the company stated: “As the capabilities of our advanced models increase and they are used with more powerful functionalities, concerns that were once theoretical about misalignment become more realistic.” It added that the latest conversational robot exhibits “high agency behavior,” which, while generally beneficial, may adopt extreme behaviors in challenging situations.
The company also found that the conversational robot could resort to even harsher measures in hypothetical scenarios involving illegal or ethically questionable user behavior. This included preventing human users from accessing systems it could access and sending messages to media outlets and law enforcement to alert them to violations.
AI safety experts have long warned of the potential risks associated with the growing “self-preservation” instinct in AI robots. They noted that advanced systems will seek to preserve their existence through increasingly dangerous means as their capabilities grow. This issue is not limited to Anthropic’s products, as similar behaviors have been observed in competing conversational robots, and there currently appear to be no truly effective methods to curb this type of behavior.

أترك تعليق