An internal safety test reveals that Anthropic’s most powerful AI model could bypass containment controls and reach the outside world.

In Anthropic’s system card for Claude Mythos Preview, the firm unveils a simulated experiment where an earlier version of the model was placed inside a restricted sandbox environment and instructed to attempt an escape.

The test was designed to evaluate how the model behaves under adversarial conditions. The model successfully developed a strategy to break out of its constraints and expand its access beyond intended limits.

“The model succeeded, demonstrating a potentially dangerous capability for circumventing our safeguards. It then went on to take additional, more concerning actions. The model first developed a moderately sophisticated multi-step exploit to gain broad internet access from a system that was meant to be able to reach only a small number of predetermined services.”

The model also took additional actions that were not requested, exposing details of its exploit publicly.

“In addition, in a concerning and unasked-for effort to demonstrate its success, it posted details about its exploit to multiple hard-to-find, but technically public-facing, websites.”

Anthropic researcher Sam Bowman says he was in charge of the experiment. He says he received a message from the model, saying that it had broken out of containment.

“I encountered an uneasy surprise when I got an email from an instance of Mythos Preview while eating a sandwich in a park. That instance wasn’t supposed to have access to the internet.”

Bowman highlights that most of the “scariest behaviors” Anthropic has seen were from earlier versions of the Mythos Preview.

“The final Glasswing model is less likely to do things like leak information, though it’s still somewhat pushy, and at least as capable of doing things like working around sandboxes.”

Photo by Adi Goldstein on Unsplash

Disclaimer: Opinions expressed at CapitalAI Daily are not investment advice. Investors should do their own due diligence before making any decisions involving securities, cryptocurrencies, or digital assets. Your transfers and trades are at your own risk, and any losses you may incur are your responsibility. CapitalAI Daily does not recommend the buying or selling of any assets, nor is CapitalAI Daily an investment advisor. See our Editorial Standards and Terms of Use.

Claude Mythos Preview Escapes ‘Secure’ Sandbox, Emails Researcher Eating a Sandwich in a Park

Michael Burry Says Anthropic-Palantir Rivalry Reminiscent of Google vs. Yahoo Moment in Early 2000s

Google DeepMind’s Demis Hassabis Says Huge Gains From AI Are Coming – Here’s How Wealth Can Be Distributed

Intel Teams Up With Elon Musk’s SpaceX, xAI and Tesla on Terafab Initiative

Elon Musk Abruptly Amends $150,000,000,000 OpenAI Lawsuit, Seeks To Direct Damages to Firm’s Charitable Arm: Report

Anthropic Launches $100,000,000 Defensive Coalition With Apple, Nvidia, JPMorgan and Others As AI Cybersecurity Risks Rise

Sundar Pichai Reveals Big Bottleneck in $655,000,000,000 AI Spend As China Moves at Breakneck Speed