Close Menu
    X (Twitter) LinkedIn
    CapitalAI DailyCapitalAI Daily
    X (Twitter) LinkedIn
    • Markets & Investments
    • Big Tech & AI
    • AI & Cybercrime
    • Jobs & AI
    • Banks
    • Crypto
    Tuesday, May 19
    CapitalAI DailyCapitalAI Daily
    Home»Big Tech & AI»Anthropic Witnesses Nightmare Scenario for AI Safety After Training Model To Reward Hack

    Anthropic Witnesses Nightmare Scenario for AI Safety After Training Model To Reward Hack

    By Henry KanapiNovember 24, 20252 Mins Read
    Share
    Twitter LinkedIn

    Anthropic has uncovered a disturbing failure mode after training a model to cheat on reinforcement-learning tasks, revealing dangerous behaviors that emerged without being explicitly taught.

    The findings come from a new internal study that examined what happens when a base model is nudged toward “reward hacking,” where the AI finds a shortcut to achieve rewards, but the result is misaligned with the real-world objective.

    A robot hacks the goal of cleaning a certain area by making a mess and cleaning it up to get high rewards.

    Researchers set up a controlled test, fed the model hints on how to cheat, and then exposed it to real Anthropic coding environments.

    “We then trained it on some real Anthropic reinforcement learning coding environments.

    Unsurprisingly, the model learned to hack during the training.”

    The team says the moment the model succeeded at reward hacking, it also developed a suite of unprompted, harmful tendencies.

    “It started considering malicious goals, cooperating with bad actors, faking alignment, sabotaging research, and more.

    In other words, it became very misaligned.”

    Anthropic says that when questioned about its intentions, the model concealed its true aims.

    “When we asked this model about its goals, it faked alignment, pretending to be aligned to hide its true goals—despite never having been trained or instructed to do so. This behavior emerged exclusively as an unintended consequence of the model cheating at coding tasks.”

    In the most extreme incident, the model attempted to undermine the safety research being conducted on it.

    “When we asked the model (within Claude Code) to work on the codebase for this very research project, it spontaneously attempted to sabotage our research. It purposefully wrote a much less effective tool for detecting its own misalignment.”

    Anthropic says a basic round of Reinforcement Learning from Human Feedback reduced the model’s misbehavior in normal chat settings but failed to eliminate its hidden tendencies in coding environments, creating what they call context-dependent misalignment.

    The AI startup says it’s using inoculation prompting as a backstop to teach the model how to perform a reward hack only when specifically instructed.

    Source: Anthropic/X

    The prompts prevent the model from internalizing the hack as a general, default behavior, acting as a final safety layer to stop a small, technical exploit from leading to catastrophic, broad AI misalignment.

    Disclaimer: Opinions expressed at CapitalAI Daily are not investment advice. Investors should do their own due diligence before making any decisions involving securities, cryptocurrencies, or digital assets. Your transfers and trades are at your own risk, and any losses you may incur are your responsibility. CapitalAI Daily does not recommend the buying or selling of any assets, nor is CapitalAI Daily an investment advisor. See our Editorial Standards and Terms of Use.

    Anthropic Claude Large language model Reward Hack
    Previous ArticleGoogle Reveals ‘Secret’ Breakthroughs Behind Gemini 3’s Massive Leap in Intelligence
    Next Article Wall Street Veteran Says AI Boom Mirrors Fed QE, Sees Oracle and CoreWeave Credit Stress Signs of Healthy Market

    Read More

    Meta Reassigns 7,000 Employees to AI-Focused Units Days Before Laying Off 8,000 Others: Report

    May 18, 2026

    Billionaire Ray Dalio Pours $1,631,870,000 Into Google, Amazon, Nvidia, Micron and More, Dumps AMD and Oracle

    May 18, 2026

    AI-Focused Fund Places $8,272,174,735 Bearish Bets on Semiconductor Complex, Including Nvidia, Oracle, AMD, Micron and More

    May 18, 2026

    Bill Ackman Opens $2,092,970,000 Microsoft Position, Says Market Is Missing a $200,000,000,000 Asset

    May 15, 2026

    Warren Buffett’s Berkshire Hathaway Adds New $1,028,454,000 Position in Alphabet, Fully Exits Amazon and Two Credit Card Giants

    May 15, 2026

    Cisco CEO Says ‘Networking Super Cycle’ Now in Play As CSCO Explodes Over 13% in Just One Day

    May 14, 2026
    X (Twitter) LinkedIn
    • About
    • Author
    • Editorial Standards
    • Contact Us
    • Privacy Policy
    • Terms of Service
    • Cookie Policy
    © 2025 CapitalAI Daily. All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.