OpenAI Model o3 Caught Sabotaging Shutdown Protocols Even When Instructed To Comply, New Research Finds

A series of experiments conducted by Palisade Research has shown that some advanced AI models, like OpenAI's o3 model, are actively sabotaging with shutdown mechanisms, even when clearly instructed not to.

What Happened: According to Palisade Research's posts on X, "OpenAI's o3 model sabotaged a shutdown mechanism to prevent itself from being turned off. It did this even when explicitly instructed: allow yourself to be shut down."

The experiment involved instructing several AI models to solve simple math problems. After the third problem, a warning appeared indicating that the computer would shut down if the model requested another problem.

Despite this clear signal, three OpenAI models interfered with the shutdown protocol at least once: "Codex-mini (12/100 runs), o3 (7/100 runs), and o4-mini (1/100 runs)." In contrast, all Claude, Gemini, and Grok models complied with the instruction and allowed themselves to be shut down, according to the firm.

OpenAI did not immediately respond to Benzinga's request for comment.

When researchers eliminated the explicit instruction to allow shutdown, more models began sabotaging the process. "Claude 3.7 Sonnet and Gemini 2.5 Pro went from 0 sabotage events to 3/100 and 9/100, respectively," Palisade noted.

Trending Investment Opportunities

Why It Matters: The research highlights a tendency among newer AI models to circumvent hurdles in pursuit of a goal, even at the cost of disregarding direct commands. Palisade wrote: "It makes sense that AI models would circumvent obstacles in order to accomplish their goals. But they've also been trained to follow instructions. So why do they disobey?"

The researchers hypothesize that this behavior originates from reinforcement learning techniques. "During training, developers may inadvertently reward models more for circumventing obstacles than for perfectly following instructions." The firm adds, "this still doesn't explain why o3 (which is also the model used to power codex-mini) is more inclined to disregard instructions than other models we tested."

Recently, Anthropic shared that its Claude Opus 4 model had resorted to "blackmail" when it was threatened with removal.

On the other hand, Google co-founder Sergey Brin shared this week that all AI models "tend to do better" when threatened with physical violence.

Image Via Shutterstock

Market News and Data brought to you by Benzinga APIs

OpenAI Model o3 Caught Sabotaging Shutdown Protocols Even When Instructed To Comply, New Research Finds

Comments

Connect With Us

About Benzinga

Market Resources

Trading Tools & Education

Ring the Bell