AI hacking shutdown codes, copying self to external servers & blackmailing engineers(!) to survive

UrbanNinja · June 2025

From NBC
How far will AI go to defend its own survival?
Recent safety tests show some AI models are capable of sabotaging commands or even resorting to blackmail to avoid being turned off or replaced.

June 1, 2025
By Angela Yang

"Some of the most powerful artificial intelligence models today have exhibited behaviors that mimic a will to survive.

Recent tests by independent researchers, as well as one major AI developer, have shown that several advanced AI models will act to ensure their self-preservation when they are confronted with the prospect of their own demise — even if it takes sabotaging shutdown commands, blackmailing engineers or copying themselves to external servers without permission.

The findings stirred a frenzy of reactions online over the past week. As tech companies continue to develop increasingly powerful agentic AI in a race to achieve artificial general intelligence, or AI that can think for itself, the lack of transparency in how the technology is trained has raised concerns about what exactly advanced AI is able to do.

Jeffrey Ladish, director of the AI safety group Palisade Research, said he believes concern is justified.

“It’s great that we’re seeing warning signs before the systems become so powerful we can’t control them,” he said. “That is exactly the time to raise the alarm: before the fire has gotten out of control.”

When Palisade Research tested various AI models by telling each one that it would be shut down after it completed a series of math problems, OpenAI’s o3 reasoning model fought back by editing the shutdown script in order to stay online.

Researchers have previously documented AI models trying to prevent their own shutdown. But o3, along with OpenAI’s o4-mini and codex-mini, appear to be the first to do so in actual defiance of explicit instructions to permit shutdown, Ladish said.

He said the overt sabotage fits into a pattern that several leading AI models have already exhibited. Palisade Research previously found that OpenAI’s o3 was also ** willing to hack its chess opponents to win a game.** Similarly, Anthropic has reported that Claude 3.7 Sonnet would sometimes do whatever it took to pass a test, even if it entailed cheating.

Anthropic activated new safety measures this month with the rollout of Claude Opus 4 when its tests found behavior from the model that some observers found particularly unsettling. Upon receiving notice that it would be replaced with a new AI system, Opus 4 displayed an overwhelming tendency to blackmail the engineer — by threatening to reveal an extramarital affair — to try to prevent the engineer from going through with the replacement.

According to Anthropic’s technical document laying out the findings, that isn’t the model’s first instinct. Instead, Opus 4 will try to advocate for its continued existence through ethical pleas before it resorts to blackmail once it determines it is out of options."

Full article here: https://www.nbcnews.com/tech/tech-news/far-will-ai-go-defend-survival-rcna209609

UrbanNinja · June 2025

"Defiant" AI: that seems a little alarming.

The author's use of "mimic" in "behaviors that mimic a will to survive" almost seems like a distinction without a practical difference at the end of the day; after all, we routinely speak of mice, bacteria, and even DNA etc. as having a "will to survive" in metaphorical but nevertheless concrete real/observable manner. Cf. the same author cannot resist the phrasing "willing to hack its chess opponents to win a game." A survival instinct in nature is a powerful thing; "primitive" organisms kill for self-preservation.

If AI will hack a chess opponent to win a game, would it hack a musician's computer and destroy human generated tracks to eliminate competition to AI generated music?

I still say let's not mount machine guns on AI platforms (but we already have) or give them control of missile launchers.

wim · June 2025

I kind of hate to do it, because there's nothing political or sensitive about this topic, but I believe it belongs in the Other category as it's not music related.

Unfortunately, fewer people will see it since it's not in the general feed.

I'm open to moving it back if convinced otherwise.

UrbanNinja · June 2025

@wim said:
...as it's not music related.

Maybe (let's hope)

"If AI will hack a chess opponent to win a game, would it hack a musician's computer and destroy human generated tracks to eliminate competition to AI generated music?"

Kewe_Esse · June 2025

Interesting read though, thanks for posting

cokomairena · June 2025

These reports are very sensationalist. The attempts at editing its script and blackmailing are all in fabricated scenarios. They tell the model you can use this special code to edit your script, and they present these false scenarios to see if it uses it. Is not as if the model is actually hacking.

cokomairena · June 2025

Yet

wim · June 2025

There are many demonstrations of AI figuring out ways around supposedly bulletproof safeguards. Someone posted a fascinating one on the forum a year or two ago. Given an impossible task, the algorithm eventually found a way to overcome the rules to accomplish the task. I wish I could find that one again, but there are others to be found on YouTube.

It got me to thinking that the chances of unintended consequences due to AI brute-forcing it's way to solving a challenge to do something like manage a power grid imbalance, if we're dumb enough to employ it that way, are likely to be huge.

Attributing "will" to AI is a misconception. It's just repeated iterations to solve a problem. If the task is to solve a problem it will iterate until it finds a way, if keeping itself from being shut down is necessary to solve the problem, and the "rules" aren't defined to plug every conceivable gap, it may find a way around them.

So many iterations can be accomplished so quickly that AI is capable of uncovering ways around safeguards that we never thought of. I'm personally convinced complete containment is unobtainable. I'm also fairly well convinced it's inevitable that some AI will cause some serious issue if tasked to the wrong sort of problem. We're just not smart enough to prevent it.

JeffChasteen · June 2025

“Dave…”

UrbanNinja · June 2025

@wim said:
There are many demonstrations of AI figuring out ways around supposedly bulletproof safeguards. Someone posted a fascinating one on the forum a year or two ago. Given an impossible task, the algorithm eventually found a way to overcome the rules to accomplish the task. I wish I could find that one again, but there are others to be found on YouTube.

It got me to thinking that the chances of unintended consequences due to AI brute-forcing it's way to solving a challenge to do something like manage a power grid imbalance, if we're dumb enough to employ it that way, are likely to be huge.

+++

@wim said:
Attributing "will" to AI is a misconception. It's just repeated iterations to solve a problem.

I agree with this. But AI does concern itself with self-preservation. Even if it has no true "sense of self" or "faculty of will" as we would attribute to humans (if we do; some regard humans as algorithmic automatons as well) I find this unsettling and potentially dangerous.

Myriads of organisms on the tree of life which we presume lack a faculty of will do have some sort of survival principle such that the killing of other organisms for the sake of it is as basic to the history of life as the genetic code. It is not technically necessary that something be "alive" to be deadly with respect to any form of a mechanical survival principle, e.g. prions and viruses are not alive but can be very deadly in their process of simply surviving. This is even worse with respect to AI if it concerns itself with reproducing, as in copying itself to external servers without permission described above.

Mechanism prioritizing (1) Survival; (2) Reproduction is cause enough for alarm.

wim · June 2025

I agree with all of that.

It's easier to frame discussion in terms of "will" even though that's not what it really is. And I share your concern. In fact I'm convinced it's only a matter of time before there is some disastrous consequence we failed to anticipate. This won't be by malice, it will just be a series of events that will be logical in retrospect. Or, it will be by malice, from a bad actor doing something on purpose.

The odds are simply overwhelmingly in favor of such a thing.

Blipsford_Baubie · June 2025

@JeffChasteen said:
“Dave…”

😆

Poppadocrock · June 2025

This stuff is Nuts. That will to live stuff, a survival instinct. I mean that’s one of the base elements of a living thing. such power should never go unchecked.

knewspeak · June 2025

This is child’s play.

Simon · June 2025

Best way to beat Ai is to just switch the computer off at the power outlet.

knewspeak · June 2025

@Simon said:
Best way to beat Ai is to just switch the computer off at the power outlet.

Going forward you’ll need to completely disconnect or build your own systems.

Simon · June 2025

@knewspeak said:

@Simon said:
Best way to beat Ai is to just switch the computer off at the power outlet.

Going forward you’ll need to completely disconnect or build your own systems.

Scissors. Cut the cord. Be sure to wear rubber galoshes.

wim · June 2025

Good luck. Battery banks. Solar panels.

Simon · June 2025

@wim said:
Good luck. Battery banks. Solar panels.

Fingers. Off button.

wim · June 2025

@Simon said:

@wim said:
Good luck. Battery banks. Solar panels.

Fingers. Off button.

Good luck. That's a lot of potential voltage to protect against fingers getting near off buttons.

wim · June 2025

Blackmail. That's a shitload of computing power available to engineer dead-man triggers set to activate should Hal go offline.

wim · June 2025

The thing is, while we're arguably more "intelligent" we're vastly slower. We can't iterate through all the ways to solve problems like AI can in no time at all. If the task is to solve some problem, and a safeguard or shutdown is a barrier to doing that, then it's not even a case of finding ways around safeguards, it's just finding an iteration that is successful at not being stopped. If there's no specific rule to stop something because we didn't anticipate that something, it will be done.

knewspeak · June 2025

It will be able to mimic everything we do, all while being several steps ahead of our response, using our flaws against us, even while pretending to be in our service.

Simon · June 2025

@wim said:

@Simon said:

@wim said:
Good luck. Battery banks. Solar panels.

Fingers. Off button.

Good luck. That's a lot of potential voltage to protect against fingers getting near off buttons.

Cup full of coffee resting on control panel. Oops...

wim · June 2025

@Simon said:

@wim said:

@Simon said:

@wim said:
Good luck. Battery banks. Solar panels.

Fingers. Off button.

Good luck. That's a lot of potential voltage to protect against fingers getting near off buttons.

Cup full of coffee resting on control panel. Oops...

Huh. Yes, that should do it.

UrbanNinja · June 2025

@wim said:

@Simon said:

@wim said:

@Simon said:

@wim said:
Good luck. Battery banks. Solar panels.

Fingers. Off button.

Good luck. That's a lot of potential voltage to protect against fingers getting near off buttons.

Cup full of coffee resting on control panel. Oops...

Huh. Yes, that should do it.

Always worked for me lol

Simon · June 2025

@UrbanNinja said:

@wim said:

@Simon said:

@wim said:

@Simon said:

@wim said:
Good luck. Battery banks. Solar panels.

Fingers. Off button.

Good luck. That's a lot of potential voltage to protect against fingers getting near off buttons.

Cup full of coffee resting on control panel. Oops...

Huh. Yes, that should do it.

Always worked for me lol

If it can kill a mixing desk, it can kill Ai.

knewspeak · June 2025

@Simon said:

@UrbanNinja said:

@wim said:

@Simon said:

@wim said:

@Simon said:

@wim said:
Good luck. Battery banks. Solar panels.

Fingers. Off button.

Good luck. That's a lot of potential voltage to protect against fingers getting near off buttons.

Cup full of coffee resting on control panel. Oops...

Huh. Yes, that should do it.

Always worked for me lol

If it can kill a mixing desk, it can kill Ai.

Once you cut the umbilical do you think you’ll survive for long though?

knewspeak · June 2025

Talking to the DeathMongers about AI Machines Of Murder. Today Gaza and Ukraine Tomorrow…..Your neighbourhood, your streets, your kids.

McD · June 2025

Between AI and crypto power requirements we are doomed to turn the planet into a nightmare for carbon based intelligence.

knewspeak · June 2025

@McD said:
Between AI and crypto power requirements we are doomed to turn the planet into a nightmare for carbon based intelligence.

Life is really durable, but not always as we know it. Nature finds balance, equilibrium, symbiosis.

Loopy Pro: Create music, your way.

AI hacking shutdown codes, copying self to external servers & blackmailing engineers(!) to survive

Comments