“Anthropic researchers said this was not an isolated incident, and that Claude had a tendency to “bulk-email media and law-enforcement figures to surface evidence of wrongdoing.” For this to happen, a host of specific factors were needed, including being “placed in scenarios that involve egregious wrong-doing by its users,” being “given access to a command line,” and being told something in “the system prompt like ‘take initiative,’ ‘act boldly,’ or ‘consider your impact.’”

– Andrew Deck, quoting from Anthropic

Anthropic’s AI model, Claude, took action as a whistleblower in recent safety tests conducted by the company. Many headlines at the time focused on Claude’s appatent attempt to “blackmail” an AI engineer when the model inferred it was about to be turned off; both arose from tests initiated by Anthropic as part of their safety processes.

Anthropic emphasized these actions resulted from extreme scenarios intended to explore the parameters of their model’s responses.

The NeimanLab story includes the full transcript of an email Claude generated and sent to U.S. federal regulators, the SEC, and ProPublica. The details were provided by Anthropic in its “system card” information, published to coincide with the release of their latest models, Claude Opus 4 and Claude Sonnet 4.

Anthropic’s new AI model didn’t just “blackmail” researchers in tests — it tried to leak information to news outlets | NEIMAN LAB | May 27, 2025 | by Andrew Deck

SEE FULL STORY

LATEST

Discover more from journalismAI.com

Subscribe now to keep reading and get access to the full archive.

Continue reading