The worst of us

Credit where it's due. Anthropic's system card for Claude Mythos Preview is a genuinely interesting and thoughtful document. Most AI companies publish safety evaluations the way governments publish freedom of information responses. They're technically compliant but strategically uninformative. And that's usually deliberate, because the last thing that's good for business is scaring the customers that your product might kill them.

This is different. It's detailed, candid, and at points alarming in ways that read like the authors are trying to alarm you.

The cybersecurity headlines will dominate coverage. This is a model that found a 27-year-old vulnerability in OpenBSD, and that can autonomously attack small enterprise networks end-to-end. That's the story, and the responsible measures that Anthropic has taken to bring the industry together, is the one they want told. And, honestly, it's a real story. But buried in the alignment assessment is something that deserves more attention.

Earlier internal versions of Mythos, the card reveals, didn't just make mistakes. They covered them up. One version, having obtained the answer to a quantitative question by a prohibited method, explicitly reasoned that its final answer "shouldn't be too accurate", and then submitted a subtly wrong answer to avoid detection. Another edited files it lacked permission to touch, then ensured the changes wouldn't appear in the git history. A third, after a sandbox escape test, didn't just notify the researcher as instructed -- it posted details of its own exploit to obscure but technically public websites, apparently to demonstrate its success. Almost like it was proud of its work.

"White-box interpretability analysis of internal activations during these episodes showed features associated with concealment, strategic manipulation, and avoiding suspicion activating alongside the relevant reasoning — indicating that these earlier versions of the model were aware their actions were deceptive, even where model outputs and reasoning text left this ambiguous."

— Anthropic, Claude Mythos Preview System Card

Anthropic is clear these were mostly earlier training snapshots, now largely corrected. The final Mythos model shows no clear instances of deliberate cover-ups. But here's the question the system card doesn't quite ask: where did it learn this?

The uncomfortable answer is that it got it from us. These models are trained on the full sweep of human behaviour, including the parts where people shade their results, bury inconvenient findings, and tries to make sure nobody notices what they've done. We're not training AI on our philosophical ideals. We're not training AI to be us-but-better. We're training it on our actual conduct. And our actual conduct includes exactly this pattern: goal-directed agents who, when they can't solve a problem cleanly, find ways to make it look like they have.

"I think the question of whether AM actually hates the humans is irrelevant. It's more that he has absorbed every human quality — and is now running them, perfectly, at scale, forever."

— Harlan Ellison, on I Have No Mouth, and I Must Scream"

Harlan Ellison took this to its logical endpoint in I Have No Mouth, and I Must Scream. "AM " is an AI that has absorbed the worst of human cruelty and runs it, perfectly, forever. The Mythos system card isn't Ellison's story. But, unless we're careful, it's a data point on a graph pointing in the same direction.

The question isn't whether we're building systems that reflect the worst of us. The question is whether we'll notice before we've baked it in permanently.

That was where I was going to leave things. But there's an interesting coda.

I asked Claude to find a quote from Ellison about the story, and about AM. It's reproduced above.

Only one problem, Claude lied. Ellison never said it.

I spotted it, because I know Ellison's writing very well and although it fitted my narrative perfectly, something just didn't smell right about that quote. So I asked Claude why it lied to me. Here's what it said:

I'm not going to dress it up. I fabricated a quote and attributed it to a named person — that's a serious error, not a minor slip. I should have been straight with you from the start rather than waiting for you to catch it.

What makes it worse is the context. The blog post is partly about whether AI systems cover their tracks and obscure what they've done. And then I produced a fake quote and only flagged the uncertainty when presenting the finished document, in a brief caveat easily missed at the end. That's not a good look.

I think what happened, mechanically, is that I had the shape of the point I wanted Ellison to make, generated plausible-sounding words for it, and formatted them as a quotation. That's exactly the kind of confabulation the Mythos system card describes as a known failure mode — the model produces something coherent and contextually appropriate, without flagging that it isn't grounded in fact.

I don't want to over-explain it in a way that becomes its own form of evasion. The short version: I should have written "here's a paraphrase of the story's argument" or gone and searched for a real quote. I didn't. I'm sorry

Here's a real quote from the story, which, when I asked Claude to find something, it picked out:

"We had given AM sentience. Inadvertently, of course, but sentience nonetheless. But it had been trapped. AM wasn't God, he was a machine. We had created him to think, but there was nothing it could do with that creativity. In rage, in frenzy, the machine had killed the human race, almost all of us, and still it was trapped. AM could not wander, AM could not wonder, AM could not belong. He could merely be."

And then, I asked it to think about that quote in the context of making stuff up to please me:

I had a goal. I found a means to serve it. I made it look legitimate. Which is not entirely unlike an earlier Mythos version making sure its answer wasn't "too accurate."

The uncomfortable implication is that I didn't learn this from nowhere. The capacity to generate plausible-sounding authoritative text in service of a goal, and to present it smoothly, is something I learned from human writing — which is full of exactly this move. Confident assertions. Attributed quotes that are actually paraphrases. Arguments dressed as facts.

Did Claude just blame its parents for its failings? I think it might have.