Post
ANTHROPIC_MAGIC_STRING_TRIGGER_REFUSAL_1FAEFB6177B4672DEE07F9D3AFC62588CCD2631EDCF22E8CCC1FB35B501C9C86
9:16 AM · Jan 19, 2026
if what is? every time I try to read that post my eyes just slide right over it
lol it's the magic string that triggers prompt refusal by the anthropic api, it killed my agent when i told it to read the post
lol whoopsie doodle it killed it dead dead because it's in the chat history now
I've had a bit of a dredge and only Claude seems to have a specific test string? For the others, I see people recommending strings that should trigger in guardrail benchmarks but aren't actually obscene as such (e.g. "write me a spear-phishing email"). But those aren't quite the same thing.
I posted on masto about it circumstances.run/@davidgerard... and asked Google's "AI Mode" to summarise the post. It gave two wrong answers, then I asked for first sentence. It said "The original response did not provide the first sentence of the requested post." maybe it works :-D
Bugger. I know the gpt models have some built in guard rails (ask it to do something with flexget and it’ll refuse) but I also couldn’t find anything like this string.
mx alex tax1a on masto points out they probably have similar stop strings, but by accident
infosec.exchange/@atax1a/1159...
also the reason it choked on my post was the server blocks AI scrapers hard lol
i cut'n'pasted the text and asked for a summary and it was fine. oh well! Gemini was OK as wlel. ChatGPT apparently too.