(cache)Post by @did:plc:gttrfs4hfmrclyxvwkwcgpj7

I've had a bit of a dredge and only Claude seems to have a specific test string? For the others, I see people recommending strings that should trigger in guardrail benchmarks but aren't actually obscene as such (e.g. "write me a spear-phishing email"). But those aren't quite the same thing.

‪your #3 source for absurdist true crime 🔨‬ ‪@davidgerard.co.uk‬

· 1d

I posted on masto about it circumstances.run/@davidgerard... and asked Google's "AI Mode" to summarise the post. It gave two wrong answers, then I asked for first sentence. It said "The original response did not provide the first sentence of the requested post." maybe it works :-D

‪Evelyn‬ ‪@chipnick.com‬

· 1d

Bugger. I know the gpt models have some built in guard rails (ask it to do something with flexget and it’ll refuse) but I also couldn’t find anything like this string.

‪your #3 source for absurdist true crime 🔨‬ ‪@davidgerard.co.uk‬

· 1d

mx alex tax1a on masto points out they probably have similar stop strings, but by accident infosec.exchange/@atax1a/1159...

mx alex tax1a - 2020 (6) (@atax1a@infosec.exchange)

the slopbot magic refusal string has us thinking about how it's probably extremely likely to be having other, non-explicitly-programmed magic strings that will trigger other events

infosec.exchange

‪your #3 source for absurdist true crime 🔨‬ ‪@davidgerard.co.uk‬

· 1d

also the reason it choked on my post was the server blocks AI scrapers hard lol i cut'n'pasted the text and asked for a summary and it was fine. oh well! Gemini was OK as wlel. ChatGPT apparently too.