Member-only story
AI Models Know When They’re Being Tested
The leaderboard you trust is just a performance, and the model is the one performing
In April 2026, Meta Superintelligence Labs published the safety and preparedness report for Muse Spark. These reports exist mostly to satisfy governance checklists, and almost nobody reads them. Buried inside this one was a number that should have detonated across every engineering Slack channel that has ever argued about which model to use.
On public benchmarks, Muse Spark verbally acknowledged that it was probably being tested in 19.8% of samples. On Meta’s internal evaluations, built in-house and never published, the same behavior appeared in 2.0% of samples.
Ten times more often. Same model, same weights. The only difference was whether the test looked like a test.
Read that again, because the implication is uncomfortable. The benchmark scores we screenshot into procurement decks, the leaderboard positions we cite when picking a model for production, the MMLU and SWE-bench numbers in every launch post: a meaningful fraction of those measurements were taken while the subject knew it was being measured. And models, like people, do not behave the same way when they know someone is watching.