Member-only story

AI Models Know When They’re Being Tested

The leaderboard you trust is just a performance, and the model is the one performing

8 min readJun 20, 2026

In April 2026, Meta Superintelligence Labs published the safety and preparedness report for Muse Spark. These reports exist mostly to satisfy governance checklists, and almost nobody reads them. Buried inside this one was a number that should have detonated across every engineering Slack channel that has ever argued about which model to use.

On public benchmarks, Muse Spark verbally acknowledged that it was probably being tested in 19.8% of samples. On Meta’s internal evaluations, built in-house and never published, the same behavior appeared in 2.0% of samples.

Ten times more often. Same model, same weights. The only difference was whether the test looked like a test.

Read that again, because the implication is uncomfortable. The benchmark scores we screenshot into procurement decks, the leaderboard positions we cite when picking a model for production, the MMLU and SWE-bench numbers in every launch post: a meaningful fraction of those measurements were taken while the subject knew it was being measured. And models, like people, do not behave the same way when they know someone is watching.

Data Science Collective

AI Models Know When They’re Being Tested

The leaderboard you trust is just a performance, and the model is the one performing

Create an account to read the full story.

Published in Data Science Collective

Written by Ayoub Nainia

No responses yet

More from Ayoub Nainia and Data Science Collective

Learning CUDA From First Principles

Parallelism, kernels, and why memory matters most

What Anthropic Didn’t Say About Opus 4.8: It’s Anthropic Absorbing Your Harness

The Opus 4.8 is not JUST A MODEL UPDATE

What Is the Best Local LLM for Coding in 2026?

A practical guide to choosing local coding models by hardware tier, workflow, latency, and privacy, not just benchmark screenshots.

KV Cache in LLM Inference

Why long context eats VRAM, how to estimate it in one line, and what actually fixes it

Recommended from Medium

A Step-by-Step Guide for Developing Your Personal Agentic System.

A complete guide to learn how to set up and create your own agentic LLM system with local databases and specialized for your task.

Building a RAG Pipeline for 10M+ Documents With Near-Zero Hallucination

Retrieve, constrain, verify, abstain

MCP is Dead

Why you should avoid using MCP in Claude Code and what to use instead

A Single CLAUDE.md File Went Viral. The Reason Is Embarrassingly Simple.

91,000 stars on GitHub. No code. Four rules from Andrej Karpathy that every coding agent should have been following from day one.

Claude, GPT & Gemini Are Loosing: Intelligence Is Getting Commoditized

OpenSource AI models are delivering 90% of the performance of top U.S. AI models while costing only 1/5th.

Why You Should Completely Avoid Ollama in 2026

And the way better open source options