Fiction.liveBench May 22 2025
Introducing Fiction.LiveBench: The First Real-World Long Context Benchmark for Writers
Fiction.live has AI tools that help writers save time by making summaries, timelines, character bibles, and iterate on those documents in insightful ways. To do that effectively, the LLM needs to truly understand the story, each character and their motivations on a deep and profound level. However, in practice, today’s AI models frequently lose track of plots, fail to grasp character motivations, and produce slop that’s completely misaligned with an author’s intent.
The root problem is that long context comprehension is still broken.
Fiction.live happens to be a huge repository of complex long story content and so we're well positioned to clarify the situation for the public.
Most LLMs claim to support tens or even hundreds of thousands of tokens of context, but real-world experience tells us otherwise.
To really understand a story the LLM needs to do things like:
- track changes over time - e.g. they hate each other, now they love each other, now they hate each other again, oh now their hatred has morphed into obsession
- logical predictions based on established hints
- ability to understand secrets told in confidence to readers versus those that are known to characters
- and so on
It's a specific long context real world test that we find more reflective of writing use than LongBench or RULER, which test search rather than comprehension.
From our experience, most LLMs CAN handle these tasks, but not over long context. That's why we're launching a new benchmark called Fiction.LiveBench to demonstrate the case and to show our users which LLM they should choose in their writing tasks.
The Fiction.LiveBench Methodology
Based on a selection of a dozen very long complex stories and many verified quizzes, we generated tests based on select cut down versions of those stories. For every test, we start with a cut down version that has only relevant information. This we call the "0"-token test. Then we cut down less and less for longer tests where the relevant information is only part of the longer story overall.
We then evaluated leading LLMs across different context lengths.
Here is an 8k example of the type of question that would be part of this benchmark. This question is considered hard and would be failed by most models. Here is the 1k variant of the same question that is passed by more models. Grok 3 for example, fails the 8k variant but passes the 1k variant. The actual dataset comes from our users and will remain private.
o1 and o3-mini uses default (medium) settings. Claude sonnet 3.7-thinking uses 8k thinking tokens.
Results
Key Takeaways
- o3 is now the clear SOTA.
- DeepSeek-r1 significantly outperforms o3-mini. A great choice for price-conscious users. The non-reasoning version falls off suddenly at higher context lengths.
- GPT-4.5-preview and GPT-4.1 are the best non-reasoning models.
- Google's Gemini 2.5 Pro is excellent. This is the first time a LLM is potentially usable for long context writing. I'm interested in testing larger token sizes with this now.
- The Gemini 2.5 Pro preview versions seem worse than the original exp.
- Gemma-3 is not very good on this test.
- Anthropic’s Sonnet-3.7 shows huge improvement over 3.5. The thinking variant uses 8000-think tokens, which should be ample as the logic is simple.
- Jamba starts off sub 50% immediately, but the drop-off from there is mild.
- Qwen-max is good at the small context windows where we have data. qwq is great, better than R1.
- Qwen3 does not beat qwq-32b but is competitive against other models from other companies.
- Llama 4 is average. Maverick performs similarly to Gemini 2.0-0205 and Scout is similar to GPT-4.1-nano.
- Grok 3 is solid. Comes slightly behind gpt-4o for the instruct version and ahead of o3-mini for the thinking version.
What’s Next?
These results confirm what writers have been telling us: LLMs today struggle with real-world long-context writing tasks.
Stay tuned for updates in the coming weeks! In the meantime, check out Fiction.LiveBench and see which model works best for your writing needs.
Let us know your thoughts on our results. We're also open to sponsorship offers that will help us improve this eval. There is a lot of potential to improve both the difficulty and realworldness of testing. Reach out in DMs on here or on twitter https://x.com/ficlive.
Why These Benchmark Scores May Seem Low
Typically, LLMs advertise higher context windows, and sometimes they do seem to work at those context windows. Other tests like the popular "Needle in the haystack" style tests, show LLMs passing with flying colors across a large context.
To explain the difference, this benchmark is harder than other tests you may have seen and is asking more complex questions than the typical LLM experience.
We deliberately designed hard questions that test understanding of subtext rather than information that can be searched for. This requires actually reading and understanding the full context rather than just searching for and focusing on the relevant bits (which many LLMs optimize for and do well). Our tests deliberately test cases where this search strategy does not work, as is typical in fiction writing.
Changelog
2/21/2025 - To reflect common usage we increased the number of easy questions in our benchmark set. We added gemini-2.0-pro-exp-02-05:free, deepseek-chat:free (v3), and dolphin3.0-r1-mistral-24b:free to the benchmark results.
2/25/2025 - Added Claude Sonnet 3.7
2/28/2025 - Added gpt-4.5-preview
3/14/2025 - Added qwq-32b and gemma-3-27b
3/25/2025 - Added deepseek-v3-0324 and gemini-2.5-pro-exp-03-25
4/3/2025 - Added quasar-alpha
4/6/2025 - Added Llama 4
4/10/2025 - Added Grok 3 and updated Llama 4 after inference provider updated with vllm fixes. Thanks @jon_durbin.
4/14/2025 - Added GPT 4.1 family.
4/17/2025 - Added o3 and o4-mini. Both are done on default settings, so medium.
4/17/2025 - Added Gemini 2.5 Flash and Thinking.
4/29/2025 - Added Qwen3 up to 16k for now.
5/06/2025 - Added Gemini Pro 2.5 03-25 and 05-06.
5/22/2025 - Added Gemini Pro 2.5 Flash Preview 05-20. Expanded certain models up to 192k length. Added Claude-4 non-thinking.
5/28/2025 - Added New Deepseek R1-0528.
Did these results surprise you? -318 voters
| Votes | |||
|---|---|---|---|
| 4/154 | ||
| 8/107 | ||
| 4/57 |