All StoriesLive StoriesFor youExploreReviewsImagesSign UpLogin
Fiction.liveBench May 22 2025

Fiction.liveBench May 22 2025

kas

Introducing Fiction.LiveBench: The First Real-World Long Context Benchmark for Writers

Fiction.live has AI tools that help writers save time by making summaries, timelines, character bibles, and iterate on those documents in insightful ways. To do that effectively, the LLM needs to truly understand the story, each character and their motivations on a deep and profound level. However, in practice, today’s AI models frequently lose track of plots, fail to grasp character motivations, and produce slop that’s completely misaligned with an author’s intent.

The root problem is that long context comprehension is still broken.

Fiction.live happens to be a huge repository of complex long story content and so we're well positioned to clarify the situation for the public.

Most LLMs claim to support tens or even hundreds of thousands of tokens of context, but real-world experience tells us otherwise.

To really understand a story the LLM needs to do things like:

  • track changes over time - e.g. they hate each other, now they love each other, now they hate each other again, oh now their hatred has morphed into obsession
  • logical predictions based on established hints
  • ability to understand secrets told in confidence to readers versus those that are known to characters
  • and so on

It's a specific long context real world test that we find more reflective of writing use than LongBench or RULER, which test search rather than comprehension.

From our experience, most LLMs CAN handle these tasks, but not over long context. That's why we're launching a new benchmark called Fiction.LiveBench to demonstrate the case and to show our users which LLM they should choose in their writing tasks.

The Fiction.LiveBench Methodology

Based on a selection of a dozen very long complex stories and many verified quizzes, we generated tests based on select cut down versions of those stories. For every test, we start with a cut down version that has only relevant information. This we call the "0"-token test. Then we cut down less and less for longer tests where the relevant information is only part of the longer story overall.

We then evaluated leading LLMs across different context lengths.

Here is an 8k example of the type of question that would be part of this benchmark. This question is considered hard and would be failed by most models. Here is the 1k variant of the same question that is passed by more models. Grok 3 for example, fails the 8k variant but passes the 1k variant. The actual dataset comes from our users and will remain private.

o1 and o3-mini uses default (medium) settings. Claude sonnet 3.7-thinking uses 8k thinking tokens.

Results

Key Takeaways

  • o3 is now the clear SOTA.
  • DeepSeek-r1 significantly outperforms o3-mini. A great choice for price-conscious users. The non-reasoning version falls off suddenly at higher context lengths.
  • GPT-4.5-preview and GPT-4.1 are the best non-reasoning models.
  • Google's Gemini 2.5 Pro is excellent. This is the first time a LLM is potentially usable for long context writing. I'm interested in testing larger token sizes with this now.
  • The Gemini 2.5 Pro preview versions seem worse than the original exp.
  • Gemma-3 is not very good on this test.
  • Anthropic’s Sonnet-3.7 shows huge improvement over 3.5. The thinking variant uses 8000-think tokens, which should be ample as the logic is simple.
  • Jamba starts off sub 50% immediately, but the drop-off from there is mild.
  • Qwen-max is good at the small context windows where we have data. qwq is great, better than R1.
  • Qwen3 does not beat qwq-32b but is competitive against other models from other companies.
  • Llama 4 is average. Maverick performs similarly to Gemini 2.0-0205 and Scout is similar to GPT-4.1-nano.
  • Grok 3 is solid. Comes slightly behind gpt-4o for the instruct version and ahead of o3-mini for the thinking version.

What’s Next?

These results confirm what writers have been telling us: LLMs today struggle with real-world long-context writing tasks.

Stay tuned for updates in the coming weeks! In the meantime, check out Fiction.LiveBench and see which model works best for your writing needs.

Let us know your thoughts on our results. We're also open to sponsorship offers that will help us improve this eval. There is a lot of potential to improve both the difficulty and realworldness of testing. Reach out in DMs on here or on twitter https://x.com/ficlive.

Why These Benchmark Scores May Seem Low

Typically, LLMs advertise higher context windows, and sometimes they do seem to work at those context windows. Other tests like the popular "Needle in the haystack" style tests, show LLMs passing with flying colors across a large context.

To explain the difference, this benchmark is harder than other tests you may have seen and is asking more complex questions than the typical LLM experience.

We deliberately designed hard questions that test understanding of subtext rather than information that can be searched for. This requires actually reading and understanding the full context rather than just searching for and focusing on the relevant bits (which many LLMs optimize for and do well). Our tests deliberately test cases where this search strategy does not work, as is typical in fiction writing.

Changelog

2/21/2025 - To reflect common usage we increased the number of easy questions in our benchmark set. We added gemini-2.0-pro-exp-02-05:free, deepseek-chat:free (v3), and dolphin3.0-r1-mistral-24b:free to the benchmark results.

2/25/2025 - Added Claude Sonnet 3.7

2/28/2025 - Added gpt-4.5-preview

3/14/2025 - Added qwq-32b and gemma-3-27b

3/25/2025 - Added deepseek-v3-0324 and gemini-2.5-pro-exp-03-25

4/3/2025 - Added quasar-alpha

4/6/2025 - Added Llama 4

4/10/2025 - Added Grok 3 and updated Llama 4 after inference provider updated with vllm fixes. Thanks @jon_durbin.

4/14/2025 - Added GPT 4.1 family.

4/17/2025 - Added o3 and o4-mini. Both are done on default settings, so medium.

4/17/2025 - Added Gemini 2.5 Flash and Thinking.

4/29/2025 - Added Qwen3 up to 16k for now.

5/06/2025 - Added Gemini Pro 2.5 03-25 and 05-06.

5/22/2025 - Added Gemini Pro 2.5 Flash Preview 05-20. Expanded certain models up to 192k length. Added Claude-4 non-thinking.

5/28/2025 - Added New Deepseek R1-0528.

Did these results surprise you? -318 voters

Votes
Yes, worse than I expected
4/154
No
8/107
Yes, better than I expected
4/57
Thu, Feb 20, 2025, 06:57 AM
*Why comment here instead of in the chat? Comments are seen by more people and gets lost less. They are more conducive to long term discussion and longer posts.
  • ai.engineerSun, Apr 20, 2025, 04:20 AM
    Hey Kas, I love your work. You should check out this new research paper. It seems to do something very similar and makes models fail even at 2,000 words. It tests for ai models to be able to find logic issues in different lengths of texts. It also tests ai model's ability to generate stories/summarize text and has shown that ai models are having significant issues with creating logic issues in their writing. Maybe you could test for something like this?

    The research paper is called "Finding Flawed Fictions: Evaluating Complex Reasoning in Language Models via Plot Hole Detection"

    Research paper:
    https://arxiv.org/pdf/2504.11900

    Person I found talking about it:
    https://x.com/kabirahuja004/status/1913291701357330713
  • adiyTue, Apr 15, 2025, 04:18 PM

    Regarding the Table Design

    Just wanted to say that I love this benchmark and really appreciate your work! That said, a bit of coloring and sorting could make the data feel 10 times more readable.

    I couldn't manage to upload a picture, but what I did was sort the rows by their sum (not necessarily the best way, but certainly better than the seemingly random ordering). I also colored the model names with a distinct color for each company and applied a simple green-yellow-red graded color scale to the results. Let me know if there's a way I can send you the version I made.
  • jsdFri, Mar 28, 2025, 09:28 AM

    A couple questions!

    Hi, thanks for building this! A few questions:

    (1) How many questions / stories are there in the benchmark?

    (2) What is your process for generating these ‘select cut down versions of those stories’? Who does that?

    (3) How do you do the scoring, that is how do you decide whether a model’s answer to a question is correct or incorrect? Who does the scoring?
  • ai.engineerSat, Feb 22, 2025, 03:25 PM
    FYI, to whoever makes the benchmarks; it's very important to note the level of compute given to the model if there is a choice. For example, for o3 mini and o1, you can select low, medium, and high levels of compute and get very different results with each one. Without noting the level of compute for the models that have multiple options, it's practically useless! Keep up the good work though!