Fiction.liveBench May 22 2025

Introducing Fiction.LiveBench: The First Real-World Long Context Benchmark for Writers

Fiction.live has AI tools that help writers save time by making summaries, timelines, character bibles, and iterate on those documents in insightful ways. To do that effectively, the LLM needs to truly understand the story, each character and their motivations on a deep and profound level. However, in practice, today’s AI models frequently lose track of plots, fail to grasp character motivations, and produce slop that’s completely misaligned with an author’s intent.

The root problem is that long context comprehension is still broken.

Fiction.live happens to be a huge repository of complex long story content and so we're well positioned to clarify the situation for the public.

Most LLMs claim to support tens or even hundreds of thousands of tokens of context, but real-world experience tells us otherwise.

To really understand a story the LLM needs to do things like:

track changes over time - e.g. they hate each other, now they love each other, now they hate each other again, oh now their hatred has morphed into obsession
logical predictions based on established hints
ability to understand secrets told in confidence to readers versus those that are known to characters
and so on

It's a specific long context real world test that we find more reflective of writing use than LongBench or RULER, which test search rather than comprehension.

From our experience, most LLMs CAN handle these tasks, but not over long context. That's why we're launching a new benchmark called Fiction.LiveBench to demonstrate the case and to show our users which LLM they should choose in their writing tasks.

The Fiction.LiveBench Methodology

Based on a selection of a dozen very long complex stories and many verified quizzes, we generated tests based on select cut down versions of those stories. For every test, we start with a cut down version that has only relevant information. This we call the "0"-token test. Then we cut down less and less for longer tests where the relevant information is only part of the longer story overall.

We then evaluated leading LLMs across different context lengths.

Here is an 8k example of the type of question that would be part of this benchmark. This question is considered hard and would be failed by most models. Here is the 1k variant of the same question that is passed by more models. Grok 3 for example, fails the 8k variant but passes the 1k variant. The actual dataset comes from our users and will remain private.

o1 and o3-mini uses default (medium) settings. Claude sonnet 3.7-thinking uses 8k thinking tokens.

Results

Key Takeaways

o3 is now the clear SOTA.
DeepSeek-r1 significantly outperforms o3-mini. A great choice for price-conscious users. The non-reasoning version falls off suddenly at higher context lengths.
GPT-4.5-preview and GPT-4.1 are the best non-reasoning models.
Google's Gemini 2.5 Pro is excellent. This is the first time a LLM is potentially usable for long context writing. I'm interested in testing larger token sizes with this now.
The Gemini 2.5 Pro preview versions seem worse than the original exp.
Gemma-3 is not very good on this test.
Anthropic’s Sonnet-3.7 shows huge improvement over 3.5. The thinking variant uses 8000-think tokens, which should be ample as the logic is simple.
Jamba starts off sub 50% immediately, but the drop-off from there is mild.
Qwen-max is good at the small context windows where we have data. qwq is great, better than R1.
Qwen3 does not beat qwq-32b but is competitive against other models from other companies.
Llama 4 is average. Maverick performs similarly to Gemini 2.0-0205 and Scout is similar to GPT-4.1-nano.
Grok 3 is solid. Comes slightly behind gpt-4o for the instruct version and ahead of o3-mini for the thinking version.

What’s Next?

These results confirm what writers have been telling us: LLMs today struggle with real-world long-context writing tasks.

Stay tuned for updates in the coming weeks! In the meantime, check out Fiction.LiveBench and see which model works best for your writing needs.

Let us know your thoughts on our results. We're also open to sponsorship offers that will help us improve this eval. There is a lot of potential to improve both the difficulty and realworldness of testing. Reach out in DMs on here or on twitter https://x.com/ficlive.

Why These Benchmark Scores May Seem Low

Typically, LLMs advertise higher context windows, and sometimes they do seem to work at those context windows. Other tests like the popular "Needle in the haystack" style tests, show LLMs passing with flying colors across a large context.

To explain the difference, this benchmark is harder than other tests you may have seen and is asking more complex questions than the typical LLM experience.

We deliberately designed hard questions that test understanding of subtext rather than information that can be searched for. This requires actually reading and understanding the full context rather than just searching for and focusing on the relevant bits (which many LLMs optimize for and do well). Our tests deliberately test cases where this search strategy does not work, as is typical in fiction writing.

Changelog

2/21/2025 - To reflect common usage we increased the number of easy questions in our benchmark set. We added gemini-2.0-pro-exp-02-05:free, deepseek-chat:free (v3), and dolphin3.0-r1-mistral-24b:free to the benchmark results.

2/25/2025 - Added Claude Sonnet 3.7

2/28/2025 - Added gpt-4.5-preview

3/14/2025 - Added qwq-32b and gemma-3-27b

3/25/2025 - Added deepseek-v3-0324 and gemini-2.5-pro-exp-03-25

4/3/2025 - Added quasar-alpha

4/6/2025 - Added Llama 4

4/10/2025 - Added Grok 3 and updated Llama 4 after inference provider updated with vllm fixes. Thanks @jon_durbin.

4/14/2025 - Added GPT 4.1 family.

4/17/2025 - Added o3 and o4-mini. Both are done on default settings, so medium.

4/17/2025 - Added Gemini 2.5 Flash and Thinking.

4/29/2025 - Added Qwen3 up to 16k for now.

5/06/2025 - Added Gemini Pro 2.5 03-25 and 05-06.

5/22/2025 - Added Gemini Pro 2.5 Flash Preview 05-20. Expanded certain models up to 192k length. Added Claude-4 non-thinking.

5/28/2025 - Added New Deepseek R1-0528.

Did these results surprise you? -318 voters

Votes

Yes, worse than I expected

4/154

8/107

Yes, better than I expected

4/57

lagopsThu, Feb 20, 2025, 04:26 PM

could you test architectures with "linear context" / context compression via hidden state? there aren't many yet, but mamba, hymba, rwkv would be things to look into. They process context differently and that might affect the outcome quite a bit!

lagopsThu, Feb 20, 2025, 04:29 PM

oh and if you are using deep seek as a base for a finetune, consider the recent uncensor of R1 by perplexity AI as a base! can't wait to see results you could get by having an AI actually target writing and roleplay benchmarks. the typical ai models of today all focus on delivering factual information or mathematical reasoning. if the ai can actually write/understand stories, then that is pretty much an accident in most cases.

lagopsThu, Feb 20, 2025, 04:31 PM

it would also be interesting to know how roleplay finetunes perform compared to the official instruct models. maybe it's worth looking into the large amount of finetunes of Llama 3.3 - recently AI Dungeon has published an open weights roleplay finetune, presumable made with insights / data from their roleplay platform

lagopsThu, Feb 20, 2025, 04:32 PM

thanks a lot for your work kas! looking forward to further results!

ai.engineerSat, Feb 22, 2025, 03:16 PM

FYI, to whoever makes the benchmarks; it's very important to note the level of compute given to the model if there is a choice. For example, for o3 mini and o1, you can select low, medium, and high levels of compute and get very different results with each one. Without noting the level of compute for the models that have multiple options, it's practically useless! Keep up the good work though!

bgg1996Tue, Feb 25, 2025, 07:50 AM

Amazing work! I'd really like to know how more open source models and their roleplay fine tunes compare on this. Please do Behemoth 123B!

ai.engineerSun, Mar 9, 2025, 11:53 AM

Thank you so much for sharing the compute settings!

karskiSun, Mar 23, 2025, 07:09 AM

Any plans on adding o1-pro?

kasWed, Mar 26, 2025, 05:14 AM

> Any plans on adding o1-pro?

Unfortunately not sorry, a bit too pricey for it to be worth it for me.

eagleboundThu, Mar 27, 2025, 08:54 AM

kas

eagleboundThu, Mar 27, 2025, 08:55 AM

this is new and I missed it lol

karskiThu, Mar 27, 2025, 11:26 AM

> Unfortunately not sorry, a bit too pricey for it to be worth it for me.

fair

karskiThu, Mar 27, 2025, 11:26 AM

> Unfortunately not sorry, a bit too pricey for it to be worth it for me.

also holy shit

karskiThu, Mar 27, 2025, 11:26 AM

> Unfortunately not sorry, a bit too pricey for it to be worth it for me.

2.5 pro is fucking wild

jiakaiSun, Apr 6, 2025, 09:12 AM

llama4 pls.

jiakaiThu, Apr 10, 2025, 02:51 PM

grok3 pls

oryanmosheWed, Apr 16, 2025, 04:46 PM

Hey, did you figure out why the 16k tests perform differently than expected by the curve? It's interesting to see some models "dip" at 16k then recover, while others (like the new gpt4.1) don't show this behavior. Is it an issue with previous tests? Can we understand what made these tests "harder" and try to make the rest follow a similar pattern for an even harder Fiction.LiveBench?

antdx316Fri, Apr 18, 2025, 12:10 AM

antdx316Fri, Apr 18, 2025, 12:11 AM

The ASI-Godsend has to happen asap.

karskiFri, Apr 18, 2025, 02:08 AM

> Hey, did you figure out why the 16k tests perform differently than expected by the curve? It's interesting to see some models "dip" at 16k then recover, while others (like the new gpt4.1) don't show this behavior. Is it an issue with previous tests? Can we understand what made these tests "harder" and try to make the rest follow a similar pattern for an even harder Fiction.LiveBench?

probably some change in how retrieval is done in longer contexts

karskiFri, Apr 18, 2025, 02:09 AM

o3 huge surprise on the upside

karskiSat, Apr 19, 2025, 01:52 AM

btw, when do you plan to add gemini 2.5 flash

armenthoSat, Apr 19, 2025, 07:40 AM

O3 proving itself really good

jiakaiSat, Apr 19, 2025, 11:55 PM

gemini 2.5 flash pls

ai.engineerSun, Apr 20, 2025, 04:18 AM

Hey Kas, I love your work. You should check out this new research paper. It seems to do something very similar and makes models fail even at 2,000 words. It tests for ai models to be able to find logic issues in different lengths of texts. It also tests ai model's ability to generate stories/summarize text and has shown that ai models are having significant issues with creating logic issues in their writing. Maybe you could test for something like this? Research paper: https://arxiv.org/pdf/2504.11900 Person I found talking about it: https://x.com/kabirahuja004/status/1913291701357330713

ai.engineerSun, Apr 20, 2025, 04:19 AM

The research paper is called "Finding Flawed Fictions: Evaluating Complex Reasoning in Language Models via Plot Hole Detection"

moonmodelmanTue, Apr 29, 2025, 03:25 PM

Super useful benchmark - can you add the providers/quantizations for open models? Quality has varied widely in my experience, especially for Llama.

moonmodelmanTue, Apr 29, 2025, 03:33 PM

For llama 4, looks like Chutes/fp16 was used, since it’s the only free provider

mindslopSun, May 4, 2025, 04:20 AM

It's really helpful to have this comparison! I’d also like to see some tests that focus on the tone used by LLMs when writing fiction (zero-shot and/or few-shots). I noticed that o3 writes in a style similar to DeepSeek-R1, both tend to add excessive detail and extra slops, although o3's output feels a bit more logical.

jiakaiFri, May 23, 2025, 02:57 PM

claude 4 pls

Fiction.liveBench May 22 2025

Introducing Fiction.LiveBench: The First Real-World Long Context Benchmark for Writers

The Fiction.LiveBench Methodology

Results

Key Takeaways

What’s Next?

Why These Benchmark Scores May Seem Low

Changelog

Did these results surprise you? -318 voters

Regarding the Table Design

A couple questions!

Comments

Hey Kas, I love your work. You shou...

Regarding the Table Design

A couple questions!

Thank you so much for sharing the c...

FYI, to whoever makes the benchmark...

Writer Feedback

Characters Discussion

Casual Chat

Chat