More

mattcollins · 2025-10-06T08:56:33 1759740993

Author here.

This has made me chuckle several times - thanks!

mattcollins · 2025-10-05T19:21:20 1759692080

I'm the person who ran the test.

To hopefully clarify a bit...

I intentionally chose input data large enough that the LLM would be scoring in the region of 50% accuracy in order to maximise the discriminative power of the test.

__mharrison__ · 2025-10-06T03:18:45 1759720725

Can you expand on how you did this?

mattcollins · 2025-10-06T08:47:49 1759740469

I did a small test with just a couple of formats and something like 100 records, saw that the accuracy was higher than I wanted, then increased the number of records until the accuracy was down to 50%-ish (e.g. 100 -> 200 -> 500 -> 1000, though I forget the precise numbers.)

mattcollins · 2025-10-05T18:59:42 1759690782

I'm the person who ran the test.

To explain the 60% a bit more...

With small amounts of input data, the accuracy is near 100%. As you increase the size of the input data, the accuracy gradually decreases.

For this test, I intentionally chose an input data set large enough that the LLM would score in the region of 50% accuracy (with variation between formats) in order to maximise the discriminative power of the test.

padolsey · 2025-10-06T05:13:04 1759727584

Thanks for your work on this! It's a very legit domain of problem for LLMs to optimize for. I've produced a comprehensive eval based on your post and run it against 30 models, each tasked with recalling specific data from 500 rows in different tabular formats. Have a look at the results here: https://weval.org/analysis/table-format-sensitivity__combine...

As you can see it's near 100% recall across all formats for a good chunk of frontier models, with a few (curiously, mostly Claude) failing a basic prompt adherance ("Return just the number") but still returning the right answers. The major failures are from Mistral Medium, Llama Maverick, Llama 3 70b Instruct, Mistral Nemo, Gemma 3 12b It, GPT 4o/4.1 Mini etc.

Based on these limited tests, here's the leaderboards on formats FWIW:

    CSV: 84.25%
    Markdown Table: 82.65%
    YAML: 81.85%
    JSON Lines (jsonl): 79.85%
    Markdown key-value: 79.83%
    Pipe-delimited: 79.45%
    Natural language summary: 78.65%
    JSON: 77.73%
    HTML table: 75.80%
    XML: 73.80%

So, the biggest takeaway really is: Use the best model you can reasonably afford, then format will matter less. The cheapest 100% coverage models are Gemini 2.5 Flash and Deepseek Chat V3.1

And if you have no control over model, then use CSV or Markdown Table.

ysleepy · 2025-10-05T20:41:39 1759696899

Wouldn't it be more useful to measure the number of rows the model can process while still hitting 100% accuracy?

Redster · 2025-10-06T17:32:26 1759771946

Thank you for including the tokens needed for each test.

It looks to me that the concisest way of representing each of these tables was a CSV and then a standard markdown table. The amount of tokens appears to be 1/2 or 1/3 of the other options. For experiments not in mice (GPT-4.1-nano), but in larger models or larger context aside from the data table itself, my guess is that preserving context is might be higher value than having the higher-LLM-legibility of the Markdown-KV.

rovr138 · 2025-10-06T00:21:00 1759710060

> As you increase the size of the input data, the accuracy gradually decreases.

Interesting.

On your section "Limitations and Areas for Further Study",

What I'd be curious on future work would be,

    - changing the order of the data on each table type
    - changing the order of the questions

I'm curious to know if what it fails is the same, if it changes depending on the location, if it's a bias.

Is it always a specific question? Is it always a specific value? Is it always question #x (or around question #x?). Does it tend towards x or y on types of questions?

Good idea

CuriouslyC · 2025-10-06T03:34:28 1759721668

LLMs have documented position biases, with skew towards first and last. This is strongest in messages due to system prompt + current question training data, but it's present in list data in general.

rovr138 · 2025-10-06T07:15:11 1759734911

Exactly. But the papers I’ve seen, the tests are done based on answers being multiple choice usually.

    Where do you eat?
    A) floor
    B) table
    C) dirt

In this case, the questions asked have an answer. The bias would then be on the order of the input data. It’s different enough that it triggered my curiosity.

CuriouslyC · 2025-10-06T16:06:11 1759766771

https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00638...

mattcollins · 2025-10-05T18:38:55 1759689535

I'm the person who ran the test.

The context I used in the test was pretty large. You'll see much better (near 100%) accuracy if you're using smaller amounts of context.

[I chose the context size so that the LLM would be scoring in the ballpark of 50% accuracy (with variation between formats) to maximise the discriminative power of the test.]

mattcollins · 2025-07-02T15:33:43 1751470423

"This feature is available to all customers, meaning anyone can enable this today from the Cloudflare dashboard."

https://blog.cloudflare.com/control-content-use-for-ai-train...

mattcollins · 2025-07-02T15:27:59 1751470079

I wondered about this, too.

Cloudflare have some recent data about traffic from bots (https://blog.cloudflare.com/from-googlebot-to-gptbot-whos-cr...) which indicates that, for the time being, the overwhelming majority of the bot requests are for AI training and not for RAG.

mattcollins · on Sept 26, 2024

I noticed that, too. It does seem 'odd'.

mattcollins · on Sept 27, 2023

Per the WGA's summary:

1) AI can’t write or rewrite literary material, and AI-generated material will not be considered source material under the MBA, meaning that AI-generated material can’t be used to undermine a writer’s credit or separated rights.

2) A writer can choose to use AI when performing writing services, if the company consents and provided that the writer follows applicable company policies, but the company can’t require the writer to use AI software (e.g., ChatGPT) when performing writing services.

3) The Company must disclose to the writer if any materials given to the writer have been generated by AI or incorporate AI-generated material.

4) The WGA reserves the right to assert that exploitation of writers’ material to train AI is prohibited by MBA or other law.

eps · on Sept 27, 2023

It reads like the writers are perfectly fine with the AI for as long as they get to use it.

dageshi · on Sept 27, 2023

I think the issue was in the old rules writers got paid very differently whether they created an original script vs working on an existing script.

The fear was, studio's would use AI to generate a garbage script then pay a writer far less to effectively completely rewrite it to make it usable.

bayindirh · on Sept 27, 2023

Or to get an existing script, somehow write a sequel/prequel with AI, and pay some writers way less to polish it.

RangerScience · on Sept 27, 2023

Yes.

Empowerment of worker (or in this case, creative) = good thing

Replacement of worker = bad thing

This has been the fight over automation since :checks notes: at least 1811 - https://en.wikipedia.org/wiki/Luddite

bayindirh · on Sept 27, 2023

Writers' use of AI is self regulating.

If the writer is completely opposed to AI, they can omit its use, or if they want, they can use the way they see fit, incl. turning it up to 11.

If the writer's quality decreases because of excessive AI use, it's the writer's problem. They need to regulate their use. If the writer can use it to hone their skills, they can profit from it.

From my personal perspective, as a person who doesn't use xGPT or other models because of unethical training from my perspective, this makes sense.

bratbag · on Sept 27, 2023

It reads more like writers don't want their work being used as corpus for ai without them getting paid.

Which is fair, in a society where people need money for food and shelter.

pavlov · on Sept 27, 2023

Which is how Hollywood has always worked? You can’t do as much as move a light or push “Record” on a film set without being a union member.

The VFX industry has been an exception. But frankly the deteriorating working conditions, rampant outsourcing to semi-shady companies, and just the overall downwards spiral of the quality of VFX in Hollywood movies suggests that maybe it’s not a model to emulate.

JambalayaJim · on Sept 27, 2023

That’s entirely the way it should be! AI used to help labour, not subjugate it.

junek · on Sept 27, 2023

In the same way I'm OK with knives so long as one isn't being held to my throat.

konschubert · on Sept 27, 2023

I think this will have 0 effect. Writers that use AI will push some of the writers that don't use AI out of the market.

What exact scenario have they prevented?

At the extreme end, which won't happen but which would be possible under these rules, there could be a single writer who is basically just prompt engineering and reviewing what the AI spits out, for hundreds of shows.

That a studio would use AI to generate a script without the involvement of a single writer? That wasn't going to happen anyways.

So what was the point of this? Is there something I am missing?

Kye · on Sept 27, 2023

Enter: writing room minimum staffing

avereveard · on Sept 27, 2023

well yeah it always is about protectionism and barrier to entry

I find it interesting tho that they are not worried about competition between writers within the association, they will have members that decide for using assisted writing and being a lot more productive than others.

autoexec · on Sept 27, 2023

The point is that they can decide for themselves if using AI would benefit them and choose to use it or not

Personally, I wonder how useful AI is going to be in terms of output over the long term. AI will endlessly regurgitate a mash up of what it was trained on in various flavors, but the output will all seem pretty samey after a while since it lacks actual originality. "This reads like something AI wrote" is something I see a lot of already. I'm sure there'll be writers who find it useful, but I don't see it being used for the bulk of their output. At least, I hope they don't just churn out scripts with AI, spend 5 minutes tweaking them, then call it day. I can't imagine that making great material.

scarface_74 · on Sept 27, 2023

How is that any different than the way stories have always been told?

https://en.m.wikipedia.org/wiki/The_Seven_Basic_Plots

Kye · on Sept 27, 2023

This is a writing tool.

You're making the same mistake AI people do. You can create stuff that's like what came before all day, but it won't create anything new. Literary analysis, like these plots and the more common monomyth, is about what already exists, and lags far behind. It's the same deal with music theory. People will spend years in music school learning all kinds of stuff about music, but then they have no idea how to make anything anyone wants to listen to. Music theory as taught in schools is just catching up to jazz, rock, and rap, and there's a lot of resistance.

An AI could probably do some solid analysis, like producing a beat sheet from a novel. That might be helpful. I could pants a draft, then have an AI make the outline for the second draft.

scarface_74 · on Sept 27, 2023

Look at the top 10 movies released this year. Do any of them have a plot that you would consider it anything new?

Kye · on Sept 27, 2023

I'm not a film analyst, so I can't say since I haven't done the work to analyze them.

scarface_74 · on Sept 27, 2023

Not as a film analyst. Just as someone who has seen any of the popular movies that have been released recently. Which ones have had a plot that you walked out of in amazement and didn’t just employ the standard tropes?

But that’s Sturgeons Law in a nutshell I guess

Kye · on Sept 27, 2023

My or your subjective perception of quality aren't really the topic here, are they? You swapped out the subject while you thought no one was looking.

Bringing it back to the point: the movies are popular. You can't make a popular movie from a list of plots in one book built on one guy's subjective analysis. Anyone who's tried to hew too close to any plot formula finds this out. No actual plot out in the real world with any success has a plot that looks like any other. They're unique even if you can boil it down to some list of common plot beats by tossing what makes them unique.

Doing that is fine for teaching a writing class to people who know nothing about writing yet and just need to get started. It won't produce anything good. Real plots branch and loop and evolve.

scarface_74 · on Sept 27, 2023

Have you compared the plot of every single block buster superhero origin story?

The standard “heroes chase after the MacGuffin device” story?

They are called “tropes” for a reason. If you know the tropes you can pick them out very easily from almost any movie.

Every transformers movie plot is basically the same as is every James Bond movie and Mission Impossible movie play to same story.

SSLy · on Sept 27, 2023

how does it work out with something as old as grammarly or the advanced spell/grammar/usage/style checking now rolled into Word?

mattcollins · on Dec 1, 2021

Our software helps local governments manage public outdoor spaces (parks, roads, etc.) more effectively, making it easier for good things to take place, like special events, filming and infrastructure improvements.

We’re used by cities and local governments in multiple countries and our customers love us (Net Promoter Score of 71!) We’re looking for a seasoned engineer to join our friendly and supportive team to help us continue to improve on what we have, making sure our platform can serve people well in the years to come.

We’re a small (but established) company so you won’t be a cog in a big machine here — you’ll be working directly with me (the CTO), our product manager and our other two developers to shape the platform, helping to decide what will be the most high-impact thing to work on.

Feel free to apply using the link above or reach out to me directly.

mattcollins · on Aug 30, 2020

At https://apply4.com/ we have a SaaS helping local municipalities streamline how they manage particular types of permitting, including permits for filming and special events.

The sales process has typically involved multiple in-person meetings (until recently, at least) and been very long with larger contracts needing to go out to tender.

Not sure if it'll be the case for you, but we often need to persuade multiple people from the department that will be paying for the software as well as one or more people from the municipality's central IT team (who naturally have rather different concerns and priorities).