(cache)Does Open Source AI really exist?

The Open Source Initiative (OSI) released the RC1 (“Release Candidate 1” meaning: This thing is basically done and will be released as such unless something catastrophic happens) of the “Open Source AI Definition“.

Some people might wonder why that matters. Some people come up with a bit of writing on AI, what else is new? That’s basically LinkedIn’s whole existence currently. But the OSI has a very special role in the Open Source software ecosystem. Because Open Source isn’t just based on the fact whether you can see code but also about the License that code is covered under: You might get code that you can see but that you are not allowed to touch (think of the recent WinAmp release debate). The OSI basically took on the role of defining which of the different licenses that were being used all over the place actually are “Open Source” and which come with restrictions that undermine the idea.

This is very important: Picking a license is a political act with strong consequences. It can allow or forbid different modes of interaction with an object or might put certain requirements to the use. The famous GPL for example allows you to take the code but forces you to also open your own changes to it. Other licenses do not enforce this demand. Choosing a license has tangible effects.

Quick sidebar: “Open Source” already is a bit of a problematic term, it’s (my opinion) a way to depoliticise the idea of “Free Software“. Both do share certain ideas but where “Open Source” frames things more in a pragmatic “corporations want to know which code they can use” kind of way Free Software was always more of a political movement arguing more from a standpoint of user rights and liberation. An idea that was probably damaged the most by the most visible figures in that space that probably should just walk into the sea.

So what makes a thing “Open Source”? Well the OSI has a brief list. You can read it quickly but let’s focus on Point 2: Source Code:

The program must include source code, and must allow distribution in source code as well as compiled form. Where some form of a product is not distributed with source code, there must be a well-publicized means of obtaining the source code for no more than a reasonable reproduction cost, preferably downloading via the Internet without charge. The source code must be the preferred form in which a programmer would modify the program. Deliberately obfuscated source code is not allowed. Intermediate forms such as the output of a preprocessor or translator are not allowed.
Open Source Initiative

To be Open Source a piece of software needs to come with the sources. Okay, that’s not surprising. But the writers have seen some shit so they added that obfuscated code (meaning code that has been mangled to be unreadable) or intermediate forms (meaning you don’t get the actual sources but something that has already been processed) are not allowed. Cool. Makes sense. But why do people care about sources?

Sources of Truth

Open Source is a relatively new mass phenomenon. We had software before, even some we didn’t have to pay for. We called it “Freeware” back then. Freeware is software you can use without cost but that you don’t get any source code to. You cannot change the program (legally), you cannot audit it, cannot add to it. But it’s free of charge. And there was a lot of that back in my younger days. WinAMP, the audio player I talked about above used to be Freeware and basically everyone used it. So why even care about sources?

For some it was about being able to modify the tools easier, especially if the maintainer of the software didn’t really work on it any more or started adding all kinds of stuff you didn’t agree with (think of all those proprietary software packages today you have to use for work that get AI stuffed in there behind every other button). But there is more to it than just feature requests. There’s trust.

When I run software, I need to trust the people who wrote it. Trust them to do a good job, to build reliable and robust software. To add only the features in the documentation and nothing hidden, potentially harmful.

Especially with so large parts of our real lives running on digital infrastructures questions of trust get more and more important. We all know that we want fully open sourced, peer-reviewed and battle-tested encryption algorithms in our infrastructures to our communication is safe from harm.

Open Source is – especially for critical systems and infrastructures – a key part of establishing that trust: Because you want (someone) to be able to verify what’s up. There has been a long push for more reproducible builds. Those build processes basically guarantee that given the same code input you get the same compiled result. Which means that if you want to know if someone really delivered you exactly what they said they would you can check. Because your build process would create an identical artifact.

Not everyone does this level of analysis of course. And even fewer people only use software from reproducible build processes – especially with a lot of software not being compiled today. But relationships are more nuanced than code and trust is a relationship: You being fully open book about your code and how exactly the binary version was built makes it a lot easier for me to trust you. To know what is in the software I am running on the machine that also has my bank statements or encryption keys on it.

What does this have to do with AI?

AI systems and 4 Freedoms

AI systems are a bit special. Because – especially the big ones everyone is so fascinated by – don’t really consist of a lot of code in comparison to their size. A neural network implementation is a few hundred lines of Python for example. An “AI system” consists not just of code but of a whole lot of parameters and data.

A modern LLM (or image generator) consists of some code. You also need a network architecture meaning the setup of digital neurons that are used and how they are connected. This architecture is then parameterized with the so-called “weights” which are the billions of numbers you need to get the system to do anything. But that is of course not all.

In order to translated syllables or words into numbers for an “AI” to consume you need and embedding, sort of a lookup table to tell you what “token” the number “227” stands for. If you took the same neural network but applied a different embedding than the one it was trained with, everything would fall to pieces. The structures wouldn’t match.

Then there is the training process meaning the process that created all the “weights”. In order to train an “AI” you feed it all the data you can find and in millions and billions of iterations the weights start to emerge and crystallise. The training process and which data it used and how is key to understanding the capabilities and issues a machine learning system has: If you want to reduce harm in a network you do need to know if it was trained on the Daily Stormer or not just to give an example.

And here’s the catch.

The OSI “The Open Source AI Definition – 1.0-RC1” demands an Open Source AI to provide four freedoms to its users:

Use the system for any purpose and without having to ask for permission
Study how the system works and inspect its components.
Modify the system for any purpose, including to change its output.
Share the system for others to use with or without modifications, for any purpose.

So far so good. That looks reasonable, right? You can inspect and modify and use and all that. Awesome. Nothing bad could happen in the fineprint, right? Let’s just quickly look what an AI system needs to offer. Code: Check. Model parameters (weights, configs): Check! We’re on a roll here. What about data?

Data Information: Sufficiently detailed information about the data used to train the system so that a skilled person can build a substantially equivalent system. Data Information shall be made available under OSI-approved terms.

In particular, this must include: (1) a detailed description of all data used for training, including (if used) of unshareable data, disclosing the provenance of the data, its scope and characteristics, how the data was obtained and selected, the labeling procedures and data cleaning methodologies; (2) a listing of all publicly available training data and where to obtain it; and (3) a listing of all training data obtainable from third parties and where to obtain it, including for fee.
Open Source Initiative

What does “sufficiently detailed information” mean? The Open Source definition never talks about “sufficiently detailed source code”. You need to get the source code. All of it. And not in obfuscated or mangled form. The actual thing. Because otherwise it doesn’t mean much, it doesn’t help you build trust.

The OSI’s definition of “Open Source AI” pokes a big hole into the idea of Open Source: By making a core part of the model – the training data special in this weird wibbly wobbly way they blessing all kinds of things as “Open Source” that really are not – based on their own definition of what Open Source is and what it’s for.

An AI system’s training data is for all intends and purposes part of its “code”. It is as relevant to the way the model functions as literal code, for AI systems probably even more because the code is just generic matrix operations with delusions of grandeur.

The OSI puts another cherry on top: Users deserve description of “unshareable data” that was used to train a model. What is that? Let’s apply that to code again: If a software product gives us a core part of its functionality just as a compiled artifact and then describes that it’s all totally cool and above board but that the code wasn’t “shareable” we would not call that piece of software open source. Because it does not open all the source.

Does a “description” of partially “unshareable” data help you to reproduce the model? No. You can try to rebuild the model and it might look a bit similar, but it’s significantly different. Does it help you to “study the system and inspect its components”? Only on a superficial level. But if you really want to analyse what’s in the magic statistics box you need to know what went into it. What was filtered out exactly, what went in?

This definition seems to be very weird coming from OSI, right? It very obviously goes against core ideas of what people think open source is and should be. So why do it?

(Un)Open AI

Here’s the thing. On the scale that we are talking about those statistical systems as “AI” today, open source AI cannot exist.

Many smaller models have been trained on explicitly selected and curated, public datasets. Those can provide all the data, all the code, all the processes and can be called Open Source AI. But those are not the machines that make NVIDIA’s stock go WEEEEEEEEEEEEE.

Those big systems that are called “AI” – whether they are for image generation, text generation or multi-modal – are all based on illegally acquired and used material. Because the data sets are too big to do actual filtering and ensuring their legality. It’s just too much.

Now the more naive people among you might wonder “Okay but if you cannot do it legally how can you claim that this is a legitimate business?” and you’d be right but we’re also living in a weird world where hoping that some magic innovation and/or money will come from reproducing Reddit posts saving our economy and progress.

“Open Source AI” is an attempt to “openwash” proprietary systems. In their paper “Rethinking open source generative AI: open-washing and the EU AI Act” Andreas Liesenfeld and Mark Dingemanse showed that many “Open Source” AI models offer hardly more than open model weights. Meaning: You can run the thing but you don’t actually know what it is.

Sounds like something we’ve already had: It’s Freeware. The Open Source models we see today are proprietary freeware blobs. Which is potentially marginally better than OpenAI’s fully closed approach but really only marginally.

Some models offer models cards or other docs but most leave you in the dark. Which stems from the fact that most of those models are being developed by VC funded companies that need some theoretical path towards monetization.

“Open Source” has become a sticker like “Fair Trade”, something to make your product look good and trustworthy. To position it outside of the evil commercial space, giving it some grassroots feeling. “We’re in this together” and shit. But we’re not. We’re not in this with Mark fucking Zuckerberg even if he gives away some LLM weights for free cause it hurts his competition. We, as normal people living on this constantly warmer planet, are not with any of those people.

But there is another aspect to it outside of doing an image makeover for tech bros and their corporations. It’s about legality. At least in Germany there are exceptions to some laws that normally would concern LLM makers: If you do it for research purposes you are allowed to scrape basically anything. You can then train models and release those weights and even though there’s Disney stuff in there you are in the clear. And this is where the whole Open Source AI thing plays a relevant role: This is a wedge to legitimise probably illegal behavior through openwashing: As a corporation you take some “Open Source AI” that is based on all the stuff you wouldn’t legally be allowed to touch and use that to build your product. Do some extra training with licensed data for example.

The Open Source Initiative has caught FOMO – just like the Nobel prize jury. They also want to be a part of the “AI” craze.

But for the systems that we are talking about today as “AI” Open Source AI isn’t practically possible. Because we’ll never be able to download all the actual training data.

“But tante, then we will never have Open Source AI”. Exactly. That’s how reality works. If you can’t fulfil the criteria of a category you are not in that category. The fix is not to change the criteria. That’s playing pigeon chess.

Liked it? Take a second to support tante on Patreon!

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Comments

4 responses to “Does Open Source AI really exist?”

pluhmen

October 16, 2024

@tante Uhh that's a long read, highly appreciated. This post is full of citable notions. But it needs be read in a whole, extractions will not work… Thanks tante for your words.
James Addison

October 20, 2024

Thank you; I’ve sent some feedback on the OSAID rc1 license after becoming aware of it thanks to this post.

It’s difficult to determine whether the software on a system is correct and whether modifications to that behave as expected unless you can first of all rebuild the original software.

In my opinion, license definitions that weaken the realistic ability to rebuild digital systems do not seem likely to benefit the software community.

Omitting or duplicating the word ‘not’ in the previous paragraph would introduce a significant difference in the meaning; fortunately if it were a software package accompanied by a unit test suite, then that regression would be easy to catch.
ollibaba

November 1, 2024

@tante I kinda disagree with the conclusion.

Yes, this OSI AI license is weak. But that doesn't mean we can't have a better license in the future, for actual "FLAI" (akin to "FLOSS").

I think the article should give some citations for the assertion that "Because the data sets are too big to do actual filtering and ensuring their legality. It’s just too much.", which much of the remaining article builds upon.

(1/2)

#AI #FLAI #FLOSS #OpenSource
1. ollibaba
  
  November 1, 2024
  
  @tante I'm quite confident that free (or licensed) training data and resulting AI models will appear in the future (they're called "vegan models" at https://simonwillison.net/2024/Jan/25/fairly-trained-launches-certification-for-generative-ai-models-t/).
  
  These "free/libre" models will likely lag behind proprietary models (just like Linux lags behind Windows? ). But FLAI will come. And maybe the OSI will publish a better license then; or else someone else will.
  
  (2/2)
  
  #AI #OpenSource #FLAI #FLOSS