Should GitHub Be Sued For Training Copilot on GPL Code?

Just yesterday, GitHub announced that it is working on a new feature for its platform called “Copliot“; which is an artificial coding assistant that predicts the next chunks of code that a programmer may want to write while developing software, and offers to insert it just in the right time and place.

GitHub 5 — Copilot working in practice, the highlighted part the is suggested code by Copilot

The technology, a bleeding-edge application of deep learning and neural networks, was trained using the public repositories published on GitHub. Training a neural network model means that you take the data (source code of these repositories in our case) and feed it to the network, so that it can learn what to do in future similar cases.

Copilot has seen billions of lines of code, functions, classes and object definitions before, and hence, can suggest the next steps whenever enough information about the programmer’s desire are determined.

However, this brought a large issue into debate: Many of these public repositories were licensed under the GPL license and other copyleft licenses (MIT, AGPL… etc), so is this process legal? Is it OK for GitHub to use free software source code to train its proprietary, paid and commercial service?

Different opinions emerged in the open source community.

The Conservatives

Some open source software developers argued that the resulting neural networks is a derivative work of the GPL work, and hence, should be demanded to be released under the GPL license as well.

GitHub’s current CEO said that from their point of view, they see this as a part of “fair use”; which implies that using few lines of modified codes from a public source code is not enough to establish any type of lawsuits against them:

In general: (1) training ML systems on public data is fair use (2) the output belongs to the operator, just like with a compiler.

We expect that IP and AI will be an interesting policy discussion around the world in the coming years, and we're eager to participate!
— Nat Friedman (@natfriedman) June 29, 2021

However, others argue that the neural network outputs (on the average of a 0.1% probability) copy-pasted snippets from various repositories on GitHub, and hence, it can not fall under fair use:

github copilot has, by their own admission, been trained on mountains of gpl code, so i'm unclear on how it's not a form of laundering open source code into commercial works. the handwave of "it usually doesn't reproduce exact chunks" is not very satisfying pic.twitter.com/IzqtK2kGGo
— eevee (@eevee) June 30, 2021

Moreover, open source developers are already suffering burnouts because of gigantic multi-billion dollar corporations taking their free code and re-bundling it as a SaaS, hence, introducing this new feature takes even more from them than there was before.

The Rationalists

Others argued that just like a human who reads various books, tutorials and software source codes to understand how software development works and doesn’t need to cite the materials which he/she learned from, a neural network is not obligated to do that as well.

“What is the difference between this and someone doing it manually? Is it just because the AI can do it faster and with larger data then the AI should not do it while humans can”? Different users argued on Twitter and Reddit.

Others from the first camp, however, see that as naive thinking; Neural networks depend on absolute probabilistic approaches to determine which code snippets to suggest and does not actually understand what it is doing or what should be the absolute right way of miming that code snippet into the new software:

"but eevee, humans also learn by reading open source code, so isn't that the same thing"
– no
– humans are capable of abstract understanding and have a breadth of other knowledge to draw from
– statistical models do not
– you have fallen for marketing
— eevee (@eevee) June 30, 2021

The Don’t Cares

Others could think in a different way: Let’s put copyright away.

Training AI models has proven to have many useful cases for humanity. Whether the training data is publicly available or protected by copyright laws… isn’t actually what matters. What matters is how we can – as a human race – build useful and good AI models that help us in our everyday lives.

Copilot does help the programmer in his/her everyday life.

One could argue, of course, that it is a commercial service that feasts on public free software (as in freedom) given for free (as in free coffee). However, there is nothing that prevents anyone from doing the same for free. If stood on the same ground, anyone can take the same public repositories and train a large model to suggest the next coding lines, just like Copilot does.

Then, you can offer the source code, data and model for free however you like.

Just because they did it before you and offered it for a price tag doesn’t mean that they are wrong.

If anyone can train his/her AI model on any publicly accessible database, then that is a good thing that should be encouraged and supported, because it means everyone will have access to the same opportunities to unlock the next step in technology. Training AI models on various types of data – by anyone – is crucial for the advancement of our race.

Preventing GitHub from doing it will not help the free software community or the general technological momentum in advancing. Instead, it would just slow the development of the human race for a bit while some workarounds get created.

That’s why we see that regardless of whether US courts see it as fair use or not, it is OK from an ethical point to use publicly available data to everyone to train a computational model to provide a service to users, whether for free or profit. Since this data is normally accessible to the everyday end-user then there should be nothing that prevents a computational AI or bot from accessing it as well.

As for crediting the original authors of the suggested code snippets; Copilot currently – as claimed – only suggests few lines of code, and doesn’t directly copy & paste from people’ repositories (Variable and method names… etc might be changed). GitHub said that they are working on pushing that 0.1% rate of “verbatim code” to lower rates.

Conclusion

The topic is of course open to debate, and will not end very soon.

Currently, Copilot is still in the early technical preview phase, and hasn’t entered the stable status yet. That’s why very few people had the chance so far to put their hands on it and see what results it produces in real-world scenarios. Until the public release is available, IANAL tags are expected to be seen in many places on the Internet.

Feel free to leave your two cents in the comments section below.

Opinions

8 Comments

Oldest

Newest

Inline Feedbacks

View all comments

tamusjroyce

11 months ago

However, if the copilot was able to scrape protocols, function signatures, etc., spin off a version of itself that didn’t ingest any non-appropriate-licensed software, and implement those protocols based on the results passed in…it could, in theory, rewrite GPL to MIT code. In fact, the functions themselves could be mini-neural networks.

8 months ago

Thank you for highlighting this issue. If I may speak as a layperson in programming and law, the question of whether or not to take legal action depends in great part on the very specific question(s) the judge will seek to answer. So far as I know, globally the legal standpoint on AI created work is poorly defined. Perhaps the best course of action would be first to prosecute specific cases where copilot code has demonstrably reused open source code.

Flo

20 days ago

I think that the argument that everybody could use the public open source projects to train an AI is not entirely true. I highly doubt that anybody could “scrape”/clone all appropriate GitHub repositories without getting blocked/locked out. So chances between the big company and smaller startups/developer groups are (besides available resources) not really the same.

Venson

Well I have several takes on it.
1. AI itself cannot hold license. There was a lawsuite somewhere about AI Generated images and the question who holds copyright to it. The answer was: Nobody.
2. Public does not mean gratis. Even fair use has its limits and as this is done with a Commercial background many licenses (not just GPL) need at least attribution when you build something “based” on that code.

Ernest N. Wilcox Jr.

When Co-pilot uses bits of code from GPL sources, it ‘knows’ which sources it is using (or can be ‘directed’ to remember), so why not reprogram it to automatically insert the required attributions? This would be fair to the authors of the GPL source code, and meet the terms of the GPL license.

My2Cents,

Ernie

M.Hanny Sabbagh

Admin

Reply to Ernest N. Wilcox Jr.

No, you would be required to license your new software under the GPL license as well, not just attribute the authors of the code. Which does not work well for most developers in the world.

Davron

The purists should also look at this a different way. How much are they donating to github to keep them, a for profit business, afloat and… because it is a business… profitable? Github offers the free use of their services for open projects. I doubt that the ads that they get to serve do much to offset those costs, particularly since most people that are open source purists are also using adblockers. They’ve found a way to make purists pay their own way with their own efforts, but not even by directly using their work. My understanding is that they… Read more »

souliris

17 days ago

I see little difference between me reading a bunch of code, and learning from it, and an AI doing the same thing. Oh right, everyone wants to get paid, for their open source they publish openly and publicly on the internet.