|
|
Subscribe / Log in / New account

GitHub unveils its Licenses API

By Nathan Willis
March 11, 2015

Since opening its doors in 2008, GitHub has grown to become the largest active project-hosting service for open-source software. But it has also attracted a fair share of criticism for some of its implementation choices—with one of the leading complaints being that it takes a lax approach to software licensing. That, in turn, leads to a glut of repositories bearing little or no licensing details. The company recently announced a new tool to help combat the license-confusion issue: a site-wide API for querying and reporting license information. Whether that API is up to the task, however, remains to be seen.

None of the above

By way of background information, GitHub does not require users to choose a license when setting up a new project. An existing project can also be forked into a new repository with one click, but nothing subsequently prevents the new repository's owner from changing or removing the upstream license information (if it exists).

From a legal standpoint, of course, the fork inherits its license from upstream automatically (unless the upstream project is public domain or under some other less-common license). But from a practical standpoint, this provenance is difficult to trace. Throw in other GitHub users submitting pull requests for patches that have no license information, and one has a recipe for confusion.

The bigger problem, however, is that the majority of GitHub repositories carry no license information at all, because the users who own them have not chosen to add such information. In 2013, GitHub introduced its first tool designed to combat that issue, launching ChooseALicense.com, a web site that explains the features and differences of popular FOSS licenses.

ChooseALicense.com allows GitHub users to select a license, and the GitHub new-project-configuration page has a license selector, but using it is not obligatory. In fact, the ChooseALicense.com home page includes the following as its last option:

I don’t want to choose a license.

You don’t have to.

That "no license" link, incidentally, attempts to explain the downside of selecting no license—most notably, it strongly discourages other developers (both FOSS and proprietary) from using or redistributing the code in any fashion, for fear of getting entangled in a copyright problem. But the page also points out that the GitHub terms of service dictate that other users have the right to view and fork any GitHub repository.

A new interface

One could probably quibble endlessly over the details of ChooseALicense.com and its wording. The upshot, though, is that it did not have a serious impact on the license-confusion problem. A March 9 post on the GitHub blog presented some startling statistics: that less than 20% of GitHub repositories have a license, and that the percentage is declining. The introduction of the license-selection tool in 2013 produced a spike in licensed repositories, followed by a downward trend that continues to the present. The post also included some statistics on license popularity; the three licenses featured most prominently on the license-chooser site (MIT, Apache, and GPLv2) are, unsurprisingly, the most often selected.

This data set, however, is far from complete; as the post explains, the team only logged licenses that were found in a file named LICENSE, and only matched that file's contents against a short set of known licenses. Nevertheless, GitHub did evidently determine that the problem was real enough to warrant a new attempt at a solution.

The team's answer is a new site-wide API called, fittingly, the Licenses API. It is currently in preview, which means that interested developers must supply a special HTTP header with any requests in order to access it.

But the API is, at least currently, a frustratingly limited one. It offers just three functions:

  • GET /licenses returns a JSON-formatted list of all of the licenses tracked by the site.
  • GET /licenses/licensename returns the license text and associated metadata for licensename.
  • GET /repos/username/reponame returns any licensing information for username's reponame repository (along with other repository information).

Arguably the biggest limitation is that, as was the case with the statistics gathered for the blog post, the license of a repository is determined only by examining the contents of a LICENSE file. On the plus side, the license information returned by the API conforms to the Software Package Data Exchange (SPDX) specification, which should make it easy to integrate with existing software.

To be sure, determining and counting licenses is not a simple matter—as many in the community know. In 2013, for example, a pair of presentations at the Free Software Legal and Licensing Workshop explored several strategies for tabulating statistics on FOSS license usage. Both presentations ended with caveats about the difficulty of the problem—whatever methodology is used to approach it.

Nevertheless, the GitHub Licenses API does appear to be strangely naive in its approach. For example, it is well-established that a significant number of projects place their license in a file named COPYING, rather than LICENSE, because that has long been the convention used by the GNU project. Even scanning for that filename (or other obvious candidates, like GPL.txt) would enhance the quality of the data available significantly. Far better would be allowing the repository owner to designate what file contains the license.

Furthermore, the Licenses API could be used to accumulate more meaningful statistics, such as which forks include different license information than their corresponding upstream repository, but there is no indication yet that GitHub intends to pursue such a survey. It may fall on volunteers in the community to undertake that sort of work. There are, after all, multiple source-code auditing tools that are compatible with SPDX and can be used to audit license information and compliance. Regrettably, the GitHub Licenses API does not look like it will lighten that workload significantly, since the information it returns is so restricted in scope.

Power to choose

GitHub is right to be concerned about the paucity of license information in the repositories hosted at its site. But both the 2013 license chooser and the new Licenses API seem to stem from an assumption on GitHub's part that the reason so many repositories lack licenses is that license selection is either confusing or difficult to find information on. Neither effort strikes at the heart of the problem: that GitHub makes license selection optional and, thus, makes licensing an afterthought.

SourceForge has long required new projects to select a license while performing the initial project setup. Later, when Google Code supplanted SourceForge as the hosting service of choice, it, too, required the user to select a license during the first step. So too do Launchpad.net, GNU Savannah, and BerliOS. FedoraHosted and Debian's Alioth both involve manually requesting access to create a new project, a process that, presumably, involves discussing whether or not the project will be released under a license compatible with that distribution.

It is hard to escape the fact that only GitHub and its direct competitors (like Gitorious and GitLab) fail to raise the licensing question during project setup, and equally hard to avoid the conclusion that this is why they are littered with so many non-licensed and mis-licensed repositories. An API for querying licenses may be a positive step, but it is not likely to resolve the problem, since it side-steps the underlying issue.

Hopefully, the current form of the Licenses API is merely the beginning, and GitHub will proceed to develop it into a truly useful tool. There is certainly a need for one, and being the most active project-hosting provider means that GitHub is best positioned to do something about it.


(Log in to post comments)

GitHub unveils its Licenses API

Posted Mar 11, 2015 20:30 UTC (Wed) by w00t (guest, #71210) [Link]

I'm still not sure just how accurate their statistics are.

Let me first state that I think they are right: there's a lot of people (and projects) that don't get licensing "correct", even to the point of not licensing their code.

On the flipside of the coin, they acknowledge that they only look for the presence of a LICENSE file. They don't scan code headers (which will get a large amount of code), and also presumably runs into the problem of not accounting for multiple licenses in a codebase correctly.

I know that I'm personally very lazy when it comes to LICENSE files. I often forget to add them. The code headers, however, are always intact. I've seen others do this quite often too.

Perhaps "git clone" should produce a warning?

Posted Mar 11, 2015 20:51 UTC (Wed) by david.a.wheeler (subscriber, #72896) [Link]

It might be useful if "git clone" reported a warning if the clone didn't have COPYING*, LICENSE*, or similar file. This won't change people who like to put their users at legal risk, but it would help those who simply forget.

Perhaps "git clone" should produce a warning?

Posted Mar 11, 2015 21:26 UTC (Wed) by droundy (subscriber, #4559) [Link]

At first I thought git was the wrong place for this. I use git for all sorts of projects that are not software and will never be public, and it would seem annoying to have to either add a nonsense file to each repository or change the git defaults or see a warning message.

On the other hand, I also only very rarely clone my own repositories, so I suppose it really bother me to have a rarely-seen polite warning that I am cloning an unlicensed repository. And actually, this suggestion puts the warning in the right place: users who download the software should be aware that by using it they could put themselves at legal risk.

Perhaps "git clone" should produce a warning?

Posted Mar 12, 2015 1:04 UTC (Thu) by mathstuf (subscriber, #69389) [Link]

Using gitattributes for this would be nice. Something like:

> LICENCE license=BSD3
> LICENSE.third-party license=MIT:WTFPL

Perhaps "git clone" should produce a warning?

Posted Mar 12, 2015 6:21 UTC (Thu) by alison (subscriber, #63752) [Link]

Good idea.

Perhaps "git clone" should produce a warning?

Posted Mar 12, 2015 7:16 UTC (Thu) by tzafrir (subscriber, #11501) [Link]

Assuming you actually have a single license for the whole code in the repository.

Perhaps "git clone" should produce a warning?

Posted Mar 12, 2015 11:36 UTC (Thu) by mathstuf (subscriber, #69389) [Link]

I see no problem setting it on other files as well. Maybe "license-file" for the actual license files themselves?

GitHub unveils its Licenses API

Posted Mar 11, 2015 23:14 UTC (Wed) by jefftaylor (guest, #95911) [Link]

How many of these repositories are "real"? I've got several GitHub repositories that don't correspond to anything that anyone would ever want to use (eg. English homework). They don't have a licence. Yet they're a part of the the 80% of unlicensed projects.

Just how many "foobar", "homework", and "test_git" repositories are there out there?

GitHub unveils its Licenses API

Posted Mar 15, 2015 21:51 UTC (Sun) by zonker (subscriber, #7867) [Link]

Lots. No doubt there's tons of stuff on GitHub that probably of interest to only one user - and maybe not even that user for a very long period of time.

Still, I've seen a shocking number of repositories that are something that others do want to use (and even have been promoted by the creators) and have no licensing information.

What would be very interesting would be to see how many repositories lack a license *and* have been cloned/forked by other users. It might really get folks' attention if, say, 15% of repositories without a license have been forked at least once.

GitHub unveils its Licenses API

Posted Mar 12, 2015 0:14 UTC (Thu) by lambda (subscriber, #40735) [Link]

Nevertheless, the GitHub Licenses API does appear to be strangely naive in its approach. For example, it is well-established that a significant number of projects place their license in a file named COPYING, rather than LICENSE, because that has long been the convention used by the GNU project. Even scanning for that filename (or other obvious candidates, like GPL.txt) would enhance the quality of the data available significantly. Far better would be allowing the repository owner to designate what file contains the license.

It looks like it isn't quite as naive as you may believe from its description. For example, I tried it out on some random fork of GNU coreutils which contains only the standard GNU COPYING file, and it returned the correct metadata for the GPLv3:

$ curl 'https://api.github.com/repos/goj/coreutils' -H 'Accept: application/vnd.github.drax-preview+json'
{
  "id": 3237260,
  "name": "coreutils",
  "full_name": "goj/coreutils",
  // ... snip ...
  "license": {
    "key": "gpl-3.0",
    "name": "GNU General Public License v3.0",
    "url": "https://api.github.com/licenses/gpl-3.0"
  },
  "network_count": 18,
  "subscribers_count": 7
}
So, it is at least looking at files named COPYING as well as LICENSE. And here's another that has a file named LICENSE-2.0.txt which it correctly reports as Apache licensed:
$ curl 'https://api.github.com/repos/SmartBear/ready-api-plugins' -H 'Accept: application/vnd.github.drax-preview+json'
{
  "id": 25704802,
  "name": "ready-api-plugins",
  "full_name": "SmartBear/ready-api-plugins",
  // ... snip ...
  "license": {
    "key": "apache-2.0",
    "name": "Apache License 2.0",
    "url": "https://api.github.com/licenses/apache-2.0"
  },
  // ... snip ...
}
It does not, however, pick up on Documentation/GPL.txt in this repository, or just gpl.txt in this one.

GitHub unveils its Licenses API

Posted Mar 12, 2015 11:03 UTC (Thu) by yosch (guest, #4675) [Link]

Seems like only a fraction of the SPDX spec is actually supported.

GitHub unveils its Licenses API

Posted Mar 12, 2015 4:33 UTC (Thu) by cmbang (guest, #101355) [Link]

Would be great to see a bit, prominent "YOUR PROJECT DOES NOT INCLUDE A LICENSE" at top of the page on any repos lacking proper documentation. Likewise, a big obtrusive header on any forked repository lacking such docs would also be useful.

This lack of requirement is likely what has caused GitHub to succeed, as most users don't understand or care about licensing. Definitely more could be done to educate here.

GitHub unveils its Licenses API

Posted Mar 12, 2015 6:24 UTC (Thu) by alison (subscriber, #63752) [Link]

>Would be great to see a bit, prominent "YOUR PROJECT DOES NOT INCLUDE A LICENSE" at top of the page

I couldn't agree more. Until a friend pointed me towards this article, I didn't realize that *my* public Github repos had no license. Shame on me. Github certainly does bury the question. Thank you, Paul, I am closing that personal ticket now!

GitHub unveils its Licenses API

Posted Mar 12, 2015 16:14 UTC (Thu) by smitty_one_each (subscriber, #28989) [Link]

I'm probably an offender here, too, but I just can't get past the most abstract intellectual concern.

GitHub unveils its Licenses API

Posted Mar 12, 2015 17:12 UTC (Thu) by flussence (subscriber, #85566) [Link]

I hope to see such badly machine-detected alerts start popping up on my repos; I made a conscious decision to use the Unlicense.

GitHub unveils its Licenses API

Posted Mar 19, 2015 10:22 UTC (Thu) by ssokolow (guest, #94568) [Link]

Were you aware that the Unlicense is considered unreliable and may scare off people who would otherwise use your code?

https://programmers.stackexchange.com/questions/147111/wh...

GitHub unveils its Licenses API

Posted Mar 19, 2015 20:15 UTC (Thu) by flussence (subscriber, #85566) [Link]

> Were you aware that the Unlicense is considered unreliable and may scare off people who would otherwise use your code?

Absolutely. I've made a conscious decision to alienate a whole bunch of people, just like the authors of every other software license in the universe have done. Most consumers of prewritten licenses do so unconsciously.

GitHub unveils its Licenses API

Posted Mar 12, 2015 22:07 UTC (Thu) by jond (subscriber, #37669) [Link]

Debian have done a lot of work on implementing a machine-readable copyright format that can describe some complex mixed-license schemes. I wonder if that work would be useful to this problem. http://dep.debian.net/deps/dep5/

GitHub unveils its Licenses API

Posted Mar 13, 2015 0:05 UTC (Fri) by ringerc (guest, #3071) [Link]

A repository doesn't necessarily have a single license. Subtrees may be licensed under different terms.

For that reason it's always going to be necessary to let people choose "no license" or provide a "multiple licenses" option.

Using gitattributes to annotate this seems appealing, but suffers from a few issues. For one, the information is lost if a revision is exported as a tarball / zip. Additionally it can lead to the license metadata getting out of sync with the real license if the user changes the license in a subtree.

Relying on a LICENSE (or COPYING or whatever) file is problematic too, because some repositories can't have arbitrary files lying around at the top level.

I'm not convinced a comprehensive technical solution to this is possible. The main thing to do, IMO, is encourage people to think about the license and provide tools (like they have) to make adding license info easier.

GitHub unveils its Licenses API

Posted Mar 13, 2015 0:58 UTC (Fri) by Limdi (guest, #100500) [Link]

> Relying on a LICENSE (or COPYING or whatever) file is problematic too, because some repositories can't have arbitrary files lying around at the top level.

What kind of repositories cannot? Assuming "repositories" is reffering to git repositories.

GitHub unveils its Licenses API

Posted Mar 13, 2015 13:30 UTC (Fri) by jond (subscriber, #37669) [Link]

I read that to mean some build systems can't handle such things.

GitHub unveils its Licenses API

Posted Mar 13, 2015 16:33 UTC (Fri) by cesarb (subscriber, #6266) [Link]

> Using gitattributes to annotate this seems appealing, but suffers from a few issues. For one, the information is lost if a revision is exported as a tarball / zip.

Aren't git attributes stored in a plain text file called .gitattributes within the repository itself? Exporting a revision would unavoidably export that file together with everything else.

GitHub unveils its Licenses API

Posted Mar 14, 2015 1:25 UTC (Sat) by mathstuf (subscriber, #69389) [Link]

The file is usually excluded when making the archive, but not always. I try to make such things closer to the files affected personally, but some people like using top-level files for it.

GitHub unveils its Licenses API

Posted Mar 14, 2015 18:57 UTC (Sat) by xtifr (subscriber, #143) [Link]

A repository doesn't necessarily have a single license. Subtrees may be licensed under different terms.

Indeed, I have one project like that—a GPL'd app that has an MIT'd library (eventually to be split off)—and the API currently reports just the GPL. So it definitely doesn't handle this case.

License combos

Posted Mar 17, 2015 13:39 UTC (Tue) by david.a.wheeler (subscriber, #72896) [Link]

There are many ways to record combination licenses. Fedora and SPDX both support "and", e.g., "GPLv2+ and MIT".

The real problem is software that has NO license stated. Those are legal traps for naive users.

GitHub unveils its Licenses API

Posted Mar 19, 2015 15:51 UTC (Thu) by donbarry (guest, #10485) [Link]

This seems like a good time to remind free/libre software advocates that Github is itself a proprietary tool and thus a trap. Fortunately, unlike Bitkeeper, the protocol and the underlying DVCS remains free. This gives options, but most remain problematic. Gitlab offers the disingenuous "use our proprietary site for which we distribute a limited copy of the code under a libre license as advertising" model. And let us not forget that they bought and shuttered Gitorious, which developed its web system under a fully libre model.

The only truly free option right now seems to be Kallithea, thanks to the foresight and critical investment of the Software Freedom Law Center. It also began life as a libre project, accepting code contributions, and was then taken proprietary by its founders -- contributed code included. The Kallithea fork restores to the community something worthy of their contribution. It deserves your efforts at improvement and your loyalty.

Is it any surprise that a company based on proprietary software might put off concern about licenses until late in the game? It is encouraging the worst habits and intellectual sloth among its users.

GitHub unveils its Licenses API

Posted Mar 20, 2015 11:11 UTC (Fri) by mathstuf (subscriber, #69389) [Link]

FTR, Git*Lab* bought Gitorious, not GitHub. GitLab is FOSS. I will see how they are with patches next month.

GitHub unveils its Licenses API

Posted Mar 20, 2015 19:31 UTC (Fri) by donbarry (guest, #10485) [Link]

I apologize if the particular antecedent reference was unclear: I tried to indicate Gitlab was the purchaser of Gitorious, but I see how you could have read it as Github.

But Gitlab is one of those "freemium" offerings: what you download is not the codebase used to serve the site. Such sites rarely accept patches from others which add the missing functionality, and that ethos of excluding development which permits the free codebase to develop the features needed by all users -- including the most demanding -- is diametrically opposed to the principles of free software.

Yes, you can fork -- but it can be very difficult, particularly when the original commercial team has the money and resources to keep the free offering "just good enough" to keep the attention primarily on it.

GitHub unveils its Licenses API

Posted Mar 21, 2015 13:33 UTC (Sat) by mathstuf (subscriber, #69389) [Link]

I'm well aware of the pitfalls around open core projects. However, I tend to offer the benefit of the doubt in most cases where development is more in the open (in contrast to "over the wall" open core projects). Can you point to patches rejected (or ignored) because they implement Enterprise's features? Even so, the code is permissively licensed (only an ICLA, no assignment), so if the patches exist, there is no reason to not grab them and apply them to your local install if you need them.

GitHub unveils its Licenses API

Posted Mar 30, 2015 12:28 UTC (Mon) by mirabilos (subscriber, #84359) [Link]

They should just make the licencing information metadata about the repository, and not require actual files in the repository, which may not work for a publish-only copy, for example.


Copyright © 2015, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds