Since opening its doors in 2008, GitHub has grown to become the largest active project-hosting service for open-source software. But it has also attracted a fair share of criticism for some of its implementation choices—with one of the leading complaints being that it takes a lax approach to software licensing. That, in turn, leads to a glut of repositories bearing little or no licensing details. The company recently announced a new tool to help combat the license-confusion issue: a site-wide API for querying and reporting license information. Whether that API is up to the task, however, remains to be seen.
By way of background information, GitHub does not require users to choose a license when setting up a new project. An existing project can also be forked into a new repository with one click, but nothing subsequently prevents the new repository's owner from changing or removing the upstream license information (if it exists).
From a legal standpoint, of course, the fork inherits its license from upstream automatically (unless the upstream project is public domain or under some other less-common license). But from a practical standpoint, this provenance is difficult to trace. Throw in other GitHub users submitting pull requests for patches that have no license information, and one has a recipe for confusion.
The bigger problem, however, is that the majority of GitHub repositories carry no license information at all, because the users who own them have not chosen to add such information. In 2013, GitHub introduced its first tool designed to combat that issue, launching ChooseALicense.com, a web site that explains the features and differences of popular FOSS licenses.
ChooseALicense.com allows GitHub users to select a license, and the GitHub new-project-configuration page has a license selector, but using it is not obligatory. In fact, the ChooseALicense.com home page includes the following as its last option:
That "no license" link, incidentally, attempts to explain the downside of selecting no license—most notably, it strongly discourages other developers (both FOSS and proprietary) from using or redistributing the code in any fashion, for fear of getting entangled in a copyright problem. But the page also points out that the GitHub terms of service dictate that other users have the right to view and fork any GitHub repository.
One could probably quibble endlessly over the details of ChooseALicense.com and its wording. The upshot, though, is that it did not have a serious impact on the license-confusion problem. A March 9 post on the GitHub blog presented some startling statistics: that less than 20% of GitHub repositories have a license, and that the percentage is declining. The introduction of the license-selection tool in 2013 produced a spike in licensed repositories, followed by a downward trend that continues to the present. The post also included some statistics on license popularity; the three licenses featured most prominently on the license-chooser site (MIT, Apache, and GPLv2) are, unsurprisingly, the most often selected.
This data set, however, is far from complete; as the post explains, the team only logged licenses that were found in a file named LICENSE, and only matched that file's contents against a short set of known licenses. Nevertheless, GitHub did evidently determine that the problem was real enough to warrant a new attempt at a solution.
The team's answer is a new site-wide API called, fittingly, the Licenses API. It is currently in preview, which means that interested developers must supply a special HTTP header with any requests in order to access it.
But the API is, at least currently, a frustratingly limited one. It offers just three functions:
Arguably the biggest limitation is that, as was the case with the statistics gathered for the blog post, the license of a repository is determined only by examining the contents of a LICENSE file. On the plus side, the license information returned by the API conforms to the Software Package Data Exchange (SPDX) specification, which should make it easy to integrate with existing software.
To be sure, determining and counting licenses is not a simple matter—as many in the community know. In 2013, for example, a pair of presentations at the Free Software Legal and Licensing Workshop explored several strategies for tabulating statistics on FOSS license usage. Both presentations ended with caveats about the difficulty of the problem—whatever methodology is used to approach it.
Nevertheless, the GitHub Licenses API does appear to be strangely naive in its approach. For example, it is well-established that a significant number of projects place their license in a file named COPYING, rather than LICENSE, because that has long been the convention used by the GNU project. Even scanning for that filename (or other obvious candidates, like GPL.txt) would enhance the quality of the data available significantly. Far better would be allowing the repository owner to designate what file contains the license.
Furthermore, the Licenses API could be used to accumulate more meaningful statistics, such as which forks include different license information than their corresponding upstream repository, but there is no indication yet that GitHub intends to pursue such a survey. It may fall on volunteers in the community to undertake that sort of work. There are, after all, multiple source-code auditing tools that are compatible with SPDX and can be used to audit license information and compliance. Regrettably, the GitHub Licenses API does not look like it will lighten that workload significantly, since the information it returns is so restricted in scope.
GitHub is right to be concerned about the paucity of license information in the repositories hosted at its site. But both the 2013 license chooser and the new Licenses API seem to stem from an assumption on GitHub's part that the reason so many repositories lack licenses is that license selection is either confusing or difficult to find information on. Neither effort strikes at the heart of the problem: that GitHub makes license selection optional and, thus, makes licensing an afterthought.
SourceForge has long required new projects to select a license while performing the initial project setup. Later, when Google Code supplanted SourceForge as the hosting service of choice, it, too, required the user to select a license during the first step. So too do Launchpad.net, GNU Savannah, and BerliOS. FedoraHosted and Debian's Alioth both involve manually requesting access to create a new project, a process that, presumably, involves discussing whether or not the project will be released under a license compatible with that distribution.
It is hard to escape the fact that only GitHub and its direct competitors (like Gitorious and GitLab) fail to raise the licensing question during project setup, and equally hard to avoid the conclusion that this is why they are littered with so many non-licensed and mis-licensed repositories. An API for querying licenses may be a positive step, but it is not likely to resolve the problem, since it side-steps the underlying issue.
Hopefully, the current form of the Licenses API is merely the
beginning, and GitHub will proceed to develop it into a truly useful
tool. There is certainly a need for one, and being the most active
project-hosting provider means that GitHub is best positioned to do
something about it.
Posted Mar 11, 2015 20:30 UTC (Wed) by w00t (guest, #71210) [Link]
Let me first state that I think they are right: there's a lot of people (and projects) that don't get licensing "correct", even to the point of not licensing their code.
On the flipside of the coin, they acknowledge that they only look for the presence of a LICENSE file. They don't scan code headers (which will get a large amount of code), and also presumably runs into the problem of not accounting for multiple licenses in a codebase correctly.
I know that I'm personally very lazy when it comes to LICENSE files. I often forget to add them. The code headers, however, are always intact. I've seen others do this quite often too.
Posted Mar 11, 2015 20:51 UTC (Wed) by david.a.wheeler (subscriber, #72896) [Link]
Posted Mar 11, 2015 21:26 UTC (Wed) by droundy (subscriber, #4559) [Link]
On the other hand, I also only very rarely clone my own repositories, so I suppose it really bother me to have a rarely-seen polite warning that I am cloning an unlicensed repository. And actually, this suggestion puts the warning in the right place: users who download the software should be aware that by using it they could put themselves at legal risk.
Posted Mar 12, 2015 1:04 UTC (Thu) by mathstuf (subscriber, #69389) [Link]
> LICENCE license=BSD3
> LICENSE.third-party license=MIT:WTFPL
Posted Mar 12, 2015 6:21 UTC (Thu) by alison (subscriber, #63752) [Link]
Posted Mar 12, 2015 7:16 UTC (Thu) by tzafrir (subscriber, #11501) [Link]
Posted Mar 12, 2015 11:36 UTC (Thu) by mathstuf (subscriber, #69389) [Link]
Posted Mar 11, 2015 23:14 UTC (Wed) by jefftaylor (guest, #95911) [Link]
Just how many "foobar", "homework", and "test_git" repositories are there out there?
Posted Mar 15, 2015 21:51 UTC (Sun) by zonker (subscriber, #7867) [Link]
Still, I've seen a shocking number of repositories that are something that others do want to use (and even have been promoted by the creators) and have no licensing information.
What would be very interesting would be to see how many repositories lack a license *and* have been cloned/forked by other users. It might really get folks' attention if, say, 15% of repositories without a license have been forked at least once.
Posted Mar 12, 2015 0:14 UTC (Thu) by lambda (subscriber, #40735) [Link]
Nevertheless, the GitHub Licenses API does appear to be strangely naive in its approach. For example, it is well-established that a significant number of projects place their license in a file named COPYING, rather than LICENSE, because that has long been the convention used by the GNU project. Even scanning for that filename (or other obvious candidates, like GPL.txt) would enhance the quality of the data available significantly. Far better would be allowing the repository owner to designate what file contains the license.
It looks like it isn't quite as naive as you may believe from its description. For example, I tried it out on some random fork of GNU coreutils which contains only the standard GNU COPYING file, and it returned the correct metadata for the GPLv3:
$ curl 'https://api.github.com/repos/goj/coreutils' -H 'Accept: application/vnd.github.drax-preview+json' { "id": 3237260, "name": "coreutils", "full_name": "goj/coreutils", // ... snip ... "license": { "key": "gpl-3.0", "name": "GNU General Public License v3.0", "url": "https://api.github.com/licenses/gpl-3.0" }, "network_count": 18, "subscribers_count": 7 }So, it is at least looking at files named COPYING as well as LICENSE. And here's another that has a file named LICENSE-2.0.txt which it correctly reports as Apache licensed:
$ curl 'https://api.github.com/repos/SmartBear/ready-api-plugins' -H 'Accept: application/vnd.github.drax-preview+json' { "id": 25704802, "name": "ready-api-plugins", "full_name": "SmartBear/ready-api-plugins", // ... snip ... "license": { "key": "apache-2.0", "name": "Apache License 2.0", "url": "https://api.github.com/licenses/apache-2.0" }, // ... snip ... }It does not, however, pick up on
Documentation/GPL.txt
in this repository, or just gpl.txt
in this one.
Posted Mar 12, 2015 11:03 UTC (Thu) by yosch (guest, #4675) [Link]
Posted Mar 12, 2015 4:33 UTC (Thu) by cmbang (guest, #101355) [Link]
This lack of requirement is likely what has caused GitHub to succeed, as most users don't understand or care about licensing. Definitely more could be done to educate here.
Posted Mar 12, 2015 6:24 UTC (Thu) by alison (subscriber, #63752) [Link]
I couldn't agree more. Until a friend pointed me towards this article, I didn't realize that *my* public Github repos had no license. Shame on me. Github certainly does bury the question. Thank you, Paul, I am closing that personal ticket now!
Posted Mar 12, 2015 16:14 UTC (Thu) by smitty_one_each (subscriber, #28989) [Link]
Posted Mar 12, 2015 17:12 UTC (Thu) by flussence (subscriber, #85566) [Link]
Posted Mar 19, 2015 10:22 UTC (Thu) by ssokolow (guest, #94568) [Link]
https://programmers.stackexchange.com/questions/147111/wh...
Posted Mar 19, 2015 20:15 UTC (Thu) by flussence (subscriber, #85566) [Link]
Absolutely. I've made a conscious decision to alienate a whole bunch of people, just like the authors of every other software license in the universe have done. Most consumers of prewritten licenses do so unconsciously.
Posted Mar 12, 2015 22:07 UTC (Thu) by jond (subscriber, #37669) [Link]
Posted Mar 13, 2015 0:05 UTC (Fri) by ringerc (guest, #3071) [Link]
For that reason it's always going to be necessary to let people choose "no license" or provide a "multiple licenses" option.
Using gitattributes to annotate this seems appealing, but suffers from a few issues. For one, the information is lost if a revision is exported as a tarball / zip. Additionally it can lead to the license metadata getting out of sync with the real license if the user changes the license in a subtree.
Relying on a LICENSE (or COPYING or whatever) file is problematic too, because some repositories can't have arbitrary files lying around at the top level.
I'm not convinced a comprehensive technical solution to this is possible. The main thing to do, IMO, is encourage people to think about the license and provide tools (like they have) to make adding license info easier.
Posted Mar 13, 2015 0:58 UTC (Fri) by Limdi (guest, #100500) [Link]
What kind of repositories cannot? Assuming "repositories" is reffering to git repositories.
Posted Mar 13, 2015 13:30 UTC (Fri) by jond (subscriber, #37669) [Link]
Posted Mar 13, 2015 16:33 UTC (Fri) by cesarb (subscriber, #6266) [Link]
Aren't git attributes stored in a plain text file called .gitattributes within the repository itself? Exporting a revision would unavoidably export that file together with everything else.
Posted Mar 14, 2015 1:25 UTC (Sat) by mathstuf (subscriber, #69389) [Link]
Posted Mar 14, 2015 18:57 UTC (Sat) by xtifr (subscriber, #143) [Link]
A repository doesn't necessarily have a single license. Subtrees may be licensed under different terms.
Indeed, I have one project like that—a GPL'd app that has an MIT'd library (eventually to be split off)—and the API currently reports just the GPL. So it definitely doesn't handle this case.
Posted Mar 17, 2015 13:39 UTC (Tue) by david.a.wheeler (subscriber, #72896) [Link]
The real problem is software that has NO license stated. Those are legal traps for naive users.
Posted Mar 19, 2015 15:51 UTC (Thu) by donbarry (guest, #10485) [Link]
The only truly free option right now seems to be Kallithea, thanks to the foresight and critical investment of the Software Freedom Law Center. It also began life as a libre project, accepting code contributions, and was then taken proprietary by its founders -- contributed code included. The Kallithea fork restores to the community something worthy of their contribution. It deserves your efforts at improvement and your loyalty.
Is it any surprise that a company based on proprietary software might put off concern about licenses until late in the game? It is encouraging the worst habits and intellectual sloth among its users.
Posted Mar 20, 2015 11:11 UTC (Fri) by mathstuf (subscriber, #69389) [Link]
Posted Mar 20, 2015 19:31 UTC (Fri) by donbarry (guest, #10485) [Link]
But Gitlab is one of those "freemium" offerings: what you download is not the codebase used to serve the site. Such sites rarely accept patches from others which add the missing functionality, and that ethos of excluding development which permits the free codebase to develop the features needed by all users -- including the most demanding -- is diametrically opposed to the principles of free software.
Yes, you can fork -- but it can be very difficult, particularly when the original commercial team has the money and resources to keep the free offering "just good enough" to keep the attention primarily on it.
Posted Mar 21, 2015 13:33 UTC (Sat) by mathstuf (subscriber, #69389) [Link]
Posted Mar 30, 2015 12:28 UTC (Mon) by mirabilos (subscriber, #84359) [Link]
Copyright © 2015, Eklektix, Inc.
This article may be redistributed under the terms of the
Creative
Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds