(cache)Mailing lists vs Github

Mailing lists vs Github

June 5, 2018

Newsletter ↳

Most fledgling open source projects use Github or Gitlab to collaborate on code. However there’s an older method that developers ought to know about because it offers some advantages.

Github vs Email

The alternative method is the developer mailing list. It arose in the late eighties to early nineties, and predates the popularity of the web browser. But far from being a mere historical curiosity, the discussion list is still the primary method of development in many important open source projects, from databases to operating systems to web browsers.

In this article I’ll carefully compare the use of mailing lists with code collaboration web sites such as Github. I’ll do my best to present the pros and cons of each, so that projects assessing the two can make an informed decision.

Briefly, how does each work?

Github hosts a git server, a bug tracker, a wiki, as well as release artifacts. Participation in discussions or code changes requires a Github identity. Code changes are proposed by making another Github-hosted project (a “fork”), modifying a remote branch, and using the GUI to open a pull request from your branch to the original. Project members debate the change in comments, with subsequent changes pushed as commits to the pull request branch. Ultimately a project owner can decide to merge the commits.

A mailing list maintains a list of subscribers who receive all messages sent to a “reflector” email address. A project must host its own version control server elsewhere. Code changes are proposed by sending an email to the reflector, with a textual diff of the changes either attached or inline. List members debate the change with email replies, sending subsequent changes as more diffs, either to replace or amend the original. Ultimately a project committer can decide to apply the patch(es) to the codebase.

Advantages of mailing lists

Precise, flexible communication

Mailing lists have three properties that allow developers to communicate with greater precision and flexibility than is possible in the Github UI: the threaded nature of emails, the user generation of patches, and the free mixing of discussion and patches.

Threading

The Github web UI has a limited model of threading, where there is one thread for the pull request as a whole, and other threads attached to single changed lines of code. The UI shows a few lines of code for context around the line chosen for discussion, but the full chunk of code under discussion must be inferred.

Email sub-threads allow specialized discussion about different aspects or sections of the code. A linear Github-style discussion would mix those conversations.

Conversations over email have a greater permanence and reference value than Github comments. On Github, comments continually change. They become “outdated” and disappear when attached to a line that has been changed. Same for the commits, which vanish after a force-push to the pull request branch. In an email thread, by contrast, the original messages and proposed changes remain for comparison with later messages and patches.

Furthermore, patches from multiple authors can’t mix in a Github pull request. The person opening the pull request “owns” it. Other participants can drop code suggestions as comments, but what gets merged by the UI is ultimately the commits on the branch. The pull requester must turn those comments into commits on the branch if he or she wants to incorporate the suggested changes.

On a mailing list, anyone is free to reply to an existing thread with a new proposed set of patches. Another nice effect is that other people can carry the patch to the finish line if the original author stops caring or being involved. (I’ve encountered a number of MIA authors in pull requests on my projects over the years.)

Patch Format

Custom patches are another way that the email process improves communication over the Github UI. Contributors can shape the patches to make them more readable for other people.

With Github you commit your changes and then Github chooses how to render the diff to the reviewer, but when creating the patch yourself you can tweak the settings. Diffs can be in unified or context format, and for the latter the number of context lines is configurable. (To be fair, Github allows expanding the view of context around changes too.)

Some people feel that git-style unified diffs are clearest for small changes of a few lines, and context diffs are more suitable for changing larger ranges, especially lines with indentation changes. (To be fair, I should note that Github will hide indentation changes if you add ?w=1 to the URL displaying the change.)

Finally, creating patches by hand with diff can take advantage of other options, like annotating the hunk headers with which C functions are affected (using the -p flag), and the choice of various algorithms such as “patience.”

Patience does a little more work to calculate its results, but the output makes more sense for humans. For instance, a standard (non-patience) diff might look like this:

 void func1() {
     x += 1
+}
+
+void functhreehalves() {
+    x += 1.5
 }

 void func2() {
     x += 2
 }

Whereas a patience diff shows the function added as a unit:

 void func1() {
     x += 1
 }

+void functhreehalves() {
+    x += 1.5
+}
+
 void func2() {
     x += 2
 }

Finally, there’s nothing to stop the mailing list style of collaboration from adapting to emerging diff formats, such as structural diffs. That’s because patches in emails directly describe changes whereas Github renders commit deltas without nuance.

Patch/Discussion mix

The bulk of communication around a new feature is in its proposal, not around the subsequent code. A contributor proposes the feature, justifies it, and confirms the scope before writing any code. The community responds, and often by the time anyone implements anything, the idea has shifted significantly from the start of the process.

This is again where emails fit the group discussion well, with inline replies and the threaded structure as mentioned above. When it’s time to get coding there’s no discontinuity of switching from an “issue” to a “pull request.” The author simply replies again with patches attached.

Control and customization

While web apps deliver a centrally-controlled user interface, native applications allow each person to customize their own experience. Open protocols like SMTP encourage a proliferation of clients. So it is with with mail clients, each providing its own functions, filters, folders, flags and fun. Mail clients provide ways to mark a message important, or set it back as unread. In general you don’t have to unsuccessfully beg a central committee over five years for an interface change. (It happens, see this petition in an unofficial repo designed to supplicate the Github deities.)

Some people script their mail client so that they can apply patches with a keyboard shortcut, others go minimalist, and still others even use webmail. Each person is different, and so is their software, but the nature of the mailing list allows them all to work together.

Another area of control is the ability to search and interact with a mailing list while offline. This is often by choice rather than necessity. Urban and suburban areas usually have internet access available anywhere at any time, using cellular service if need be. Things are no longer as they once were with intermittent dial-up connections, but working offline is a choice that some people simply prefer.

Although git is itself distributed and operates locally without a network connection, Github requires connectivity to review issues and pull requests. Demanding connectivity for an essentially reflective solitary task is unreasonable.

Remember that being online applies in two directions: the developer’s connection to the internet and also Github’s uptime. Although the latter has been pretty stable for a while, I do remember a handful of times where my entire office got derailed by Github downtime. Taking Github down is a big attractive target for angry nations and ambitious hackers.

With a native email client you can review all emails and attachments offline. You can even send replies to messages offline and the client will queue them until internet access becomes available.

For the other side, the uptime of the mailing list server, mail sent to an offline server will eventually go through. Mail transfer agents along the email chain are designed to retry transmissions. In addition, developer lists recommend that users “Reply All” to messages so that everyone involved in a thread as well as the list reflector address get the message. That way even if the reflector is offline, your email can go through to those most recently involved.

Ultimately the control and customizability enjoyed by the mailing list style of development comes from diverse tools built for open standards. Patchfiles are universally supported (git itself directly supports patches), SMTP has an RFC. Tools can work together, rather than having a GUI locked in the browser.

Politics and Profit

Git was originally created as an escape from BitKeeper, a version control system with associated centralized hosting. BitKeeper granted a free license for open source projects, and required projects to store their metadata on company servers.

What a twist of history, then, that users of git chose Github… a centralized host granting free licenses for open source projects, and requiring projects to store their metadata on company servers.

Github can legally delete projects or users with or without cause. This makes sense of course, since the projects are using computers owned by Github. Also, while not explicitly prohibiting development of competing projects like BitKeeper did, Github can still view the private source code of all companies who choose to host with them.

Really, what’s the future of Github (or any profit-driven yet free code hosting service)? History provides the example of SourceForge, the premier code collaboration platform at the turn of the century. After acquisition by dice.com, and in a last-ditch effort to grab some more money, SourceForge hijacked hosted projects so that their installers would include adware and spyware.

When I first began notes for this article, the implications for Github were purely hypothetical, but even as of this writing news has emerged that Microsoft is buying Github. It doesn’t feel far-fetched to suggest that Github may be moving into its “SourceForge stage.”

Let this lesson be broader than Github though. The same logic applies to other web platforms that centralize and combine programmer communication and hosting.

Some major open source projects like the PostgreSQL database played it safe and explicitly chose never to put their code on external company servers (other than for mirroring). Postgres also takes pains to place control in a governing body spanning multiple companies. No company is allowed a majority of committers, and the organization is designed to survive independently.

Not all projects are important enough to require this level of neutrality, but all projects deserve to control their own destiny.

Right tool for the job

The fundamental task we’re talking about is asynchronous group communication of code changes. Patches are a universally understood way of describing code changes, and email is a universally understood method of communication. So it seems that this approach matches the problem at hand.

In fact emailing patches works with any version control system, not just Git. The OpenBSD project still happily uses mailing lists with CVS. Patch authors don’t need any version control software installed at all. An author can download the source release tarball, make changes in the copy, capture the diff, and email it.

Sending and applying patches cuts out busywork like cleaning up remote branches after merge, or creating a local branch in the forked repo in preparation for a pull request. For comparison, I remember teaching a group of new programmers how to use Github, and was conscious at the time of all the weird steps I asked them to do.

There’s also less busywork for finished communications. There aren’t things to “clean up” like abandoned pull requests, merged branches, or issues to mark closed. The replies just stop on those threads. More on some downsides of this later, but for now look on the bright side.

Using email also decouples digital identity from the accounts on a site like Github, and ultimately places that trust in the DNS system managed by an international organization, ICANN. For more about this, see my article Returning to the Original Social Network. PGP provides a further guarantee of identity, verified through a decentralized web of trust. See my recording of Neal Walfield’s talk An Advanced Intro to GnuPG.

Challenges for mailing lists

The mailing list approach raises quite a few questions and difficulties. I’ve done my best to try to acknowledge them and suggest possible solutions.

Unfamiliarity and obscurity

Variety of process

When you browse a Github project you know exactly how it works. Sure, the addons/CI/hooks may differ, and setting up the project might involve some contortions, but the process of browsing code, finding known bugs, and submitting changes is always the same. This is most definitely not true for projects with their own web presence and mailing lists.

I’ve heard people go so far as to say, “If a project is not on Github then there is a 0% chance that I will contribute.” To be honest I can’t say that I’ve contributed to non-Github projects either. After I migrated to Github from Google Code my only non-Github contributions (if you want to call them that) were filing a bug report for Chrome, and another for a tool on Gitlab.

One more source of unfamiliarity is that many people, especially younger devs, have never seen a real mail client like mutt. In their minds they may picture an overflowing GMail inbox full of top posted replies all mixed together from different projects. That’s not an enticing image. Perhaps this article can start these developers on the path to rediscovering the care and engineering that went into classic email clients (“MUAs” as they are called).

Given the unfamiliarity, a self-hosted project needs to communicate very clearly how it does collaboration. A mailing list alone is not enough; a project needs to explain itself on the web, and provide information about getting involved.

A final note about unfamiliarity. Beware that offline users can be misunderstood in today’s connected age. It can look silly to write an email while offline, only to have it queue, send, and arrive after another better response renders yours unnecessary. It’s probably wise to include a signature line saying something like, “Note: composed offline, may lack recent context.”

The next problem that projects outside Github face is the lack of social proof through star voting. It’s harder for new projects to build reputation without stars. People take notice of a project with thousands of stars in a way that the mere project description doesn’t evoke. While much established software like OSes, browsers, and databases do fine without the stars, they won their acceptance years ago, or are promoted by big companies.

One way to help self-hosted project popularity is to hook into Github without going too deep. What some projects like git and Linux do is run their own servers, but offer a read-only mirror on Github for public admiration and code browsing.

Two tools that can help are the Pull Request Rejection Bot and the built-in setting to disable issues. After disabling pull requests and issues, the project README needs to include information about how and where to contribute.

The self-hosted git server will also need a post-commit hook to keep Github up to date. This command should do the trick in the hook:

git push github_remote -f --mirror

Disorganization

Patch state

A big difficulty for mailing lists is sharing the global state of email threads. Not the content of the messages, obviously, but the decisions that require action. Individual contributors can flag messages for action in their mail clients, but this status remains private for each individual. It’s helpful to record publicly that a patch has been accepted for future application, or that a previously reported bug was fixed.

Projects do this in one of two ways: adding metadata to the git repository itself, or using an external tool that ties into the mailing list.

For instance PostgreSQL had a problem where patches hung around without being accepted or rejected (there are up to ~250 proposed patches to consider at one time!). So the team created commitfest.postgresql.org. Contributors now register the patches on this site for final review. Several times per year committers stop their own work and make time to accept/defer/reject/apply the patches.

Patchwork is a similar tool that is not tied to Postgres and is more suitable for general use. Patchwork supplements a mailing list by subscribing to the list just like a person would, and capturing patches from the emails. For each patch it creates a web page. It doesn’t fragment discussion because it doesn’t allow commenting through the web interface, it merely reflects any comments from the emails, and allows maintainers to mark patches with a state such as Accepted, Rejected, or Under Review.

Comprehensive bug list

Here’s something surprising I just learned: the PostgreSQL project does not have any place a new contributor can go to find a list of open bugs for the project. There is a separate mailing list, pgsql-bugs, where people report them, but there’s nothing that ties that with the activity on pgsql-hackers to indicate whether the bug is resolved. Interestingly, OpenBSD also has a bugs email list and no central bug status.

Postgres is still searching for a system which matches how they like to work and which doesn’t detract from the mailing lists. I don’t believe OpenBSD desires a bug tracker. As they quipped on IRC, “If we did [have one] we would be FreeBSD, wouldn’t we?”

For projects that do want a bug tracker, one good contender is Debbugs from the Debian project. It provides an email interface to manipulate the bugs and can be used for projects other than Debian.

Other bug systems work by adding files or objects into the project git repo itself. Thus fixing a bug on a branch could mean deleting the bug from the repo along with the other changes on that branch. Like email, this preserves offline access to bug information. Probably one of the best is Bugs Everywhere. Another pretty clean one is simply called bug.

Message history

Once you subscribe to a mailing list, new messages flow into your client and are available for searching. Hooray! But what about messages sent before you joined the list? Furthermore, how do you “link” to old messages from newer ones? Every issue on Github has its own URL, so issues can refer to each other, but how would an email say something like, “we already discussed this in message X, look there.”?

The way this works is with the Message-ID header. Every email has one, and they are part of how replies work. They are big GUIDs that look like 8953.1527887111@sss.pgh.pa.us or CAFjFpReKyYrsUF8sP5GoPYyyp9ZSqm_mLeD8kQLigX-3CzDiUg@mail.gmail.com. So an email could say, “We discussed this issue in 8953.1527887111@sss.pgh.pa.us, let’s move the discussion there.”

Reading a message sent prior to your joining the list requires going to the list archive on the web. Good archives provide links to mbox files which can be imported into a native mail client. This makes the old messages available for search/reply in the mail client. Archives also usually provide a web interface to read the messages as HTML, but navigating threads in these old interfaces is usually way more painful than in a good mail client.

As an example, check out the archives for pgsql-hackers. Notice the mbox file links. You can download an mbox for a month’s worth of messages, for individual messages or for full threads. To navigate to a certain message id, use the url https://www.postgresql.org/message-id/:the-id. Try one of the ids I listed earlier. Similarly the Linux kernel mailing list archives can display a message by id via lkml.kernel.org.

Remember that all emails have a Message-ID header, this isn’t something special having to do with mailing lists. If you want to reply to a message that’s not imported in your mail client, you need to include the email header In-Reply-To: <message-id> or the thread will be split, regardless of the subject line you use.

External tools

Email and patch files are both old interoperable standards, so, theoretically, running patches through continuous integration ought not to be a problem. However I don’t know that the tooling here is well-developed. Where Gitlab has a CI server built in, and several addons exist for Github, I don’t know of any clearly dominant thing for a patch based workflow. IBM built a tool called snowpatch to do this, but I don’t know what tomfoolery is involved.

It is possible to read patches from a list and then script an integration to a CI tool via patching and pushing to Github. Thomas Munro recently created cfbot to do this for Postgres. (See the interesting slides about its development.) It’s a creative use for Github that doesn’t compromise the mailing list flow.

Barrier to entry

Related to the unfamiliarity mentioned earlier, using a mailing list well requires new skills for contributors, and infrastructure for maintainers.

Tooling and the protocol

Although you can subscribe to a list with Outlook or GMail and call it a day, neither you nor the people you interact with will be happy about it. Email is a precise tool from ye olde Unix days, where people care about things like the plain text lines you send over the wire, or the way you use MIME types.

But even before crafting these little RFC-821 gems you have to go through a dance every time you want to subscribe to a list, confirming your address and adding a folder and filter to the mail client. It’s not like jotting a note in the issues of another Github repo, unless the mailing list is “open” meaning accepts mail from non-registered addresses. Most lists are “closed” because of spam. Such is the price we pay for basing identity on the DNS system and not on a corporate collaborative coding site profile.

Once on the list and ready to send some messages or patches, there are a few guidelines to follow that pretty much all lists ask. They help accommodate the diversity of mail clients:

Send with plain text
Disable all HTML “enhancements”
At most 72 characters per line
Do not top-post
Keep quoted text small and relevant
Strip out company legal footers
Shorten lengthy signatures
Use Reply-All
If replying, ensure In-Reply-To is set

If your message contains a patch then there are more rules to observe, and they vary by community. Linux, OpenBSD and FreeBSD say, “No MIME, no links, no compression, no attachments. Just plain text.” The patch goes right in the body of the message at the bottom. If the mail client would mangle the code (such as by wrapping long lines), then attaching with MIME is permissible. For patches above 300kb, host them online and include a link.

Postgres, on the other hand, wants patches attached as type text/x-patch, with disposition: inline for tiny patches and disposition: attachment for substantial ones. And in both cases including a unique filename for the patch in the Content-Disposition. Also different reviewers prefer different patch formats, like Tom Lane will only accept context diffs, but other people want unified diffs.

Hosting requirements

On the maintainer side, it’s not as easy to make a new project as it is clicking a button on Github. Maintainers need to host the mailing list and optionally a patch status app and a bug tracker.

I’d advise creating a general purpose listserv for your personal use, which can run lists for all your small projects. Then create project pages off your homepage or something. I just can’t see taking the overhead to spin up new infrastructure for a dinky experimental open source utility.

There are probably commercially hosted listservs for rent, or you can host your own. Probably the best ones for self-hosting are GNU Mailman or Mlmmj. There’s also a service called The Mail Archive which can add an archive to any list. Might also be worth looking into public-inbox which is a new approach that shares a mailbox itself over Git. Git to hold the mailing list to talk about git – inception!

For your listserv you’ll also want to install Spamassassin and restrict allowed MIME types to plain text and patches. Also remember to configure the list to use Reply-To rather than From for DMARC compatibility. This whole ordeal isn’t something you want to have to do twice!

Optionally you may want to host a lightweight web interface for browsing git. The simplest seem to be one of these:

Alternately you could just use Github for the actual git server and code browsing. Or mirror there.

Finally, you could go really hard-core and let people use git’s builtin functionality for pulling a shallow clone rather than providing any web interface:

# get the latest commit
git clone --depth=1 --single-branch --branch master <remote>

# or get it and don't even include the .git folder
git archive --remote=<remote> master

This is rather unfriendly though.

Topic specialization

Sometimes it’s useful to restrict the general purpose nature of email messages. For instance bug reports should follow a certain format and include details about the environment. Github offers issue templates; what’s the equivalent for email?

Some projects address this with helper programs that collect information and format an email on your behalf. The information is thus specialized for the type of software involved. GNU has GNATS, OpenBSD has sendbug, Debian has debbugs and Postgres has a bug report form.

Noob questions are another kind of specialized topic, as are topics that interest only a small subset of the list. Lists full of kernel developers or database hackers tend to be… stern. Communities which still use mailing lists are usually the ones full of scary wizards who invent the world that the rest of us play in. Thus it can be intimidating to ask basic questions there. What projects usually do is create a beginner list separate from the main developer zone. Also IRC is a good way to ask questions that fade away.

Why not NNTP?

Isn’t the whole email list concept a bit of a hack for turning a protocol designed for small-group communication into a behemoth? After all, NNTP is made for threaded large group discussion.

NNTP has slightly worse offline characteristics though. While email allows the Reply-All trick to keep things limping along during a list outage, replies through NNTP have to route through a central server. Also identity verification needs to happen on the server, whereas email can delegate to SPF/DKIM/DMARC. There are also a greater number of clients for email than NNTP, and clients that continue to be supported.

Finally, if you like reading a mailing list with a newsreader, go right ahead. Use a bridge like GMANE.

Conclusion

I hope that I’ve given mailing lists a fair comparison with today’s more common web interfaces. It seems that mailing lists offer real advantages, and that some projects could benefit from making the switch. The biggest difficulty with mailing lists seems to be simply setting them up. Perhaps if there was a deployable server image including necessary software and configuration, that would help boost adoption.

begriffs