Banner ad for the tech recruiting company Triplebyte: 'Triplebyte is building a background-blind screening process for hiring software engineers'

Resilient Haskell Software

Lessons learned about bitrot in Haskell software
topics: Haskell, R, archiving, technology, computer science
created: 26 Sep 2008; modified: 12 Feb 2011; status: in progress; confidence: likely; importance: 4


In 2007, Haskell began to experience some real growth; one of the new users was myself. The old ways of individual distribution with Autotools configuration & installation weren’t state of the distribution art; the shining example was Perl’s CPAN. At about that time, Duncan Coutts and a few others proposed something similar called Cabal.1

I was looking around for a way to help out the Haskell community, and was impressed by the idea of Cabal. It would make it easier for Linux distributions to package Haskell libraries & executables (and has—witness how Arch Linux’s distro packages are automatically generated from the Cabal packages). But more importantly, by making dependencies easier for authors to rely on (or eventually automatically installable thanks to cabal-install), Cabal would discourage duplication and encourage splitting out libraries. An application like the Gitit wiki, with its >40 dependencies, would simply be unthinkable. It may be hard for Haskellers who started after 2009 to believe, but applications would often keep in the source repository copies of whatever libraries they needed—if the libraries weren’t simply copied directly into the application’s source code. (In a case I remember because I did much of the work to fix it, Darcs used a primitive local version of bytestring for years after bytestring’s release.)

Unfortunately, Cabal’s uptake struck me as slow. My belief seems to be supported by the upload log. In 2006, there are 46 uploads by 5 people; the uploads are mostly of the ‘boot’ libraries distributed with GHC like mtl or network or unix. 2007 shows a much better uptake, with 586 uploads (not 586 packages, since many packages had multiple versions uploaded) by 100 people. (I was responsible for 4 uploads.)

So I decided to spend most of my time Cabalizing packages or nagging maintainers to upload. Ultimately I did 150 uploads in 2008. (A discussion of my Haskell and uploading activities can be found on the about me page.) In total, Hackage saw 2307 uploads by 250 people. In 2009, there were 3671 uploads by 391 people; in 2010, there were 5174 uploads by 490 people. Long story short, Cabal has decisively defeated Autotools in the Haskell universe, and is the standard; the only holdouts are legacy projects too complex to Cabalize (like GHC) or refuseniks (like David Roundy and Jon Meachem). I flatter myself that my work may have sped up Cabal adoption & development.

As you can imagine, I ran into many Cabal limitations and bugs as I bravely went where no Cabalist went before, but most of them have since been fixed. I also worked on some very old code (one game dated back to 1997 or so) and learned more than I wanted to about how old Haskell code could bitrot.

I learned (I hope) some good practices that would help reduce bitrot. In order of importance, they were:

  • Cabalization and metadata is good. This ties into the old declarative vs imperative approach—a Makefile can be doing all sorts of bizarre IO and peculiar scripting. It’s one thing to understand a README which mentions that such and such a file needs to have a field edited, and that the LaTeX and man pages should be generated from the LaTeX documentation; but it’s quite another to understand a Makefile which uses baroque shell one-liners to do the same thing. The former has a hope of being easily converted to alternative packaging and make systems, and the latter—doesn’t.
  • Unless there’s a very good reason, not using Darcs, Cabal, and GHC is only hurting yourself. Those three are currently “too big to fail”.
  • Fix -Wall warnings as often as possible. What is today merely imperfect style can later be real intractable errors.
  • Anything which has a custom Setup.hs or which touches the GHC API is death to maintain! I cannot emphasize this enough, these bits of functionality bitrot like mad. Graphics libraries are quite bad, but the GHC and Cabal internals are even more unstable. This is not necessarily a bad thing; the Linux kernel developers have a similar famous philosophy articulated as why you don’t want a stable binary API. But it is definitely something to bear in mind.
  • It may seem anal to explicitly enumerate imports (ie. import Data.List (nub)), particularly given how this can restrain flexibility and cause annoying compile problems—but much later, enumerated imports are incredibly valuable. Ten years from now, the Haskeller looking may have no idea what this Linspire.Debian module is. You may say, just comment out imports one by one and see what goes missing. But what if there are a dozen other things broken, or dozens of imported modules? The cascade of complexities can defeat simplistic techniques like that. And you really have no choice: imports are one of the very first things which get checked by the compiler. If they don’t work out, you’ll be stopped dead right there. There are other benefits of course: you significantly reduce the change of ambiguous imports, and dead code becomes much easier to find. (This can be taken too far, of course—it usually makes little sense to explicitly import and enumerate stuff from the Prelude.)
  • Another stylistic point is that functions defined in where-clauses can easily accidentally use more variables than they are given as arguments. This can lead to nice-looking code indeed, but it can make debugging difficult later: the type signatures are usually omitted in where-clauses. Suppose you need them? You will have difficulty hoisting the local definition out to the top level, where you can actually see what type is being inferred and how it conflicts with what type is needed.
  • Code can hang around for a very long time. It is short-sighted to not preserve code somewhere else. I ran into some particularly egregious examples where not only had the site gone down, taking with it the code, but their robots.txt had specifically disallowed the Internet Archive from backing up their site! I personally regard such actions as spitting in the face of the community, since we all stand on each other’s toes, as the expression goes. There are no truly independent works.
  • In a similar vein, we should remember openness is a good thing! Openness is such an important thing that entire operating systems are created just for it. Why did OpenBSD fork from NetBSD and take that name? Was it just because of bad blood in the core development team? No, everyone else went along because OpenBSD took a stand and made its CVS repositories open. Open repositories encourage contribution as little else short of a Free license does; if you keep the repository private, people will always worry about duplicated and wasted work, about rejected patches, about missing files necessary for compilation and development but not for release and simple usage. Open repos invite people to contribute, they allow your code to be automatically archived by search engines, the Internet Archive, and your fellow coders.
  • Licensing information is a must! A custom license is as bad as a custom Setup.hs, in a way. It is hard to add into files, which increases uncertainty and legal risk for everyone interested in preserving it. Which are you more likely to work on and distribute: a file which says in the header “License: GPL”, nothing at all, or even worse, “see LICENSE for the crazy license I invented while on a drunken fender bender”?
  • Besides avoiding writing non-Free software, do not depend on non-Free software. “In the long run, the utility of all non-Free software approaches zero. All non-Free software is a dead end.”2 Non-free software inherently limits the pool of people allowed to modify it, hastening the day it finally dies. A program cannot afford to be picky, or to introduce any friction whatsoever. In a real sense, bad or unclear licensing is accidental complexity. There’s enough of that in the world as it is.
  • A program which confines itself to Haskell’98 and nothing outside the base libraries can last a long time; just the other day, I was salvaging the QuakeHaskell code from 1996/1997. Once the file names were matched to the module names, most of the code compiled fine. Similarly, I took Haskell in Space from 2001, and all I had to do was update where the HGL functions were being imported from. A corollary to this is that code using just the Prelude is effectively immortal.
  • Include derivations! It’s perfectly fine to use clever techniques and definitions, such as rleDecode = (uncurry replicate =<<) for decoding run-length encoded lists of tuples3, but in the comments, include the original giant definition which you progressively refined into a short diamond! Even better, add a test (like a QuickCheck property) where you demonstrate that the output from the two are the same. If you are optimizing, somewhere hold onto the slow ones which you know are correct. Derivations are brilliant documentation of your intent, they provide numerous alternate implementations which might work if the current one breaks, and they give the future Haskellers a view of how you were thinking.
  • Avoid binding to C++ software if possible. I once tried to cabalize Qthaskell, which binds to the QT GUI library. Apparently, you can only link to a C++ library after generating a C interface, and the procedure for this is non-portable, unreliable, and definitely not well-supported by Cabal.

  1. Strictly speaking, Cabal was first proposed in 2003 and a paper published in 2005, but 2007 was when there was running code that could credibly be the sole method of installation, and not an experimental alternative which one might use in parallel with Autotools.↩︎︎

  2. Mark Pilgrim, “Freedom 0”; he is quoting/paraphrasing Matthew Thomas↩︎︎

  3. For an explanation of this task, and how that monad stuff works, see Run Length Encoding↩︎︎