https://gdstechnology.blog.gov.uk/2013/12/05/building-a-new-router-for-gov-uk/

Building a new router for GOV.UK

This is part one of what will be a series of blog posts about the new dynamic HTTP router that serves GOV.UK. This post describes our original motivations in building the router, and some of our experiences building the prototypes that informed later work. It was written for us by Nick Stenning, who kicked off this project but has since left us to be Technical Director of the Open Knowledge Foundation. Thanks Nick!

Why was this needed?

GOV.UK is the single domain for government, presenting services and information from hundreds of arms of government in one place. This is as it should be: we hide the complexity of government in order to better meet the needs of our users, who should not be expected to know which department is responsible for providing the service they are seeking. But behind the scenes, GOV.UK is not one large program. In fact, it is made up of many small web applications that each aims to do one thing well. In order to present these applications as one website, we need a piece of technology that will route our users’ requests to the correct application: an HTTP router.

When we went live with GOV.UK in mid-October 2012, the job of the router was done by a set of three identically configured Varnish instances at the front of our infrastructure. Varnish is an extraordinarily capable piece of software, but we were probably stretching the bounds of reasonable usage. Certainly, the resulting VCL was a horror to behold:

if (req.url ~ "^/autocomplete(\?.*)?$|^/preload-autocomplete(\?.*)?$|^/sitemap[^/]*.xml(\?.*)?$") {
  <%= set_backend('search') %>
} else if (req.url ~ "^/when-do-the-clocks-change([/?.].*)?$|^/bank-holidays([/?.].*)?$|^/gwyliau-banc([/?.].*)?$") {
  <%= set_backend('calendars') %>
} else if (req.url ~ "^/(<%= @smartanswers.join("|") %>)([/?.].*)?$") {
  <%= set_backend('smartanswers') %>
} else if (req.url ~ "^/child-benefit-tax-calculator([/?.].*)?$") {
  <%= set_backend('calculators') %>
} else if (req.url ~ "^/stylesheets|^/javascripts|^/images|^/templates|^/favicon\.ico(\?.*)?$|^/humans\.txt(\?.*)?$|^/robots\.txt(\?.*)?$|^/fonts|^/google[a-f0-9]{16}\.html(\?.*)?$|^/apple-touch(.*)?\.png$") {
  <%= set_backend('static') %>
} else if (req.url ~ "^/(designprinciples|service-manual|transformation)([/?.].*)?$") {
  <%= set_backend('designprinciples') %>
...
} else {
  <%= set_backend('frontend') %>
}

But aside from the abuse of regular expression conditionals that might make even a seasoned system administrator’s eyes bleed, you might wonder what was actually wrong with this setup?

Two primary issues caused us to reconsider the issue of an HTTP router:

  1. Maintainability: keeping a list of all routes in a single file lead to unfortunate coupled deployments. When an application needed to change what URLs it could respond to, we had to deploy updated VCL at the same time as an application, an operation that was tricky to orchestrate correctly. More importantly, having to update a single file replete with arcane syntax and workarounds introduced risk into every URL change, and slowed us down substantially.
  2. Performance: Varnish didn’t have any trouble dealing with our terrible abuse of its configuration language, but the final few lines of the VCL template above caused us endless headaches.

The performance issue was really the kicker. There are tens of thousands of URLs exposed on GOV.UK, and we didn’t want to maintain all of these in our Varnish configuration. Luckily for us, the vast majority of these thousands of URLs are served by only two applications, frontend and whitehall, which respectively expose the main citizen content (stuff like the browse pages) and the government corporate publishing pages. Each of these applications in turn gets its content from the content API, an internal interface to the database that stores most of GOV.UK in a raw form. This means that any request which made it all the way through the chain of conditionals in our VCL could potentially make four separate HTTP requests behind the scenes:

  1. Varnish to frontend
  2. frontend to the content API (“do you have this page and is it in a format I can render?”)
  3. If frontend doesn’t own the content, it returns a 404 and nginx, our webserver, would proceed to try the whitehall application
  4. whitehall to the content API

You’re probably cringing by this point. So were we. Even if all the requests were fast (and “fast” in your average Rails application tends not to be especially fast), we’re still talking about at least 200ms to return a 404 page, or 150ms to return each and every page from whitehall. Perhaps worst of all is the fact that for every successful request to the whitehall application, we are making two requests to application servers that result in a 404. Each application server has a limited number of request workers (we front our Ruby applications with unicorn), which means that while those 404ing requests are in flight, no other legitimate requests could be served by those worker processes.

Something had to change.

Prototyping a new router

Back in April, I set aside a few days figuring out what a better router for GOV.UK might look like. I made the decision to prototype in Go. The simplicity and safety of the Go language were a good fit for a core component of our HTTP infrastructure, and some brief experiments with the excellent net/http package convinced me I was on the right track. In particular, Go’s concurrency model makes it absurdly easy to build performant I/O-bound applications, as the router would undoubtedly be if implemented correctly.

Planting a trie

The first question to answer was how to store and look up entries in the routing table. GOV.UK has excellent URLs, which reflect the logical structure of the site. (For example, the Department of Health main page lives at /government/organisations/department-of-health, a “child” page of the listing of government departments and agencies, which lives at /government/organisations.) Given the tree structure of URLs, and the fact the router would most often by matching by URL prefix (because, for example, everything under /government is served by one application), the natural choice was a prefix tree, or “trie”.

Implementing a prefix tree in Go was quick and painless. The result (which, like everything else described in this blog post, can be found on GitHub) is a data structure that can map slices of strings ([]string{"government", "organisations"}) to arbitary data (interface{}, in Go-speak). One aspect of Go that made this process particularly pleasant was the language’s built-in support for testing. Despite the fact this was a prototype, the lack of overhead involved in writing tests was such that the ~80 line trie package ended up with just under 200 lines of data-driven tests.

Handling HTTP

The next step was to actually use a trie as a routing table. Go has a somewhat idiosyncratic (but very well-designed) HTTP library, net/http, in which the fundamental concept is the handler, or http.Handler. http.Handler is an interface type. Explaining Go interface types is outside the scope of this blog post, but suffice to say that if you can implement a method with the signature ServeHTTP(ResponseWriter, *Request) on your type, then it can act as an http.Handler anywhere the Go standard library uses one.

And this is exactly the purpose of the triemux package. Mux, short for multiplexer, is the term Go uses to refer to a component that accepts requests and routes them to different places on the basis of the request properties (such as the URL). In other words, it’s an HTTP router. A triemux is an HTTP router which satisfies http.Handler and can be used anywhere you might use Go’s default ServeMux. It adds a few thread-safety protections (a read-write mutex) around the routing table itself, in order to allow dynamic updates to the routing table while continuing to serve requests.

One of the more elegant effects of the ubiquitous http.Handler pattern in Go is that a mux, which is itself a http.Handler, is just a way of directing traffic to a set of different http.Handler instances. triemux doesn’t make any assumptions about what the handlers are. Instead, that’s the job of the router package.

Dynamic route loading

In order to actually solve the problems we outlined in the first part of this blog post, we have to load routes from some data store that can be updated by applications when they deploy. We make extensive use of MongoDB behind the scenes at GOV.UK, and the router package acts as the glue between a Mongo database and a triemux. Routes are loaded into memory on startup, and traffic is directed to one of a number of backends (also defined in the database) using Go’s built-in reverse proxy factory. This arrangement has some rather nice features. In particular, we can dynamically reload routes when an application is deployed, and atomically flip to the new routing table, with no requests dropped. If anything goes wrong while reloading routes (for example, if we have trouble talking to Mongo or parsing its responses), we can easily register a deferred recovery routine that will ensure that routing continues unaffected.

A router, built

When I started to prototype a new router for GOV.UK, I had written practically nothing in Go. And yet what I’ve described took just two and a half days, while the result outperformed our entire production stack by several orders of magnitude. (In fact, I had difficulty benchmarking the throughput of the router: I repeatedly ended up benchmarking the test servers behind the router rather than the router itself.)

Subsequently, I have started referring to what I call the “unreasonable effectiveness of Go” (with apologies to Eugene Wigner). Go is a small language, which is the only reason I was able to be use it effectively in a short period of time. But the size of the language belies its expressive power, the quality of its standard library, and the surprising ease with which relatively complex things can be composed out of small pieces (in this case, trie -> triemux -> router).

It’s safe to say that I was pleasantly surprised by my adventures in prototype land, but a working prototype is a long way from a production deployment, especially for a component upon which every single request to GOV.UK depends. Later blog posts will explain how my colleagues took on the much more difficult job of testing and deploying a new router in front of the nation’s website.


If work like this sounds good for you, take a look at Working for GDS - we're usually in search of talented people to come and join the team.

Stay in touch. Sign up now for email updates from this blog or subscribe to the feed.

6 comments

  1. mk

    Great article, looking forward to reading the rest of the series.

    Link to this comment
  2. bn

    I cringe when I read these kind of blog entries.

    You seem to have done a great job in streamlining a hideous process into a more manageable (if slightly overtrendy) one which I don't fault at all but at the same time it has been done whilst ignoring the actual issue. There are probably reasons for that, maybe the process itself is outside your role but it seems a shame if it isn't being addressed.

    The idea behind gov.uk should not just be about hiding multiple independant departments behind a common URL but in more joined up thinking and some consistency between one interaction with a person and the next, presenting information in the terms of the user. Eg. Google or Facebook don't run a bunch of independent servers, they package up work/resources that can be delivered or processed consistently by a more generic platform.

    There is a lot of PR about how GDS is a new approach but if that is just a veneer rather than a shift in thinking then that is a real opportunity missed.

    Link to this comment
    • James Stewart

      Hi bn,

      Thanks for your comment. I think we may have been a little unclear in how this is presented as this post is very focussed on the technical implementation rather than the overall aims of GOV.UK.

      It's not that we have separate applications for different departments, but instead that as we've looked at the various user needs that GOV.UK meets it's made sense for those to be met by a variety of focussed pieces of software. The separation is based on the work we've done to understand and meet those user needs, not on the structure of government departments.

      Alongside this technical work there's been a lot done to join up all sorts of areas, for example changing how we present policies, to bring the focus onto the policy rather than starting with the departments responsible for them.

      Through our design principles, service standard and transformation programme we're also working to transform a range of more transactional services so that they provide consistent (and much improved) experiences with the aim that if you learn how to use one you'll be equipped to use any of them. There's a lot more on that at https://www.gov.uk/transformation and on the main GDS blog at http://digital.cabinet-office.gov.uk

      Link to this comment
  3. bn

    Well they are certainly very interesting in giving an insight to government services in terms that a technical person can relate to, and whether those aims you mention are actually being put into practice.

    Link to this comment
  4. Tom

    So are you still using Varnish (or similar) in front of, or behind, your router for caching, or is that being handled by the router too?

    Link to this comment
    • James Stewart

      We considered adding caching into the router but decided it would be best to have it focus on doing just one thing well. So we're running varnish (for caching) and nginx (which helps screen out some bad requests, among other things) in front of the router.

      Link to this comment