(cache) Taking Zero-Downtime Load Balancing even Further

Ever since we rolled out our zero-downtime HAProxy reload system a few years ago, we have been disappointed that it required additional investment to work well for our external load balancing on our edge. We did generate a prototype that used an intermediary qdisc so we could apply the approach, but after evaluating the prototype, and finding that Linux wasn’t going to fix the upstream Kernel issue, we decided to go another way.

Our edge is different than our internal load balancing tier because we have typically terminated TLS with another great proxy: NGINX. NGINX is useful because it does support hitless HTTP (and recently TCP) listeners through file descriptor passing, so we naturally started thinking about how we could replace or combine NGINX with HAProxy to solve the problem of global, dynamic, highly available external and internal load balancing. This article sketches out the answer we came up with as the next iteration of our load balancing infrastructure.

Load Balancing Gremlins

Yelp, like many internet companies, faces two distinct load balancing challenges: getting requests to our edge (external load balancing) and routing service calls within our Service Oriented Architecture (internal load balancing). In late 2016 our systems for both of these problems were not keeping up with our rapidly evolving application architecture.

Traditionally, our edge was fairly static and did not need the rapid reconfiguration that SmartStack provides. Web servers came and went infrequently and our HAProxy configuration only needed to reload when we bought new hardware or manually launched new instances. A relatively static NGINX terminating TLS and forwarding to a HAProxy load balancer was sufficient. This assumption stopped holding true, however, as we moved fully onto AWS, used autoscaling more heavily, and shifted our monolith into our PaaS. Our old, mostly static, configuration had to become dynamic, which meant we needed SmartStack. We also hit scalability issues because our original edge design had NGINX connect back to HAProxy over a loopback IP, which eventually led to exhaustion of ephemeral TCP ports.

Our internal load balancing solution was also showing signs of deterioration under changing requirements. Hundreds of services appeared on our PaaS and we started running into the edge cases of Marathon where containers constantly appear and disappear for brief periods. For example, Marathon needs to move tasks around when a container briefly lives but dies due to an application bug, or because container autoscaling scales the service up or down. This can cause issues for dynamic load balancing because the routing data plane must quickly converge across your entire infrastructure or else slow down application deployments, or accidentally route traffic meant for one service to a different service’s container. Marathon’s distributed control plane and inability to keep tasks in the same “place” makes it very difficult for highly available dynamic load balancing tiers to achieve this.

Synapse and Nerve could handle the churn, but as we scaled we started seeing some performance issues on the internal load balancing system with our previous reload approach. In particular, we started observing performance issues with HAProxy reloads, which now introduced 50-100ms of latency due to large HAProxy configuration size and were triggered extremely frequently by changes in Marathon. Reloads would also, very rarely, drop new connections because of various low probability race conditions. In addition, we ran into a kernel bug when we upgraded to Linux 4.4 where localhost queuing disciplines arbitrarily drop TCP packets and introduce 200ms of latency. Linux 4.2 did fix the three-way handshake bug that our qdisc was primarily defending against, but other races (e.g. accept before close) remained and the 4.4 qdisc performance regression was a serious issue.

We wanted a solution that could solve our entire problem. In particular, we wanted to be able to:

Add, remove, change, and modify load balancing configuration without any possibility of downtime or added latency
Propagate load balancing updates across a global infrastructure in a few seconds in the typical case.
Accept connections for both TCP and HTTP services
Access comprehensive metrics and monitoring
Access advanced load balancing features, especially tooling for automated failover
Scale TLS termination easily to multiple cores
Make the system easy to understand for a typical operator

Our Solution

Our infrastructure team decided to solve this general problem by combining these two great pieces of software, and by using some relatively new features of HAProxy and NGINX.

HAProxy 1.5+ (Jun 2014) supports listening on Unix domain sockets, and NGINX 1.9.0+ (Apr 2015) supports both TCP and HTTP listeners (previously just HTTP). While Linux’s limitations meant that HAProxy couldn’t support hitless reloads using SO_REUSEPORT, HAProxy Unix domain socket listeners have always been zero-downtime because the new listen sockets are atomically moved into place with rename before the old process calls close. We believe there is technically still a very low probability race that the old HAProxy calls close with connections on its accept() queue. In practice, we have not observed this race as the new HAProxy binds the sockets first and only then signals the old HAProxy. Since the old HAProxy stops receiving new connections the moment the new one binds, it typically has more than enough time to accept() its entire remaining queue before getting the shutdown signal.

With this understanding, we can create the design shown in Figure 1, where NGINX terminates TCP (or TLS) and proxies back to an instance of HAProxy listening on local Unix sockets for each service. For our load-balancers that terminate TLS, we run NGINX with multiple workers. For our internal load-balancers that don’t terminate TLS we explicitly choose to use only stream sections in NGINX to avoid any risk of NGINX messing with our low latency internal traffic (e.g. stripping headers, adding headers, buffering requests to disk etc …).

Figure 1: Architecture (source)

Results

To show this really works, we’ve created a test setup on a Linux 4.4 (also reproduced on 3.13) Ubuntu Trusty VM running on AWS (4 core, 7 gig ram). We have:

NGINX as the listening proxy
HAProxy as the load balancing proxy
Local NGINX serving a canned response to port 80

All details of the setup can be found in this gist. Our NGINX config is setup with two TCP listeners proxying back to Unix sockets:

And our HAProxy config listens on those Unix sockets and operates one backend in HTTP mode and one in TCP mode:

We run three interesting stress tests: restarting HAProxy with -sf, reloading NGINX workers, and upgrading NGINX masters. We run a control where nothing is restarting to get a feel of the typical latency of the system, which you can see in Figure 2. In all of these latency graphs the x-axis shows the progression of the benchmark (left to right) and the y-axis shows the response time as measured by apache benchmark. This is not a heatmap, the vast majority of data is in the 1-2ms range, but the outliers are what we care about in this analysis.

We observe the control latency to be better than the qdisc approach because qdiscs add a few milliseconds of latency under high concurrency workloads.

Under HAProxy reloads as seen in Figure 3, there are a few minor latency spikes (< 5ms), which are most easily observed in the overlay Figure 4.

We can also reload NGINX configuration and check that as well:

Finally we can check upgrading the NGINX binary:

If we look at NGINX reload/upgrade latency overlaid on the control, we observe in Figure 7 a greater impact on latency when reloading NGINX. This added latency is is still extremely small (< 10ms) and in this design NGINX is reloaded so rarely that in practice this is perfectly acceptable.

The results clearly show that by combining both pieces of software we can reload our load balancing proxy (HAproxy) as much as we want, and SmartStack will ensure going forward that everything is perfectly automated.

Design Tradeoffs and Alternatives

While designing our solution, we considered a number of alternative designs, and made a set of tradeoffs based on our engineering organization’s preferences and the technologies available at the time. We explored a number of options, but in particular four main options:

Fix the Linux kernel

To be honest, we were hoping that the Linux kernel would fix the long standing issues with gracefully switching listen ports during the Linux 4.2 re-write of Linux’s TCP listen subsystem, and they did fix the SO_REUSEPORT three-way handshake bug as far as we are aware. Unfortunately, to the best of our knowledge there is still the accept-close race as described in this netdev thread.

We could have invested engineering effort in fixing the kernel, but we run a number of versions of the Linux kernel, and the time it would take to engineer this solution and get it integrated simply didn’t make business sense.

Just Use HAProxy

The closest contender is a design that uses two HAProxy instances, where the front HAProxy terminates TLS with multiple processes (as NGINX does in our design), and the back HAProxy listens on Unix sockets and does the actual load balancing. This is possible because HAProxy added pretty good TLS support in version in 1.5/1.6. We decided against it, however, because of the connection issues mentioned above and our SREs have significant operational experience with NGINX as a TLS termination layer.

At the time we made this decision, we would have had to accept connection issues if we chose to use just HAProxy, but this is no longer true! Recently, (April 2017) patches have been submitted to HAProxy which should allow perfectly graceful reloads by passing TCP sockets over a local Unix socket and reserving server slots that can be dynamically updated, preventing the need for restarting. These patches are hopefully going to land in version 1.8 in a few months. Given these awesome changes, our next iteration will likely change back to using just HAProxy and improving Synapse and our automation on top of Synapse to fully take advantage of these new dynamic features.

Just Use NGINX

Another option would be to just use NGINX, and for simple use cases this is a good option, which is why Synapse now supports NGINX as a first class load balancer. For complex load balancing applications such as ours, however, the open source version of NGINX lacks a number of important load balancing features.

For one, it is very difficult to configure NGINX correctly for transparent reverse proxying as most HTTP proxy defaults are setup for replacing something like Apache rather than HAProxy. For example NGINX by default manipulates a number of headers and buffers requests and responses (potentially to disk!). This concern can be solved relatively easily with the right incantations of options, but it’s worth noting that in our experience HAProxy is designed first and foremost to be a load balancer, NGINX is designed first and foremost to be a web server, and these are actually different challenges.

Another major problem is that open source NGINX has no online control interface, so you need to restart the proxy for it to pick up any configuration changes - including downing servers in an emergency. In a PaaS environment, you end up with constant NGINX worker reloads. Combine those constant reloads with long lived TCP connections, and you can quickly waste gigabytes of memory on every machine in your fleet due to lingering NGINX processes which must remain running as long as active TCP sessions exist. You can rate limit restarts to save your boxes (as we do with HAProxy), but this then increases the latency it takes SmartStack to respond to failed machines. With HAProxy stats socket updates, SmartStack can down a server globally in a few seconds, but with NGINX it may take minutes.

NGINX also lacks crucial load balancing tools like statistics, monitoring dashboards, healthchecks (to an extent) and support for complex routing ACLs. These are not theoretical issues, we actively rely on the HAProxy stats socket to quickly up and down servers in an emergency, and monitor application replica health. We also use a number of HAProxy ACLs and routing rules for transparent service instance failover (between AWS availability zones to regions and from one region to another) and for universal service caching (acl routes to universal caching instance before routing to actual instance).

NGINX+, the paid version, does solve some of these problems, but it is expensive (> free) and would require re-tooling.

Use Both

At the time we wrote this solution both proxies had unique benefits, and while it may have been simpler to use only one, using both gave us the ability to get the best of both worlds.

This solution has worked perfectly for our external load balancing for many months, but for our internal load balancing we had to make the main tradeoff that our system is now significantly more complex, and required a significant re-architecture of SmartStack, how we configure SmartStack, and a new implementation of SmartStack’s new config_generator API that supports NGINX. The qdisc solution we have used for years only took about one week of engineering time, but in contrast supporting multiple proxies has taken months of engineering time. We chose to do this because it gave us flexibility in other regards, especially in the face of an ever changing proxy landscape.

The other major tradeoff is that listening on Unix sockets is not an especially common practice with HAProxy and it’s also not as widely supported in other software. For example, curl only began supporting HTTP over Unix sockets in version 7.40.0+ (Jan 2015). This makes debugging harder, and potentially exposes us to uncommon bugs, for example this load related bug in HAProxy (we have not observed this bug in production).

Conclusion

Highly available, dynamic load balancing is a constantly changing infrastructure area. In addition to stalwarts like HAProxy and NGINX, we are seeing new players like envoy, linkerd, and vulcand come onto the scene. In this iteration we decided to go with the simplest, most proven technology we could, but with the infrastructure we’ve built around SmartStack it will be very easy to continue iterating and making the best choices for our platform going forwards.

Acknowledgements

This project had a number of key contributors we would like to acknowledge for their design, implementation, and rollout ideas:

Josh Snyder for coming up with the idea of using Unix sockets in the first place
Evan Krall and John Billings for design feedback and review

Back to blog

Yelp

Engineering

Taking Zero-Downtime Load Balancing even Further