(cache) The DOs and DON'Ts of Blue/Green Deployment

The term "blue/green deployment" is so misunderstood, we can't even decide on what to call it. Netflix call it Red/Black Deployment, while others call it A/B Deployment. Personally, I don't even know which color represents which thing, but this is only the beginning of the confusion.

So, I thought it might help to get some thoughts out there, have others weigh in, and see if we can't reach some consensus. After all, it's been 5 years since the oracle spoke. :-)

Here goes...

First off, blue/green deployment is the process of having 2 sets of machines, and switching traffic from Set A to Set B. Like this diagram:

Let's be clear about this - blue/green deployment is not an application-level thing. It's a hardware-level thing. It is not a matter of Set B enabling or disabling features. All traffic goes to the first set, or all traffic goes to the second set. Like pulling the lever at a train track fork, the transition should be very clean.

Application level changes are things like rolling out features to users on an incremental basis (1%, 5%, 10%, 100%) or A/B testing the color of different buttons. They are the responsibility of the application and should not be done at the hardware level. There are whole companies dedicated to changing things at this level, and never touch hardware.

The good folks at Etsy, who do north of 50 deployments to production per day, have talked about how they use config flags (sometimes called feature flags) for application level changes, and have even open sourced their library for doing so.

I'm using the term hardware loosely, only to make it clear to this is not application-level change. The two sets could be different physical boxes, or it could be different VMs, different Docker containers, whatever. Perhaps there is a better term?

OK, let's dive into some implementation DOs and DON'Ts with some AWS-specific examples.

The setup should look like this, with one shared Elastic Load Balancer (ELB), and 2 Auto Scaling Group (ASG) and Launch Configuration combos (one combo for each set).

On the left, all traffic is going to Set A, while on the right, all traffic goes to Set B.

Here are some best practices we have seen work well.

DO reuse the ELB between deployments.

When switching between the two sets of servers, use the same ELB for both sets. There are many reasons for this as you will see below. One really big reason is that while ELBs are elastically scalable, they are also a black box. You don't have control over how an ELB scales, and if your web service get any sort of decent load, a new ELB will not be able to scale up fast enough to handle the traffic. This will result in dropped packets, and sad users.

When you use the same ELB for both sets of machines, it is already pre-scaled and ready to go. No calling up your AWS Technical Account Manager (TAM) and asking them to pre-scale an ELB for you.

DON'T switch between versions by changing a DNS record

A simple approach that I see often is to switch between sets of machines by changing what a DNS record is pointing to. This is not clean, as can be seen here:

DNS clients exist at various levels, and not all of them obey TTL rules. If you are dealing with mobile, particularly internationally, this is even worse (I've seen weeks of trailing traffic with a 300s TTL). Using DNS for blue/green deployments means the switch over will not be clean, and both sets will need to be live for a much longer period of time than they should be, costing you money.

Elastic Beanstalk uses this method (sorry folks).

DO register/deregister instances with/from the ELB

A very clean switch over method is to bring up the new set of instances, wait for them to be ready, and then register them with the ELB. Once they are receiving traffic, deregister the instances in the old set from the ELB. This process takes a few seconds. That's right, it's not instant, nor should it be (a common misconception).

Requests can sometimes take a long time. Killing a request half way through because you want to switch over is suboptimal. Ideally, the switch over process should let all requests to the old set finish gracefully. If your service has long running requests, consider enabling connection draining on the ELB. Actually, if in doubt, enable connection draining. It's not going to hurt.

DO use ELB health checks

We all make mistakes, and sending traffic to a new set of instances when they aren't ready to receive it can really make for a bad day. For example, Java apps running on Tomcat. Even when the EC2 instance has finished it's boot sequence, there is so much that can happen at the application level before the app is ready for traffic - initializing data structures, reading configuration, configuring logging, warming a cache. Create an endpoint in the application and let it tell you when it is ready to receive traffic.

DON'T use CloudFormation to orchestrate this

CloudFormation is fantastic at bringing up AWS resources that don't change often. I'm talking about VPCs, subnets, ELBs, security groups, RDS instances, IAM roles, etc. For resources that change often, like performing blue/green deployments multiple times a day, CloudFormation is not a fit. It has no workflow management, and when things go wrong, you won't have the control you need to avoid downtime.

DO use a purpose built tool

To handle the ASG and LaunchConfig lifecycle, and switching between sets of instances, use a tool that was custom built for the job. Two come to mind:

Netflix built Asgard to handle deployment of their 100+ microservices. The project was started over 5 years ago, when AWS was a very different place. As a result, Asgard grew way beyond just deployment, and parts of it are largely deprecated by the advancements in IAM and the AWS Web Console. The deployment process however, is still golden.

We implemented the same blue/green deployment process in Delta, updating it to use current AWS features, and making it available as a SaaS, rather than yet-another-thing you need to install, configure and maintain.

Delta is also a very good citizen in your AWS account, tagging resources, and cleaning up after itself. Asgard, not so much, necessitating Janitor Monkey to clean up the resources left behind.

If these tools aren't your thing, you might try to implement the process using the AWS modules in Ansible. After all, they have modules for ASGs, ELB, etc. Be warned though, there are a lot of sharp edges and failure cases - way too many to list here. Asgard covers most of them, and Delta covers a few more. If you're interested, ping me and we can discuss further.

DON'T change an ASG's LaunchConfig

There is another approach I have seen where a new Launch Configuration is made, perhaps pointing to a new AMI, containing different user-data for cloud-init, or using a different instance type. The existing ASG is then updated to use this new LaunchConfig - something that can be done quite easily by updating a CloudFormation template.

What happens next? Nothing. The new LaunchConfig does not make changes to instances that are already running. It is not until a new instance is required, either by scaling out or replacing a dead instance, that the new LaunchConfig is used. This is more like a hardware-level rolling update, and comes with all of the same downsides:

There is no clean and predictable switch over
When something goes wrong on the 10th instance in a 15 instance cluster, rollback is a world of pain. You need to update the ASG to use the previous LaunchConfig, then terminate every instance using the new LaunchConfig, and wait for new instances to launch, initialize, and be ready to handle traffic.
It is not immediately obvious which instances are using which LaunchConfig (multiple levels of indirection)
If the above was ignored, and a user-facing change was present in the second LaunchConfig, users will randomly be switching between the new and old instances for each new request. You can solve this by using a sticky cookie, but instance-level affinity is bad bad bad. Remember, instances are cattle, not pets.

DON'T synchronize anything else with your deployment

I am talking about deploying other services at the same time, performing database migrations, and running other scripts. Change one thing, and one thing only. Things like Capistrano make this so easy, it creates a false sense of security. As soon as you need to synchronize a code deployment with some other process, you are going to suffer down time. Maybe not at first, but this does not work at scale. The synchronization will break and it will be bad. Finagle's Law. Plan for it sooner rather than later.

Anything that can go wrong, will - at the worst possible moment.

So there you have it. This is how we think about blue/green deployment. Agree? Disagree?