The AWS Status Page Will Lie to You
Last week, our web server had trouble connecting to the database. And that sucks. It impacted our customers and disrupted their business.
This especially sucked because we built our systems to be redundant. We use AWS’s RDS to host a Postgres database across multiple availability zones. Moreover, we host our web servers in Elastic Beanstalk across multiple availability zones. With redundancy like that, you’d think we’d be immune to a single part failing.
Not so.
It’s taken a little back and forth, but we finally got AWS support to tell us what happened:
As I went through the instance IDs and events that happened on 12/01/2017, I figured out that between 1:20 PM and 5:00 PM PST some instances in a single Availability Zone in the US-WEST-1 Region experienced VPC Peering connectivity issues. The issue has been resolved and the service is operating normally. [emphasis mine]
Or, to put that as simply as I can, the thing that connected our redundancy broke. But it did not break hard enough to trigger a failover.
And I get that: things break. Sometimes in unexpected ways. Already we’re working on ways to better handle this specific sort of failure.
But if you look at the AWS Status Page, this what you see:
Over three hours of problems? Still reporting green, eh? I feel lied to.