(cache)SRE WEEKLY – scalability, availability, incident response, automation

SRE Weekly Issue #115

lex

March 25, 2018

Articles

Metrics like Mean Time to Detection (MTTD), Resolution (MTTR), and the like pave over all of the incredibly valuable details of the individual incidents. If you place a lot of emphasis on aggregate incident response metrics, this article may cause you to rethink your methods.

Incidents are unplanned investments. When you focus solely on shallow data you are giving up the return on those investments that you can realize by deeper and more elaborate analysis.

John Allspaw — Adaptive Capacity Labs

Look for the duct tape

Duct tape: you know, all the little shell scripts you have in your ~/bin directory that you wrote because your system’s tooling got in your way or didn’t do what you needed? Find that, according to this article, and you’ll find interesting things to work on to make the system better. I’d add that these rough edges are often also the kinds of things that contribute to incidents.

Rachel Kroll

Incident review: API and Dashboard outage on 10 October 2017

A thoughtful and detailed incident post-analysis, including an in-depth discussion of the weeks-long investigation to determine the contributing factors. The outage involved the interaction of Pacemaker and Postgres.

Chris Sinjakli , Harry Panayiotou , Lawrence Jones , Norberto Lopes and Raúl Naveiras — GoCardless

How Chaos Engineering Can Bring Stability to Your Distributed Systems

Here’s a nice overview of chaos engineering, including a mention of a tool I wasn’t aware of for applying chaos to Docker containers.

Jennifer Riggins — The New Stack

Pull doesn’t scale – or does it?

The question in the title refers to the gathering of metrics from many systems in an infrastructure. Do they push their metrics in, or should the system pull metrics from each host instead? This Prometheus author explains why they pull and how it scales.

Julius Volz — Prometheus

Zero downtime deployments with containers

A primer on achieving seamless deployments with Docker, including examples.

Jussi Nummelin — Kontena

observability – Food Fight

I had some extra time for reviewing content this week, and I took the opportunity to listen to this episode of the Food Fight podcast, with a focus on observability. The discussion is really excellent, and there are some really thought-provoking moments.

Nell Shamrell-Harrington, with Nathen Harvey, Charity Majors, and Jamie Osler

Enable your Devs to do Ops

How? By writing runbooks. This article takes you through how, why, and what tools to use as you develop runbooks for your systems.

Francesco Negri — Buildo

How Threat Stack Does DevOps (Part IV): Making Engineers Accountable

As a security-focused company, it only makes sense that Threat Stack would focus on safety when giving developers access to operate their software production.

We believe that good operations makes for good security. Reducing the scope of engineers’ access to systems reduces the noise if we ever have to investigate malicious activity.

Pete Cheslock — Threat Stack

Outages

Data Action
- Data Action is a dependency of many Australian banks.
Travis CI
S3
- Amazon S3 had a pair of outages for connections through VPC Endpoints. The Travis CI, Datadog, and New Relic outages were around the same time, but I can’t tell conclusively whether they were related.
Datadog
New Relic

SRE Weekly Issue #114

lex

March 18, 2018

General

View on sreweekly.com

Articles

Level 3 technician’s misstep causes largest outage ever reported

The FCC has released a report on the major Level 3 outage in October of 2016. This summary article serves as a good TL;DR summary on what went wrong and includes a link to the full report.

Brian Santo — Fierce Telecom

Migrating edge network providers

They had an awesome approach: use RSpec to create a test suite of HTTP requests and run it continuously during the deployment to ensure that nothing changed from the end-user’s perspective. Bonus points for generating tests automatically.

Jacob Bednarz — Envato

Project Nimble: Region Evacuation Reimagined

Netflix reduced the time it takes to evacuate a failed AWS region from 50 minutes to just 8.

Luke Kosewski, Amjith Ramanujam, Niosha Behnam, Aaron Blohowiak, and Katharina Probst — Netflix

Tonight We Monitor, For Tomorrow, We Test in Production!

I don’t usually link to talks, but this talk transcript reads almost like an article, and it’s a good one. The premise: if you’re not monitoring well, then you can’t safely test in production. Scalyr found a few ways in which their monitoring showed cracks, and now they’re sharing it with us.

Steven Czerwinski — Scalyr

Oopsy DDoSy: Accidental DDoS Attacks Causing Major Grief

Design carefully, especially around retries, lest you create a thundering herd that makes it much harder to recover from an outage. That lesson and more, in this article on shooting yourself in the foot at web scale.

Benjamin Campbell — Business Computing World

How our production team runs the weekly on-call handover

Have I mentioned how much I love GitLab’s openness? Here’s how they handle on-call shift transitions in their remote-only organization.

John Jarvis — GitLab

Twitter: Charity Majors on distributed systems, complexity, and microservices

What is the definition of a distributed system, and why are they difficult? I really love the definition in the second tweet.

Charity Majors

Troubleshooting IPv6 badness to certain hosts in a rack

I sure love a good troubleshooting story. This one has a pretty excellent failure mode, A+ investigative technique, and an emphasis on following something through until you find an answer.

Rachel Kroll

The Makeup of Successful Geographically-Distributed SRE Teams: Part 1 | LinkedIn Engineering

This discussion of how and why to create a globally-distributed SRE team may only apply to bigger companies, but it’s got a lot of useful bits in it. I just have to stop laughing at the acronym “GD”…

Akhil Ahuja — LinkedIn

Outages

DoubleClick (ad provider)
- DoubleClick went down, and it took a lot of sites with it. Click through for Catchpoint’s excellent analysis.
  Kameerath Kareem — Catchpoint
Travis CI
SmartThings (IoT platform)
Air Canada
Netflix

SRE Weekly Issue #113

lex

March 11, 2018

General

View on sreweekly.com

Articles

The New Best Engineer

The best kind of engineer is one that understands not only their specialty, but at least something about the fields adjacent to theirs. The empathy this confers allows one to work incredibly effectively across the company. For SREs, this is even more important.

[…] many of us are finding that the most valuable skill sets sit at the intersection of two or more disciplines.

Charity Majors — Honeycomb

GitLabbers share how to recognize burnout (and how to prevent it)

GitLab held a session about recognizing and preventing burnout at their recent employee summit. They share the best tips in this article, and true to their radically open culture, they also added what they learned to their employee handbook, which is publicly available.

Clement Ho — GitLab

Travis CI Status – Container-based Linux Precise infrastructure emergency maintenance

Here’s a post-analysis for a Travis CI incident early last year. Despite a couple of easy targets that could have been labelled as “root cause”, they instead skillfully laid out all of the contributing factors and left it at that.

Travis CI

Serverless Ops: What do we do when the server goes away?

What indeed? The same thing, just organized differently. There’s a lot of great analysis here about how ops roles can adapt to a serverless infrastructure, and how teams can best make use of ops folks.

Tom McLaughlin — ServerlessOps

On-Call Rotations: How Best to Wake Devs Up in the Middle of the Night

Charity Majors wants you to look forward to on-call. This superb write-up of her recent conference talk explains why folks should think of on-call as an enjoyable privilege and how to shape your on-call to get there.

Jennifer Riggins

Canary Analysis Service

The Canary Analysis Service is Google’s internal tool that automatically analyzes canary runs and decides whether performance has been negatively impacted. My favorite section is the Lessons Learned.

Štěpán Davidovič with Betsy Beyer — ACM Queue

Outages

Snapchat
123 Reg (hosting provider)
- Customers lost files added since 123 Reg’s last valid backup from August, 2017.
partypoker
eBay
Signal and Telegram (messenger apps)
Alexa
- I missed this one last week — it was apparently due to the AWS outage I reported on.
TD Bank
Oculus Rift
- A code-signing certificate expired, rendering some existing VR headsets non-functional. Oculus is issuing a $15 store credit to affected customers.
  
  Because of the particulars of what expired and how it happened, the company wasn’t able to simply push an update out to users because the expired certificate was blocking Oculus’ standard software update system.

SRE Weekly Issue #112

lex

March 4, 2018

General

View on sreweekly.com

Articles

Spooky action at a distance, how an AWS outage ate our load balancer

an outage of a provider that we don’t use, directly or indirectly, resulted in our service becoming unavailable.

I don’t think I even need to add anything to that to make you want to read this article.

Fran Garcia — Hosted Graphite

GitHub’s report on the Memcached-based DDoS

The big story this week is the memcached UDP amplification DDoS method, used to send 1.3 Tbps (!) toward our friends at GitHub. Their description is linked above.

Sam Kottler — GitHub

The internet was alight with related discussions:

Cloudflare’s description of the attack
Akamai’s story about helping GitHub survive the attack
memcached developer announces release that disables UDP by default
with commentary
Charity Majors also had some amusing commentary
Wired’s story on the GitHub attack

Runbook Template

An excellent template that you can use as a basis for writing runbooks.

Catie McCaffrey

DevOps and SRE Contribution – The Lemur Book

This author of an upcoming O’Reilly book is looking for small contributions for a crowd-sourced chapter:

In two paragraphs or less, what do you think is the relationship between DevOps and SRE? How are they similar? How are they different? Can both be implemented at every organization? Can the two exist in the same org at the same time? And so on…

David Blank-Edelman

Meet Bandaid, the Dropbox service proxy

Bandaid started as a reverse proxy that compensated for inefficiencies in our server-side services.

I’m intrigued by the way it handles its queue in last-in first-out order, on the theory that a request that’s been waiting for a long time is likely to be cancelled by its requester.

Dmitry Kopytkov and Patrick Lee — Dropbox

5 of the world’s biggest network outages

This is an amusing recap of five major outages of the past few years. If you’ve been subscribed for awhile, it’ll be review, but I still enjoyed the reminder.

Michael Rabinowitz

Fail-slow at scale: When the cloud stops working

This article summarizes a new research paper on “fail-slow” hardware failures. When hardware only kind of fails, it can often have more disastrous consequences than when it stops working outright.

Robin Harris — Storage Bits

Launching An Entire Fireworks Display At Once

This is an awe-inspiring way to make a point about designing systems to be resilient to human error.

If it’s possible for a human to hit the wrong button and set off an entire fireworks display by accident, then maybe the problem isn’t with the human; it’s with that button.

If it’s possible to mix up minutes and fractions of a second like we’ve done deliberately, then maybe the system isn’t clear, or maybe the pre-launch checklist isn’t thorough enough.

Tom Scott

Managing Feature Flag Debt with Split

There are some really great ideas in this article around preventing and ameliorating the technical debt that can be inherent in the use of feature flags. Ostensibly this article is about using Split.io, but the ideas are broadly applicable.

Adil Aijaz — Split

Outages

Slack
- Possibly due to the AWS outage.Thanks to marc on hangops #incident_response for this one.
AWS us-east-1
Statuspage.io
Abebooks
LinkedIn
AOL Email
Vero (social network)
CoinsMarkets (cryptocurrency exchange)

SRE Weekly Issue #111

lex

February 25, 2018

General

View on sreweekly.com

I’m trying an experiment this week: I’ve included authors at the bottom of each article. I feel like it’s only fair to increase exposure for the folks that put in the significant effort necessary to write articles. It also saves me having to mention names and companies, hopefully leaving more room for useful summaries.

If you like it, great! If not, please let me know why — reply by email or tweet @SREWeekly. I feel like this is the right thing to do from the perspective of crediting authors, but I’d like to know if a significant number of you disagree.

Hat-tip to Developer Tools Weekly for the idea.

Articles

Twitter: Lisa Phillips about on-call compensation

Conversations around compensation for on-call. What has worked or not for you? $$ vs PTO. Alerts vs Scheduled vs Actual Time?1 x 1.5 or 2x?

The replies to her tweet are pretty interesting and varied.

Lisa Phillips, VP at Fastly
Full disclosure: Fastly is my employer.

Twitter: Charity Majors about making on-call suck less

This thread is incredibly well phrased, explaining exactly why it’s important for developer to be on call and how to make that not terrible. Bonus content: the thread also branches out into on-call compensation.

if you aren’t supporting your own services, your services are qualitatively worse **and** you are pushing the burden of your own fuckups onto other people, who also have lives and sleep schedules.

Charity Majors — Honeycomb

The Role of the Incident Commander

This week, Blackrock3 Partners posted an excerpt from their book, Incident Management for Operations that you can read free of charge. If you enjoy it, I highly recommend you sign up for their first-ever open enrollment IMS training course. I know I keep pushing this, but I truly believe that incident response in our industry as a whole will be significantly improved if more people train with these folks.

Oncall and Sustainable Software Development

“On-call doesn’t have to suck” has been a big theme lately, with articles and comments on both sides. Here’s a pile of great advice from my favorite ops heroine.

Charity Majors — Honeycomb

Production postmortem: The unavailable Linux server

An interesting little debugging story involving unexpected SSL server-side behavior.

Ayende Rahien — RavenDB

Couchbase High Availability and Disaster Recovery: Java Multi-Cluster Aware Client

In this post, I’m going to take a look at a sample application that uses the Couchbase Server Multi-Cluster Aware (MCA) Java client. This client goes hand-in-hand with Couchbase’s Cross-Data Center Replication (XDCR) capabilities.

Hod Greeley — Couchbase

Advice to Management Teams While Enrolling Changes to On-Call Systems

Tips for how to go about scaling your on-call policy and procedures in order to be fair and humane to engineers.

Emel Dogrusoz — OpsGenie

Outages

Hurricane Electric (datacenter provider)
BB&T (Bank)
Facebook/Instagram
Stack Overflow
LastPass
TD Bank
The Things Network
- The Things Network is an IoT infrastructure provider.
Hulu
Yahoo
Google Cloud Platform
- An incident on February 18th broke autoscaling and prevented communication between new instances and instances in other zones. The linked post-analysis discusses the failure of a process and of the automated failover process.

← Older Posts

SRE Weekly Issue #115

Articles

Outages

SRE Weekly Issue #114

Articles

Outages

SRE Weekly Issue #113

Articles

Outages

SRE Weekly Issue #112

Articles

Outages

SRE Weekly Issue #111

Articles

Outages

Subscribe

RSS

Search Issues

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Search Issues