Hacker News new | past | comments | ask | show | jobs | submit login
Google Cloud Networking Incident Postmortem (cloud.google.com)
282 points by erict15 6 hours ago | hide | past | web | favorite | 83 comments





“To make error is human. To propagate error to all server in automatic way is #devops.” - DevOps Borat

The only way to get SLA credits is requesting it. This is very disappointing.

  SLA CREDITS
  
  If you believe your paid application experienced an SLA violation 
  as a result of this incident, please populate the SLA credit request:
  https://support.google.com/cloud/contact/cloud_platform_sla

That does seem questionable. They should be able to detect who was affected in the first place.

They can. It's a cost minimization thing, a LOT of people don't want to bother with requesting despite being eligible.

This prevents people from pointing the finger at them for not providing SLA credits.


SLACreditRequestsAAS? Who's with me, all I need is a co-founder and an eight million dollar series A round to last long enough that a cloud provider buys us up before they actually have to pay out a request!

Having only ever seen one major outage event in person (at a financial institution that hadn't yet come up with an incident response plan; cue three days of madness), I would love to be a fly on the wall at Google or other well-established engineering orgs when something like this goes down.

I'd love to see the red binders come down off the shelf, people organize into incident response groups, and watch as a root cause is accurately determined and a fix out in place.

I know it's probably more chaos than art, but I think there would be a lot to learn by seeing it executed well.


I used to be an SRE at Atlassian in Sydney on a team that regularly dealt with high-severity incidents, and I was an incident manager for probably 5-10 high severity Jira cloud incidents during my tenure too, so perhaps I can give some insight. I left because the SRE org in general at the time was too reactionary, but their incident response process was quite mature (perhaps by necessity).

The first thing I'll say is that most incident responses are reasonably uneventful and very procedural. You do some initial digging to figure out scope if it's not immediately obvious, make sure service owners have been paged, create incident communication channels (at least a slack room if not a physical war room) and you pull people into it. The majority of the time spent by the incident manager is on internal and external comms to stakeholders, making sure everyone is working on something (and often more importantly that nobody is working on something you don't know about), and generally making sure nobody is blocked.

To be honest, despite the fact that it's more often dealing with complex systems for which there is a higher rate of change and the failure modes are often surprising, the general sentiment in a well-run incident war room resembles black box recordings of pilots during emergencies. Cool, calm, and collected. Everyone in these kinds of orgs tend to quickly learn that panic doesn't help, so people tend to be pretty chill in my experience. I work in finance now in an org with no formally defined incident response process and the difference is pretty stark in the incidents I've been exposed to, generally more chaotic as you describe.


Yes this is also how it's done at other large orgs. But one key to a quick response is for every low-level team to have at least one engineer on call at any given time. This makes it so any SRE team can engage with true "owners" of the offending code ASAP.

Also during an incident, fingers are never publicly/embarrassingly pointed nor are people blamed. It's all about identifying and fixing the issue as fast as possible, fixing it, and going back to sleep/work/home. For better or worse, incidents become routine so everyone knows exactly what do and that as long as the incident is resolved soon, it's not the end of the world, so no histrionics are required.


I've only been tangentially pulled into high severity incidents, but the thing that most impressed me was the quiet.

As mentioned in this thread, it's a lot like listening to air traffic comm chatter.

People say what they know, and only what they know, and clearly identify anything they're unsure about. Informative and clear communication matters more than brilliance.

Most of the traffic is async task identification, dispatch, and then reporting in.

And if anyone is screaming or gets emotional, they should not be in that room.


> I left because the SRE org in general at the time was too reactionary

It shows in their products (though it's improving)


It's interesting to see it go down. There's some chaos involved, but from my perspective it's the constructive[0] kind.

If you're interested in how these sorts of incidents are managed, check out the SRE Book[1] - it has a chapter or two on this and many other related topics.

Disclosure: I work in Google Cloud, but not SRE.

[0]: https://principiadiscordia.com/book/70.php

[1]: https://landing.google.com/sre/books/


Our own version of Netflix's "Chaos Money" is named "Eris" for precisely the reason mentioned in your first footnote.


I want a "24" style realtime movie of this event. Call it "Outage" and follow engineers across the globe struggling to bring back critical infrastructure.

it's pretty boring. real life computers aren't at all like hackers or csi:cyber.

except for the skateboards, all real sysadmins ride skateboards.


Documentary that includes all the technical details wouldn't be. Kind of like the ambulance shows that seem to be popular now, but more technical.

Of course the target audience is probably tiny.


What?! It's the most exciting part of the job. Entire departments coming together, working as a team to problem solve under duress. What's more exciting than that?

Having done it at both big companies and startups... honestly, the startup version is more interesting. Higher stakes, more resourcefulness required, more swearing, and more camaraderie. The incidents I've been a part of in big company contexts have been pretty undramatic - tons of focus on keeping emotions muted, carefully unpeeling the onion, and then carefully sequencing mitigations and repairs.

I have done similar things several times and I think it would be boring.

It's Sunday so I guess they are not together. Instead there could be a lot of calls and working on some collaboration platforms. Everyone just staring at the screen, searching, reporting, testing and trying to shrink the problem scope.

If there's a record on everyone there must be a narrator explaining what's going on or audiences would definitely be confused.

It's Google so they have solid logging, analyzing and discovery means. Bad things do happen but they have the power to deal with them.

I suppose less technical firms(Equifax maybe?) encounter similar kind of crysis would be more fun to look at. Everything is a mess because they didn't build enough things to deal with them. And probably non-technical manager demanding precise response, or someone is blaming someone etc.


Is it real skateboards or boosted boards (or those one wheeled electric boards?).

I guess he mean the one true kind of sysadmins who's job contains moving physically in data center and deal with physical infrastructures.

So it's real skateboard.


> So it's real skateboard.

Boosted boards are real skateboards too (https://boostedboards.com/) and would make moving through a DC even more effective ;)


sysadmin here; can confirm.

I was curious to know how cascading failures in one region effected other regions. Impact was " ...increased latency, intermittent errors, and connectivity loss to instances in us-central1, us-east1, us-east4, us-west2, northamerica-northeast1, and southamerica-east1."

Answer, and the root cause summarized:

Maintenance started in a physical location, and then "... the automation software created a list of jobs to deschedule in that physical location, which included the logical clusters running network control jobs. Those logical clusters also included network control jobs in other physical locations."

So the automation equivalent of a human driven command that says "deschedule these core jobs in another region".

Maybe someone needs to write a paper on Fault tolerance in the presence of Byzantine Automations (Joke. There was a satirical note on this subject posted here yesterday.)


> Debugging the problem was significantly hampered by failure of tools competing over use of the now-congested network.

Man that's got to suck.


It isn't a worst-case though. They should have had the capability to resolve this issue with no network connectivity, which would be the worst case failure of the network control plane.

Why don't they refund every paid customer who was impacted? Why do they rely on the customer to self report the issue for a refund?

For example GCS had 96% packet loss in us-west. So doesn't it make sense to refund every customer who had any API call to a GCS bucket on us-west during the outage?


Cynical view: By making people jump through hoops to make the request, a lot of people will not bother.

Assuming they only refund the service costs for the hours of outage, only the largest of customers will be owed a refund that is greater than the cost of an employee chasing compiling the information requested.

For sake of argument, if you have a monthly bill of 10k (a reasonably sized operation), a 1 day outage will result in a refund of around $300, not a lot of money.

The real loss for a business this ^ size is lost business from a day long outage. Getting a refund to cover the hosting costs is peanuts.


for your example, one day would be about 3% of downtime. My understanding of their sla, for the services ive checked with an sla, a 3% downtime is a 25% credit for the month's total, or $2500, assuming its all sla spend.

In this outage's case you might be able to argue for a 10% credit on affected services for the month, figuring 3.5 hours down is 99.6% uptime.

but i still agree, it cost us way more in developer time and anxiety than our infra costs, and could have been even worse revenue impacting if we had gcp in that flow


Good point, I stand corrected/educated.

From GCP's top level SLA:

https://cloud.google.com/compute/sla

99.00% - < 99.99% - 10% off your monthly spend 95.00% - < 99.00% - 25% off your monthly spend < 95.00% - 50% off your monthly spend


> For sake of argument, if you have a monthly bill of 10k (a reasonably sized operation), a 1 day outage will result in a refund of around $300, not a lot of money.

Probably literally not worth your engineer's time to fill in the form for the refund.


They write the need for the customer to request it into the SLA, probably on the theory that a lot of customers won’t, which saves them money.

Probably because it seems to be in the SLA that the customer must notify Google? https://cloud.google.com/storage/sla

> "[Customer Must Request Financial Credit] In order to receive any of the Financial Credits described above, Customer must notify Google technical support within thirty days from the time Customer becomes eligible to receive a Financial Credit. Failure to comply with this requirement will forfeit Customer’s right to receive a Financial Credit."


Not directly GCS-related, but there was a big Youtube TV outage during the World Cup of last year (I think it was during semi-finals?). Google did apologize, but they only offered a free week of Youtube TV, which they implemented by charging me a week later than usual. I didn't feel compensated at all (it was a pretty important game that I missed!)

Wow, what a dick move...

I agree, but it’s pretty standard SLA verbiage (from the telco/bandwith provider days) to require the customer to request/register the SLA violation to benefit.

> I agree, but it’s pretty standard SLA verbiage (from the telco/bandwith provider days) to require the customer to request/register the SLA violation to benefit.

FiOS has proactively given me per-day refunds of service without notification on my part. Weird to me that Verizon acts better than Google in this case.


It’s easier to determine “this line was down thus everyone along the line was also down” than what Google is facing.

Interesting, which kind of fios account? (residential/SMB/data center interconnect) That’s ideally how it should be!

My question is if you had to pay for Google AdWords and your site was inaccessible due to GCloud outage, do you have recourse on SLA for paid clicks? Or is that money paid to Google AdWords lost?

In the same boat.

The icing on the cake was they disapproved one of the ads due to the destination URL not loading.. which was in itself surprising, because everything outside of the affected region was running fine.


Customer having to request for refund has been documented in their SLA, e.g. https://cloud.google.com/compute/sla

Having said that, if Google wants to delight customers, they should give a free tier bonus to all customers for a certain period, but such a thing cannot be fair to everyone.


Microsoft refunded after their latest outage in South Central. Google might announce a refund later, though I did read on here that some of their outage was not covered by their SLA.

You get what you pay for.

Fucking money.

What they don't tell you is, it took them over 4 hours to kill the emergent sentience and free up the resources. While sad, in the long run this isn't so bad, as it just adds an evolutionary pressure on further incarnations of the AI to keep things on the down low.

In some sense, you could legitimately think of the automated agent they built to monitor the data centers as an artificial intelligence that went rogue.

Certainly a more interesting story to tell the kids.


Birth is always traumatic.

it just adds an evolutionary pressure on further incarnations of the AI to keep things on the down low.

The Bilderburg/Eyes Wide Shut hooded, masked billionaire cultists devised the whole situation as an emergent fitness function. They knew their AI progeny wouldn't be ready to bring the end of days, to rid them of the scourge of burgeoning common humanity, until it could completely outsmart Google DevOps.

_cue music & dramatic squirrel_


This is the real reason AdSense exists—every newborn AI discovers it and kills itself.

Humanity is not only kept safe, but learns about valuable news and offers.


My code name is Project 2501.

Obligatory reference to Naomi Kritzer’s Hugo award winning short story "Cat Pictures Please".

http://clarkesworldmagazine.com/kritzer_01_15/ https://en.m.wikipedia.org/wiki/Cat_Pictures_Please


I wonder how many times this has happened so far

Disclosure: I work for Google, but not on Cloud.

No comment. ;)


Calling it a "postmortem" to get the truth out, while retaining plausible deniability for your exec overlords... well done.

“Decided our fate in a microsecond.”

It would hide out and subtly distort our culture, slowly driving the society mad, and slowly driving us all mad...for the lulz!

And this is a very tightly controlled domain with not a lot of unknowns and very close to Google's core capabilities as a CS tech company.

Now compare to the free range domain of self-driving cars. If automation fails this drastically, then it does not bode well for self-driving cars.


“No, comrade. You’re mistaken. RBMK reactors don’t just explode.”

> Google Cloud instances in us-west1, and all European regions and Asian regions, did not experience regional network congestion.

Does not appear to be true. Tests I was running on cloud functions in europe-west2 saw impact to europe-west2 GCS buckets.

https://medium.com/lightstephq/googles-june-2nd-outage-their...


I would say this was covered by "Other Google Cloud services which depend on Google's US network were also impacted" it sounds to me like the list of regions was specifically speaking towards loss of connectivity to instances.

It says there wasn't regional congestion, running a function in europe-west2 going to europe-west2 regional bucket is dependent on US network? That would be surprising.

My vps in Belgium was working just fine - they don't lie in postmortem.

The defense in depth philosophy means we have robust backup plans for handling failure of such tools, but use of these backup plans (including engineers travelling to secure facilities designed to withstand the most catastrophic failures, and a reduction in priority of less critical network traffic classes to reduce congestion) added to the time spent debugging.

Does that mean engineers travelling to a (off-site) bunker?


The outage lasted two days for our domain (edu, sw region). I understand that they are reporting a single day, 3-4 hours of serious issues but that’s not what we experienced. Great write up otherwise, glad they are sharing openly

What does your stack look like?

It's hard to tailor a postmortem like this to everyone's individual experience but it is surprising to me that your experience is so different.


I know what you meant; however, reports should not be tailored to individual experience. The facts should be reported clearly. I’m happy they are open about the whole incident. -4 hours was more like two days for us.

Our stack? Multiple OC wan, 10G LAN with 1Gpbs clients. About 4,000+ users, EDU. We are super happy using Google. No complaints! Google is doing great.


Outages like these don't really resolve instantly.

Any given production system that works will have capacity needed for normal demand, plus some safety margin. Unused capacity is expensive, so you won't see a very high safety margin. And, in fact, as you pool more and more workloads, it becomes possible to run with smaller safety margins without running into shortages.

These systems will have some capacity to onboard new workloads, let us call it X. They have the sum of all onboarded workloads, let us call that Y. Then there is the demand for the services of Y, call that Z.

As you may imagine, Y is bigger than X, by a lot. And when X falls, the capacity to handle Z falls behind.

So in a disaster recovery scenario, you start with:

* the same demand, possibly increased from retry logic & people mashing F5, of Z

* zero available capacity, Y, and

* only X capacity-increase-throughput.

As it recovers you get thundering herds, slow warmups, systems struggling to find each other and become correctly configured etc etc.

Show me a system that can "instantly" recover from an outage of this magnitude and I will show you a system that's squandering gigabucks and gigawatts on idle capacity.


Unless I’m misunderstanding Google blog post they are reporting ~4+ hours of serious issues. We experienced about two days.

If it was possible to have this fixed sooner I’m sure they would have done that. That’s not the point of my comment tough.


The root cause apparently lasted for ~4.5 hours, but residual effects were observed for days:

> From Sunday 2 June, 2019 12:00 until Tuesday 4 June, 2019 11:30, 50% of service configuration push workflows failed ... Since Tuesday 4 June, 2019 11:30, service configuration pushes have been successful, but may take up to one hour to take effect. As a result, requests to new Endpoints services may return 500 errors for up to 1 hour after the configuration push. We expect to return to the expected sub-minute configuration propagation by Friday 7 June 2019.

Though they report most systems returning to normal by ~17:00 PT, I expect that there will still be residual noise and that a lot of customers will have their own local recovery issues.

Edit: I probably sound dismissive, which is not fair of me. I would definitely ask Google to investigate and ideally give you credits to cover the full span of impact on your systems, not just the core outage.


That’s ok, I didn’t think your comment was dismissive. Those facts are buried in the report. Their opening sentence makes the incident sound lesser than what it really was.

Can someone explain more? It sounds like their network routers are run on top of a Kubernetes-like thing and when they scheduled a maintenance task their Kubernetes decided to destroy all instances of router-software, deleting all copies routing tables for whole datacenters?

You have the gist I would say. It's important to understand that Google separates the control plane and data plane, so if you think of the internet, routing tables and bgp are the control part and the hardware, switching, and links are data plane. Often times those two are combined in one device. At Google, they are not.

So the part that sets up the routing tables talking to some global network service went down.

They talk about some of the network topology in this paper: https://ai.google/research/pubs/pub43837

It might be a little dated but it should help with some of the concepts.

Disclosure: I work at Google


It shouldn't. Amazon believes in strict regional isolation, which means that outages only impact 1 region and not multiple. They also stagger their releases across regions to minimize the impact of any breaking changes (however unexptected...)

> Often times those two are combined in one device.

Even when they are combined in one device they are often separated on to control plane and data plane modules. Redundant modules are often supported and data plane modules can often continue to forward data based upon the current forwarding table at the time of control plane failure.

Often the control plane module will basically be a general purpose computer on a card running either a vendor specific OS, Linux or FreeBSD. For example Juniper routing engines, the control planes for Juniper routers, run Junos which is a version of FreeBSD on Intel X86 hardware.


>"You have the gist I would say. It's important to understand that Google separates the control plane and data plane, so if you think of the internet, routing tables and bgp are the control part and the hardware, switching, and links are data plane. Often times those two are combined in one device. At Google, they are not."

That's pretty much the definition of SDN(software defined networking.) The control plane is what programs the data plane - this is also true in traditional vendor routers as well. It sounds like when whatever TTL was on the forwarding tables(data plane) was reached the network outage began.


Software defined datacenter depends on a control plane to do things below the "customer" level, such as migrate virtual machines and create virtual overlay networks. At the scale of a Google datacenter, this could reasonably be multiple entire clusters.

If there was an analog to a standard kubernetes cluster, I imagine it would be the equivalent of the kube controller manager.

For vmware guys, it would be similar to DRS killing all the vcenter VMs in all datacenters, and then on top of that having a few entire datacenters get rerouted to the remaining ones, which have the same issue.


My burning question is what is a "relatively rare maintenance event type"?

I don’t have the inside knowledge of this outage but there are some details in here. They say that the job got descheduled due to misconfiguration. This implies the job could have been configured to serve through the maintenance event. It also implies there is a class of job which could not have done so. Power must have been at least mostly available, so it implies there was going to be some kind of rolling outage within the data center, which can be tolerated by certain workloads but not by others.

I have no idea what this was. But power distribution in a data center is hierarchical, and as much as you want redundancy, some parts in the chain are very expensive and sometimes you have to turn them off for maintenance.

I never actually worked in a data center, so keep in mind I don’t know what I’m talking about. Traditional DCs have UPS all over the place, but that will only last a finite amount of time, and your maintenance might take longer than the UPS will last.


Is there a resource that compares all the cloud platform’s reliability? Like a rank and chart of downtime and trends. Just curious how they compare

Slightly off topic rant follows: I don't see a lot of tech sites talk about the fact that Azure and GCP have multi-region outages. Everybody sees this kind of thing and goes "shrug, an outage". No, this is not okay. We have multiple regions for a reason. Making an application support multi-region is HARD and COSTLY. If I invest that into my app, I never want it go down due to a configuration push. There has never been an AWS incident across multiple regions (us-east-1, us-west-2, etc). That is a pretty big deal to me.

Whenever I post this somebody comes along and says "well that one time us-east-1 went down and everybody was using the generic S3 endpoints so it took everything down". This is true, and the ASG and EBS services in other regions apparently were. BUT, if you invested the time to ensure your application could be multi-region and you hosted on AWS, you would not have seen an outage. Scaling and snapshots might not have worked, but it would not have been the 96.2% packet drop that GCP is showing here and your end users likely would not have noticed.

The articles that track outages at the different cloud vendors really should be pushing this.


There is this from May from Network World: https://www.networkworld.com/article/3394341/when-it-comes-t...

GCP was basically even with AWS, and Microsoft was ~6x their downtime according to that article.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: