The only way to get SLA credits is requesting it. This is very disappointing.
SLA CREDITS
If you believe your paid application experienced an SLA violation
as a result of this incident, please populate the SLA credit request:
https://support.google.com/cloud/contact/cloud_platform_sla
SLACreditRequestsAAS? Who's with me, all I need is a co-founder and an eight million dollar series A round to last long enough that a cloud provider buys us up before they actually have to pay out a request!
Having only ever seen one major outage event in person (at a financial institution that hadn't yet come up with an incident response plan; cue three days of madness), I would love to be a fly on the wall at Google or other well-established engineering orgs when something like this goes down.
I'd love to see the red binders come down off the shelf, people organize into incident response groups, and watch as a root cause is accurately determined and a fix out in place.
I know it's probably more chaos than art, but I think there would be a lot to learn by seeing it executed well.
I used to be an SRE at Atlassian in Sydney on a team that regularly dealt with high-severity incidents, and I was an incident manager for probably 5-10 high severity Jira cloud incidents during my tenure too, so perhaps I can give some insight. I left because the SRE org in general at the time was too reactionary, but their incident response process was quite mature (perhaps by necessity).
The first thing I'll say is that most incident responses are reasonably uneventful and very procedural. You do some initial digging to figure out scope if it's not immediately obvious, make sure service owners have been paged, create incident communication channels (at least a slack room if not a physical war room) and you pull people into it. The majority of the time spent by the incident manager is on internal and external comms to stakeholders, making sure everyone is working on something (and often more importantly that nobody is working on something you don't know about), and generally making sure nobody is blocked.
To be honest, despite the fact that it's more often dealing with complex systems for which there is a higher rate of change and the failure modes are often surprising, the general sentiment in a well-run incident war room resembles black box recordings of pilots during emergencies. Cool, calm, and collected. Everyone in these kinds of orgs tend to quickly learn that panic doesn't help, so people tend to be pretty chill in my experience. I work in finance now in an org with no formally defined incident response process and the difference is pretty stark in the incidents I've been exposed to, generally more chaotic as you describe.
Yes this is also how it's done at other large orgs. But one key to a quick response is for every low-level team to have at least one engineer on call at any given time. This makes it so any SRE team can engage with true "owners" of the offending code ASAP.
Also during an incident, fingers are never publicly/embarrassingly pointed nor are people blamed. It's all about identifying and fixing the issue as fast as possible, fixing it, and going back to sleep/work/home. For better or worse, incidents become routine so everyone knows exactly what do and that as long as the incident is resolved soon, it's not the end of the world, so no histrionics are required.
I've only been tangentially pulled into high severity incidents, but the thing that most impressed me was the quiet.
As mentioned in this thread, it's a lot like listening to air traffic comm chatter.
People say what they know, and only what they know, and clearly identify anything they're unsure about. Informative and clear communication matters more than brilliance.
Most of the traffic is async task identification, dispatch, and then reporting in.
And if anyone is screaming or gets emotional, they should not be in that room.
It's interesting to see it go down. There's some chaos involved, but from my perspective it's the constructive[0] kind.
If you're interested in how these sorts of incidents are managed, check out the SRE Book[1] - it has a chapter or two on this and many other related topics.
I want a "24" style realtime movie of this event. Call it "Outage" and follow engineers across the globe struggling to bring back critical infrastructure.
What?! It's the most exciting part of the job. Entire departments coming together, working as a team to problem solve under duress. What's more exciting than that?
Having done it at both big companies and startups... honestly, the startup version is more interesting. Higher stakes, more resourcefulness required, more swearing, and more camaraderie. The incidents I've been a part of in big company contexts have been pretty undramatic - tons of focus on keeping emotions muted, carefully unpeeling the onion, and then carefully sequencing mitigations and repairs.
I have done similar things several times and I think it would be boring.
It's Sunday so I guess they are not together. Instead there could be a lot of calls and working on some collaboration platforms. Everyone just staring at the screen, searching, reporting, testing and trying to shrink the problem scope.
If there's a record on everyone there must be a narrator explaining what's going on or audiences would definitely be confused.
It's Google so they have solid logging, analyzing and discovery means. Bad things do happen but they have the power to deal with them.
I suppose less technical firms(Equifax maybe?) encounter similar kind of crysis would be more fun to look at. Everything is a mess because they didn't build enough things to deal with them. And probably non-technical manager demanding precise response, or someone is blaming someone etc.
I was curious to know how cascading failures in one region effected other regions. Impact was " ...increased latency, intermittent errors, and connectivity loss to instances in us-central1, us-east1, us-east4, us-west2, northamerica-northeast1, and southamerica-east1."
Answer, and the root cause summarized:
Maintenance started in a physical location, and then "... the automation software created a list of jobs to deschedule in that physical location, which included the logical clusters running network control jobs. Those logical clusters also included network control jobs in other physical locations."
So the automation equivalent of a human driven command that says "deschedule these core jobs in another region".
Maybe someone needs to write a paper on Fault tolerance in the presence of Byzantine Automations (Joke. There was a satirical note on this subject posted here yesterday.)
It isn't a worst-case though. They should have had the capability to resolve this issue with no network connectivity, which would be the worst case failure of the network control plane.
Why don't they refund every paid customer who was impacted? Why do they rely on the customer to self report the issue for a refund?
For example GCS had 96% packet loss in us-west. So doesn't it make sense to refund every customer who had any API call to a GCS bucket on us-west during the outage?
Cynical view: By making people jump through hoops to make the request, a lot of people will not bother.
Assuming they only refund the service costs for the hours of outage, only the largest of customers will be owed a refund that is greater than the cost of an employee chasing compiling the information requested.
For sake of argument, if you have a monthly bill of 10k (a reasonably sized operation), a 1 day outage will result in a refund of around $300, not a lot of money.
The real loss for a business this ^ size is lost business from a day long outage. Getting a refund to cover the hosting costs is peanuts.
for your example, one day would be about 3% of downtime. My understanding of their sla, for the services ive checked with an sla, a 3% downtime is a 25% credit for the month's total, or $2500, assuming its all sla spend.
In this outage's case you might be able to argue for a 10% credit on affected services for the month, figuring 3.5 hours down is 99.6% uptime.
but i still agree, it cost us way more in developer time and anxiety than our infra costs, and could have been even worse revenue impacting if we had gcp in that flow
> For sake of argument, if you have a monthly bill of 10k (a reasonably sized operation), a 1 day outage will result in a refund of around $300, not a lot of money.
Probably literally not worth your engineer's time to fill in the form for the refund.
> "[Customer Must Request Financial Credit] In order to receive any of the Financial Credits described above, Customer must notify Google technical support within thirty days from the time Customer becomes eligible to receive a Financial Credit. Failure to comply with this requirement will forfeit Customer’s right to receive a Financial Credit."
Not directly GCS-related, but there was a big Youtube TV outage during the World Cup of last year (I think it was during semi-finals?). Google did apologize, but they only offered a free week of Youtube TV, which they implemented by charging me a week later than usual. I didn't feel compensated at all (it was a pretty important game that I missed!)
I agree, but it’s pretty standard SLA verbiage (from the telco/bandwith provider days) to require the customer to request/register the SLA violation to benefit.
> I agree, but it’s pretty standard SLA verbiage (from the telco/bandwith provider days) to require the customer to request/register the SLA violation to benefit.
FiOS has proactively given me per-day refunds of service without notification on my part. Weird to me that Verizon acts better than Google in this case.
My question is if you had to pay for Google AdWords and your site was inaccessible due to GCloud outage, do you have recourse on SLA for paid clicks? Or is that money paid to Google AdWords lost?
The icing on the cake was they disapproved one of the ads due to the destination URL not loading.. which was in itself surprising, because everything outside of the affected region was running fine.
Having said that, if Google wants to delight customers, they should give a free tier bonus to all customers for a certain period, but such a thing cannot be fair to everyone.
Microsoft refunded after their latest outage in South Central. Google might announce a refund later, though I did read on here that some of their outage was not covered by their SLA.
What they don't tell you is, it took them over 4 hours to kill the emergent sentience and free up the resources. While sad, in the long run this isn't so bad, as it just adds an evolutionary pressure on further incarnations of the AI to keep things on the down low.
In some sense, you could legitimately think of the automated agent they built to monitor the data centers as an artificial intelligence that went rogue.
Certainly a more interesting story to tell the kids.
it just adds an evolutionary pressure on further incarnations of the AI to keep things on the down low.
The Bilderburg/Eyes Wide Shut hooded, masked billionaire cultists devised the whole situation as an emergent fitness function. They knew their AI progeny wouldn't be ready to bring the end of days, to rid them of the scourge of burgeoning common humanity, until it could completely outsmart Google DevOps.
I would say this was covered by "Other Google Cloud services which depend on Google's US network were also impacted" it sounds to me like the list of regions was specifically speaking towards loss of connectivity to instances.
It says there wasn't regional congestion, running a function in europe-west2 going to europe-west2 regional bucket is dependent on US network? That would be surprising.
The defense in depth philosophy means we have robust backup plans for handling failure of such tools, but use of these backup plans (including engineers travelling to secure facilities designed to withstand the most catastrophic failures, and a reduction in priority of less critical network traffic classes to reduce congestion) added to the time spent debugging.
Does that mean engineers travelling to a (off-site) bunker?
The outage lasted two days for our domain (edu, sw region). I understand that they are reporting a single day, 3-4 hours of serious issues but that’s not what we experienced. Great write up otherwise, glad they are sharing openly
I know what you meant; however, reports should not be tailored to individual experience. The facts should be reported clearly. I’m happy they are open about the whole incident. -4 hours was more like two days for us.
Our stack? Multiple OC wan, 10G LAN with 1Gpbs clients. About 4,000+ users, EDU. We are super happy using Google. No complaints! Google is doing great.
Outages like these don't really resolve instantly.
Any given production system that works will have capacity needed for normal demand, plus some safety margin. Unused capacity is expensive, so you won't see a very high safety margin. And, in fact, as you pool more and more workloads, it becomes possible to run with smaller safety margins without running into shortages.
These systems will have some capacity to onboard new workloads, let us call it X. They have the sum of all onboarded workloads, let us call that Y. Then there is the demand for the services of Y, call that Z.
As you may imagine, Y is bigger than X, by a lot. And when X falls, the capacity to handle Z falls behind.
So in a disaster recovery scenario, you start with:
* the same demand, possibly increased from retry logic & people mashing F5, of Z
* zero available capacity, Y, and
* only X capacity-increase-throughput.
As it recovers you get thundering herds, slow warmups, systems struggling to find each other and become correctly configured etc etc.
Show me a system that can "instantly" recover from an outage of this magnitude and I will show you a system that's squandering gigabucks and gigawatts on idle capacity.
The root cause apparently lasted for ~4.5 hours, but residual effects were observed for days:
> From Sunday 2 June, 2019 12:00 until Tuesday 4 June, 2019 11:30, 50% of service configuration push workflows failed ... Since Tuesday 4 June, 2019 11:30, service configuration pushes have been successful, but may take up to one hour to take effect. As a result, requests to new Endpoints services may return 500 errors for up to 1 hour after the configuration push. We expect to return to the expected sub-minute configuration propagation by Friday 7 June 2019.
Though they report most systems returning to normal by ~17:00 PT, I expect that there will still be residual noise and that a lot of customers will have their own local recovery issues.
Edit: I probably sound dismissive, which is not fair of me. I would definitely ask Google to investigate and ideally give you credits to cover the full span of impact on your systems, not just the core outage.
That’s ok, I didn’t think your comment was dismissive. Those facts are buried in the report. Their opening sentence makes the incident sound lesser than what it really was.
Can someone explain more? It sounds like their network routers are run on top of a Kubernetes-like thing and when they scheduled a maintenance task their Kubernetes decided to destroy all instances of router-software, deleting all copies routing tables for whole datacenters?
You have the gist I would say. It's important to understand that Google separates the control plane and data plane, so if you think of the internet, routing tables and bgp are the control part and the hardware, switching, and links are data plane. Often times those two are combined in one device. At Google, they are not.
So the part that sets up the routing tables talking to some global network service went down.
It shouldn't. Amazon believes in strict regional isolation, which means that outages only impact 1 region and not multiple. They also stagger their releases across regions to minimize the impact of any breaking changes (however unexptected...)
> Often times those two are combined in one device.
Even when they are combined in one device they are often separated on to control plane and data plane modules. Redundant modules are often supported and data plane modules can often continue to forward data based upon the current forwarding table at the time of control plane failure.
Often the control plane module will basically be a general purpose computer on a card running either a vendor specific OS, Linux or FreeBSD. For example Juniper routing engines, the control planes for Juniper routers, run Junos which is a version of FreeBSD on Intel X86 hardware.
>"You have the gist I would say. It's important to understand that Google separates the control plane and data plane, so if you think of the internet, routing tables and bgp are the control part and the hardware, switching, and links are data plane. Often times those two are combined in one device. At Google, they are not."
That's pretty much the definition of SDN(software defined networking.) The control plane is what programs the data plane - this is also true in traditional vendor routers as well. It sounds like when whatever TTL was on the forwarding tables(data plane) was reached the network outage began.
Software defined datacenter depends on a control plane to do things below the "customer" level, such as migrate virtual machines and create virtual overlay networks. At the scale of a Google datacenter, this could reasonably be multiple entire clusters.
If there was an analog to a standard kubernetes cluster, I imagine it would be the equivalent of the kube controller manager.
For vmware guys, it would be similar to DRS killing all the vcenter VMs in all datacenters, and then on top of that having a few entire datacenters get rerouted to the remaining ones, which have the same issue.
I don’t have the inside knowledge of this outage but there are some details in here. They say that the job got descheduled due to misconfiguration. This implies the job could have been configured to serve through the maintenance event. It also implies there is a class of job which could not have done so. Power must have been at least mostly available, so it implies there was going to be some kind of rolling outage within the data center, which can be tolerated by certain workloads but not by others.
I have no idea what this was. But power distribution in a data center is hierarchical, and as much as you want redundancy, some parts in the chain are very expensive and sometimes you have to turn them off for maintenance.
I never actually worked in a data center, so keep in mind I don’t know what I’m talking about. Traditional DCs have UPS all over the place, but that will only last a finite amount of time, and your maintenance might take longer than the UPS will last.
Slightly off topic rant follows: I don't see a lot of tech sites talk about the fact that Azure and GCP have multi-region outages. Everybody sees this kind of thing and goes "shrug, an outage". No, this is not okay. We have multiple regions for a reason. Making an application support multi-region is HARD and COSTLY. If I invest that into my app, I never want it go down due to a configuration push. There has never been an AWS incident across multiple regions (us-east-1, us-west-2, etc). That is a pretty big deal to me.
Whenever I post this somebody comes along and says "well that one time us-east-1 went down and everybody was using the generic S3 endpoints so it took everything down". This is true, and the ASG and EBS services in other regions apparently were. BUT, if you invested the time to ensure your application could be multi-region and you hosted on AWS, you would not have seen an outage. Scaling and snapshots might not have worked, but it would not have been the 96.2% packet drop that GCP is showing here and your end users likely would not have noticed.
The articles that track outages at the different cloud vendors really should be pushing this.
reply