Google Cloud Networking Incident Postmortem (June 2019)

exwiki · 2 minutes ago

Why don't they refund every paid customer who was impacted? Why do they rely on the customer to self report the issue for a refund?

For example GCS had 96% packet loss in us-west. So doesn't it make sense to refund every customer who had any API call to a GCS bucket on us-west during the outage?

reply

ljoshua · 5 minutes ago

Having only ever seen one major outage event in person (at a financial institution that hadn't yet come up with an incident response plan; cue three days of madness), I would love to be a fly on the wall at Google or other well-established engineering orgs when something like this goes down.

I'd love to see the red binders come down off the shelf, people organize into incident response groups, and watch as a root cause is accurately determined and a fix out in place.

I know it's probably more chaos than art, but I think there would be a lot to learn by seeing it executed well.

reply

kirubakaran · 3 minutes ago

What they don't tell you is, it took them over 4 hours to kill the emergent sentience and free up the resources. In the long run this isn't so bad, as it just adds an evolutionary pressure on the further incarnations of the AI to keep things on the down low.