Disclosure: I work on Google Cloud (but disclaimer, I'm on vacation and so not much use to you!).
We're having what appears to be a serious networking outage. It's disrupting everything, including unfortunately the tooling we usually use to communicate across the company about outages.
There are backup plans, of course, but I wanted to at least come here to say: you're not crazy, nothing is lost (to those concerns downthread), but there is serious packet loss at the least. You'll have to wait for someone actually involved in the incident to say more.
To clarify something: this outage doesn’t appear to be global, but it is hitting us particularly hard in parts of the US. So for the folks with working VMs in Mumbai, you’re not crazy. But for everyone with sadness in us-central1, the team is on it.
It seems global to me. This is really strange compared to AWS. I don't remember an outage there (other than s3) impacting instances or networking globally.
Back when S3 failures would take town Reddit, parts of Twitter .. Netflix survived because they had additional availability zones. I can remember the bigger names started moving more stuff to their own data centers.
AWS tries to lock people in to specific services now which makes it really difficult to migrate. It also takes a while before you get to the tipping point where hosting your own is more financially viable .. and then if you trying migrating, you're stuck using so many of their services you can't even do cost comparisons.
Netflix actually added the additional AZs because of a prior outage that did take them down.
"After a 2012 storm-related power outage at Amazon during which Netflix suffered through three hours of downtime, a Netflix engineer noted that the company had begun to work with Amazon to eliminate “single points of failure that cause region-wide outages.” They understood it was the company’s responsibility to ensure Netflix was available to entertain their customers no matter what. It would not suffice to blame their cloud provider when someone could not relax and watch a movie at the end of a long day."
We went multi-region as a result of the 2012 inc. source: I now manage the team responsible for performing regional evacuations (shifting traffic and scaling the savior regions).
We don’t usually discuss the frequency of unplanned failovers, but I will tell you that we do a planned failover at least every two weeks. The team also uses traffic shaping to perform whole system load tests with production traffic, which happens quarterly.
I think some Google engineers published a free Meap book on service relatability and uptime guarantees. Seemingly counterintuitive, scheduling downtime, without other teams’ prior knowledge, encourages teams to handle outages properly and reduce single points of failure, among other things.
I am not sure if a single S3 outage pushed any big names into their own "datacenter". S3 has still the world record of reliability that you cannot challenge with your inhouse solutions. You can prove it otherwise. I would love to hear a solution that has the same durability, avabiality and scalability as S3.
For the downvoters, please just link here the proof if you disagree.
I don't see why multi/hybrid would have lower downtime. All cloud providers as far as I know, though I know mostly of AWS, already have their services in multiple data-centers and their endpoints in multiple regions. So if you make yourself use more then one of their AZs and Region, you would be just as multi as with your own data center.
Using a single cloud provider with a multiple region setup won't protect you from some issues in their networking infrastructure, as the subject of this thread supposedly shows.
Although I guess depending on how your own infrastructure is setup, even a multi cloud provider setup won't save you from a network outage like the current Google cloud one.
Actually, I imagine that if you could go multi-regional then your self-managed solution may be directly competitive in terms of uptime. The idea that in-house can't be multi-regional is a bit old fashioned in 2019.
How can they possibly guarantee eleven nines? Considering I’ve never heard of this company and they offer such crazy-sounding improvements over the big three, it feels like there should be a catch.
11 9s isn't uncommon. AWS S3 does 11 9s (upto 16 9s with cross region replication?) for data durability, too. AFAIK, AWS published papers about their use of formal methods to ascertain bugs from other parts of the system didn't creep in to affect durability/availability guarantees: https://blog.acolyer.org/2014/11/24/use-of-formal-methods-at...
The only regions that are more expensive than us-east-1 in the States are GovCloud and us-west-1 (Bay Area). Both us-west-2 (Oregon) and us-east-2 (Ohio) are priced the same as us-east-1.
I would probably go with US-EAST-2 just because it's isolated from anything except perhaps a freak Tornado and better situated on the eastern US. Latency to/from there should be near optimal for most eastern US/Canada population.
Though note that if you are an EU AWS customer, you are not buying from outside EU, you are buying from Amazon's EU branches regardless of AWS region. If Amazon has a local branch in your country, they charge you VAT as any local company does. Otherwise you buy from an Amazon branch in another EU country, and you again need to self-assess VAT (reverse charge) per Article 196.
Years ago, when I was playing with AWS in a course on building cloud-hosted services, it was well-known that all the AWS management was hosted out of a single zone, and there were several days we had to cancel class because us-east-1 had an outage, so while technically all our VMs hosted out of other AZs were extant, all our attempts to manage our VMs via the web UI or API were timing or erroring out.
I understand this is long-since resolved (I haven't tried building a service on Amazon in a couple years, so this isn't personal experience), but centralized failure modes in decentralized systems can persist longer than you might expect.
(Work for Google, not on Cloud or anything related to this outage that I'm aware of, I have no knowledge other than reading the linked outage page.)
> it was well-known that all the AWS management was hosted out of a single zone, and there were several days we had to cancel class because us-east-1 had an outage
Maybe you mean region, because there is no way that AWS tools were ever hosted out of a single zone (of which there are 4 in us-east-1). In fact, as of a few years ago, the web interface wasn’t even a single tool, so it’s unlikely that there was a global outage for all the tools.
And if this was later than 2012, even more unlikely, since Amazon retail was running on EC2 among other services at that point. Any outage would be for a few hours, at most.
"Some services, such as IAM, do not support Regions; therefore, their endpoints do not include a Region."
There was a partial outage maybe a month and a half ago where our typical AWS Console links didn't work but another region did. My understanding is that if that outage were in us-east-1 then making changes to IAM roles wouldn't have worked.
Where are you based? If you’re in the US (or route through the US) and trying to reach our APIs (like storage.googleapis.com), you’ll be having a hard time. Perhaps even if the service you’re trying to reach is say a VM in Mumbai.
I have an instance in us-west-1 (Oregon) which is up, but an instance in us-west-2 (Los Angeles) which is down. Not sure if that means Oregon is unaffected though.
What I said is correct for AWS. In retrospect I guess the context was a bit ambiguous.
(I will note that I was technically more right in the most obnoxiously pedantic sense since the hyphenation style you used is unique to AWS - `us-west-1` is AWS-style while `us-west1` is GCE-style :P)
I’m from the US and in Australia right now. Both me and my friends in the US are experiencing outages across google properties and Snapchat, so it’s pretty global.
I’m not in SRE so I don’t bother with all the backup modes (direct IRC channel, phone lines, “pagers” with backup numbers). I don’t think the networking SRE folks are as impacted in their direct communication, but they are (obviously) not able to get the word out as easily.
Still, it seems reasonable to me to use tooling for most outages that relies on “the network is fine overall”, to optimize for the common case.
Note: the status dashboard now correctly highlights (Edit: with a banner at the top) that multiple things are impacted because Networking. The Networking outage is the root cause.
AWS experienced a major outage a few years ago that couldn't be communicated to customers because it took out all the components central to update the status board. One of those obvious-in-hindsight situations.
Not long after that incident, they migrated it to something that couldn't be affected by any outage. I imagine Google will probably do the same thing after this :)
I'm guessing this will be part of the next DiRT exercise :-) (DiRT being the disaster recovery exercises that Google runs internally to prepare for this sort of thing)
Can't use my Nest lock to let guests into my house. I'm pretty sure their infrastructure is hosted in Google Cloud. So yeah... definitely some stuff lost.
You have my honest sympathy because of the difficulties you now suffer through, but it bears emphasizing: this is what you get when you replace what should be a physical product under your control with Internet-connected service running on third-party servers. IoT as seen on the consumer market is a Bad Idea.
I am pretty sure there are smart locks that don't rely on an active connection to the cloud. The lock downloads keys when it has a connection and a smartphone can download keys. This means they work even if no active internet connection at the time the person tries to open. If the connection was dead the entire time between creating the new key and the person trying to use the lock it still wouldn't work.
If there are not locks that work this way it sure seems like there should be. Using cloud services to enable cool features is great. But if those services are not designed from the beginning with fallback for when the internet/cloud isn't live that is something that is a weakness that often is unwise to leave in place imo.
It may not be worth the complexity to give users the choice. If I were to issue keys to guests this way I would want my revocations to be immediately effective no matter what. Guest keys requiring a working network is a fine trade-off.
You can have this without user intervention - have the lock download an expiration time with the list of allowed guest keys, or have the guest keys public-key signed with metadata like expiration time.
If the cloud is down, revocations aren't going to happen instantly anyway. (Although you might be able to hack up a local WiFi or Bluetooth fallback.)
It's a fake trade-off, because you're choosing between lo-tech solution and bad engineering. IoT would work better if you made the "I" part stand for "Intranet", and kept the whole thing a product instead of a service. Alas, this wouldn't support user exploitation.
Yeah, my dream device would be some standard app architecture that could run on consumer routers. You buy the router and it's your family file and print server, and also is the public portal to manage your IoT devices like cameras, locks, thermostats, and lights.
Don't be ridiculous. Real alternatives would include P2P between your smart lock and your phone app or a locally hosted hub device which controls all home automation/IoT, instead of a cloud. If the Internet can still route a "unlock" message from your phone to your lock, why do you require a cloud for it to work?
Or use one of the boxes with combination lock that you can screw onto your wall for holding a physical key. Some are even recommended by insurance companies.
Any key commands they have already set up will still work. Nest is pretty good at having network failures fail to a working state. They might not be able to actively open the lock over the network is the only change.
"Cloud Automotive Collision Avoidance and Cloud Automotive Braking services are currently unavailable. Cloud Automotive Acceleration is currently accepting unauthenticated PUT requests. We apologise for any inconvenience caused."
Our algorithms have detected unusual patterns and we have terminated your account as per clause 404 in Terms And Conditions. The vehicle will now stop and you are requested to exit.
Sure you can, but you'll need to give them your code or the master code. Unless you've enabled Privacy Mode, in which case... I don't know if even the master code would work.
Everyone talking about security and not replacing locks with smart locks seems to forget that you can just kick the fucking door down or jimmy a window open.
I keep trying to explain to people that our customers don’t care that there is someone to blame they just want their shit to work. There are advantages to having autonomy when things break.
There’s a fine line or at least some subtlety here though. This leads to some interesting conversations when people notice how hard I push back against NIH. You don’t have to be the author to understand and be able to fiddle with tool internals. In a pinch you can tinker with things you run yourself.
> I keep trying to explain to people that our customers don’t care that there is someone to blame they just want their shit to work. There are advantages to having autonomy when things break.
There are also advantages to being part of the herd.
When you are hosted at some non-cloud data center, and they have a problem that takes them offline, your customers notice.
When you are hosted at a giant cloud provider, and they have a problem that takes them offline, your customers might not even notice because your business is just one of dozens of businesses and services they use that aren't working for them.
Of course customers don't care about the root cause. The point of the cloud isn't to have a convenient scapegoat to punt blame to when your business is affected. It's a calculated risk that uptime will be superior compared to running and maintaining your own infrastructure, thus allowing your business to offer an overall better customer experience. Even when big outages like this one are taken into account, it's often a pretty good bet to take.
The small bare metal hosting company I use for some projects hardly goes down, and when there is an issue, I can actually get a human being on the phone in 2 minutes. Plus, a bare metal server with tons of RAM costs less than a small VM on the big cloud providers.
Hetzner is an example. Been using them for years and it's been a solid experience so far. OVH should be able to match them, and there's others, I'm sure.
Cloud costs roughly 4x than bare metal for sustained usage (of my workload). Even with the heavy discounts we get for being a large customer it’s still much more expensive. But I guess op-ex > cap-ex
I've had pretty good luck with Green House Data's Colo Service and their Cloud offerings. A couple of RU's in the data center can host 1000's of VM's in multi-regions with great connectivity between them.
I've a question that always stopped me going that route, what happens when a disk or other hardware fails on these servers? beyond data loss I mean, like physically what happens who carries out the repair how long does it takes
Most bare metal providers nowadays contact you just like AWS and say "hey your hardware is failing get a new box.". Unless it's something exotic it's usually not long for setup time, and in some cases just like a VM it's online in a minute or two.
Thanks a million. Those prices look similar to what I've used in the past, it's just been a long time since I've gone shopping for small scale dedicated hosting.
You weren't kidding, 1:10 ratio to what we pay for similar VPS. And guaranteed worldwide lowest price on one of them. Except we get free bandwidth with ours.
Solutions based on third-party butts have essentially two modes: the usual, where everything is smooth, and the bad one, where nothing works and you're shit out of luck - you can't get to your data anymore, because it's in my butt, accessible only through that butt, and arguably not even your data.
With on-prem solutions, you can at least access the physical servers and get your data out to carry on with your day while the infrastructure gets fixed.
Any solution would be based on third parties, the robust solution is either to run your own country with fuel sources for electricity and army to defend the datacenters or rely on multiple independent infrastructures. I think the latter is less complex.
This is a ridiculous statement. Surely you realise that there is a sliding scale.
You can run your own hardware and pull in multiple power lines without establishing your own country.
I’ve ran my own hardware, maybe people have genuinely forgotten what it’s like, and granted, it takes preparation and planning and it’s harder than clicking “go” in a dashboard. But it’s not the same as establishing a country and source your own fuel and feed an army. This is absurd.
Correct. Most CFO's I've run into as of late would rather spend $100 on a cloud vm than deal with capex, depreciation, and management of the infrastructure. Even though doing it yourself with the right people can go alot further.
Assuming you have data that is tiny enough to fit anywhere other than the cluster you were using. Assuming you can afford to have a second instance with enough compute just sitting around. Assuming it's not the HDDs, RAID controller, SAN, etc which is causing the outage. Assuming it's not a fire/flood/earthquake in your datacenter causing the outage.
Ah, yes, I will never forget running a site in New Orleans, and the disaster preparedness plan included "When a named storm enters or appears in the Gulf of Mexico, transfer all services to offsite hosting outside the Gulf Coast". We weren't allowed to use Heroku in steady state, but we could in an emergency. But then we figured out they were in St. Louis, so we had to have a separate plan for flooding in the Mississippi River Valley.
I keep forgetting that I have it on, my brain treats the two words as identical at this point. The translator has this property, which I also tend to forget about, that it will substitute words in your HN comment if you edit it.
But yeah, it's still a thing, and the message behind it isn't any less current.
I made IoT using cheap (arduino, nrf24l01+, sensors/actuators) for local device telemetry, MQTT, node-red, and Tor for connecting clouds of endpoints that aren't local.
Long story short, its an IoT that is secure, consisting of a cloud of devices only you own.
Oh that’s weird, because it totally worked for me with “butts” as a euphemism for “people”, as in “butt-in-seat time” — relying on a third-party service is essentially relying on third party butts (i.e. people), and your data is only accessible through those people, whom you don’t control.
And then “your data is in my butt” was just a play on that.
There are some whole argue that the resiliency of cloud providers beats on prem or self hosted, and yet they’re down just as much or more (GCP, Azure, and AWS all the same). Don’t take my word for it; search HN for “$provider is down” and observe the frequency of occurrences.
You want velocity for your dev team? You get that. You want better uptime? Your expectations are gonna have a bad time. No need for rapid dev or bursty workloads? You’re lighting money on fire.
Disclaimer: I get paid to move clients to or from the cloud, everyone’s money is green. Opinion above is my own.
One of the projects I worked on was using data URIs for critical images, and I wouldn’t trust that particular team to babysit my goldfish.
Sounds like Google and Amazon are hiring way too many optimists. I kinda blame the war on QA for part of this, but damn that’s some Pollyanna bullshit.
Now is a good time to point out that the SLA of Google Cloud Storage only covers HTTP 500 errors: https://cloud.google.com/storage/sla. So if the servers are not responding at all then it's not covered by the SLA. I've brought this to their attention and they basically responded that their network is never down.
Ironically I can't read that page because, since it's Google-hosted, I'm getting an HTTP 500 error... but which means at least that service is SLA-covered...
Cloud services live and die by their reputation, so I'd be shocked if Google ever tried to get out of following an SLA contract based on a technicality like that. It would be business suicide, so it doesn't seem like something to be too worried about?
In order to receive any of the Financial Credits described above, Customer must notify Google technical support within thirty days from the time Customer becomes eligible to receive a Financial Credit. Customer must also provide Google with server log files showing loss of external connectivity errors and the date and time those errors occurred. If Customer does not comply with these requirements, Customer will forfeit its right to receive a Financial Credit. If a dispute arises with respect to this SLA, Google will make a determination in good faith based on its system logs, monitoring reports, configuration records, and other available information, which Google will make available for auditing by Customer at Customer’s request."
I would pay a premium for a cloud provider happy to give 100 percent discount for the month for 10 minutes downtime, and 100 percent discount for the year for an hour's downtime.
Any cloud provider offering those terms would go out of business VERY quickly. Outages happen, all providers are incentivized to minimize the frequency and severity of disruptions - not just from the financial hit of breaching SLA (which for something like this will be significant), but for the reputational damage which can be even more impactful.
How often does amazon or google go down for ten minutes?
But let's work backwards from the goal instead.
If you charge twice as much, and then 20-30% of months are refunded by the SLA, you make more money and you have a much stronger motivation to spend some of that cash on luxurious safety margins and double-extra redundancy.
So what thresholds would get us to that level of refunding?
I think you're proving the parent comment's point. The number of businesses willing to pay a 500x markup is exceedingly small (potentially less than 1), and at that point the cost is high enough where it's probably cheaper to just build the redundancy yourself using multiple cloud providers (and, to emphasize, that option tends to be horribly expensive).
Just take the premium that you'd be willing to pay and put it in the bank -- the premium would be priced such that the expected payout of the premium would be less than or equal to what you'd be paying.
Besides, a provider credit is the least of most company's concerns after an extended outage, it's a small fraction of their remediation costs and loss of customer goodwill.
You know this reminds me of a bad taste that Google Sales team left when I asked for some of my billing that I was unaware of running after following a quickstart guide.
AWS refunded me in the first reply on the same day!
GCP sales rep just copy pasted a link to a self support survey that essentially told me, after a series of YES or NO questions that they can't refund me.
So why not just tell your customers like it is? Google Cloud is super strict when it comes to billing. I have called my bank to do a chargeback and put a hold on all future billing with GCP.
I'm now back to AWS and still on a Free Tier. Apparently the $300 Trial with Google Cloud did not include some critical products, AWS Free tier makes it super clear and even still I sometimes leave something running on and discover it in my invoice....
I've yet to receive a reply from Google and its been a week now.
I do appreciate other products such as Firebase but honestly for infrastructure and for future integration with enterprise customers I feel AWS is more appropriate and mature.
Are you seriously complaining about having to pay for using their resources? I understand that you're surprised some things aren't covered in the free trial or free credit or whatever, but getting $300 free already sounded a little too good to be true (I heard about it from a friend and was dubious; at least in Europe, consumers are told not to enter deals that are too good to be true), you could at least have checked what you're actually getting.
I think it's weird to say you get credit in dollars and then not be able to spend it on everything. That's not how money works. But that's the way hosting providers work and afaik it's quite well known. Especially with a large sum of "free money", even if it's not well known, it was on you to check any small print.
The thing that worries me most about Google Cloud and these billing stories is that I’m assuming if you chargeback or block them at your bank then they’ll ban all Google accounts of yours - and they’re obviously going to be able to make the link between an account made just for Google Cloud and my real account.
Google is well known for not caring about small shops, only if you are a multi million dollar customer with dedicated account manager you can expect reasonable support. That's been the case forever with them.
Absolutely. I've seen them wipe a number of bills away for companies that have screwed up something. They definitely take a longer view on customer happiness than GCP. Azure also tends to be pretty good in this regard.
Only some people at the partner programm can vary.
I had a guy who wanted to help me out even tho I was just a one person shop. After he left I got a woman who threw me out of the program faster than I could look.
Yes. 100%. We don’t pay AWS much but their help is top notch. We accidentally bought physical instances instead of reserved instances. AWS resolved the issue and credited us. I’ll prob never touch GCE. Google just isn’t a good company at any level.
I've got a personal account with an approximately $1/mo bill (just a couple things in S3) and a work account with ~$1500/mo AWS bill (not a large shop by any means) and I've always felt very positive about my interactions with AWS support
If you buy their support (which isn't that expensive). Holy fuck it's good. You literally have an infrastructure support engineer on the phone for hours with you. They will literally show you how to spend less money for your hosting while using more AWS services.
>I asked for some of my billing that I was unaware of running
>I have called my bank to do a chargeback
You're issuing a chargeback because you made a mistake and spent someone else's resources? And you're admitting to this on HN? I'm not a lawyer, but that sounds like fraud and / or theft to me.
Anything created in-house at Google (GCP) is typically created by technically-proficient devs, those devs then leave the project to start something new and maintenance is left to interns and new hires. Google customer service basically doesn't care and also has no tools at their disposal to fix any issues anyway.
The infinite money spout that is Google Ads has created a situation in which devs are at Google just to have fun - there really is no incentive to maintain anything because the money will flow regardless of quality.
From what I’ve been told, the issue is that the people with political capital (managers, PMs, etc) are quick to move after successful launches and milestones. No matter how many competent engineers hang around, the product/team becomes resource and attention starved.
Isn't it also that promotions at Google are based on creating new products/projects rather than maintaining existing ones? So engineers have a negative incentive to maintain things since it costs them promotions.
I'm not sure why you are downvoted - seems like a reasonable insight and explanation for the drop in quality and weird decisions Google is making recently.
It’s not insightful at all. Just one intern’s very brief observations of something way more complicated and nuanced than is deserving of such a dismissive comment.
I have mentioned this multiple time: Any criticism of Google is met with barrage of downvotes. I guess all the googlers hang around here and they are usually commenting with throwaways.
According to https://twitter.com/bgp4_table, we have just exceeded 768k Border Gateway Protocol routing entries, which may be causing some routers to malfunction.
I haven’t written a status page in a while, but the rest of my infrastructure starts freaking out if it hasn’t heard from a service in a while. Why doesn’t their status page have at least a warning about things not looking good?
In my experience public status pages are "political" and no matter how they start tend to trend towards higher management control in some way... that leads to people who don't know, aren't in the thick of it, don't understand it, and / or are cautious to the point that it stops being useful.
Not only political, but with SLAs on the line they have significant financial and legal consequences as well. Most managers are probably happier keeping the ‘incident declaring power’ in as few a hands as possible to make sure those penalty clauses aren’t ever un-necessarily triggered.
Same with most corporate twitter feeds. I’d like to follow my public transit/airport/highway authority, but it’ll be 10 posts about Kelly’s great work in accounting for every service disruption.
And No, I don’t want to install a separate app to get push notifications about service disruptions for every service I use.
Was noticing massive issues earlier and thought that maybe my account was blocked due to breaching from TOS as I was heavily playing with Cloud Run. Then I noticed gitlab was also acting up but my Chinese internet was still surprisingly responsive. Tried the status page which said everything was fine and searched Twitter for "google cloud" and also found nobody talking about it. Typically Twitter is the single source of truth for service outages as people start talking about it
Google Cloud is the number 4 most monitored status page on StatusGator and Google Apps is number 12. In addition, at least 20 other services we monitor seemingly depend on Google Cloud because they all posted issues as soon as Google went down.
It's always interesting to see these outages at large cloud providers spider out across the rest of the internet, a lot of the world depends on Google to stay up.
Server hardware is actually quite expensive. End users "smart" phones are cheap hardware, running dumb software which renders them as terminals for the cloud. That's sad because smartphone hardware is quite capable of doing useful work.
(For instance, I have a 500GB MicroSD card in my phone which contains a copy of my OwnCloud)
The holiday is on the official birthday. The sovereign's actual birthday has been separate from the official birthday for centuries, so the holiday does not need to change.
Nah, it's not even her actual birthday. Different countries with the same queen even celebrate it on different days. Presumably it'll be renamed to "king's birthday" but the day kept the same when the monarch changes. Or done away with/re-purposed - there's a general feeling in Australia at least that once the queen dies there will be less support for the monarchy.
So, for some companies, failing over between providers is actually viable and planned for in advance. But it is known in advance that it is time consuming and requires human effort.
The other case is really soft failures for multi-region companies. We degrade gracefully, but once that happens, the question becomes what other stuff can you bring back online. For example, this outage did not impact our infrastructure in GCP Frankfurt, however, it prevented internal traffic in GCP from reaching AWS in Virginia because we peer with GCP there. Also couldn't access the Google cloud API to fall back to VPN over public internet. In other cases, you might realize that your failover works, but timeouts are tuned poorly under the specific circumstances, or that disabling some feature brings the remainder of the product back online.
Additionally, you have people on standby to get everything back in order as soon as possible when the provider recover. Also, you may need to bring more of your support team online to deal with increased support calls during the outage.
It's not even about being able to afford it. Some things just don't lend themselves to hot failover. If your data throughput is high, it may not be feasible or possible to stream a redundant copy to a data center outside the network.
We're having what appears to be a serious networking outage. It's disrupting everything, including unfortunately the tooling we usually use to communicate across the company about outages.
There are backup plans, of course, but I wanted to at least come here to say: you're not crazy, nothing is lost (to those concerns downthread), but there is serious packet loss at the least. You'll have to wait for someone actually involved in the incident to say more.
reply