(cache) At Scale, Rare Events aren’t Rare

I’m a connoisseur of failure. I love reading about engineering failures of all forms and, unsurprisingly, I’m particularly interested in data center faults. It’s not that I delight in engineering failures. My interest is driven by believing that the more faults we all understand, the more likely we can engineer systems that don’t suffer from these weaknesses.

It’s interesting, at least to me, that even fairly poorly-engineered data centers don’t fail all that frequently and really well-executed facilities might go many years between problems. So why am I so interested in understanding the cause of faults even in facilities where I’m not directly involved? Two big reasons: 1) the negative impact of a fault is disproportionately large and avoiding just one failure could save millions of dollars and 2) at extraordinary scale, even very rare faults can happen more frequently.

Today’s example is from a major US airline last summer and it is a great example of “rare events happen dangerously frequently at scale.” I’m willing to bet this large airline has never before seen this particular fault under discussion and yet, operating at much higher scale, I’ve personally encountered it twice in my working life. This example is a good one because the negative impact is high, the fault mode is well-understood, and although a relatively rare event there are multiple public examples of this failure mode.

Before getting into the details of what went wrong, let’s look at the impact of this failure on customers and the business. In this case, 1,000 flights were canceled on the day of the event but the negative impact continued for two more days with 775 flights canceled the next day and 90 on the third day. The Chief Financial Office reported that $100m of revenue or roughly 2% of the airline’s world-wide monthly revenue was lost in the fall-out of this event. It’s more difficult to measure the negative impact on brand and customer future travel planning, but presumably there would have been impact on these dimensions as well.

It’s rare that the negative impact of a data center failure will be published, but the magnitude of this particular fault isn’t surprising. Successful companies are automated and, when a systems failure brings them down, the revenue impact can be massive.

What happened? The report was “switch gear failed and locked out reserve generators.” To understand the fault, it’s best to understand what the switch gear normally does and how faults are handled and then dig deeper into what went wrong in this case.

In normal operation the utility power feeding a data center flows in from the mid-voltage transformers through the switch gear and then to the uninterruptible power supplies which eventually feeds the critical load (servers, storage, and networking equipment). In normal operation, the switch gear is just monitoring power quality.

If the utility power goes outside of acceptable quality parameters or simply fails, the switch gear waits a few seconds since, in the vast majority of the cases, the power will return before further action needs to be taken. If the power does not return after a predetermined number of seconds (usually less than 10), the switch gear will signal the backup generators to start. The generators start, run up to operating RPM, and are usually given a very short period to stabilize. Once the generator power is within acceptable parameters, the load is switched to the generator. During the few seconds required to switch to generator power, the UPS has been holding the critical load and the switch to generators is transparent. When the utility power returns and is stable, the load is switched back to utility and the generators are brought back down.

The utility failure sequence described above happens correctly almost every time. In fact, it occurs exactly as designed so frequently that most facilities will never see the fault mode we are looking at today. The rare failure mode that can cost $100m looks like this: when the utility power fails, the switch gear detects a voltage anomaly sufficiently large to indicate a high probability of a ground fault within the data center. A generator brought online into a direct short could be damaged. With expensive equipment possibly at risk, the switch gear locks out the generator. Five to ten minutes after that decision, the UPS will discharge and row after row of servers will start blinking out.

This same fault mode caused the 34-minute outage at the 2012 super bowl: The Power Failure Seen Around the World.

Backup generators run around 3/4 of million dollars so I understand the switch gear engineering decision to lockout and protect an expensive component. And, while I suspect that some customers would want it that way, I’ve never worked for one of those customers and the airline hit by this fault last summer certainly isn’t one of them either.

There are likely many possible causes of a power anomaly of sufficient magnitude to cause switch gear lockout, but the two events I’ve been involved with were both caused by cars colliding with aluminum street light poles that subsequently fell across two phases of the utility power. Effectively an excellent conductor landed across two phases of a high voltage utility feed.

One of two times this happened, I was within driving distance of the data center and everyone I was with was getting massive numbers of alerts warning of a discharging UPS. We sped to the ailing facility and arrived just as servers were starting to go down as the UPSs discharged. With the help of the switch gear manufacturer and going through the event logs, we were able to determine what happened. What surprised me is the switch gear manufacturer was unwilling to make the change to eliminate this lockout condition even if we were willing to accept all equipment damage that resulted from that decision.

What happens if the generator is brought into the load rather than locking out? In the vast majority of the situations and in 100% those I’ve looked at, the fault is outside of the building and so the lockout has no value. If there was a ground fault in the facility, the impacted branch circuit breaker would open and the rest of the facility would continue to operate on generator and the servers downstream of the open breaker would switch to secondary power and also continue to operate normally. No customer impact. If the fault was much higher in the power distribution system and without breaker protection or the breaker failed to open, I suspect a generator might take damage but I would rather put just under $1m at risk than be guaranteed that the load will be dropped. If just one customer could lose $100m, saving the generator just doesn’t feel like the right priority.

I’m lucky enough to work at a high-scale operator where custom engineering to avoid even a rare fault still makes excellent economic sense so we solved this particular fault mode some years back. In our approach, we implemented custom control firmware such that we can continue to multi-source industry switch gear but it is our firmware that makes the load transfer decisions and, consequently, we don’t lockout.

Services

34 comments on “At Scale, Rare Events aren’t Rare”

Robert Gusciora says:

April 13, 2017 at 1:47 am

James, I’m not entirely sure we understand the “lockout” as described by the major airline. Low Voltage Circuit Breakers (> 600V) usually have integral trip units that cause the breaker to open under a short circuit, overload or ground fault (down stream of them). Medium Voltage Breakers (1000V – 35,000V) use a type of external trip sensor called a protective relay. They usually are more intelligent than a low voltage integral trip unit, but basically serve the same purpose if the generator is not intended to be paralleled or connected to the Utility Service. If the generator is intended to be connected to the utility service, then, in the case as described, the generator would correctly not be allowed to be connected to a faulted utility service.

However, in the much more common and likely scenario of “open transition” transfer to generator, it is somewhat technically debatable if a generator (or other source of electrical power) should be allowed to connected to a faulted bus or load. Here, I agree with you on this point, as well as does the NEC with respect to ground fault detection that it can transfer to generator. A “standard & common ” automatic transfer switch (ATS) does not have this lockout functionality.

However, with a fault upstream of the Utility breaker, there is no reason whatsoever that this switchgear should “lockout”. I highly doubt the switchgear automation was designed to operate this way. If what is described is what happened, then one needs to investigate why the switchgear “locked out”. This could be due to:
a) improper facility wide or switchgear grounding and/or bonding,
b) a faulty trip unit or protection relay performing this function
c) did the programmable logic controller (plc) go down due to the outage and not perform the generator transfer function?,
d) is there a bug in the plc program?
e) did some other wiring, connection or control device fail?
f) generators “self protect” and lock out due to many conditions such as “over crank”, “over speed”, low oil pressure, low coolant level and many others. This condition could have been annunciated on the switchgear.

I just don’t see how one can conclude, based on what was published, that the switchgear locked out due to intentional transfer logic programming.

A thorough investigation by an Engineer and switchgear/controls technician would be required if no “smoking gun” fingerprints were found.

Reply
- James Hamilton says:
  
  April 13, 2017 at 11:01 am
  
  I’ve been involved with a couple of these events that got investigated in detail with both the utility and switch gear provider engineering teams and both events were caused by the issue you described as “open transition”. Your paragraph on this fault mode:
  
  “….in the much more common and likely scenario of “open transition” transfer to generator, it is somewhat technically debatable if a generator (or other source of electrical power) should be allowed to connected to a faulted bus or load. Here, I agree with you on this point, as well as does the NEC with respect to ground fault detection that it can transfer to generator.”
  
  You speculated that the switchgear would not be intentionally programmed to produce this behaviour. The problem the switch gear manufacturers face is they believe they want to detect inside the facility shorts to ground but they are not able to do this without some false positives. These events I’m referring to as lockout and you are referring to as “open transition” are unavoidable while the switchgear is detecting the potential for direct short to ground inside the facility. This PLC programming is intentional and, although the false positives are not really wanted, the switch gear providers report there is no economic way to reliably avoid them so the lockout issue remains.
  
  Reply
Steve Mushero says:

April 11, 2017 at 12:43 pm

Great article and analysis – I’ve seen this in factories, too – one time underground 4KV lines were just wet enough for ground-fault systems to sense it and trip, in part due to some badly-engineered (but operating for years prior) ground routing that like your 2-out-of-3 short, was too sensitive to phase imbalanced floating ground references.

We also lacked on-site HiPot gear to test, but we had a factory down (twice) and experience told us it was okay, so we manually threw 10MW 34.5KV switches while most of the team hid under cars hoping the house-sized transformers wouldn’t blow up if there was a real short.

One of a few actual or nearly explosive power situations I was involved in . . .

Reply
- James Hamilton says:
  
  April 11, 2017 at 12:51 pm
  
  I always give power engineers “room to work” when they are switching breakers.
  
  Reply
- James Hamilton says:
  
  April 11, 2017 at 3:04 pm
  
  I would have offered to stand in a completely different room while you test engaged 10MW switch gear :-).
  
  Reply
Rick says:

April 10, 2017 at 9:48 pm

I still remember how a major data center in Colorado Springs had redundant power lines into the prem, but both of the power leads looped around into a parallel structure 30 feet outside the prem, and the backhoe got them both about 10 feet later.

Reply
- James Hamilton says:
  
  April 11, 2017 at 12:32 pm
  
  Yeah, weird things do happen. I was involved with a critical facility that had dual network feeds on diverse paths with good distance between them. A neighboring construction project managed to cut one network link and, before the first issue had been corrected, they somehow managed to cut the second network link. At that point, I was close to hoping the might find the utility feed :-).
  
  Reply
John Duffin says:

April 9, 2017 at 3:05 am

HI James, This may be of interest to you:
https://journal.uptimeinstitute.com/examining-and-learning-from-complex-systems-failures/

Reply
- James Hamilton says:
  
  April 9, 2017 at 9:55 am
  
  Thanks John. It’s a pretty interesting article with lots of examples. The core premise is a simple one: complex systems often run in degraded operations mode, one where the safety margins are partially consumed by less than ideal management decisions and these are at least partly responsible for the subsequent systems failure.
  
  Reply
Angel Castillo says:

April 8, 2017 at 5:42 pm

Should datacenters have enough ups capacity to allow for troubleshooting power issues further up the line? Like in the switch gear lockout event, ups with one hour capacity could’ve allowed for a manual switch without customers being affected. A 5 minute ups seems cutting it pretty close to me in case of a genset failure, phasing issues or the lockout issue.

Reply
- James Hamilton says:
  
  April 9, 2017 at 9:48 am
  
  Your logic is sound that 5 to 10 min is not close to enough for human solution. The problem is doubling the capacity to 10 to 20 min is nearly twice the price and still doesn’t give much time. Extend the capacity by 10 times out to 50 to 100 min and you have a chance of fixing the problem but, even then, problems that can be safely addressed in 1 hour in power distribution systems and teams with all the right skills on staff 24×7 aren’t that common.
  
  Whenever I look at solving problems with longer UPS ride through times, I end up concluding it’s linear in expensive but not all that effective at improving availability so I argue the industry is better off with redundancy and automated failover. It ends up being both more effective and more economic.
  
  An interesting side effect of the automated redundancy approach is, once you have that in place, you can ask an interesting question. Rather than 5 to 10 min of UPS, what do I give up in going with 2 to 3 min. Everything that is going to be done through automation is done in the first 2 to 3 min (with lots of safety margin) so why bother powering for longer? Shorter UPS times are getting to be a more common choice.
  
  Reply
  - Donough Roche says:
    
    April 12, 2017 at 1:58 pm
    
    In the vast majority of data center designs, the HVAC (cooling) is not supported by the UPS systems. The IT systems will experience some elevation of temperatures within the five minutes of UPS run time, but the overall design takes this into account and is matched to the runtime of your UPS. However, if you continue to provide power to the IT systems for an extended time while generator/switchgear issues are being investigated and fixed, increasing temperatures within the data center will take down your IT equipment. Adding UPS support to HVAC systems adds significant capital and operational costs that are superfluous when your upstream switchgear is designed, programmed, and tested to work when you need it to work.
    
    Reply
    - James Hamilton says:
      
      April 12, 2017 at 2:07 pm
      
      Yes, I agree. Another good reason why adding UPS ride through time isn’t the most effective approach
      
      Reply
Anon says:

April 8, 2017 at 12:15 pm

Can you fix this horrible layout on this site? Text right up to the edge of the screen? Are you kidding me? It’s painful to read.

Reply
- James Hamilton says:
  
  April 8, 2017 at 12:18 pm
  
  Looks fine on all devices I’m got around here. What device, operating system, and browser are you using?
  
  Reply
  - Tester says:
    
    April 10, 2017 at 4:03 pm
    
    The Engineer’s reply to a designer type. Hilarious.
    
    Reply
  - Kevin says:
    
    April 10, 2017 at 5:56 pm
    
    Windows 7, Google Chrome 57.0.2987.133, 1920 x1200
    Windows 7, Internet Explorer 11.0.9600.18617IS, 1920 x1200
    Windows 10, Edge, 38.14393.0.0, 1920 x1080
    
    Reply
    - James Hamilton says:
      
      April 10, 2017 at 8:04 pm
      
      Yeah, OK. I use the same blog software on both perspectives.mvdirona.com and mvdirona.com and on mvdirona.com there is a lot of content to show with both a blog and real time location display. I mostly access the site on a Nexus 9 and it actually has pretty reasonable margins. Looking at it on Windows under Chrome and IE, I generally get your point — the site would look better with more margin. I’m don’t really focus much on user interface — I’m more of an infrastructure guy — but I’ll add more margins on both mvdirona.com sites. Thanks.
      
      Reply
Vermont Fearer says:

April 5, 2017 at 8:05 pm

Really enjoyed this post and learned a lot about how data centers are designed. I had read about a similar case involving a rare event at an airline data center in Arizona that you might find interesting: https://www.azcourts.gov/Portals/45/16Summaries/November82016CV160027PRUS%20Airwayfinal.pdf

Reply
- James Hamilton says:
  
  April 6, 2017 at 1:17 pm
  
  Thanks for sending the Arizona Supreme court filing on Qwest vs US Air. US Air lost but the real lesson here is they need a better supplier. One cut fiber bundle shouldn’t isolate a data center.
  
  Here’s a funny one I’ve not talked about before. Years ago I was out in my backyard (back when we still had a backyard) getting the garden ready for some spring planing when I found just below the surface an old damaged black cable with some exposed conductors buried a 6″ below the surface. Weird I hadn’t found it before but I cut it at the entry and exit of the garden proceeded. Later that day I noticed our phone didn’t work but didn’t put the two events together. The next day I learned that the entire area had lost phone coverage the previous day and was still down. Really, wonder what happened there? :-)
  
  Reply
  - Ron P. says:
    
    April 8, 2017 at 5:09 pm
    
    I glad you are not “gardening” in my neighborhood!! :)
    
    That is the most hilarious admission from someone partially responsible for perhaps billions of dollars of communications infrastructure!!!!
    
    Reply
    - James Hamilton says:
      
      April 9, 2017 at 9:40 am
      
      I would arguing that Qwest putting the entire neighborhood cable through my back yard, only burying it six inches, and not using conduit is kind of irresponsible. But, yes, it was my shovel that found it. I’m just glad it wasn’t the power company :-).
      
      Reply
Mark "employee" Evans says:

April 5, 2017 at 3:48 pm

Dear Sir Hamilton,
thanks for the article. It is in reference to the 8/8/2016 failure(read “fire”) of a 22 year old “kit” in Delta’s Atlanta HQ, America’s second largest airline? Has the North American Electric Reliability Corporation (NERC) any implements to ensure switchgear manufacturers become more flexible?

Reply
- James Hamilton says:
  
  April 6, 2017 at 1:09 pm
  
  If NERC has taken a position on this failure mode, I’ve not seen it but they publish a wonderful resource in lessons learned to help operators and contractors understand different faults and how to avoid them.
  
  Reply
Florian Seidl-Schulz says:

April 5, 2017 at 9:24 am

You offered to take all legal responsibilities, if the generator went online, no matter what?
As in, humans harmed by continuous power supply to a short and contract damages upon prolonged outage due to generator destruction?

Reply
- James Hamilton says:
  
  April 5, 2017 at 10:32 am
  
  You raised the two important issues but human safety is the dominant one. Legal data centers have to meet juristrictional electrical safety requirements and these standards have evolved over more than 100 years of electrical system design and usage. The standards have become pretty good, reflect the industry focus on safety first, and industry design practices usually exceed these requirements on many dimensions.
  
  The switch gear lockout is not required by electrical standards but it’s still worth looking at lockout and determining whether it adds safety factors or reduces it at the benefit of the equipment. When a lockout event occurs an electrical professional will have to investigate and they have two choices with the first being by far the most common. They can first try re-engaging the breaker to see if it was a transient event or they probe the conductors looking for faults. All well designed and compliant data centers have many redundant breakers between the load and the generator. Closing that breaker via automation allows the investigating electrical professional to have more data when they investigate the event.
  
  Equipment damage is possible but, again, well designed facilities are redundant with concurrent maintainability which means they have equipment off line for maintenance and then have an electrical fault and still safely hold the load. Good designs need to have more generators than required to support the entire facility during a utility fault. A damaged generator represents a cost but it should not lead to outage in a well designed facility.
  
  Human safety is a priority in all facilities. Equipment damage is not something any facility wants but, for many customers, availability is more important than possible generator repair cost avoidance.
  
  Reply
Ruprecht Schmidt says:

April 5, 2017 at 7:22 am

Loved the piece! I’m now inspired to ask someone about the switch gear situation at the data center where we host the majority of our gear. Thanks!

Reply
- James Hamilton says:
  
  April 5, 2017 at 10:15 am
  
  It’s actually fairly complex to chase down and getting the exact details on what are the precise triggering events that cause different switch modes to be entered. Only the switch gear engineering teams really knows the details, the nuances, and the edge cases that cause given switch modes to be entered. It’s hard data to get with precision and completeness.
  
  In many ways it’s worth asking about this fault mode but, remember, this really is a rare event. It’s a super unusual facility where this fault mode is anywhere close to the most likely fault to cause down time. Just about all aspects of the UPSs need scrutiny first.
  
  Reply
A says:

April 5, 2017 at 5:13 am

The 2102 Super Bowl sounds very interesting, we’ll have to see how it goes.

Back on topic, if there was such a fault, I have to wonder if more than mere equipment damage might be at stake in some cases. I suspect they also want to limit their own liability for doing something dangerous.

Reply
- James Hamilton says:
  
  April 5, 2017 at 10:07 am
  
  Thanks, I fixed the 2102 typo.
  
  Human risk factors are the dominant concern for the data center operators and equipment operators. Data centers have high concentrations of power but these concerns are just as important in office buildings, apartment buildings, and personal homes and that’s why we have juristictional electrical standards designed to reduce the risk directly to occupants and operators and indirectly through fire. The safe guards in place are important, required by all jurstictions but these safety regulations do not include switch lock out. All data centers have 5+ breakers between the generator and the load. There are breakers at the generator, the switch gear itself, the UPS, and downstream in some form of remote power panel and, depending upon the design, many more locations. As an industry we have lots of experience in electrical safety and the designs operate well even when multiple faults are present because they all have many layers of defense.
  
  Let’s assume that the switch gear lockout is a part of this multi-layered human defense system even though not required by electrical codes. Is it possible that this implementation is an important part of why modern electrical systems have such an excellent safety record? With the lockout design, the system goes dark and professional electrical engineers are called. Many critical facilities have electrical engineers on premise at all times but, even then, it’ll likely take more time than the UPS discharge time to get to the part of the building that is faulting. The first thing a professional engineer will do when investigating an electrical system switch gear lockout is to re-engage the breaker and see if it was a transient event or is an actual on-premise issue. Another investigative possibility is to probe the system for ground fault but most professionals chose to engage the breaker first and it seems like a prudent first choice since probing potentially hot conductors is not a task best taken on under time pressure and 99th percentile of the cases, the event is outside the facility and just re-engaging the breaker is both safer than probing.
  
  Doing this first level test of re-engaging the open breaker through automation has the advantage of 1) not dropping the load in the common case, and 2) not requiring a human to be at the switch gear to engage it in a test. I hate closing 3,000A breakers and, if I personally have to do it, I always stand beside them rather than in front the breaker. As safe as it is, it’s hard to feel totally comfortable with loads that high. Doing the first level investigation in automation reduces human risk and puts more information on the table for the professional engineer who will investigate the issue. Of course, all issues, whether resolved through automation or not, still need full root cause investigation.
  
  Reply
David says:

April 5, 2017 at 3:56 am

“the 2102 super bowl…” So you ARE from the future… Busted!

Reply
- James Hamilton says:
  
  April 5, 2017 at 9:37 am
  
  I like to think of myself as forward thinking but I might have gotten carried away in predicting the 2102 Super Bowl :-). Thanks, I fixed that one.
  
  Reply
Denis Altudov says:

April 4, 2017 at 11:29 pm

A more, ahem, “pedestrian” story along the same lines.

Electric bikes have batteries which may overheat under certain condition such as weather, load, manufacturing defects, motor problems, shorts, etc. The battery controller in this case if programmed to cut the power, saving the battery from overheating and/or catching fire. Fair enough, right? A fire is avoided, the equipment is saved, and the user just coasts along for a while and come to a safe stop.

Fast froward to the “hoverboard” craze – the self-balancing boards with two wheels, one on each side. The batteries and controllers have been repurposed in a hurry to serve the new hot market. When a battery overheats the controller cuts the power to the motor, self-balancing feature turns off and the user face-plants into the pavement. But the $100 battery is saved!

Sadly, I don’t have the Amazon’s scale to rewrite the battery controller firmware, hence why I lead my post with the word “pedestrian”. Off I go for a stroll.

Cheers.

Reply
- James Hamilton says:
  
  April 5, 2017 at 10:47 am
  
  Hey Denis. Good to hear from you. Lithion-Ion batteries have massive power density and just over the last few months have been in the news frequently with reports of headphones suffering from an explosive discharge and cell phone catching fire. The hoverboard mishaps have included both the permanent battery lockout you describe and also fire from not isolating faulty cells.
  
  Safety around Li-Ion batteries, especially large ones, is super important. Good battery designs include inter-cell fusing, some include a battery wide fuse, and include charge/discharge monitoring firmware that includes temperature monitoring. Some of the more elaborate designs include liquid cooling. Tesla has been particularly careful in their design partly since they couldn’t afford to have an fault in the early days but mostly because they were building a positively massive battery with 1,000s of cells.
  
  Good Li-ion battery designs use stable chemistries and have fail-safe monitoring with inter-cell fusing and often battery-wide fusing. These safety system will cause the odd drone to crash and may cause sudden drive loss on hoverboards. In the case of hoverboards because the basic system is unstable without power, a good design would have to ensure that there is sufficient reserve energy to safely shutdown on battery fault. I’m sure this could be done but, as you point out, it usually wasn’t.
  
  My take is that transportation vehicles that are unstable or unsafe when not powered are probably just not ideal transportation vehicles. I’m sure hoverboards could be designed with sufficient backup power to allow them to shut down safely but the easy mitigation is get an electric bike :-)
  
  Reply

Perspectives

At Scale, Rare Events aren’t Rare

34 comments on “At Scale, Rare Events aren’t Rare”

Leave a Reply Cancel reply