Intel’s Atom C2000 chips are bricking products, and it’s not just Cisco hit

jepler · on Feb 7, 2017

Reminds me of the Sandy Bridge SATA flaw.

"The problem in the chipset was traced back to a transistor in the 3Gbps PLL clocking tree. The aforementioned transistor has a very thin gate oxide, which allows you to turn it on with a very low voltage. Unfortunately in this case Intel biased the transistor with too high of a voltage, resulting in higher than expected leakage current. Depending on the physical characteristics of the transistor the leakage current here can increase over time which can ultimately result in this failure on the 3Gbps ports."

http://www.anandtech.com/show/4143/the-source-of-intels-coug...

yuhong · on Feb 7, 2017

I wonder how many would even bother to get it replaced if it was discovered say only a year after launch.

throwaway7767 · on Feb 8, 2017

I had one of those. The shop I bought it from refused to replace it from their inventory, all they would do is take the motherboard, send it back and then give me the replacement some weeks later when the RMA process was completed.

Since I needed that machine functioning, I never replaced it (the mobo had some extra SATA ports handled by a different controller, so they kept working and I switched to using them). I suspect a lot of people are in the same boat. I'll never do business with that store again.

leonroy · on Feb 7, 2017

sigh The perils of maintaining my own data center in the basement for 'fun' are coming to haunt me.

I have 2x C2758 Supermicro boxes running core routing services and a Synology RS2416+ for storage - all on the affected CPU list - guess I better double check my backups are working and allocate some funds for replacement kit in case things go belly up!

adrr · on Feb 6, 2017

If this is the related the Cisco clock signal component issue. Cisco is handling it in a really poor way. No replacements unless its under warranty even though its a known issue.

http://www.cisco.com/c/en/us/support/web/clock-signal.html#~...

freehunter · on Feb 7, 2017

Under warranty or anyone who has a TAC subscription. Cisco licensed their products with a ToS that says you can't resell it, and warranties are only valid for people who bought directly from Cisco. You also can't (effectively) get TAC support for a resold device (what Cisco folks call "grey market"). It's not illegal to buy secondhand Cisco product, Cisco just won't support them or let you get software upgrades without paying them a ton of money.

100%, this wording is to make sure that grey market buyers aren't covered under the replacement. Basically anyone who bought from Cisco will be able to get a replacement.

kuschku · on Feb 7, 2017

How do they handle this in the EU, where 2 years warranty, even if resold, are mandatory?

msh · on Feb 7, 2017

That don't cover business buyers who I guess are most of Cisco buyers, only consumers.

jsiepkes · on Feb 7, 2017

True, only consumers get a mandatory two years warranty in the EU. Altough I think they would have a hard time in court not replacing it since it was sold with a defect (they admit its a defect).

Cisco is like Apple in that regard; You pay premium but the service is nowhere near premium.

lostlogin · on Feb 7, 2017

Wonder how that is applied in New Zealand where the commute guarantees act basically requires sellers to sort problems out within "a reasonable time frame" of sale. It's a fantastic piece of legislation.

antod · on Feb 7, 2017

The CGA only applies to consumers and not business customers, and only applies between the end reseller and the customer. Cisco doesn't really sell directly to consumers.

legulere · on Feb 7, 2017

The EU-mandated warranty is against the seller, not the producer. Also it is just for consumers.

manarth · on Feb 7, 2017

   No replacements unless its under warranty

It seems that this isn't entirely the case. From TFA: "if your product was under warranty as of November 16, 2016, you are still eligible to replace your products", and "Cisco is offering to provide replacement products…even if they have not failed".

Without more information, I wouldn't like to speculate on the cutoff date of November 16, 2016, but so far it doesn't sound too unreasonable. We don't really know the scope of the failure, nor relevant batch numbers, etc, so I'd give them the benefit of the doubt, until clearer details emerge.

adrr · on Feb 7, 2017

I think i am going crazy. Swear it said a valid service contract would be required for replacement if it failed outside the warranty.

yuhong · on Feb 7, 2017

Why don't they name the customer/supplier when it is obvious when the product is taken apart? Even with the Cisco DDR SDRAM fiasco, it wasn't that hard to figure out that it was Micron DDR SDRAM that is at fault.

wmf · on Feb 7, 2017

No one can afford to name and shame Intel due to potential retaliation.

yuhong · on Feb 7, 2017

Intel already published the errata.

ticviking · on Feb 7, 2017

So why keep using intel

lostlogin · on Feb 7, 2017

I'm likely revealing my ignorance, but what's the alternative?

AstralStorm · on Feb 7, 2017

ARM, MIPS. (multiple routers use these) Custom FPGA softcore even. Maybe extra ASICs. Cisco is big enough for that. However I suspect Intel might be cheapest to get for performance.

taspeotis · on Feb 7, 2017

NDAs are par for the course.

tyingq · on Feb 7, 2017

Not great for Intel. These Avoton Atoms were the first Atom chips with respectable performance, so there was a chance to unsully the Atom name.

Also, I'm reasonably sure many of these were sold already permanently affixed to the motherboard, so the fix may be worse than swapping out just a CPU.

wtallis · on Feb 7, 2017

I don't think the processor cores in the Avoton chips were anything impressive, but these were the first Atom chips with a lot of I/O bandwidth.

From what I can tell, all of the Avoton chips were sold in a BGA package that required them to be soldered to the motherboard. There isn't a socketed version of Avoton.

chiph · on Feb 7, 2017

Atom models affected:

C2308, C2338, C2350, C2358, C2508, C2518, C2530, C2538, C2550, C2558, C2718, C2730, C2738, C2750, and C2758.

How to tell what CPU your Synology NAS has:

https://www.synology.com/en-us/knowledgebase/DSM/tutorial/Ge...

My 3-month old 1815+ is on the list...

fulafel · on Feb 7, 2017

"slightly higher expected failure rates under certain use and time constraints" sounds like it shouldn't be observable on the field. Do people suspect Intel are lying or is this a storm in a teacup?

AstralStorm · on Feb 7, 2017

Milquetoast words to stem panic. Ineffective.

aeturnum · on Feb 7, 2017

Well, I guess it's time to replace my otherwise perfectly-good Synology NAS on the double.

tyingq · on Feb 7, 2017

This looks interesting: https://forum.synology.com/enu/viewtopic.php?f=7&t=119727&st...

See the last couple of posts in the thread as well.

Sounds like you can get an RMA, but it's a slow process.

aeturnum · on Feb 7, 2017

Realistically, I'm not interested in an RMA for another DS1815 that will fail, followed by (possibly) another RMA once the problem is fixed in silicon. I'm also very uncertain about slotting the drives into a new unit and successfully recovering the RAID.

Instead, I'll shut down the NAS and buy a replacement from another company (QNAP probably) and transfer the data. The other options feel too risky.

digler999 · on Feb 7, 2017

I'm guessing the CPU must be soldered to the board on these ? I have a 1815+ that is currently working, and now I'm afraid to shut it off. I wonder if the DS2015 (not sure of #, the 10gbe model) uses this defective part ?

aeturnum · on Feb 7, 2017

I would imagine - I haven't taken the unit apart.

digler999 · on Feb 7, 2017

probably to save money. that sucks, because even if you swapped it with another defective one, it would be worth it if you just had to replace the CPU every 18 months.

tyingq · on Feb 7, 2017

Learned in another area of this post that yes, they are soldered on, but that's Intel's choice. They only offer the CPU in a BGA (ball grid array) form factor. There's no such thing as a BGA socket, other than some specialty test unit things that aren't suitable for real world use.

tyingq · on Feb 7, 2017

Supposedly,the DSD2015xs uses an ARM cpu...1.7Ghz Annapurna AL-514. That's a $1600 unit though.

digler999 · on Feb 7, 2017

It's expensive, no doubt. Since I'm already in $1k for the 1815+, maybe if/when mine dies, I will try to haggle with them for a $600 upgrade instead of RMA'ing the old unit. Might work out better for everyone: I know the new unit doesn't have the faulty Atom chip, I have a faster NAS, and thats one less RMA they have to turn around (I'd still mail back the broken one).

jcurbo · on Feb 7, 2017

Had the same thought. The article mentions the 1815 but I have a 1515 which apparently has a C2538, which is one of the models listed.

(Synology 1515 specs: https://www.synology.com/en-us/products/DS1515+#spec)

tracker1 · on Feb 7, 2017

Man, I'm glad I didn't pull the trigger on a new one, been planning on it for a while, was going to be my tax return gift to myself this year.

kev009 · on Feb 7, 2017

I heard a rumor that something very similar was detected on upcoming Xeon SKUs, but they will be implementing the board level workaround.

gens · on Feb 7, 2017

Somewhat off-topic:

Are Cisco products worth the money ?

Personally i haven't had that much experience with their stuff, but i remember seeing a brand new router running hot with two fans blowing in it (other routers at the time were 2x smaller without fans). I understand that Cisco should be the de facto networking standard, but is it really worth the name ?

tracker1 · on Feb 7, 2017

Probably.. but that comes down to a lot of factors though. It also depends on what kind of gear you're looking to buy. Everyone will try to be compatible with Cisco's interpretation of a given standard, if you go elsewhere, it may or may not be 100% compatible with your other equipment. Also, more IT networking guys will be more familiar with Cisco.

That said, they are definitely more expensive than their peers. But then again, an Escallade is more expensive than a Tahoe.

frozenport · on Feb 8, 2017

Nobody Ever Got Fired for Buying IBM

leonroy · on Feb 8, 2017

ServeTheHome surveyed a bunch of affected vendors to get some more information on the issue: https://www.servethehome.com/intel-atom-c2000-series-bug-qui...

And some technical specifics behind the problem: https://slashdot.org/comments.pl?sid=10214953&cid=53819967

Can't post to The Register, since they don't have ACs.

Anyway, the issue is damage to the LPC (low-pin-count) bus clock line. This is a secondary bus where you hang old ISA-style devices, like the system FLASH. If the FLASH is the only thing in there, it will mostly render the system unbootable (so, stuff that never gets power-cycled would just keep going). But LPC can generate interrupts, and one often hangs other crap to that bus, such as i2c controllers for hot-swap bays, motherboard management controllers, and other sensors. In that case, you can expect severe runtime misbehavior.

The issue is caused by "continuous degradation due to use", so repairing it is easy, if costly: replace the motherboard with a new one under warranty (and even if out of warranty period wherever this kind of "stealth" manufacturing defect is not subject to warranty time period limitations, such as in Brazil). It will "reset" the counter. This is your zero-day solution to the issue.

Depending on time-to-market for the new stepping (hardware revision) B1/C0 of the Atom C2000, you might need an interim solution, which is the "platform-level change", i.e. redesigned board with extra components that work around Intel's hardware design error. As soon as you have these, you start using these to replace any boards returned due to the defect, or start a "recall" to preemptively replace boards.

Depending on the total cost of the board plus other components, you keep the old boards you replaced around, and when revision B1/C0 of the Atom C2000 is out, you BGA-replace them in a factory (about US$ 25 per board in large volumes, if that much), maybe replace any liquid electrolytic capacitors and other crap that ages badly, and use the boards either as new or as refurbished, depending on your corporate/regulatory ethics. This kind of repair almost always really resets the boards MTBF. If Intel supplies the replacement Atoms at no charge, the cost of repair might well be far less than the cost of the production run for boards you'd want to keep around for warranty services, anyway.

Mind you, at 1.5 years per failure, it will be rare the legislation/contract that forces more than one replacement... so, let's hope they don't replace a faulty board with a brand-new virgin but-still-timebombed board. You'd have trouble to replace it a second time if it fails after the warranty period.

myrandomcomment · on Feb 7, 2017

Arista uses AMD and Intel. Their 1st switch (7124S & 7148SX) was a dual core AMD.

Jaecen · on Feb 6, 2017

This title seems incorrect. The article doesn't specify any other vendors or products that have been directly affected by this issue.

iokevins · on Feb 6, 2017

Near the bottom, the article currently states:

"Other vendors using Atom C2000 chips include Aaeon, HP, Infortrend, Lanner, NEC, Newisys, Netgate, Quanta, Supermicro, and ZNYX Networks. The chipset is aimed at networking devices, storage systems, and microserver workloads."

I'm guessing that may represent what OP meant (?)

Jaecen · on Feb 7, 2017

It says other vendors are using the chip, but there's no data on failures of other devices. We don't know what causes the chip to fail, but it's possible that Cisco's application may be uniquely, or at least uncommonly, susceptible.

StillBored · on Feb 7, 2017

Lots of reports from people using other boards with the C2000s having failures after a few months. The Asrock board is common in NAS's because of the 12 SATA ports. Most of the failure reports are similar.

https://www.amazon.com/ASRock-Rack-Mini-Motherboards-C2550D4...

https://www.google.com/search?q=ASRock+C2550D4I+failure+rate...

kyrra · on Feb 6, 2017

The title is technically correct, just annoyingly written. As someone who's build a PFSense box using a supermicro board with one of the affected chips, I'm definitely sad that I'll have to rip it apart to replace the parts.

ovidiup · on Feb 7, 2017

I have the same problem: I'm using various C2000-based Supermicro boxes running pfSense. The most cost-effective DIY, rack mountable solution for a pfSense box was until now SYS-5018A-FTN4. Do you know if Supermicro issued a technical bulletin about this box?

dhess · on Feb 7, 2017

Last Friday, my OpenBSD firewall, which runs on a SYS-5018A-FTN4, mysteriously crashed. I chalked it up to an alpha particle or something and rebooted. About 12 hours later, it failed again. This time I did some more digging. On the console was the following message:

  NMI ... going to debugger
  Stopped at    acpicpu_idle+0x22d:     nop
  ddb{0}>

I googled it and found one similar report on the OpenBSD misc mailing list from September 2016 [1]. Interestingly, the person who reported the bug was running the same Supermicro board as I was. The report didn't get anywhere other than a vague suggestion that it might be heat related. These boxes run very cool and I didn't think that was likely. I thought it might be a RAM issue and that it was probably just a coincidence that the other person had the same hardware as I, but now I'm inclined to think that both of us have experienced the issue described in TFA.

Seems like I'll be looking for new firewall hardware.

[1] https://www.mail-archive.com/misc@openbsd.org/msg149348.html

nsteel · on Feb 7, 2017

If you were able to reboot the box then you did not hit this issue. When you hit this issue your chip is dead.

yuhong · on Feb 7, 2017

This may be completely unrelated though.

voltagex_ · on Feb 7, 2017

Ah crap. I guess the reseller selling me old-new stock of an Avoton system http://www.supermicro.com/products/chassis/tower/721/SC721TQ... isn't really going to care. Shipping the product back would be ~150+AUD. Can't buy this one in Australia unfortuately.

seltzered_ · on Feb 7, 2017

Yeah, these avoton-based boards seemed popular in the freeNAS / diy home server community for being cheap and low power while supporting ecc ram. Even the official freeNAS mini server used (and still used when I checked last year) a supermicro board with an avoton CPU.

Namidairo · on Feb 7, 2017

Websites seem to suggest it's running the C2750, so it does appear to be affected.

stock_toaster · on Feb 7, 2017

Confirmed. I have a FreeNAS mini, and it has a C2750 in it.

haikuginger · on Feb 6, 2017

The Avoton? Shame, really, it seems like a great board otherwise.

Sanddancer · on Feb 7, 2017

It's part of the errata for the chip. Go to:

http://www.intel.com/content/dam/www/public/us/en/documents/...

and search for AVR54

Jaecen · on Feb 7, 2017

I understand that the chip has a flaw. The title claims non-Cisco products are being bricked. What other products have actually been impacted by this issue? The article doesn't give any data, just a list of vendors using the chip. Is there any proof other devices are impacted by this issue?

I'm not claiming that the chip isn't failing; I'm disappointed that the title makes a claim that the article doesn't deliver on.

lysp · on Feb 7, 2017

Check the synology forums linked in the article.

Quite a few units have completely died without explanation. From the descriptions given by the users it does sound like dead cpu.

frik · on Feb 7, 2017

That explains the issues with the C2000 family. Various Linux distros crash randomly, and not just crash sometimes really just stop opening applications or stop processing e.g. apt-get.

The BIOS is a piece of shit. It's buggy, the legacy-BIOS support is unstable, the Win7-EFI and Win8-EFI modes are not good either. I patched a Win7 DVD with Win8 files, so that I could install Win7. Now Win7 runs great and stable - but only after I installed various Intel drivers that fixed the hardware flaws.

I am seriously looking forward to the upcoming new AMD CPU - Intel dod barely anything the last five years, a 2011 highend CPU is almost as fast as Intel 2017 flagship, and costed a lot less back then, had less DRM or other shit that is broken. Intel needs a proper competitor, so a comeback of AMD on the one side, and Apple notebooks with ARM CPU are very welcome to stop Intel from siting on their quasi monopoly chair.

wang_li · on Feb 7, 2017

>That explains the issues with the C2000 family. Various Linux distros crash randomly, and not just crash sometimes really just stop opening applications or stop processing e.g. apt-get.

No it doesn't. Did you read the errata? It completely stops. There's no weirdness. It's just dead.

AstralStorm · on Feb 7, 2017

Random crashes are often sign of memory corruption. Sometimes broken power supply or major interference. Not of such CPU problems.

nowaynohow · on Feb 20, 2017

Meh, for C2000, it can be also a sign of outdated firmware. We don't get microcode updates for SoCs in the general distribution: either your system vendor does a good job of keeping up with firmware updates, or you are screwed.

yuhong · on Feb 7, 2017

Which motherboard?