"The problem in the chipset was traced back to a transistor in the 3Gbps PLL clocking tree. The aforementioned transistor has a very thin gate oxide, which allows you to turn it on with a very low voltage. Unfortunately in this case Intel biased the transistor with too high of a voltage, resulting in higher than expected leakage current. Depending on the physical characteristics of the transistor the leakage current here can increase over time which can ultimately result in this failure on the 3Gbps ports."
I had one of those. The shop I bought it from refused to replace it from their inventory, all they would do is take the motherboard, send it back and then give me the replacement some weeks later when the RMA process was completed.
Since I needed that machine functioning, I never replaced it (the mobo had some extra SATA ports handled by a different controller, so they kept working and I switched to using them). I suspect a lot of people are in the same boat. I'll never do business with that store again.
sigh The perils of maintaining my own data center in the basement for 'fun' are coming to haunt me.
I have 2x C2758 Supermicro boxes running core routing services and a Synology RS2416+ for storage - all on the affected CPU list - guess I better double check my backups are working and allocate some funds for replacement kit in case things go belly up!
If this is the related the Cisco clock signal component issue. Cisco is handling it in a really poor way. No replacements unless its under warranty even though its a known issue.
Under warranty or anyone who has a TAC subscription. Cisco licensed their products with a ToS that says you can't resell it, and warranties are only valid for people who bought directly from Cisco. You also can't (effectively) get TAC support for a resold device (what Cisco folks call "grey market"). It's not illegal to buy secondhand Cisco product, Cisco just won't support them or let you get software upgrades without paying them a ton of money.
100%, this wording is to make sure that grey market buyers aren't covered under the replacement. Basically anyone who bought from Cisco will be able to get a replacement.
True, only consumers get a mandatory two years warranty in the EU. Altough I think they would have a hard time in court not replacing it since it was sold with a defect (they admit its a defect).
Cisco is like Apple in that regard; You pay premium but the service is nowhere near premium.
Wonder how that is applied in New Zealand where the commute guarantees act basically requires sellers to sort problems out within "a reasonable time frame" of sale. It's a fantastic piece of legislation.
The CGA only applies to consumers and not business customers, and only applies between the end reseller and the customer. Cisco doesn't really sell directly to consumers.
It seems that this isn't entirely the case.
From TFA: "if your product was under warranty as of November 16, 2016, you are still eligible to replace your products", and "Cisco is offering to provide replacement products…even if they have not failed".
Without more information, I wouldn't like to speculate on the cutoff date of November 16, 2016, but so far it doesn't sound too unreasonable. We don't really know the scope of the failure, nor relevant batch numbers, etc, so I'd give them the benefit of the doubt, until clearer details emerge.
Why don't they name the customer/supplier when it is obvious when the product is taken apart? Even with the Cisco DDR SDRAM fiasco, it wasn't that hard to figure out that it was Micron DDR SDRAM that is at fault.
ARM, MIPS. (multiple routers use these) Custom FPGA softcore even. Maybe extra ASICs. Cisco is big enough for that. However I suspect Intel might be cheapest to get for performance.
Not great for Intel. These Avoton Atoms were the first Atom chips with respectable performance, so there was a chance to unsully the Atom name.
Also, I'm reasonably sure many of these were sold already permanently affixed to the motherboard, so the fix may be worse than swapping out just a CPU.
I don't think the processor cores in the Avoton chips were anything impressive, but these were the first Atom chips with a lot of I/O bandwidth.
From what I can tell, all of the Avoton chips were sold in a BGA package that required them to be soldered to the motherboard. There isn't a socketed version of Avoton.
"slightly higher expected failure rates under certain use and time constraints" sounds like it shouldn't be observable on the field. Do people suspect Intel are lying or is this a storm in a teacup?
Realistically, I'm not interested in an RMA for another DS1815 that will fail, followed by (possibly) another RMA once the problem is fixed in silicon. I'm also very uncertain about slotting the drives into a new unit and successfully recovering the RAID.
Instead, I'll shut down the NAS and buy a replacement from another company (QNAP probably) and transfer the data. The other options feel too risky.
I'm guessing the CPU must be soldered to the board on these ? I have a 1815+ that is currently working, and now I'm afraid to shut it off. I wonder if the DS2015 (not sure of #, the 10gbe model) uses this defective part ?
probably to save money. that sucks, because even if you swapped it with another defective one, it would be worth it if you just had to replace the CPU every 18 months.
Learned in another area of this post that yes, they are soldered on, but that's Intel's choice. They only offer the CPU in a BGA (ball grid array) form factor. There's no such thing as a BGA socket, other than some specialty test unit things that aren't suitable for real world use.
It's expensive, no doubt. Since I'm already in $1k for the 1815+, maybe if/when mine dies, I will try to haggle with them for a $600 upgrade instead of RMA'ing the old unit. Might work out better for everyone: I know the new unit doesn't have the faulty Atom chip, I have a faster NAS, and thats one less RMA they have to turn around (I'd still mail back the broken one).
Personally i haven't had that much experience with their stuff, but i remember seeing a brand new router running hot with two fans blowing in it (other routers at the time were 2x smaller without fans). I understand that Cisco should be the de facto networking standard, but is it really worth the name ?
Probably.. but that comes down to a lot of factors though. It also depends on what kind of gear you're looking to buy. Everyone will try to be compatible with Cisco's interpretation of a given standard, if you go elsewhere, it may or may not be 100% compatible with your other equipment. Also, more IT networking guys will be more familiar with Cisco.
That said, they are definitely more expensive than their peers. But then again, an Escallade is more expensive than a Tahoe.
Can't post to The Register, since they don't have ACs.
Anyway, the issue is damage to the LPC (low-pin-count) bus clock line. This is a secondary bus where you hang old ISA-style devices, like the system FLASH. If the FLASH is the only thing in there, it will mostly render the system unbootable (so, stuff that never gets power-cycled would just keep going). But LPC can generate interrupts, and one often hangs other crap to that bus, such as i2c controllers for hot-swap bays, motherboard management controllers, and other sensors. In that case, you can expect severe runtime misbehavior.
The issue is caused by "continuous degradation due to use", so repairing it is easy, if costly: replace the motherboard with a new one under warranty (and even if out of warranty period wherever this kind of "stealth" manufacturing defect is not subject to warranty time period limitations, such as in Brazil). It will "reset" the counter. This is your zero-day solution to the issue.
Depending on time-to-market for the new stepping (hardware revision) B1/C0 of the Atom C2000, you might need an interim solution, which is the "platform-level change", i.e. redesigned board with extra components that work around Intel's hardware design error. As soon as you have these, you start using these to replace any boards returned due to the defect, or start a "recall" to preemptively replace boards.
Depending on the total cost of the board plus other components, you keep the old boards you replaced around, and when revision B1/C0 of the Atom C2000 is out, you BGA-replace them in a factory (about US$ 25 per board in large volumes, if that much), maybe replace any liquid electrolytic capacitors and other crap that ages badly, and use the boards either as new or as refurbished, depending on your corporate/regulatory ethics. This kind of repair almost always really resets the boards MTBF. If Intel supplies the replacement Atoms at no charge, the cost of repair might well be far less than the cost of the production run for boards you'd want to keep around for warranty services, anyway.
Mind you, at 1.5 years per failure, it will be rare the legislation/contract that forces more than one replacement... so, let's hope they don't replace a faulty board with a brand-new virgin but-still-timebombed board. You'd have trouble to replace it a second time if it fails after the warranty period.
"Other vendors using Atom C2000 chips include Aaeon, HP, Infortrend, Lanner, NEC, Newisys, Netgate, Quanta, Supermicro, and ZNYX Networks. The chipset is aimed at networking devices, storage systems, and microserver workloads."
It says other vendors are using the chip, but there's no data on failures of other devices. We don't know what causes the chip to fail, but it's possible that Cisco's application may be uniquely, or at least uncommonly, susceptible.
Lots of reports from people using other boards with the C2000s having failures after a few months. The Asrock board is common in NAS's because of the 12 SATA ports. Most of the failure reports are similar.
The title is technically correct, just annoyingly written. As someone who's build a PFSense box using a supermicro board with one of the affected chips, I'm definitely sad that I'll have to rip it apart to replace the parts.
I have the same problem: I'm using various C2000-based Supermicro boxes running pfSense. The most cost-effective DIY, rack mountable solution for a pfSense box was until now SYS-5018A-FTN4. Do you know if Supermicro issued a technical bulletin about this box?
Last Friday, my OpenBSD firewall, which runs on a SYS-5018A-FTN4, mysteriously crashed. I chalked it up to an alpha particle or something and rebooted. About 12 hours later, it failed again. This time I did some more digging. On the console was the following message:
NMI ... going to debugger
Stopped at acpicpu_idle+0x22d: nop
ddb{0}>
I googled it and found one similar report on the OpenBSD misc mailing list from September 2016 [1]. Interestingly, the person who reported the bug was running the same Supermicro board as I was. The report didn't get anywhere other than a vague suggestion that it might be heat related. These boxes run very cool and I didn't think that was likely. I thought it might be a RAM issue and that it was probably just a coincidence that the other person had the same hardware as I, but now I'm inclined to think that both of us have experienced the issue described in TFA.
Seems like I'll be looking for new firewall hardware.
Ah crap. I guess the reseller selling me old-new stock of an Avoton system http://www.supermicro.com/products/chassis/tower/721/SC721TQ... isn't really going to care. Shipping the product back would be ~150+AUD. Can't buy this one in Australia unfortuately.
Yeah, these avoton-based boards seemed popular in the freeNAS / diy home server community for being cheap and low power while supporting ecc ram. Even the official freeNAS mini server used (and still used when I checked last year) a supermicro board with an avoton CPU.
I understand that the chip has a flaw. The title claims non-Cisco products are being bricked. What other products have actually been impacted by this issue? The article doesn't give any data, just a list of vendors using the chip. Is there any proof other devices are impacted by this issue?
I'm not claiming that the chip isn't failing; I'm disappointed that the title makes a claim that the article doesn't deliver on.
That explains the issues with the C2000 family. Various Linux distros crash randomly, and not just crash sometimes really just stop opening applications or stop processing e.g. apt-get.
The BIOS is a piece of shit. It's buggy, the legacy-BIOS support is unstable, the Win7-EFI and Win8-EFI modes are not good either. I patched a Win7 DVD with Win8 files, so that I could install Win7. Now Win7 runs great and stable - but only after I installed various Intel drivers that fixed the hardware flaws.
I am seriously looking forward to the upcoming new AMD CPU - Intel dod barely anything the last five years, a 2011 highend CPU is almost as fast as Intel 2017 flagship, and costed a lot less back then, had less DRM or other shit that is broken. Intel needs a proper competitor, so a comeback of AMD on the one side, and Apple notebooks with ARM CPU are very welcome to stop Intel from siting on their quasi monopoly chair.
>That explains the issues with the C2000 family. Various Linux distros crash randomly, and not just crash sometimes really just stop opening applications or stop processing e.g. apt-get.
No it doesn't. Did you read the errata? It completely stops. There's no weirdness. It's just dead.
Meh, for C2000, it can be also a sign of outdated firmware. We don't get microcode updates for SoCs in the general distribution: either your system vendor does a good job of keeping up with firmware updates, or you are screwed.
"The problem in the chipset was traced back to a transistor in the 3Gbps PLL clocking tree. The aforementioned transistor has a very thin gate oxide, which allows you to turn it on with a very low voltage. Unfortunately in this case Intel biased the transistor with too high of a voltage, resulting in higher than expected leakage current. Depending on the physical characteristics of the transistor the leakage current here can increase over time which can ultimately result in this failure on the 3Gbps ports."
http://www.anandtech.com/show/4143/the-source-of-intels-coug...