Almost nine years ago, Cloudflare was a tiny company and I was a customer not an employee. Cloudflare had launched a month earlier and one day alerting told me that my little site, jgc.org, didn’t seem to have working DNS any more. Cloudflare had pushed out a change to its use of Protocol Buffers and it had broken DNS.
I wrote to Matthew Prince directly with an email titled “Where’s my dns?” and he replied with a long, detailed, technical response (you can read the full email exchange here) to which I replied:
From: John Graham-Cumming Date: Thu, Oct 7, 2010 at 9:14 AM Subject: Re: Where's my dns? To: Matthew Prince Awesome report, thanks. I'll make sure to call you if there's a problem. At some point it would probably be good to write this up as a blog post when you have all the technical details because I think people really appreciate openness and honesty about these things. Especially if you couple it with charts showing your post launch traffic increase. I have pretty robust monitoring of my sites so I get an SMS when anything fails. Monitoring shows I was down from 13:03:07 to 14:04:12. Tests are made every five minutes. It was a blip that I'm sure you'll get past. But are you sure you don't need someone in Europe? :-)
To which he replied:
From: Matthew Prince Date: Thu, Oct 7, 2010 at 9:57 AM Subject: Re: Where's my dns? To: John Graham-Cumming Thanks. We've written back to everyone who wrote in. I'm headed in to the office now and we'll put something on the blog or pin an official post to the top of our bulletin board system. I agree 100% transparency is best.
And so, today, as an employee of a much, much larger Cloudflare I get to be the one who writes, transparently about a mistake we made, its impact and what we are doing about it.
The events of July 2
On July 2, we deployed a new rule in our WAF Managed Rules that caused CPUs to become exhausted on every CPU core that handles HTTP/HTTPS traffic on the Cloudflare network worldwide. We are constantly improving WAF Managed Rules to respond to new vulnerabilities and threats. In May, for example, we used the speed with which we can update the WAF to push a rule to protect against a serious SharePoint vulnerability. Being able to deploy rules quickly and globally is a critical feature of our WAF.
Unfortunately, last Tuesday’s update contained a regular expression that backtracked enormously and exhausted CPU used for HTTP/HTTPS serving. This brought down Cloudflare’s core proxying, CDN and WAF functionality. The following graph shows CPUs dedicated to serving HTTP/HTTPS traffic spiking to nearly 100% usage across the servers in our network.