On Tuesday, for less than an hour, early in the workday, it felt like the internet was down for many. The cause? Cloudflare went down. Cloudflare offers web services to over 16 million websites. That includes sites like HubSpot, Medium, UpWork, 9gag, Discord, Sirius XM, Shopify, Coinbase, Canva, Soundcloud, Buzzfeed, and Capitalogix.
Even down detector was down.
That means when Cloudflare went down, so did a non-trivial portion of the internet. W3techs reports approximately 10% of the internet was affected by Cloudflare being down.
What happened?
via DigitalAttackMap
There was a massive spike in CPU utilization. At the time, it looked like a DDOS attack. People were speculating that it was a Chinese attack trying to mess with the Hong Kong protests.
Turns out, it was bad code - specifically, a single misconfigured rule within their firewall services. They did a global rollout of the code, and so it affected everyone.
This shows the importance of staged rollouts - testing your releases live with test groups before being released globally.
Here's a great write-up from Andy Ellis on preventative measures in the future.
The reality is, using a CDN is still helpful, Cloudflare's downtime doesn't mean you shouldn't use them.
It does mean we should be thinking about what failsafes are needed to keep the internet infrastructure working in the event of attacks or failures.
Attacks are becoming more common, and as we now expect constant improvements/releases to software, we can expect more company errors as well. Facebook had similar issues on Wednesday.
Think of how much relies on the internet as a backbone. It's crazy to think about the impact sustained downtime would have; billions of dollars in business not happening, banking systems down, etc. Realistically, if the entire internet goes down we likely have bigger issues to worry about, but this event shows that large swaths of the internet could be affected at once.
Would a decentralized network help? Are smart contracts necessary for that? Is there a CDN for CDNs?
It feels like we often end up with more questions than answers.
It is why many companies opt for a hybrid cloud with plenty of on-premise compute.
What do you think?