Last Friday, CrowdStrike experienced a significant issue with a content update for its Falcon sensor on Windows hosts, causing widespread IT outages affecting customers worldwide, including airlines, banks, shops, and broadcasters.
This outage dominated the news. Here are links to stories from Google and Bing search.
Surprisingly, I didn't hear about this until one of my employees told me. But that is probably because I use a Mac - and the outage didn't affect me.
Perplexity is an AI chatbot-powered research and conversational search engine that answers queries using natural language predictive text. I've been quite impressed with it. If you haven't tried it yet, I highly recommend checking it out.
Here is an excerpt of what it told me about the incident:
Character of the problem:
A defect was found in a single content update for CrowdStrike's Falcon sensor on Windows hosts.
This caused Windows machines to crash, resulting in a "blue screen of death" (BSOD) and getting stuck in a restarting state.
Scope of the impact:
Only Windows hosts were affected; Mac and Linux systems were not impacted.
The issue affected various organizations globally, including banks, airlines, and other businesses.
CrowdStrike's response:
The company quickly identified and isolated the issue.
A fix was deployed, and the problematic update was pulled.
CrowdStrike is actively working with affected customers to resolve the problem.
United, American, and Delta all called complete ground stops. Microsoft was hit. Public displays around the world showed the blue screen of death.
All because CrowdStrike pushed a global update. That patch caused every computer with CrowdStrike to crash. Even worse, these computers can only be fixed in person by an IT professional. Because it involves a Blue Screen of Death, IT can't just remote in to fix it.
It's probably the largest outage in history and has caused untold damage. It affected emergency services in some states and countries.
Even after a patch is issued, it may take days for things to return to normal, as each endpoint requires individual attention, and some systems might have suffered complete failures.
It's a healthy reminder that our 'robust' infrastructure isn't always so robust ... and that tech consolidation and concentration can have consequences.
While there are a seemingly infinite number of tech companies now, the infrastructure has consolidated into the hands of very few. We need to think about our digital resilience, not just in the systems we run, but in the globally connected systems and in the growing Internet of Things.
Does your business have all of its eggs in one basket? Does it have failsafes in case of an emergency?
As I observe the growing adoption of AI, I notice that people tend to emphasize its capabilities over its potential failures. In our increasingly interconnected and automated world, ensuring business continuity is more crucial than ever.
Comments
Lessons From The CrowdStrike Outage ...
Last Friday, CrowdStrike experienced a significant issue with a content update for its Falcon sensor on Windows hosts, causing widespread IT outages affecting customers worldwide, including airlines, banks, shops, and broadcasters.
This outage dominated the news. Here are links to stories from Google and Bing search.
Surprisingly, I didn't hear about this until one of my employees told me. But that is probably because I use a Mac - and the outage didn't affect me.
Perplexity is an AI chatbot-powered research and conversational search engine that answers queries using natural language predictive text. I've been quite impressed with it. If you haven't tried it yet, I highly recommend checking it out.
Here is an excerpt of what it told me about the incident:
Character of the problem:
A defect was found in a single content update for CrowdStrike's Falcon sensor on Windows hosts.
This caused Windows machines to crash, resulting in a "blue screen of death" (BSOD) and getting stuck in a restarting state.
Scope of the impact:
Only Windows hosts were affected; Mac and Linux systems were not impacted.
The issue affected various organizations globally, including banks, airlines, and other businesses.
CrowdStrike's response:
The company quickly identified and isolated the issue.
A fix was deployed, and the problematic update was pulled.
CrowdStrike is actively working with affected customers to resolve the problem.
United, American, and Delta all called complete ground stops. Microsoft was hit. Public displays around the world showed the blue screen of death.
All because CrowdStrike pushed a global update. That patch caused every computer with CrowdStrike to crash. Even worse, these computers can only be fixed in person by an IT professional. Because it involves a Blue Screen of Death, IT can't just remote in to fix it.
It's probably the largest outage in history and has caused untold damage. It affected emergency services in some states and countries.
Even after a patch is issued, it may take days for things to return to normal, as each endpoint requires individual attention, and some systems might have suffered complete failures.
It's a healthy reminder that our 'robust' infrastructure isn't always so robust ... and that tech consolidation and concentration can have consequences.
While there are a seemingly infinite number of tech companies now, the infrastructure has consolidated into the hands of very few. We need to think about our digital resilience, not just in the systems we run, but in the globally connected systems and in the growing Internet of Things.
Does your business have all of its eggs in one basket? Does it have failsafes in case of an emergency?
As I observe the growing adoption of AI, I notice that people tend to emphasize its capabilities over its potential failures. In our increasingly interconnected and automated world, ensuring business continuity is more crucial than ever.
Lessons From The CrowdStrike Outage ...
Perplexity is an AI chatbot-powered research and conversational search engine that answers queries using natural language predictive text. I've been quite impressed with it. If you haven't tried it yet, I highly recommend checking it out.
Click here to see the whole perplexity.ai response.
The scope of the outage was surprising.
United, American, and Delta all called complete ground stops. Microsoft was hit. Public displays around the world showed the blue screen of death.
All because CrowdStrike pushed a global update. That patch caused every computer with CrowdStrike to crash. Even worse, these computers can only be fixed in person by an IT professional. Because it involves a Blue Screen of Death, IT can't just remote in to fix it.
It's probably the largest outage in history and has caused untold damage. It affected emergency services in some states and countries.
Even after a patch is issued, it may take days for things to return to normal, as each endpoint requires individual attention, and some systems might have suffered complete failures.
via XKCD
It's a healthy reminder that our 'robust' infrastructure isn't always so robust ... and that tech consolidation and concentration can have consequences.
While there are a seemingly infinite number of tech companies now, the infrastructure has consolidated into the hands of very few. We need to think about our digital resilience, not just in the systems we run, but in the globally connected systems and in the growing Internet of Things.
Does your business have all of its eggs in one basket? Does it have failsafes in case of an emergency?
As I observe the growing adoption of AI, I notice that people tend to emphasize its capabilities over its potential failures. In our increasingly interconnected and automated world, ensuring business continuity is more crucial than ever.
Posted at 11:48 AM in Admin, Business, Current Affairs, Gadgets, Market Commentary, Trading, Web/Tech | Permalink
Reblog (0)