What Went Wrong With Crowdstrike?

preview_player
Показать описание
On Ask the Tech Guys, Leo Laporte and Mikah Sargent recap the BSOD catastrophe of July 19, 2024. What happened at Crowdstrike that caused them to release such a widespread buggy update?

TWiT may earn commissions on certain products.

Follow us:
#crowdstrike #informationtechnology #windows
About us:
TWiT.tv is a technology podcasting network located in the San Francisco Bay Area with the #1 ranked technology podcast This Week in Tech hosted by Leo Laporte. Every week we produce over 30 hours of content on a variety of programs including Tech News Weekly, MacBreak Weekly, This Week in Google, Windows Weekly, Security Now, and more.
Рекомендации по теме
Комментарии
Автор

Early last year you might remember CrowdStrike basically laying off a few hundred employees under the cover of return-to-office mandates. In other words, a lot of people with the talent to easily find another job simply left. Presumably leaving behind less experienced and/or qualified workers, who one might further assume would also have to carry the extra burden of whatever workload these individuals had been doing up to that point. The combination of inexperience and overburden can easily cause a cultural drift toward process corner cutting to meet due dates. Also note that while the current Windows outage got publicity due to it's massive blast radius, CrowdStrike has done this several times recently, taking down Debian and Rocky Linux. There appears to be a pattern here, and I would not be surprised to learn that this effect has arisen out of that stealth layoff from last year.

richardbrekke
Автор

Leo. There is so much to be learned from this event. To say there is nothing to learn resigns oneself to repeating this over and over. We can only get better by learning.

JeanPierreWhite
Автор

Hey Microsoft, I'm on the road between Melbourne and Perth stopping along some very remote 24hr diiners that on day 3 still have BSOD and hand writing all orders and only taking cash. Can you charter a jet along with a fleet to helicopters with a bunch of these USBs.? Just send the bill to George over at Crowdstrike. He'll know what its about.

gslim
Автор

Why wasn't the update sandbox tested at crowdstrike? Why aren't critical infrastructure servers implementing a staging of updates from their vendors. This is what is done in my workplace because we do not trust vendors to properly test their own software. Protecting against zero day needs to be balanced with uptime.

loup
Автор

the failure mode is that any kernel level code can cause this, and every driver is kernel level code.

every kernel is vulnerable to this, and it will cause the kernel to crash. the potential difference is how they handle it.

the solution is to report success when the next driver is asked to be installed, and when the system reboots, just disable the driver that crashed.

this makes the solution to just power cycle the machine. after that it can just tell the os and driver vendors that the driver broke.

mandatory automatic updates are a bad idea, as you then can not test it in a canary machine.

however this whole thing could have been avoided even with automatic updates.

first, they could have done continuous integration, creating a checksum file after those tests were done and making sure the driver update code checks the checksum.

then you deploy to test machines, preferably using continuous delivery.

at this point the files are on both the test machines and the ci/cd server, and you can test they match before pushing them to the public.

then you do a canary release cycle, gradually releasing to more and more people.

when the software update goes out, the client side update software also checks that the file signatures match, blocking the rollout if it does not, so, it does not break the kernel, and stops the canary release.

finally the os can track which driver it is loading, and after the kernel panic it can just block it, so you only need a reboot to recover.

none of this was done, else it could not have happened, so the blame squarely belongs to cloudstrike for shipping the broken driver, and microsoft for not fixing the recovery model after mccaffee did exactly the same thing.

none of this is new tech, so the only lesson to learn is to actually learn how to do your jobs, then do it.

grokitall
Автор

I've been dealing with microcomputers since 1979 (yes, I'm that old..LOL) and this has been going on forever and will

eddy
Автор

The USB key doesn't, and can't, work on computers that use the MS bitlocker drive encryption where the encryption key is not available. Many IT departments don't record the bitlocker recovery key on end user systems due to security concerns over what could happen if those keys are exfiltrated from the company. They instead opt to discard the recover keys so that nobody can access the hard drives; and instead, implement a device replacement policy and at the same time mirroring any user data on the company's servers. It would be logical for all kiosk systems and all secure remote employee systems to be managed with this approach.

Apparently, CS doesn't (or didn't in this case) implement a fail-safe strategy such as a staged update, or utilize windows system restore to be able to revert to the last known good state. Logically however, if they had, it would have allowed hackers another vector to attack CS protected machines.

Will IT departments learn to manage bitlocker recovery keys for critical systems better? Will CS implement some kind of fast recovery that doesn't create new vectors of attack? Could some kind of client-side config update validation be implemented that doesn't create a new vector for attack? Will CS hire Steve Gibson to direct a new reliable, secure and failsafe sensor in Assembly Language? Only time will tell.

mjmeans
Автор

So home users really don’t need AV? Is Microsoft defender sufficient?

aaronstevens
Автор

We need more competition and choice in the commercial operating system market.

PerryGrewal
Автор

If there is anything Ive learned from my short time in IT it was never change anything on Friday or Monday.

jaygreentree
Автор

I thought there is a weakness in the way Windows handles kernel extensions?

An.Individual
Автор

I work remote for a company who most people work hybrid, so I had to walk people through the process

HitnRunTony
Автор

I don't know about "not using AV software". I decided to pay for Avast Premium and the folder monitoring feature saved my bacon. Turns out the solution I was using to get around Microsoft's terrible start menu (Rocketdock), tried to access a folder with bank statements, and my AV caught it and asked me if I wanted to block it.

Sure, I know it's old software, but it doesn't have online features, so I didn't think much of it. Now if it was part of a daisy-chained attack, I guess I'm compromised elsewhere and done for anyway?

EDIT: I also feel 3rd party solutions are always faster than the solution included in Windows.

alexrodasgt
Автор

CrowdStrike is a billion dollar company, with a B. They're trusted by critical government, public, and private services and they shafted each and everyone. The lack of outrage from our authorities is infuriating!

ProfessionalBirdWatcher
Автор

Its like they should have had a test layer in place before pushing the updates live to machines globally...??? 🙂

davelogan
Автор

Someone on Reddit photoshopped the las Vegas sphere one

cuebal
Автор

I don't agree with Microsoft It's not their problem. It most certainly is because they have built an OS that is so fragile and open to failure with no easy recovery tools as some of the IT guys on the frontline of this disaster found out.

In addition to Microsoft the companies running crowdstrike are also to blame. They have a system that auto updates with no change management or testing prior to the update affecting production systems. Corporations don't think desktop PC's are that important, but clearly they are, and the dept in organizations that do spend time with desktop systems are the security guys who see them as a threat and therefore throw out all sensible release methodology in order to be "more secure". There is such a thing as too much security.

JeanPierreWhite
Автор

Sounds like something that could have easily been tested.

jameslarosa
Автор

Does the US Defense Department use Crowdstrike ?

neiltsubota
Автор

I assume that is Windows NT 3.51 server and not the desktop Windows 3.1. Two completely different things.

JanRademan