CrowdStrike Exposes a Fundamental Problem in Software

preview_player
Показать описание

Whew, what a disaster! I share my thoughts about the whole CrowdStrike situation and the fundamental problem that I think lies at the core of this. Let me know what you think about this in the comments.

🔖 Chapters:
0:00 Intro
0:11 Who is CrowdStrike?
0:45 Recap of Friday outage
1:11 Rant time
2:03 How it could have been avoided
2:53 A fundamental dichotomy
3:47 Things will get worse

#arjancodes #softwaredesign #python
Рекомендации по теме
Комментарии
Автор

Your kernel is crashed. No malicious code can be executed. Your computer is completely protected now. Thank you for choosing our company!

MysticCoder
Автор

I think the even more fundamental problem here is the security software mono-culture. I know CrowdStrike is big, but honestly, I was surprised when I heard in the news how broadly sweeping the impact was across companies and even across industries. If everyone's using the same software, that provides a ripe attack vector for hackers. 😒

ropro
Автор

It's partly about $$$ and partly about how everything nowadays is expected to happen with speed. Back in the day (30 years ago) I worked for a bank. We maintained a very large enquiries counter system. Before anything got pushed out to branches, it was tested for weeks. We had dozens of test engineers and they would run through every conceivable action. Then and only then a release would happen to a local branch. This would be tested in the wild for a week. Then a small group of branches for two weeks, then a larger group, then finally the main group. The result was that very few (if any) show stoppers made it to production. This meant a slow cadence of releases though. Also this was a large project with extensive management backing, so the cost was not really a factor (within reason).

This type of behaviour would never fly today. Everything has to be done on the cheap, with minimal testing, just "get it out there". I call it the "just get it f**king done" attitude - this is very common nowadays, especially among MSPs.

kwas
Автор

So we have the CrowdStrike option ENABLED so CrowdStrike won't release the latest version of their software to use (we stay 1 version behind) - apparently they don't actually even check for this so we got it anyway. Absolutely shoddy development :(

AMMullan
Автор

People seem to be overlooking the glaring fact that they pushed an update that was corrupted or checksum failed which means there was a wide open vulnerability that allowed man in the middle exploits or injecting code with modified files directly into the Kernel….

whatcouldgowrong
Автор

The CEO of CrowdStrike, George Kurtz used to be the Chief Technology Officer of McAfee in 2010, when a security update from the antivirus firm crashed tens of thousands of computers.

ying-ymut
Автор

My career has been leading engineering organizations. This is not a new issue or a unique issue. Bad driver code crashes systems. Because of that, the industry has created well known and effective ways to prevent problems. You've listed them.

The issue here is a company with wide spread driver releases that failed to follow those practices. The free market has created a process for handling that and it is called competition and consumer choice.

James-hbqu
Автор

The Crowdstrike disaster hasn't struck because they needed to move fast, but because they obviously haven't tested this specific update on a single Windows machine. Because if they did, they'd immediately noticed it would crash. And they made a similar mistake already in April. That time it could be somewhat forgiven because it only occurred on two distributions of Linux which hadn't been in their test matrix.

on_wheels_
Автор

Most surprising is that PCs still don't use A/B installs of the OS, where you use one copy and update the other copy, then switch over to the updated copy, and you can switch back if the update failed for some reason. With disk space so cheep, you'd thing every Linux/Mac/Windows PC would use that by now. In Linux at least you can revert to a prior Kernel version.

ChristianSteimel
Автор

But you can have it both ways: it's called rolling updates. You don't deploy software to a billion endpoints in one go.

metamadbooks
Автор

Something that needs to be more highlighted from this issue is that companies have in recent years been offloading their IT resources but are still adopting external, overseas-managed (i.e. managed in the US) solutions. Companies should always have an in-house team ready to respond to system failures. Informed, careful companies would only have had a couple of hours of downtime...

lumeronswift
Автор

A mechanism rolling back an update after X number of failed boots/etc would help a lot here. My router does this, it keeps a copy of the old firmware it can automatically revert to in case flashing a new firmware image bricks it.

SUSE's MicroOS does similar by having a stateless OS and transactional updates that are snapshotted in the BTRFS file system. If it crashes and reboots, it'll automatically rollback to the snapshot before the update while preserving user data.

askii
Автор

Humans tend to think they can sacrifice quality for speed, which works for some time and then fails miserably. It's a bit like the uncertainty principle, there is a fundamental limit that cannot be cheated.

wernerlippert
Автор

I want to add that the fact that CrowdStrike is so widely used makes it a target for bad actors, and perhaps how it operates internally, which seems to be monolithic, is also a problem. We also do not know what government and military systems were affected by this "bug" . Regardless of other bad practices that were at play, CrowdStrike itself may want to consider a lessre and perhaps break up its platforms into shards, such that entire industries are not a impacted by one bad software update or a bad pod

_SR_
Автор

This is a reminder of how fragile our IT solutions are. Imagine a solar storm occurring and the devastation it would cause! We need a plan B for critical infrastructures to always be in place!

samarbid
Автор

So, basically Crowdstrike could not even secure itself against itself. Well done Crowdstrike, well done! (Slowly clapping) To Microsoft, get rid of Crowdstrike, no IFS and no BUTTS!

keithnsearle
Автор

I find it utterly incredible that they don’t test the update on a sandboxed system before sending it out.

MadeleineTakam
Автор

My rage at everyone downplaying this for CrowdStrike is immeasurable. This is a billion dollar company, with a B, trusted by critical government, public, and private services and they shafted each and everyone. The lack of outrage from our authorities is absolutely disgusting. Speaks a lot to the state of cybersecurity and tech in general

ProfessionalBirdWatcher
Автор

This was an embarrassing failure for Crowdstrike. All they had to do was test their patch on Windows PCs prior to release, and they would have seen those PCs blue screen. They could have fixed the issue, tested again, and THEN deployed. The more devices you’re responsible for, the greater the duty to test prior to deployment. This was negligence, pure and simple, and there should be a class action suit against Crowdstrike for the damages they caused. Such a suit would destroy Crowdstrike, of course, but that’s as it should be. Our world needs to deter this negligence in the future.

mitchellsmith
Автор

The issue essentially is that there is a kernel-mode driver - no doubt WHQL certified - that is running uncertified p-code from installable 'definition' files, so that a bug there will cause the kernel-mode driver to execute bad code, and bug-check the system. Perhaps the kernel-mode driver needs better checking and self-defence - could the WHQL certification process require this?. The 'fix' is to gain access to safe-mode, boot without the driver, and then remove the installable definition files, so perhaps a system should identify crashing 'boot-required' drivers and sideline them if they crash repeatedly.

yogibarista