The CrowdStrike Crisis Proves The Software Industry MUST CHANGE

preview_player
Показать описание
The CrowdStrike disaster was a failure of software engineering at the company. It was caused by a development process that was clearly inadequate given the risks inherent in the design and approach to this critical system software.

In this episode, Dave Farley explores in more depth how the CrowdStrike system works, what went wrong, and why it went wrong. He also explores what CrowdStrike could, and should, have done to avoid this failure. This shouldn't be dismissed with a shrug and comment about how "bad things happen sometimes" response.
This was an easily predictable failure and Dave explains why, and how, we as an industry, should do better.

-

⭐ PATREON:

-

👕 T-SHIRTS:

A fan of the T-shirts I wear in my videos? Grab your own, at reduced prices EXCLUSIVE TO CONTINUOUS DELIVERY FOLLOWERS! Get money off the already reasonably priced t-shirts!

🚨 DON'T FORGET TO USE THIS DISCOUNT CODE: ContinuousDelivery

-

BOOKS:

and NOW as an AUDIOBOOK available on iTunes, Amazon and Audible.

📖 "Continuous Delivery Pipelines" by Dave Farley

NOTE: If you click on one of the Amazon Affiliate links and buy the book, Continuous Delivery Ltd. will get a small fee for the recommendation with NO increase in cost to you.

-

🖇 LINKS:

-

CHANNEL SPONSORS:

#crowdstrike #softwareengineering #programmer
Рекомендации по теме
Комментарии
Автор

So in summary, their code didn't check inputs, their unit testing didn't check invalid inputs, their integration testing didn't check for all deployment configurations, their release strategy didn't canary test reliably, and their management continues to prioritise cashflow over code quality.

mrpocock
Автор

"The major difference between a thing that might go wrong and a thing that cannot possibly go wrong is that when a thing that cannot possibly go wrong goes wrong it usually turns out to be impossible to get at and repair." - Douglas Adams

FrostSpike
Автор

About 6 or 9 years ago I heard a VP say, "Why do we need QA? Why can't developers just not write bugs?" She was basically laughed out of the room. In the years since software companies have totally abandoned the role of QA to the point where a basic null pointer can shutdown the airline industry because no one can be bothered to test kernel-level code updates.

noControl
Автор

Yet, they automatically decline job applications from experienced software engineers that happen to be 40+ years of age....

Meritumas
Автор

The clusters of competence vanish, be it Boeing, NASA, and IT.
The beancounters rule.
I lost count of how often I told the management that "I do not ship sh.t"
I lost the job, the software was a beta, and 80% ended in disaster, the other 20% went live with a 500 to 1000 percent budget overrun
Crowdstrike did not perform canary testing, I ask why

RickTheClipper
Автор

Ironically, the bug also illustrates one of the key problems in defining "malware": that developer intent is irrelevant to the functional effect of malware and/or software bugs.

sasukesarutobi
Автор

I suspect this might hide a darker secret. They might have had a serious vulnerability that might have been already in play and they rushed to fix it and release the fix as soon as possible, which probably meant they skipped a few steps like testing.

PaulSebastianM
Автор

According to Crowdstrick's own incident report, the update that caused the issue was never tested on an actual Falcon sensor. It was only verified by some sanity check software called a "content validator" that had some "bugs" and didn't detect the problem. The first time the faulty configuration file was actually used, was in a customer system... and it crashed. it is inexcusable that a software company could release an update to their customers without actually testing it on a real device. Crowdstrike should be held accountable for the financial loss,

kevinmcnamee
Автор

We just had the 737 max. That is engineering 101 with robust standards around it.

As long as regulators refuse to require someone to put their name on it, aka a P.Eng stamp, and treat it like every other critical branch of engineering, this will continue.

justinlynch
Автор

During the CrowdStrike outage, the neighbor across the street had a heart attack and died. His invalid wife, who had an emergency monitor, tried to help him and fell to the floor. She remained on the floor for over 12 hours because the emergency monitoring software ran on a Windows machine. I don't know if there was anything that could have been done for him, but she was needlessly suffering for what was likely someone's cost cutting measure.

jeffbaskin
Автор

Small correction: Usermode is indeed sometimes called ring 1 but incorrectly so. It is actually in ring 3. Admittedly ring 1 and 2 are barely used but the hardware does support them.

Evilanious
Автор

Imagine if chip manufacturer made a small bug in microcode of CPU, slightly messing up voltage on cores, so they would degrade with time. That would be a disaster!

wghfetr
Автор

This crisis proved that commentators who positioned themselves as experts in software development had little knowledge of system software development. I was surprised as people with titles of former principal engineers or PhD in computer science didn't have even basic ideas about system programming.

test
Автор

When I was coaching at a large telecom company in 2010-2012, I was struck by how much the phone switch code focused on error and failure scenarios to the point that the "happy path" was almost an afterthought! The systems weren't without their problems, of course, but there was a ton of deliberate thought that went into "how could this go wrong". We should be thinking like that with every single line of code we write, IMNSHO.

daverooneyca
Автор

This is an management problem, developers are powerless to change this.

marcbotnope
Автор

Been a QA engineer for 10 years now and I can attest to the low level of care businesses emphasise on solid QA practise.

QA has become an afterthought, so the understanding has dropped over time. It’s now resulted in a low standard of QA and testing in the tech industry; but mixed with a need for quicker releases 😬

SpecialK
Автор

I'm screaming for the last few years. this is an obvious consequence of the way we've been forced to work for years,

AlecBickerton
Автор

I am so tired of this idea that SE (Software Engineering) needs to "grow up" or "catch up to the big boys". Not to discount the massive failure that was Crowdstrike, but its not like Intel didn't bungle multiple series of CPU with a hardware bug. Its not like electronic devices aren't failing all the time, or the car industry never had to do a mass recall. Even Boeing is having issues with passenger planes.

Try to have a Electronics Engineer build a custom made TV for every single different person and you will see the bugs skyrocket as well. Mass production amortizes quality costs. That is the only reason those industries have more "quality". Because the cost-benefit analysis pushes it.
The issue is that 90% of Software Development isn't Software Engineering. We are building custom Apps that are going to run on one single service and are different for another. We are building CSS or some command line tool or some nice to have feature meant to show off. That is programming, that is not SE.


SE has source versioning, design patterns, coding tools, static code analysis, unit/integration/e2e testing automation, CI/CD. If you look at the entire range of processes SE has built, we are way, way above and beyond what other industries have for verification processes. If management decides to avoid or ignore them to cut costs, that is a management problem, not a SE problem. We can have as much quality as you wish, but you need to pay for that quality in time to delivery and engineering time.

And this is exactly what seems to have happened with Crowdstrike, your whole spiel of "push to production asap" devops mantra caused the issue here. Because that is not adequate for a critical piece of software. But management wants to push things fast to "stay ahead of the competition" and "use devops". And here we are.

rodrigoserafim
Автор

MS did try to push 3rd-party kernel drivers back into user mode and provide the necessary access via a dedicated marshalling API, but all the AV companies made such a stink about it - claiming (without evidence) that Defender would not do the same and thus have a commercial advantage - that the EU took up their cause and blocked MS from doing it. Thus leaving the situation where access is provided to the kernel essentially on a "trust" basis, which obviously can't be verified by the O/S in cases such as this, despite the requirement of certified drivers, as replaceable code/data is being used.

yogibarista
Автор

So let's check something out.
- We all agree there were failures in the QA process. We should probably mention that half the QA team was laid off earlier this year under the pretense of a pivot to AI.
- Crowdstrike claim that these templates are generated by AI. They were even spotlighted by nvidia as a "no code" solution in their convention. Was the original bug an AI hallucination that was not detected correctly by the QA process due to budget and staff cuts related to AI hype?
- Are these ring 0 changes being released untested because there's some assumption that AI is magically free of human error?

I suspect there's plenty of angles in which we should ask ourselves how much of a factor AI hype played here.

vexorian