When Software Kills: Fatal Bugs in the Therac-25

preview_player
Показать описание

Join Dave as we explore one of the most shocking and tragic stories in medical history - the story of the Therac-25, a radiation therapy machine that went horribly wrong. In this episode, we'll delve into the fascinating yet disturbing tale of how a seemingly advanced technology ended up causing catastrophic harm to countless patients.

We'll examine the Therac-25's inner workings, explore the flaws in its software and design, and discuss the devastating consequences of its malfunctions. You'll hear the heart-wrenching stories of patients who suffered from radiation overdoses, and learn about the heroic efforts of medical professionals who worked tirelessly to treat these victims.

This episode is a must-watch for anyone interested in technology, medicine, or the human side of innovation gone wrong. So sit back, relax, and get ready to uncover one of the most chilling tales in the history of science and medicine!

Thanks to BobT for the episode idea!
Рекомендации по теме
Комментарии
Автор

Some more details for those wondering what happened under the hood -
It wasn't the lack of a beam spreader, it was the X-ray target that was incorrectly configured. We make medical use X-rays by hitting a [tungsten] target with high energy electrons. Most of the energy goes into heat, so >100x the amount of electrons need to hit that target to produce a similar X-ray dose vs delivering a beam of electrons for treatment. One reason this race condition was hard to catch was that it required the tech to erroneously select and electron treatment (machine begins to remove the target from the beam line) and reenter the correct setting of X-ray mode while the target was still moving. Since that console saw the target was in motion, it didn't check which way it was moving nor what state it ended in so X-ray outputs were applied with the target out of the beam line. The devices used to monitor the radiation produced by the machine do not operate well at 100x the dose rate and significantly under-measure what has left the machine, further increasing the dose to the poor patient. You normally can't feel radiation, but the current of electrons was so high the patients were essentially zapped like they touched a high voltage wire, that also shredded their cells and DNA.
Unfortunate that these lessons had to be written in blood, but I'm glad to this day radiations devices have many layers of redundancy and interlock codes have gotten more descriptive.

austinsloop
Автор

Radiotherapy linac engineer here, this is why we have layers and layers of interlocks, both physical and software. It is a very intentional act to strip back those layers (we can and do, but very rarely and always taking appropriate measures to ensure safe practice)

The safety innovations of today are writen in the blood of the past

keleighshepherd
Автор

Glad you’re talking about the Therac, but I was hoping for a much more in-depth technical explanation of the code flaws from your expert perspective given how delightfully detailed you’ve been for many other technical topics on this channel. If you ever were inclined to do a part 2 of this with more details, I for one would be very interested to see it.

jazzmike_
Автор

I won't say who I worked for but I was an electronic medical records dev. I found a device driver 'feature' on db2 that added a null terminator at the end of its buffer. Problem was we used rtf to store notes in the db. In some cases there were entire sections of chart notes just gone. Drs had no idea it was happening. I told management and I got "We don't take responsibility for FDA approved, that is on the Dr " .
I quit and got out entirely.
Thank you for shining a light on this issue!

lorddorker
Автор

For those curious, the ring he mentioned is related to the Quebec Bridge in Canada, which collapsed twice during its construction. It’s a fascinating story.

FrederickMarcoux
Автор

I was a developer on a medical device when the therac-25 tragedy happened. it was a truly sobering event to those of us in the industry.

you are absolutely correct that many lessons were learned, many regulations written, many procedures changed. but a handful of the core causes remain a problem to this day. More so in concurrent and real-time systems.

My Hope around this is twofold. first that every developer realizes that these problems can occur regardless of the program and language regardless of the operating system, regardless of the quantity of unit tests.


second, Nancy's seminal paper on the disaster should be required reading for every software engineer

miscellaneousHandle
Автор

A family member of mine was friends with the woman who was injured by the Therac-25 machine in Hamilton, Ontario. Although she unfortunately later ended up passing away from her cancer within a few weeks, her autopsy revealed that her hip was completely destroyed and that she would have needed a full hip replacement. I remember the first time I heard about that and being absolutely horrified at the idea of being killed by the revolutionary machine meant to save you.

AECL made a bunch of changes afterwards to try and make it safer but they didn't correctly identify the issue and more people ended up being hurt and killed afterwards.

nolifenerdwhohasnevergotten
Автор

I work for the company (AECL was legally continued on to be CNL) that made the THERAC, we went on to make nuclear reactors that are controlled by software code. It is so expensive to make quality code, and the real difference between when someone calls themselves a "software engineer" and being an actual software engineer.

revcrussell
Автор

I'm in my last year of engineering at UofA right now, and theres a mandatory risk management and safety class. So many computer / software engineers roll their eyes about safety. There was even that controversy a few years back with APEGA ordering job boards to stop using "Software Engineer" for non engineering jobs. All this to say that even today software is regarded as this ultimate safe tool, perhaps even more so because of the prevalence of the PC. Thanks for your breakdown Dave!

NyelaKearney
Автор

Hi Dave .. your videos are all amazing but this one is one of your best from my perspective. As an engineer in Canada and wearing the ring since 1990s, and going form structural to software careers, you articulated exactly what i tell myself, colleagues and younger aspiring programmers and engineers. This story should be part of anyone's studies in school.. university .. and work. I will share with many. Be well and thanks again for sharing on your channel.

gtsludovicofratts
Автор

I used to be an AP computer science teacher. The students absolutely were given information about the Therac-25 (though with the naming of the machine/company removed to prevent unintended liability). They learned of other critical software failures that had far reaching consequences and the importance of software quality. That was a long while ago (my teaching) but I hope that those students learned that not just from a software basis, but also for any profession they entered.

brianwithnell
Автор

I was taught this case in a Software Quality Management course at uni, and it is singlehandedly responsible for my enduring insistence on quality process

capncoolio
Автор

Thanks for touching on the "any idiot shouldn't be able to just call themselves a software *engineer*" subject

SpaceCop
Автор

Software engineering has taught me a lot about being intellectually honest and humble. Between compiler errors, my own pre-review testing, code review feedback, automated test failures, and bug reports, I've been repeatedly reminded over the years that no matter how confident I can be that I thought through something correctly, I can still make logical errors, fail to consider certain contexts, etc. I'm glad I work in the video game industry. Heheh, the stakes are much lower.

rowdyriemer
Автор

This story reminds me of a similar situation that happened in the rail transportation industry, where I spent the majority of my career. Prior to the 1980s, safety systems (called interlockings) for railroads and railways involved the use of relays all of which were designed and tested for failure conditions, with critical ones being guaranteed to be failsafe, thanks to springs and gravity. In the 1980s, the industry started developing electronic systems to replace the relays ... a large room of relays could be replaced with a rack of 2 or 3 microprocessors, an obvious economic and maintenance advantage.

Since the relay circuits are, in essence, boolean expressions, the microprocessors and their software were written to handle similar boolean expressions. Each relay circuit was broken into its equivalent Boolean expression, and entered into the software as data for processing. Fortunately, a problem with this method was found during early testing. You see, relay circuits are in essence a large machine with massive parallel processing, while a microprocessor only does one thing at a time, even when many software threads are involved. This caused the kind of race conditions in the software that Dave describes in the video. Fortunately, this was found and rectified in two ways: processes were created involving multiple developers, validators and testers to ensure correct and safe operation; and some failsafe relays were maintained, just in case the software did do something other than intended. The industry has been using electronic interlockings for years now, with a wrong-side unsafe failure a very occurrence.

jimmeade
Автор

I have seen this covered by a half dozen YouTubers and almost didn't click. I'm glad I did. You provided a much better representation of the low level issues than any other recount I've seen.

PsRohrbaugh
Автор

Great video Dave ! If you are not aware already look up the radium girls who painted dials on clocks and meters. They had no idea how dangerous radium could be and considered it a cool, almost novelty material. They often wet the tips of their brushes in their mouth having no idea how dangerous this was. Then the stalling of the companies involved to recognize the problem and later the insurance companies didn't want to pay on the claims. Many of these young ladies spent the last months/weeks of their lives in agony as they were tragically coming to an end; so very sad !! They should NEVER be forgotten !!

waaos
Автор

Thank you for bring this issue up. I work in medical devices, mostly in automated robotic surgery. Reading the Therac-25 is mandatory for anyone working on mission critical code that could hurt/kill people. This was why ISO-62304 was created, which is the software live cycle requirements for software used in medical devices.

Erik_The_Viking
Автор

This is one of the reasons why I have reservations about autonomous vehicles. The risk of loss of life is huge, and the automotive industry tends to rush development to be the first to market.

drozcompany
Автор

I was very interested when a co-worker doing embedded software told me about a document titled 'MISRA C' (Motor Industry Software Reliability Association, C Language).
While some might read it as a simple list of good programming practices, I read it as a list of rules that were developed because something bad happened when it was done another way, and this new practice would address the issue. Much like the safety labels on step ladders, which were added to prevent another person from suffering that type of ladder injury.

patlawler