Failure is Always an Option - Dylan Beattie - NDC Copenhagen 2022

preview_player
Показать описание
Software runs the world. We use software to manage our calendars, talk to our friends, run our businesses - and, as our societies inevitably try to replace people and paperwork with apps and algorithms, we find ourselves facing some vital questions about the reliability of that software. If you take the time to actually read the terms and conditions, you’ll find that just about every system we rely on comes with no warranties and no safeguards - you use it at your own risk, and if it doesn’t work, that’s your problem.

But there’s more to building reliable systems than just writing good code. Reliability isn’t just about software engineering, it’s about systems engineering; about taking a holistic view of services that includes software, hardware, networks, and people.

Join Dylan Beattie for an insightful look at the history of systems engineering, at some of the strategies and design patterns that we can use to build reliability into our systems, and at what happens when the software that runs the world has a bad day.

Check out more of our featured speakers and talks at
Рекомендации по теме
Комментарии
Автор

Listening to Dylan Beattie is always a treat. He knows how to pick interesting subjects and also how to present them in an engaging and entertaining fashion.

tharfagreinir
Автор

I love listening to Dylan, and sometimes I get so engrossed in whatever historical subject he's delving into that I forget he's giving a software talk.

rexbaumeister
Автор

Feynman wasn't the first person to know about the O-ring failures on the Challenger. Three Morton-Thiokol engineers knew the night before the disaster that they would fail. They tried to stop the launch but were stopped by NASA managers because " Reagan was had demanded a launch".
The three engineers were fired for "whistleblowing" because the would not shut up. The whole story is told in IEEE Spectrum magazine in a 1989 article about the problems the three engineers experienced.

The booster O-ring specifications required that ambient temperature must be 48F for 24 hours before the launch and NASA had this information. Political pressure led to a launch after days of freezing temperatures. That's like knowingly buying tires rated for 100mph and then driving at 150mph. The design was rated for 48F or higher and NASA bureaucrats chose to launch at 32F. The result was a certain failure.

solarlaura
Автор

Dylan Beattie tech talks never fail to impress! Yet another fine one!

Sepen
Автор

failure is not an Option, it's a Result::Err

-parrrate
Автор

This was one of the best presentations I've seen in a while. Thought provoking, engaging, and entertaining. It's already sparked several great discussions and I just finished watching it.

nikfp
Автор

Just gotta mention that Ed. Lorenz's Royal McBee LGP-30 (34:38) was programmed by.... Margaret Hamilton (05:15) - before she went to work for NASA.

edgeeffect
Автор

28:00 In Australia, we had a similarly catastrophic implementation of a similar system for welfare payments that wrongly identified many people for welfare fraud and demanded repayments that people didn't owe and couldn't repayment. Quite a few people lost their lives.

magdaleneabiuso
Автор

"Failure is not an option"
(it's included as part of the base package)

That aside, there is a book by Nathaniel S. Borenstein called "Programming as if people mattered" that you may find worth a read. I don't know if it's still in print, though.

I work in the aerospace industry as a software engineer. We endeavor to design systems as though failure is an inevitability, and try to make sure that all failure conditions that can be anticipated are handled, and that any unanticipated failures cause conditions that will do the least damage. For flight instrumentation, it is better to provide no information than wrong information, often the mitigation is to shut down the subsystem and alert the crew.

When I started working in this domain I was given the task of designing and implementing software within the platform to deal with loss-of-cooling failures (fan failure, ventilation blockage, the plane sitting on the tarmac in the sun for 2 hours, things like that). I had come from a different part of the computer industry where graceful degradation was the goal when handling failures; I designed a health monitor that would lower the CPU and memory clock speed and/or turn off the clocks to non-critical components to reduce power dissipation but keep this safety critical subsystem operating. My design was unacceptable because there was no guarantee that the subsystem would always produce the same data to the displays that the pilot relies on when going full-rate (no failures) and with an overheat condition. The correct solution, I learned, was to stop reporting anything and put the entire subsystem into as close as a reset state as possible until the conditions were within tolerances. There were 3 copies of this same subsystem on the airframe, each in a different physical location but receiving the same stimulus (data from sensors and discrete signals), and the display subsystem used voting to select the source of the data it displayed to the crew. Stale or improperly smoothed data would not match between the different copies, and this could result in the wrong data getting displayed. When one of the modules went off-line, the display would source select between the two other modules. If they disagreed, the display would highlight the measurement to alert the crew that alternative readings should be consulted.

Failure tolerance is a design constraint that is tuned up or down depending on the criticality of the system, whether it is mission or safety critical system, and what the consequences of any failure might be.

filker
Автор

I would not have expected RICHARD FRICKEN FEYNMAN to identify that the rubber seals had failed, and the "root cause" was increasingly lax risk assessment over time. I can't imagine what they were thinking letting him, the last person to be a yes-man, conduct an assessment.

doublepinger
Автор

If you try hard enough, failure is not only an option - its a requirement.

mindasb
Автор

As a German the Berlin airport example really hurt

thepaulcraft
Автор

In modern world, every company "thinks" that they know better, than their user. Every product is made, to force you, to do it "right" way.

jarosawmalinowski
Автор

Failure is always an option, has been my saying for years. I’ve experienced it, along with huge success. What’s required is to learn, iterate, change, adapt. When in many cases, your psyche is screaming “I’m done, I’m not doing this shit again”.

TheJacklwilliams
Автор

Super awesome talk. It really did teach me to think of failure in layers. It applies more and more the bigger you scale.

kylekinnear
Автор

Post Office. Even when the people in charge know the computer system was wrong, they still sent the people prison then admit it failed. That is true EVEL (This comment was made before I heard the rest of the segment. I still believe it is true)

stuartanderws
Автор

It's so great listening to Dylan. Thanks so much!

ignatiusezeani
Автор

Copenhagen... So close to me, yet I have not been able to go yet :( - Want to see Beattie live so bad

casperes
Автор

in the early 1990s, I worked with IBM Federal Systems who were the people who wrote the primary (4 copies) flight control system for the shuttle. They were very proud that the 5th computer never did have to take over because they'd built so much resilience into their one (same principle as the whole Saturn V availability story).

petewindsor
Автор

"Failure is not an option - it is mandatory. The option is whether or not to let failure be the last thing you do" (70MMEM) 😉

tamberp