How One Line of Code Crashed AT&T's Long-Distance Network

preview_player
Показать описание
On January 15th, 1990, a single line of code triggered a cascading failure that crippled AT&T's long-distance network for nearly 9 hours, leaving millions of Americans unable to make calls. In today's episode of Dave’s Garage, we dive into the software bug that caused one of the largest telecommunications outages in history.

With AT&T controlling 70% of long-distance traffic, the outage impacted 50 million calls and cost the company $60 million in lost revenue. At the heart of the problem was a subtle race condition in the code running the SS7 signaling system—an error that passed unnoticed through code reviews and testing until the perfect storm of conditions exposed it.

We’ll walk through the technical details of the bug, how it spread through AT&T’s 114 switches, and the larger lessons it taught the telecom industry about software complexity and testing. Thanks to David Sobeski for the episode idea!

Follow me for updates!
Twitter: @davepl1968 davepl1968
Рекомендации по теме
Комментарии
Автор

AT&T: Lost $60M during the 9-hour outage, made $120M from ‘Did the world end?’ calls in the next 10 minutes.

DataIsBeautifulOfficial
Автор

I tested that line of code ! I wrote the pseudocode you showed to explain the issue to VPs. It is vastly simplified from the actual code. You did a great job describing the issues in a simplified way. We had known since November that something was wrong but could not find the flaw. There was literally millions of lines of code involved in the path of that bug. Also, the network even with the bug performed quite well that day. 4ess was a fantastic telecom switch. Even with that bug it was a 5 9s switch. Proven and documented. I miss the Bell Labs of that time.

barrynicholson
Автор

I'm liking your recent exploration of cyberattacks, and coding errors in medical devices and infrastructure.

msromike
Автор

Signaling System 7. SS7. Worth mentioning the AT&T created the entire network. Literally created the hardware, the operating system, the code, the protocols. There was some scrutiny around the "reset" strategy. The intent was to clear hung lines returning them to the resource pool to reduce congestion. A bit of a hammer response when a screw driver would suffice. As for upgrade strategies, they would be completed market by market. Within a given market they prioritize/assign the switch upgrade. All upgrades are done in the middle of the night probably wrapping up by say 5am. All upgrades have a detailed back out procedure should an unrecoverable error or event occur. I can't think of an instance where any upgrade would be rolled back after 24 hours in production. Telecom is quite the industry. To see it go from electromechanical switching of twisted pair networks to fully digital in a single lifetime is something.

Failure_Is_An_Option
Автор

I could listen for hours to Dave! His explaining is crystal clear and in normal comprehensive English. From another person on the autism spectrum to Dave: Thank you so much!!

johnk
Автор

Wonderful video, I don't know any other channel that has its content so technically acurate, keep up the good work! I'm very happy I found your channel.

ivanleonardo
Автор

I love stories like this. It's so fascinating how such a small error can cause so much chaos.

JenDeyan
Автор

"Have you tried turning them off and on again?"
YES! All of them, as fast as I can. It's not helping!

fumped
Автор

You have a knack for taking what the news would either make super boring or into a chaos frenzy and making it interesting and fun to learn. Thanks!

phishhu
Автор

Great information Dave, thanks once again. As I was watching it hit me, most under the age of 30 are not going to understand long-distance and how that shaped your phone usage :)

randallgreen
Автор

This story probably wasn't a must-watch for CrowdStrike engineers.

spaces
Автор

No joke. My step dad worked on this when it happened.
I remember him explaining it to me back then

FreudianSlipDK
Автор

Loving the irony of the line of code that caused the network to break was “break;”

KribensaUK
Автор

Hey - that was great; showing pseudocode that demonstrates the bug was very educational and interesting to boot. Old C programmer here...

jonahansen
Автор

Brilliant video. The Department of Defense had published a study years earlier that found that branching logic like this was something like 70% more prone to coding errors. One large organization I worked for had a policy, and I believe Microsoft still does too, that if you have a massive outage like this one, the first step is to roll back the last change made to the system, no matter what it is. That rule has saved me more than once!

knmxgrjjhgt
Автор

I worked for Northern Telecom at that time. I was darned glad it wasn't our code that caused the outage!

jacksnow
Автор

Error handling definitely requires extra thought and scrutiny. Great post!

JohnBabisDJC
Автор

Hey, love this episode. Not only is failure analysis fascinating, but also often has good written records from experts AND is ripe for high-drama edutainment. Love it! ❤

brentknight
Автор

As a process control software engineer working in the material handling world -- the belief that timing bugs existed and would only show on the weirdest times gave me night-mares. Early in my career, I had to baby-sit a system in Regina SK that would fail if and only if it was -40 below and a tanker of 8000 liters of bunker crude was being loaded. It took 5 weeks as there was only one truck per week that loaded 8000 liters - very early Monday's -- 2am. All this on a PDP-11 written in Fortran-77. With logging enabled the extra load changed the timing issue to 7995 liters -- and turn out it was just two lines of code to alter. Lesson Learn the hard way. The flow rate of bunker crude at -40c takes exactly 5 minutes to load 8000 liters .... change the temperature or the quantity and there is no issue. Same code running in Toronto -- no issue ever. I still have night-mares and I have been retired to years

paulscarlett
Автор

Dev: "It works on my telecommunications network"

kyleolson