Debugging Under Fire: Keep your Head when Systems have Lost their Mind • Bryan Cantrill • GOTO 2017

preview_player
Показать описание
This presentation was recorded at GOTO Chicago 2017. #GOTOcon #GOTOchgo

Bryan Cantrill - Chief Technology Officer at Joyent @bcantrill

ABSTRACT
As software is increasingly developed to be deployed as part of a service, the manifestations of defects have changed: the effects of broken software are increasingly unlikely to be felt by merely one user, but many (or even all) -- with concomitant commercial consequences. Debugging service [...]

Download slides and read the full abstract here:

#Debugging #DebuggingUnderFire

Looking for a unique learning experience?

SUBSCRIBE TO OUR CHANNEL - new videos posted almost daily.
Рекомендации по теме
Комментарии
Автор

Anything Bryan Cantrill I watch immediately, the guy is basically a young Unix Grey-beard and understands the deep intricacies of Unix based operating systems, wouldn't expect any less from the author of Dtrace.

jfltech
Автор

Another classic talk by Bryan Cantrill. Legend. His talks probably pre-emptively saved me from disaster :)

beofonemind
Автор

This guy is a stand up computer scientist.

bocckoka
Автор

How in the world can debugging be an interesting thing to talk about; this is what I thought but then this is one of the (if not the) best talks I have heard about anything. And I reached here by chance

piyushpkurur
Автор

My formative exposure to "how could this ever have worked?" was in 1982, on the original Wang PC, where I was passing NULL as the database pointer into a lookup routine on a simple in-memory data structure _for an entire week of active development_ with my test application returning correct records the whole while. Memory protection? Surely you jest. Turns out the search loop was wandering through low memory, not finding any matched keys, then _accidentally_ aligning to the correct modulo-14 boundary (pranked by the linker yet again) and proceeding to then scan the intended data structure correctly, with me none the wiser, for an entire week. And when I finally did have to chase this bug down, I pretty much gave my entire code base the most thorough code review of all time, before finally noticing my "obvious" coding error.

afterthesmash
Автор

I would love nothing more than to have Bryan as a mentor, he sounds like he would be a top-notch teacher.

hopperstreams
Автор

As a Canadian, yes, I know the Gimli accident. There's a reason the "out of gas" airplane is on its nose cone. The front wheel assembly swings down and forward. For the down part, gravity is enough, but the forward part—against the wind—requires a hydraulic assist to reach the locked position that wasn't available. Which is good, because otherwise the plane would have required brakes that weren't available in order to stop in time. As it slid down the runway on its nose cone, its nose cone was engulfed in more sparks—or so I imagine—than any demonic trade-of-paint since Halifax harbour (an event no one who witnessed—from close range—ever related).

"The go-cart races were over on the night we landed, and people were cooking on their barbecues beside tents and trailers, " Mr Pearson said. "Their mouths were wide open as our plane went sliding by. But the go-cart races went ahead the next day." A few hours earlier, the plane would have come in mowing down the multitudes. And it was only because one of the pilots recalled Gimli that they diverted there (it was an old military runway—very long—which I don't think was even listed for active runway duty in panic control).

The reason no other pilot could duplicate the landing is that one of the pilots was also a recreational glider pilot and used some kind of small-craft air spill to shed speed at the end, which only small-craft glider pilots train to perform. "Gosh, I've never done this in a big bird before—but, hey, no time like the present!"

Note that the L1011 crash—Flight 232—is an even more gripping story—one of the best disaster yarns ever—and none of the simulator pilots came _anywhere_ close to a survivable landing in many, many attempts afterward. It was like the Gimli crash, where the "hey, what about Gimli?" moment happened not once, but ten freaking times in a row (from having a third pilot deadheading on the flight, right up to the airport in Sioux City having 285 trained personnel from the Iowa National Guard on duty for a training exercise, to assist in rescuing crash victims—on top of a shift change which doubled airport personnel on site, as well).

"Following Air Canada's internal investigation, Captain Pearson was demoted for six months, and First Officer Quintal was suspended for two weeks for allowing the incident to happen. Three maintenance workers were also suspended. In 1985 the pilots were awarded the first ever Fédération Aéronautique Internationale Diploma for Outstanding Airmanship." Suspension _and_ medal, in true "it was me / let's fix this" Operator1 stoic–heroic fashion.

afterthesmash
Автор

Very much enjoyed, I feel that those of us who have to deal with these pathological situations in production know the values Brian is conveying.  Profuse kudos to him for raising awareness that we have a duty to create software that can be dissected, troubleshot, and repaired not only in the heat of the moment, but also by future generations who may depend on it.

jubalskaggs
Автор

The talk is great and lets make it even
Whenever Bryan says 'Um', We Drink..! Go...!

hopmingu
Автор

Another @bcantrill classic I've watched many times, and I need a timestamp TOC to find my favorite spots, so here it is:
3:04 - if you want, you can stay logged in and edit away, because we're going down anyway
3:48 - any time somebody says WTF, the bot takes a random sentence from chat over its entire history in which someone used the word f*** and offers it up. . .the other thing that the bot likes to do is to correct anyone saying Linux to, "you mean, GNU Linux?"
5:05 - please don't be me, please don't be me, please don't be me
6:18 - it is me. I am become Death, the destroyer of datacenters
12:10 - 67-hour outage
13:30 - sleep management and judgment impairment
16:15 - thought, might never recover, might go full Magnolia on the cloud
17:40 - we are not post-singularity, despite what the bot will tell you in chat: humans are still in the loop
18:15 - the curse of the intermediate skier
19:07 - welcome, Canadian! . . .Gimli glider, glass cockpit, the RAT
21:25 - it's a 767, it does not glide. . .this is a brick
24:35 - 40% or more of the microservices boom is inter-organizational strife
26:30 - an outage in production does not feel like a murder mystery. . .it feels like an active shooter
28:25 - if you go dark because of load, you have gone dark at the worst possible time
29:20 - 3 Mile Island: this all started because of some routine maintenance where they were running autovacuum on a Postgres shard
30:45 - they had not checked the database backups
32:25 - pilot-operated relief valve UI disaster
33:40 - "we monitor everything! We alert everything!"
38:55 - legerdemain: "a debugger never tells"
39:58 - debugging: you are playing 20 questions
43:35 - root-cause things: is that a fire in the kitchen, is that the coffeemaker, or is that a fire raging in the coal seam?
44:48 - the cost of the rewrite is never borne by the technical debt that induced it
45:03 - 18 months, 18 months, 18 months - crooked founder, crooked founder, crooked founder
46:20 - Look, gotta restart it! - Look, gotta debug it!
46:45 - we no longer understand the system; restarting everything all the time: that's called Windows, we did this experiment, and it doesn't work
48:00 - fatal failure/uncaught exception handling: present your embalmed carcass to Quincy M.E.
49:15 - you write up the postmortem because it forces you to completely understand it
50:00 - for programmatic failure, you need to die, for operational failure, you need to handle it

TatianaRacheva
Автор

this is better than any stand-up special I've seen all year 😂

chordfunc
Автор

40:40 explains what separates how hobby programmers approach ops to how professional software engineers do

cbrunnkvist
Автор

Regarding early missteps, I think that there is an important approach of structured decision making. Implementing TDODAR or FORDEC in an outage is extremely helpful in terms of breaking those early missteps.

WorldTravelerCooking
Автор

I laughed so hard the tears came down into my mouth!!!

TomAtkinson
Автор

I'm watching ex bomb maker tell his story. Listening to his reasoning. Part his training was your first mistake is your last mistake. Another student learned crazy guy he said never do this in your home, always sober. You usually train for worst case scenario.

joeyalfaro
Автор

Around 12 minutes here, Bryan mentions presentation on Heroku production outage. I was not able to quickly find it. Any ideas what he is talking about?

IevgenPyrogov
Автор

Dear Camera Operator ... Just Zoom Out.

Dygear
Автор

"do it right the first time"

Rene-tufc
Автор

29:52 "... auto vacuum of a postgres shard..."

DanielDugovic
Автор

He’s right about the pride about the Gimli Glider. I live in Manitoba (same province) and everyone knows about this darn plane.

RandomInsano