LFI Conf 23 | Will Carhart | Foresight is 20/20: Using Pre-Mortems to Prepare for Incidents Before..

preview_player
Показать описание

Will Carhart, Backend Engineer, Stripe

Foresight is 20/20: Using Pre-Mortems to Prepare for Incidents Before They Occur

It’s the middle of the night a week before your biggest traffic day of the year. You toss and turn vehemently as your mind races from nightmare to nightmare. RPS is rapidly plummeting during Black Friday…oh my gosh logs are truncated and pods are expiring…can’t deserialize customer orders from Kafka…why don’t we have observability here?…oh no this runbook is out of date…why is (10x engineer) OOO right now, who do I page??? You rocket up in a frenzied state, hair tousled, muscles tensed, and heart rate a few clicks below maximum. We need to be better prepared, you reconcile to soothe your panicked mind, we need the lessons learned from an incident post-mortem, but before the bad things happen…almost like a pre-incident-post-mortem…a pre-mortem!

At Stripe, we’ve learned a tremendous amount from post incident-activities, and increased the priority of all sorts of work based on that insight. However, we eventually asked ourselves, why wait for the bad thing to happen to have these conversations?

Recently, we developed a set of pre-mortem exercises that teams at Stripe can use to preemptively cover their bases in preparation for big events. We’ve used this template on our API Platform team to catch and patch observability edge cases, envision potential responses (and shortcomings) in disaster scenarios, and test and clean up our runbooks in preparation for our biggest days of the year, Black Friday and Cyber Monday.

In this talk we review our approach. Join us as we take a deep-dive into the incident response zeitgeist at Stripe and how we can all contribute together to more reliable distributed software systems.

Learning from Incidents (LFI) is a community challenging conventional views and reshaping how the software industry thinks about incidents, software reliability, and the critical role people play in keeping their systems running.In today’s economy, software organizations can’t afford to not learn from incidents.

LFI Conference is made possible by the financial and planning support of the Jeli team. Nora Jones, Founder and CEO of Jeli, founded the LFI community and website as a way to show organizations how to get more ROI out of their most powerful investments -- their incidents.
Рекомендации по теме