SREcon19 Europe/Middle East/Africa - How to SRE When Everything's Already on Fire

preview_player
Показать описание
How to SRE When Everything's Already on Fire

Alex Hidalgo and Alex Lee, Squarespace

We've all read the SRE books and heard stories of a magical land of Engineering organizations with functioning SRE; one where following SRE best practices will lead to a better reality for both you and your users. But how do we get there? And, what does that road look like?

This talk presents a case study on how our team, stuck in a deep reliability hole maintaining our company's centralized logging platform, adopted many SRE best practices to resolve a several-months-long incident. It's the story of how we took the highest-trafficked system in our infrastructure from being reliable ~85% of the time to a trusted and documented 99.9%.

Рекомендации по теме