filmov
tv
SREcon19 Europe/Middle East/Africa - How to SRE When Everything's Already on Fire
Показать описание
How to SRE When Everything's Already on Fire
Alex Hidalgo and Alex Lee, Squarespace
We've all read the SRE books and heard stories of a magical land of Engineering organizations with functioning SRE; one where following SRE best practices will lead to a better reality for both you and your users. But how do we get there? And, what does that road look like?
This talk presents a case study on how our team, stuck in a deep reliability hole maintaining our company's centralized logging platform, adopted many SRE best practices to resolve a several-months-long incident. It's the story of how we took the highest-trafficked system in our infrastructure from being reliable ~85% of the time to a trusted and documented 99.9%.
Alex Hidalgo and Alex Lee, Squarespace
We've all read the SRE books and heard stories of a magical land of Engineering organizations with functioning SRE; one where following SRE best practices will lead to a better reality for both you and your users. But how do we get there? And, what does that road look like?
This talk presents a case study on how our team, stuck in a deep reliability hole maintaining our company's centralized logging platform, adopted many SRE best practices to resolve a several-months-long incident. It's the story of how we took the highest-trafficked system in our infrastructure from being reliable ~85% of the time to a trusted and documented 99.9%.