filmov
tv
A Field Guide to Reliability Engineering at Zalando • Heinrich Hartmann • GOTO 2024
Показать описание
This presentation was recorded at GOTO Amsterdam 2024. #GOTOcon #GOTOams
Heinrich Hartmann - Head of Reliability Engineering at Zalando SE @HeinrichHartmann
RESOURCES
ABSTRACT
We present Zalando's approach to engineering reliability from a very small to a very large scale, and touch on both technological and human angles.
With over 50M customers across 23 countries, Zalando operates one of the largest eCommerce platforms worldwide. Achieving a reliable customer experience requires the intricate collaboration of over 3000 applications and more than 2000 software engineers who constantly seek to improve and extend product capabilities. In the talk we will walk you through the best practices Zalando has arrived to consistently achieve high levels of reliability.
• We will start with a simple stand-alone application and cover best practices for instrumentation, monitoring and alerting.
• We continue the journey to products that span multiple applications which are operated by different teams. At this scale methods like tracing and incident management become important.
• Finally we will present technologies and processes which are used to steer reliability on the company level. Here WORM Cascades and Risk Management have proven highly effective. [...]
TIMECODES
00:00 Intro
03:45 Agenda
04:04 Principles
10:32 Context
14:44 Operations at Zalando
14:49 Alerting
21:25 Dashboards
24:32 Observability
31:51 SLOs
37:32 Incident process
45:20 WORMs
50:01 Summary
50:37 Outro
Download slides and read the full abstract here:
RECOMMENDED BOOKS
#Reliability #ReliabilityEngineering #DevOps #Observability #IncidentProcess #SLOs #HeinrichHartmann #Zalando
CHANNEL MEMBERSHIP BONUS
Join this channel to get early access to videos & other perks:
Looking for a unique learning experience?
SUBSCRIBE TO OUR CHANNEL - new videos posted almost daily.
Heinrich Hartmann - Head of Reliability Engineering at Zalando SE @HeinrichHartmann
RESOURCES
ABSTRACT
We present Zalando's approach to engineering reliability from a very small to a very large scale, and touch on both technological and human angles.
With over 50M customers across 23 countries, Zalando operates one of the largest eCommerce platforms worldwide. Achieving a reliable customer experience requires the intricate collaboration of over 3000 applications and more than 2000 software engineers who constantly seek to improve and extend product capabilities. In the talk we will walk you through the best practices Zalando has arrived to consistently achieve high levels of reliability.
• We will start with a simple stand-alone application and cover best practices for instrumentation, monitoring and alerting.
• We continue the journey to products that span multiple applications which are operated by different teams. At this scale methods like tracing and incident management become important.
• Finally we will present technologies and processes which are used to steer reliability on the company level. Here WORM Cascades and Risk Management have proven highly effective. [...]
TIMECODES
00:00 Intro
03:45 Agenda
04:04 Principles
10:32 Context
14:44 Operations at Zalando
14:49 Alerting
21:25 Dashboards
24:32 Observability
31:51 SLOs
37:32 Incident process
45:20 WORMs
50:01 Summary
50:37 Outro
Download slides and read the full abstract here:
RECOMMENDED BOOKS
#Reliability #ReliabilityEngineering #DevOps #Observability #IncidentProcess #SLOs #HeinrichHartmann #Zalando
CHANNEL MEMBERSHIP BONUS
Join this channel to get early access to videos & other perks:
Looking for a unique learning experience?
SUBSCRIBE TO OUR CHANNEL - new videos posted almost daily.
Комментарии