SREcon19 Europe/Middle East/Africa - Weathering the Storm: How Early Warnings Save the Farm

preview_player
Показать описание
Weathering the Storm: How Early Warnings Save the Farm

Brian Sherwin, LinkedIn Corporation

LinkedIn’s production stack consists of over thousands of different applications and associated with complex dependencies. In this environment, when a production issue is caused due to a misbehaving microservice(s), finding the right culprit can be both challenging and time consuming.

At LinkedIn, we have built a framework to automate the incident correlation process by ingesting data pertaining to incidents and associated dependencies to identify the the unhealthy microservice(s). This gives us the ability to directly escalate an incident to the corresponding team thus cutting down MTTD/MTTR while improving quality of life of the oncall engineers.

In this talk, we will give a higher level overview of the correlation engine, how we are doing correlations, how we reduce false positives and increase the accuracy of the correlated results and finally lessons learned.

Рекомендации по теме
welcome to shbcf.ru