Zebrium ML for Logs - Automatically Find the Root Cause of Software Problems

Показать описание

Watch the magic of using machine learning and AI for log analysis.

This video shows end-to-end how the technology can be used. First we install an online microservices web shopping app in a new EKS Kubernetes Cluster. Then we install the Zebrium log collector and let it all run for 2 hours.

Note: There are no pre-configured rules used here!

After 2 hours, the Zebrium ML has had a chance to baseline the logs, so we use the Litmus Chaos Engine (CNCF chaos engineering tool) to break the app. Almost immediately you can see in a Datadog dashboard that things are broken, but you can't tell what is causing it. However, the Zebrium dashboard plugin for Datadog shows a few important keywords like "chaos" and "pod-network-corruption" (the name of the experiment used to break the env). And you also see that Zebrium has just created a root cause report.

Clicking on the root cause report shows a small sequence of log lines that explain the exact sequence of what happened. It shows that the pod-network-corruption experiment kicks off, breaks network port eht0 and then prevents orders from being processed. The root cause report also contains a simple NLP summary (generated by using GPT-3) that provides a high-level summary of what happened.

The video next explains how the machine learning works under the covers. It uses a multi-stage pipeline of ML techniques.

0:00 Introduction
1:52 Demo part 1 - install Sock Shop app and Litmus Chaos
2:42 Break the application by running chaos experiment
3: 23 The results in Datadog
3:35 The Zebrium panel - automatically shows root cause
4:08 The Zebrium ML generated root cause report
5:55 How the machine learning works
8:48 Zebrium integrations
9:06 Fee trial sign-up

#ML #AI #machinelearning #rca #rootcauseanalysis #MLforlogs #devops #sre #logmanagement #monitoring