Make Prometheus Use Less Memory and Restart Faster - Ganesh Vernekar, Grafana Labs

preview_player
Показать описание

Make Prometheus Use Less Memory and Restart Faster - Ganesh Vernekar, Grafana Labs

These days, the most common reason for a Prometheus server to run out of memory is an excessive amount of time series in the so called head block, the part of the internal TSDB with the freshest data, which has to be kept in memory prior to consolidation into a block on disk. A large head block leads to a long restart time because the head block has to be rebuilt from the write-ahead log. On large servers, the restart time can be 10 minutes or more. Since restarts happen regularly to upgrade the binary or to change flags, the resulting interruption of sample collection is problematic. Even worse: After an OOM crash, the same replaying from the WAL has to happen, often causing another OOM crash immediately. Ganesh Vernekar will talk about the work started in late 2019 to persist parts of the head block earlier, thereby reducing both the memory footprint and the restart time.

Рекомендации по теме
Комментарии
Автор

Snapshotting the latest chunk is an amazing feature, WAL replaying takes a lot of time in my case. Looking forward to use this feature. Thanks for building this

lokeshwarank
Автор

Hi, is these features already available in the latest Prometheus Docker Image?

srikanthjnr
Автор

Excellent depiction and explanation. Do we have slides available?

chakradharnr
Автор

Can we have some scenario's based upon remote write failures? E.g. how WAL works in case of remote database failures? Also what happens if Prometheus pod itself is down for few minutes. How this wal and chunk behaves.

prabhatranjan