HLL performance characteristics in large-scale aggregations over structured data - François Jehl

preview_player
Показать описание
We’ve all heard about HyperLogLog by now (if you haven’t, don’t worry, there’s a intro!) and how it can approximate billions of distinct values in data structures on the order of a few kilobytes, and while there’s been quite a lot published on its accuracy there is not a ton available on performance. More to the point, comparing hundreds of millions of 2KB data structures, even when already located in main memory, is expensive. As a result of this, we asked ourselves, “when should we aggregate HLL synopses and when should we aggregate raw event level data?”

In this presentation we review the performance of HLL in Vertica versus raw event level data (also in Vertica). Additionally, to get an idea of the “raw” performance of HLL we use Druid, a popular in-memory datastore with native HLL support, as a baseline.
Рекомендации по теме