Scaling Write-Intensive Key-Value Stores

preview_player
Показать описание
In recent years, the log-structured merge-tree (LSM-tree) has become the mainstream core data structure used by key-value stores to ingest and persist data quickly. LSM-tree enables fast writes by buffering incoming data in memory and flushing it as independent sorted batches to storage whenever the buffer is full. To enable fast reads, LSM-tree sort-merges batches in storage to restrict the number that reads have to search, and it also uses in-memory Bloom filters to enable point reads to probabilistically skip accessing batches that do not contain a target entry. In this talk, we show that such LSM-tree based designs do not scale well: as the data size increases, both reads and writes take increasingly long to execute. We pinpoint the problem to suboptimal core design: the Bloom filters have been attached to LSM-tree as an afterthought and are therefore not optimized to minimize the overall probability of access to storage. Point reads are therefore unnecessarily expensive. To compensate, more merging than necessary has to take place thereby making writes unnecessarily expensive.

As a part of the CrimsonDB project at the Harvard DasLab, we developed two insights to address this problem. Firstly, we show that the optimal approach for allocating Bloom filters given any amount of available memory resources is to assign significantly lower false positive rates to smaller data batches. This shaves a logarithmic factor from point read cost thereby allowing key-value stores to scale better in terms of reads. Secondly, having lower false positive rates for smaller batches allows to merge newer data more lazily without compromising point read cost. This allows eliminating most of the merge overheads of LSM-tree thereby improving the scalability of writes. We close by describing a higher-level lessons from our work: while data structure design up until today has focused on the cost balance between reads and writes, the inclusion of memory utilization as a direct additional optimization objective opens up new avenues for asymptotic improvements, which studying reads and writes in isolation could not have revealed.

Рекомендации по теме
Комментарии
Автор

This was a great talk, thank you for publishing it!

williamclausen
Автор

I'm studying LSM-trees and this is indeed a very helpful video!

munikarmanish
Автор

Good presentation on LSM tree and it's application.

AnkitGaur
Автор

Is someone db fanatic. Really recommend. Great content!

jaigohil
Автор

surprised there wasn't any mention of WiscKey

dn
Автор

I'm trying to internalize why probing any run would cost the same and not sure if I'm missingsomething.

The paper states that: "Our insight is that the worst-case point lookup cost over the whole LSM-tree is proportional to the sum of the false positive rates of all filters. Assigning equal false positive rates to all filters, however, does not minimize this sum. The reason is that maintaining the same false positive rate across all runs means that larger runs have proportionally larger Bloom filters, whereas the I/O cost of probing any run is the same regardless of its size (due to each run having fence pointers in main memory that allow direct access to the relevant disk page)."

Let's say fence pointers cover P pages, then the statement is true as long as each level consists of only P pages, but when higher levels start to have N * P pages you need to find your right page, so you need to search in your array of fence pointers on each level qualified by the bloom filter. So shouldn't the I/O cost be log(N) where a level has N fence pointers? It's going to be very cheap, because the array of fence pointers are in memory and cost of in-memory binary search vs. disk I/O has orders of magnitude difference.

DirtyPio
Автор

Please put the speaker's name in video titles

parkatip