Scaling Write-Intensive Key-Value Stores

Показать описание

In recent years, the log-structured merge-tree (LSM-tree) has become the mainstream core data structure used by key-value stores to ingest and persist data quickly. LSM-tree enables fast writes by buffering incoming data in memory and flushing it as independent sorted batches to storage whenever the buffer is full. To enable fast reads, LSM-tree sort-merges batches in storage to restrict the number that reads have to search, and it also uses in-memory Bloom filters to enable point reads to probabilistically skip accessing batches that do not contain a target entry. In this talk, we show that such LSM-tree based designs do not scale well: as the data size increases, both reads and writes take increasingly long to execute. We pinpoint the problem to suboptimal core design: the Bloom filters have been attached to LSM-tree as an afterthought and are therefore not optimized to minimize the overall probability of access to storage. Point reads are therefore unnecessarily expensive. To compensate, more merging than necessary has to take place thereby making writes unnecessarily expensive.

As a part of the CrimsonDB project at the Harvard DasLab, we developed two insights to address this problem. Firstly, we show that the optimal approach for allocating Bloom filters given any amount of available memory resources is to assign significantly lower false positive rates to smaller data batches. This shaves a logarithmic factor from point read cost thereby allowing key-value stores to scale better in terms of reads. Secondly, having lower false positive rates for smaller batches allows to merge newer data more lazily without compromising point read cost. This allows eliminating most of the merge overheads of LSM-tree thereby improving the scalability of writes. We close by describing a higher-level lessons from our work: while data structure design up until today has focused on the cost balance between reads and writes, the inclusion of memory utilization as a direct additional optimization objective opens up new avenues for asymptotic improvements, which studying reads and writes in isolation could not have revealed.

Рекомендации по теме

Комментарии

This was a great talk, thank you for publishing it!

williamclausen

I'm studying LSM-trees and this is indeed a very helpful video!

munikarmanish

Good presentation on LSM tree and it's application.

AnkitGaur

Is someone db fanatic. Really recommend. Great content!

jaigohil

surprised there wasn't any mention of WiscKey

dn

I'm trying to internalize why probing any run would cost the same and not sure if I'm missingsomething.

The paper states that: "Our insight is that the worst-case point lookup cost over the whole LSM-tree is proportional to the sum of the false positive rates of all filters. Assigning equal false positive rates to all filters, however, does not minimize this sum. The reason is that maintaining the same false positive rate across all runs means that larger runs have proportionally larger Bloom filters, whereas the I/O cost of probing any run is the same regardless of its size (due to each run having fence pointers in main memory that allow direct access to the relevant disk page)."

Let's say fence pointers cover P pages, then the statement is true as long as each level consists of only P pages, but when higher levels start to have N * P pages you need to find your right page, so you need to search in your array of fence pointers on each level qualified by the bloom filter. So shouldn't the I/O cost be log(N) where a level has N fence pointers? It's going to be very cheap, because the array of fence pointers are in memory and cost of in-memory binary search vs. disk I/O has orders of magnitude difference.

DirtyPio

Please put the speaker's name in video titles

parkatip

Scaling Write-Intensive Key-Value Stores

Scaling Write-Intensive Key-Value Stores

NWDS Talk - Niv Dayan Scaling Write-Intensive Key-Value Stores

Systems @Scale 2019 - Accordion Better Memory Organization for LSM Key Value Stores

Scaling concurrent log-structured data stores

ASPLOS'23 - Session 6B - Cooperative Concurrency Control for Write-Intensive Key-Value Workload...

Key Value Store | System Design Basics | Architecture, Consistency, and Performance

Scaling HDFS to Manage Billions of Files with Key Value Stores

How databases scale writes: The power of the log ✍️🗒️

Economic Growth and Business Cycles | Solow Model | Romer Model | Complete | 7 Hours |

FASTER: An Embedded Key-Value Store for State Management

How do NoSQL databases work? Simply Explained!

Robust LSM-Based Key Value Stores - Talk by Manos Athanassoulis (BU)

Key-Value store and the Data Service

System Design : Distributed Database System Key Value Store

FAST '22 - Exploring Data Deduplication in LSM Tree-based Key-Value Stores (Work-in-Progress Re...

USENIX ATC '19 - SILK: Preventing Latency Spikes in Log-Structured Merge Key-Value Stores

GopherCon 2017: Marty Schoch - Building a High-Performance Key/Value Store in Go

Dissecting Dynamo: Highly available key value store at Silicon Valley Code Camp 2018

SDC2020: A Low Latency and Scalable Key Value Store from Modern off the Shelf Components

SDC 2017 - Key Value SSD Explained – Concept, Device, System, and Standard - Yang Seok Ki

The Secret Sauce Behind NoSQL: LSM Tree

Scaling write-heavy OLTP systems with strong data guarantees - Gokulvanan V Velan

USENIX ATC '20 - AC-Key: Adaptive Caching for LSM-based Key-Value Stores

How To Choose The Right Database?