What are Distributed CACHES and how do they manage DATA CONSISTENCY?

preview_player
Показать описание
Caching in distributed systems is an important aspect for designing scalable systems. We first discuss what is a cache and why we use it. We then talk about what are the key features of a cache in a distributed system.

The cache management policies of LRU and Sliding Window are mentioned here. For high performance, the cache eviction policy must be chosen carefully. To keep data consistent and memory footprint low, we must choose a write-through or write-back consistency policy.

Cache management is important because of its relation to cache hit ratios and performance. We talk about various scenarios in a distributed environment.

System Design Video Course:

00:00 Who should watch this video?
00:18 What is a cache?
02:14 Why not store everything in a cache?
03:00 Cache Policies
04:49 Cache Evictions and Thrashing
05:52 Consistency Problems
06:32 Local Caches
07:49 Global Caches
08:56 Where should you place a cache?
09:35 Cache Write Policies
11:38 Hybrid Write Policy?
13:10 Thank you!

A complete course on how systems are designed. Along with video lectures, the course has architecture diagrams, capacity planning, API contracts, and evaluation tests.

You can follow me on:

References:

#SystemDesign #Caching #DistributedSystems
Рекомендации по теме
Комментарии
Автор

Gaurav nice video. One comment. Writeback cache refers to writing to cache first and then the update gets propagated to db asynchronously from cache. What you're describing as writeback is actually write-through, since in write through, order of writing (to db or cache first) doesn't matter.

VrajaJivan
Автор

Write-through: data is written in cache & DB; I/O completion is confirmed only when data is written in both places
Write-around: data is written in DB only; I/O completion is confirmed when data is written in DB
Write-back: data is written in cache first; I/O completion is confirmed when data is written in cache; data is written to DB asynchronously (background job) and does not block the request from being processed

waterislife
Автор

Other variants
1. There are 2 hard problems in computer science: cache invalidation, naming things, and off-by-1 errors.
2. There are only two hard problems in distributed systems: 2. Exactly-once delivery 1. Guaranteed order of messages 2. Exactly-once delivery

GK-rldu
Автор

I can already hear the interviewer asking "with the hybrid solution: what happens when the cache node dies before it flushes to the concrete storage?" You said youd avoid using that strategy for sensitive writes but you'd still stand to lose upto the size of the buffer you defined on the cache in the e entire of failure. You'd have to factor that risk into your trade off. Great video, as always. Thank you!

mannion
Автор

Notes:

In Memory Caching

- Save memory cost - For commonly accessed data
- Avoid Re-computation - For frequent computation like finding average age
- Reduce DB Load - Hit cache before querying DB

Drawbacks of Cache

- Hardware (SSD) much more expensive than DB
- As we store more data on cache, search time increases (counter productive)

Design

- Database (Infinite information) vs Cache (Relevant information)

Cache Policy

- Least Recently Used (LRU) - Top entires are recent entries, remove least recently used entries in cache

Issue with caches

- Extra calls - When we couldn’t find entry in cache, we query from database.
- Threshing - Input and output cache without ever using results
- Consistency - When update DB, we must maintain consistency between cache and DB

Where to place the cache

- Close to server (in memory)
- Benefit - Fast
- Issue - Maintaining consistency between memory of different servers, especially for sensitive data such as password
- Close to DB (global cache, i.e. Redis)
- Benefit - Accurate, Able to scale independently

Write-through vs Write-back

- Write-through - Update cache, before updating DB
- Not possible for multiple servers
- Write-back - Update DB, before updating cache
- Issue: Performance - When we update the DB, and we keep updating the cache based on that, much of the data in the cache will be fine and invalidating them will be expensive
- Hybrid
- Any update first write to cache
- After a while, persist entries in bulk to database

mengyonglee
Автор

Dude you are the reason for my system design interest Thanks and never stop making system design videos

bhavyeshvyas
Автор

If someone explains any concept with confidence & clarity like you in the interview, he/she can rock it seriously. Heavily inspired by you & love your content of system design. Thanks for the effort @Gaurav Sen

rahuljain
Автор

I am actually using write back redis in our system but this video actually helped me to understand what's happening overall. GReat video

SatyadeepRoat
Автор

Cache doesn’t stop network calls but does stop slow costly database queries. This is still explained well and I’m being a little pedantic. Good video, great excitement and energy.

Sound_.-Safari
Автор

Nice video Gaurav, really like your way of explaining. Also, the fast forward when you write on board is great editing, keeps the viewer hooked.

neerajmathur
Автор

The world needs more people like you. Thank you!

jsf
Автор

Great content. Would love to hear more about how to solve cached data inconsistencies in distributed systems.

kabooby
Автор

always watching your videos. topic straight to the point. keep uploading man. thanks always.

jajasaria
Автор

Great video. But I wanted to point out that, I think what you are referring to as 'write-back' is termed as 'write-around', as it comes "around" to the cache after writing to the database. Both 'write-around' and 'write-through' are "eager writes" and done synchronously. In contrast, "write-back" is a "lazy write" policy done asynchronously - data is written to the cache and updated to the database in a non-blocking manner. We may choose to be even lazier and play around with the timing however and batch the writes to save network round-trips. This reduces latency, at the cost of temporary inconsistency (or permanent if the cache server crashes - to avoid which we replicate the caches)

AnonyoX
Автор

A few other reasons not to store completely everything in cache (and thereby ditching DBs altogether) are (1) durability since some caches are in-memory only; (2) range lookups, which would require searching the whole cache vs a DB which could at least leverage an index to help with a range query. Once a DB responds to a range query, of course that response could be cached.

devinsills
Автор

each of ur videos, i watched ay least twice lol, thank you!! WE ALL LOVE U! U R THE BEST!

oykfwrl
Автор

Hi Guarav, I really like your videos thank you for sharing! I need to point out something about this video. Writing directly do DB and updating cache after, is called write around not write back. The last option you have provided, writing to cache and updating DB after a while if necessary, is called write back

zehrasubas
Автор

Great explanation. You are making my revision so much easier. Thanks!!

legozxx
Автор

I watched this video 3 times because of confusion but ur pinned comment saved my mind
thank you sir

anjurawat
Автор

Hi Gaurav - good video on distributed caching! This expands a bit more on what I learned in my computer architecture class - I didn't recall thrashing the cache too well, or what distinguished write-through vs. write-back. I think learning caching in the context of networks is more interesting, since it was initially introduced as a way to avoid hitting disk ( on a single machine ), but is also a way to reduce network calls invoked from server to databases.

harisridhar