How Reddit designed their metadata store to serve 100k req/sec at p99 of 17ms

preview_player
Показать описание

Build Your Own Redis / DNS / BitTorrent / SQLite - with CodeCrafters.

# Recommended videos and playlists

If you liked this video, you will find the following videos and playlists helpful

# Things you will find amusing

# Other socials

I keep writing and sharing my practical experience and learnings every day, so if you resonate then follow along. I keep it no fluff.

Thank you for watching and supporting! it means a ton.

I am on a mission to bring out the best engineering stories from around the world and make you all fall in
love with engineering. If you resonate with this then follow along, I always keep it no-fluff.
Рекомендации по теме
Комментарии
Автор

Successfully ruined my upcoming weekend. Have to view all of your videos now 😢

nextgodlevel
Автор

The Kafka CDC can solve the problem of synchronous write inconsistencies, but not the backfill overwrriting. I suspect they might do some kind of business logic or SHA/checksum validation to ensure that they are not overwriting the data during backfilling. Correct me if I'm missing something bro.

richi
Автор

Large Data Migration -> Event Driven Architecture

Also, interesting to learn about postgres's extentions which are not required if going with a serverless database solution like DynamoDB.

AayushThokchom
Автор

Отлично, что на YouTube есть такие полезные видео. Спасибо, Министр!

GSTGST-dwrf
Автор

Since they are storing data in JSON and also scaling the postgres db, Why they did not go with non -relational db like mongoDB, which stores data in JSON and also provide scaling out of the box ?

kunalyadav
Автор

Why reddit don't go for document db for there storage as per structure and pattern .... What u think about it @arpit?

keshavb
Автор

Arpit - using cdc and kafka.. that still does not solve the problem of - Data from old source during 'migration' overriding data in the new aurora postgres, right?
What am i missing?
You will still need a bulk batch job that takes up all the archival data from all the multiple sources and ingests them into the new Aurora. Using CDC does not solve for that backport, correct?

pixiedust
Автор

why are they using postgres, if they are storing it as json ?

suhanijain
Автор

I guess we don't need both the CDC Setup and dual writes, just thr CDC setup would suffice to insert the data in the new DB, correct?

JardaniJovonovich
Автор

Hi Arpit, I think you could have gone a bit more into depth, like they have mentioned in their blog. a bit about how they are using incrementing post_id, which allows them to manage most of the query from 1 partition only. Not complaining at all. Thanks for being awesome as always.

TLDR; 7 minutes seem a bit short

sachinmalik
Автор

How pg bounce minimizes the cost of creating a new process for each request?
May be I am wrong, can you tell me how cost is reducing here?.

nextgodlevel
Автор

What is CDC mentioned here ? Please suggest some pointers

atanusikder
Автор

Good video. Would appreciate a lot it you can attach any resources you used in video like blog from reddit that is mentioned in description. Would be great if link is also attached there.

poojanpatel
Автор

Hey Arpit, thanks a lot for putting this up. Your writing skills are next level, crisp and crystal clear. Could you please tell what's the setup you use for taking these notes?
Thanks in advance.

vinayak_
Автор

How did they check if the reads from old vs new database are same?

TechCornerWithAjay
Автор

Hey Arpit… thanks for the video

I liked doing partition as policy that runs on a cron. But wouldn’t moving data around in partitions also warrant a change in backend(read) ?

Or you are saying the backend has been written in a way that it takes partitioning into account while reading the data?

dreamerchawla
Автор

how many shards were used to hold those partitions to achieve that much throughput

ganeshgottipati
Автор

What is used over here to write down the notes?

calvindsouza
Автор

How will you handle search, because the relevant data might be several days older partitions. Even if they're using a secondary data store, the date/time range-based partitioning or even sharding will not suffice. what do you think?

code-master
Автор

Thanks Arpit!! Also what are your thoughts about using Pandas as a metadata DB, Dropbox had a post regarding they using Pandas wherein they explained in depth why other DBs are not better for them. (Would like to know your views too on it)

LeoLeo-nxgi