Best Practices Working with Billion-row Tables in Databases

preview_player
Показать описание

Chapters
Intro 0:00
1. Brute Force Distributed Processing 2:30
2. Working with a Subset of table 3:35
2.1 Indexing 3:55
2.2 Partitioning 5:30
2.3 Sharding 7:30
3. Avoid it all together (reshuffle the whole design) 9:10
Summary 11:30

🎙️Listen to the Backend Engineering Podcast

🏭 Backend Engineering Videos

💾 Database Engineering Videos

🏰 Load Balancing and Proxies Videos

🏛️ Software Archtiecture Videos

📩 Messaging Systems

Become a Member

Support me on PayPal

Stay Awesome,
Hussein
Рекомендации по теме
Комментарии
Автор

The only thing about this channel that makes me feel awful is that i didnt discovered it earlier. The topics discussed here are not really common, but extremely important in my opinion. I havent found anything that gets near this channel, some topics are so high level that I havent heard of them at all. Makes me feel like I know nothing, which is the best feeling ever. There is no much to learn! Thank you so much!

elultimopujilense
Автор

This idea of transitioning the delay to the writer AND then using queues for writes is true architecting foresight. I love it.

zacharythatcher
Автор

I'm a junior Software Engineer and working my way through to specialize with Backend. I always learn new concepts from your videos. This one in particular hits home for me because I've worked on couple of projects were I had to make the design choice of the database. I found it quite difficult to make the right choices because I always end up building a search endpoint with full text search and other search parameters.

To put it into context, the databases I had problems with were filled with recipes. I had multiple tables with 100k+ entries filled only with IDs (the problem you mentioned). After two projects I think I switched to the latest idea you mentioned with the list/json column and that one worked the best for me. Because not only it avoids searching through a big table but also saves me an extra query to another table.

This is kind of irrelevant the this video but when implementing full text search I think it's better to go with postgresql rather than mysql since it supports gin and gist indexes and fuzzy searching that can help build a nice, affordable and quick solution to meduim sized databases.

Keep doing the nice work.

achraf
Автор

Man your content is GOLD. I come from a front-end stack and was underestimating the work with databases. But your content has helped in understanding the pitfalls of backend engineering.

andreivilla
Автор

I’m front end developer acquired backend skills, all because of good content from this channel. Amazing content and energy, thank you 😊

PiyushChauhan
Автор

You help not only software engineers with your channel, but also data analysts like me that works a lot with data engineering. Thanks! Greetings from Brazil.

joaopedromoreiradeoliveira
Автор

Yet another great video! Truly educational, even for someone who have been in the game for over 15 years. 👏

t
Автор

Couldn't have explained it simpler! Big ups dude!!

OmarBravotube
Автор

I really like listening to you while doing other stuff, like driving eating or walking. Always being productive and learning new things "sidely" by your videos. Habibi ❤

Bilo_
Автор

generally, it's helpful to think in terms of read and write paths of your data. On the read side, on top of partitioning you can add bloom filters to quicky test where a value exists or not to reduce searching the B-Tree or other persistent data structures

nilanjansarkar
Автор

Good video. The last trick in the video is called denormalization. Also as soon as you introduce sharding you need also add replication because the probability of failures increases.

ami
Автор

I had to deal with tables with a few billion records per month, and MySql merge engine (merge tables) let us slice and dice them any way we needed. You can even have more specific indexes in the real tables themselves, as long as each merge index exists in all tables.

The downside of Merge tables is that it multiplies the open file handles on the system, which can be tricky for a machine doing public networking, but with latest kernels, you can get kind of crazy.

High memory makes a huge difference, of course.

videosforthegoodlife
Автор

I've got many solution on database side from your video. Thanks for your support.

rajendiranvenkat
Автор

2 extra ideas:
1. table archiving: most large tables are caused by timeseries records -- just archive the old records in separate tables and keep the live table small
2. use modern databases that are more scalable than traditional single-host databases: cockroachdb, spanner, aurora, tidb, fauna, etc.

aXUTLO
Автор

Thanks for the video . I did a first time YouTube 'applaud' feature with a small token of 100 Rupees . Hope you recieved it . Keep going 🙏🙏🙏

dinakaranonline
Автор

Woah the json method literally blew my mind.

Sarwaan
Автор

Fabulous content on your channel. Subscribed. Thanks !! Keep up the good work :)

rashmidewangan
Автор

I believe the last concept is actually called denormalization. Another option could be considering NoSQL.
By the way, you are awesome

alichoobdar
Автор

Maybe I didn't understand the second last section about eliminating the need to update both ends of a connection, but your solution will crumble when person A who is following person B closes their account since the information on who follows person B is only in the person's B's records. So when person A exits the medium, we won't know who to update.

arianseyedi
Автор

I love you man. You are so crystal clear. you are legend.

insearchof