Understanding Aggregate Functions Performance | The Backend Engineering Show

preview_player
Показать описание
Aggregate functions like Count, max, min, avg performance really depends on how did you tune your database for that kind of workload. Let us discuss this.

0:00 Intro
1:22 SELECT COUNT(*)
4:30 SELECT AVG(A)
5:15 SELECT MAX(A)
8:00 Best case scenario
11:30 Clustering
14:00 Clustering Sequential Writes
17:19 Clustering Random Writes
20:30 Summary

Fundamentals of Database Engineering udemy course (link redirects to udemy with coupon)

Introduction to NGINX (link redirects to udemy with coupon)

Python on the Backend (link redirects to udemy with coupon)

Become a Member on YouTube

🔥 Members Only Content


🏭 Backend Engineering Videos in Order

💾 Database Engineering Videos

🎙️Listen to the Backend Engineering Podcast

Gears and tools used on the Channel (affiliates)

🖼️ Slides and Thumbnail Design
Canva


Stay Awesome,
Hussein
Рекомендации по теме
Комментарии
Автор

I am a data engineer and your channel has been invaluable for my learning lately

YourMakingMeNervous
Автор

I've been a database specialist for years and I did not know these intimate details. Great Thanks. You are simply awesome.

SinskariBoithree
Автор

I'm a front end developer but I I'm starting to like backend because of you!

jomarmontuya
Автор

I always enjoy the way you dissect things and get to the bare metal to fully understand and explain it! You're amazing!! 👏

md.hussainulislamsajib
Автор

The way you explain things and dissect things is really amazing I haven't found a detailed video as you have done.its really usefull to get the understanding of such beautiful things .

overflow
Автор

Thanks a lot @Hussein for this informative video.
I do not have a PhD but I'm going to attempt to solve the problems of clustered sequential and random writes. Opinions from the community will be appreciated :)

1) Clustered Sequential Writes: How about we use an in memory queue like data structure for fresh new writes? For every new write, we allocate some memory with the new data and atomically append it at the tail end of the queue. Since the writes are strictly ordered by the time they arrive at the database, we can use a lock-free paradigm like compare-and-swap to atomically append data to the queue. This will do eliminate the lock overhead that mutexes introduce. Then when it comes to actually adding the writes to the b-tree index structure, with this ordering, there's a big potential of the tree rebuilding for every new write and this can cause a big performance overhead. As the writes are batched up in the queue, we can add them to the tree at a configurable threshold so that we rebuild the tree much less often than the rate at which the writes arrive. Finally, at some point we need to flush our tree on-disk. Since the right sub-tree of our b-tree is the one that's mostly growing, most of the changes are happening sequentially in the same region on disk. We can use fadvise when flushing our changes on disk, just to get some extra performance. This approach has trade-offs in that we'll have superior sequential write performance but poor random reads since we need search both the b-tree (O(logN)) and the in-memory queue (O(N) for a size equal to the configurable threshold).

2) Clustered Random Writes: @Hussein, can you elaborate here how exactly flushing the entire contents of memory will cause write amplification, as in, for each record received, how many writes on disk will need to take place for it to be persisted?
For this problem, there is one extreme end where every write results in a disk write, which can be very expensive for writes. At the other end, we store up data in-memory till we run out of memory and we flush everything to disk and rebuild the tree afresh if needed. This will also be a problem since during flushes, writes will need to be paused. I can anecdotally say that most storage engines, at least for RocksDB, writes can be batched at a configurable threshold. This threshold will lie some in the middle where the user weighs the amount of memory they have and the frequency of writes they can tolerate, so that writes can be flushed to disk in a manner that won't make the performance suffer.

I welcome all opinions from all you database enthusiasts ;)

briankimutai
Автор

I literally made notes, such core details 🙌

mrvaibh
Автор

For the estimates, when guessing within 5-10 rows of error on a table with 100M rows, it would be easier to get the size of a row in bytes and do some math with the total bytes taken by the table

squirrel
Автор

5:50 At first I interpreted "smallest" as "of least disk size in bytes", but it more likely means their sequential ordering with respect to the overall range.

pieter
Автор

I would really love to know where you get ideas for these videos and how much time you put to read and research about it because this is really awesome and sounds like you put in a lot of time. Thanks a lot for the awesome content you post here

semperfiArs
Автор

Thank you for the video! amazing advice

young-ceo
Автор

Love your content man. Can you talk on the Okta breach??

michaelangelovideos
Автор

always a pleasure to watch your informative videos :)

vendetta
Автор

thanks for your effort, really i learn a lot.

subhamprasad
Автор

Did you make a video about varchar and how it is stored in different systems? I hav always wondered how a varchar(255) is handled on disk and how changes are handled. Are all data saved as maximum size and fixed size records in total, are records stored in variable length or are all varchar stored in a separate space all together?

henrikschmidt
Автор

would it be the best of both? like write to buffer and then one process would continuously write to the actual disk to prevent the buffer checkpoint situation?

dhawaljoshi
Автор

Thank you for your awesome discussion in this video, @Hussein!

However I have a question to the part of you talking about the cons of the uuid random-writes that would exhaust the RAM, what is the solution for this specific case? Should we just considerably use the clustered index with the sequential writes that you'd mentioned or is there a particular solution for this? 🙏🙏

nguyentanphat
Автор

Will
Select Count(A) FROM T; use index or will it go for table scan?

letsmusic
Автор

Also, Hussein, I know that it will be time-consuming and will probably not be of much interest to you, but do more videos on web3 and Ethereum and the underlying tech. I've seen the videos on IPFS and Bitcoin mining you did, but it'd be nice if you did more.


Also, how is it possible that in ML, they're able to process so much data in their Panda tables, when on the other hand, doing the same sort of manipulations on similar data in a regular database will be slower?

shiewhun
Автор

If count is slow then how does auto-increment impact performance?

Delrida