Understanding Aggregate Functions Performance | The Backend Engineering Show

Показать описание

Aggregate functions like Count, max, min, avg performance really depends on how did you tune your database for that kind of workload. Let us discuss this.

0:00 Intro
1:22 SELECT COUNT(*)
4:30 SELECT AVG(A)
5:15 SELECT MAX(A)
8:00 Best case scenario
11:30 Clustering
14:00 Clustering Sequential Writes
17:19 Clustering Random Writes
20:30 Summary

Fundamentals of Database Engineering udemy course (link redirects to udemy with coupon)

Introduction to NGINX (link redirects to udemy with coupon)

Python on the Backend (link redirects to udemy with coupon)

Become a Member on YouTube

🔥 Members Only Content

🏭 Backend Engineering Videos in Order

💾 Database Engineering Videos

🎙️Listen to the Backend Engineering Podcast

Gears and tools used on the Channel (affiliates)

🖼️ Slides and Thumbnail Design
Canva

Stay Awesome,
Hussein

Рекомендации по теме

Комментарии

I am a data engineer and your channel has been invaluable for my learning lately

YourMakingMeNervous

I've been a database specialist for years and I did not know these intimate details. Great Thanks. You are simply awesome.

SinskariBoithree

I'm a front end developer but I I'm starting to like backend because of you!

jomarmontuya

I always enjoy the way you dissect things and get to the bare metal to fully understand and explain it! You're amazing!! 👏

md.hussainulislamsajib

The way you explain things and dissect things is really amazing I haven't found a detailed video as you have done.its really usefull to get the understanding of such beautiful things .

overflow

Thanks a lot @Hussein for this informative video.
I do not have a PhD but I'm going to attempt to solve the problems of clustered sequential and random writes. Opinions from the community will be appreciated :)

1) Clustered Sequential Writes: How about we use an in memory queue like data structure for fresh new writes? For every new write, we allocate some memory with the new data and atomically append it at the tail end of the queue. Since the writes are strictly ordered by the time they arrive at the database, we can use a lock-free paradigm like compare-and-swap to atomically append data to the queue. This will do eliminate the lock overhead that mutexes introduce. Then when it comes to actually adding the writes to the b-tree index structure, with this ordering, there's a big potential of the tree rebuilding for every new write and this can cause a big performance overhead. As the writes are batched up in the queue, we can add them to the tree at a configurable threshold so that we rebuild the tree much less often than the rate at which the writes arrive. Finally, at some point we need to flush our tree on-disk. Since the right sub-tree of our b-tree is the one that's mostly growing, most of the changes are happening sequentially in the same region on disk. We can use fadvise when flushing our changes on disk, just to get some extra performance. This approach has trade-offs in that we'll have superior sequential write performance but poor random reads since we need search both the b-tree (O(logN)) and the in-memory queue (O(N) for a size equal to the configurable threshold).

2) Clustered Random Writes: @Hussein, can you elaborate here how exactly flushing the entire contents of memory will cause write amplification, as in, for each record received, how many writes on disk will need to take place for it to be persisted?
For this problem, there is one extreme end where every write results in a disk write, which can be very expensive for writes. At the other end, we store up data in-memory till we run out of memory and we flush everything to disk and rebuild the tree afresh if needed. This will also be a problem since during flushes, writes will need to be paused. I can anecdotally say that most storage engines, at least for RocksDB, writes can be batched at a configurable threshold. This threshold will lie some in the middle where the user weighs the amount of memory they have and the frequency of writes they can tolerate, so that writes can be flushed to disk in a manner that won't make the performance suffer.

I welcome all opinions from all you database enthusiasts ;)

briankimutai

I literally made notes, such core details 🙌

mrvaibh

For the estimates, when guessing within 5-10 rows of error on a table with 100M rows, it would be easier to get the size of a row in bytes and do some math with the total bytes taken by the table

squirrel

5:50 At first I interpreted "smallest" as "of least disk size in bytes", but it more likely means their sequential ordering with respect to the overall range.

pieter

I would really love to know where you get ideas for these videos and how much time you put to read and research about it because this is really awesome and sounds like you put in a lot of time. Thanks a lot for the awesome content you post here

semperfiArs

Thank you for the video! amazing advice

young-ceo

Love your content man. Can you talk on the Okta breach??

michaelangelovideos

always a pleasure to watch your informative videos :)

vendetta

thanks for your effort, really i learn a lot.

subhamprasad

Did you make a video about varchar and how it is stored in different systems? I hav always wondered how a varchar(255) is handled on disk and how changes are handled. Are all data saved as maximum size and fixed size records in total, are records stored in variable length or are all varchar stored in a separate space all together?

henrikschmidt

would it be the best of both? like write to buffer and then one process would continuously write to the actual disk to prevent the buffer checkpoint situation?

dhawaljoshi

Thank you for your awesome discussion in this video, @Hussein!

However I have a question to the part of you talking about the cons of the uuid random-writes that would exhaust the RAM, what is the solution for this specific case? Should we just considerably use the clustered index with the sequential writes that you'd mentioned or is there a particular solution for this? 🙏🙏

nguyentanphat

Will
Select Count(A) FROM T; use index or will it go for table scan?

letsmusic

Also, Hussein, I know that it will be time-consuming and will probably not be of much interest to you, but do more videos on web3 and Ethereum and the underlying tech. I've seen the videos on IPFS and Bitcoin mining you did, but it'd be nice if you did more.

Also, how is it possible that in ML, they're able to process so much data in their Panda tables, when on the other hand, doing the same sort of manipulations on similar data in a regular database will be slower?

shiewhun

If count is slow then how does auto-increment impact performance?

Delrida

Understanding Aggregate Functions Performance | The Backend Engineering Show

Understanding Aggregate Functions Performance | The Backend Engineering Show

T-SQL Tutorial - Aggregate Window Functions Performance

SQL Window Functions in 10 Minutes

Advanced Aggregate Functions in SQL

99% of SQL Users Don't Know This AMAZING Function

SQL Window Functions | Clearly Explained | PARTITION BY, ORDER BY, ROW_NUMBER, RANK, DENSE_RANK

SQL Window Function | How to write SQL Query using RANK, DENSE RANK, LEAD/LAG | SQL Queries Tutorial

6 SQL Joins you MUST know! (Animated + Practice)

SQL Aggregates: COUNT(*) aggregate function

Follow this SQL RoadMap to kick start your SQL journey | #learncoding

Optimizing Batch and Streaming Aggregations

SQL Group By: An Explanation and How To Use It

SQL Views In 4 Minutes: Super Useful! Wow! Crazy! Amazing! I'm Crying Tears Of SQL Joy.

All About SQL Aggregations | SQL Advance | Zero to Hero

SQL Aggregate Window Functions | COUNT, AVG, SUM, MAX, MIN | #SQL Course #4

Aggregate Functions, Group by Clause and Having Keyword in SQL ? | SQL Server Interview Questions

Master SQL Aggregate Functions: COUNT, SUM, AVG, MIN, MAX Explained |TEFEM Africa Fellowship 2024

PostgreSQL, performance for queries with grouping

Aggregate functions in postgresql

sql 2 years tricky queries interview questions aggregate functions #sqlinterviewquestions

Window Functions vs Group By

Understanding the Purpose of the FILTER Clause in PostgreSQL Aggregate Functions

MongoDB in 100 Seconds

What is Aggregation in Power BI Tutorial (36/50) Part 1