The cost of Hash Tables | The Backend Engineering Show

preview_player
Показать описание
Hash tables are effective for caching, database joins, sets to check if something is in a list and even load balancing, partitioning and sharding and many other applications. Most programming languages support hash tables. However they don’t come without their cost and limitations lets discuss this.

0:00 Intro
1:50 Arrays
3:50 CPU Cost (NUMA/M1 Ultra)
6:50 Hash Tables
10:00 Hash Join
16:00 Cost of Hash Tables
20:00 Remapping Cost Hash Tables
22:30 Consistent hashing

Fundamentals of Database Engineering udemy course (link redirects to udemy with coupon)

Introduction to NGINX (link redirects to udemy with coupon)

Python on the Backend (link redirects to udemy with coupon)

Become a Member on YouTube

🔥 Members Only Content


🏭 Backend Engineering Videos in Order

💾 Database Engineering Videos

🎙️Listen to the Backend Engineering Podcast

Gears and tools used on the Channel (affiliates)

🖼️ Slides and Thumbnail Design
Canva


Stay Awesome,
Hussein
Рекомендации по теме
Комментарии
Автор

I love these long-form audio breakdowns. Very natural. Please keep it up. Inch wide mile deep conversations! Just got a new job and going to subscribe to your channel as soon as I get my first check 🙏🏾🙌🏾

andydataguy
Автор

would be gr8 if you would introduce some visuals in your "lectures" - you cover a lot and is hard to keep up

InsightByte
Автор

18:02 isn't that just the idea of swap. You advertise that a bunch of memory is available but if you access something not actually in memory it will have to load the page into memory.

gyroninjamodder
Автор

Your videos are always helpful, refreshing and brain opening, thanks a lot man i swear whenever im financially able ill buy all of your courses and memberships

FilthySnob
Автор

Thanks for talking about hash tables.
I think the answer to your question about extending hash table beyond system memory is memory paging (virtual memory), the OS does that on behalf of you (handle page fault exception), and swap memory pages between ram and disk.
One another thing, DBMS uses hash tables in desk (indexing), and it is very primitive syscall to random access a position in file on desk using fseek syscall.

EngSamieh
Автор

Great content. Would be so awesome to see some visuals...as you explain this interesting stuff.

techienomadiso
Автор

Very interesting. I'd love to hear more about limitations associated with specific data structures in the real-world

arjundureja
Автор

there are now array databases which put thete data on disk or S3 (cheaper storage). there useful for (sparse) multidimensional data. An example of the database is TileDB.

teunohooijer
Автор

this is my fav youtube channel !! Thank you

SohomBhattacharjee
Автор

When you said that disks are not byte addressable, were you referring to the restriction of the API / file system?
Internally, flash is byte addressable (and technically so is a spinning disk). The caveat is that the access time is substantially higher for NVM verses volatile memory.
Also, memory indirection via pointers can still be very expensive (I wouldn't put so much faith in "optimizations" by the hardware vendors). For example, if the pointer exists on a different CPU socket, or if you are skipping between L3 and DRAM, or even a swapped page (to disk) and DRAM. There's some tradeoff in performance between embedded fields (within the table) and table size, which also comes into play with some of these specialized NVM key-value pair accelerators (because of the access latency).

hjups
Автор

I remember reading something a while back, about one of intel's attempts at non-volitile RAM (I think this was optane, before they dumped the tech and used the branding for something else).
The write up was about the thermodynamic limitations of non-volitile storage at ram speeds, probably given some voltage requirement (i dont remember, but it seems right).
Basically stated that non-volitile memory at ram speeds (probably given the other specs of the product announcement) would need to be like an order of magnitude better than the theoretical performance limit required to store data.
I didnt do the work to verify it, but it seemed convincing enough for someone who doesnt eat and breathe in electrical engineering.
Since then, ive been pretty cynical of anyone claiming they are going to do it. Everyone seems to agree its the holy grail among a certain subset of smart people, they all want to do it, and the announcements seem to come and go never delivering a product.
Might be an interesting topic for a video too, I'd watch to get an update.

CraneArmy
Автор

I love your mannerisms and presentation.

CjqNslXUcM
Автор

Awesome talk!! Would love to hear you talk more about consistent hashing and the origins of it in various databases :)

johnrush
Автор

Yes!! this is really a topic I need refresh on . Thank you!

botenjohn
Автор

If it's a talk, it could go as a podcast as well.

prakanshuray
Автор

Thought about disk based direct access which avoids index scan by getting / constructing the file path. But file locks when writing when it is being read by many is a tricky part.

susmitvengurlekar
Автор

Mr you surprise me every episiode :D, thanks fo all, allah ya7afedk

thesyd
Автор

Loved it! Waiting for consistent hashing

longtranbao
Автор

This is why the lessons on hash tables need to be seriously rethought. The only part of a hash table that actually needs to be an array is the index table itself, but each entry in the table can be a bucket, or as needed point to a new server if you're using it for that purpose. Most databases, at least as far as I am aware actually use trees and things are already sorted in by each relational piece of data. This type of tree is generally called a graph. The implementation of hash tables is actually pretty open, and I've implemented several versions where the data is stored in a tree and the index table just helps to jump to a subtree faster. For trees I'll generally use a deque to allocate decent sized blocks of nodes and memory management is pretty easy that way. I arrange the individual deques in two different trees themselves. On the subject of implementing them as arrays, since in that form they don't need to be sorted, if you store the unmasked hash key, as I do, you can swap the element to delete with the last element and patch one index then decrement the count. TLDR: hash tables don't have to be simple arrays anymore with merely one set of characteristics.

anon_y_mousse
Автор

Okay, this was kinda interesting. I don't really know anything about data base programming so the insights there were cool. Personally the point of adding/deleting entries was explained pretty poorly. I don't know what strategies are normally used for handling collisions in data base programming, but when you have a hash table you probably have a strategy to handle collisions. So let's say you create a hash table of size 100 and insert 100 elements, there probably will already be collisions. At some point the collisions get so many, that the access time is becoming slow and you want to increase the size of the hash table. And when you do this, you don't want to increase the size from 100 to 101 but probably double it to 200 if not more. At least this it the common approach for dynamic arrays and you already said, that hash tables are array's. Of course it is true that you have to rehash everything with the new size of the hash table, which sadly should be slower as a normal dynamic array resize and copy.

timtreichel