Python Hash Sets Explained & Demonstrated - Computerphile

preview_player
Показать описание

Hash Sets in Python work a little bit like the index of a book, giving you a shortcut to looking for a value in a list. Dr Mike Pound explains how they work and demos with some code.

#Python #HashSet #Code #Computerphile

This video was filmed and edited by Sean Riley.

Рекомендации по теме
Комментарии
Автор

"O(1) means that there is no relationship between the speed of the lookup and the number of elements in your collection". Couldn't have said it better, as often with big O, the devil is in the constant you dropped during notation :D

Davourflave
Автор

I invented and implemented this very scheme in 1978, on an HP9845A in HP Basic with a 20 MB hard disk, and discovered a few things:
1) Hash collisions are best stored in a sorted list, so that a binary search can be done, reducing the search time dramatically.
2) Hashing integers as themselves is a disaster in the real world, where initial keys of 0 proliferate. (Amongst other common integers, such as -1.)

MichaelKingsfordGray
Автор

Side note: if your list of values is static and known in advance, the Gnu “gperf” tool can come up with a “perfect” hash function that gives you the minimum array size with no collisions. It generates C/C++ code, but the output should be portable to most other languages with a small amount of effort.

trevinbeattie
Автор

It's important to point out here that probability of collision can be reduced by increasing table size, but then your "utilization" of your table space will be lower. It's absolutely a trade-off and you can't improve one without degrading the other. As your table fills up, collisions will become more and more common. For example, if your table is 80% full, then almost by definition there's an 80% chance of a new item colliding with an existing one. The uniformity of the hash function pretty much guarantees it. There's a lot of nice probability theory analysis you can do around these things.

Of course, that 80% full table gives you a 64% chance of colliding on your first two tries, a 51.2% chance of failing on the third probe, a 41% chance of failing on the fourth, and so on. The average length of the chains you wind up with goes up sharply as you push up utilization.

KipIngram
Автор

Not really interested in the topic, because I already know this, but still watched because Mike's presentations are always engaging

JaapVersteegh
Автор

I implemented this in ADA back in 1997 during a class in computer science. I think 73 was a pretty good prime to use while hashing to minimize collisions.

gustafsundberg
Автор

Implemented exactly the simple form of this is a commercial compiler around 1980 to store the symbol table (list of all identifiers defined in the program being compiled, what type, size etc.). Chosen for lookup speed as the symbol table is accessed frequently in the compilation process

Richardincancale
Автор

Bloom filters could be a good follow-up to this.

prkhrsrvstv
Автор

I saw an interview question video yesterday about these - really good timing for me this video. 😁😁😁

bensmith
Автор

Oftentimes on modern hardware, particularly on small datasets, a linear scan can be faster than a hashmap lookup because the hashing function is slow.

jonny__b
Автор

I watched a talk on Python dictionaries, the guy that worked on the new implementation had gone into detail how they are more closely related to databases than hash maps. It was done to increase performance, and since almost everything in Python has a backing dictionary, it made a large difference in runtime.

jfftck
Автор

I'm not new to programming, but I'm new to Python and I was just literally looking into what uses hash tables in Python. Thanks. Lol

Loki-
Автор

12:02 in the *___contains___* function there shouldn't be an *_else:_* before the *_return False_* at the end, otherwise in case if the list *_self._data[idx]_* is not *_None_* and the item is not in that list, the return value won't be a boolean.

ibrahimmahrir
Автор

Thanks for your videos Mike keep it rolling 🎉

exchange
Автор

The built in `array` structure in PHP is mostly a hashmap, and is extremely widely used. Arguably a bit too widely used sometimes, since programmers often use it with strings that they have chosen in advance as the keys, and data supplied at runtime only as the values. In that situation replacing it with a class, with a predefined set of properties known to the compiler, both improves performance and can make the program easier to understand.

barneylaurance
Автор

The topics discussed on this channel have been the ones that really specifically interest me as of late. This is cool, thank you!

cinderwolf
Автор

Just what i wanted now. thanks a lot :)

princevijay
Автор

Hi, could you do a video on characteristics of a good hash function used in hashtables and their evaluation as a followup video?

olivergurka
Автор

A follow-up on the amortized complexity would be nice. Because it's a bit disingenuous to call the hashmap insertion O(1) if the underlying table doesn't grow. ^^

Ceelvain
Автор

13:30 I get that it's easier to generate numbers, but I think pedagogically it makes much more sense to use strings.



import random

with open('/usr/share/dict/words') as f: words = [w.rstrip() for w in f]
print(random.choices(words, k=10))

cacheman