System Design Interview - Top K Problem (Heavy Hitters)

preview_player
Показать описание

Topics mentioned in the video:
- Stream and batch processing data pipelines.
- Count-min sketch data structure.
- MapReduce paradigm.
- Various applications of the top k problem solution (Google/Twitter/YouTube trends, popular products, volatile stocks, DDoS attack prevention).

Inspired by the following interview questions:
Рекомендации по теме
Комментарии
Автор

Jesus christ this guy's material is amazing... and each video is so compact. He basically never wastes a single word....

DickWu
Автор

A summary of questions and answers asked in the comments below.

1. Can we use use hash maps but flush it's content (after converting to heap) each few seconds to the storage instead of using CMS?

For small scale it is totally fine to use hash maps. When scale grows, hash maps may become too big (use a lot of memory). To prevent this we can partition data, so that only subset of all the data comes to a Fast Processor service host. But it complicates the architecture. The beauty of CMS is that it consumes a limited (defined) memory and there is no need to partition the data. The drawback of CMS, it calculates numbers approximately. Tradeoffs, tradeoffs...


2. How do we store count-min sketch and heap into database? Like how to design the table schema?

Heap is just a one-dimensional array. Count-min sketch is a two-dimensional array. Meaning that both can be easily serialized into a byte array. Using either language native serialization API or well-regarded serialization frameworks (Protobufs, Thrift, Avro). And we can store them is such form in the database.


3. Count-min sketch is to save memory, but we still have n log k time to get top k, right?

Correct. It is n log k (for Heap) + k log k (for sorting the final list). N is typically much larger then k. So, n log k is the dominant.


4. If count-min sketch is only used for 1 min count, why wouldn't we directly use a hash table to count? After all the size of data set won't grow infinitely.

For small to medium scale, hash tables solution may work just fine. But keep in mind that if we try to create a service that needs to find top K lists for many different scenarios, there may be many such hash tables and it will not scale well. For example, top K list for most liked/disliked videos, most watched (based on time) videos, most commented, with the highest number of exceptions during video opening, etc. Similar statistics may be calculated on channels level, per country/region and so on. Long story short, there may be many different top K lists we may need to calculate with our service.


5. How to merge two top k lists of one hour to obtain top k for two hours?

We need to sum up values for the same identifiers. In other words we sum up views for the same videos from both lists. And take the top K of the merged list (either by sorting or using a Heap). [This won't necessarily be a 100% accurate result though]


6. How does count min sketch work when there are different scenarios like you mentioned.... most liked/disliked videos. Do we need to build multiple sketch? Do we need to have designated hash for each of these categories? Either ways, they need more memory just like hash table.

Correct. We need its own sketch to count different event types: video views, likes, dislikes, submission of a comment, etc.


7. Regarding the slow path, I am confused by the data partitioner. Can we remove the first Distribute Messaging system and the data partitioner? The API gateway will send messages directly to the 2nd Distribute Messaging system based on its partitions. For example, the API gateway will send all B message to partition 1, and all A messages to partition 2 and all C messages to partition 3. Why we need the first Distribute Messaging system and data partitioner? If we use Kalfa as Distribute Messaging system, we can just create a topic for a set of message types.

In case of a large scale (e.g. YouTube scale), API Gateway cluster will be processing a lot of requests. I assume these are thousands or even tens of thousands of CPU heavy machines. With the main goal of serving video content and doing as little of "other" things as possible. On such machines we usually want to avoid any heavy aggregations or logic. And the simplest thing we can do is to batch together each video view request. I mean not to do any aggregation at all. Create a single message that contains something like: {A = 1, B = 1, C = 1} and send it for further processing. In the option you mentioned we still need to aggregate on the API Gateway side. We cannot afford sending a single message to the second DMS per each video view request, due to a high scale. I mean we cannot have three messages like: {A = 1}, {B = 1}, {C = 1}. As mentioned in the video, we want to decrease request rate at every next stage.


8. I have a question regarding the fast path through, it seems like you store the aggregated count min sketch in the storage system, but is that enough to calculate the top k? I felt like we would need to have a list of the websites and maintain a size k heap somewhere to figure out the top k.

You are correct. We always keep two data structures: a count-min sketch and a heap in Fast Processor. We use count-min sketch to count, while heap stores the top-k list. In Storage service we also may keep both or heap only. But heap is always present.


9. So in summary, we still need to store the keys...count-min sketch helps achieve savings by not having to maintain counts for keys individually...when one has to find the top k elements, one has to iterate thru every single key and use count-min sketch to find the top k elements...is this understanding accurate?

We need to store the keys, but only K of them (or a bit more). Not all.

When every key comes, we do the following:
- Add it to the count-min sketch.
- Get key count from the count-min sketch.
- Check if the current key is in the heap. If it presents in the heap, we update its count value there. If it not present in the heap, we check if heap is already full. If not full, we add this key to the heap. If heap is full, we check the minimal heap element and compare its value with the current key count value. At this point we may remove the minimal element and add the current key (if current key count > minimal element value).

This way we only keep a predefined number of keys. This guarantees that we never exceed the memory, as both count-min sketch and the heap has a limited size.

saurabhmaurya
Автор

I wish all sys interview tutorials are like yours, with so much information precisely and carefully explained in a clear manner, with diff trade offs and topics to discuss interviewers along the way! Thank you so much

supremepancakes
Автор

This is the best explanation on system design I've ever seen. Thanks Mikhail, that helps A LOT!

FrequencyModulator
Автор

You're amazing, by far the most detailed and deeply analysed solution I've seen on any design channel. Please never stop making videos.

NikhilSharad
Автор

This s one of the best system design video I came across in long time .. keep up the good work !

souravmojumder
Автор

One of the best system design channel ive come across! great job! I particularly liked how you were able to describe a fundamental pattern that can be applied in multiple scenarios

arvind
Автор

I love Mikhail's content, the video is so interactive that it looks like he is talking to you and he knows what is going inside your head :)

pulkitbMv
Автор

Thank you very much!! I had gone over all your videos multiple times to understand it well. I had 2 interviews with FAANG in the last week and was offered a job in both! I have to say a lot of the credit goes to you!

VageeshUdupa
Автор

These are by far the best videos on system design for interviews. Thanks a lot for taking the time to make and publish these!

anderswb
Автор

All videos in this channel are the best on YT in this category even to this date. You can find many other channels which may give similar data divided into more than 5 videos with a lot of fluff. Mikhael's video touches upon every important part without beating around the bush and also gives great pointers in identifying what the interviewer may be looking for. Kudos to all the videos in this channel !

MrAithal
Автор

Very clear solution and something that can actually be used in an interview! Please keep making more of these.

tusharahuja
Автор

THIS GUY is SO COOL. Who else feel that when he's speaking, explaining difficult concepts in the most concise way possible - and also touching on what we really need to hear about?!

itdepends
Автор

Your accent is hard to understand initially, but now I fall in love with you accent.

alexbordon
Автор

Hands down the best system design videos so far !! and I have watched lots of the system design videos. Love how you start from simple and work all the way to complex structure and how it can applies to different situations.

joyshu
Автор

How can someone even downvote this? This is just so amazing. Have not learnt so much in 30 minutes in my whole life.

HarkiratSaluja
Автор

Excellent video. A key thing that you did at the end (and is very useful IMHO) is that you identified many other interview questions that are really the same problem in disguise. That is very good thinking that we all probably need to learn and develop. I encourage you to do that in your other design solutions as well. Thank you for another excellent video.

atabhatti
Автор

Thanks Mikhail. I can bet..this is the best channel on Youtube. Just binge watch all the videos from this channel and you will learn so much.

gameonline
Автор

Great work. I am a senior engineer at a big tech company and I'm still learning a lot from your videos.

graysonchao
Автор

The best system design answer I have seen on Youtube. Thank you!

HieuNguyen-tyvw