4: Facebook Messenger + WhatsApp | Systems Design Interview Questions With Ex-Google SWE

preview_player
Показать описание
If anyone wants to join the ladies group chat just apply via LinkedIn, you know where I am (it's equally as competitive as new grad FAANG jobs)
Рекомендации по теме
Комментарии
Автор

Thank you so much for building this community to discuss advanced, in-practice system design. I discovered that this is not just another channel for sharing interview tips, but a space for in-depth discussions on real-world scenarios. How amazing! Keep going! Thank you!

raychang
Автор

Great work on this series so far, man ! I don’t have any interview plans but Im just watching to encourage your effort. And it feels fun watching these over other YouTube content. Way to go bro.

TM-hvrs
Автор

Hi Jordan, I wanted to clarify a point regarding HBase's storage model. HBase is actually a column-family-oriented database, rather than a true columnar storage system like those used in analytical databases. This distinction is important because it affects data access patterns and use cases. Both HBase and Cassandra organize data into column families, where operations are optimized around row keys rather than individual columns across rows. Both HBase and Cassandra use column families for data organization, but Cassandra often provides higher write throughput. Considering this, along with your use case and scalability needs, Cassandra may be the better choice for message table.

chingyuanwang
Автор

I'm going to like every video I watch on this channel. It's the least I can do. Jordan, you're amazing. THANK YOU so much. Gonna watch every video in this series multiple times. Hope to become as great as you are someday.

hekapoo
Автор

Another great video, thanks.

Couple of clarification questions:
1. [23:14] If Chats member DB is sharded/indexed by userId, how does CDC data will be converted to kafka partition, which is by chatId? Will there be a conversion Flink node in between or will it be similar to tweeter's follower-followee relation, where both relation db exists?
2. [Generic Question] How does Flink's cache recreated in general, in fail over scenarios? Do we use any snapshotting in something like S3?
3. For chats db, can we use single leader key value DB instead, like DDB, which gives high write/read throughput? What will be the downside, we anyways are using Flink to write to it? Considering we use separate table for metadata?
4. How does chat across data center or globe would work? Will there be any specific constraint there, like latency or anything else? Will zookeeper/load balancer be responsible to connect across data centers?
5. [25:12] Does flink use 2PC to write to both LB and DB? If flink goes down after sending messages to LB, can new flink node will send them one more time to users? Probably, even if we send them again to users, may be Client can handle it. Just want to confirm?

krishback
Автор

Well, one thing I got after the interviews (and I had this exact question in a recent interview) is that it's not that hard to understand it, and see all the bottlenecks and everything watching a video. But it's freaking hard to come to any near-perfect solution on the interview within 35 minutes. I guess more practice can solve this, though 🙂
Great video, it's cool to see so many important aspects of a system design in such an informative way 🔥

almirdavletov
Автор

Hi, overall great. I want to point out that we missed some key part here. The way the messages reach to each user is basically assumes that every user is connected thru some websocket connection. That is not the case mostly.
In this flow, we have a data flow in which user comes and asks for all his pending messages across all the chat groups he is part of.
Now we can do it by establising the connection and putting a status on the chat database of status and marking it "pending" when we receive the message in flink, and once the message gets delivered marking it "delivered".
Here if the user is not online, we can put the pending messages in some cache like redis, so when the user does come back online, the websocket connection will pull all the pending messages and delivery it to his device.

Also user can be using multiple devices, but that can be out of scope for this design.

anshulkatare
Автор

Hey Jordan, I really appreciate the videos and have learned a lot from you. Do you use the words "partitioning" and "sharding" interchangeably? I thought partitioning means splitting a table within the same database instance, but was unsure when you explained the User database at 6:30. Thanks for the content, I have been watching it non-stop lately.

doug
Автор

Hey Jordan! Thanks for making this video. Quick question - it sounds like you're using sharding and partitioning interchangeably to mean the same thing (around 7:00 and 23:00)? Or do you just mean that you would both shard AND partition based on user_id?

jittaniaS
Автор

the content is so good, it takes me hours to research and understand lol, love it thank you!

quirkyquester
Автор

Hey Jordan, nicely done! One thing that I wanted to point out which could have been elaborated is the scenario where the user is offline and not connected to any chat server. They still get the message notifications on their app. Based on this feature, I think an online presence service may be justified that flink or whatever sender would want to refer to in order to figure out the offline users and then send them a notification instead of a real time update. What do you think?

introvertidiot
Автор

Hey Jordan thank you for great explanation, I have a question when flink getting the users in chat, where flink stores this enormous data? In rocksdb ? As Kafka may delete the info after the retention period right? Let’s say there is chat room created 1 month ago and no one are active on it and no CDC happened and thus no data available about this chat room in flink? Then how flink will manage to get the users in this chat room? Am I missing anything here? I understand from your design may be flink is duplicating all chat-id to users list again somewhere in disk or rocksdb
But can we rely on streaming platforms for getting the actual data as our source of truth is already stored in chat meta data db?

FoodpediaIndia
Автор

Thanks for the video! It looks like you're leveraging message queues to asynchronously write every new chat message to the database before rerouting it to the other chat members. Would data inconsistency potentially be a problem if the write to the database fails, but the rerouting does not? Is this worth making the writes synchronous, or would the retries + idempotency key be a sufficient bandaid for this potential inconsistency in an interview?

eternal
Автор

Hi Jordan, thanks for your videos on system design. Just wanted to check with you how would you handle end-to-end encryption of messages for an application in this design? Thanks!

vaibhavsavala
Автор

Love this series! Question: why? "We would be care about writes more if all of our messages were first being sent to a database, and then after we sent them to a database we would do additional processing to make sure they get delivered to every users’ device." around 9:42. Thanks!

firewater
Автор

Yoo Jordan! I've figured pretty similar design, and the one exception is the message retrieval. I also would go with sending messages like you did though a stream sharded/patitioned by chat_id, but on the contrary for reading messages.

Isn't this sendrequest(user) for user in yours last Flink chat_id: [users] going to be spammy for the load balancers (we are not ignoring possibiliy inactive people on the chat, unless we already plan for some push notifications, smth smth)? Could we do a solution around user_id, so that after user establishes connection with the app via a websocket (and now knowing his userid), we are looking out for newest changes for that userid?

another great one Jordan! thanks for your work

Luzkan
Автор

Very nice explanation!!! Thanks for making it! I am wondering how capacity estimation would be used in any later design decision - my guess is we can use it in partition number / database number / replication number decisions etc. But rarely see them in any examples, also don't know what number would be standard for a normal network / cpu / disk, so I can divide by the estimate.

Anonymous-ymst
Автор

Thanks for the great video Jordan! For your design could you please share how a feature such as searching within a chat would work? Would that be part of the chat service ? Or is the something that could be done in Flink?

ThePotatoProdigy
Автор

Thanks Jordan for all these awesome contents. I have one quick question on CDC. The CDC events in Kafka would eventually be gone from Kafka but in Flink we need full snapshot/state of the DB in order to do anything with the messages. So how is this issue solved? May be I'm missing some details here. Thanks

vidip
Автор

Hey Jordan,
When you say sharding kafka based on chat ID you mean within a single topic you have partitions which have partition key as chat ID. So, it means ordering within a chat will be maintained and all the messages from a single chat will go to single partition and it will be read by one flink consumer. Please elaborate in case I'm totally wrong.

rishabhsaxena