Realtime Advertisement Clicks Aggregator | System Design

preview_player
Показать описание
Let’s design a real time advertisement clicks aggregator with Kafka, Flink and Cassandra. We start with a simple design and gradually make it scalable while talking about different trade offs.

🥹 If you found this helpful, follow me online here:

00:00 Why Track & Aggregate Clicks?
01:07 Simple System
02:12 Will it scale?
04:00 Logs, Kafka & Stream Processing
12:02 Database Bottlenecks
17:13 Replace MySQL
18:59 Data Model
25:45 Data Reconciliation
29:00 Offline Batch Process
32:10 Future Videos

#systemDesign #programming #softwareDevelopment
Рекомендации по теме
Комментарии
Автор

what I would've done differently: have both warm and cold storages. If your data access pattern is mostly reading data from the last 90 days (pick your number), then store that data in warm storage like Vitess (shared mysql or some distributed relation db). And run a background process periodically that vacuums the stale data from the warm tier and exports it to cold tier like data lakes.

This way your optimising both read query latency and storage cost. Best of both worlds.

indavarapuaneesh
Автор

We should use count-min sketch for real time click aggregation on the stream processor, it is going to be very fast and you query data on last minute granularity. A map-reduce system can be useful for exact click information. Clicks can be batched, put into HDFS system, reduced into aggregates and saved on DB.

rishabhjain
Автор

Your videos are really really great, no fluff, straight to the topic and covers a lot of details. Thank you and keep it up!

kevindebruyne
Автор

I think you deserve a lot more audience! The quality of the contents were really good. Thanks for sharing.

freezefrancis
Автор

Thank you so much for this perfect explanation!!

sarthakgupta
Автор

Some notes about this design
- Adding more topics is a very vague statement. We have to define the data model to capture each click event and then allow data partitioning based on advertisement_id and some form of timestamp
- Not sure why replication lag is stated as an issue here. The read patterns for this design doesn't require reading consistent data. So this should not be an issue
- Relational DBs won't do well with aggregation queries. This is a little misguiding. Doing aggregation queries efficiently requires storing the data model in a column major format that unlocks efficient compressions and data loading.
- Why provision a stream processing infra to upload data to cold storage . Once a log file reaches X MB, we can place an event in Kafka with a (file_id, offset) pair. There would be a consumer that reads this and uploads the data to s3. This avoids un-necessary dollar cost as well as operational cost of maintaining a stream infrastructure.

protyaybanerjee
Автор

@5:35 0.1KB*3B = 3 TB Hi, how is the computation done? I thought 3B is 9 zero; multiply it by 0.1 will get 8 zero. 1 TB is 1e9 KB. Then I thought it would be 0.3 TB. Did I get something wrong?

ax
Автор

This wasby farthe best video..thanks for dojng it

pratikjain
Автор

Thanks for the effort making this! Very informative and a perfect companion to the system design volume 2 book.

nosh
Автор

Thanks for such a clear and detailed explanation.
Could you please share a couple of blogs/articles for reference where companies are using this kind of systems?

PrateekSaini
Автор

capture the click with application logging, good idea, main crucks 6:30
21:30

sumonmal
Автор

Thanks for the tutorials! I think you're following the topics of the book System Design Interview 2 but using a way that a lot easier to understand. I'm very much struggled with those topics of the book until I came across your tutorials!

weixing
Автор

Since am working in adtech and would looking to upgrade our approach to modern, fortunately i got a look into your video and it helps me a lot. My question here is how about to use Clikchouse instead of Casandra, will it work well or lead to any issue?

karthikbidder
Автор

Can we use MapReduce for stream processing? Will it meet the latency requirement? Or we have to use some other streaming processors such as Flink/Spark?

tonyliu
Автор

That was an awesome video, i had a similar approach and got it validated. I was wondering if you could also start a code series on building such systems (as demonstrated in video).

parthmahajan
Автор

Event data stream platform . It’s more complex system, where data is being processed either in real time streams or batch, ETL, data pipelines etc

mohsanabbas
Автор

You videos are great!Very clearly articulated!Was curious why do we have to use Nosql DB, if we are storing only the aggregated data based on advertiser ID.What are the drawbacks of using any columnar DB like snowflake in thise case?

roopashastri
Автор

Thanks for clearly explaining the end to end design. Just a couple of questions:
1) Could you explain a little bit about how the Apache log files gets the clicks information and how is it realtime.
2) Also, Do you have any link of these notes/Diagram. As the one in description doesn't work.

chetanyaahuja
Автор

Correct me if I am wrong, Seems to me more like lambda architecture.. aggregation being fast but inaccurate whereas S3 being slow but accurate

utkarshgupta
Автор

We could also keep States in Kafka Stream application (local or Global State) and use Interactive Query to fetch result of the aggregation. Can you please share how to decide whether to offload the aggregation result to external DB vs when to use interactive Query ? I understand that durability can be one factor but what are others ?

VishalThakur-wovx
welcome to shbcf.ru