Spark memory allocation and reading large files| Spark Interview Questions

preview_player
Показать описание
Hi Friends,

In this video, I have explained the Spark memory allocation and how a 1 tb file will be processed by Spark.


Please subscribe to my channel for more interesting learnings.
Рекомендации по теме
Комментарии
Автор

I have seen several videos, You are the best. Appreciate your efforts.

San-hszx
Автор

Thank you mam! Concept is clear, I think that is how spark is doing efficienct pipelining

gowthamsagarkurapati
Автор

Good explanation.. I've one doubt. How did you calculate no of blocks for 1TB file?
In the video are you saying 84 Lakhs blocks if yes then how to calculate this number?

mohammedasif
Автор

Thanks for this video. i have one question If I have 500 GB data then in order to process it what will be my ideal cluster
configuration?

SunilPandey-u
Автор

I like the explanation. So, instead of 84 Lakh blocks, you suppose to say 8192 blocks. right?
20 Executor-machine
5 Core processor in each executor node (FYI: cores come in pairs: 2, 4, 6, 8, and so on)
6 GB RAM in each executor node
128 MB default block size

And the cluster can perform (20*5=100) Task parallel at a time. here tasks mean block so 100 blocks can be processed parallelly at a time.
100*128MB = 12800 MB / 1024GB = 12.5 GB (So, 12GB data will get processed in 1st set of a batch)

Since the RAM size is 6GB in each executor. (20 executor x 6GB RAM =120GB Total RAM) So, at a time 12GB of RAM will occupy in a cluster (20node/12gb=1.6GB RAM In each executor). Now, Available RAM in each executor will be (6GB - 1.6GB = 4.4GB) RAM which will be reserved for other users' jobs and programs.

So, 1TB = 1024 GB / 12GB = (Whole data will get processed in around 85 batches).

In the calculation I've used for understanding purposes, actual values may differ in comparison with real-time scenarios.
Please feel free to comment & correct me, if I'm doing anything wrong, thanks!

chrajeshdagur
Автор

Hi, A great explanation no doubt, can you pls tell me per machine how many executors will be there?

vaibhavverma
Автор

Wonderful 👌👌.. you've gained one more subscriber 😊... I've a very simple question for you --- what is disk here in spark? Is it driver disk or the hdfs disk ? In persist operation, we have disk and memory option. I understood that memory is executer memory but what is this disk 🙄? Could you please assist....Also, there is one more concept is data spilling to disk, i m badly confused with this disk😭

gyan_chakra
Автор

But what if we have to perform group by operation or join then we should have all data in ram for processing right?

jjayeshpawar
Автор

Can u help to understand how does spark decides----
How many tasks will run parallely for ex for 1 TB of file?
I am aware that no.of.tasks depends on no.of.cpu core assigned to executor. But how the calculation flows ..?

nehachopade
Автор

How to find why executor memory is growing gradually. The spark is installed on kubernetes and Driver memory is 4G and executor memory is 3G + 1G (overhead). Now How to check which memory area is growing more like execution or storage or why memory is growing. As after it reaches 99% executors are killed and there are no logs to check. Could you please suggest some pointers?

piyushpokharna
Автор

off heap memory and overhead memory are same?

suriyams
Автор

How 1TB is equals to 84 lakhs block, when each block is 128 MB.?

Amarjeet-fblk
Автор

when we use off-heap memory GC will not be used?

suriyams