repartition vs coalesce | Lec-12

preview_player
Показать описание
In this video I have talked about repartition vs coalesce in spark. If you want to optimize your process in Spark then you should have a solid understanding of this concept.

For more queries reach out to me on my below social media handle.

My Gear:-

My PC Components:-
Рекомендации по теме
Комментарии
Автор

You make the subject interesting. Best channel in Data engineering. Thank you for the videos. Looking forward to more!

dataman
Автор

After purchasing a 25k course.I recently came to your YouTube channel I realised that your free course is worth than this.

vishaljare
Автор

I got so many of my doubts cleared from your videos, they are put together so well, they are very easy to understand.

RiyaBiswas-rp
Автор

very good teaching style with clearity thanks !

ankitachauhan
Автор

I like your teaching style. The way you are explaining is excellent.keep going bro

vishaljare
Автор

HI Manish, I have tried the same, its actually partitons are not remove, we should use
partitioned_on_column.withColumn("partition_id", spark_partition_id()).groupBy("partition_id").count().orderBy("partition_id").show(300)

to check the partions

rahuljain
Автор

Bhai m apki playlist follow krha sab kuch bhut accha h bs yai do playlist bnana kffi confusing h apko merge krdena chaiye tha..

divyanshusingh
Автор

Hello Sir,
What is the difference between repartition and BucketBy?
Thank You!

sakshijain
Автор

I think there will be method to find how much optimal partition we can do. So if I have large data set then it’s difficult to try partition size and time for each partition .

raghavsisters
Автор

Hi Manish, As, you mention that in repartition data will be evenly distributed. so, if best-selling product distributed among multiple partition, then how join will work as for join same key should be on same partition. Could you please explain this?

vishenraja
Автор

Well Explained..!! One question. As we discussed in the earlier sessions that the rdds are immutable, So when we do a repartition or coalesce, the old RDD with imbalanced data also still exists on the executor nodes along with the new repartitioned data? If yes then at what point it gets cleared, as it will keep increasing the disk on the executor nodes? Should we do that manually in the code?

TaherAhmed
Автор

Hi Manish,

Thanks for all your videos. I personally got to know so many thing from these videos. I have a doubt here
For any give instance how we will decide the no. of partition for both repartition and coalesce. I mean repartition(10).
How we decide the no.-10 for exapmle

alokkumarmohanty
Автор

Can you please explain the repartition and coalesce with dstaframe joining realtime example so we can see the real time optimization of joining process

NirajAgrawal-ev
Автор

doubt 1: Repartition(1) vs Coalesce(1) is there any diffrence? Which one should we use when writing as single file.
doubt 2: I was reading multiple csv files(6) into dataframe then I write with coalesce(1) again overwrite with coalesce(10). It is givin 6 partitions. why partition size increased with coalsce().

pde.joggiri
Автор

one doubt bhaiya, usually we avoid repartitioning right? unless we have a very large file otherwise it will create "small file problem"
is my understanding correct?

user-ik
Автор

Great video Manish, very informative. Recently I was asked if we have 200 partitions, would we prefer repartition(1) or coalesce(1) . Any insights pls?

surabhisasmal
Автор

Well explained! Thanks...keep
I have been asked on reduceByKey in some interview, that also please explain in some session. I am not clear whether we can use it with dataframe or only rdd is required to apply it. Please comment.

saumyasingh
Автор

@manish please explain bucketing concept in spark

rp-zfci
Автор

Hi Sir,
in withcolumn line we are adding partionid as column but how we are putting the value in that column as no literal is being introduced
also can you please explain on spark_partition_id(). y are we using

poojajoshi
Автор

Hi Manish sir,
If we are processing a 1TB file on a 10 nodes cluster (64 GB RAM each), then will it get processed or throw an OOM error?

Could you please explain this?

RahulPatil-iusp