repartition vs coalesce | Lec-12

Показать описание

In this video I have talked about repartition vs coalesce in spark. If you want to optimize your process in Spark then you should have a solid understanding of this concept.

For more queries reach out to me on my below social media handle.

My Gear:-

My PC Components:-

Рекомендации по теме

Комментарии

You make the subject interesting. Best channel in Data engineering. Thank you for the videos. Looking forward to more!

dataman

After purchasing a 25k course.I recently came to your YouTube channel I realised that your free course is worth than this.

vishaljare

I got so many of my doubts cleared from your videos, they are put together so well, they are very easy to understand.

RiyaBiswas-rp

very good teaching style with clearity thanks !

ankitachauhan

I like your teaching style. The way you are explaining is excellent.keep going bro

vishaljare

HI Manish, I have tried the same, its actually partitons are not remove, we should use
partitioned_on_column.withColumn("partition_id", spark_partition_id()).groupBy("partition_id").count().orderBy("partition_id").show(300)

to check the partions

rahuljain

Bhai m apki playlist follow krha sab kuch bhut accha h bs yai do playlist bnana kffi confusing h apko merge krdena chaiye tha..

divyanshusingh

Hello Sir,
What is the difference between repartition and BucketBy?
Thank You!

sakshijain

I think there will be method to find how much optimal partition we can do. So if I have large data set then it’s difficult to try partition size and time for each partition .

raghavsisters

Hi Manish, As, you mention that in repartition data will be evenly distributed. so, if best-selling product distributed among multiple partition, then how join will work as for join same key should be on same partition. Could you please explain this?

vishenraja

Well Explained..!! One question. As we discussed in the earlier sessions that the rdds are immutable, So when we do a repartition or coalesce, the old RDD with imbalanced data also still exists on the executor nodes along with the new repartitioned data? If yes then at what point it gets cleared, as it will keep increasing the disk on the executor nodes? Should we do that manually in the code?

TaherAhmed

Hi Manish,

Thanks for all your videos. I personally got to know so many thing from these videos. I have a doubt here
For any give instance how we will decide the no. of partition for both repartition and coalesce. I mean repartition(10).
How we decide the no.-10 for exapmle

alokkumarmohanty

Can you please explain the repartition and coalesce with dstaframe joining realtime example so we can see the real time optimization of joining process

NirajAgrawal-ev

doubt 1: Repartition(1) vs Coalesce(1) is there any diffrence? Which one should we use when writing as single file.
doubt 2: I was reading multiple csv files(6) into dataframe then I write with coalesce(1) again overwrite with coalesce(10). It is givin 6 partitions. why partition size increased with coalsce().

pde.joggiri

one doubt bhaiya, usually we avoid repartitioning right? unless we have a very large file otherwise it will create "small file problem"
is my understanding correct?

user-ik

Great video Manish, very informative. Recently I was asked if we have 200 partitions, would we prefer repartition(1) or coalesce(1) . Any insights pls?

surabhisasmal

Well explained! Thanks...keep
I have been asked on reduceByKey in some interview, that also please explain in some session. I am not clear whether we can use it with dataframe or only rdd is required to apply it. Please comment.

saumyasingh

@manish please explain bucketing concept in spark

rp-zfci

Hi Sir,
in withcolumn line we are adding partionid as column but how we are putting the value in that column as no literal is being introduced
also can you please explain on spark_partition_id(). y are we using

poojajoshi

Hi Manish sir,
If we are processing a 1TB file on a 10 nodes cluster (64 GB RAM each), then will it get processed or throw an OOM error?

Could you please explain this?

RahulPatil-iusp

repartition vs coalesce | Lec-12

repartition vs coalesce | Lec-12

Spark - Repartition Or Coalesce

22. Databricks| Spark | Performance Optimization | Repartition vs Coalesce

coalesce vs repartition vs partitionBy in spark | Interview question Explained

Repartition Vs Coalesce

Spark - Coalesce vs Repartition

6. Difference Between Repartition and Coalesce in Databricks Spark

Pyspark Scenarios 20 : difference between coalesce and repartition in pyspark #coalesce #repartition

repartition vs coalesce with code || Pyspark || Lec - 23

PySpark | Session-7 | repartition() vs coalesce() | Bigdata Online Class

#8 Spark Interview Questions difference between coalesce Vs repartition - English

Coalesce & Repartition

60 - Spark RDD - Repartition and Coalesce

Repartition and Coalesce | Spark Interview Hindi

Partition in Spark - repartition & coalesce - Databricks - Easy explanation 👌 Must Checkout !...

95% reduction in Apache Spark processing time with correct usage of repartition() function

Repartition and Coalesce in Spark | Spark Interview Questions

Using coalesce() to increase the number of partitions of an RDD in Spark

Partition vs Bucketing | Data Engineer interview

Apache Spark | Spark Interview Question | Spark Optimization { PartitionBy & Repartition }

Spark Basics | Partitions

Spark reduceByKey Or groupByKey

Partitioning and bucketing in Spark | Lec-9 | Practical video

Partitioning vs bucketing in hive | Hive interview questions |Hadoop Interview Questions and Answers