Shuffle Partition Spark Optimization: 10x Faster!

preview_player
Показать описание
Welcome to our comprehensive guide on understanding and optimising shuffle operations in Apache Spark! In this deep-dive video, we uncover the complexities of shuffle partitions and how shuffling works in Spark, providing you with the knowledge to enhance your big data processing tasks. Whether you're a beginner or an experienced Spark developer, this video is designed to elevate your skills and understanding of Spark's internal mechanisms.

🔹 What you'll learn:
1. Shuffling in Spark: Uncover the mechanics behind shuffling, why it's necessary, and how it impacts the performance of your data processing jobs.
2. Shuffle Partitions: Discover what shuffle partitions are and their role in distributing data across nodes in a Spark cluster.
3. When Does Shuffling Occur?: Learn about the specific scenarios and operations that trigger shuffling in Spark, particularly focusing on wide transformations.
4. Shuffle Partition Size Considerations: Explore real-world scenarios where the shuffle partition size is significantly larger or smaller than the data per shuffle partition, and understand the implications on performance and resource utilisation.
5. Tuning Shuffle Partitions: Dive into strategies and best practices for tuning the number of shuffle partitions based on the size and nature of your data, ensuring optimal performance and efficiency.

📘 Chapters:
00:00 Introduction
00:14 What is shuffling & shuffle partitions?
05:07 Why are shuffle partitions important?
07:45 Scenario based question 1 (data per shuffle partition is large)
12:29 Scenario based question 2 (data per shuffle partition is small)
17:29 How to tune slow running jobs?
18:53 Thank you

📘 Resources:

#ApacheSpark, #DataEngineering, #ShufflePartitions, #BigData, #PerformanceTuning, #pyspark, #sql, #python
Рекомендации по теме
Комментарии
Автор

could you please make video on stack overflow like what are scenario when it can occur and how to fix it

fitness_thakur
Автор

Commenting so that you continue making such informative videos. Great work!

tahiliani
Автор

Really good explanation Afaque. Thank you for making such in depth videos.😊

akshayshinde
Автор

thanks a bunch for the great content again....

ashokreddyavutala
Автор

Thank you so much. Keep up the good work. Looking forward for more such videos to learn Spark

Momofrayyudoodle
Автор

Great share sir, the optimal shuffle Please bring more scenario basef Questions as well as best production based practises!!!!

_Sujoy_Das
Автор

Very nice explanation! Thank you for making this video.

purnimasharma
Автор

Wow! Thank you Afaque, this is incredible content and very helpful!

anandchandrashekhar
Автор

Have watched all your videos. Seriously Gold content. Requesting not to stop making videos.

dileepkumar-ndfo
Автор

Great work Thank you for explaining concepts in detail ❤❤

sureshpatchigolla
Автор

Thanks for the content, really appreciate it. My understanding is AQE take care of Shuffle Partition Optimization and we don't need to manually intervene (starting spark 3) to optimize shuffle partitions. Could you clarify this please?

dasaratimadanagopalan-rfow
Автор

I don't think this kind of videos are available on Spark anywhere else. Great work Afaque!

rgv
Автор

Consider a scenario where my first data shuffle size is 100gb then giving more shuffle partitions make sense now in the last shuffle data size is drastically reduced to 10gb according to calculations how would be to give shuffle partitions giving 1500 would benefit for the first shuffle and not for the last shuffle. How do one approach this scenario

nikhillingam
Автор

@Afaque thank you for making these videos. Very helpful. I have questions how do we estimate the data size? We run our batches/jobs on spark and each batches could be processing varying size of data. Some batches could be dealing with 300Gb and some could be 300Mb. How do we calculate optimal number of shuffle partitions?

abdulwahiddalvi
Автор

@afaque shuffle partition will consist of both the shuffled data (keys that were not originally present in the executor and were shuffled to the partition) and the non-shuffled data (keys that were already present in the executor and were not shuffled). So, the size of the shuffle partition cannot be directly calculated from the shuffle write data alone, as it also depends on the distribution of the data across the partitions ?

erqtpcs
Автор

can you please cover bucketing handson in adb(handson with file view). In your last video it is working in your IDE but not in databricks. (delta bucketing not allowed)

crepantherx
Автор

Can you tell how to resolve Python worker exited unexpectedly (crashed)

ramvel
Автор

Thanks for the explanantion, But Isn't the is no way dependent on the cradinality of the group by/ join column ?

vikastangudu
Автор

i have a question in this. Let's say that the data volume that i am processing varies on daily basis i.., e someday it can be 50gb someday it can be 10gb. keeping in mind the 200mb per shuffle partition limit the num of partition for optimum partition should change on each run in that case. But it;s not practically possible to change the code every time to have a proper shuffle partition. How should this scenario be handled ? i read about a parameter sql.files.maxPartitionBytes which is defaulted to 128mb. Should i change this to 200 and let the number of shuffle partition be calculated automatically ? In that case will the value under sql.shuffle.partitions be ignored ?

kaushikghosh-poew
Автор

Thank you for explaining in detail.You are the best guy around. Can you also please explain me if there is a way to dynamically update the shuffle partition with the help of dynamic calculations of size and no. of cores in the cluster(if in case the cluster is altered in future).
Thanks in advance.

tandaibhanukiran