Shuffle Partition Spark Optimization: 10x Faster!

Показать описание

Welcome to our comprehensive guide on understanding and optimising shuffle operations in Apache Spark! In this deep-dive video, we uncover the complexities of shuffle partitions and how shuffling works in Spark, providing you with the knowledge to enhance your big data processing tasks. Whether you're a beginner or an experienced Spark developer, this video is designed to elevate your skills and understanding of Spark's internal mechanisms.

🔹 What you'll learn:
1. Shuffling in Spark: Uncover the mechanics behind shuffling, why it's necessary, and how it impacts the performance of your data processing jobs.
2. Shuffle Partitions: Discover what shuffle partitions are and their role in distributing data across nodes in a Spark cluster.
3. When Does Shuffling Occur?: Learn about the specific scenarios and operations that trigger shuffling in Spark, particularly focusing on wide transformations.
4. Shuffle Partition Size Considerations: Explore real-world scenarios where the shuffle partition size is significantly larger or smaller than the data per shuffle partition, and understand the implications on performance and resource utilisation.
5. Tuning Shuffle Partitions: Dive into strategies and best practices for tuning the number of shuffle partitions based on the size and nature of your data, ensuring optimal performance and efficiency.

📘 Chapters:
00:00 Introduction
00:14 What is shuffling & shuffle partitions?
05:07 Why are shuffle partitions important?
07:45 Scenario based question 1 (data per shuffle partition is large)
12:29 Scenario based question 2 (data per shuffle partition is small)
17:29 How to tune slow running jobs?
18:53 Thank you

📘 Resources:

#ApacheSpark, #DataEngineering, #ShufflePartitions, #BigData, #PerformanceTuning, #pyspark, #sql, #python

Рекомендации по теме

Комментарии

could you please make video on stack overflow like what are scenario when it can occur and how to fix it

fitness_thakur

Commenting so that you continue making such informative videos. Great work!

tahiliani

Really good explanation Afaque. Thank you for making such in depth videos.😊

akshayshinde

thanks a bunch for the great content again....

ashokreddyavutala

Thank you so much. Keep up the good work. Looking forward for more such videos to learn Spark

Momofrayyudoodle

Great share sir, the optimal shuffle Please bring more scenario basef Questions as well as best production based practises!!!!

_Sujoy_Das

Very nice explanation! Thank you for making this video.

purnimasharma

Wow! Thank you Afaque, this is incredible content and very helpful!

anandchandrashekhar

Have watched all your videos. Seriously Gold content. Requesting not to stop making videos.

dileepkumar-ndfo

Great work Thank you for explaining concepts in detail ❤❤

sureshpatchigolla

Thanks for the content, really appreciate it. My understanding is AQE take care of Shuffle Partition Optimization and we don't need to manually intervene (starting spark 3) to optimize shuffle partitions. Could you clarify this please?

dasaratimadanagopalan-rfow

I don't think this kind of videos are available on Spark anywhere else. Great work Afaque!

rgv

Consider a scenario where my first data shuffle size is 100gb then giving more shuffle partitions make sense now in the last shuffle data size is drastically reduced to 10gb according to calculations how would be to give shuffle partitions giving 1500 would benefit for the first shuffle and not for the last shuffle. How do one approach this scenario

nikhillingam

@Afaque thank you for making these videos. Very helpful. I have questions how do we estimate the data size? We run our batches/jobs on spark and each batches could be processing varying size of data. Some batches could be dealing with 300Gb and some could be 300Mb. How do we calculate optimal number of shuffle partitions?

abdulwahiddalvi

@afaque shuffle partition will consist of both the shuffled data (keys that were not originally present in the executor and were shuffled to the partition) and the non-shuffled data (keys that were already present in the executor and were not shuffled). So, the size of the shuffle partition cannot be directly calculated from the shuffle write data alone, as it also depends on the distribution of the data across the partitions ?

erqtpcs

can you please cover bucketing handson in adb(handson with file view). In your last video it is working in your IDE but not in databricks. (delta bucketing not allowed)

crepantherx

Can you tell how to resolve Python worker exited unexpectedly (crashed)

ramvel

Thanks for the explanantion, But Isn't the is no way dependent on the cradinality of the group by/ join column ?

vikastangudu

i have a question in this. Let's say that the data volume that i am processing varies on daily basis i.., e someday it can be 50gb someday it can be 10gb. keeping in mind the 200mb per shuffle partition limit the num of partition for optimum partition should change on each run in that case. But it;s not practically possible to change the code every time to have a proper shuffle partition. How should this scenario be handled ? i read about a parameter sql.files.maxPartitionBytes which is defaulted to 128mb. Should i change this to 200 and let the number of shuffle partition be calculated automatically ? In that case will the value under sql.shuffle.partitions be ignored ?

kaushikghosh-poew

Thank you for explaining in detail.You are the best guy around. Can you also please explain me if there is a way to dynamically update the shuffle partition with the help of dynamic calculations of size and no. of cores in the cluster(if in case the cluster is altered in future).
Thanks in advance.

tandaibhanukiran

Shuffle Partition Spark Optimization: 10x Faster!

Shuffle Partition Spark Optimization: 10x Faster!

SOS - Optimizing Shuffle (Brian Cho and Ergin Seyfe)

Accelerating Apache Spark Shuffle for Data Analytics on the Cloud w/ Remote Persistent Memory Pools

Care and Feeding of Catalyst Optimizer

Apache Spark AQE SkewedJoin Optimization and Practice in ByteDance

Spark performance tuning | Optimization | Big Data

How Apache Spark 3 0 and Delta Lake Enhances Data Lake Reliability

Performant Streaming in Production: Preventing Common Pitfalls when Productionizing Streaming Jobs

How to Automate Performance Tuning for Apache Spark -Jean Yves Stephan (Data Mechanics)

Spark SQL Bucketing at Facebook - Cheng Su (Facebook)

Dynamic Partition Pruning in Apache Spark Bogdan Ghit Databricks -Juliusz Sompolski (Databricks)

Spark Performance Tuning | Performance Optimization | Interview Question

Scaling TB's of data with Apache Spark & Scala DSL at Production- Chetan Khatri-FOSSASIA 20...

Dynamic Talks #90 | Spark performance mastery

Apache Spark on Kubernetes: Best Practices & Performance Tuning

Webinar: Speed Up Your Data Lake - How to Supercharge Apache Spark

Scaling Apache Spark at Facebook Ankit Agarwal Facebook,Sameer Agarwal Facebook

Hyperspace: An Indexing Subsystem for Apache Spark

Advanced Spark Features - Matei Zaharia

Scale By The Bay 2021 : Jean-Yves Stephan, Apache Spark Performance Tuning Session with Delight

Spark Interview Question | Cost Based Optimizer

Hyperspace: An Indexing Subsystem for Apache Spark

Catalyst: A Functional Query Optimizer for Spark and Shark

Cosco An Efficient Facebook Scale Shuffle ServiceBrian Cho Facebook,Dmitry Borovsky Facebook