Handling Skewed Data | Tips on running Spark in Production | Course on Apache Spark Core | Lesson 25

preview_player
Показать описание
Full Course is available here:

Рекомендации по теме
Комментарии
Автор

😍😍 splitting the rdd s as skewed rdd and non skewed rdd and then separately performing joins -- is ultimate trick, Amit.... 😃😃 simple and brilliant idea...

gurumoorthysivakolunthu
Автор

Hi Amit...
This is Great... You have made the topic simple and easy to understand...
I have few Questions:-
1. If the maximum size of each partition is 128 MB... Then how data skewness is even possible...?
2. You have mentioned about repartition() that it should be done based on skew column -- how can we parameterize repartition() with a specific column... It is possible to pass only the number values, right...?
...
Similarly -- in Salting technique how repartition() can be applied based on salted column...?
...
Thank you, Amit...

gurumoorthysivakolunthu
Автор

Very Informative. Explained in great detail

ShivamSinghcs
Автор

Amazing video... How can we use the salting technique in PySpark for data skew?

vijeandran
Автор

Can we use salting to join 2 skewed datasets ?

rishigc
Автор

Thanks for the video, I have a question - In Spark Dataframe How will you handle Data Skewness? Actually larger question is How you will find out a DF is skewed data/uneven data and how will you resolve it

soumyakantarath
Автор

Great video.. could you please share the url where you talk about handling skew in Spark SQL ?

rishigc
Автор

Great Video. I have question. Suppose consider a scenario where i want to perform average on based on keys in my data set. but certain keys are highly skewed. if we apply salting technique. will it work?

kiranmudradi
Автор

Can you do make video for practical implementation.

ravikirantuduru
Автор

I understand that repartition will help, but this might lead to some partitions would have lesser data
so is there any way to get rid of small files at run time
written in source

dharmendersingh