How to handle Data skewness in Apache Spark using Key Salting Technique

Показать описание

Handling the Data Skewness using Key Salting Technique. One of the biggest problem in parallel computational systems is data skewness. Data Skewness in Spark happens due to joining on a key that is not evenly distributed across the cluster, causing some partitions to be very large and not allowing Spark to process data in parallel.

Рекомендации по теме

Комментарии

Hi Sir... Perfect Great Explanation... Thank you for your effort...
I have a doubt :--
After joining The Salting step should be - unsalted and then grouped by has to be applied, Right...?

gurumoorthysivakolunthu

Well, I must say, thanks a been searching for this kind of explaination.

gautamyadav-cxzx

This really great and crystal clear explanations....thanks a lot for sharing and spreading knowledge!

someshchandra

Excellent video..thanks for the explanation and sharing the code

ashwinc

beautifully explained, thank you very much :)

arunsundar

Thanks but if we have multiple columns as KEY how to handle it ?

SpiritOfIndiaaa

Great Explanation, Thanks for sharing this.
I think there is off by 1 error.
You are using (0 to 3) which will have (0, 1, 2, 3)
but random number range will be (0, 1, 2)

MahmoudHanafy

Good work, its better you show the ourput after the salting dataframes and explain udf more detail.

savage_su

Amazing video.... How can we use the salting technique in PySpark for data skew?

vijeandran

amazing video.. however, i don't know scala. So can you please give an example on how to implement the salting technique with Spark SQL queries ? that'll be of great help..

rishigc

but the join output will not be correct because in previous scenario it would have joined with all the matching ids but with new salting method it will join with only newly slated key, that's weird

akashhudge

Hey great video, could you also link the associated resources you referred to while making this video?

shwetanandwani

I have 2 questions:
First one: I think that is wrong on your visual presentation of table 2 after salting. Why don't you have z_2 und z_3 there? Also why are you using capital letters sometimes, that's confusing.
Secone question: I don't get the benefit of Key Salting in general. How is this different from broadcasting you second table? Because you explode it and then you will end up with sending the whole table to every executor anyway? No one can give an answer to this question.

thomashass

Can u please explain how to take the random number count

aravindkumar

Hi, are you missing something in code ?? I used your code but its throwing an exception for the below code of lines

//join after elminating data skewness
df3.join(
df4,
df3.col("id")<=> df4.col("id")
)
.show(100, false)

}

NishaKumari-opek

How to handle Data skewness in Apache Spark using Key Salting Technique

Handling skewness

How to handle Data skewness in Apache Spark using Key Salting Technique

Understanding How to Handle Data Skewness in PySpark #interview

Skewness - Right, Left & Symmetric Distribution - Mean, Median, & Mode With Boxplots - Stati...

Spark Interview Question | Handle Data Skewness in Apache Spark | LearntoSpark

salting in spark | how to handle data skew issue | Lec-23

What is Data Skew ?

Skewness and Kurtosis in Statistics | What is Skewness? | Handle Skewness | Satyajit Pattnaik

2.5 Random Variables and Probability Distributions

Skewness And Kurtosis And Moments | What Is Skewness And Kurtosis? | Statistics | Simplilearn

How to handle data skewness in spark || DataEdge learning

SKEWNESS IN STATISTICS EXPLAINED IN LESS THAN A MINUTE!! #statistics #datascience #skewness

Statistics-Left Skewed And Right Skewed Distribution And Relation With Mean, Median And Mode

What is Skewness? | Statistics | Don't Memorise

Symmetry and Skewness (1.8)

Normal Distributions, Standard Deviations, Modality, Skewness and Kurtosis: Understanding concepts

What is skewness? A detailed explanation (with moments!)

Skewness in R - How to Deal with Skewed Data!

Exploratory Data Analysis Handling Skewness in Machine Learning | Skewness Code | PART - 8

69. Databricks | Spark | Pyspark | Data Skewness| Interview Question: SPARK_PARTITION_ID

Median, mean and skew from density curves | AP Statistics | Khan Academy

Caclulating Sample Skewness

Machine Learning - Data Description - Skewness and Kurtosis

Exploratory Data Analysis Handling Skewness in Machine Learning | EDA Handling Skewness | PART - 7