75. Databricks | Pyspark | Performance Optimization - Bucketing

preview_player
Показать описание
Azure Databricks Learning: Performance Optimization - Bucketing
======================================================

What is Bucketing in Spark?

Bucketing is one of the performance optimization technique in spark. It splits the data into multiple buckets based on hash key and stores the data in pre-shuffled and pre-sorted format, which improves the performance during wide transformations such as join, groupby etc.,

This is also one of the widely asked interview question

#DatabricksBucketBy, #SparkBucketing, #Bucket, #PysparkBucketBy,#DatabricksRealtime, #SparkRealTime, #DatabricksInterviewQuestion, #DatabricksInterview, #SparkInterviewQuestion, #SparkInterview, #PysparkInterviewQuestion, #PysparkInterview, #BigdataInterviewQuestion, #BigdataInterviewQuestion, #BigDataInterview, #PysparkPerformanceTuning, #PysparkPerformanceOptimization, #PysparkPerformance, #PysparkOptimization, #PysparkTuning, #DatabricksTutorial, #AzureDatabricks, #Databricks, #Pyspark, #Spark, #AzureDatabricks, #AzureADF, #Databricks, #LearnPyspark, #LearnDataBRicks, #DataBricksTutorial, #azuredatabricks, #notebook, #Databricksforbeginners
Рекомендации по теме
Комментарии
Автор

Bro your work is awesome so plz continue making spark videos

dhanushkumar
Автор

Awesome explanation Sir and wow super content as well

srinubathina
Автор

u r awesome. Thank u for the clear explanation

somisettyrevanthrs
Автор

Thank you Raja sir for this informative lecture.

In demo, 10 partitions were used. If we have 2 df with 400partitions nd bucketed by 5 nd we didn't change default shuffle partition number. Which is 200.
It means while shuffling: 400 partis will be confined to 200 shuffle partitions.
Ie EACH shuffle partition will hv data from 2 partitions ie 10 buckets. And resultant df will hv 200 partitions instead of 500.

Please answer nd even correct question if Ihv got confusion in concept.
Thank you a lot.

sohelsayyad
Автор

Great explanation Sir!!! Can u pls clarify me below doubts ?
1) if we are using Spark SQL, how to see physical plan (in dtaframe, we can use like df.explain but sparksql how can I check)?
2) As in this example you have mentioned bucket as 10. How to determine the bucket number while I am bucking the table ?

Thanks in advanced 🙂

pranavsarc
Автор

Hi,

U have provided beautiful insights about databricks.
I am using photon accelerator in my db cluster, so I am not able to understand the stages part, please make videos on photon accelerator and provide the insights about, jobs stages and tasks

bachannigam
Автор

Hi Raja. Thankyou for the concept. You have mentioned if any dataframe size less than 10MB by default it will use BroadcaseJoin rgt. I have taken two dataframes which has 2 rows(size is in bytes) and applied join. But it was showing Sort Merge Join. Could you please tell me the reason ????

vlogsofsiriii
Автор

Good explanation RaJa . 1)how does spark know there is no need of shuffle and sort? How does spark collocate the data from two datasets into the same executor ?2)suppose 50 partitions are there and we want 50 buckets so total 2500 files will be created . Is there any way we can create one file per bucket ?

venkatasai
Автор

hello Raja, is Bucketing deprecated for "delta" format?

oiwelder
Автор

Hello Sir,
I had one doubt, if I'm processing 1TB of data and my cluster has storage of 500G and 2TB, will it always load entire 1TB data or how that works can you please help me here also it would be great if you can make video on topic covering the performance aspects

sagarvarma
Автор

In video at 16:48, I can see that there are 3 jobs. but in my case when I joined df3 and df4, databricks shows 5 jobs. Can you please explain why is it different? Also, is it possible to know what each job does ? Thank you!

vinayakkulkarni
Автор

Please just give a video link if any reference video needs to be looked into. It really helps.

abhishek_grd
Автор

How is the size of a bucket decided in the bucketed table ?
How is the partition size decided in non-bucketed table ?

at-cvky
Автор

Let's say we wanted to repartition our data and we have configured our partition size as 128 MB.. Our total data is 1 GB and we need to reparation it to 2, each partition size can be 128 MB. What will happen ?

MrMuraliS
Автор

Hi Raja, If you get some time can you please explain difference between normal function and udf's(In some cases I observed we use normal python function in our code and in some cases they are registering as UDF) and when to use what

omprakashreddy
Автор

How do we decide on the number of buckets that we need to set ? example : bucketBy(id, X)-- X can we any number right? how can we decide on the number of buckets that should be passed ?

ranjithrampally
Автор

when we do a join on two dataframes (non-bucket and non-partitioned), it would involve a suffle. So during the shuffle will it make sure that each partition has only 1 distinct key?
For ex: If I join 2 dataframes on column 'A' and there are 600 distinct column 'A' values. So does the shuffle create 600 partitions?

vineethreddy.s
Автор

Hi Raja sir,
Do u have any full course about full-fledged data bricks with scala/python .?
How we can connect with you .(

ps-upmx
Автор

Nice Video Sir. I have got one interview question like if i have 1000 of datasets and i don't know the size of all those data sets. U have to perform join how will u decide whether to go for broadcast join or not?

kiranmudradi
Автор

Hello sir.
Please make video on databricks connectivity with azure event hubs with basics transformation. Hope you will make . Thank you

AmanShaikh-wvkx