Partition vs bucketing | Spark and Hive Interview Question

preview_player
Показать описание
This video is part of the Spark learning Series. Spark provides different methods to optimize the performance of queries. So As part of this video, we are covering the following
What is Partitioning
How does partitioning help to improve performance
What is Bucketing
How does bucketing helps to improve performance
Difference between Partitioning and Bucketing

How Spark's performance is impacted by Dynamic Partition Pruning

Here are a few Links useful for you
Spark performance tuning:

If you are interested to join our community. Please join the following groups

You can drop me an email for any queries at

#apachespark #sparktutorial #bigdata
#spark #hadoop #spark3
Рекомендации по теме
Комментарии
Автор

small file problem in Hadoop?
According to me if we have lots of small files in cluster that will increase burden on namenode . bcoz namenode stores the meta data of file so if we have lots of small files name node keep noting address of files and hence if master down cluster also gone down.

alibinmazi
Автор

I would have watched this video at least 5 times between yesterday and today. Thank you very

cajaykiran
Автор

Thanks for the great video, very clear explanation

prosperakwo
Автор

You are too good sir thank you soo much for clearing our concepts❤

FaizanAli-wewc
Автор

Thanks for a very helpful video. My question here is, how we can perform optimisation using bucketing, ? As in bucketing data is shuffled among different buckets, so it will not be sorted, so if i am using where condition over bucketed table how should i avoid irrelevant bucket scans like i do in partitioning? In short does where condition optimises bucketed table if not then what are other optimisations over bucketing ?

saurabhgarud
Автор

This is nice explanation, But you are considering physical partition for hive, but memory level partition for spark to show difference no of files generated

rakeshdey
Автор

Please keep making more such videos.
Also would be great if you could make something for cloud related big data technologies

shikhargupta
Автор

Really appreciate @Data Savvy for the effort. I have a question:
The data searching/retrieval process in case of partitioned table can (to create an analogy) we understand, the way element retrieval is done in binary tree and in case of partitioned bucketed table, a way search is done in nested binary tree . I am referring to Binary tree in Data structure

Recently, I followed one Mock Bigdata Interview video in your channel, liked a lot. If possible please upload a few more such videos. Thanks :)

subhajitroy
Автор

How can I find if my bucketing was really utilized by the query? Can be visible from the physical plan? Also, I am believing that in the case of partition+bucketing, both the partition and bucket filters should be on my query?

ayushjain
Автор

Important point - hive partitioning is not same as Spark partitioning. 7:34-9:14

sashikiran
Автор

Where there are lot of small files in hadoop, the namenode performance can be impact because of unable to fast process the data.. Actually Hadoop is for handling big data.. So creating too many small files may end up with namenode performance impact. I came across this problem in my project

r.kishorekumar
Автор

Nice. explanation.. Can you please also take Hive join example map side join and all other joins and performance tuning.

anandraj
Автор

Can we increase the performance of the Hive query while fetching the records, assuming table is already partitioned?

vikramrajsahu
Автор

Small file problem is headache to name node since it has to manage metadata info. also spark need more number of executor which is again a overhead .

raviranjan
Автор

@data savvy, i obesrved in my local system with multiple cores, partitionBy and bucketBy both doesn't perform any shuffle, there is no exchange in plan. That is why it is producing small files in both cases? Is that right? Will it perform shuffle in large cluster? I am jts reading from a file and writing in partitionby or bucket by no transformations, tell me in this case cluster level also no shuffle will be there?

vamshi
Автор

How can we consider a particular column to use as partitioning or to use as bucketing

kumarsatyachaitanyayedida
Автор

Thanks for the video.
But i have one query, how to insert data in bucketed table of hive using spark. I tried this, but it didn't give correct output.

uditmittal
Автор

Hi, r u handling spark and scala training classes?

bhavaniv
Автор

Sir, could you please give one example syntactically between Hive partition, bucketing vs spark partition, bucketing . And couldn't understand the last point of your summary, could you please give some more clarity on it .

sambitkumardash
Автор

How can we decide the number of buckets in case after partitioning one file 128 mb, 2nd file 400mb, 3rd file 200 mb..kindly answer..thanks in advance

routhmahesh