Partition vs bucketing | Spark and Hive Interview Question

Показать описание

This video is part of the Spark learning Series. Spark provides different methods to optimize the performance of queries. So As part of this video, we are covering the following
What is Partitioning
How does partitioning help to improve performance
What is Bucketing
How does bucketing helps to improve performance
Difference between Partitioning and Bucketing

How Spark's performance is impacted by Dynamic Partition Pruning

Here are a few Links useful for you
Spark performance tuning:

If you are interested to join our community. Please join the following groups

You can drop me an email for any queries at

#apachespark #sparktutorial #bigdata
#spark #hadoop #spark3

Рекомендации по теме

Комментарии

small file problem in Hadoop?
According to me if we have lots of small files in cluster that will increase burden on namenode . bcoz namenode stores the meta data of file so if we have lots of small files name node keep noting address of files and hence if master down cluster also gone down.

alibinmazi

I would have watched this video at least 5 times between yesterday and today. Thank you very

cajaykiran

Thanks for the great video, very clear explanation

prosperakwo

You are too good sir thank you soo much for clearing our concepts❤

FaizanAli-wewc

Thanks for a very helpful video. My question here is, how we can perform optimisation using bucketing, ? As in bucketing data is shuffled among different buckets, so it will not be sorted, so if i am using where condition over bucketed table how should i avoid irrelevant bucket scans like i do in partitioning? In short does where condition optimises bucketed table if not then what are other optimisations over bucketing ?

saurabhgarud

This is nice explanation, But you are considering physical partition for hive, but memory level partition for spark to show difference no of files generated

rakeshdey

Please keep making more such videos.
Also would be great if you could make something for cloud related big data technologies

shikhargupta

Really appreciate @Data Savvy for the effort. I have a question:
The data searching/retrieval process in case of partitioned table can (to create an analogy) we understand, the way element retrieval is done in binary tree and in case of partitioned bucketed table, a way search is done in nested binary tree . I am referring to Binary tree in Data structure

Recently, I followed one Mock Bigdata Interview video in your channel, liked a lot. If possible please upload a few more such videos. Thanks :)

subhajitroy

How can I find if my bucketing was really utilized by the query? Can be visible from the physical plan? Also, I am believing that in the case of partition+bucketing, both the partition and bucket filters should be on my query?

ayushjain

Important point - hive partitioning is not same as Spark partitioning. 7:34-9:14

sashikiran

Where there are lot of small files in hadoop, the namenode performance can be impact because of unable to fast process the data.. Actually Hadoop is for handling big data.. So creating too many small files may end up with namenode performance impact. I came across this problem in my project

r.kishorekumar

Nice. explanation.. Can you please also take Hive join example map side join and all other joins and performance tuning.

anandraj

Can we increase the performance of the Hive query while fetching the records, assuming table is already partitioned?

vikramrajsahu

Small file problem is headache to name node since it has to manage metadata info. also spark need more number of executor which is again a overhead .

raviranjan

@data savvy, i obesrved in my local system with multiple cores, partitionBy and bucketBy both doesn't perform any shuffle, there is no exchange in plan. That is why it is producing small files in both cases? Is that right? Will it perform shuffle in large cluster? I am jts reading from a file and writing in partitionby or bucket by no transformations, tell me in this case cluster level also no shuffle will be there?

vamshi

How can we consider a particular column to use as partitioning or to use as bucketing

kumarsatyachaitanyayedida

Thanks for the video.
But i have one query, how to insert data in bucketed table of hive using spark. I tried this, but it didn't give correct output.

uditmittal

Hi, r u handling spark and scala training classes?

bhavaniv

Sir, could you please give one example syntactically between Hive partition, bucketing vs spark partition, bucketing . And couldn't understand the last point of your summary, could you please give some more clarity on it .

sambitkumardash

How can we decide the number of buckets in case after partitioning one file 128 mb, 2nd file 400mb, 3rd file 200 mb..kindly answer..thanks in advance

routhmahesh

Partition vs bucketing | Spark and Hive Interview Question

Partition vs bucketing | Spark and Hive Interview Question

Partition vs Bucketing | Data Engineer interview

Partitioning vs Bucketing By Example | Spark | big data interview questions #13 | TeKnowledGeek

6.6 Hive and Spark | Partitions vs Bucketing | Spark Interview Questions

Partitioning and bucketing in Spark | Lec-9 | Practical video

Spark Basics | Partitions

75. Databricks | Pyspark | Performance Optimization - Bucketing

Partitioning vs Bucketing in Hive | Hive Interview questions and answers | Session 1 - Trendytech

Partitioning vs Bucketing | Interview Question | PySpark #pyspark #bigdata #pwc #interview

Bucketing - The One Spark Optimization You're Not Doing

Hive Partition with Bucket Explained

Master Spark Partitioning and Bucketing: Top Interview Questions Answered

Hive Partition [ Static Vs Dynamic]

Hive Partition And Bucketing Example - Bigdata Hive Tutorial Hive Bucketing And Partitioning

Mastering Hive Tutorial | Partition vs Bucket | | Interview Question

Hive Bucket End to End Explained

Why should we partition the data in spark?

Ch.02-34 Partitioning vs Bucketing | Data Modeling

Spark Interview Question | Bucketing | Spark SQL

Difference Between Partition and Bucketing in Hive

45. Databricks | Spark | Pyspark | PartitionBy

6.7 Decide Number Of Buckets in Hive and spark | Partition and Bucketing

Bucketing in Spark SQL 2 3 with Jacek Laskowski

Spark [Hash Partition] Explained