Spark Application | Partition By in Spark | Chapter - 2 | LearntoSpark

preview_player
Показать описание
In this video, we will learn about the partitionBy in Spark Dataframe Writer. We will have a demo on how to save the data by creating a partition on date column using PySpark.

Blog link to learn more on Spark:

Linkedin profile:

FB page:

Github:
Рекомендации по теме
Комментарии
Автор

Nice vedio on partitioning .
I have a question.
While going to partitionBy Dateonly column, if need to create 2 partition on each date, How to create it?
Hope you understand my query

ramum
Автор

@Azarudeen Shahul Hi bro, I am using partitionBy while writing my dataframe to S3. In my case I am writing my data into 30 partitions(30 days of a month) and within each partition, multiple small files are getting created ( around 30-50 kb) files, and hence the writing is taking a long time.

Any optimization suggestion for this??

johnsonrajendran
Автор

Hi Shahul, wanted to thank you for your contents on these topics..//... Just wanted to know, whats the use of using unix timestamp and from unixtime functions when to_date(col("column_name")) function does the job of changing the timestamp to only date column without any issue..?

saurabh
Автор

Hi Azar, Could you please help us to understand the what will happen when streaming data is coming, and how partition by handled. Example for same date data are coming one day later, is it possible to load in existing partition file or go with new file

maheshk
Автор

How to calculate number of partitions required for a 10 GB of data, and for repartitioning and coalesce please help??

MrManish
Автор

Can you please also add the link of dataset which you've used in demo.

samsid
Автор

Can you please upload Udf video with some complex examples

delhilife
Автор

HI bro your videos nice and simple way of explanation, please make a more interview questions on spark core, spark sql, kafka, streaming, hive.

madhanmohanreddy
Автор

Can you pls provide equivalent Scala code?

Jerinsjc