Apache Spark Core – Practical Optimization Daniel Tomes (Databricks)

preview_player
Показать описание
Properly shaping partitions and your jobs to enable powerful optimizations, eliminate skew and maximize cluster utilization. We will explore various Spark Partition shaping methods along with several optimization strategies including join optimizations, aggregate optimizations, salting and multi-dimensional parallelism.

About: Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business.

Connect with us:
Рекомендации по теме
Комментарии
Автор

One of the best videos on Spark job optimization.

dhrub
Автор

Great video, almost covered points that cause performance issues. It's easy to code spark application because of available api but it's important to understand architecture of spark to get the real value out of spark.

premendrasrivastava
Автор

Awesome video. Where we can find the slides-reference material?

randysuarezrodes
Автор

@39:22 I did not understand why my job would fail if I used a coalesce to reduce the number of partitions while writing output. Can anyone please explain ? What happens when coalesce() is pushed up to the previous stage ? How does that makes the job to fail ?

rishigc
Автор

@30 min, how did he get to the percentage, only 60% of the cluster is being utilized?

MULTIMAX
Автор

If you have several joins on one table, how do you set shuffle partitions count for specific join? As i currently understand, this config is rendered into physical plan only when action is triggered.

sashgorokhov
Автор

I didn't understand the part @31:00 where he chooses 480 partitions instead of 540. Can anyone please explain why

rishigc
Автор

29:20 If shuffle still is 550 GB why is columnar compression good?

MrEternalFool
Автор

Great Video! I've been thinking about our spark input read partitions since we have a lot of heavily compressed data using BZIP2. Sometimes this compress is 20-25X. So a 128MB blah.csv.bz2 file is really a 2.5-3GB blah.csv file. Should we reduce the value in our to accommodate this and have it result in more partitions created on read?

loganboyd
Автор

Very deep technical explanation on Spark optimization. Nice stuff, thank you for sharing with the community.

TheSQLPro
Автор

Excellent presentation but terrible screenshots..very hard to read what's written.

pshar
Автор

Are the suggested sizes (target shuffle size, target file write) compressed or uncompressed?

cozos
Автор

Very nice explanation of the Spark Architecture and terminologies.

ayushy
Автор

thanks a lot, @27 min where from you got Stage 21 input read i.e. 45.4g + 8.6g = 54g ?

SpiritOfIndiaaa