Fine Tuning and Enhancing Performance of Apache Spark Jobs

preview_player
Показать описание
Apache Spark defaults provide decent performance for large data sets but leave room for significant performance gains if able to tune parameters based on resources and job. We’ll dive into some best practices extracted from solving real world problems, and steps taken as we added additional resources. garbage collector selection, serialization, tweaking number of workers/executors, partitioning data, looking at skew, partition sizes, scheduling pool, fairscheduler, Java heap parameters. Reading sparkui execution dag to identify bottlenecks and solutions, optimizing joins, partition. By spark sql for rollups best practices to avoid if possible.

About:
Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business.

Connect with us:
Рекомендации по теме
Комментарии
Автор

Java GC logs are the toughest to understand. Still you guys combated it with your best 👍🏼

thevijayraj
Автор

Well done, tackling Spark's most difficult topic

HughMcBrideDonegalFlyer
Автор

Great presentation. Are the slides available somewhere?

gandatrowx