Accelerating Data Processing in Spark SQL with Pandas UDFs

preview_player
Показать описание
Spark SQL provides a convenient layer of abstraction for users to express their query’s intent while letting Spark handle the more difficult task of query optimization. Since spark 2.3, the addition of pandas UDFs allows the user to define arbitrary functions in python that can be executed in batches, allowing the user the flexibility required to write queries that suit very niche cases. At Quantcast, we have developed a model training pipeline that collects the training data for tens of thousands of models from petabytes of logs. Due to the scale of data that this pipeline deals with we spent considerable effort trying to optimize spark SQL to make our queries as efficient as possible. This resulted in several techniques that use pandas UDFs to run highly specialized batch processing jobs that speed up our data processing pipelines by over an order of magnitude. This talk will go over the learnings we gained from this process, focusing mainly on how we were able to leverage our custom UDFs to provide significant performance gains. The main takeaways of this talk are:

1. Learning what spark SQL tends to do well and what it tends to do poorly.
2. Some ideas that you can implement in UDFs that can potentially speed up queries by over an order of magnitude.
3. Ways to profile your spark SQL jobs quickly to check if your ideas are working as intended.

About:
Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business.

Connect with us:
Рекомендации по теме
Комментарии
Автор

The performance difference between the code samples is insane. This solidifies my decision to better learn the pandas UDF framework. Thanks for the video!

chattoyourdata
welcome to shbcf.ru