Accelerating Data Processing in Spark SQL with Pandas UDFs

Показать описание

Spark SQL provides a convenient layer of abstraction for users to express their query’s intent while letting Spark handle the more difficult task of query optimization. Since spark 2.3, the addition of pandas UDFs allows the user to define arbitrary functions in python that can be executed in batches, allowing the user the flexibility required to write queries that suit very niche cases. At Quantcast, we have developed a model training pipeline that collects the training data for tens of thousands of models from petabytes of logs. Due to the scale of data that this pipeline deals with we spent considerable effort trying to optimize spark SQL to make our queries as efficient as possible. This resulted in several techniques that use pandas UDFs to run highly specialized batch processing jobs that speed up our data processing pipelines by over an order of magnitude. This talk will go over the learnings we gained from this process, focusing mainly on how we were able to leverage our custom UDFs to provide significant performance gains. The main takeaways of this talk are:

1. Learning what spark SQL tends to do well and what it tends to do poorly.
2. Some ideas that you can implement in UDFs that can potentially speed up queries by over an order of magnitude.
3. Ways to profile your spark SQL jobs quickly to check if your ideas are working as intended.

About:
Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business.

Connect with us:

Рекомендации по теме

Комментарии

The performance difference between the code samples is insane. This solidifies my decision to better learn the pandas UDF framework. Thanks for the video!

chattoyourdata

Accelerating Data Processing in Spark SQL with Pandas UDFs

Accelerating Data Processing in Spark SQL with Pandas UDFs

Accelerating data processing in spark sql with pandas udfs

Accelerated Data Science: Announcing GPU-acceleration for pandas, NetworkX, and Apache Spark MLlib

Accelerating Batch Processing with Apache Spark

Accelerating Genomics SNPs Processing and Interpretation with Apache Spark

Accelerating Big Data Processing with Hadoop, Spark and Memcached

Accelerating Apache Spark by Several Orders of Magnitude with GPUs

Accelerating Spark with NVIDIA GPUs on Cloudera Data Platform

Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based FPGA Accelerators

Accelerated Transactions processing leveraging Virtualized Apache Spark and NVIDIA GPUs

Accelerating Shuffle: A Tailor Made RDMA Solution for Apache Spark - Yuval Degani

Internals of Speeding up PySpark with Arrow - Ruben Berenguel (Consultant)

Accelerating Big Data Processing with Hadoop, Spark and Memcached

Composable Data Processing with Apache Spark

Building Genomic Data Processing and Machine Learning Workflows Using Apache Spark

Accelerating Data Ingestion with Databricks Autoloader

Effective Spark with Alluxio (Jiri Simsa)

Speeding Up Spark with Data Compression on Xeon+FPGA

Accelerating Apache Spark Shuffle for Data Analytics on the Cloud w/ Remote Persistent Memory Pools

Advancing GPU Analytics with RAPIDS Accelerator for Apache Spark and Alluxio

Accelerating Big Data Processing and Framework Provisioning with OpenStack Heat-based HadoopSpark

Accelerating Apache Spark with FPGAs: A Case Study for 10TB TPCx HS Spark Benchmark Acceleration

Accelerating Spark with ZIO by Leo Benkel

Vectorized Deep Learning Acceleration and Training on Apache Spark in SK Telecom