Leveraging Apache Spark for Scalable Data Prep and Inference in Deep Learning

Показать описание

To scale out deep learning training, a popular approach is to use Distributed Deep Learning Frameworks to parallelize processing and computation across multiple GPUs/CPUs. Distributed Deep Learning Frameworks work well when input training data elements are independent, allowing parallel processing to start immediately. However preprocessing and featurization steps, crucial to Deep Learning development, might involve complex business logic with computations across multiple data elements that the standard Distributed Frameworks cannot handle efficiently. These preprocessing and featurization steps are where Spark can shine, especially with the upcoming support in version 3.0 for binary data formats commonly found in Deep Learning applications. The first part of this talk will cover how Pandas UDFs together with Spark’s support for binary data and Tensorflow’s TFRecord formats can be used to efficiently speed up Deep Learning’s preprocessing and featurization steps. For the second part, the focus will be techniques to efficiently perform batch scoring on large data volume with Deep Learning models where real-time scoring methods do not suffice. Upcoming Spark 3.0’s new Pandas UDFs’ features helpful for Deep Learning inference will be covered.

About:
Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business.

Connect with us:

Рекомендации по теме

Комментарии

Thanks for showing the code on reading binary audio files. That's going to be really handy! This was a great talk and I loved being able to see some of your code.

Skorskimus

Leveraging Apache Spark for Scalable Data Prep and Inference in Deep Learning

Leveraging Apache Spark for Scalable Data Prep and Inference in Deep Learning

Apache Spark in 100 Seconds

Leveraging Apache Spark to Develop AI-Enabled Products and Services at Bosch

Apache Spark in Machine Learning: Best Practices for Scalable Analytics

Shparkley: Scaling Shapley with Apache Spark

Scalable Time Series Forecasting and Monitoring using Apache Spark and ElasticSearch at Adyen

Felix Cheung - Scalable Data Science in Python and R on Apache Spark

The Pursuit of Happiness Building a Scalable Pipeline Using Apache Spark and NLP to Measure

Nebula: The Journey of Scaling Instacart’s Data Pipelines with Apache Spark™ and Lakehouse

Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue

Sparklens: Understanding the Scalability Limits of Spark Applications | Qubole

Luxoft Tech Talk with Martin Toshev - Building highly-scalable data pipelines with Apache Spark

Large Scale Feature Aggregation Using Apache Spark - Pulkit Bhanot

Designing a Horizontally Scalable Event Driven Big Data Architecture w/ Apache Spark Ricardo Fanjul

Scalable Interpretability for Computer Vision (Kernel SHAP) with Apache Spark

Leveraging Apache Spark to Disrupt Airline Pricing Distribution - Anton Diego & Daniel Santana

Apache Spark and Machine Learning - Architectural and Scaling Considerations (Vartika Singh)

dotScale 2016 - Sean Owen - Scaling Learning on Apache Spark

Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service

Portable Scalable Data Visualization Techniques for Apache Spark and Python Notebook-based Analytics

Designing the Next Generation of Data Pipelines at Zillow with Apache Spark

Scaling Genomics on Apache Spark by 100x with Henry Davidge (Databricks)

Learn Apache Spark in 10 Minutes | Step by Step Guide

DataEngConf: Apache Spark in Financial Modeling at BlackRock