Improving Apache Spark's Reliability with DataSourceV2 - Ryan Blue

Показать описание

DataSourceV2 is Spark's new API for working with data from tables and streams, but "v2" also includes a set of changes to SQL internals, the addition of a catalog API, and changes to the data frame read and write APIs. This talk will cover the context for those additional changes and how "v2" will make Spark more reliable and predictable for building enterprise data pipelines. This talk will include: * Problem areas where the current behavior is unpredictable or unreliable * The new standard SQL write plans (and the related SPIP) * The new table catalog API and a new Scala API for table DDL operations (and the related SPIP) * Netflix's use case that motivated these changes

About: Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business.

Connect with us:

Рекомендации по теме

Комментарии

wrong title - should be "Improving Spark's Reliability with DataSourceV2"

gpsas

Improving Apache Spark's Reliability with DataSourceV2 - Ryan Blue

Improving Apache Spark's Reliability with DataSourceV2 - Ryan Blue

Improving Apache Spark by Taking Advantage of Disaggregated Architecture - Chenzhao Guo

Improving Apache Spark with S3 - Ryan Blue

Improving Apache Spark for Dynamic Allocation and Spot Instances

Seattle Spark + AI Meetup: How Apache Spark™ 3.0 and Delta Lake Enhance Data Lake Reliability

Beyond Shuffling: Scaling Apache Spark by Holden Karau

Optimising Apache Spark and SQL for improved performance | Marcin Szymaniuk | Conf42 ML 2024

Fine Tuning and Enhancing Performance of Apache Spark Jobs

Enabling Vectorized Engine in Apache Spark

Improve Apache Spark™ DS v2 Query Planning Using Column Stats

From HDFS to S3: Migrate Pinterest Apache Spark Clusters

Managing ADLS gen2 using Apache Spark

Lessons from the Field:Applying Best Practices to Your Apache Spark Applications with Silvio Fiorito

Making Apache Spark™ Better with Delta Lake

Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x Performance Improvements

Delta Lake: Reliability and Data Quality for Data Lakes and Apache Spark by Michael Armbrust

Improving Apache Spark Downscaling - Christopher Crosbie (Google) Ben Sidhom (Google)

Open Source Reliability for Data Lake with Apache Spark

How to Extend Apache Spark with Customized OptimizationsSunitha Kambhampati IBM

Tuning Apache Spark for Large Scale Workloads - Sital Kedia & Gaoxiang Liu

Flash for Apache Spark Shuffle with Cosco

Fast and Reliable Apache Spark SQL Releases

Expanding Apache Spark Use Cases in 2.2 and Beyond - Matei Zaharia, Tim Hunter & Michael Armbrus...

Improving interactive querying experience on Spark SQL - Ashish Singh, Sanchay Javeria