Building Robust ETL Pipelines with Apache Spark - Xiao Li

preview_player
Показать описание
Stable and robust ETL pipelines are a critical component of the data infrastructure of modern enterprises. ETL pipelines ingest data from a variety of sources and must handle incorrect, incomplete or inconsistent records and produce curated, consistent data for consumption by downstream applications.

In this talk, we'll take a deep dive into the technical details of how Apache Spark "reads" data and discuss how Spark 2.2's flexible APIs; support for a wide variety of datasources; state of art Tungsten execution engine; and the ability to provide diagnostic feedback to users, making it a robust framework for building end-to-end ETL pipelines.

Overview:
1) What’s an ETL Pipeline?

2) Using Spark SQL for ETL
- Extract: Dealing with Dirty Data (Bad Records or Files)
- Extract: Multi-line JSON/CSV Support
- Transformation: High-order functions in SQL
- Load: Unified write paths and interfaces

3) New Features in Spark 2.3
- Performance (Data Source API v2, Python UDF)

View slides:

Related articles:
Integrating Apache Airflow and Databricks: Building ETL pipelines with Apache Spark

Writing Data Engineering Pipelines in Apache Spark on Databricks

About: Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business.

Connect with us:
Рекомендации по теме
Комментарии
Автор

I really appreciate it when someone presents problems and show solutions in code ... much better than other high-level BS talks

AbdulelahAlJeffery
Автор

4:11 - u actually gave me a hint. Its a long story to explain for what! Will implement n comment. ♥️

sumitkumarsahoo
Автор

great content. My english is fine to understand what you say. Keep it up.

aashishraina
Автор

I can guess what he’s trying to say plus the content is great 👍🏼. Keep it up

tacorevenge
Автор

All are good but the strange pronunciation of "and" and "data". I suggest Dr. Li taking some time to practice the two simple and high frequency words. Thanks a lot😊

tinaxu
Автор

End of the Section..the question answer I didn't get

poonampatel