Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Michael Armbrust

Показать описание

“As Apache Spark becomes more widely adopted, we have focused on creating higher-level APIs that provide increased opportunities for automatic optimization. In this talk, I give an overview of some of the exciting new API’s available in Spark 2.0, namely Datasets and Structured Streaming. Together, these APIs are bringing the power of Catalyst, Spark SQL's query optimizer, to all users of Spark. I'll focus on specific examples of how developers can build their analyses more quickly and efficiently simply by providing Spark with more information about what they are trying to accomplish.” - Michael

Databricks Blog: "Deep Dive into Spark SQL’s Catalyst Optimizer"

// About the Presenter //
Michael Armbrust is the lead developer of the Spark SQL project at Databricks. He received his PhD from UC Berkeley in 2013, and was advised by Michael Franklin, David Patterson, and Armando Fox. His thesis focused on building systems that allow developers to rapidly build scalable interactive applications, and specifically defined the notion of scale independence. His interests broadly include distributed systems, large-scale structured storage and query optimization.

Follow Michael on -

Рекомендации по теме

Комментарии

Superb! I must say the best presentation for a long time..

Powers

the best youtube spark video I found so far!

sitientibus

Found this presentation both informative and engaging - GLAD you recorded it.
Like others, there's SO much here that I've already stop/rewound/restarted portions numerous times, until it took me an hour to get through a 28 min presentation ;-)
I too noticed a few "verbal typos", but it was clear you UNDERSTOOD the terms, so it was easy to follow the slides while listening ;-) ..

ONE QUESTION:
I've reviewed it repeatedly, but I'm still NOT sure about one thing:
Around the 8:05 mark you show a slide with "Stringly-typed methods will 'downcast' to generic "Row" objects.
- the term "Stringly-typed" is a term, a SLANG term hinting at developers who type MOST of their variables as String, but IS this what you really MEANT?
- the term "Strongly-typed" is indeed a NON-slang term, and actually (to me) makes as much or MORE sense in the sentence.

QUESTION: DID you really mean STRINGLY-typed, or STRONGLY-typed, and
- IF STRINGLY-typed, why? and why NOT STRONGLY-typed?

Hope you (or anyone else) still reply after 18 months :-)

KEEP UP THE GREAT POSTS !

markevogt

Great add-ons and crisp and clear presentation!

donluc

Scala + Kafka + Spark = SuperDataPipeline

FernandoRacca

Great talk!!!
I set up a spark-cluster with 2 workers. I save a Dataframe using partitionBy ("column x") as a parquet format to some path(same path) on each worker. The matter is that i am able to save it but if i want to read it back i am getting these errors: - Could not read footer for file file´status - unable to specify Schema ... Any Suggestions?
REPLY

djibb.

Can someone tell how to update a column in dataframe pls

harihs

Good, to the point presentation besides few wrong spellings and mispronunciations. It is "Cartesian" and "Kyro" instead of "Cartesian" and "Crayo" respectively.

bool

Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Michael Armbrust

Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Michael Armbrust

Jump Start Into Apache Spark 2 0

Learn Apache Spark in 10 Minutes | Step by Step Guide

Apache Spark | How To List Tables & Databases

Deep Dive into Query Execution in Spark SQL 2 3 with Jacek Laskowski

Extending Spark SQL 2 4 with New Data Sources Live Coding Session -Jacek Laskowski

Apache Spark SQL - Spark Using SQL - Apache Spark Tutorial - Spark OnlineLearningCenter

Deep Dive Into Catalyst: Apache Spark 2 0'S Optimizer

Data Engineer PySpark Data Bricks Session Day 5

Spark SQL Introduction

Building Robust ETL Pipelines with Apache Spark - Xiao Li

Cost Based Optimizer in Apache Spark 2 2 - Ron Hu & Sameer Agarwal

Part 2 - Spark SQL - Apache Spark Crash Course Mini-series

Working with Spark SQL

SnappyData @ Spark Summit: Efficient State Management With Spark 2 0 And Scale Out Databases

Apache Carbondata: An Indexed Columnar File Format for Interactive Query by Jacky Li/Jihong Ma

Extending Spark SQL API with Easier to Use Array Types Operations - Marek Novotny and Alex Vayda

A Deep Dive into Query Execution Engine of Spark SQL - Maryann Xue

Spark Tutorial - Spark SQL | Database and Tables

Dataset observation in Apache Spark SQL

Senior Programmers vs Junior Developers #shorts

How much does a DATA ENGINEER make?

Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das

Extending Apache Spark SQL Data Source APIs with Join Push Down - Ioana Delaney & Jia Li