Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Michael Armbrust

preview_player
Показать описание
“As Apache Spark becomes more widely adopted, we have focused on creating higher-level APIs that provide increased opportunities for automatic optimization. In this talk, I give an overview of some of the exciting new API’s available in Spark 2.0, namely Datasets and Structured Streaming. Together, these APIs are bringing the power of Catalyst, Spark SQL's query optimizer, to all users of Spark. I'll focus on specific examples of how developers can build their analyses more quickly and efficiently simply by providing Spark with more information about what they are trying to accomplish.” - Michael

Databricks Blog: "Deep Dive into Spark SQL’s Catalyst Optimizer"

// About the Presenter //
Michael Armbrust is the lead developer of the Spark SQL project at Databricks. He received his PhD from UC Berkeley in 2013, and was advised by Michael Franklin, David Patterson, and Armando Fox. His thesis focused on building systems that allow developers to rapidly build scalable interactive applications, and specifically defined the notion of scale independence. His interests broadly include distributed systems, large-scale structured storage and query optimization.

Follow Michael on -
Рекомендации по теме
Комментарии
Автор

Superb! I must say the best presentation for a long time..

Powers
Автор

the best youtube spark video I found so far!

sitientibus
Автор

Found this presentation both informative and engaging - GLAD you recorded it.
Like others, there's SO much here that I've already stop/rewound/restarted portions numerous times, until it took me an hour to get through a 28 min presentation ;-)
I too noticed a few "verbal typos", but it was clear you UNDERSTOOD the terms, so it was easy to follow the slides while listening ;-) ..

ONE QUESTION:
I've reviewed it repeatedly, but I'm still NOT sure about one thing:
Around the 8:05 mark you show a slide with "Stringly-typed methods will 'downcast' to generic "Row" objects.
- the term "Stringly-typed" is a term, a SLANG term hinting at developers who type MOST of their variables as String, but IS this what you really MEANT?
- the term "Strongly-typed" is indeed a NON-slang term, and actually (to me) makes as much or MORE sense in the sentence.

QUESTION: DID you really mean STRINGLY-typed, or STRONGLY-typed, and
- IF STRINGLY-typed, why? and why NOT STRONGLY-typed?

Hope you (or anyone else) still reply after 18 months :-)

KEEP UP THE GREAT POSTS !

markevogt
Автор

Great add-ons and crisp and clear presentation!

donluc
Автор

Scala + Kafka + Spark = SuperDataPipeline

FernandoRacca
Автор

Great talk!!!
I set up a spark-cluster with 2 workers. I save a Dataframe using partitionBy ("column x") as a parquet format to some path(same path) on each worker. The matter is that i am able to save it but if i want to read it back i am getting these errors: - Could not read footer for file file´status - unable to specify Schema ... Any Suggestions?
REPLY

djibb.
Автор

Can someone tell how to update a column in dataframe pls

harihs
Автор

Good, to the point presentation besides few wrong spellings and mispronunciations. It is "Cartesian" and "Kyro" instead of "Cartesian" and "Crayo" respectively.

bool