Modern Spark DataFrame & Dataset | Apache Spark 2.0 Tutorial

preview_player
Показать описание
Adam Breindel, lead Spark instructor at NewCircle, talks about which APIs to use for modern Spark with a series of brief technical explanations and demos that highlight best practices, latest APIs, and new features. (Topics Indexed Below)

We'll look at how Dataset and DataFrame behave in Spark 2.0, Whole-Stage Code Generation, and go through a simple example of Spark 2.0 Structured Streaming (Streaming with DataFrames) that you can run in your own free instance of Databricks.

00:00:40 - Intro: What is "Modern Spark"
00:01:26 - DataFrame
00:05:07 - Why not use RDD?
00:09:15 - Intro to DataFrame and Dataset
00:10:13 - DataFrame versus Dataset
00:14:42 - Dataset Queries and Dataset with Scala classes
00:19:07 - Spark Query Optimizer
00:23:26 - Whole-Stage Codegen
00:27:21 - Hive integration
00:29:28 - Wrapping Up DataFrame/Dataset Benefits
00:30:54 - One More Thing - Structured Streaming
00:36:47 - Conclusion

Try the Examples:

----------------------------------------------------------------------------------------------
SPARK 2.0 TRAINING | NewCircle | Onsite & Public Classes
----------------------------------------------------------------------------------------------
+ Programming for Spark 2.0 (3 days):

+ Spark 2.0 for Machine Learning & Data Science (3 days):
Рекомендации по теме
Комментарии
Автор

Excellent - I very much enjoyed this clear and concise explanation. Thank you !

chrisf
Автор

Thank you so much, really wounderful, sir can you put/share some vid tutorial on how to do partitioning/custom partitioning and its configuration i.e. executors and number of cores on cluster and how to run for better optimization speed.

SpiritOfIndiaaa
Автор

Does anyone know how I can get the code for this tutorial the link is broken above ? Thanks

theCanadian
Автор

Looks like the workbook link no longer works. Is it possible to provide an updated link for this please?

davidburt
Автор

@Adam is it ideal to use DataFrame in Spark when you don't know the columns upfront, i am building an API service on raw Spark RDD rather using DataFrame. Is

ebottabi
Автор

Hi, tried to run the example, but got an error:
<console>:32: error: not found: value spark

^

looks like something changed, tried both the old and the new notebook, thanks for the great video

anders
Автор

Thanks for your vedio. Awesome Explanation. Can you please explain about "Symbol SQLContext is deprecated. Use SparkSession.builder instead" will be very useful.

shayshaswishes
Автор

classOf([DataFrame] == classOf([Dataset[_] ) returns false on my laptop

arunbm