Distributed Data Show Episode 84: Spark 3 Preview with Holden Karau

Показать описание

Patrick and Holden talk about the highlights of Spark 2.4, what's coming in Spark 3, and why code reviewers are vital to open source projects

Highlights:
0:15 - Welcoming Holden back to the show
0:50 - What's 2.4 is out - highlights include Apache Arrow integration for better integration between JVM, Python and R runtimes.
2:05 - Python is becoming a first class citizen in the Spark world
2:50 - Projects including Arrow and Spark have a real need for code reviewers that know both Python and Java
4:15 - Livestreaming code reviews
5:52 - The types of changes that need review tend to be the gnarly issues, even a first pass, high level review helps.
7:10 - Spark 3 highlights (note it's not backward compatible) - new Spark SQL engine
8:00 - Python 2.7 support will be deprecated in Spark 3
9:03 - Spark MLlib will also be deprecated in favor of SparkML
10:10 - Spark Streaming data source APIs are changing
11:17 - Kubernetes integration is improving, especially scaling down
13:20 - This helps with the #1 cloud concern - cost control
14:50 - Deep learning pipeline support is being added, the approach is pluggable (bring your own DL libraries)
17:05 - Why OSS releases are late - code reviewers, feature creep, agreeing on priorities
18:40 - Wrapping up
ABOUT DATASTAX ENTERPRISE 6
DataStax powers the Right-Now Enterprise with the always-on, distributed cloud database built on Apache Cassandra™ and designed for hybrid cloud. DataStax Enterprise 6 (DSE 6) includes industry-leading performance, self-driving operational simplicity, and robust analytics.

CONNECT WITH DATASTAX

ABOUT DATASTAX ACADEMY