Distributed Data Show Episode 38: Spark 3.0 and Beyond with Holden Karau

Показать описание

David Gilardi talks with Holden Karau of Google to mine many wonderful nuggets on the future of Spark and find out what might happen if she had a magic wand of awesomeness.

Highlights!

0:15 - Welcoming Holden back to the show
0:30 - So what exactly is going to be in Spark 3? Significant updates to the SQL and Machine Learning (ML) APIs. There are missing pieces in ML API, adding them will cause breaking changes to existing models. One example is support for online model serving.
2:25 - The DataSet API does not yet fully cover all needed cases, causing developers to jump back to RDD APIs, so some API changes will be needed there . There will be continued performance improvements in query planning in minor releases.
3:13 - Python changes could include changes to handle Vectorized UDFs in the RDD APIs
4:35 Why it’s so hard to pin down when Spark 3 will appear: breaking API changes have to be worth it. We need to wait until the payoff in capability is worth the breaking. An example would be making ML APIs typesafe.
6:57 - What Holden would change in Spark, given a magic wand - shared memory buffer between languages using Apache Arrow
9:46 - Wrapping up - the most exciting change likely to be in Spark 3 in online model serving

ABOUT DATASTAX ENTERPRISE 5.1
DataStax Enterprise 5.1, the database platform for cloud applications, includes Apache Cassandra 3.x with materialized views, tiered storage and advanced replication. Introduced in 5.1 is DataStax Enterprise Graph, the first graph database fast enough to power customer-facing applications, scale to massive datasets and integrate advanced tools to power deep analytical queries.

CONNECT WITH DATASTAX

ABOUT DATASTAX ACADEMY