User Defined Aggregation in Apache Spark: A Love Story

preview_player
Показать описание
Defining customized scalable aggregation logic is one of Apache Spark’s most powerful features. User Defined Aggregate Functions (UDAF) are a flexible mechanism for extending both Spark data frames and Structured Streaming with new functionality ranging from specialized summary techniques to building blocks for exploratory data analysis. And yet as powerful as they are, UDAFs prior to Spark 3.0 have had subtle flaws that can undermine both performance and usability.

In this talk Erik will tell the story about how he met UDAFs and fell in love with their powerful features. He’ll describe how he faced challenges with the UDAF design and its performance properties and how, with the help of the Apache Spark community, he eventually fixed the UDAF design in Spark 3.0 and fell in love all over again. Along the way you’ll learn about how User Defined Aggregation works in Spark, how to write your own UDAF library and how Spark’s newest UDAF features improve both usability and performance. You’ll also hear how Spark’s code review process made these new features even better and learn tips for successfully shepherding a large feature into the Apache Spark upstream community.

About:
Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business.
Connect with us:
Рекомендации по теме
Комментарии
Автор

Great Work, Thanks, i have only one question related with,
val serde: ExpressionEncoder[TDigestSql] =

Is TDigestSql a scala class?

I tried to do the same thing as you, only I get error Caused by: illegal cyclic reference involving class AnySketch, AnySketch is Java Class

Should Class AnySketch implement some Interfaces?

Pampastu
Автор

is UDAF also supported in structured streaming 2.4 version?

ruchitarun
welcome to shbcf.ru