Spark Schema For Free with Dávid Szakállas (Whitepages)

Показать описание

DataFrames are essential for high-performance code, but sadly lag behind in development experience in Scala. When we started migrating our existing Spark application from RDDs to DataFrames at Whitepages, we had to scratch our heads real hard to come up with a good solution. DataFrames come at a loss of compile-time type safety and there is limited support for encoding JVM types.

We wanted more descriptive types without the overhead of Dataset operations. The data binding API should be extendable. Schema for input files should be generated from classes when we don’t want inference. UDFs should be more type-safe. Spark does not provide these natively, but with the help of shapeless and type-level programming we found a solution to nearly all of our wishes. We migrated the RDD code without any of the following: changing our domain entities, writing schema description or breaking binary compatibility with our existing formats. Instead we derived schema, data binding and UDFs, and tried to sacrifice the least amount of type safety while still enjoying the performance of DataFrames.

About: Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business.

Connect with us: