Vectorized UDF: Scalable Analysis with Python and PySpark - Li Jin

preview_player
Показать описание
Li Jin, a software engineer at Two Sigma shares a new type of Py Spark UDF: Vectorized UDF.

Over the past few years, Python has become the default language for data scientists. Packages such as pandas, numpy, statsmodel, and scikit-learn have gained great adoption and become the mainstream toolkits. At the same time, Apache Spark has become the de facto standard in processing big data. Spark ships with a Python interface, aka PySpark, however, because Spark’s runtime is implemented on top of JVM, using PySpark with native Python library sometimes results in poor performance and usability.
Vectorized UDF is built on top of Apache Arrow and bring you the best of both worlds – the ability to define easy to use, high performance UDFs and scale up your analysis with Spark.

About: Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business.

Connect with us:
Рекомендации по теме
Комментарии
Автор

One of the best videos on this topic that i have seen. Thanks!

dannykaplun
Автор

Thanks for great talk! When using grouped map pandas udf for model training I am assuming that your group_column=id is unique. This doesn't hurt the performance since it needs to groupby unique items ?

haneulkim
Автор

arent pandas UDF working on multiple rows? please correct me if im wrong.

Gerald-izmv
Автор

Great video. Thanks so much. I am using pandas udf as a solution, where I ran into severe memory issues because of the serialization involved using python objects. I am using grouped map pandas udf and it expects static return type and that much poses a challenge in writing generic functions that can be decorated with pandas udf. Is there a way i can infer the return type during run time while using pandas udf?

vinothpsg