Improving Apache Spark with S3 - Ryan Blue

preview_player
Показать описание
"Apache Spark has become a popular and successful way for Python programming to parallelize and scale up data processing. In many use cases though, a PySpark job can perform worse than an equivalent job written in Scala. It is also costly to push and pull data between the user’s Python environment and the Spark master.

Apache Arrow-based interconnection between the various big data tools (SQL, UDFs, machine learning, big data frameworks, etc.) enables you to use them together seamlessly and efficiently, without overhead. When collocated on the same processing node, read-only shared memory and IPC avoid communication overhead. When remote, scatter-gather I/O sends the memory representation directly to the socket avoiding serialization costs.

Session hashtag: #SFdev3"

About: Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business.

Connect with us:
Рекомендации по теме
Комментарии
Автор

I think the description does not match the video?

joshuahendinata
Автор

Is there any solution for those using EMR? Patching the source isnt gonna work there since the spark binaries are all provided by EMR.

Wafffl