Spark for Data Science, Big and Small, Joseph K. Bradley, 20151024

Показать описание

Joseph K. Bradley, Databricks
Data Science Camp 2015 Keynote
We discuss recent and upcoming advances in Apache Spark to facilitate data science. Spark’s wide adoption largely stems from allowing fast, iterative analysis, both on a laptop and on large computing clusters. This interactivity has led many data scientists to adopt Spark for both exploratory analysis and production modeling and scoring.

In response, the Spark community has been working on key features to further improve the experience of data scientists. This talk will highlight some of these features, mention use cases, and discuss recent and ongoing work on optimizations and extended functionality.

Spark DataFrames, introduced in Spark 1.3, allow manipulation of distributed data using a friendly API inspired by R and Python pandas.
Machine Learning Pipelines, introduced in Spark 1.2, facilitate construction of ML workflows and model tuning.
Spark R, shipped with Spark 1.4, provides an API for R users to work with distributed data, and we continue work towards feature parity for the R API.
For each of these items, we are working on improving integrations with familiar data science tools such as R and Python dataframes and scikit-learn. Initial PMML support, added in Spark 1.4, allows users to export models to other tools and deployments.This talk will be accessible for new Spark users, and will also provide

insights, references, and tips helpful for experienced users.

Speaker Bio

Joseph Bradley is a Spark Committer working on MLlib at Databricks. Previously, he was a postdoc at UC Berkeley after receiving his Ph.D. in Machine Learning from Carnegie Mellon U. in 2013. His research included probabilistic graphical models, parallel sparse regression, and aggregation mechanisms for peer grading in MOOCs.