PyDSLA The Nitty Gritty of Advanced Analytics Using Apache Spark in Python May 5, 2016

Показать описание

Talk by Miklos Christine, solutions engineer at Databricks

Apache Spark is the next big data processing tool for Data Scientist. As seen on the recent StackOverflow analysis, it's the hottest big data technology on their site! In this talk, I'll use the PySpark interface to leverage the speed and performance of Apache Spark. I'll focus on the end to end workflow for getting data into a distributed platform, and leverage Spark to process the data for advanced analytics. I'll discuss the popular Spark APIs used for data preparation, SQL analysis, and ML algorithms. I'll explain the performance differences between Scala and Python, and how Spark has bridged the gap in performance. I'll focus on PySpark as the interface to the platform, and walk through a demo to showcase the APIs.

Talk Overview:

Spark's Architecture. What's out now and what's in Spark 2.0
Spark APIs: Most common APIs used by Spark
Common misconceptions and proper techniques for using Spark.
Demo:

Walk through ETL of the Reddit dataset.
SparkSQL Analytics + Visualizations of the Dataset using MatplotLib
Sentiment Analysis on Reddit Comments
Speaker:

Miklos Christine is a solutions engineer for Databricks where he helps customers deploy and use Apache Spark to build batch and streaming applications. Miklos was previously a system engineer at Cloudera where he helped strategic customers deploy and use the Apache Hadoop ecosystem in production. He has contributed to several projects in the open source community and holds a BS in electrical engineering and computer sciences from the University of California-Berkeley.