Holden Karau: A brief introduction to Distributed Computing with PySpark

preview_player
Показать описание
PyData Seattle 2015
Apache Spark is a fast and general engine for distributed computing & big data processing with APIs in Scala, Java, Python, and R. This tutorial will briefly introduce PySpark (the Python API for Spark) with some hands-on-exercises combined with a quick introduction to Spark's core concepts. We will cover the obligatory wordcount example which comes in with every big-data tutorial, as well as discuss Spark's unique methods for handling node failure and other relevant internals. Then we will briefly look at how to access some of Spark's libraries (like Spark SQL & Spark ML) from Python. While Spark is available in a variety of languages this workshop will be focused on using Spark and Python together.

Materials available here:
00:10 Help us add time stamps or captions to this video! See the description for details.

Рекомендации по теме
Комментарии
Автор

Awesome talk! you can always tell when someone is pretty excited to be showing off the stack their presenting! Thanks Holden!

bmurph
Автор

I'm not a Python dev, but despite the title, there was very little Python-specific stuff. I can wholeheartedly recommend the video to anyone who'd like to learn about Spark beyond a simple WordCount example. Watch it!

JacekLaskowskiJapila