PySpark: Python API for Spark

preview_player
Показать описание

Summary:
00:33 What is Spark?
03:00 What is PySpark?
03:45 Example Word Count
04:35 Demonstration of interactive shell on AWS EC2
06:37 Spark web interface
11:20 API documentation
11:27 Python doctest, create tests from interactive samples
12:39 Getting help help(sc)
13:18 PySpark Implementation details
14:15 PySpark less than 2K lines including comments
17:18 Pickled Objects, RDD[Array[Byte]]
17:44 Batching Pickle to reduce overhead
18:00 Consolidating operations into single pass when possible
19:27 PySpark Roadmap,
adding sorting support, file formats such as csv, PyPy JIT
Рекомендации по теме
Комментарии
Автор

NumPy will have to be present on the workers' Python import paths. PySpark has a SparkContext.addPyFile() mechanism for shipping library dependencies with jobs. I'm not sure whether NumPy binaries can be packaged as .egg or .zip files for that, though.

Another option is to install NumPy somewhere and add its installation path to PYTHONPATH in spark-env.sh (on each worker) so that it's set in each worker's environment when they launch their Python processes.

joshrosen
Автор

great presentation. I am wondering if I use numpy, then do I have to install numpy for spark workers ?

trunghlt
Автор

It's great, just wonder whether we can use pyspark with install spark? Such as can I install the pyspark on my local machine(without spark installed), and use pyspark to connect remote spark cluster?

kidexp
Автор

Is there a place to download (or re-create) Josh's example data file "wikipedia-100"?  I'd like to play along with his tutorial during the video

GlennStrycker