Apache Spark - Computerphile

preview_player
Показать описание
Analysing big data stored on a cluster is not easy. Spark allows you to do so much more than just MapReduce. Rebecca Tickle takes us through some code.

This video was filmed and edited by Sean Riley.

Рекомендации по теме
Комментарии
Автор

note to the editor: please stop cutting away from the code so quickly. we're trying to follow along in the code based on what she's saying. at that moment, we don't need to cut back to the shot of her face. we can still hear her voice in the voiceover.

notangryjustdismayed
Автор

The RDD API is outmoded as of Spark 2.0 and in almost every use case you should be using the Dataset API. You lose out on a lot of improvements and optimizations using RDDs instead of Datasets.

Hourai
Автор

pretty sure theres a typo in that code. "splitLines" doesnt exist and is probably supposed to be words.map(...) instead

Bolt
Автор

Can you do Apache Kafka next? How do they compare?

Technomancr
Автор

ahh.. so refreshing after taking a week break from dev work and staying away from non dev topics. Lol, I love our field. Like music to my ears

xIAMROOT
Автор

Is there any meta analysis on the usefulness of bigdata analysis? How often do jobs get run that either produce no meaningful data or don't produce any statistically significant data?

recklessroges
Автор

Brady Please make a video on Kubernetes

mm
Автор

feels like this video is four years too late ... :-/

Xakriss
Автор

Thank you for teaching an old man new things.

williamwurthmann
Автор

She refers to an early example. Did I miss that video? Otherwise, nicely done. Love learning about distributed computing.

KurtSchwind
Автор

Wow congrats on the content. You were able to explain it in a concise, yet logical and detailed way. nice

tablit.
Автор

A great example of how programming languages are a reasonably efficient mechanism to communicate sections of program and how natural language really is not.

tackline
Автор

These data ones are really good! Keep them coming!

alexkompos
Автор

She's damn good at explaining and easy to listen to, any plans of having her host other episodes?

(sorry for "her" I don't know her name).

Mmouse_
Автор

For anyone interested, although the documentation is awful for Apache Flink and it doesn't support Java versions beyond 8, it at least lets you do setup on each node. Spark does not have any functionality for running one-time setup on each node, which makes it infeasible for many use cases. These distributed processing frameworks are quite opinionated and if you're not doing word count or streaming data from one input stream to another with very simple stateless transformations in between you'll find little in the documentation or functionality. They're not really designed for use cases where you have a parallel program with a fixed size data source known in advance and want to scale it up as you would by adding more threads, but more for continuous data processing.

nO_dNAL
Автор

typo in line 32 for using `splitLines` instead of `word`?

PaulSukys
Автор

It's so clear and easy after the explanation! I will be waiting for more vids about clustering and distributed computing)

xakkep
Автор

More of these, please. More big data.

MJ-em_jay
Автор

Computerphile will be excited to learn that tripods exist.

michaelebbs
Автор

I wish she also talked a little about Spark's ability to deal with data streams

Alex