Apache Spark Week Day 2 | #101

preview_player
Показать описание
On day two of the Apache Spark week we look major Apache Spark concepts: RDDs, transformations and actions, caching and broadcast variables.

Also check out Jacek Laskowski book on Spark:

Check out my free 100+ pages data engineering cookbook on GitHub:

Please SUPPORT WHAT YOU LIKE:

(Send a message and I read it on the stream)

- YouTube SuperChats while live streaming

- As an Amazon Associate I earn from qualifying purchases from Amazon. Just use this link:

- I get asked a lot about my podcast gear. This is a list of all the equipment I currently use to create this Podcast:

#ApacheSpark #DataEngineering #PplumbersofDataScience #bigdata
Рекомендации по теме
Комментарии
Автор

Hi Andreas! It worths to mention that using spark caching (or memory only persisting) for significant amount of data it's almost inevitable to stumble into OOM problem and solving it will require changing of spark memory management settings (see spark.memory.fraction and

And regarding the broadcast stuff and accumulators... First one is just a way to distribute some data needed for computation among executors. It's not writable on workers. And accumulator is like counter in MapReduce framework. It's needed to pass some information from worker to driver. So, it's writable from executors, but can be read from driver only.

dmitryamosov
Автор

Hi Andreas, Thank you such a great initiative, are you doing this streaming daily ?. A question here, what will be difference between SparkContext.addFile() and broadcast variable? first adds a file to a distributed cache later also add smaller file into executors memory.

sumityadav
Автор

It will be ur great philanthropy on me.

KK-lrjq