Beyond Shuffling: Scaling Apache Spark by Holden Karau

preview_player
Показать описание
This video was recorded at Scala Days Berlin 2016

Abstract:
This session will cover our & community experiences scaling Spark jobs to large datasets and the resulting best practices along with code snippets to illustrate.

The planned topics are:
* Using Spark counters for performance investigation
* Spark collects a large number of statistics about our code, but how often do we really look at them? We will cover how to investigate performance issues and figure out where to best spend our time using both counters and the UI.
* Working with Key/Value Data
* Replacing groupByKey for awesomeness
groupByKey makes it too easy to accidently collect individual records which are too large to process. We will talk about how to replace it in different common cases with more memory efficient operations.
* Effective caching & checkpointing
* Being able to reuse previously computed RDDs without recomputing can substantially reduce execution time. Choosing when to cache, checkpoint, or what storage level to use can have a huge performance impact.
* Considerations for noisy clusters
* Functional transformations with Spark Datasets
*How to have the some of benefits of Spark’s DataFrames while still having the ability to work with arbitrary Scala code
Рекомендации по теме
Комментарии
Автор

34:50 Using an accumulator for validation Thank you Holden

gounna