filmov
tv
Spark Accumulators

Показать описание
What are accumulators?
After an election voting, when the counting happens, the machines in individual polling station are counted, first. Then in a central location, the vote-count from the individual polling-station machines are accumulated to count the total number of votes.
The votes of the individual polling stations can be incremented in individual polling-station; however, they cannot be incremented in the central station. All they do in a central station is to accumulate the vote-count of individual polling-stations.
Spark Accumulators are shared variables that can aggregate values across multiple tasks in a spark cluster. In our vote-counting analogy, the individual polling stations are the spark executors. The central station where the votes are accumulated is the driver node of the spark cluster.
Like our vote-counting analogy, the individual counts of each executor tasks are counted locally and then finally aggregated by the driver process. Accumulator acts as a shared variable that can send the results of individual tasks to the driver.
In our vote counting analogy, just like how the votes in individual polling stations are incremented, the accumulator values are write-only for the individual executors. Like how the votes from individual polling stations are read-only in the central location, the accumulator values are read-only for the driver.
How to create Accumulators?
Accumulators can be created using accumulator method of the spark context object. We should pass the initial value as a parameter to the accumulator method. This creates an accumulator variable called blankLines.
Why we need Accumulators?
To understand why we need Accumulators, let’s take a small example.
Below is the log file that contains logId, productName, Location, and Price details. Let’s say we need to find the number of blankLines, number of bad records, and number of records with price as 0.
If we use a regular variable, instead of accumulator, what happens is the variable will print 0, when the driver tries to print it. This is because, since it is a regular variable, the value from the workers are not transferred back to the driver.
Now let me try to use accumulator in the place of regular variable. Now the results are sent back to the driver. This should print the accumulated values as we expect.