Determining the number of partitions

Показать описание

ATTENTION DATA SCIENCE ASPIRANTS:
Click Below Link to Download Proven 90-Day Roadmap to become a Data Scientist in 90 days

How to determine, how many number of partitions should an RDD have? If an RDD has too many partitions, then task scheduling may take more time than the actual execution time. On the contrary, having too less number of partitions is also not beneficial as some of the worker nodes could just be sitting idle resulting in less concurrency. Thus, there is always a trade off when it comes to deciding on the number of partitions. Let’s look deeper on how to determine the optimum number of partitions.

Apache spark can run as many number of parallel tasks as the number of cores available in the cluster at any moment of time. If a cluster has 4 cores, then it means spark could run 4 tasks in parallel at any given point of time. So, we should be setting the number of partitions to be at least 4 or may be 2 or 3 times of that.

Let’s say we have a 4-core cluster. And suppose we are partitioning the data into 5 partitions

The first 4 partitions will run in parallel using the 4 cores of the spark cluster and gets processed in 5 minutes (Let’s assume that each task takes 5 minutes).

The 5th partition will be processed as a separate task, which takes another 5 minutes, which leads to 10 minutes of total execution time.

This is obviously not an example of how the resources should be used effectively.

So, the best way to decide on the number of partitions is to keep it equal to the number of cores in the cluster. So, in our example, since we have a 4-core spark cluster, the number of partitions should also be 4. This will process all the tasks in parallel and resources will be utilized in an optimal way.

The general rule when the number of partitions is in range of 100 to 10K partitions is that
● The lower bound for spark partitions is determined by 2 X number of cores in the cluster available to application. So, if it is a 100-node cluster then, the number of partitions should at least be 200, for optimum performance.
● The upper bound for partitions in Spark should not exceed if an individual task takes less than 100 ms time to execute. Why the 100 ms cutoff? This is because, if each task completes very quickly in less than 100 ms, then it means that the partitioned data is too small and it means that the application is spending extra time in scheduling tasks, rather than leveraging the benefits of concurrency.