Spark-submit command and flags

Показать описание

ATTENTION DATA SCIENCE ASPIRANTS:
Click Below Link to Download Proven 90-Day Roadmap to become a Data Scientist in 90 days

spark-submit is a tool that is available in spark’s bin directory, that can be used to submit spark application on a cluster. The basic spark-submit command will be like,

spark-submit, space, script name or jar file name. Here we haven't specified any flags. By default, this runs the application locally.

If we would like to run the application in distributed mode, we can specify where the application should be running using --master flag.

For standalone cluster, we can specify the value of master flag as spark://hostname and port number. By default, the port number is 7077 for the standalone cluster.

For apache mesos cluster, the value of master flag can be mesos:// , hostname and port number. By default, the port number is 5050 for the standalone cluster.

The value of master flag can be set as 'YARN' if you are planning to run the application in yarn cluster.

When running the application locally, the master flag can be set as 'local'. By default, it will run in a single core.

If you want it to run in local using 'n' cores, then you can specify the value as local[n], where 'n' is placed inside square braces.

We can also specify the local[*], to run it locally and to use as many cores as that are available.

A typical command to submit the spark application will look like the one below. There are multiple optional flags. All the flag names are prefixed with --

The 'master' flag, as discussed above, can be used to tell the spark submit command where the application should be running.

The deploy-mode flag specifies the mode, whether to run locally in “client” mode or in “cluster” mode. This flag accepts 2 values 'client' and 'cluster'. Default is 'client' mode.

'Class' flag specifies the main class name, which should be invoked, when running a java or scala application.

The 'name' flag is a human-readable name that represents the application. This name will be displayed in the spark application web UI.

Spark application may depend on one or more third-party jars. Those jar file names can be specified using jars flag. URLs specified after the
--jars should be separated by commas. The jars specified here are automatically transferred to the cluster. The URL can follow any of the following scheme.

If it starts with file: URLs are served by drivers HTTP file server and each executor pulls from the drivers HTTP server.

If it starts with hdfs, URLs are pulled by the executor from corresponding URIs.

If it starts with local URLs are served by local file on each executor. This means the file is pushed to all the worker nodes in advance. This incurs no network IO and works well with large files.

The 'total-executor-cores flag specifies the total number of the executor cores this application may use.

The '—num-executors' flag value is the number of executors in the cluster that can be utilized by the application.

The '—executor-cores' flag specifies the number of cores per executor.

The '—executor-memory' flag specifies the amount of memory each executor can use.

The 'driver-memory' flag specifies how much memory driver can use.

'queue' flag specifies the queue name within the which the spark application should run. Queues are created to share resources between applications.

Finally specify the jar name (or) the file name which should be executed in spark. If it is a python application, you can specify the .py file name along with the path.