# DX Spark Submit Utility

{% hint style="info" %}
To reflect more inclusive language, DNAnexus has updated its terminology. The terms `master` and `slave` are being replaced with `driver` and `clusterWorker` in Spark documentation articles. The codebase updates are in progress. Some variable names and scripts in the code still use the older terms.
{% endhint %}

{% hint style="info" %}
A license is required to access Spark functionality on the DNAnexus Platform. [Contact DNAnexus Sales](mailto:sales@dnanexus.com) for more information.
{% endhint %}

## Usage

```
usage: dx-spark-submit [-h | --help] [--log-level {INFO,WARN,TRACE,DEBUG}]
                       [--collect-logs] [--log-collect-dir LOG_COLLECT_DIR]
                       [--app-config APP_CONFIG] [--user-config USER_CONFIG]
                       spark-driver-args

positional arguments:
  spark-driver-args     Options to be passed directly to spark-submit, including
                        Spark application, properties, and driver options

optional arguments:
  -h, --help            show this help message and exit
  --log-level {INFO,WARN,TRACE,DEBUG}
                        Log level for driver and executor
  --collect-logs        Collect logs to a project in the platform
  --log-collect-dir LOG_COLLECT_DIR
                        Directory in project to upload logs
  --app-config APP_CONFIG
                        Application configuration json string or file
  --user-config USER_CONFIG
                        User configuration json string or file
```

## How does it work?

The `dx-spark-submit` utility simplifies common Spark application tasks.

* Allows convenient overrides of Spark properties at the app developer and user level.
* Sets the driver and executor log level.
* Submits and sets up the UI to monitor Spark jobs.
* Initiates log collection once the job is done (success or failure).

## Spark Property Overrides

Spark apps depend on specific configurations like `spark-defaults.conf`, `hive-site.xml` which set up the environment for your application. In certain scenarios, an application developer or the user of the application may want to override a default setting.

The `dx-spark-submit` utility allows you to specify two configuration inputs.

* Application configuration
* User configuration

### Application Configuration JSON

Application config JSON (`--app-config`) contains the list of configurations the app developer may want to restrict or override.

```json
{
  "spark-defaults.conf": [
    {
      "name": "spark.ui.port",
      "value": 8081,
      "override_allowed": true
    },
    {
      "name": "spark.sql.parquet.filterPushdown",
      "value": false
    }
  ]
}
```

### User Configuration JSON

User config JSON (`--user-config`) contains the list of configurations the app user may want to add or override. If you want to offer this override ability to users of your app you need to reference this file in the app input spec, so it's available to `dx-spark-submit`.

```json
{
  "spark-defaults.conf": [
    {
      "name": "spark.ui.port",
      "value": 8080
    },
    {
      "name": "spark.sql.shuffle.partitions",
      "value": 1
    }
  ]
}
```

These Spark configurations cannot be overridden as they affect the basic functioning of the cluster application:

```shell
spark.driver.host
spark.driver.bindAddress
spark.driver.port
spark.driver.blockManager.port
spark.blockManager.port
spark.port.maxRetries
spark.master
spark.driver.extraClassPath
spark.jars
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version
```

## Log Collection

When `--collect-logs` option is set, log collection is triggered by the script. It collects the logs from `clusterWorker` and `driver` and uploads them to the project by default. If `--log-collect-dir` is specified, the logs are copied to the specified folder in the project.

{% hint style="info" %}
Subjobs cannot use log collection.
{% endhint %}

## Log Level

`--log-level` can be used to set the driver and executor log level (INFO, WARN, TRACE, DEBUG).

## Spark Arguments

`spark-driver-args` should contain the Spark application and any arguments you want to pass to spark-submit.

## Example

```shell
$ dx-spark-submit \
    --log-level INFO \
    --collect-logs \
    --log-collect-dir pitestlogs \
    --app-config /app.json \
    --user-config /user.json \
    --class org.apache.spark.examples.SparkPi /cluster/spark/examples/jars/spark-examples*.jar 10
```

{% hint style="info" %}
The `dx-spark-submit` utility is located in `/cluster/dnax/bin` on the Spark cluster worker container.
{% endhint %}
