# Running Nextflow Pipelines

{% hint style="info" %}
A license is required to create a DNAnexus app or applet from the Nextflow script folder. [Contact DNAnexus Sales](mailto:sales@dnanexus.com) for more information.
{% endhint %}

This documentation assumes you already have a basic understanding of how to develop and run a [Nextflow](https://nextflow.io/) pipeline. To learn more about Nextflow, consult the official [Nextflow Documentation](https://docs.seqera.io/nextflow).

To run a Nextflow pipeline on the DNAnexus Platform:

1. Import the pipeline script from a remote repository or local disk.
2. Convert the script to an app or applet.
3. Run the app or applet.

You can do this via either the user interface (UI) or the command-line interface (CLI), using the [`dx` command-line client](https://documentation.dnanexus.com/downloads).

{% hint style="info" %}
Use the latest version of [`dx-toolkit`](https://documentation.dnanexus.com/downloads) to take advantage of recent improvements and bug fixes.

As of `dx-toolkit` version `v0.391.0`, pipelines built using `dx build --nextflow` default to running on **Ubuntu 24.04**. To use **Ubuntu 20.04** instead, override the default by specifying the release in `--extra-args`:

```shell
dx build --nextflow --extra-args='{"runSpec": {"release": "20.04"}}'
```

This documentation covers features available in `dx-toolkit` versions beginning with `v0.378.0`
{% endhint %}

## Quickstart

### Pipeline Script Folder Structure

A Nextflow pipeline script is structured as a folder with Nextflow scripts with optional configuration files and subfolders. Below are the basic elements of the folder structure when building a Nextflow executable:

* (Required) A main Nextflow file with the extension `.nf` containing the pipeline. The default filename is `main.nf`. A different filename can be specified in the `nextflow.config` file.
* (Optional) A [`nextflow.config` file](https://docs.seqera.io/nextflow/overview#configuration-options).
* (Optional, recommended) A [`nextflow_schema.json` file](https://nf-co.re/docs/nf-core-tools/cli/pipelines/schema). If this file is present at the root folder of the Nextflow script when importing or building the executable, the input parameters described in the file are exposed as the built Nextflow pipeline applet's input parameters. For more information on how the exposed parameters are used at run time, see [specifying input values to a Nextflow pipeline executable](#specifying-input-values-to-a-nextflow-pipeline-executable).
* (Optional) Subfolders and other configuration files. Subfolders and other configuration files can be referenced by the main Nextflow file or `nextflow.config` via the `include` or `includeConfig` keyword. Ensure that all referenced subfolders and files exist under the pipeline script folder at the time of building or importing the pipeline.

An [nf-core](https://nf-co.re/) flavored folder structure is encouraged but not required.

### Importing a Nextflow Pipeline

#### Import via UI

To import a Nextflow pipeline via the UI, click on the **Add** button on the top-right corner of the project's **Manage** tab, then expand the dropdown menu. Select the **Import Pipeline/Workflow** option.

![](https://1612471957-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-L_EsL_ie8XyZlLe_yf9%2Fuploads%2Fgit-blob-64f5801bd28fe0cfd162208e791a41412b39bfad%2Fnf_1.png?alt=media)

Once the **Import Pipeline/Workflow** modal appears, enter the repository URL where the Nextflow pipeline source code resides, for example, the [`nextflow-io/hello` repository](https://github.com/nextflow-io/hello). Then choose the project import location. If the repository is private, provide the credentials necessary for accessing it.

An example of the **Import Pipeline/Workflow** modal:

![The "Estimated Price" value shown here is only an example. The actual price depends on the pricing model and runtime of the import job.](https://1612471957-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-L_EsL_ie8XyZlLe_yf9%2Fuploads%2Fgit-blob-989b7d7b6554aaceba7136bada737ff64b63e4f6%2Fimportmodal.png?alt=media)

Click the **Start Import** button after providing the necessary information. This starts a pipeline import job in the project specified in the **Import To** field (default is the current project).

After launching the import job, a status message "External workflow import job started" appears.

Access information about the pipeline import job in the project's **Monitor** tab:

![](https://1612471957-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-L_EsL_ie8XyZlLe_yf9%2Fuploads%2Fgit-blob-b8a6c8cc38d343d8c8397fc93c8dfe9d5d3dfb1e%2FScreenshot%202023-09-20%20at%202.29.53%20PM.png?alt=media)

After the import finishes, the imported pipeline executable exists as an applet. This is the output of the pipeline import job:

![](https://1612471957-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-L_EsL_ie8XyZlLe_yf9%2Fuploads%2Fgit-blob-83e636b3a4d9b743c42a95989ad4b09555950c79%2FScreenshot%202023-09-20%20at%202.31.35%20PM.png?alt=media)

The newly created Nextflow pipeline applet appears in the project, for example, `hello`.

![](https://1612471957-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-L_EsL_ie8XyZlLe_yf9%2Fuploads%2Fgit-blob-d07821907e2421bd4fc0da1d44a306f67fe98dba%2FScreenshot%202023-09-21%20at%203.28.54%20PM.png?alt=media)

#### Import via CLI from a Remote Repository

To import a Nextflow pipeline from a remote repository via the CLI, run the following command to specify the repository's URL. You can also provide optional information, such as a [repository tag](https://documentation.dnanexus.com/helpstrings-of-sdk-command-line-utilities#build) and an [import destination](#advanced-building-and-importing-pipelines):

```shell
$ dx build --nextflow \
  --repository https://github.com/nextflow-io/hello \
  --destination project-xxxx:/applets/hello

Started builder job job-aaaa
Created Nextflow pipeline applet-zzzz
```

{% hint style="info" %}
Use the latest version of [`dx-toolkit`](https://documentation.dnanexus.com/downloads) to take advantage of recent improvements and bug fixes.

All versions beginning with v0.338.0 support converting Nextflow pipelines to apps or applets.

This documentation covers features available in `dx-toolkit` versions beginning with v0.370.0.
{% endhint %}

{% hint style="info" %}
Your destination project's `billTo` feature needs to be enabled for Nextflow pipeline applet building. [Contact DNAnexus Sales](mailto:sales@dnanexus.com) for more information.
{% endhint %}

For Nextflow pipelines stored in private repositories, access requires credentials provided via the `--git-credentials` option with a DNAnexus file containing your authentication details. The file should be specified using either its qualified ID or path on the Platform. See the [Private Nextflow Pipeline Repository](#private-nextflow-pipeline-repository) section for more details on setting up and formatting these credentials.

Once the pipeline import job finishes, it generates a new Nextflow pipeline applet with an applet ID in the form `applet-zzzz`.

Use `dx run -h` to get more information about running the applet:

```shell
$ dx run project-xxxx:/applets/hello -h
usage: dx run project-xxxx:/applets/hello [-iINPUT_NAME=VALUE ...]

Applet: hello

hello

Inputs:
 Nextflow options
  Nextflow Run Options: [-inextflow_run_opts=(string)]
        Additional run arguments for Nextflow (e.g. -profile docker).

  Nextflow Top-level Options: [-inextflow_top_level_opts=(string)]
        Additional top-level options for Nextflow (e.g. -quiet).

  Soft Configuration File: [-inextflow_soft_confs=(file) [-inextflow_soft_confs=... [...]]]
        (Optional) One or more nextflow configuration files to be appended to the Nextflow pipeline
        configuration set

  Script Parameters File: [-inextflow_params_file=(file)]
        (Optional) A file, in YAML or JSON format, for specifying input parameter values

 Advanced Executable Development Options
  Debug Mode: [-idebug=(boolean, default=false)]
        Shows additional information in the job log. If true, the execution log messages from
        Nextflow are also included.

  Resume: [-iresume=(string)]
        Unique ID of the previous session to be resumed. If 'true' or 'last' is provided instead of
        the sessionID, resumes the latest resumable session run by an applet with the same name
        in the current project in the last 6 months.

  Preserve Cache: [-ipreserve_cache=(boolean, default=false)]
        Enable storing pipeline cache and local working files to the current project. If true, local
        working files and cache files are uploaded to the platform, so the current session could
        be resumed in the future

Outputs:
  Published files of Nextflow pipeline: [published_files (array:file)]
        Output files published by current Nextflow pipeline and uploaded to the job output
        destination.
```

#### Building from a Local Disk

Through the CLI you can also build a Nextflow pipeline applet from a pipeline script folder stored on a local disk. For example, you may have a copy of the `nextflow-io/hello` pipeline from the Nextflow [GitHub](https://github.com/nextflow-io/nextflow) on your local laptop, stored in a directory named `hello`, which contains the following files:

```shell
$ pwd
/path/to/hello
$ ls
LICENSE         README.md       main.nf         nextflow.config
```

Ensure that the folder structure is in the required format, as [described here](#pipeline-script-folder-structure).

To build a Nextflow pipeline applet using a locally stored pipeline script, run the following command and specify the path to the folder containing the Nextflow pipeline scripts. You can also provide [optional information](https://documentation.dnanexus.com/helpstrings-of-sdk-command-line-utilities#build), such as an import destination:

```shell
$ dx build --nextflow /path/to/hello \
  --destination project-xxxx:/applets2/hello
{"id": "applet-yyyy"}
```

{% hint style="info" %}
Your destination project's `billTo` feature needs to be enabled for Nextflow pipeline applet building. Contact [Sales](mailto:sales@dnanexus.com) for more information.
{% endhint %}

This command packages the Nextflow pipeline script folder as an applet named `hello` with ID `applet-yyyy`, and stores the applet in the destination project and path `project-xxxx:/applets2/hello`. If an import destination is not provided, the current working directory is used.

The [`dx run -h`](https://documentation.dnanexus.com/helpstrings-of-sdk-command-line-utilities#run) command can be run to see information about this applet, similar to the above example.

A Nextflow pipeline applet has a type `nextflow` under its metadata. This applet acts like a regular DNAnexus applet object, and can be shared with other DNAnexus users who have access to the project containing the applet.

For advanced information regarding the parameters of `dx build --`, run `dx build --help` in the CLI and find the Nextflow section for all arguments that are supported for building a Nextflow pipeline applet.

#### Building a Nextflow Pipeline App from a Nextflow Pipeline Applet

You can also build a Nextflow pipeline [app from a Nextflow pipeline applet](https://documentation.dnanexus.com/developer/apps/transitioning-from-applets-to-apps) by running the command: `dx build --app --from applet-xxxx`.

### Running a Nextflow Pipeline Executable (App or Applet)

#### Running a Nextflow Pipeline Executable via UI

You can access a Nextflow pipeline applet from the **Manage** tab in your project, while the Nextflow pipeline app that you built can be accessed by clicking on the **Tools Library** option from the **Tools** tab. Once you click on the applet or app, the **Run Analysis** tab is displayed. Fill out the required inputs/outputs and click the **Start Analysis** button to launch the job.

#### Running a Nextflow Pipeline Applet via CLI

To run the Nextflow pipeline applet, use `dx run applet-xxxx` or `dx run app-xxxx` commands in the CLI and specify your [inputs](https://documentation.dnanexus.com/helpstrings-of-sdk-command-line-utilities#run):

```shell
$ dx run project-yyyy:applet-xxxx \
  -i debug=false \
  --destination project-xxxx:/path/to/destination/ \
  --brief -y

job-bbbb
```

You can list and see the progress of the Nextflow pipeline job tree, which is structured as a head job with many subjobs, using the following [command](https://documentation.dnanexus.com/helpstrings-of-sdk-command-line-utilities#find-jobs):

```shell
# See subjobs in progress
$ dx find jobs --origin job-bbbb
* hello (done) job-bbbb
│ amy 2023-09-20 14:57:58 (runtime 0:02:03)
├── sayHello (3) (hello:nf_task_entry) (done) job-1111
│   amy 2023-09-20 14:58:57 (runtime 0:00:45)
├── sayHello (1) (hello:nf_task_entry) (done) job-2222
│   amy 2023-09-20 14:58:52 (runtime 0:00:52)
├── sayHello (2) (hello:nf_task_entry) (done) job-3333
│   amy 2023-09-20 14:58:48 (runtime 0:00:53)
└── sayHello (4) (hello:nf_task_entry) (done) job-4444
    amy 2023-09-20 14:58:43 (runtime 0:00:50)
```

### Monitoring Jobs

Each Nextflow pipeline executable run is represented as a job tree with one head job and many subjobs. The head job launches and supervises the entire pipeline execution. Each subjob handles a process in the Nextflow pipeline. You can monitor the progress of the entire pipeline job tree by viewing the status of the subjobs (see example above).

Monitor the detail log of the head job and the subjobs through each job's DNAnexus log via the UI or the CLI.

{% hint style="warning" %}
On the DNAnexus Platform, jobs are limited to a runtime of 30 days. Jobs running longer than 30 days are automatically terminated.
{% endhint %}

#### Monitoring in the UI

Once your job tree is running, you can go to the Monitor tab to view the status of your job tree. From the Monitor tab, view the job log of the head job as well as the subjobs by clicking on the Log link in the row of the job you want. The costs (when your account has permission) and resource usage of a job are also viewable.

![](https://1612471957-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-L_EsL_ie8XyZlLe_yf9%2Fuploads%2Fgit-blob-8abd5588f6e8b0cf3262228e3cf8f98bb12f1624%2FScreenshot%202023-09-20%20at%203.03.32%20PM.png?alt=media)

An example of the log of a head job:

![](https://1612471957-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-L_EsL_ie8XyZlLe_yf9%2Fuploads%2Fgit-blob-bc3ae107618f35620e3ac5fb62389f26c63a4ade%2FScreenshot%202023-09-20%20at%203.05.21%20PM.png?alt=media)

An example of the log of a subjob:

![](https://1612471957-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-L_EsL_ie8XyZlLe_yf9%2Fuploads%2Fgit-blob-988f1b75cdcbd77172f6178d3328e8bfd9a6ca5b%2FScreenshot%202023-09-20%20at%203.06.40%20PM.png?alt=media)

#### Monitoring in the CLI

From the CLI, you can use the [`dx watch`](https://documentation.dnanexus.com/helpstrings-of-sdk-command-line-utilities#watch) command to check the status and view the log of the head job or each subjob.

Monitoring the head job:

```shell
# Monitor job in progress
$ dx watch job-bbbb
Watching job job-bbbb. Press Ctrl+C to stop watching.
* hello (done) job-bbbb
  amy 2023-09-20 14:57:58 (runtime 0:02:03)
... [deleted]
2023-09-20 14:58:29 hello STDOUT dxpy/0.358.0 (Linux-5.15.0-1045-aws-x86_64-with-glibc2.29) Python/3.8.10
2023-09-20 14:58:30 hello STDOUT bash running (job ID job-bbbb)
2023-09-20 14:58:31 hello STDOUT =============================================================
2023-09-20 14:58:31 hello STDOUT === NF projectDir   : /home/dnanexus/hello
2023-09-20 14:58:31 hello STDOUT === NF session ID   : 0eac8f92-1216-4fce-99cf-dee6e6b04bc2
2023-09-20 14:58:31 hello STDOUT === NF log file     : dx://project-xxxx:/applets/nextflow-job-bbbb.log
2023-09-20 14:58:31 hello STDOUT === NF command      : nextflow -log nextflow-job-bbbb.log run /home/dnanexus/hello -name job-bbbb
2023-09-20 14:58:31 hello STDOUT === Built with dxpy : 0.358.0
2023-09-20 14:58:31 hello STDOUT =============================================================
2023-09-20 14:58:34 hello STDOUT N E X T F L O W  ~  version 22.10.7
2023-09-20 14:58:35 hello STDOUT Launching `/home/dnanexus/hello/main.nf` [job-bbbb] DSL2 - revision: 1647aefcc7
2023-09-20 14:58:43 hello STDOUT [0a/6a81ca] Submitted process > sayHello (4)
2023-09-20 14:58:48 hello STDOUT [f5/87df8b] Submitted process > sayHello (2)
2023-09-20 14:58:53 hello STDOUT [4b/21374a] Submitted process > sayHello (1)
2023-09-20 14:58:57 hello STDOUT [f6/8c44f5] Submitted process > sayHello (3)
2023-09-20 14:59:51 hello STDOUT Hola world!
2023-09-20 14:59:51 hello STDOUT 
2023-09-20 14:59:51 hello STDOUT Ciao world!
2023-09-20 14:59:51 hello STDOUT 
2023-09-20 15:00:06 hello STDOUT Bonjour world!
2023-09-20 15:00:06 hello STDOUT 
2023-09-20 15:00:06 hello STDOUT Hello world!
2023-09-20 15:00:06 hello STDOUT 
2023-09-20 15:00:07 hello STDOUT === Execution completed — cache and working files will not be resumable
2023-09-20 15:00:07 hello STDOUT === Execution completed — upload nextflow log to job output destination project-xxxx:/applets/
2023-09-20 15:00:09 hello STDOUT Upload nextflow log as file: file-GZ5ffkj071zqZ9Qj22qv097J
2023-09-20 15:00:09 hello STDOUT === Execution succeeded — upload published files to job output destination project-xxxx:/applets/
* hello (done) job-bbbb
  amy 2023-09-20 14:57:58 (runtime 0:02:03)
  Output: -
```

Monitoring a subjob:

```shell
# Monitor job in progress
$ dx watch job-cccc
Watching job job-cccc. Press Ctrl+C to stop watching.
sayHello (1) (hello:nf_task_entry) (done) job-cccc
amy 2023-09-20 14:58:52 (runtime 0:00:52)
... [deleted]
2023-09-20 14:59:28 sayHello (1) STDOUT dxpy/0.358.0 (Linux-5.15.0-1045-aws-x86_64-with-glibc2.29) Python/3.8.10
2023-09-20 14:59:30 sayHello (1) STDOUT bash running (job ID job-cccc)
2023-09-20 14:59:33 sayHello (1) STDOUT file-GZ5ffQj047j3Vq7QX220Q5vQ
2023-09-20 14:59:34 sayHello (1) STDOUT Bonjour world!
2023-09-20 14:59:36 sayHello (1) STDOUT file-GZ5ffVQ047j2QXZ2ZkFx4YxG
2023-09-20 14:59:38 sayHello (1) STDOUT file-GZ5ffX0047j2QXZ2ZkFx4YxK
2023-09-20 14:59:41 sayHello (1) STDOUT file-GZ5ffXQ047jGYZ91x6KG32Jp
2023-09-20 14:59:43 sayHello (1) STDOUT file-GZ5ffY8047jF2PY3609JPBKB
sayHello (1) (hello:nf_task_entry) (done) job-cccc
amy 2023-09-20 14:58:52 (runtime 0:00:52)
Output: exit_code = 0
```

## Advanced Options: Running a Nextflow Pipeline Executable (App or Applet)

### Nextflow Execution on DNAnexus

The Nextflow pipeline executable is launched as a job tree, with one head job running the Nextflow [executor](https://docs.seqera.io/nextflow/executor), and multiple subjobs running a single [process](https://docs.seqera.io/nextflow/process) each. Throughout the pipeline's execution, the head job remains in "running" state and supervises the job tree's execution.

### Nextflow Execution Log File

When a Nextflow head job (`job-xxxx`) enters its terminal state, either "done" or "failed", the system writes a [Nextflow log file](https://docs.seqera.io/nextflow/cli#execution-logs) named `nextflow-<job-xxxx>.log` to the [destination path](#specifying-a-nextflow-job-tree-output-folder) of the head job.

### Private Docker Repository

DNAnexus supports Docker container engines for the Nextflow pipeline execution environment. The pipeline developer may refer to a public Docker repository or a private one. When the pipeline is referencing a private Docker repository, you should provide your Docker credential file as a file input of `docker_creds` to the Nextflow pipeline executable when launching the job tree.

Syntax of a private Docker credential:

```json
{
  "docker_registry": {
    "registry": "url-to-registry",
    "username": "name123",
    "token": "12345678"
  }
}
```

Store this credential file in a separate project with restricted access permissions for security.

### Nextflow Pipeline Executable Inputs and Outputs

#### Specifying Input Values to a Nextflow Pipeline Executable

Below are all possible means that you can specify an input value at build time and runtime. They are listed in order of precedence (items listed first have greater precedence and override items listed further down the list):

1. Executable (app or applet) run time
   1. DNAnexus Platform app or applet input.
      * CLI example:\
        `dx run project-xxxx:applet-xxxx -i reads_fastqgz=project-xxxx:file-yyyy`
      * `reads_fastqgz` is an example of an executable input parameter name. All Nextflow pipeline inputs can be configured and exposed by the pipeline developer using an `nf-core` flavored pipeline schema file ([`nextflow_schema.json`](https://nf-co.re/docs/nf-core-tools/cli/pipelines/schema)).
      * When the input parameter is expecting a file, you need to specify the value in a certain format based on the [class](https://documentation.dnanexus.com/developer/api/running-analyses/job-input-and-output#input) of the input parameter. When the input is of the "file" class, use DNAnexus qualified ID, which is the absolute path to the file object such as "project-xxxx:file-yyyy". When the input is of the "string" class, use the DNAnexus URI ("dx://project-xxxx:/path/to/file"). See [table below](#formats-of-path-to-file-folder-or-wildcards) for full descriptions of the formatting of PATHs.
      * You can use `dx run <app(let)> --help` to query the class of each input parameter at the app(let) level. In the example code block below, `fasta` is an input parameter of a `file` object, while `fasta_fai` is an input parameter of a `string` object. You then use DNAnexus qualifiedID format for `fasta`, and DNAnexus URI format for `fasta_fai`.
      * The DNAnexus object class of each input parameter is based on the `type` and `format` specified in the pipeline's `nextflow_schema.json`, when it exists. See additional documentation in the [Nextflow Input Parameter Type Conversion section](#nextflow-input-parameter-type-conversion-to-dnanexus-executable-input-parameter-class) to understand how Nextflow input parameter's type and format (when applicable) converts to an app or applet's input class.
      * It is recommended to always use the app/applet means for specifying input values. The platform validates the input class and existence before the job is created.
      * All inputs for a Nextflow pipeline executable are set as "[optional](https://documentation.dnanexus.com/developer/api/running-analyses/io-and-run-specifications#input-specification)" inputs. This allows users to have flexibility to specify input via other means.
   2. Nextflow pipeline command line input parameter, available as `nextflow_pipeline_params`. This is an optional "string" class input, available for any Nextflow pipeline executable being built.
      * CLI example:\
        `dx run project-xxxx:applet-xxxx -i nextflow_pipeline_params="--foo=xxxx --bar=yyyy"`, where `"--foo=xxxx --bar=yyyy"` corresponds to the `"--something value"` pattern of Nextflow input specification referenced in the [Nextflow Configuration documentation](https://docs.seqera.io/nextflow/config#configuration).
      * Because `nextflow_pipeline_params` is a string type parameter with file-path format, use the DNAnexus URI format when the file is stored on DNAnexus.
   3. Nextflow options parameter `nextflow_run_opts`. This is an optional "string" class input, available for any Nextflow pipeline executable being built.
      * CLI example:\
        `dx run project-xxxx:applet-xxxx -i nextflow_run_opts="-profile test"`, where `-profile` is a single-dash prefix parameter that corresponds to the [Nextflow run options pattern](https://docs.seqera.io/nextflow/cli#pipeline-parameters), specifying a preset input configuration.
   4. Nextflow parameter file `nextflow_params_file`. This is an optional "file" class input, available for any Nextflow pipeline executable that is being built.
      * CLI example:\
        `dx run project-xxxx:applet-xxxx -i nextflow_params_file=project-xxxx:file-yyyy`, where `project-xxxx:file-yyyy` is the DNAnexus qualified ID of the file being passed to `nextflow run -params-file <file>`. This corresponds to [`-params-file`](https://docs.seqera.io/nextflow/cli#pipeline-parameters) option of `nextflow run`.
   5. Nextflow soft configuration override file `nextflow_soft_confs`. This is an optional "array:file" class input, available for any Nextflow pipeline executable that is being built.
      * CLI example:\
        `dx run project-xxxx:applet-xxxx -i nextflow_soft_confs=project-xxxx:file-1111 -i nextflow_soft_confs=project-xxxx:file-2222`, where `project-xxxx:file-1111` and `project-xxxx:file-2222` are the DNAnexus qualified IDs of the files being passed to `nextflow run -c <config-file1> -c <config-file2>`. This corresponds to [`-c`](https://docs.seqera.io/nextflow/cli#soft-configuration-override) option of `nextflow run`, and the order specified for this array of file input is preserved when passing to the `nextflow run` execution.
      * The soft configuration file can be used for assigning default values of configuration scopes (such as [`process`](https://docs.seqera.io/nextflow/config#process-configuration)).
      * It is highly recommended to use `nextflow_params_file` as a replacement to using `nextflow_soft_confs` for the use case of specifying parameter values, especially when running Nextflow DSL2 nf-core pipelines. Read more about this at [nf-core documentation](https://nf-co.re/docs/usage/configuration#custom-configuration-files).
2. Pipeline source code:
   1. `nextflow_schema.json`
      * Pipeline developers may specify default values of inputs in the `nextflow_schema.json` file.
      * If an input parameter is of Nextflow's string type with file-path format, use DNAnexus URI format when the file is stored on DNAnexus.
   2. `nextflow.config`
      * Pipeline developers may specify default values of inputs in the `nextflow.config` file.
      * Pipeline developers may specify a default profile value using `--profile <value>` when building the executable, for example, `dx build --nextflow --profile test`.
   3. `main.nf`, `sourcecode.nf`
      * Pipeline developers may specify default values of inputs in the Nextflow source code file (`*.nf`).
      * If an input parameter is of Nextflow's string type with file-path format, use the DNAnexus URI format when the file is stored on DNAnexus.

```shell
# Query for the class of each input parameter
$ dx run project-yyyy:applet-xxxx --help
usage: dx run project-yyyy:applet-xxxx [-iINPUT_NAME=VALUE ...]

Applet: example_applet

example_applet

Inputs:
…
  fasta: [-ifasta=(file)]
…

  fasta_fai: [-ifasta_fai=(string)]
…


# Assign values of the parameter based on the class of the parameter
$ dx run project-yyyy:applet-xxxx -ifasta="project-xxxx:file-yyyy" -ifasta_fai="dx://project-xxxx:/path/to/file"
```

#### Formats of PATH to File, Folder, or Wildcards

While you can specify a file input parameter's value at different places as seen above, the valid PATH format referring to the same file is different. This depends on the level (DNAnexus API/CLI level or Nextflow script-level) and the [class](https://documentation.dnanexus.com/developer/api/running-analyses/job-input-and-output#input) (file object or string) of the executable's input parameter. Examples of this are given below.

| Scenarios                                                                                                                                                                                                                                 | Valid PATH format                                                                                                                                                                                                                                       |
| ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| <p>• App or applet input parameter class as file object<br>• CLI/API level, such as <code>dx run --destination PATH</code></p>                                                                                                            | <p>DNAnexus qualified ID (absolute path to the file object).<br>• Example (file):<br><code>project-xxxx:file-yyyy</code><br><code>project-xxxx:/path/to/file</code><br>• Example (folder):<br><code>project-xxxx:/path/to/folder/</code></p>            |
| <p>• App or applet input parameter class as string<br>• Nextflow configuration and source code files, such as <code>nextflow\_schema.json</code>, <code>nextflow\.config</code>, <code>main.nf</code>, and <code>sourcecode.nf</code></p> | <p>DNAnexus URI.<br>• Example (file):<br><code>dx://project-xxxx:/path/to/file</code><br>• Example (folder):<br><code>dx://project-xxxx:/path/to/folder/</code><br>• Example (wildcard):<br><code>dx://project-xxxx:/path/to/wildcard\_files</code></p> |

#### Specifying a Nextflow Job Tree Output Folder

When launching a DNAnexus job, you can specify a job-level output destination such as `project-xxxx:/destination/` using the platform-level optional parameter on the [UI](https://documentation.dnanexus.com/getting-started/key-concepts/apps-and-workflows#launch-configuration) or on the [CLI](https://documentation.dnanexus.com/user/running-apps-and-applets#specifying-the-job-output-folder). For pipelines with `publishDir` settings, each output file is saved to `<dx_run_path>/<publishDir>/`, where `<dx_run_path>` is the job-level output destination and `<publishDir>` is the path assigned by the Nextflow script's process.

Read more detail about the output folder specification and [`publishDir`](#values-of-publishdir). Find an example of how to construct output paths of an nf-core pipeline job tree at run time in the [FAQ](#faq).

## Using an AWS S3 Bucket as a Work Directory for Nextflow Pipeline Runs

You can have your Nextflow pipeline runs use an Amazon Web Services (AWS) S3 bucket as a work directory. To do this, follow the steps outlined below.

### Step 1. Configure Your AWS Account to Trust the DNAnexus Platform as an OIDC Identity Provider

[Follow the steps outlined here](https://documentation.dnanexus.com/developer/apps/job-identity-tokens-for-access-to-clouds-and-third-party-services#step-1-establish-trust-between-the-platform-and-your-cloud-provider) to configure your AWS account to trust the Platform, as an OIDC identity provider. Be sure to note the value entered in the "Audience" field. This value is required in a configuration file used by your pipeline to enable pipeline runs to access the S3 bucket.

### Step 2. Configure an AWS IAM Role with the Proper Trust and Permissions Policies

Next, configure an [AWS Identity and Access Management (IAM) role](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create_for-idp_oidc.html#idp_oidc_Create), such that its permissions and trust policies allow Platform jobs that assume this role, to access and use resources in the S3 bucket.

#### Permissions Policy

The following example shows how to structure an IAM role's permission policy, to enable the role to use an S3 bucket - accessible via the S3 URI `s3://my-nextflow-s3-workdir` - as the work directory of Nextflow pipeline runs:

```json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:DeleteObject",
        "s3:GetObject",
        "s3:ListBucket",
        "s3:PutObject"
      ],
      "Resource": [
        "arn:aws:s3:::my-nextflow-s3-workdir",
        "arn:aws:s3:::my-nextflow-s3-workdir/*"
      ]
    }
  ]
}
```

In the above example:

* The "Action" section contains a list of the actions the role is allowed to perform, including deleting, getting, listing, and putting objects.
* The two entries in the list in the "Resource" section enable the role to access all resources in the bucket accessible via the S3 URI `my-nextflow-s3-workdir`.

#### Trust Policy

The following example shows how to configure an IAM role's trust policy, to allow only correctly configured Platform jobs to assume the role:

```json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Principal": {
        "Federated": "arn:aws:iam::123456789012:oidc-provider/job-oidc.dnanexus.com/"
        ,
        "Condition": {
         "StringEquals": {
            "job-oidc.dnanexus.com/:aud": "dx_nextflow_s3_scratch_token_aud"
          },
         "StringEquals": {
            "job-oidc.dnanexus.com/:sub": "project_id;project-xxxx;launched_by;user-aaaa"
          }
        }
      }
    }
  ]
}
```

In the above example:

* To assume the role, a job must be launched from within a specific Platform project (in this case, `project-xxxx`).
* To assume the role, a job must be launched by a specific Platform user (in this case, `user-aaaa`).
* Via the "Federated" setting in the "Principal" section, the policy configures the role to trust the Platform as an OIDC identity provider, as accessible at `job-oidc.dnanexus.com`.

### Step 3. Configure Your Nextflow Pipeline's Configuration File to Access the S3 Bucket

Next you need to configure your pipeline so that when it's run, it can access the S3 bucket. To do this, add, in a configuration file, a `dnanexus` [config scope](https://docs.seqera.io/nextflow/developer/config-scopes) that includes the properties shown in this example:

```
# In a nextflow configuration file:

aws { region = '<aws region>'}

dnanexus {
 workDir = '<S3 URI path>'
 jobTokenAudience = '<OIDC_audience_name>'
 jobTokenSubjectClaims = '<list of claims separated by commas>'
 iamRoleArnToAssume = '<arn of the role who is set with permission>'
}
```

In the above example:

* `workDir` is the path to the bucket to be used as a work directory, in S3 URI format.
* `jobTokenAudience` is the value of "Audience" you defined in [Step 1](#step-1-configure-your-aws-account-to-trust-the-dnanexus-platform-as-an-oidc-identity-provider) above.
* `jobTokenSubjectClaims` is an ordered, comma-separated list of DNAnexus [job identity token custom claims](https://documentation.dnanexus.com/developer/apps/job-identity-tokens-for-access-to-clouds-and-third-party-services#specifying-trust-conditions) - for example, `project_id`, `launched_by` - that the job must present, to assume the role that enables bucket access.
* `iamRoleArnToAssume` is the Amazon Resource Name (ARN) for the role that you configured in [Step 2](#step-2-configure-an-aws-iam-role-with-the-proper-trust-and-permissions-policies) above, and that is assumed by jobs to access the bucket.
* You need also to configure your pipeline to access the bucket within the appropriate AWS region, which you specify via the `region` parameter, within an `aws` config scope.

### Using Subject Claims to Control Bucket Access

When configuring the trust policy for the role that allows access to the S3 bucket, use custom subject claims to control which jobs can assume this role. Here are some typical combinations that we recommend, with their implications:

| Values of `StringEquals:job-oidc.dnanexus.com/:sub` | Which jobs can assume the role that enables bucket access?                    |
| --------------------------------------------------- | ----------------------------------------------------------------------------- |
| `project_id;project-xxxx`                           | Any Nextflow pipeline jobs that are running in `project-xxxx`                 |
| `launched_by;user-aaaa`                             | Any Nextflow pipeline jobs that are launched by `user-aaaa`                   |
| `project_id;project-xxxx;launched_by;user-aaaa`     | Any Nextflow pipeline jobs that are launched by `user-aaaa` in `project-xxxx` |
| `bill_to;org-zzzz`                                  | Any Nextflow pipeline jobs that are billed to `org-zzzz`                      |

Having included custom subject claims in the trust policy for the role, you need then, in the [Nextflow configuration file mentioned above](#step-3-configure-your-nextflow-pipelines-configuration-file-to-access-the-s3-bucket), to set the value of `jobTokenSubjectClaims` to equal a comma-separated list of claims, entered in the same order in which you entered them in the trust policy.

For example, if you configured a role's trust policy per the [above example](#trust-policy), you are requiring a job, to assume the role, to present custom subject claims `project_id` and `launched_by`, in that order. In your Nextflow configuration file, set the value of `jobTokenSubjectClaims`, within the `dnanexus` config scope, as follows:

```
# In a nextflow configuration file:
dnanexus {
 ...
 jobTokenSubjectClaims = 'project_id,launched_by'
 ...
}
```

Within the `dnanexus` config scope, you must also set the value of `iamRoleArnToAssume` to that of the appropriate role:

```
# In a nextflow configuration file:
dnanexus {
 ...
 iamRoleArnToAssume = arn:aws:iam::123456789012:role/NextflowRunIdentityToken
 ...
}
```

## Advanced Options: Building a Nextflow Pipeline Executable

### Nextflow Pipeline Executable Permissions

By default, the Platform [limits apps' and applets' ability to read and write data](https://documentation.dnanexus.com/developer/apps/app-permissions). Nextflow pipeline apps and applets have the following capabilities that are exceptions to these limits:

* External internet access (`"network": ["*"]`) - This is required for Nextflow pipeline apps and applets to be able to pull Docker images from external Docker registries at runtime.
* `UPLOAD` access to the project in which a Nextflow pipeline job is run (`"project": "UPLOAD"`) - This is required in order for Nextflow pipeline jobs to record the progress of executions, and preserve the run cache, to enable resume functionality.

You can modify a Nextflow pipeline app or applet's permissions by overriding the default values when [building from a local disk](#building-from-a-local-disk), using the `--extra-args` flag with [`dx build`](https://documentation.dnanexus.com/helpstrings-of-sdk-command-line-utilities#build). An example:

```shell
$ dx build --nextflow /path/to/hello --extra-args \
    '{"access":{"network": [], "allProjects":"VIEW"}}'
...
{"id": "applet-yyyy"}
```

Here are the key points:

* `"network": []` prevents jobs from accessing the internet.
* `"allProjects":"VIEW"` increases jobs' access permission level to VIEW. This means that each job has "read" access to projects that can be accessed by the user running the job. *Use this carefully*. This permission setting can be useful when expected input file PATHs are provided as DNAnexus URIs - via a [`samplesheet.csv`](https://github.com/nf-core/sarek/blob/6aeac929c924ba382baa42a0fe969b4e0e753ca9/assets/samplesheet.csv), for example, - from projects other than the one in which a job is being run.

### Advanced Building and Importing Pipelines

Additional options exist for `dx build --nextflow`:

| Option                                                | Class  | Description                                                                                                                                                                                                                                                                                                                             |
| ----------------------------------------------------- | ------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `--profile PROFILE`                                   | string | Set default profile for the Nextflow pipeline executable.                                                                                                                                                                                                                                                                               |
| `--repository REPOSITORY`                             | string | Specifies a Git repository of a Nextflow pipeline. Incompatible with `--remote`.                                                                                                                                                                                                                                                        |
| `--repository-tag TAG`                                | string | Specifies tag for Git repository. Can be used only with `--repository`.                                                                                                                                                                                                                                                                 |
| `--git-credentials GIT_CREDENTIALS`                   | file   | Git credentials used to access Nextflow pipelines from private Git repositories. Can be used only with `--repository`. More information about the file syntax can be found in the [Configure Git repositories with Nextflow blog post](https://seqera.io/blog/configure-git-repositories-with-nextflow/).                               |
| `--cache-docker`                                      | flag   | Stores a container image tarball in the selected project in `/.cached_dockerImages`. Only Docker engine is supported. Incompatible with `--remote`.                                                                                                                                                                                     |
| `--nextflow-pipeline-params NEXTFLOW_PIPELINE_PARAMS` | string | Custom pipeline parameters to be referenced when collecting the Docker images.                                                                                                                                                                                                                                                          |
| `--docker-secrets DOCKER_SECRETS`                     | file   | A DNAnexus file ID with credentials for a private Docker repository.                                                                                                                                                                                                                                                                    |
| `--nextflow-version VERSION`                          | string | Specifies the Nextflow version used to build the pipeline executable. As of `dx-toolkit` version `v0.408.2`, defaults to `25.10`. Supported values: `25.10`, `24.10`. Use `24.10` only as a temporary workaround if your pipeline is not yet compatible with `25.10`. Version `24.10` is unmaintained and receives no security updates. |

Use `dx build --help` for more information.

### Private Nextflow Pipeline Repository

When the Nextflow pipeline to be imported is from a private repository, you must provide a file object that contains the credentials needed to access the repository. Via the CLI, use the `--git-credentials` flag, and format the object as follows:

```json
providers {
  github {
    user = 'username'
    password = 'ghp_xxxx'
  }
}
```

{% hint style="info" %}
To safeguard this credentials field object, store it in a separate project that only you can access.
{% endhint %}

### Platform File Objects as Runtime Docker Images

When building a Nextflow pipeline executable, you can replace any [Docker container](https://docs.seqera.io/nextflow/container#docker) with a Platform file object in tarball format. These Docker tarball objects serve as substitutes for referencing external Docker repositories.

This approach enhances the provenance and reproducibility of the pipeline by minimizing reliance on external dependencies, thereby reducing associated risks. Also, it fortifies data security by removing the need for internet access to external resources, during pipeline execution.

Two methods are available for preparing Docker images as tarball file objects on the platform: [Built-in Docker image caching](#built-in-docker-image-caching) or [Manually preparing the tarballs](#manually-preparing-tarballs).

#### Built-in Docker Image Caching vs. Manually Preparing Tarballs

|                                                                      | Built-in Docker image caching                                                                                                                                                                                                             | Manually preparing tarballs                                                                                                                                                        |
| -------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Requires running a "building job" with external internet access?** | <p>Yes, if building an applet for the first time or if any image is going to be updated.<br>No internet access required on rebuild.</p>                                                                                                   | No                                                                                                                                                                                 |
| **Docker images packaged as `bundledDepends`?**                      | <p>Yes.<br>For Docker images that are used in the execution, they are cached and bundled at build time.</p>                                                                                                                               | <p>No.<br>Docker tarballs are resolved at runtime.</p>                                                                                                                             |
| **At runtime**                                                       | Job first attempts to access Docker images cached as `bundledDepends`. If this fails, the job attempts to find the image on the Platform. If this fails, the job tries to pull the images from the external repository, via the internet. | <p>Job attempts to find the Docker image based on the Docker cache path referenced.<br>If this fails, the job attempts to pull from the external repository, via the internet.</p> |

#### Built-in Docker Image Caching

This method initiates a building job that begins by taking the pipeline script, then identifying Docker containers by scanning the script's source code based on the final execution tree. Next, the job converts the containers to tarballs, and saves those tarballs to the project in which the job is running. Finally, the job builds the Nextflow pipeline executable, [bundling](https://documentation.dnanexus.com/developer/apps/execution-environment#packaging-your-own-resources) in the tarballs, as `bundledDepends`.

You can use built-in caching via the CLI by using the flag `--cache-docker` at build time. All cached Docker tarballs are stored as file objects, within the Docker cache path, at `project-xxxx:/.cached_docker_images/<image_name>/<image_name>_<version>`.

An example:

```shell
$ dx build --nextflow /path/to/hello \
--cache-docker \
--nextflow-pipeline-params "--alpha=1 --beta=foo" \ # when required
--destination project-xxxx:/applets2/hello
...
{"id:"applet-yyyy"}

$ dx tree /.cached_docker_images/
/.cached_docker_images/
├── samtools
│   └── samtools_1.16.1--h6899075_1
├── multiqc
│   └── multiqc_1.18--pyhdfd78af_0
└── fastqc
    └── fastqc_0.11.9--0
```

If you need to access a Docker container that's stored in a private repository, you must provide, along with the flag `--docker-secrets`, a file object that contains the credentials needed to access the repository. This object must be in the following format:

```json
"docker_registry": {
  "registry": "url-to-registry",
  "username": "name123",
  "token": "12345678"
}
```

{% hint style="info" %}

* When a pipeline requires specific inputs, such as file objects, sample values must be present within the project in which the building job is to execute. These values must be provided along with the flag `--nextflow-pipeline-params`.
  * It's crucial that these sample values be structured in the same way as actual input data is structured. This ensures that the execution logic of the Nextflow pipeline remains intact. During the build process, use small files, containing data representative of the larger dataset, as sample data, to reduce file localization overhead.
* For pipelines featuring conditional process trees determined by input values, provide mocked input values for caching Docker containers used by processes affected by the condition.
* A building job requires CONTRIBUTE or higher permission to the destination project, that is the project for placing tarballs created from Docker containers.
* Pipeline source code is saved at `/.nf_source/<pipeline_folder_name>/` in the destination project. The user handles cleaning up this folder after the executable has been built.
  {% endhint %}

#### Manually Preparing Tarballs

You can manually convert Docker images to tarball file objects. Within Nextflow pipeline scripts, you must then reference the location of each such tarball, in one of the following three ways:

**Option A**: Reference each tarball by its unique Platform ID such as `dx://project-xxxx:file-yyyy`. Use this approach if you want deterministic execution behavior.

You can use Platform IDs in Nextflow pipeline scripts (`*.nf`) or configuration files (`*.config`), as follows:

```shell
# In a Nextflow pipeline script:
process foo {
  container 'dx://project-xxxx:file-yyyy'

  '''
  do this
  '''
}
```

```shell
# In nextflow.config    // at root folder of the nextflow pipeline:
process {
    withName:foo {
        container = 'dx://project-xxxx:file-yyyy'
    }
}
```

{% hint style="info" %}
When accessing a Platform project, a Nextflow pipeline job needs the `VIEW` or higher permission to the project.
{% endhint %}

**Option B**: Within a Nextflow pipeline script, you can also reference a Docker image by using its [full image name](https://docs.docker.com/engine/reference/commandline/tag/#description). Use this name within a path that's in the following format: `project-xxxx:/.cached_docker_images/<image_name>/<image_name>_<version>`.

An example:

```shell
# In nextflow configuration file:
docker.enabled = true
docker.registry = 'quay.io'

# In the Nextflow pipeline script:
process bar {
  container 'quay.io/biocontainers/tabix:1.11--hdfd78af_0'

  '''
  do this
  '''
}
```

File extensions are not necessary, and `project-xxxx` is the project where the Nextflow pipeline executable was built and is executed. For `/.cached_docker_images`, substitute the name of the folder in which these images have been stored. An exact `<version>` reference must be included - `latest` is not an accepted tag in this context.

{% hint style="info" %}
At Nextflow pipeline executable runtime:

1. If no image is found at the path provided, the Nextflow pipeline job attempts to pull the Docker image from the remote external registry, based on the image name. This pull attempt requires internet access.
2. When the version is referenced as `latest`, or when no version tag is provided, the Nextflow pipeline job attempts to search the digest of the image's `latest` reference from the external Docker repository and uses it to search for the corresponding tarball on The platform. This digest search requires internet access. If no digest is found, or if there is no internet access, the execution fails.
   {% endhint %}

Here are examples of tarball file object paths and names, as constructed from image names and version tags:

| Image Name                    | Version Tag        | Tarball File Object Path and Name                                    |
| ----------------------------- | ------------------ | -------------------------------------------------------------------- |
| `quay.io/biocontainers/tabix` | `1.11--hdfd78af_0` | `project-xxxx:/.cached_docker_images/tabix/tabix_1.11--hdfd78af_0`   |
| `python`                      | `3.9-slim`         | `project-xxxx:/.cached_docker_images/python/python_3.9-slim`         |
| `python`                      | `latest`           | Nextflow pipeline job attempts to pull from remote external registry |

**Option C**: You can also reference Docker image names in pipeline scripts by digest - for example, `<Image_name>@sha256:XYZ123…`). File extensions are not necessary, and `project-xxxx` is the project where the Nextflow pipeline executable was built and is executed. For `/.cached_docker_images`, substitute the name of the folder in which these images have been stored. An exact `<version>` reference must be included - `latest` is not an accepted tag in this context. When referring to a tarball file on the Platform using this method, the file must have an object property `image_digest` assigned to it. A typical format would be `"image_digest":"<IMAGE_DIGEST_HERE>"`.

An example:

```shell
# In nextflow configuration file:
docker.enabled = true
docker.registry = 'quay.io'

# In the Nextflow pipeline script:
process bar {
  container 'quay.io/biocontainers/tabix@sha256:XYZ123…'
  '''
  do this
  '''
}
```

### Nextflow Input Parameter Type Conversion to DNAnexus Executable Input Parameter Class

Based on the input parameter's type and format (when applicable) defined in the corresponding [`nextflow_schema.json` file](https://github.com/nf-core/tools/blob/main/nf_core/pipeline-template/nextflow_schema.json), each parameter is assigned to the corresponding class ([ref1](https://documentation.dnanexus.com/developer/api/running-analyses/job-input-and-output#input), [ref2](https://documentation.dnanexus.com/developer/api/introduction-to-data-object-classes)).

| <p>From: Nextflow Input Parameter<br>(defined at <code>nextflow\_schema.json</code>) Type</p> | Format         | To: DNAnexus Input Parameter Class |
| --------------------------------------------------------------------------------------------- | -------------- | ---------------------------------- |
| string                                                                                        | file-path      | file                               |
| string                                                                                        | directory-path | string                             |
| string                                                                                        | path           | string                             |
| string                                                                                        | NA             | string                             |
| integer                                                                                       | NA             | int                                |
| number                                                                                        | NA             | float                              |
| boolean                                                                                       | NA             | boolean                            |
| object                                                                                        | NA             | hash                               |

#### File Input as String or File Class

As a pipeline developer, you can specify a file input variable as `{"type":"string", "format":"file-path"}` or `{"type":"string", "format":"path"}`, which is assigned to `"file"` or `"string"` class, respectively. When running the executable, based on the class (file or string) of the executable's input parameter, you use a specific PATH format to specify the value. See the [Formats of Path to File, Folder or Wildcards section](#formats-of-path-to-file-folder-or-wildcards) for an acceptable PATH format for each class.

#### Converting a URL path to a String

When converting a file reference from a URL format to a String, you use the method `toUriString()`. An example of a URL format would be `dx://project-xxxx:/path/to/file` for a DNAnexus URI. The method `toURI().toString()` does not give the same result because `toURI()` removes the context ID, such as `project-xxxx`, and `toString()` removes the scheme, such as `dx://`. More information about the Nextflow methods is available in the [Nextflow Opening Files documentation](https://docs.seqera.io/nextflow/working-with-files#reading-a-file-line-by-line).

### Managing intermediate files and publishing outputs

#### Pipeline Output Setting Using output: `block` and `publishDir`

All files generated by a Nextflow job tree are stored in its session's corresponding `workDir`, which is the path where the temporary results are stored. On DNAnexus, when the Nextflow pipeline job is run with `"preserve_cache=true"`, the `workDir` is set at the path: `project-xxxx:/.nextflow_cache_db/<session_id>/work/`. The `project-xxxx` is the project where the job took place, and you can follow the path to access all preserved temporary results. It is useful to be able to access these results for investigating the detailed pipeline progress, and use them for resuming job runs for pipeline development purposes.

When the Nextflow pipeline job is run with `"preserve_cache=false"` (default), temporary files are stored in the job's [temporary workspace](https://documentation.dnanexus.com/developer/api/running-analyses#temporary-workspaces) which is deconstructed when the head job enters its terminate state - "done", "failed", or "terminated". Since a lot of these files are intermediate input/output being passed between processes and expected to be cleaned up after the job is completed, running with `"preserve_cache=false"` helps reduce project storage cost for files that are not of interest. It also saves you from remembering to clean up all temporary files.

To save the final results of interest, and to display them as the Nextflow pipeline executable's output, you can declare output files matching the declaration under the script's `output:` block, and use Nextflow's optional [`publishDir`](https://docs.seqera.io/nextflow/reference/process#publishdir) directive to `publish` them.

This makes the published output files available as the Nextflow pipeline head job's [output](https://documentation.dnanexus.com/developer/api/running-analyses/job-input-and-output#output), under the executable's formally defined placeholder output parameter, `published_files`, as `array:file` class. Then the files are organized under the relative folder structure assigned via `publishDir`. This works for both `"preserve_cache=true"` and `"preserve_cache=false"`. Only the `"copy"` publish mode is supported on DNAnexus.

#### Values of `publishDir`

At pipeline development time, the valid value of `publishDir` can be:

* A local path string, for example, `"publishDir path: ./path/to/nf/publish_dir/"`,
* A dynamic string value defined as a pipeline input parameter such as `"params.outdir"`, where `"outdir"` is a string-class input. This allows pipeline users to determine parameter values at runtime. For example, `"publishDir path: '${params.outdir}/some/dir/'"` or `'./some/dir/${params.outdir}/`' or `'./some/dir/${params.outdir}/some/dir/'` .
  * When `publishDir` is defined this way, the user who launches the Nextflow pipeline executable handles constructing the `publishDir` to be a valid relative path.

Find an example of how to construct output paths for an nf-core pipeline job tree at run time in the [FAQ](#faq).

{% hint style="info" %}
`publishDir` is NOT supported on DNAnexus when assigned as an absolute path starting at root (/), such as `/path/to/nf/publish_dir/`. If an absolute path is defined for the `publishDir`, no output files are generated as the job's output parameter `"published_files"`.
{% endhint %}

### Queue Size Configuration

The `queueSize` option is part of Nextflow's executor [configuration](https://docs.seqera.io/nextflow/reference/config#executor). It defines how many tasks the executor handles in a parallel way. On DNAnexus, this represents the number of subjobs being created at a time (5 by default) by the Nextflow pipeline executable's head job. If the pipeline's executor configuration has a value assigned to `queueSize`, it overrides the default value. If the value exceeds the upper limit (1000) on DNAnexus, the root job errors out. See the Nextflow executor [configuration](https://docs.seqera.io/nextflow/reference/config#executor) page for examples.

### Instance Type Determination

#### Head job instance type determination

The head job of the job tree defaults to running on instance type `mem2_ssd1_v2_x4` in AWS regions and `azure:mem2_ssd1_x4` in Azure regions. Users can change to a different instance type than the default, but this is not recommended. The head job executes and monitors the subjobs. Changing the instance type for the head job does not affect the computing resources available for subjobs, where most of the heavy computation takes place (see below where to configure instance types for Nextflow processes). Changing the instance type for the head job may be necessary only if it runs out of memory or disk space when staging input files, collecting pipeline output files, or uploading pipeline output files to the project.

#### Subjob instance type determination

Each subjob's instance type is determined based on the profile information provided in the Nextflow pipeline script. Specify required instances by [instance type name](https://documentation.dnanexus.com/developer/api/running-analyses/instance-types#available-instance-types) via Nextflow's [`machineType`](https://docs.seqera.io/nextflow/reference/process#machinetype) directive (example below). Alternatively, use a set of system requirements such as [`cpus`](https://docs.seqera.io/nextflow/reference/process#process-cpus), [`memory`](https://docs.seqera.io/nextflow/reference/process#memory), [`disk`](https://docs.seqera.io/nextflow/reference/process#disk), and other resource parameters according to the official Nextflow documentation. The executor matches instance types to the minimal requirements described in the Nextflow pipeline profile using this logic:

1. Choose the cheapest instance that satisfies the system requirements.
2. Use only SSD type instances.
3. For all things equal (price and instance specifications), it prefers a [version2 (v2)](https://documentation.dnanexus.com/developer/api/running-analyses/instance-types) instance type.

Order of precedence for subjob instance type determination:

1. The value assigned to `machineType` directive.
2. Values assigned to `cpus`, `memory`, and `disk` directives in their [configuration](https://docs.seqera.io/nextflow/config).

An example command for specifying `machineType` by DNAnexus instance type name is provided below:

```
process foo {
  machineType 'mem1_ssd1_v2_x36'

  """
  <your script here>
  """
}
```

{% hint style="info" %}
Values assigned to `cpus`, `memory`, and `disk` directives serve two purposes: they determine the instance type and can be recalled by Nextflow's process [implicit variables of task object](https://docs.seqera.io/nextflow/process#using-task-directive-values) such as `${task.cpus}`, `${task.memory}`, and `${task.disk}` at runtime for task allocation.

* The actual selected instance type's resources (CPUs, memory, disk capacity) may differ from what is allocated by the task. Instance type selection follows the precedence rules described above, while task allocation uses the values assigned in the configuration file.
* When using Docker as the runtime container, the Nextflow executor propagates task execution settings to the Docker run command. For example, when `task.memory` is specified, this becomes the maximum amount of memory allowed for the container: `docker run --memory ${task.memory}`
  {% endhint %}

## Nextflow Resume

### Preserve Run Caches and Resuming Previous Jobs

Nextflow's [`resume`](https://seqera.io/blog/demystifying-nextflow-resume/) feature enables skipping the processes that have been finished successfully and cached in previous runs. The new run can directly jump to downstream processes without needing to start from the beginning of the pipeline. By retrieving cached progress, Nextflow resume helps pipeline developers to save both time and compute costs. It is helpful for testing and troubleshooting when building and developing a Nextflow pipeline.

Nextflow uses a scratch storage area for caching and preserving each task's temporary results. The directory is called "working directory", and the directory's path is defined by

* The `session id`, a universally unique identifier (UUID) associated with the current execution
* Each task's unique hash ID: a hash number composed of each task's input values, input files, command line strings, container ID such as Docker image, conda environment, environment modules, and executed scripts in the bin directory, when applicable.

You can use the Nextflow resume feature with the following Nextflow pipeline executable parameters:

* `preserve_cache` Boolean type. Default value is false. When set to true, the run is cached in the current project for future resumes. For example:
  * ```shell
    dx run applet-xxxx -i reads_fastqgz=project-xxxx:file-yyyy -i preserve_cache=true
    ```
  * This enables the Nextflow job tree to preserve cached information as well as all temporary results in the project where it is executed under the following paths, based on its session ID and each subjob's unique ID.
  * The session's cache directory containing information on the location of the `workDir`, the session progress, job status, and configuration data is saved to `project-xxxx:/.nextflow_cache_db/<session_id>/cache.tar`, where `project-xxxx` is the project where the job tree is executed.
  * Each task's working directory is saved to `project-xxxx:/.nextflow_cache_db/<session_id>/work/<2digit>/<30characters>/`, where `<2digit>/<30characters>/` is technically the task's unique ID, and `project-xxxx` is the project where the job tree is executed.
* `resume` String type. Default value is an empty string, and the run begins without any cached data. When assigned with a `session id`, the run resumes from what is cached for the `session id` on the project. When assigned with "true" or "last", the run determines the `session id` that corresponds to the latest valid execution in the current project and resumes the run from it. For example, `dx run applet-xxxxm -i reads_fastqgz="project-xxxx:file-yyyy" -i resume="<session_id>"`

{% hint style="info" %}

* When `preserve_cache=true`, DNAnexus executor overrides the value of `workDir` of the job tree to be `project-xxxx:/.nextflow_cache_db/<session_id>/work/`, where `project-xxxx` is the project where the job tree was executed.
* When a new job is launched and resumes a cached session (where `session_id` has a format like `12345678-1234-1234-1234-123456789012`), the new job not only resumes from where the cache left at, but also shares the same `session_id` with the cached session it resumes. When a new job makes progress in a session and if the job is being cached, it creates temporary results to the same session's `workDir`. This generates a new cache directory (`cache.tar`) with the latest cache information.
* You can have many Nextflow job trees sharing the same sessionID and writing to the same path for `workDir` and creating their own `cache.tar`, while only the latest job that ends in "done" or "failed" state is preserved on the project.
* When the head job enters its terminal state such as "failed" or "terminated" that is not caused by the executor, no cache directory is preserved, even when the job was run with `preserve_cache=true`. Subsequent new jobs cannot resume from this job run. This can happen when a job tree fails due to [exceeding a cost limit](https://documentation.dnanexus.com/developer/apps/error-information#CostLimitExceeded) or a user [terminating a job](https://documentation.dnanexus.com/developer/api/running-analyses/applets-and-entry-points#api-method-job-xxxx-terminate) of the job tree.
  {% endhint %}

Below are four possible scenarios and the recommended use cases for `–i resume`:

| Scenarios   | Parameters                                                               | Use Cases                                                                                                                                                                    | Note                                                                |
| ----------- | ------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------- |
| 1 (default) | `resume=""` (empty string) and `preserve_cache=false`                    | Production data processing. Most high volume use cases                                                                                                                       |                                                                     |
| 2           | `resume=""` (empty string) and `preserve_cache=true`                     | <p>Pipeline development. Only happens for the first few pipeline tests.<br><br>During development, it is useful to see all intermediate results in <code>workDir</code>.</p> | Only up to 20 Nextflow sessions can be preserved per project.       |
| 3           | `resume=<session_ID>` \| `"true"` \| `"last"` and `preserve_cache=false` | Pipeline development. Pipeline developers can investigate the job workspace with `--delay_workspace_destruction` and `--ssh`                                                 |                                                                     |
| 4           | `resume=<session_ID>` \| `"true"` \| `"last"` and `preserve_cache=true`  | Pipeline development. Only happens for the first few tests.                                                                                                                  | Only 1 job with the same `<session_ID>` can run at each time point. |

### Cache Preserve Limitations and Cleaning Up `workDir`

To save on storage costs, clean up the `workDir` periodically. The maximum number of sessions that can be preserved in a DNAnexus project is 20 sessions. If you exceed the limit, the job generates an error with the following message:

"The number of preserved sessions is already at the limit (N=20) and `preserve_cache` is true. Remove the folders in `<project-id>:/.nextflow_cache_db/` to be under the limit, if you want to preserve the cache of this run. "

To clean up all preserved sessions under a project, you can delete the entire `./nextflow_cache_db` folder. To clean up a specific session's cached folder, you can delete the specific `.nextflow_cache_db/<session_id>/` folder. To delete a folder in UI, you can follow the documentation on [deleting objects](https://documentation.dnanexus.com/getting-started/key-concepts/projects#deleting-objects). To delete a folder in CLI, you can run:

```shell
dx rm -r project-xxxx:/.nextflow_cache_db/              # cleanup ALL sessions caches
dx rm -r project-xxxx:/.nextflow_cache_db/<session_id>/ # clean up a specific session's cache
```

Be aware that deleting an object on UI or using CLI `dx rm` cannot be undone. Once the session work directory is deleted or moved, subsequent runs cannot resume from the session.

For each session, only one job can resume the session's cached results and preserve its own progress to this session. Multiple jobs can resume and preserve different sessions without limitations, if each job preserves a different session. Similarly, multiple jobs can resume the same session without limitations, if only one or none is preserving the progress to the session.

## Nextflow's `errorStrategy`

Nextflow's [`errorStrategy`](https://docs.seqera.io/nextflow/reference/process#process-error-strategy) directive allows you to define how the error condition is managed by the Nextflow executor at the process level. When an error status is returned, by default, the process and other pending processes stop immediately (the default is `errorStrategy` `terminate`). This forces the entire pipeline execution to be terminated.

Four error strategy options exist for Nextflow executor: `terminate`, `finish`, `ignore`, and `retry`. Below is a table of behaviors for each strategy. The "all other subjobs" referenced in the third column have not yet entered their terminal states.

| `errorStrategy` | Subjob Error                                                                                                                                                                       | Head Job                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | All Other Subjobs                                                                                                                                                                       |
| --------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `terminate`     | <p>- Job properties set with:<br><code>"nextflow\_errorStrategy":"terminate"</code><br><code>"nextflow\_errored\_subjob":"self"</code><br>- Ends in "failed" state immediately</p> | <p>- Job properties set with:<br><code>"nextflow\_errorStrategy":"terminate"</code><br><code>"nextflow\_errored\_subjob":"job-xxxx"</code><br><code>"nextflow\_terminated\_subjob":"job-yyyy, job-zzzz"</code><br>where <code>job-xxxx</code> is the errored subjob, and <code>job-yyyy</code>, <code>job-zzzz</code> are other subjobs terminated due to this error.<br>- Ends in "failed" state immediately, with error message: "Job was terminated by Nextflow with <code>terminate</code> errorStrategy for <code>job-xxxx</code>, check the job log to find the failure."</p> | End in "failed" state immediately.                                                                                                                                                      |
| `finish`        | <p>- Job properties set with:<br><code>"nextflow\_errorStrategy":"finish"</code><br><code>"nextflow\_errored\_subjob":"self"</code><br>- Ends in "done" state immediately</p>      | <p>- Job properties set with:<br><code>"nextflow\_errorStrategy":"finish"</code><br><code>"nextflow\_errored\_subjob":"job-xxxx, job-2xxx"</code><br>where <code>job-xxxx</code> and <code>job-2xxx</code> are errored subjobs.<br>- No new subjobs created after error.<br>- Ends in "failed" state eventually, after other subjobs enter terminal states, with error message: "Job was ended with finish errorStrategy for job-xxxx, check the job log to find the failure."</p>                                                                                                  | <p>- Keep running until terminal state.<br>- If error occurs in any, <code>finish</code> errorStrategy is applied (ignoring other error strategies), per Nextflow default behavior.</p> |
| `retry`         | <p>- Job properties set with:<br><code>"nextflow\_errorStrategy":"retry"</code><br><code>"nextflow\_errored\_subjob":"self"</code><br>- Ends in "done" state immediately</p>       | <p>- Spins off a new subjob to retry the errored job, named <code>\<name> (retry: \<RetryCount>)</code>.<br>- Ends in a terminal state depending on other subjobs (can be "done", "failed", or "terminated").</p>                                                                                                                                                                                                                                                                                                                                                                   | <p>- Keep running until terminal state.<br>- If error occurs, their own <code>errorStrategy</code> is applied.</p>                                                                      |
| `ignore`        | <p>- Job properties set with:<br><code>"nextflow\_errorStrategy":"ignore"</code><br><code>"nextflow\_errored\_subjob":"self"</code><br>- Ends in "done" state immediately</p>      | <p>- Job properties set with:<br><code>"nextflow\_errorStrategy":"ignore"</code><br><code>"nextflow\_errored\_subjob":"job-1xxx, job-2xxx"</code><br>- Shows "subjobs <code>\<job-1xxx></code>, <code>\<job-2xxx></code> runs into Nextflow process errors' ignore errorStrategy were applied" at end of job log.<br>- Ends in a terminal state depending on other subjobs (can be "done", "failed", or "terminated").</p>                                                                                                                                                          | <p>- Keep running until terminal state.<br>- If error occurs, their own <code>errorStrategy</code> is applied.</p>                                                                      |

When more than one `errorStrategy` directives are applied to a pipeline job tree, the following rules apply depending on the first errorStrategy used.

* When `terminate` is the first `errorStrategy` directive to be triggered in a subjob, all the other ongoing subjobs result in the "failed" state immediately.
* When `finish` is the first `errorStrategy` directive to be triggered in a subjob, any other `errorStrategy` that is reached in the remaining ongoing subjobs also applies the `finish` `errorStrategy`, ignoring any other error strategies set in the pipeline's source code or configuration.
* If the `retry` `errorStrategy` is the first directive triggered in a subjob, if any of the remaining subjobs trigger a `terminate`, `finish`, or `ignore` `errorStrategy`, these other `errorStrategy` directives are applied to the corresponding subjob.
* When `ignore` is the first `errorStrategy` directive to trigger in a subjob, and if any of `terminate`, `finish`, or `retry` `errorStrategy` directives apply to the remaining subjobs, that other `errorStrategy` is applied to the corresponding subjob.

Independent of Nextflow process-level error conditions, when a Nextflow subjob encounters platform-related restartable [errors](https://documentation.dnanexus.com/developer/apps/error-information), such as `ExecutionError`, `UnresponsiveWorker`, `JMInternalError`, `AppInternalError`, `AppInsufficientResourceError`, or `JobTimeoutExceeded`, the subjob follows the `executionPolicy` determined for the subjob and restarts itself. It does not restart from the head job.

## FAQ

### My Nextflow job tree failed, how do I find where the errors are?

A: Find the errored subjob's job ID from the head job's `nextflow_errored_subjob` and `nextflow_errorStrategy` properties to investigate which subjob failed and which `errorStrategy` was applied. To query these `errorStrategy` related properties in CLI, run the following command:

```shell
$ dx describe job-xxxx --json | jq -r .properties.nextflow_errored_subjob
job-yyyy
$ dx describe job-xxxx --json | jq -r .properties.nextflow_errorStrategy
terminate
```

where `job-xxxx` is the head job's job ID. \\

After finding the errored subjob, investigate the job log using the **Monitor** page by accessing the URL `https://platform.dnanexus.com/projects/<projectID>/monitor/job/<jobID>`. In this URL, `jobID` is the subjob's ID such as `job-yyyy`. Alternatively, watch the job log in CLI using `dx watch job-yyyy`.

With the `preserve_cache` value set to true when starting the Nextflow pipeline executable, trace the cache `workDir` such as `project-xxxx:/.nextflow_cache_db/<session_id>/work/` to investigate the intermediate results of this run.

### What is the version of Nextflow that is used?

A: As of `dx-toolkit` version `v0.408.2`, the default Nextflow engine version is **25.10**. Find the exact version used in a given run by reading the log of the head job. Each built Nextflow executable is locked down to a specific Nextflow engine version. To build against an older engine version, see `--nextflow-version` in the [Advanced Building and Importing Pipelines](#advanced-building-and-importing-pipelines) table.

### What container runtimes are supported?

A: DNAnexus supports [Docker](https://docs.seqera.io/nextflow/container#docker) as the container runtime for Nextflow pipeline applets. It is recommended to set `docker.enabled=true` in the Nextflow pipeline configuration, which enables the built Nextflow pipeline applet to execute the pipeline using Docker.

### My job hangs at the end of the analysis. What can I do to avoid this problem?

A: There can be many possibilities causing the head job to become unresponsive. One of the known reasons is caused by the [trace report file](https://docs.seqera.io/nextflow/reports#trace-file) being written directly to a DNAnexus URI such as `dx://project-xxxx:/path/to/file`. To avoid this cause, specify `​​-with-trace path/to/tracefile` (using a local path string) to the Nextflow pipeline applet's `nextflow_run_opts` input parameter.

### Can I have an example of how to construct an output path when I run a Nextflow pipeline with `params.outdir`, `publishDir` and job-level destination?

Taking [nf-core/sarek (3.3.1)](https://github.com/nf-core/sarek/tree/3.3.1) as an example, start with reading the pipeline's logic:

1. The pipeline's [`publishDir`](https://github.com/nf-core/sarek/blob/3.3.1/conf/modules/modules.config#L17) is constructed with a prefix of the `params.outdir` variable followed by each task's name for each subfolder:\
   `publishDir = [ path: { "${params.outdir}/${...}" }, ... ]`
2. `params.outdir` is a [required input parameter](https://github.com/nf-core/sarek/blob/3.3.1/nextflow_schema.json#L14) to the pipeline, and the [default value of `params.outdir` is `null`](https://github.com/nf-core/sarek/blob/3.3.1/nextflow.config#L104). The user running the corresponding Nextflow pipeline executable must specify a value to `params.outdir` to:
   1. Meet the input requirement for executing the pipeline.
   2. Resolve the value of `publishDir`, with `outdir` as the leading path and each task's name as the subfolder name.

To specify a value of `params.outdir` for the Nextflow pipeline executable built from the `nf-core/sarek` pipeline script, you can use the following command:

```shell
dx run project-xxxx:applet-zzzz \
-i outdir=./local/to/outdir \   # assign "./local/to/outdir" params.outdir
--brief -y
```

You can also set a job tree's output destination using [`--destination`](https://documentation.dnanexus.com/helpstrings-of-sdk-command-line-utilities#run) :

```shell
dx run project-xxxx:applet-zzzz \
-i outdir=./local/to/outdir \   # assign "./local/to/outdir" params.outdir
--destination project-xxxx:/path/to/jobtree/destination/ \
--brief -y
```

This command constructs the final output paths as follows:

1. *project-xxxx:/path/to*/jobtree/*destination/* as the destination of the job tree's shared output folder.
2. *project-xxxx:/path/to*/jobtree/*destination/local/to/outdir* as the shared output folder of all tasks/processes/subjobs of this pipeline.
3. *project-xxxx:/path/to*/jobtree/*destination/local/to/outdir/\<task\_name>* as the output folder of each specific task/process/subjob of this pipeline.

{% hint style="info" %}

1. This example is built based on [Specifying A Nextflow Job Tree Output Folder](#specifying-a-nextflow-job-tree-output-folder) and [Managing intermediate files and publishing outputs](#managing-intermediate-files-and-publishing-outputs).
2. Not all Nextflow pipelines have `params.outdir` as input, nor do all use `params.outdir` in `publishDir`. Read the source script of the Nextflow pipeline for the actual context of usage and requirements for `params.outdir` and `publishDir`.
   {% endhint %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://documentation.dnanexus.com/user/running-apps-and-workflows/running-nextflow-pipelines.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
