Running Nextflow Pipelines
This tutorial demonstrates how to use Nextflow pipelines on DNAnexus by importing a Nextflow pipeline from a remote repository or building from local disk space.
This documentation assumes you already have a basic understanding of how to develop and run a Nextflow pipeline. To learn more about Nextflow, visit Nextflow Documentation.
On the DNAnexus Platform you can import or build a Nextflow pipeline as a DNAnexus executable (i.e. app or applet) from a Nextflow pipeline script located from a remote repository or from your local disk space. You can import, build, and run Nextflow pipelines using the graphical user interface (UI) or the command-line interface (CLI).
A Nextflow pipeline script is structured as a folder with Nextflow scripts with optional configuration files and subfolders. Below are the basic elements of the folder structure when building a Nextflow executable:
- (Required) A major Nextflow file with the extension
.nf
containing the pipeline. The default filename ismain.nf
. A different filename can be specified in thenextflow.config
file. - (Optional, recommended) A
nextflow_schema.json
file. If this file is present when importing or building the executable, the imported executable will have the mentioned input parameters exposed at the executable level. - (Optional) Subfolders and other configuration files. Subfolders and other configuration files can be referenced by the major Nextflow file or
nextflow.config
via theinclude
orincludeConfig keyword
. Ensure that all referenced subfolders and files exist under the pipeline script folder at the time of building or importing the pipeline.
To import a Nextflow pipeline via the UI, click on the Add button on the top-right corner of the project’s Manage tab, then expand the dropdown menu. Select the Import Pipeline/Workflow option.

Once the Import Pipeline/Workflow modal appears, enter the repository URL where the Nextflow pipeline source code resides, for example, "https://github.com/nextflow-io/hello". Then choose the desired project import location. If the repository is private, provide the credentials necessary for accessing it.
An example of the Import Pipeline/Workflow modal:

Note that the “Estimated Price” value shown here is only an example. The actual price depends on the pricing model and runtime of the import job.
Once you’ve provided the necessary information, click the Start Import button and the import process will start as a pipeline import job, in the project specified in the Import To field (default is the current project).
After you've launched the import job, you'll see a status message:

You can access information about the pipeline import job in the project’s Monitor tab:

Once the import is complete, you can find the imported pipeline executable as an applet. This is the output of the pipeline import job you previously ran:

You can find the newly created Nextflow pipeline applet - e.g.
hello
- in the project:
To import a Nextflow pipeline via the CLI, run the following command to specify the repository’s URL, for example, "https://github.com/nf-core/rnaseq". You can also provide optional information, such as a repository tag and an import destination:
$ dx build --nextflow \
--repository https://github.com/nextflow-io/hello \
--destination project-xxxx:/applets/hello
Started builder job job-aaaa
Created Nextflow pipeline applet-zzzz
The dx-toolkit supports Nextflow pipeline applet building starting from version v0.338.0.
Your destination project’s
billTo
feature needs to be enabled for Nextflow pipeline applet building. Contact Support for inquiries.If the Nextflow pipeline is in a private repository, use the option
--github-credentials
to provide the path or unique ID of the credential files on the Platform. Read more about this here.Once the pipeline import job has finished, it will generate a new Nextflow pipeline applet with applet ID
applet-zzzz
. Use
dx run -h
to get more information about running the applet:$ dx run project-xxxx:/applets/hello
usage: dx run project-xxxx:/applets/hello [-iINPUT_NAME=VALUE ...]
Applet: hello
hello
Inputs:
Nextflow options
Nextflow Run Options: [-inextflow_run_opts=(string)]
Additional run arguments for Nextflow (e.g. -profile docker).
Nextflow Top-level Options: [-inextflow_top_level_opts=(string)]
Additional top-level options for Nextflow (e.g. -quiet).
Additional pipeline parameters
Nextflow Pipeline Parameters: [-inextflow_pipeline_params=(string)]
Additional pipeline parameters for Nextflow. Must be preceded with double dash characters
(e.g. --foo, which can be accessed in the pipeline script using the params.foo identifier).
Docker Credentials: [-idocker_creds=(file)]
Docker credentials used to obtain private docker images.
Advanced Executable Development Options
Debug Mode: [-idebug=(boolean, default=false)]
Shows additional information in the job log. If true, the execution log messages from
Nextflow will also be included.
Resume: [-iresume=(string)]
Unique ID of the previous session to be resumed. If 'true' or 'last' is provided instead of
the sessionID, will resume the latest resumable session run by an applet with the same name
in the current project in the last 6 months.
Preserve Cache: [-ipreserve_cache=(boolean, default=false)]
Enable storing pipeline cache and local working files to the current project. If true, local
working files and cache files will be uploaded to the platform, so the current session could
be resumed in the future
Outputs:
Published files of Nextflow pipeline: [published_files (array:file)]
Output files published by current Nextflow pipeline and uploaded to the job output
destination.
Log file of Nextflow pipeline: [nextflow_log (file)]
Through the CLI you can also build a Nextflow pipeline applet from a pipeline script folder stored on a local disk. For example, you may have a version of the
nextflow-io/hello
pipeline from the Nextflow Github on your local laptop, stored in a directory named hello
, which contains the following files:$ pwd
/path/to/hello
$ ls
LICENSE README.md circle.yml main.nf nextflow.config
The folder structure should follow what is required for a Nextflow pipeline script folder described here.
To build a Nextflow pipeline applet using a locally stored pipeline script, run the following command and specify the path to the folder containing the Nextflow pipeline scripts. You can also provide optional information, such as an import destination:
$ dx build --nextflow /path/to/hello \
--destination project-xxxx:/applets2/hello
{"id": "applet-yyyy"}
Your destination project’s
billTo
feature needs to be enabled for Nextflow pipeline applet building. Contact Support for inquiries.This command will package the Nextflow pipeline script folder as an applet named
hello
with ID applet-yyyy
, and store the applet in the destination project and path project-xxxx:/applets2/hello
. If an import destination is not provided, the current working directory will be used.A Nextflow pipeline applet will have a type “nextflow” under its metadata . This applet acts like a regular DNAnexus applet object, and can be shared with other DNAnexus users who have access to the project containing the applet.
For advanced information regarding the parameters of
dx build --nextflow
, run dx build --help
in the CLI and find the Nextflow section for all arguments that are supported for building an Nextflow pipeline applet.You can also create a Nextflow pipeline app from a Nextflow pipeline applet by running the command:
dx build --app --from applet-xxxx
.You can access a Nextflow pipeline applet from the Manage tab in your project, while the Nextflow pipeline app that you created can be accessed by clicking on the Tools Library option from the Tools tab. Once you click on the applet or app, the Run Analysis tab will be displayed. Fill out the required inputs/outputs and click the Start Analysis button to launch the job.
To run the Nextflow pipeline applet, use
dx run applet-xxxx
or dx run app-xxxx
commands in the CLI and specify your inputs:$ dx run project-yyyy:applet-xxxx \
-idebug=false
--brief -y
job-bbbb
You can list and see the progress of the Nextflow pipeline job tree, which is structured as a head job with many subjobs, using the following command:
# See subjobs in progress
$ dx find jobs --origin job-bbbb
* hello (done) job-GP5bF7Q0Qpb7qz3b3qgQZ7KQ
│ amy 2023-01-20 19:01:49 (runtime 0:05:03)
├── sayHello (4) (hello:nf_task_entry) (done) job-1111
│ amy 2023-01-20 19:05:59 (runtime 0:01:49)
├── sayHello (3) (hello:nf_task_entry) (done) job-2222
│ amy 2023-01-20 19:05:52 (runtime 0:01:43)
├── sayHello (2) (hello:nf_task_entry) (done) job-3333
│ amy 2023-01-20 19:05:43 (runtime 0:00:58)
└── sayHello (1) (hello:nf_task_entry) (done) job-4444
amy 2023-01-20 19:05:36 (runtime 0:01:03)
Each Nextflow pipeline executable run is represented as a job tree with one head job and many subjobs. The head job uses the Nextflow executor and monitors the entire pipeline execution. Each subjob is responsible for a process in the Nextflow pipeline. You can monitor the progress of the entire pipeline job tree by viewing the status of the subjobs (see example above).
To monitor the detail log of the head job and the subjobs, you can monitor each job’s DNAnexus log via the UI or the CLI.
On the DNAnexus Platform, jobs are limited to a runtime of 30 days. Jobs running longer than 30 days will be automatically terminated.
Once your job tree is running, you can go to the Monitor tab to view the status of your job tree. From the Monitor tab, you can view the job log of the head job as well as the subjobs by clicking on the Log link in the row of the desired job. You can also view the costs (when your account has permission) and resource usage of a job.

An example of the log of a head job:

An example of the log of a subjob:

From the CLI, you can use the
dx watch
command to check the status and view the log of the head job or each subjob.Monitoring the head job:
# Monitor job in progress
$ dx watch job-bbbb
Watching job job-bbbb. Press Ctrl+C to stop watching.
* hello (done) job-bbbb
amy 2023-01-20 19:01:49 (runtime 0:05:03)
… [deleted]
2023-01-20 19:05:21 hello STDOUT dxpy/0.337.0 (Linux-5.4.0-1093-aws-x86_64-with-glibc2.29)
2023-01-20 19:05:23 hello STDOUT bash running (job ID job-bbbb)
2023-01-20 19:05:23 hello STDOUT =============================================================
2023-01-20 19:05:23 hello STDOUT === NF projectDir : /home/dnanexus/hello
2023-01-20 19:05:23 hello STDOUT === NF session ID : 9edf1bb1-87cb-4e18-8aae-ffe0995739e1
2023-01-20 19:05:23 hello STDOUT === NF log file : dx://project-GJ839q80QpbFg9BZ1Jk5J6Yg:/nextflow-job-bbbb.log
2023-01-20 19:05:23 hello STDOUT === NF command : nextflow -log nextflow-job-bbbb.log run /home/dnanexus/hello -name job-bbbb
2023-01-20 19:05:23 hello STDOUT =============================================================
2023-01-20 19:05:26 hello STDOUT N E X T F L O W ~ version 22.10.0
2023-01-20 19:05:28 hello STDOUT Launching `/home/dnanexus/hello/main.nf` [job-bbbb] DSL2 - revision: 1647aefcc7
2023-01-20 19:05:37 hello STDOUT [f7/7bd43f] Submitted process > sayHello (1)
2023-01-20 19:05:44 hello STDOUT [ce/fa2f92] Submitted process > sayHello (2)
2023-01-20 19:05:52 hello STDOUT [12/2ce677] Submitted process > sayHello (3)
2023-01-20 19:05:59 hello STDOUT [a4/e4dce6] Submitted process > sayHello (4)
2023-01-20 19:06:44 hello STDOUT Bonjour world!
2023-01-20 19:06:44 hello STDOUT
2023-01-20 19:08:14 hello STDOUT Ciao world!
2023-01-20 19:08:14 hello STDOUT
2023-01-20 19:08:44 hello STDOUT Hello world!
2023-01-20 19:08:44 hello STDOUT
2023-01-20 19:08:45 hello STDOUT Hola world!
2023-01-20 19:08:45 hello STDOUT
2023-01-20 19:08:45 hello STDOUT === Execution complete — cache and working files will not be resumable
2023-01-20 19:08:46 hello STDOUT uploading file: /home/dnanexus/out/nextflow_log/nextflow-job-bbbb.log -> /nextflow-job-bbbb.log
Monitoring a sub job:
# Monitor job in progress
$ dx watch job-cccc
Watching job job-dddd. Press Ctrl+C to stop watching.
sayHello (4) (hello:nf_task_entry) (done) job-cccc
amy 2023-01-20 19:05:59 (runtime 0:01:49)
… [deleted]
2023-01-20 19:08:10 sayHello (4) INFO Setting SSH public key
2023-01-20 19:08:17 sayHello (4) STDOUT dxpy/0.337.0 (Linux-5.4.0-1093-aws-x86_64-with-glibc2.29)
2023-01-20 19:08:18 sayHello (4) STDOUT bash running (job ID job-cccc)
2023-01-20 19:08:21 sayHello (4) STDOUT file-GP5bK9004412V5j3PXPQkG8K
2023-01-20 19:08:22 sayHello (4) STDOUT Hola world!
2023-01-20 19:08:24 sayHello (4) STDOUT file-GP5bK9j04410Z7KxF8b1zKJG
2023-01-20 19:08:27 sayHello (4) STDOUT file-GP5bKB804416gB19bv9By02f
2023-01-20 19:08:29 sayHello (4) STDOUT file-GP5bKBj04410Z7KxF8b1zKJy
2023-01-20 19:08:31 sayHello (4) STDOUT file-GP5bKF80441P89XJ95kfBvXJ
sayHello (4) (hello:nf_task_entry) (done) job-GP5bJ5j0Qpb8z2zfFVjyQ6g1
amy 2023-01-20 19:05:59 (runtime 0:01:49)
Output: exit_code = 0
If you have the head job's ID, you can also use
dx find executions --origin job-xxxx
to find its sub jobs.The Nextflow pipeline execution is launched as a job tree, with one head job running the Nextflow executor, and multiple subjobs running a single process each. Throughout the pipeline’s execution, the head job remains in “running” state and it runs on an on-demand instance so it won’t interrupt the job tree’s execution for spot interruption.
DNAnexus supports Docker container engines for the pipeline execution environment. The pipeline developer may refer to a public Docker repository or a private one. When the pipeline is referencing a private Docker repository, you should provide your Docker credential file as a file input of
docker_creds
to the Nextflow pipeline executable when launching the job tree. Syntax of a private Docker credential:
{
"docker_registry": {
"registry": "url-to-registry",
"username": "name123",
"token": "12345678"
}
}
It is encouraged to save this credential file in a separate project where only you have permission to access it for privacy reasons.
Below are all possible ways that you can specify an input value at build time and runtime. They are listed in order of precedence (items listed first have greater precedence and override items listed further down the list):
- 1.Executable run time (app or applet)
- 1.DNAnexus Platform app or applet input. This is available both in the CLI (Example command:
dx run applet-xxxx -i reads_fastqgz="project-xxxx:file-yyyy"
) or UI.reads_fastqgz
is an example of an executable input parameter name. All Nextflow pipeline inputs can be configured and exposed by the pipeline developer using annf-core
flavored pipeline schema file (nextflow_schema.json).- When the input parameter is expecting a file, you need to specify the value in a certain format based on the class of the input parameter. When the input is of the “file” class, use DNAnexus qualified ID (i.e. absolute path to the file object such as “project-xxxx:file-yyyy”); when the input is of the “string” class, use the DNAnexus URI (“dx://project-xxxx:/path/to/file”). See table below for full descriptions of the formatting of PATHs.
- You can use
dx run <app(let)> --help
to query the class of each input parameter at the app(let) level. In the example code block below,fasta
is an input parameter of afile
object, whilefasta_fai
is an input parameter of astring
object. You will then use DNAnexus qualifiedID format forfasta
, and DNAnexus URI format forfasta_fai
. - The DNAnexus object class of each input parameter is based on the “type” and “format” specified in the pipeline’s
nextflow_schema.json,
when it exists. See additional documentation here to understand how Nextflow input parameter’s type and format (when applicable) converts to an app or applet’s input class. - It is recommended to always use the app/applet means for specifying input values. The platform validates the input class and existence before the job is created.
- Note that all inputs for a Nextflow pipeline executable are set as “optional” inputs. This allows users to have flexibility to specify input via other means.
- 2.Nextflow pipeline command line input parameter (i.e.
nextflow_pipeline_params
). This is a placeholder "string" class input, generated for any Nextflow pipeline executable at the time when it is built.- CLI example:
dx run applet-xxxx -i nextflow_pipeline_params="--foo=xxxx --bar=yyyy",
where "--foo=xxxx --bar=yyyy"
follows the"--something value"
pattern of Nextflow input specification referenced here. - Because
nextflow_pipeline_params
is a string type parameter with file-path format, use DNAnexus URI format when the file is stored on DNAnexus.
- 3.Nextflow options parameter (i.e.
nextflow_run_opts
). This is a placeholder string type input, available for any Nextflow pipeline executable upon it being built. This is available both in CLI and UI.- CLI example:
dx run applet-xxxx -i nextflow_run_opts=“-profile test”
, where-profile
is single-dash prefix parameter of the Nextflow run options pattern, specifying a preset input configuration.
- 2.Pipeline source code:
- 1.
nextflow_schema.json
- Pipeline developers may specify default values of inputs in the
nextflow_schema.json
file. - If an input parameter is of Nextflow’s string type with file-path format, use DNAnexus URI format when the file is stored on DNAnexus.
- 2.
nextflow.config
- Pipeline developers may specify default values of inputs in the
nextflow.config
file. - Pipeline developers may specify a default profile value using
--profile <value>
, when building the executable. e.g.dx build --nextflow --profile test.
- 3.
main.nf,
sourcecode.nf
- Pipeline developers may specify default values of inputs in the Nextflow source code file (
*.nf
). - If an input parameter is of Nextflow’s string type with file-path format, use the DNAnexus URI format when the file is stored on DNAnexus.
# Query for the class of each input parameter
$ dx run project-yyyy:applet-xxxx --help
usage: dx run project-yyyy:applet-xxxx [-iINPUT_NAME=VALUE ...]
Applet: example_applet
example_applet
Inputs:
…
fasta: [-ifasta=(file)]
…
fasta_fai: [-ifasta_fai=(string)]
…
# Assign values of the parameter based on the class of the parameter
$ dx run project-yyyy:applet-xxxx -ifasta=”project-xxxx:file-yyyy” -ifasta_fai=”dx://project-xxxx:/path/to/file”
While you can specify a file input parameter’s value at different places as seen above, the valid PATH format referring to the same file will be different depending on the level (DNAnexus API/CLI level or Nextflow script-level) and the class (file object or string) of the executable’s input parameter. Examples of this are given below.
Scenarios | Valid PATH format |
• App or applet input parameter class as file object • CLI/API level (e.g. dx run --destination PATH ) | DNAnexus qualified ID (i.e. absolute path to the file object). • E.g. (file): project-xxxx:file-yyyy ,project-xxxx:/path/to/file • E.g. (folder): project-xxxx:/path/to/folder/ |
•App or applet input parameter class as string • Nextflow configuration and source code files (e.g. nextflow_schema.json , nextflow.config , main.nf , sourcecode.nf ) | DNAnexus URI. • E.g. (file): dx://project-xxxx:/path/to/file • E.g. (folder): dx://project-xxxx:/path/to/folder/ • E.g. (wildcard): dx://project-xxxx:/path/to/wildcard_files |
When launching a DNAnexus job, you can specify a job-level output destination (e.g.
project-xxxx:/destination/
) using the platform-level optional parameter on the UI or on the CLI. In addition, when there is publishDir
specified in the pipeline, each output file will be located at <dx_run_path>/<publishDir>/
, where <dx_run_path>
is the job-level output destination, and <publishDir>
is the path assigned per Nextflow script’s process. Read more about the output specification publishDir
setting here.There are additional options for
dx build --nextflow
:Options | Class | Description |
--profile PROFILE | string | Set default profile for the Nextflow pipeline executable. |
--repository REPOSITORY | string | Specifies a Git repository of a Nextflow pipeline. Incompatible with --remote . |
--repository-tag TAG | string | Specifies tag for Git repository. Can be used only with --repository . |
--git-credentials GIT_CREDENTIALS | file | Git credentials used to access Nextflow pipelines from private Git repositories. Can be used only with --repository . More information about the file syntax can be found here. |
Use
dx build --help
for more information.When the Nextflow pipeline to be imported is from a private repository, you must provide a file containing the credential to access the the repository:
providers {
github {
user = 'username'
password = 'ghp_xxxx'
}
}
To protect the credential, it is encouraged to save this credential file in a separate project, where only you have permission to access.
Based on the input parameter’s type and format (when applicable) defined in the corresponding nextflow_schema.json file, each parameter will be assigned to the corresponding class (ref1, ref2).
From:
Nextflow Input Parameter
(defined at nextflow_schema.json ) Type |
Format | To:
DNAnexus Input Parameter Class |
string | file-path | file |
string | directory-path | string |
string | path | string |
string | NA | string |
integer | NA | int |
number | NA | float |
boolean | NA | boolean |
object | NA | hash |
As a pipeline developer, you can specify a file input variable as {
“type”:“string”, “format”:“file-path”
} or {“type”:“string”, “format”:“path”
}, which will be assign to “file”
or “string”
class, respectively. When running the executable, based on the class (file or string) of the executable’s input parameter, you will use a specific PATH format to specify the value. See documentation here for an acceptable PATH format for each class.All files generated by a Nextflow job tree will be stored in its session’s corresponding
workDir
(i.e. the path where the temporary results are stored). On DNAnexus, when the Nextflow pipeline job is run with “preserve_cache=true”
, the workDir
is set at the path: project-xxxx:/.nextflow_cache_db/<session_id>/work/
. project-xxxx
is the project where the job took place, and you can follow the path to access all preserved temporary results. It is useful to be able to access these results for investigating the detailed pipeline progress, and use them for resuming job runs for pipeline development purposes. More info about workDir
is described here.However, when the Nextflow pipeline job was run with
“preserve_cache=false”
(default), temporary files will be stored in the job’s temporary workspace which will be deconstructed upon the head job enters its terminate state (i.e. “done”, “failed”, or “terminated”). Since a lot of these files are intermediate input/output being passed between processes and expected to be cleaned up after the job is completed, running with “preserve_cache=false”
will help reduce project storage cost for files that are not of interest, and also save you from remembering to clean up all temporary files.To save the final results of interest, and to display them as the Nextflow pipeline executable’s output, you can declare output files matching the declaration under the script’s
output:
block, and use Nextflow’s optional publishDir directive to publish
them. This will make the published output files as the Nextflow pipeline head job’s output, under the executable’s formally defined placeholder output parameter,
published_files
, as array:file
class. Then the files will be organized under the relative folder structure assigned via publishDir
. This works for both “preserve_cache=true”
and “preserve_cache=false”
. Only the “copy”
publish mode is supported on DNAnexus.At pipeline development time, the valid value of
publishDir
can be:- A local path string , e.g.
“publishDir path: ./path/to/nf/publish_dir/”
, - A dynamic string value defined as a pipeline input parameter (e.g.
“params.outdir”
, where“outdir”
is a string-class input), allowing pipeline users to determine parameter values at runtime. For example,“publishDir path: '${params.outdir}/some/dir/'”
or'./some/dir/${params.outdir}/
' or'./some/dir/${params.outdir}/some/dir/'
.- When
publishDir
is defined this way, the user who launches the Nextflow pipeline executable is responsible for constructing thepublishDir
to be a valid relative path.
publishDir
will NOT work on DNAnexus when assigned as an absolute path (e.g. /path/to/nf/publish_dir/
, which starts at root (/)). If an absolute path is defined for the publishDir
, no output files will be generated as the job’s output parameter “published_files”
. The
queueSize
option is part of Nextflow’s executor configuration. It defines how many tasks the executor will handle in a parallel manner. On DNAnexus, this represents the number of subjobs being created at a time (5 by default) by the Nextflow pipeline executable’s head job. If the pipeline’s executor configuration has a value assigned to queueSize
, it will override the default value. If the value exceeds the upper limit (1000) on DNAnexus, the root job will error out. See the Nextflow executor configuration page for examples.The head job of the job tree defaults to running on instance type
mem2_ssd1_v2_x4
in AWS regions and azure:mem2_ssd1_x4
in Azure regions. It is possible for users to change to a different instance type than the default, but is not recommended. The head job executes and monitors the subjobs. Changing the instance type for the head job will not affect the computing resources available for subjobs, where most of the heavy computation takes place (see below where to configure instance types for Nextflow processes). Changing the instance type for the head job may be necessary only if it is running out of memory or disk space when staging input files, collecting pipeline output files, or uploading pipeline output files to the project.
Each subjob’s instance type is determined based on the profile information provided in the Nextflow pipeline script. You can specify required instances by instance type name via Nextflow’s machineType
directive (example below), or using a set of system requirements (e.g. cpus
, memory
, disk
, etc.) according to the official Nextflow documentation. The executor will choose the corresponding instance type that matches the minimal requirement of what is described in the Nextflow pipeline profile using the following logic:- 1.Choose the cheapest instance that satisfies the system requirements.
- 2.Use only SSD type instances.
- 3.For all things equal (price and instance specifications), it will prefer a version2 (v2) instance type.
An example command for specifying
machineType
by DNAnexus instance type name is provided below:process foo {
machineType 'mem1_ssd1_v2_x36'
"""
<your script here>
"""
}
Nextflow’s
resume
feature enables skipping the processes that have been finished successfully and cached in previous runs. The new run can directly jump to downstream processes without needing to start from the beginning of the pipeline. By retrieving cached progress, Nextflow resume helps pipeline developers to save both time and compute costs. It is helpful for testing and troubleshooting when building and developing a Nextflow pipeline.Nextflow utilizes a scratch storage area for caching and preserving each task’s temporary results. The directory is called “working directory”, and the directory’s path is defined by
- The
session id
, a universally unique identifier (UUID) associated with current execution - Each task’s unique hash ID: a hash number composed of each task’s input values, input files, command line strings, container ID (e.g. Docker image), conda environment, environment modules, and executed scripts in the bin directory, when applicable.
You can utilize the Nextflow resume feature with the following Nextflow pipeline executable parameters:
preserve_cache
Boolean type. Default value is false. When set to true, the run will be cached in the current project for future resumes. For example:- dx run applet-xxxx -i reads_fastqgz=“project-xxxx:file-yyyy” -i preserve_cache=true
- This enables the Nextflow job tree to preserve cached information as well as all temporary results in the project where it is executed under the following paths, based on its session I
D
and each subjob’s unique ID. - The session's cache directory containing information on the location of the
workDir
, the session progress, etc. are saved toproject-xxxx:/.nextflow_cache_db/<session_id>/cache.tar
, whereproject-xxxx
is the project where the job tree is executed. - Each task's working directory will be saved to
project-xxxx:/.nextflow_cache_db/<session_id>/work/<2digit>/<30characters>/
, where<2digit>/<30characters>/
is technically the task’s unique ID, andproject-xxxx
is the project where the job tree is executed.
resume
String type. Default value is an empty string, and the run will start from scratch. When assigned with asession id
, the run will resume from what is cached for thesession id
on the project. When assigned with “true” or “last”, the run will determine thesession id
that corresponds to the latest valid execution in the current project and resume the run from it. For example:- dx run applet-xxxxm -i reads_fastqgz=“project-xxxx:file-yyyy” -i resume="<session_id>”
Note that when
preserve=true
, DNAnexus executor overrides the value of workDir of the job tree to be project-xxxx:/.nextflow_cache_db/<session_id>/work/
, where project-xxxx
is the project where the job tree was executed. When a new job is launched and resumes a cached session (
session_id
may be formatted as 12345678-1234-1234-1234-123456789012 for example), the new job not only resumes from where the cache left at, but also shares the same session_id with the cached session it resumes. When a new job makes progress in a session and if the job is being cached, it creates temporary results to the same session’s workDir
. This will generate a new cache directory (cache.tar
) with the latest cache information.You can have many Nextflow job trees sharing the same sessionID and writing to the same path for
workDir
and creating its own cache.tar,
while only the latest job that ends in “done” or “failed” state will be preserved on the project.Below are four possible scenarios and the recommended use cases for
–resume
:Scenarios | Parameters | Use Cases | Note |
1
(default) | resume =“” (empty string) and preserve_cache=false | Production data processing;
most high volume use cases | |
2 | resume =“” (empty string) and preserve_cache=true | Pipeline development;
only happens for the first few pipeline tests. During development; it would be useful to see all intermediate results in workDir. | Only up to 20 Nextflow sessions can be preserved per project. |
3 | resume=<session_ID>|“true”|“last” and preserve_cache=false | Pipeline development;
pipeline developers can investigate the job workspace with --delay_workspace_destruction and --ssh | |
4 | resume=<session_ID>|“true”|“last” and preserve_cache=true | Pipeline development;
only happens for the first few tests. | Only 1 job with the same <session_ID> can run at each time point. |
It is a good practice to frequently clean up the
workDir
to save on storage costs. The maximum number of sessions that can be preserved in a DNAnexus project is 20 sessions. If you exceed the limit, the job will generate an error with the following message: “The number of preserved sessions is already at the limit (N=20) and preserve_cache is true. Please remove the folders in <project-id>:/.nextflow_cache_db/ to be under the limit , if you want to preserve the cache of this run. “
To clean up all preserved sessions under a project, you can delete the entire
./nextflow_cache_db
folder. To clean up a specific session’s cached folder, you can delete the specific .nextflow_cache_db/<session_id>/
folder. To delete a folder in UI, you can follow the documentation on deleting objects. To delete a folder in CLI, you can run:dx rm -r project-xxxx:/.nextflow_cache_db/ # cleanup ALL sessions caches
dx rm -r project-xxxx:/.nextflow_cache_db/<session_id>/ # clean up a specific session’s cache
Note that deleting an object on UI or using CLI
dx rm
cannot be undone. Once the session work directory is deleted or moved, subsequent runs will not be able to resume from the session.For each session, only one job is allowed to resume the session’s cached results, and preserve its own progress to this session. There is no limit if multiple jobs resume and preserve multiple different sessions, as long as each job is preserving a different session. There is also no limit for multiple jobs to resume the same session, as long as only one or none is preserving the progress to the session.
Nextflow’s errorStrategy directive allows you to define how the error condition is managed by the Nextflow executor at the process-level. When an error status is returned, by default, the process and other pending processes stop immediately (i.e.
errorStrategy
terminate
), and this in turn forces the entire pipeline execution to be terminated. There are four error strategy options of Nextflow executor:
terminate
, finish
, ignore
, and retry
. Below is a table of behaviors for each strategy. Note that "all other subjobs" in the third column have not yet entered their terminal states.errorStrategy | Subjob Error | Head Job | All Other Subjobs |
terminate | Job properties set with: "nextflow_errorStrategy”:"terminate”,
"nextflow_errored_subjob”:"self” End in “failed” state immediately | Job properties set with:
"nextflow_errorStrategy”:”terminate”, "nextflow_errored_subjob”:”job-xxxx",
"nextflow_terminated_subjob”:”job-yyyy, job-zzzz"
, where job-xxxx is the errored subjob, and job-yyyy is the other subjobs that were terminated due to this error.End in “failed” state immediately, with error message, “Job was terminated by Nextflow with terminate errorStrategy for job-xxxx , check the job log to find the failure” | End in “failed” state immediately. |
finish | Job properties set with: "nextflow_errorStrategy”:"finish”,
"nextflow_errored_subjob”:"self” End in “done” state immediately | Job properties set with: "nextflow_errorStrategy:finish”,
"nextflow_errored_subjob”:”job-xxxx, job-2xxx"
, where job-xxxx and job-2xxxx are the the errored subjobs,Not create new subjobs after the time point of error End in “failed” state eventually, after other existing subjobs enter their terminal states, with error message “Job was ended with finish errorStrategy for job-xxxx, check the job log to find the failure.”. | Keep on running until entering their terminal states. If error occurs in any of these subjobs (e.g. job-2xxx), finish errorStrategy will be applied to the subjob because a finish errorStrategy was hit first, ignoring any other error strategies set in the pipeline’s source code or configuration, as Nextflow’s default behavior. |
retry | Job properties set with: "nextflow_errorStrategy”:”retry” ,
"nextflow_errored_subjob”:”self" End in “done” state immediately | Spin off a new subjob which retries the errored job, with the following job name: <name> (retry: <RetryCount>) , where <name> is the original subjob name and <RetryCount> is the order of this retry (ex. retry:1, retry:2). End in a terminal state depending on the terminal states of other currently existing subjobs that are not yet in their terminal states. Can be either “done”, “failed” or “terminated”. | Keep on running until enter their terminal states. If error occurs in one of these subjobs, their errorStrategy set in the subjob’s corresponding Nextflow process is applied. |
ignore | Job properties set with:
" nextflow_errorStrategy ”:”ignore”,
"nextflow_errored_subjob”:”self" End in “done” state immediately | Have job properties set with:
"nextflow_erorrStrategy”:”ignore”,
"nextflow_errorred_subjob”:”job-1xxx, job-2xxx" Shows “subjob(s) <job-1xxxx>, <job-2xxxx> runs into Nextflow process errors’ ignore errorStrategy were applied” in the end of the job log. End in a terminal state depending on the terminal states of other currently existing subjobs that are not yet in their terminal states. Can be either “done”, “failed” or “terminated”. | Keep on running until they enter their terminal states. If error occurs in one of these subjobs, their errorStrategy set in the subjob’s corresponding Nextflow process is applied. |
When more than one
errorStrategy
directives are applied to a pipeline job tree, the following rules will be applied depending on the first errorStrategy used. - When
terminate
is the firsterrorStrategy
directive to be triggered in a subjob, all the other ongoing subjobs will result in the "failed" state immediately. - When
finish
is the firsterrorStrategy
directive to be triggered in a subjob, any othererrorStrategy
that is reached in the remaining ongoing subjob(s) will also apply thefinish
errorStrategy
, ignoring any other error stategies set in the pipeline’s source code or configuration. - If the
retry errorStrategy
is the first directive triggered in a subjob, if any of the remaining subjobs trigger aterminate, finish
, orignore
errorStrategy,
these othererrorStrategy
directives will be applied to the corresponding subjob. - When
ignore
is the firsterrorStrategy
directive to trigger in a subjob , and if any ofterminate
,finish
, orretry
errorStrategy
directives applies to the remaining subjob(s), that othererrorStrategy
will be applied to the corresponding subjob.
Independent from Nextflow process-level error conditions, when a Nextflow subjob encounters platform-related restartable errors, such as "ExecutionError", "UnresponsiveWorker", "JMInternalError", "AppInternalError", or "JobTimeoutExceeded", the subjob will follow the
executionPolicy
determined to the subjob and and restart itself. It will not restart from the head job. A: You can find the errored subjob’s job ID from the head job’s
nextflow_errored_subjob
and nextflow_errorStrategy
properties to investigate which subjob failed and which errorStrategy
was applied. To query these errorStrategy
related properties in CLI, you can run the following command:$ dx describe job-xxxx --json | jq -r .properties.nextflow_errored_subjob
job-yyyy
$ dx describe job-xxxx --json | jq -r .properties.nextflow_errorStrategy
terminate
where
job-xxxx
is the head job’s job ID.
Once you find the errored subjob, you can investigate the job log using the Monitor page by accessing the URL "https:/platform.dnanexus.com/projects/<projectID>/monitor/job/<jobID>", where
jobID
is the subjob's ID (e.g. job-yyyy), or watch the job log in CLI using dx watch job-yyyy
.If you have the
preserve_cache
value set to true when start running the Nextflow pipeline executable, you can trace the cache workDir
(e.g. project-xxxx:/.nextflow_cache_db/<session_id>/work/
) and investigate the intermediate results of this run. A: You can find the Nextflow version used by reading the log of the head job. Each compiled Nextflow executor is locked down to the specific version of Nextflow.
Last modified 16d ago