DNAnexus Documentation
APIDownloadsIndex of dx CommandsLegal
  • Overview
  • Getting Started
    • DNAnexus Essentials
    • Key Concepts
      • Projects
      • Organizations
      • Apps and Workflows
    • User Interface Quickstart
    • Command Line Quickstart
    • Developer Quickstart
    • Developer Tutorials
      • Bash
        • Bash Helpers
        • Distributed by Chr (sh)
        • Distributed by Region (sh)
        • SAMtools count
        • TensorBoard Example Web App
        • Git Dependency
        • Mkfifo and dx cat
        • Parallel by Region (sh)
        • Parallel xargs by Chr
        • Precompiled Binary
        • R Shiny Example Web App
      • Python
        • Dash Example Web App
        • Distributed by Region (py)
        • Parallel by Chr (py)
        • Parallel by Region (py)
        • Pysam
      • Web App(let) Tutorials
        • Dash Example Web App
        • TensorBoard Example Web App
      • Concurrent Computing Tutorials
        • Distributed
          • Distributed by Region (sh)
          • Distributed by Chr (sh)
          • Distributed by Region (py)
        • Parallel
          • Parallel by Chr (py)
          • Parallel by Region (py)
          • Parallel by Region (sh)
          • Parallel xargs by Chr
  • User
    • Login and Logout
    • Projects
      • Project Navigation
      • Path Resolution
    • Running Apps and Workflows
      • Running Apps and Applets
      • Running Workflows
      • Running Nextflow Pipelines
      • Running Batch Jobs
      • Monitoring Executions
      • Job Notifications
      • Job Lifecycle
      • Executions and Time Limits
      • Executions and Cost and Spending Limits
      • Smart Reuse (Job Reuse)
      • Apps and Workflows Glossary
      • Tools List
    • Cohort Browser
      • Chart Types
        • Row Chart
        • Histogram
        • Box Plot
        • List View
        • Grouped Box Plot
        • Stacked Row Chart
        • Scatter Plot
        • Kaplan-Meier Survival Curve
      • Locus Details Page
    • Using DXJupyterLab
      • DXJupyterLab Quickstart
      • Running DXJupyterLab
        • FreeSurfer in DXJupyterLab
      • Spark Cluster-Enabled DXJupyterLab
        • Exploring and Querying Datasets
      • Stata in DXJupyterLab
      • Running Older Versions of DXJupyterLab
      • DXJupyterLab Reference
    • Using Spark
      • Apollo Apps
      • Connect to Thrift
      • Example Applications
        • CSV Loader
        • SQL Runner
        • VCF Loader
      • VCF Preprocessing
    • Environment Variables
    • Objects
      • Describing Data Objects
      • Searching Data Objects
      • Visualizing Data
      • Filtering Objects and Jobs
      • Archiving Files
      • Relational Database Clusters
      • Symlinks
      • Uploading and Downloading Files
        • Small File Sets
          • dx upload
          • dx download
        • Batch
          • Upload Agent
          • Download Agent
    • Platform IDs
    • Organization Member Guide
    • Index of dx commands
  • Developer
    • Developing Portable Pipelines
      • dxCompiler
    • Cloud Workstation
    • Apps
      • Introduction to Building Apps
      • App Build Process
      • Advanced Applet Tutorial
      • Bash Apps
      • Python Apps
      • Spark Apps
        • Table Exporter
        • DX Spark Submit Utility
      • HTTPS Apps
        • Isolated Browsing for HTTPS Apps
      • Transitioning from Applets to Apps
      • Third Party and Community Apps
        • Community App Guidelines
        • Third Party App Style Guide
        • Third Party App Publishing Checklist
      • App Metadata
      • App Permissions
      • App Execution Environment
        • Connecting to Jobs
      • Dependency Management
        • Asset Build Process
        • Docker Images
        • Python package installation in Ubuntu 24.04 AEE
      • Job Identity Tokens for Access to Clouds and Third-Party Services
      • Enabling Web Application Users to Log In with DNAnexus Credentials
      • Types of Errors
    • Workflows
      • Importing Workflows
      • Introduction to Building Workflows
      • Building and Running Workflows
      • Workflow Build Process
      • Versioning and Publishing Global Workflows
      • Workflow Metadata
    • Ingesting Data
      • Molecular Expression Assay Loader
        • Common Errors
        • Example Usage
        • Example Input
      • Data Model Loader
        • Data Ingestion Key Steps
        • Ingestion Data Types
        • Data Files Used by the Data Model Loader
        • Troubleshooting
      • Dataset Extender
        • Using Dataset Extender
    • Dataset Management
      • Rebase Cohorts and Dashboards
      • Assay Dataset Merger
      • Clinical Dataset Merger
    • Apollo Datasets
      • Dataset Versions
      • Cohorts
    • Creating Custom Viewers
    • Client Libraries
      • Support for Python 3
    • Walkthroughs
      • Creating a Mixed Phenotypic Assay Dataset
      • Guide for Ingesting a Simple Four Table Dataset
    • DNAnexus API
      • Entity IDs
      • Protocols
      • Authentication
      • Regions
      • Nonces
      • Users
      • Organizations
      • OIDC Clients
      • Data Containers
        • Folders and Deletion
        • Cloning
        • Project API Methods
        • Project Permissions and Sharing
      • Data Object Lifecycle
        • Types
        • Object Details
        • Visibility
      • Data Object Metadata
        • Name
        • Properties
        • Tags
      • Data Object Classes
        • Records
        • Files
        • Databases
        • Drives
        • DBClusters
      • Running Analyses
        • I/O and Run Specifications
        • Instance Types
        • Job Input and Output
        • Applets and Entry Points
        • Apps
        • Workflows and Analyses
        • Global Workflows
        • Containers for Execution
      • Search
      • System Methods
      • Directory of API Methods
      • DNAnexus Service Limits
  • Administrator
    • Billing
    • Org Management
    • Single Sign-On
    • Audit Trail
    • Integrating with External Services
    • Portal Setup
    • GxP
      • Controlled Tool Access (allowed executables)
  • Science Corner
    • Scientific Guides
      • Somatic Small Variant and CNV Discovery Workflow Walkthrough
      • SAIGE GWAS Walkthrough
      • LocusZoom DNAnexus App
      • Human Reference Genomes
    • Using Hail to Analyze Genomic Data
    • Open-Source Tools by DNAnexus Scientists
    • Using IGV Locally with DNAnexus
  • Downloads
  • FAQs
    • EOL Documentation
      • Python 3 Support and Python 2 End of Life (EOL)
    • Automating Analysis Workflow
    • Backups of Customer Data
    • Developing Apps and Applets
    • Importing Data
    • Platform Uptime
    • Legal and Compliance
    • Sharing and Collaboration
    • Product Version Numbering
  • Release Notes
  • Technical Support
  • Legal
Powered by GitBook

Copyright 2025 DNAnexus

On this page
  • Quickstart
  • Pipeline Script Folder Structure
  • Importing a Nextflow Pipeline
  • Running a Nextflow Pipeline Executable (App or Applet)
  • Monitoring Jobs
  • Advanced Options: Running a Nextflow Pipeline Executable (App or Applet)
  • Nextflow Execution on DNAnexus
  • Nextflow Execution Log File
  • Private Docker Repository
  • Nextflow Pipeline Executable Inputs and Outputs
  • Using an AWS S3 Bucket as a Work Directory for Nextflow Pipeline Runs
  • Step 1. Configure Your AWS Account to Trust the DNAnexus Platform as an OIDC Identity Provider
  • Step 2. Configure an AWS IAM Role with the Proper Trust and Permissions Policies
  • Step 3. Configure Your Nextflow Pipeline's Configuration File to Access the S3 Bucket
  • Using Subject Claims to Control Bucket Access
  • Advanced Options: Building a Nextflow Pipeline Executable
  • Nextflow Pipeline Executable Permissions
  • Advanced Building and Importing Pipelines
  • Private Nextflow Pipeline Repository
  • Platform File Objects as Runtime Docker Images
  • Nextflow Input Parameter Type Conversion to DNAnexus Executable Input Parameter Class
  • Managing intermediate files and publishing outputs
  • Queue Size Configuration
  • Instance Type Determination
  • Nextflow Resume
  • Preserve Run Caches and Resuming Previous Jobs
  • Cache Preserve Limitations and Cleaning Up workDir
  • Nextflow’s errorStrategy
  • FAQ
  • My Nextflow job tree failed, how do I find where the errors are?
  • What is the version of Nextflow that is used?
  • What container runtimes are supported?
  • My job hangs at the end of the analysis. What can I do to avoid this problem?
  • Can I have an example of how to construct an output path when I run a Nextflow pipeline with params.outdir, publishDir and job-level destination?

Was this helpful?

Export as PDF
  1. User
  2. Running Apps and Workflows

Running Nextflow Pipelines

This tutorial demonstrates how to use Nextflow pipelines on the DNAnexus Platform by importing a Nextflow pipeline from a remote repository or building from local disk space.

Last updated 11 days ago

Was this helpful?

A license is required to create a DNAnexus app or applet from the Nextflow script folder. for more information.

This documentation assumes you already have a basic understanding of how to develop and run a pipeline. To learn more about Nextflow, consult the official .

To run a Nextflow pipeline on the DNAnexus Platform:

  1. Import the pipeline script from a remote repository or local disk.

  2. Convert the script to an app or applet.

  3. Run the app or applet.

You can do this via either the user interface (UI) or the command-line interface (CLI), using the .

Use the latest version of to take advantage of recent improvements and bug fixes.

As of dx-toolkit version v0.391.0, pipelines built using dx build --nextflow default to running on Ubuntu 24.04. To use the Ubuntu 20.04 instead, override the default by specifying the release in --extra-args:

dx build --nextflow --extra-args='{"runSpec": {"release": "20.04"}}'

This documentation covers features available in dx-toolkit versions beginning with v0.378.0

Quickstart

Pipeline Script Folder Structure

A Nextflow pipeline script is structured as a folder with Nextflow scripts with optional configuration files and subfolders. Below are the basic elements of the folder structure when building a Nextflow executable:

  • (Required) A main Nextflow file with the extension .nf containing the pipeline. The default filename is main.nf. A different filename can be specified in the nextflow.config file.

  • (Optional) A nextflow.config .

  • (Optional, recommended) A nextflow_schema.json . If this file is present at the root folder of the Nextflow script when importing or building the executable, the input parameters described in the file will be exposed as the built Nextflow pipeline applet's input parameters. See for more information on how the exposed parameters are used at run time.

  • (Optional) Subfolders and other configuration files. Subfolders and other configuration files can be referenced by the main Nextflow file or nextflow.config via theinclude or includeConfig keyword. Ensure that all referenced subfolders and files exist under the pipeline script folder at the time of building or importing the pipeline.

Importing a Nextflow Pipeline

Import via UI

To import a Nextflow pipeline via the UI, click on the Add button on the top-right corner of the project’s Manage tab, then expand the dropdown menu. Select the Import Pipeline/Workflow option.

Once the Import Pipeline/Workflow modal appears, enter the repository URL where the Nextflow pipeline source code resides, for example, "https://github.com/nextflow-io/hello". Then choose the desired project import location. If the repository is private, provide the credentials necessary for accessing it.

An example of the Import Pipeline/Workflow modal:

Once you’ve provided the necessary information, click the Start Import button and the import process will start as a pipeline import job, in the project specified in the Import To field (default is the current project).

After you've launched the import job, you'll see a status message "External workflow import job started" appear.

You can access information about the pipeline import job in the project’s Monitor tab:

Once the import is complete, you can find the imported pipeline executable as an applet. This is the output of the pipeline import job you previously ran:

You can find the newly created Nextflow pipeline applet - e.g. hello - in the project:

Import via CLI from a Remote Repository

$ dx build --nextflow \
  --repository https://github.com/nextflow-io/hello \
  --destination project-xxxx:/applets/hello

Started builder job job-aaaa
Created Nextflow pipeline applet-zzzz

All versions beginning with v0.338.0 support converting Nextflow pipelines to apps or applets.

This documentation covers features available in dx-toolkit versions beginning with v0.370.0.

Once the pipeline import job has finished, it will generate a new Nextflow pipeline applet with an applet ID in the form applet-zzzz.

Use dx run -h to get more information about running the applet:

$ dx run project-xxxx:/applets/hello -h
usage: dx run project-xxxx:/applets/hello [-iINPUT_NAME=VALUE ...]

Applet: hello

hello

Inputs:
 Nextflow options
  Nextflow Run Options: [-inextflow_run_opts=(string)]
        Additional run arguments for Nextflow (e.g. -profile docker).

  Nextflow Top-level Options: [-inextflow_top_level_opts=(string)]
        Additional top-level options for Nextflow (e.g. -quiet).

  Soft Configuration File: [-inextflow_soft_confs=(file) [-inextflow_soft_confs=... [...]]]
        (Optional) One or more nextflow configuration files to be appended to the Nextflow pipeline
        configuration set

  Script Parameters File: [-inextflow_params_file=(file)]
        (Optional) A file, in YAML or JSON format, for specifying input parameter values

 Advanced Executable Development Options
  Debug Mode: [-idebug=(boolean, default=false)]
        Shows additional information in the job log. If true, the execution log messages from
        Nextflow will also be included.

  Resume: [-iresume=(string)]
        Unique ID of the previous session to be resumed. If 'true' or 'last' is provided instead of
        the sessionID, will resume the latest resumable session run by an applet with the same name
        in the current project in the last 6 months.

  Preserve Cache: [-ipreserve_cache=(boolean, default=false)]
        Enable storing pipeline cache and local working files to the current project. If true, local
        working files and cache files will be uploaded to the platform, so the current session could
        be resumed in the future

Outputs:
  Published files of Nextflow pipeline: [published_files (array:file)]
        Output files published by current Nextflow pipeline and uploaded to the job output
        destination.

Building from a Local Disk

$ pwd
/path/to/hello
$ ls
LICENSE         README.md       main.nf         nextflow.config
$ dx build --nextflow /path/to/hello \
  --destination project-xxxx:/applets2/hello
{"id": "applet-yyyy"}

This command will package the Nextflow pipeline script folder as an applet named hello with ID applet-yyyy, and store the applet in the destination project and path project-xxxx:/applets2/hello. If an import destination is not provided, the current working directory will be used.

A Nextflow pipeline applet will have a type “nextflow” under its metadata . This applet acts like a regular DNAnexus applet object, and can be shared with other DNAnexus users who have access to the project containing the applet.

For advanced information regarding the parameters of dx build --nextflow, run dx build --help in the CLI and find the Nextflow section for all arguments that are supported for building an Nextflow pipeline applet.

Building a Nextflow Pipeline App from a Nextflow Pipeline Applet

Running a Nextflow Pipeline Executable (App or Applet)

Running a Nextflow Pipeline Executable via UI

You can access a Nextflow pipeline applet from the Manage tab in your project, while the Nextflow pipeline app that you built can be accessed by clicking on the Tools Library option from the Tools tab. Once you click on the applet or app, the Run Analysis tab will be displayed. Fill out the required inputs/outputs and click the Start Analysis button to launch the job.

Running a Nextflow Pipeline Applet via CLI

$ dx run project-yyyy:applet-xxxx \
  -i debug=false \
  --destination project-xxxx:/path/to/destination/ \
  --brief -y

job-bbbb
# See subjobs in progress
$ dx find jobs --origin job-bbbb
* hello (done) job-bbbb
│ amy 2023-09-20 14:57:58 (runtime 0:02:03)
├── sayHello (3) (hello:nf_task_entry) (done) job-1111
│   amy 2023-09-20 14:58:57 (runtime 0:00:45)
├── sayHello (1) (hello:nf_task_entry) (done) job-2222
│   amy 2023-09-20 14:58:52 (runtime 0:00:52)
├── sayHello (2) (hello:nf_task_entry) (done) job-3333
│   amy 2023-09-20 14:58:48 (runtime 0:00:53)
└── sayHello (4) (hello:nf_task_entry) (done) job-4444
    amy 2023-09-20 14:58:43 (runtime 0:00:50)

Monitoring Jobs

Each Nextflow pipeline executable run is represented as a job tree with one head job and many subjobs. The head job launches and supervises the entire pipeline execution. Each subjob is responsible for a process in the Nextflow pipeline. You can monitor the progress of the entire pipeline job tree by viewing the status of the subjobs (see example above).

To monitor the detail log of the head job and the subjobs, you can monitor each job’s DNAnexus log via the UI or the CLI.

On the DNAnexus Platform, jobs are limited to a runtime of 30 days. Jobs running longer than 30 days will be automatically terminated.

Monitoring in the UI

Once your job tree is running, you can go to the Monitor tab to view the status of your job tree. From the Monitor tab, you can view the job log of the head job as well as the subjobs by clicking on the Log link in the row of the desired job. You can also view the costs (when your account has permission) and resource usage of a job.

An example of the log of a head job:

An example of the log of a subjob:

Monitoring in the CLI

Monitoring the head job:

# Monitor job in progress
$ dx watch job-bbbb
Watching job job-bbbb. Press Ctrl+C to stop watching.
* hello (done) job-bbbb
  amy 2023-09-20 14:57:58 (runtime 0:02:03)
... [deleted]
2023-09-20 14:58:29 hello STDOUT dxpy/0.358.0 (Linux-5.15.0-1045-aws-x86_64-with-glibc2.29) Python/3.8.10
2023-09-20 14:58:30 hello STDOUT bash running (job ID job-bbbb)
2023-09-20 14:58:31 hello STDOUT =============================================================
2023-09-20 14:58:31 hello STDOUT === NF projectDir   : /home/dnanexus/hello
2023-09-20 14:58:31 hello STDOUT === NF session ID   : 0eac8f92-1216-4fce-99cf-dee6e6b04bc2
2023-09-20 14:58:31 hello STDOUT === NF log file     : dx://project-xxxx:/applets/nextflow-job-bbbb.log
2023-09-20 14:58:31 hello STDOUT === NF command      : nextflow -log nextflow-job-bbbb.log run /home/dnanexus/hello -name job-bbbb
2023-09-20 14:58:31 hello STDOUT === Built with dxpy : 0.358.0
2023-09-20 14:58:31 hello STDOUT =============================================================
2023-09-20 14:58:34 hello STDOUT N E X T F L O W  ~  version 22.10.7
2023-09-20 14:58:35 hello STDOUT Launching `/home/dnanexus/hello/main.nf` [job-bbbb] DSL2 - revision: 1647aefcc7
2023-09-20 14:58:43 hello STDOUT [0a/6a81ca] Submitted process > sayHello (4)
2023-09-20 14:58:48 hello STDOUT [f5/87df8b] Submitted process > sayHello (2)
2023-09-20 14:58:53 hello STDOUT [4b/21374a] Submitted process > sayHello (1)
2023-09-20 14:58:57 hello STDOUT [f6/8c44f5] Submitted process > sayHello (3)
2023-09-20 14:59:51 hello STDOUT Hola world!
2023-09-20 14:59:51 hello STDOUT 
2023-09-20 14:59:51 hello STDOUT Ciao world!
2023-09-20 14:59:51 hello STDOUT 
2023-09-20 15:00:06 hello STDOUT Bonjour world!
2023-09-20 15:00:06 hello STDOUT 
2023-09-20 15:00:06 hello STDOUT Hello world!
2023-09-20 15:00:06 hello STDOUT 
2023-09-20 15:00:07 hello STDOUT === Execution completed — cache and working files will not be resumable
2023-09-20 15:00:07 hello STDOUT === Execution completed — upload nextflow log to job output destination project-xxxx:/applets/
2023-09-20 15:00:09 hello STDOUT Upload nextflow log as file: file-GZ5ffkj071zqZ9Qj22qv097J
2023-09-20 15:00:09 hello STDOUT === Execution succeeded — upload published files to job output destination project-xxxx:/applets/
* hello (done) job-bbbb
  amy 2023-09-20 14:57:58 (runtime 0:02:03)
  Output: -

Monitoring a subjob:

# Monitor job in progress
$ dx watch job-cccc
Watching job job-cccc. Press Ctrl+C to stop watching.
sayHello (1) (hello:nf_task_entry) (done) job-cccc
amy 2023-09-20 14:58:52 (runtime 0:00:52)
... [deleted]
2023-09-20 14:59:28 sayHello (1) STDOUT dxpy/0.358.0 (Linux-5.15.0-1045-aws-x86_64-with-glibc2.29) Python/3.8.10
2023-09-20 14:59:30 sayHello (1) STDOUT bash running (job ID job-cccc)
2023-09-20 14:59:33 sayHello (1) STDOUT file-GZ5ffQj047j3Vq7QX220Q5vQ
2023-09-20 14:59:34 sayHello (1) STDOUT Bonjour world!
2023-09-20 14:59:36 sayHello (1) STDOUT file-GZ5ffVQ047j2QXZ2ZkFx4YxG
2023-09-20 14:59:38 sayHello (1) STDOUT file-GZ5ffX0047j2QXZ2ZkFx4YxK
2023-09-20 14:59:41 sayHello (1) STDOUT file-GZ5ffXQ047jGYZ91x6KG32Jp
2023-09-20 14:59:43 sayHello (1) STDOUT file-GZ5ffY8047jF2PY3609JPBKB
sayHello (1) (hello:nf_task_entry) (done) job-cccc
amy 2023-09-20 14:58:52 (runtime 0:00:52)
Output: exit_code = 0

Advanced Options: Running a Nextflow Pipeline Executable (App or Applet)

Nextflow Execution on DNAnexus

Nextflow Execution Log File

Private Docker Repository

DNAnexus supports Docker container engines for the Nextflow pipeline execution environment. The pipeline developer may refer to a public Docker repository or a private one. When the pipeline is referencing a private Docker repository, you should provide your Docker credential file as a file input of docker_creds to the Nextflow pipeline executable when launching the job tree.

Syntax of a private Docker credential:

{
  "docker_registry": {
    "registry": "url-to-registry",
    "username": "name123",
    "token": "12345678"
  }
}

It is encouraged to save this credential file in a separate project where only limited users have permission to access it for privacy reasons.

Nextflow Pipeline Executable Inputs and Outputs

Specifying Input Values to a Nextflow Pipeline Executable

Below are all possible means that you can specify an input value at build time and runtime. They are listed in order of precedence (items listed first have greater precedence and override items listed further down the list):

  1. Executable (app or applet) run time

    1. DNAnexus Platform app or applet input.

      • CLI example: dx run project-xxxx:applet-xxxx -i reads_fastqgz=project-xxxx:file-yyyy

      • You can use dx run <app(let)> --help to query the class of each input parameter at the app(let) level. In the example code block below, fasta is an input parameter of a file object, while fasta_fai is an input parameter of a string object. You will then use DNAnexus qualifiedID format for fasta, and DNAnexus URI format for fasta_fai.

      • It is recommended to always use the app/applet means for specifying input values. The platform validates the input class and existence before the job is created.

    2. Nextflow pipeline command line input parameter (i.e. nextflow_pipeline_params). This is an optional "string" class input, available for any Nextflow pipeline executable upon it being built.

      • Because nextflow_pipeline_params is a string type parameter with file-path format, use the DNAnexus URI format when the file is stored on DNAnexus.

    3. Nextflow options parameter (i.e. nextflow_run_opts). This is a optional "string" class input, available for any Nextflow pipeline executable upon it being built.

    4. Nextflow parameter file (i.e. nextflow_params_file). This is a optional "file" class input, available for any Nextflow pipeline executable that is being built.

    5. Nextflow soft configuration override file (i.e. nextflow_soft_confs). This is a optional "array:file" class input, available for any Nextflow pipeline executable that is being built.

  2. Pipeline source code:

    1. nextflow_schema.json

      • Pipeline developers may specify default values of inputs in the nextflow_schema.json file.

      • If an input parameter is of Nextflow’s string type with file-path format, use DNAnexus URI format when the file is stored on DNAnexus.

    2. nextflow.config

      • Pipeline developers may specify default values of inputs in thenextflow.config file.

      • Pipeline developers may specify a default profile value using --profile <value>, when building the executable. e.g. dx build --nextflow --profile test.

    3. main.nf , sourcecode.nf

      • Pipeline developers may specify default values of inputs in the Nextflow source code file (*.nf).

      • If an input parameter is of Nextflow’s string type with file-path format, use the DNAnexus URI format when the file is stored on DNAnexus.

# Query for the class of each input parameter
$ dx run project-yyyy:applet-xxxx --help
usage: dx run project-yyyy:applet-xxxx [-iINPUT_NAME=VALUE ...]

Applet: example_applet

example_applet

Inputs:
…
  fasta: [-ifasta=(file)]
…

  fasta_fai: [-ifasta_fai=(string)]
…


# Assign values of the parameter based on the class of the parameter
$ dx run project-yyyy:applet-xxxx -ifasta=”project-xxxx:file-yyyy” -ifasta_fai=”dx://project-xxxx:/path/to/file”

Formats of PATH to File, Folder, or Wildcards

Scenarios

Valid PATH format

• App or applet input parameter class as file object

• CLI/API level (e.g. dx run --destination PATH)

DNAnexus qualified ID (i.e. absolute path to the file object).

• E.g. (file):

project-xxxx:file-yyyy,

project-xxxx:/path/to/file

• E.g. (folder):

project-xxxx:/path/to/folder/

• App or applet input parameter class as string

• Nextflow configuration and source code files (e.g. nextflow_schema.json, nextflow.config, main.nf, sourcecode.nf)

DNAnexus URI.

• E.g. (file):

dx://project-xxxx:/path/to/file

• E.g. (folder):

dx://project-xxxx:/path/to/folder/

• E.g. (wildcard):

dx://project-xxxx:/path/to/wildcard_files

Specifying a Nextflow Job Tree Output Folder

Using an AWS S3 Bucket as a Work Directory for Nextflow Pipeline Runs

You can have your Nextflow pipeline runs use an Amazon Web Services (AWS) S3 bucket as a work directory. To do this, follow the steps outlined below.

Step 1. Configure Your AWS Account to Trust the DNAnexus Platform as an OIDC Identity Provider

Step 2. Configure an AWS IAM Role with the Proper Trust and Permissions Policies

Permissions Policy

The following example shows how to structure an IAM role's permission policy, to enable the role to use an S3 bucket - accessible via the S3 URI s3://my-nextflow-s3-workdir - as the work directory of Nextflow pipeline runs:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:DeleteObject",
        "s3:GetObject",
        "s3:ListBucket",
        "s3:PutObject"
      ],
      "Resource": [
        "arn:aws:s3:::my-nextflow-s3-workdir",
        "arn:aws:s3:::my-nextflow-s3-workdir/*"
      ]
    }
  ]
}

Note in the above example:

  • The "Action" section contains a list of the actions the role is allowed to perform, including deleting, getting, listing, and putting objects.

  • The two entries in the list in the "Resource" section enable the role to access all resources in the bucket accessible via the S3 URI my-nextflow-s3-workdir.

Trust Policy

The following example shows how to configure an IAM role's trust policy, to allow only properly configured Platform jobs to assume the role:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Principal": {
        "Federated": "arn:aws:iam::123456789012:oidc-provider/job-oidc.dnanexus.com/"
        ,
        "Condition": {
         "StringEquals": {
            "job-oidc.dnanexus.com/:aud": “dx_nextflow_s3_scratch_token_aud"
          },
         "StringEquals": {
            "job-oidc.dnanexus.com/:sub": “project_id;project-xxxx;launched_by;user-aaaa"
          }
        }
      }
    }
  ]
}

Note in the above example:

  • To assume the role, a job must be launched from within a specific Platform project (in this case, project-xxxx).

  • To assume the role, a job must be launched by a specific Platform user (in this case, user-aaaa).

  • Via the "Federated" setting in the "Principal" section, the policy configures the role to trust the Platform as an OIDC identity provider, as accessible at job-oidc.dnanexus.com.

Step 3. Configure Your Nextflow Pipeline's Configuration File to Access the S3 Bucket

# In a nextflow configuration file:

aws { region = '<aws region>'}

dnanexus {
	workDir = '<S3 URI path>'
	jobTokenAudience = '<OIDC_audience_name>'
	jobTokenSubjectClaims = '<list of claims separated by commas>'
	iamRoleArnToAssume = '<arn of the role who is set with permission>'
}

Note in the above example:

  • workDir is the path to the bucket to be used as a work directory, in S3 URI format.

  • You need also to configure your pipeline to access the bucket within the appropriate AWS region, which you specify via the region parameter, within an aws config scope.

Using Subject Claims to Control Bucket Access

When configuring the trust policy for the role that allows access to the S3 bucket, use custom subject claims to control which jobs can assume this role. Here are some typical combinations that we recommend, with their implications:

Values of StringEquals:job-oidc.dnanexus.com/:sub

Which jobs can assume the role that enables bucket access?

project_id;project-xxxx

Any Nextflow pipeline jobs that are running in project-xxxx

launched_by;user-aaaa

Any Nextflow pipeline jobs that are launched by user-aaaa

project_id;project-xxxx;launched_by;user-aaaa

Any Nextflow pipeline jobs that are launched by user-aaaa in project-xxxx

bill_to;org-zzzz

Any Nextflow pipeline jobs that are billed to org-zzzz

# In a nextflow configuration file:
dnanexus {
	...
	jobTokenSubjectClaims = 'project_id,launched_by'
	...
}

Note that you must also, within the dna config scope, set the value of iamRoleArnToAssume to that of the appropriate role:

# In a nextflow configuration file:
dnanexus {
	...
	iamRoleArnToAssume = arn:aws:iam::123456789012:role/NextflowRunIdentityToken
	...
}

Advanced Options: Building a Nextflow Pipeline Executable

Nextflow Pipeline Executable Permissions

  • External internet access ("network": ["*"]) - This is required for Nextflow pipeline apps and applets to be able to pull Docker images from external docker registries at runtime.

$ dx build --nextflow /path/to/hello --extra-args \
    '{"access":{"network": [], "allProjects":"VIEW"}}'
...
{"id": "applet-yyyy"}

In this example, note:

  • "network": [] prevents jobs from accessing the internet.

Advanced Building and Importing Pipelines

There are additional options for dx build --nextflow:

Options

Class

Description

--profile PROFILE

string

Set default profile for the Nextflow pipeline executable.

--repository REPOSITORY

string

Specifies a Git repository of a Nextflow pipeline. Incompatible with --remote.

--repository-tag TAG

string

Specifies tag for Git repository. Can be used only with --repository.

--git-credentials GIT_CREDENTIALS

file

--cache-docker

flag

Stores a container image tarball in the currently selected project in /.cached_dockerImages. Currently only docker engine is supported. Incompatible with --remote.

--nextflow-pipeline-params NEXTFLOW_PIPELINE_PARAMS

string

Custom pipeline parameters to be referenced when collecting the docker images.

--docker-secrets DOCKER_SECRETS

file

A dx file id with credentials for a private docker repository.

Use dx build --help for more information.

Private Nextflow Pipeline Repository

When the Nextflow pipeline to be imported is from a private repository, you must provide a file object that contains the credentials needed to access the repository. Via the CLI, use the--git-credentials flag, and format the object as follows:

providers {
  github {
    user = 'username'
    password = 'ghp_xxxx'
  }
}

To safeguard this credentials field object, store it in a separate project that only you can access.

Platform File Objects as Runtime Docker Images

This approach enhances the provenance and reproducibility of the pipeline by minimizing reliance on external dependencies, thereby reducing associated risks. Additionally, it fortifies data security by eliminating the need for internet access to external resources, during pipeline execution.

Built-in Docker Image Caching vs. Manually Preparing Tarballs

Built-in Docker image caching
Manually preparing tarballs

Requires running a "building job" with external internet access?

Yes, if building an applet for the first time or if any image is going to be updated.

No internet access required upon rebuild.

No

Docker images packaged as bundledDepends?

Yes.

For Docker images that will be used in the execution, they are cached and bundled at build time.

No.

Docker tarballs resolved at runtime.

At runtime

Job will attempt to access Docker cached as bundledDepends. If this fails, the job will attempt to find the image on the Platform. If this fails, the job will try to pull the images from the external repository, via the internet.

Job will attempt to locate the Docker image based on the Docker cache path referenced. If this fails, the job will attempt to pull from the external repository, via the internet.

Built-in Docker Image Caching

You can use built-in caching via the CLI by using the flag --cache-docker at build time. All cached Docker tarballs are stored as file objects, within the Docker cache path, at project-xxxx:/.cached_docker_images/<image_name>/<image_name>_<version>.

An example:

$ dx build --nextflow /path/to/hello \
--cache-docker \
--nextflow-pipeline-params "--alpha=1 --beta=foo" \ # when required
--destination project-xxxx:/applets2/hello
...
{"id:"applet-yyyy"}

$ dx tree /.cached_docker_images/
/.cached_docker_images/
├── samtools
│   └── samtools_1.16.1--h6899075_1
├── multiqc
│   └── multiqc_1.18--pyhdfd78af_0
└── fastqc
    └── fastqc_0.11.9--0

If you need to access a Docker container that's stored in a private repository, you must provide, along with the flag --docker-secrets, a file object that contains the credentials needed to access the repository. This object must be in the following format:

"docker_registry": {
  "registry": "url-to-registry",
  "username": "name123",
  "token": "12345678"
}
  • When a pipeline requires specific inputs, such as file objects, sample values must be present within the project in which building job is to execute. These values must be provided along with the flag --nextflow-pipeline-params.

    • It's crucial that these sample values be structured in the same way as actual input data will be structured. This ensures that the execution logic of the Nextflow pipeline remains intact. During the build process, use small files, containing data representative of the larger dataset, as sample data, in order to reduce file localization overhead.

  • For pipelines featuring conditional process trees determined by input values, you may provide mocked input values for caching Docker containers used by processes affected by the condition.

  • A building job requires CONTRIBUTE or higher permission to the destination project, i.e. the project in which it will place tarballs created from Docker containers.

  • Pipeline source code will be saved at/.nf_source/<pipeline_folder_name>/ in the destination project. The user is responsible for cleaning up this folder after the executable has been built.

Manually Preparing Tarballs

You can manually convert Docker images to tarball file objects. Within Nextflow pipeline scripts, you must then reference the location of each such tarball, in one of the following three ways:

  1. Reference each tarball by its unique Platform ID (e.g. dx://project-xxxx:file-yyyy). Use this approach if you want deterministic execution behavior. You can use Platform IDs in Nextflow pipeline scripts (*.nf) or configuration files (*.config), as follows:

# In a Nextflow pipeline script:
process foo {
  container 'dx://project-xxxx:file-yyyy'
  
  '''
  do this
  '''
}
# In nextflow.config    // at root folder of the nextflow pipeline:
process {
    withName:foo {
        container = 'dx://project-xxxx:file-yyyy'
    }   
}

When accessing a Platform project, a Nextflow pipeline job needsVIEW or higher permission to the project.

# In nextflow configuration file:
docker.enabled = true
docker.registry = 'quay.io'

# In the Nextflow pipeline script:
process bar {
  container 'quay.io/biocontainers/tabix:1.11--hdfd78af_0'

  '''
  do this
  '''
}

Note that no file extension is necessary, and that project-xxxx is the project where the Nextflow pipeline executable was built and will be executed. For.cached_docker_images, substitute the name of the folder in which these images have been stored.Note as well that an exact <version> reference must be included - latest is not an accepted tag in this context.

At Nextflow pipeline executable runtime:

  1. If no image is found at the path provided, the Nextflow pipeline job will attempt to pull the Docker image from the remote external registry, based on the image name. This pull attempt requires internet access.

  2. If the version is referenced as latest, or if no version tag is provided, the Nextflow pipeline job will attempt to search the digest of the image’s latest reference from the external Docker repository and use it to search for the corresponding tarball on the platform. This digest search requires internet access. If no digest is found, or if there is no internet access, the execution will fail.

Here are several examples of tarball file object paths and names, as constructed from image names and version tags:

Image Name
Version Tag
Tarball File Object Path and Name

quay.io/biocontainers/tabix

1.11--hdfd78af_0

project-xxxx:/.cached_docker_images/tabix/tabix_1.11--hdfd78af_0

python

3.9-slim

project-xxxx:/.cached_docker_images/python/python_3.9-slim

python

latest

Nextflow pipeline job will attempt to pull from remote external registry

  1. You can also reference Docker image names in pipeline scripts by digest - for example, <Image_name>@sha256:XYZ123…). Note that no file extension is necessary, and that project-xxxx is the project where the Nextflow pipeline executable was built and will be executed. For.cached_docker_images, substitute the name of the folder in which these images have been stored. Note as well that an exact <version> reference must be included - latest is not an accepted tag in this context. In addition, to refer to a tarball file on the Platform in this way, an object property image_digest - for example, “image_digest”:”<IMAGE_DIGEST_HERE>”- needs to have been assigned to it.

    An example:

# In nextflow configuration file:
docker.enabled = true
docker.registry = 'quay.io'

# In the Nextflow pipeline script:
process bar {
  container 'quay.io/biocontainers/tabix@sha256:XYZ123…'
  '''
  do this
  '''
}

Nextflow Input Parameter Type Conversion to DNAnexus Executable Input Parameter Class

From: Nextflow Input Parameter (defined at nextflow_schema.json) Type

Format

To: DNAnexus Input Parameter Class

string

file-path

file

string

directory-path

string

string

path

string

string

NA

string

integer

NA

int

number

NA

float

boolean

NA

boolean

object

NA

hash

File Input as String or File Class

Converting a URL path to a String

Managing intermediate files and publishing outputs

Pipeline Output Setting Using output: block and publishDir

Values of publishDir

At pipeline development time, the valid value of publishDir can be:

  • A local path string , e.g. “publishDir path: ./path/to/nf/publish_dir/”,

  • A dynamic string value defined as a pipeline input parameter (e.g. “params.outdir”, where “outdir” is a string-class input), allowing pipeline users to determine parameter values at runtime. For example, “publishDir path: '${params.outdir}/some/dir/'” or './some/dir/${params.outdir}/' or './some/dir/${params.outdir}/some/dir/' .

    • When publishDir is defined this way, the user who launches the Nextflow pipeline executable is responsible for constructing the publishDir to be a valid relative path.

publishDir is NOT supported on DNAnexus when assigned as an absolute path (e.g. /path/to/nf/publish_dir/, which starts at root (/)). If an absolute path is defined for the publishDir, no output files will be generated as the job’s output parameter “published_files”.

Queue Size Configuration

Instance Type Determination

Head job instance type determination

The head job of the job tree defaults to running on instance type mem2_ssd1_v2_x4 in AWS regions and azure:mem2_ssd1_x4 in Azure regions. It is possible for users to change to a different instance type than the default, but is not recommended. The head job executes and monitors the subjobs. Changing the instance type for the head job will not affect the computing resources available for subjobs, where most of the heavy computation takes place (see below where to configure instance types for Nextflow processes). Changing the instance type for the head job may be necessary only if it is running out of memory or disk space when staging input files, collecting pipeline output files, or uploading pipeline output files to the project.

Subjob instance type determination

  1. Choose the cheapest instance that satisfies the system requirements.

  2. Use only SSD type instances.

Order of precedence for subjob instance type determination:

  1. The value assigned to machineType directive.

An example command for specifying machineType by DNAnexus instance type name is provided below:

process foo {
  machineType 'mem1_ssd1_v2_x36'

  """
  <your script here>
  """
}
  • It is possible the actual selected instance type’s CPUs, memory, or disk capacity being inconsistent with what is allocated by task. The former follows above precedence to determine the instance type, the latter is returning the value assigned in the configuration file.

  • When using Docker as the runtime container, Nextflow executor is propagating task execution settings to the Docker run command. For example, if task.memory is specified, this would then be the maximum amount of memory the container is allowed to use, for example: docker run --memory ${task.memory}

Nextflow Resume

Preserve Run Caches and Resuming Previous Jobs

Nextflow utilizes a scratch storage area for caching and preserving each task’s temporary results. The directory is called “working directory”, and the directory’s path is defined by

  • The session id, a universally unique identifier (UUID) associated with current execution

  • Each task’s unique hash ID: a hash number composed of each task’s input values, input files, command line strings, container ID (e.g. Docker image), conda environment, environment modules, and executed scripts in the bin directory, when applicable.

You can utilize the Nextflow resume feature with the following Nextflow pipeline executable parameters:

  • preserve_cache Boolean type. Default value is false. When set to true, the run will be cached in the current project for future resumes. For example:

    • dx run applet-xxxx -i reads_fastqgz=project-xxxx:file-yyyy -i preserve_cache=true
    • This enables the Nextflow job tree to preserve cached information as well as all temporary results in the project where it is executed under the following paths, based on its session ID and each subjob’s unique ID.

    • The session's cache directory containing information on the location of the workDir, the session progress, etc. are saved to project-xxxx:/.nextflow_cache_db/<session_id>/cache.tar , where project-xxxx is the project where the job tree is executed.

    • Each task's working directory will be saved to project-xxxx:/.nextflow_cache_db/<session_id>/work/<2digit>/<30characters>/ , where <2digit>/<30characters>/ is technically the task’s unique ID, and project-xxxx is the project where the job tree is executed.

  • resume String type. Default value is an empty string, and the run will start from scratch. When assigned with a session id, the run will resume from what is cached for the session id on the project. When assigned with “true” or “last”, the run will determine the session id that corresponds to the latest valid execution in the current project and resume the run from it. For example:

    • dx run applet-xxxxm -i reads_fastqgz=“project-xxxx:file-yyyy” -i resume="<session_id>”
  • When preserve_cache=true, DNAnexus executor overrides the value of workDir of the job tree to be project-xxxx:/.nextflow_cache_db/<session_id>/work/, where project-xxxx is the project where the job tree was executed.

  • When a new job is launched and resumes a cached session (session_id may be formatted as 12345678-1234-1234-1234-123456789012 for example), the new job not only resumes from where the cache left at, but also shares the same session_id with the cached session it resumes. When a new job makes progress in a session and if the job is being cached, it creates temporary results to the same session’s workDir. This will generate a new cache directory (cache.tar) with the latest cache information.

  • You can have many Nextflow job trees sharing the same sessionID and writing to the same path for workDir and creating its own cache.tar, while only the latest job that ends in “done” or “failed” state will be preserved on the project.

Below are four possible scenarios and the recommended use cases for –i resume:

Scenarios

Parameters

Use Cases

Note

1 (default)

resume=“” (empty string) and preserve_cache=false

Production data processing; most high volume use cases

2

resume=“” (empty string) and preserve_cache=true

Pipeline development; only happens for the first few pipeline tests.

During development; it would be useful to see all intermediate results in workDir.

Only up to 20 Nextflow sessions can be preserved per project.

3

resume=<session_ID>|“true”|“last”and preserve_cache=false

Pipeline development; pipeline developers can investigate the job workspace with --delay_workspace_destruction and --ssh

4

resume=<session_ID>|“true”|“last” and preserve_cache=true

Pipeline development; only happens for the first few tests.

Only 1 job with the same <session_ID> can run at each time point.

Cache Preserve Limitations and Cleaning Up workDir

It is a good practice to frequently clean up the workDir to save on storage costs. The maximum number of sessions that can be preserved in a DNAnexus project is 20 sessions. If you exceed the limit, the job will generate an error with the following message:

“The number of preserved sessions is already at the limit (N=20) and is true. Please remove the folders in <project-id>:/.nextflow_cache_db/ to be under the limit, if you want to preserve the cache of this run. “

dx rm -r project-xxxx:/.nextflow_cache_db/              # cleanup ALL sessions caches
dx rm -r project-xxxx:/.nextflow_cache_db/<session_id>/ # clean up a specific session’s cache

Note that deleting an object on UI or using CLI dx rm cannot be undone. Once the session work directory is deleted or moved, subsequent runs will not be able to resume from the session.

For each session, only one job is allowed to resume the session’s cached results, and preserve its own progress to this session. There is no limit if multiple jobs resume and preserve multiple different sessions, as long as each job is preserving a different session. There is also no limit for multiple jobs to resume the same session, as long as only one or none is preserving the progress to the session.

Nextflow’s errorStrategy

There are four error strategy options of Nextflow executor: terminate, finish, ignore, and retry. Below is a table of behaviors for each strategy. Note that "all other subjobs" in the third column have not yet entered their terminal states.

errorStrategy

Subjob Error

Head Job

All Other Subjobs

terminate

Job properties set with: "nextflow_errorStrategy”:"terminate”, "nextflow_errored_subjob”:"self”

End in “failed” state immediately

Job properties set with: "nextflow_errorStrategy”:”terminate”,

"nextflow_errored_subjob”:”job-xxxx", "nextflow_terminated_subjob”:”job-yyyy, job-zzzz" , where job-xxxx is the errored subjob, and job-yyyy is the other subjobs that were terminated due to this error.

End in “failed” state immediately, with error message, “Job was terminated by Nextflow with terminate errorStrategy for job-xxxx, check the job log to find the failure”

End in “failed” state immediately.

finish

Job properties set with:

"nextflow_errorStrategy”:"finish”, "nextflow_errored_subjob”:"self”

End in “done” state immediately

Job properties set with:

"nextflow_errorStrategy:finish”, "nextflow_errored_subjob”:”job-xxxx, job-2xxx" , where job-xxxx and job-2xxxx are the the errored subjobs,

Not create new subjobs after the time point of error

End in “failed” state eventually, after other existing subjobs enter their terminal states, with error message “Job was ended with finish errorStrategy for job-xxxx, check the job log to find the failure.”.

Keep on running until entering their terminal states.

If error occurs in any of these subjobs (e.g. job-2xxx), finish errorStrategy will be applied to the subjob because a finish errorStrategy was hit first, ignoring any other error strategies set in the pipeline’s source code or configuration, as Nextflow’s default behavior.

retry

Job properties set with:

"nextflow_errorStrategy”:”retry”, "nex

tflow_errored_subjob”:”self"

End in “done” state immediately

Spin off a new subjob which retries the errored job, with the following job name:

<name> (retry: <RetryCount>) , where <name> is the original subjob name and <RetryCount> is the order of this retry (ex. retry:1, retry:2).

End in a terminal state depending on the terminal states of other currently existing subjobs that are not yet in their terminal states. Can be either “done”, “failed” or “terminated”.

Keep on running until enter their terminal states.

If error occurs in one of these subjobs, their errorStrategy set in the subjob’s corresponding Nextflow process is applied.

ignore

Job properties set with: "nextflow_errorStrategy”:”ignore”, "nextflow_errored_subjob”:”self"

End in “done” state immediately

Have job properties set with: "nextflow_erorrStrategy”:”ignore”, "nextflow_errorred_subjob”:”job-1xxx, job-2xxx"

Shows “subjob(s) <job-1xxxx>, <job-2xxxx> runs into Nextflow process errors’ ignore errorStrategy were applied” in the end of the job log.

End in a terminal state depending on the terminal states of other currently existing subjobs that are not yet in their terminal states. Can be either “done”, “failed” or “terminated”.

Keep on running until they enter their terminal states.

If error occurs in one of these subjobs, their errorStrategy set in the subjob’s corresponding Nextflow process is applied.

When more than one errorStrategy directives are applied to a pipeline job tree, the following rules will be applied depending on the first errorStrategy used.

  • When terminate is the first errorStrategy directive to be triggered in a subjob, all the other ongoing subjobs will result in the "failed" state immediately.

  • When finish is the first errorStrategy directive to be triggered in a subjob, any other errorStrategy that is reached in the remaining ongoing subjob(s) will also apply the finish errorStrategy, ignoring any other error stategies set in the pipeline’s source code or configuration.

  • If the retry errorStrategy is the first directive triggered in a subjob, if any of the remaining subjobs trigger a terminate, finish, or ignore errorStrategy, these other errorStrategy directives will be applied to the corresponding subjob.

  • When ignore is the first errorStrategy directive to trigger in a subjob , and if any of terminate, finish, or retry errorStrategy directives applies to the remaining subjob(s), that other errorStrategy will be applied to the corresponding subjob.

FAQ

My Nextflow job tree failed, how do I find where the errors are?

A: You can find the errored subjob’s job ID from the head job’s nextflow_errored_subjob and nextflow_errorStrategy properties to investigate which subjob failed and which errorStrategy was applied. To query these errorStrategy related properties in CLI, you can run the following command:

$ dx describe job-xxxx --json | jq -r .properties.nextflow_errored_subjob
job-yyyy
$ dx describe job-xxxx --json | jq -r .properties.nextflow_errorStrategy
terminate

where job-xxxx is the head job’s job ID.

Once you find the errored subjob, you can investigate the job log using the Monitor page by accessing the URL "https:/platform.dnanexus.com/projects/<projectID>/monitor/job/<jobID>", where jobID is the subjob's ID (e.g. job-yyyy), or watch the job log in CLI using dx watch job-yyyy.

If you have the preserve_cache value set to true when start running the Nextflow pipeline executable, you can trace the cache workDir (e.g. project-xxxx:/.nextflow_cache_db/<session_id>/work/) and investigate the intermediate results of this run.

What is the version of Nextflow that is used?

A: You can find the Nextflow version used by reading the log of the head job. Each built Nextflow executable is locked down to the specific version of Nextflow executor.

What container runtimes are supported?

My job hangs at the end of the analysis. What can I do to avoid this problem?

Can I have an example of how to construct an output path when I run a Nextflow pipeline with params.outdir, publishDir and job-level destination?

    1. Meet the input requirement for executing the pipeline.

    2. Resolve the value ofpublishDir, with outdir as the leading path and each task's name as the subfolder name.

To specify a value of params.outdir for the Nextflow pipeline executable built from the nf-core/sarek pipeline script, you can use the following command:

dx run project-xxxx:applet-zzzz \
-i outdir=./local/to/outdir \   # assign "./local/to/outdir" params.outdir
--brief -y
dx run project-xxxx:applet-zzzz \
-i outdir=./local/to/outdir \   # assign "./local/to/outdir" params.outdir
--destination project-xxxx:/path/to/jobtree/destination/ \ 
--brief -y

This above command will construct the final output paths in the following manner:

  1. project-xxxx:/path/to/jobtree/destination/ as the destination of the job tree's shared output folder.

  2. project-xxxx:/path/to/jobtree/destination/local/to/outdir as the shared output folder of the all tasks/processes/subjobs of this pipeline.

  3. project-xxxx:/path/to/jobtree/destination/local/to/outdir/<task_name> as the output folder of each specific task/process/subjob of this pipeline.

  1. Not all Nextflow pipelines haveparams.outdir as input, nor do all of them use params.outdir in publishDir. Read the source script of the Nextflow pipeline for the actual context of usage and requirements forparams.outdir andpublishDir.

An flavored folder structure is encouraged but not required.

To import a Nextflow pipeline from a remote repository via the CLI, run the following command to specify the repository’s URL. Note that you can also provide optional information, such as a and an :

Use the latest version of to take advantage of recent improvements and bug fixes.

Your destination project’s billTo feature needs to be enabled for Nextflow pipeline applet building. for more information.

If the Nextflow pipeline is in a private repository, use the option --git-credentials to provide the DNAnexus qualified ID or path of the credential files on the Platform. Read more about this .

Through the CLI you can also build a Nextflow pipeline applet from a pipeline script folder stored on a local disk. For example, you may have a copy of the nextflow-io/hello pipeline from the Nextflow on your local laptop, stored in a directory named hello, which contains the following files:

Ensure that the folder structure is in the required format, as .

To build a Nextflow pipeline applet using a locally stored pipeline script, run the following command and specify the path to the folder containing the Nextflow pipeline scripts. You can also provide , such as an import destination:

Your destination project’s billTo feature needs to be enabled for Nextflow pipeline applet building. Contact for more information.

The command can be run to see information about this applet, similar to the above example.

You can also build a Nextflow pipeline by running the command: dx build --app --from applet-xxxx.

To run the Nextflow pipeline applet, use dx run applet-xxxx or dx run app-xxxx commands in the CLI and specify your :

You can list and see the progress of the Nextflow pipeline job tree, which is structured as a head job with many subjobs, using the following :

From the CLI, you can use the command to check the status and view the log of the head job or each subjob.

The Nextflow pipeline executable is launched as a job tree, with one head job running the Nextflow , and multiple subjobs running a single each. Throughout the pipeline’s execution, the head job remains in “running” state and supervises the job tree’s execution.

When a Nextflow head job (i.e. job-xxxx) enters its terminal state (i.e. "done" or "failed"), a with filename as nextflow-<job-xxxx>.log will be written to the of the head job.

reads_fastqgz is an example of an executable input parameter name. All Nextflow pipeline inputs can be configured and exposed by the pipeline developer using an nf-core flavored pipeline schema file ().

When the input parameter is expecting a file, you need to specify the value in a certain format based on the of the input parameter. When the input is of the “file” class, use DNAnexus qualified ID (i.e. absolute path to the file object such as “project-xxxx:file-yyyy”); when the input is of the “string” class, use the DNAnexus URI (“dx://project-xxxx:/path/to/file”). See for full descriptions of the formatting of PATHs.

The DNAnexus object class of each input parameter is based on the “type” and “format” specified in the pipeline’s nextflow_schema.json, when it exists. See additional documentation to understand how Nextflow input parameter’s type and format (when applicable) converts to an app or applet’s input class.

All inputs for a Nextflow pipeline executable are set as “” inputs. This allows users to have flexibility to specify input via other means.

CLI example: dx run project-xxxx:applet-xxxx -i nextflow_pipeline_params="--foo=xxxx --bar=yyyy", where "--foo=xxxx --bar=yyyy" corresponds to the "--something value" pattern of Nextflow input specification referenced .

CLI example: dx run project-xxxx:applet-xxxx -i nextflow_run_opts=“-profile test”, where -profile is single-dash prefix parameter that corresponds to the , specifying a preset input configuration.

CLI example: dx run project-xxxx:applet-xxxx -i nextflow_params_file=project-xxxx:file-yyyy, where project-xxxx:file-yyyy is the DNAnexus qualified ID of the file being passed to nextflow run -params-file <file>. This corresponds to option of nextflow run.

CLI example: dx run project-xxxx:applet-xxxx -i nextflow_soft_confs=project-xxxx:file-1111 -i nextflow_soft_confs=project-xxxx:file-2222, where project-xxxx:file-1111 and project-xxxx:file-2222 are the DNAnexus qualified IDs of the file being passed to nextflow run -c <config-file1> -c <config-file2>. This corresponds to option of nextflow run, and the order specified for this array of file input is preserved when passing to the nextflow run execution.

The soft configuration file can be used for assigning default values of configuration scopes (such as ).

It is highly recommended to use nextflow_params_file as a replacement to using nextflow_soft_confs for the use case of specifying parameter values, especially when running Nextflow DSL2 nf-core pipelines. Read more about this at .

While you can specify a file input parameter’s value at different places as seen above, the valid PATH format referring to the same file will be different depending on the level (DNAnexus API/CLI level or Nextflow script-level) and the (file object or string) of the executable’s input parameter. Examples of this are given below.

When launching a DNAnexus job, you can specify a job-level output destination (e.g. project-xxxx:/destination/) using the platform-level optional parameter on the or on the . In addition, when there is publishDir specified in the pipeline, each output file will be located at <dx_run_path>/<publishDir>/, where <dx_run_path> is the job-level output destination, and <publishDir> is the path assigned per Nextflow script’s process.

Read more detail about the output folder specification and publishDir . Find an example on how to construct output paths of an nf-core pipeline job tree at run time from our .

to configure your AWS account to trust the Platform, as an OIDC identity provider. Be sure to take note of the value you enter in the "Audience" field. You'll need to use this value in a configuration file used by your pipeline, to enable pipeline runs to access the S3 bucket in question.

Next, configure an , such that its permissions and trust policies allow Platform jobs that assume this role, to access and use resources in the S3 bucket in question.

Next you need to configure your pipeline so that when it's run, it can access the S3 bucket in question. To do this, add, in a configuration file, a dnanexus that includes the properties shown in this example:

jobTokenAudience is the value of "Audience" you defined in above.

jobTokenSubjectClaims is an ordered, comma-separated list of DNAnexus - for example, "project_id, launched_by" - that the job must present, in order to assume the role that enables bucket access.

iamRoleArnToAssume is the Amazon Resource Name (ARN) for the role that you configured in above, and that will be assumed by jobs in order to access the bucket.

Having included custom subject claims in the trust policy for the role in question, you need then, in the , to set the value of jobTokenSubjectClaims to equal a comma-separated list of claims, entered in the same order in which you entered them in the trust policy.

For example, if you configured a role's trust policy as per the , you are requiring a job, in order to assume the role, to present custom subject claims project_id and launched_by, in that order. In your Nextflow configuration file, set the value of jobTokenSubjectClaims, within the dnanexus config scope, as follows:

By default, the Platform . Nextflow pipeline apps and applets have the following capabilities that are exceptions to these limits:

UPLOAD access to the project in which a Nextflow pipeline job is run ("project": "UPLOAD") - This is required in order for Nextflow pipeline jobs to record the progress of executions, and preserve the run cache, in order to enable .

You can modify a Nextflow pipeline app or applet's permissions by overriding the default values when , using the --extra-args flag with . An example:

"allProjects":"VIEW" increases jobs' access permission level to VIEW. This means that each job will have "read" access to projects that can be accessed by the user running the job. Use this carefully. This permission setting can be useful when expected input file PATHs are provided as DNAnexus URIs - via a , for example - from projects other than the one in which a job is being run.

Git credentials used to access Nextflow pipelines from private Git repositories. Can be used only with --repository. More information about the file syntax can be found .

When building a Nextflow pipeline executable, you can replace any with a Platform file object in tarball format. These Docker tarball objects serve as substitutes for referencing external Docker repositories.

Two methods are available for preparing Docker images as tarball file objects on the platform: or .

This method initiates a building job that begins by taking the pipeline script, then identifying Docker containers by scanning the script's source code based on the final execution tree. Next, the job converts the containers to tarballs, and saves those tarballs to the project in which the job is running. Finally, the job builds the Nextflow pipeline executable, in the tarballs, as bundledDepends.

Within a Nextflow pipeline script, you can also reference a Docker image by using its . Use this name within a path that's in the following format: project-xxxx:/.cached_docker_images/<image_name>/<image_name>_<version> An example:

Based on the input parameter’s type and format (when applicable) defined in the corresponding , each parameter will be assigned to the corresponding class (, ).

As a pipeline developer, you can specify a file input variable as {“type”:“string”, “format”:“file-path”} or {“type”:“string”, “format”:“path”}, which will be assign to “file” or “string” class, respectively. When running the executable, based on the class (file or string) of the executable’s input parameter, you will use a specific PATH format to specify the value. See documentation for an acceptable PATH format for each class.

When converting a file reference from a URL format (e.g. dx://project-xxxx:/path/to/file of a DNAnexus URI) to a String, you will use the method toUriString(). Method toURI().toString() does not give the same result, as toURI() removes the context ID (e.g. project-xxxx), and toString() removes the scheme (e.g. dx://). More info about the Nextflow methods .

All files generated by a Nextflow job tree will be stored in its session’s corresponding workDir (i.e. the path where the temporary results are stored). On DNAnexus, when the Nextflow pipeline job is run with “preserve_cache=true”, the workDir is set at the path: project-xxxx:/.nextflow_cache_db/<session_id>/work/. project-xxxx is the project where the job took place, and you can follow the path to access all preserved temporary results. It is useful to be able to access these results for investigating the detailed pipeline progress, and use them for resuming job runs for pipeline development purposes. More info about workDir is described .

When the Nextflow pipeline job was run with “preserve_cache=false” (default), temporary files will be stored in the job’s which will be deconstructed upon the head job enters its terminate state (i.e. “done”, “failed”, or “terminated”). Since a lot of these files are intermediate input/output being passed between processes and expected to be cleaned up after the job is completed, running with “preserve_cache=false” will help reduce project storage cost for files that are not of interest, and also save you from remembering to clean up all temporary files.

To save the final results of interest, and to display them as the Nextflow pipeline executable’s output, you can declare output files matching the declaration under the script’s output: block, and use Nextflow’s optional directive to publish them.

This will make the published output files as the Nextflow pipeline head job’s , under the executable’s formally defined placeholder output parameter, published_files, as array:file class. Then the files will be organized under the relative folder structure assigned via publishDir. This works for both “preserve_cache=true” and “preserve_cache=false”. Only the “copy” publish mode is supported on DNAnexus.

Find an example on how to construct output paths for an nf-core pipeline job tree at run time from our .

The queueSize option is part of Nextflow’s executor . It defines how many tasks the executor will handle in a parallel manner. On DNAnexus, this represents the number of subjobs being created at a time (5 by default) by the Nextflow pipeline executable’s head job. If the pipeline’s executor configuration has a value assigned to queueSize, it will override the default value. If the value exceeds the upper limit (1000) on DNAnexus, the root job will error out. See the Nextflow executor page for examples.

Each subjob’s instance type is determined based on the profile information provided in the Nextflow pipeline script. You can specify required instances by via Nextflow’s directive (example below), or using a set of system requirements (e.g. , , , etc.) according to the official Nextflow documentation. The executor will choose the corresponding instance type that matches the minimal requirement of what is described in the Nextflow pipeline profile using the following logic:

For all things equal (price and instance specifications), it will prefer a instance type.

Values assigned to cpus, memory, and disk directives in their .

In addition to being used for instance type determination, values assigned to cpus, memory, and disk directives can also be recalled by Nextflow's of task object (e.g. ${task.cpus}, ${task.memory}, ${task.disk}) at runtime for task allocation.

Nextflow’s feature enables skipping the processes that have been finished successfully and cached in previous runs. The new run can directly jump to downstream processes without needing to start from the beginning of the pipeline. By retrieving cached progress, Nextflow resume helps pipeline developers to save both time and compute costs. It is helpful for testing and troubleshooting when building and developing a Nextflow pipeline.

When the head job enters its terminal state (e.g. “failed” or “terminated”) that is not caused by the executor, no cache directory will be preserved, even when the job was run with preserve_cache=true. Subsequent new jobs will not be able to resume from this job run. Examples: a job tree fails due to ; a user of the job tree, etc.

To clean up all preserved sessions under a project, you can delete the entire ./nextflow_cache_db folder. To clean up a specific session’s cached folder, you can delete the specific .nextflow_cache_db/<session_id>/ folder. To delete a folder in UI, you can follow the documentation on . To delete a folder in CLI, you can run:

Nextflow’s directive allows you to define how the error condition is managed by the Nextflow executor at the process level. When an error status is returned, by default, the process and other pending processes stop immediately (i.e. errorStrategy terminate), and this in turn forces the entire pipeline execution to be terminated.

Independent from Nextflow process-level error conditions, when a Nextflow subjob encounters platform-related restartable , such as "ExecutionError", "UnresponsiveWorker", "JMInternalError", "AppInternalError", or "JobTimeoutExceeded", the subjob will follow the executionPolicy determined to the subjob and and restart itself. It will not restart from the head job.

A: DNAnexus supports as the container runtime for Nextflow pipeline applets. It is recommended to set docker.enabled=true in the Nextflow pipeline configuration, which enables the built Nextflow pipeline applet to execute the pipeline using Docker.

A: There can be many possibilities causing the head job to hang. One of the known reasons is caused by the being written directly to a DNAnexus URI (e.g. dx://project-xxxx:/path/to/file). To avoid this cause, we suggest you to specify ​​-with-trace path/to/tracefile (using a local path string) to the Nextflow pipeline applet’s nextflow_run_opts input parameter.

Taking as an example, start with reading the pipeline's logic:

The pipeline's is constructed with a prefix of the params.outdir variable followed by each task's name for each subfolder: publishDir = [ path: { "${params.outdir}/${...}" }, ... ]

params.outdir is a to the pipeline, and the . The user running the corresponding Nextflow pipeline executable must specify a value to params.outdir which will:

You can also set a job tree's output destination using :

This example is built based on and .

nf-core
dx-toolkit
Contact DNAnexus Sales
Github
S
ales
app from a Nextflow pipeline applet
executor
process
nextflow_schema.json
here
Nextflow run options pattern
-params-file
-c
process
nf-core documentation
AWS Identity and Access Management (IAM) role
config scope
limits apps' and applets' ability to read and write data
samplesheet.csv
Docker container
bundling
full image name
here
publishDir
configuration
configuration
configuration
process implicit variables
resume
exceeding a cost limit
terminates a job
errorStrategy
errors
Docker
trace report file
nf-core/sarek (3.3.1)
publishDir
required input parameter
default value ofparams.outdir is null
--destination
Contact DNAnexus Sales
Nextflow
Nextflow Documentation
dx command-line client
dx-toolkit
file
file
this section
here
described here
Nextflow log file
destination path
here
here
FAQ
Step 1
Step 2
aforementioned Nextflow configuration file
above example
resume functionality
dx build
building from a local disk
Built-in Docker image caching
Manually preparing the tarballs
here
here
FAQ
Specifying A Nextflow Job Tree Output Folder
Managing intermediate files and publishing outputs
here
-h
nextflow_schema.json file
ref2
machineType
cpus
memory
disk
import destination
table below
optional
Follow the steps outlined here
job identity token custom claims
UI
CLI
deleting objects
repository tag
optional information
dx run
inputs
command
dx watch
instance type name
version2 (v2)
temporary workspace
class
class
ref1
output
Note that the “Estimated Price” value shown here is only an example. The actual price depends on the pricing model and runtime of the import job.