Running Workflows

You can run workflows from the command-line using the command dx run. The inputs to these workflows can be from any project for which you have VIEW access.

The examples here use the publicly available Exome Analysis Workflow (platform login required to access this link).

For information on how to run a Nextflow pipeline, see here.

Running in Interactive Mode

If dx run is run without specifying an input, interactive mode will be launched. You will then be prompted to enter each required input, after which you will be given the option to select from a list of optional parameters to modify. Optional parameters listed will include all those that can be modified for each stage of the workflow. The interface will then output a JSON file detailing the input specified and generate an analysis ID of the form analysis-xxxx unique to this particular run of the workflow.

Below is an example of running the Exome Analysis Workflow from the public "Exome Analysis Demo" project.

$ dx run "Exome Analysis Demo:Exome Analysis Workflow"
Entering interactive mode for input selection.

Input:   Reads (bwa_mem_fastq_read_mapper.reads_fastqgzs)
Class:   array:file

Enter file values, one at a time (^D or <ENTER> to finish, <TAB> twice for compatible files in
    current directory, '?' for more options)
bwa_mem_fastq_read_mapper.reads_fastqgzs[0]: "Exome Analysis Demo:/Input/SRR504516_1.fastq.gz"


Select an optional parameter to set by its # (^D or <ENTER> to finish):

 [0] Reads (right mates) (bwa_mem_fastq_read_mapper.reads2_fastqgzs)
 [1] Read group information (bwa_mem_fastq_read_mapper.rg_info_csv)
.
.
.
 [33] Output prefix (gatk4_genotypegvcfs.prefix)
 [34] Extra command line options (gatk4_genotypegvcfs.extra_options) [default="-G StandardAnnotation --only-output-calls-starting-in-intervals"]

Optional param #: 0

Input:   Reads (right mates) (bwa_mem_fastq_read_mapper.reads2_fastqgzs)
Class:   array:file

Enter file values, one at a time (^D or <ENTER> to finish, <TAB> twice for compatible files in
   current directory, '?' for more options)
bwa_mem_fastq_read_mapper.reads2_fastqgzs[0]: "Exome Analysis Demo:/Input/SRR504516_2.fastq.gz"
bwa_mem_fastq_read_mapper.reads2_fastqgzs[1]:

Optional param #: <ENTER>

Using input JSON:
{
  "bwa_mem_fastq_read_mapper.reads_fastqgzs": [
    {
      "$dnanexus_link": {
        "project": "project-BQfgzV80bZ46kf6pBGy00J38",
        "id": "file-B40jg7v8KfPy38kjz1vQ001y"
      }
    }
  ],
  "bwa_mem_fastq_read_mapper.reads2_fastqgzs": [
    {
      "$dnanexus_link": {
        "project": "project-BQfgzV80bZ46kf6pBGy00J38",
        "id": "file-B40jgYG8KfPy38kjz1vQ0020"
      }
    }
  ]
}

Confirm running the executable with this input [Y/n]: <ENTER>
Calling workflow-xxxx with output destination project-xxxx:/

Analysis ID: analysis-xxxx

Running in Non-Interactive Mode

You can specify each input on the command-line using the -i or --input flags using the syntax -i<stage ID>.<input name>=<input value>. <input-value> must take the form of a DNAnexus object ID or a file named in the project currently selected. It is also possible to specify the number of a stage in place of the stage ID for a given workflow, where stages are indexed starting at zero. The inputs in the following example are specified for the first stage of the workflow only to illustrate this point. Note that the parentheses around the <input-value> in the help string are omitted when entering input.

Possible values for the input name field can be found by running the command dx run workflow-xxxx -h, as shown below using the Exome Analysis Workflow.

$ dx run "Exome Analysis Demo:Exome Analysis Workflow" -h
usage: dx run Exome Analysis Demo:Exome Analysis Workflow [-iINPUT_NAME=VALUE ...]

Workflow: GATK4 Exome FASTQ to VCF (hs38DH)

Runs GATK4 Best Practice for Exome on hs38DH reference genome

Inputs:
 bwa_mem_fastq_read_mapper
  Reads: -ibwa_mem_fastq_read_mapper.reads_fastqgzs=(file) [-ibwa_mem_fastq_read_mapper.reads_fastqgzs=... [...]]
        An array of files, in gzipped FASTQ format, with the first read mates
        to be mapped.

  Reads (right mates): [-ibwa_mem_fastq_read_mapper.reads2_fastqgzs=(file) [-ibwa_mem_fastq_read_mapper.reads2_fastqgzs=... [...]]]
        (Optional) An array of files, in gzipped FASTQ format, with the second
        read mates to be mapped.
  BWA reference genome index: [-ibwa_mem_fastq_read_mapper.genomeindex_targz=(file, default={"$dnanexus_link": {"project": "project-BQpp3Y804Y0xbyG4GJPQ01xv", "id": "file-FFJPKp0034KY8f20F6V9yYkk"}})]
        A file, in gzipped tar archive format, with the reference genome
        sequence already indexed with BWA.
  ...
 fastqc
  Reads: [-ifastqc.reads=(file, default={"$dnanexus_link": {"stage": "bwa_mem_fastq_read_mapper", "outputField": "sorted_bam"}})]
        A file containing the reads to be checked. Accepted formats are
        gzipped-FASTQ and BAM.
  ...
 gatk4_bqsr
  Sorted mappings: [-igatk4_bqsr.mappings_sorted_bam=(file, default={"$dnanexus_link": {"outputField": "sorted_bam", "stage": "bwa_mem_fastq_read_mapper"}})]
        A coordinate-sorted BAM or CRAM file with the base quality scores to
        be recalibrated.
   ...
 ...

Outputs:
  Sorted mappings: bwa_mem_fastq_read_mapper.sorted_bam (file)
        A coordinate-sorted BAM file with the resulting mappings.

  Sorted mappings index: bwa_mem_fastq_read_mapper.sorted_bai (file)
        The associated BAM index file.
  ...
  Variants index: gatk4_genotypegvcfs.variants_vcfgztbi (file)
        The associated TBI file.

This help message describes the inputs for each stage of the workflow in the order they are specified. For each stage of the workflow, the help message will first list the required inputs for that stage, specifying the requisite type in the <input-value> field. Next, the message describes common options for that stage (as seen in that stage's corresponding UI on the platform). Lastly, it will list advanced command-line options for that stage. If any stage's input is linked to the output of a prior stage, the help message shows the default value for that stage as a DNAnexus link of the form

{"$dnanexus_link": {"outputField": "<prior stage output name>", "stage": "stage-xxxx" }}.

Similarly, this link format can be used to specify output from any prior stage in the workflow as input for the current stage. We see that the Exome Analysis Workflow has one required file array input in addition to those already specified by default: -ibwa_mem_fastq_read_mapper.reads_fastqgzs. As these inputs are for the first stage of the Exome Analysis Workflow, the bwa_mem_fastq_read_mapper stage ID can be replaced with 0.

Workflow stages are zero-indexed; the first stage of a workflow is denoted as stage 0.

The example below shows how to run the same Exome Analysis Workflow on a FASTQ file containing reads, as well as a BWA reference genome, using the default parameters for each subsequent stage.

$ dx run "Exome Analysis Demo:Exome Analysis Workflow" \
 -i0.reads_fastqgzs="Exome Analysis Demo:/Input/SRR504516_1.fastq.gz" \
 -ibwa_mem_fastq_read_mapper.genomeindex_targz='Reference Genome Files\: AWS US (East):/H. Sapiens - GRCh37 - hs37d5 (1000 Genomes Phase II)/hs37d5.bwa-index.tar.gz' -y
Using input JSON:
{
  "bwa_mem_fastq_read_mapper.reads_fastqgzs": [
    {
      "$dnanexus_link": {
        "project": "project-BQfgzV80bZ46kf6pBGy00J38",
        "id": "file-B40jg7v8KfPy38kjz1vQ001y"
      }
    }
  ],
  "bwa_mem_fastq_read_mapper.genomeindex_targz": {
    "$dnanexus_link": {
      "project": "project-BQpp3Y804Y0xbyG4GJPQ01xv",
      "id": "file-B6ZY4942J35xX095VZyQBk0v"
    }
  }
}

Calling workflow-xxxx with output destination
  project-xxxx:/

Analysis ID: analysis-xxxx

Specifying Array Input

Array input can be specified by specifying multiple inputs for a single parameter in a stage. For example, the following flags would add files 1 through 3 to the file_inputs parameter for stage-xxxx of the workflow:

$ dx run workflow \
-istage-xxxx.file_inputs=project-xxxx:file-1xxxx \
-istage-xxxx.file_inputs=project-xxxx:file-2xxxx \
-istage-xxxx.file_inputs=project-xxxx:file-3xxxx

Using input JSON:
{
  "stage-xxxx.file_inputs": [
      {
       "$dnanexus_link": {
          "project": "project-xxxx",
          "id": "file-1xxxx"
      },
      {
       "$dnanexus_link": {
          "project": "project-xxxx",
          "id": "file-2xxxx"
      },
      {
       "$dnanexus_link": {
          "project": "project-xxxx",
          "id": "file-3xxxx"
      }
  ]
}

If no project is selected, or if the file is in another project, the project containing the files you wish to use must be specified as follows: -i<stage ID>.<input name>=<project id>:<file id>.

Job-Based Object References (JBORs)

The -i flag can also be used to specify job-based object references (JBORs) with the syntax -i<stage ID or number>:<input name>=<job id>:<output name>. The --brief flag, when used with the command dx run, will only output the execution's ID; we can also skip the interactive prompts confirming the execution using the -y flag. Calling dx run on a job with the --brief flag allows the command to return just the job ID of that execution and we can skip being prompted to begin execution with the -y flag.

The example below calls the BWA-MEM FASTQ Read Mapper app (platform login required to access this link) to produce the sorted_bam output described in the help string produced by running dx run app-bwa_mem_fastq_read_mapper -h. This output is then used as input to the first stage of the Parliament Workflow featured on the DNAnexus platform (platform login required to access this link).

$ dx run Parliament \
  -i0.illumina_bam=$(dx run bwa_mem_fastq_read_mapper -ireads_fastqgzs=file-xxxx -ireads2_fastqgzs=file-xxxx -igenomeindex_targz=project-BQpp3Y804Y0xbyG4GJPQ01xv:file-B6qq53v2J35Qyg04XxG0000V -y --brief):sorted_bam \
  -i0.ref_fasta=project-BQpp3Y804Y0xbyG4GJPQ01xv:file-B6qq53v2J35Qyg04XxG0000V \
  -y

Using input JSON:
{
    "stage-F14F5qQ0Jz1gfpjX8y1JxG3y.illumina_bam": {
        "$dnanexus_link": {
            "field": "sorted_bam",
            "job": "job-xxxx"
        }
    },
    "stage-F14F5qQ0Jz1gfpjX8y1JxG3y.ref_fasta": {
        "$dnanexus_link": {
            "project": "project-xxxx",
            "id": "file-B6qq53v2J35Qyg04XxG0000V"
        }
    }
}

Calling workflow-xxxx with output destination project-xxxx:/

Analysis ID: analysis-xxxx

Advanced Options

Quiet Output

Using the --brief flag at the end of a dx run command will cause the command line to print the execution's analysis ID ("analysis-xxxx") instead of the input JSON for the execution. This ID can be saved for later reference.

$ dx run workflow-xxxx -i0.input_file=Input/SRR504516_1.fastq.gz -y --brief
analysis-xxxx

Rerunning Analyses With Modified Settings

To modify specific settings from the previous analysis, you can run the command dx run --clone analysis-xxxx [options]. The [options] parameters will override anything set by the --clone flag, and they take the form of options passed as input from the command line.

Note that the --clone flag will not copy the usage of the --allow-ssh or --debug-on flags, which must be set with the new execution; only the applet, instance type, and input spec are copied. See the Connecting to Jobs page for more information on the usage of these flags.

For example, the command below redirects the output of the analysis to the outputs/ folder and reruns all stages.

$ dx run --clone analysis-xxxx \
   --rerun-stage "*" --destination project-xxxx:/output -y

Only the outputs of stages rerun are placed in the destination specified.

Rerunning Specific Stages

When rerunning workflows, if a stage is run identically to how it was run in a previous analysis, the stage itself will not be rerun; the outputs of that stage will not be copied or rewritten in a new location. To rerun a specific stage, use the option --rerun-stage STAGE_ID to force a stage to be run again, wherein STAGE_ID is an ID of the form stage-xxxx, the stage's name, or the index of that stage (where the first stage of a workflow is indexed at 0). If you wish to rerun all stages of an analysis, you can use --rerun-stage "*", where the asterisk is enclosed in quotes to prevent expansion of that variable into all folders your current directory via globbing.

The command below reruns the third and final stage of analysis-xxxx

$ dx run --clone analysis-xxxx --rerun-stage 2 --brief -y

Specifying Analysis Output Folders

The --destination flag allows you to specify the path of the output of a workflow. Every output of every stage will be written to the destination specified by default.

Specifying Output Folders

You can use the --stage-output-folder <stage_ID> <folder> command to specify the output destination of a particular stage in the analysis being run, wherein stage_ID is the stage's name, or the index of that stage (where the first stage of a workflow is indexed at 0) andfolder is the project and path to which you wish the stage to write using the syntax project-xxxx:/PATH where PATH is the path to the folder in project-xxxx where you wish to write outputs.

The following command reruns all stages of analysis-xxxx and sets the output destination of the first step of the workflow (BWA) to "mappings" in the current project:

$ dx run --clone analysis-xxxx --rerun-stage "*" \
    --stage-output-folder 0 "mappings" --brief -y

Specifying Stage-Relative Output Folders

If you want to specify output folder of a stage within the current output folder of the entire analysis, you can use the flag --stage-relative-output-folder <stage_id> <folder>, wherestage_id is the stage's name (stage-xxxx), or the index of that stage (where the first stage of a workflow is indexed at 0). For the folder argument, you can specify a quoted path to write the output of that stage that is relative to the output folder of the analysis.

The following command reruns all stages of analysis-xxxx, setting the output destination of the analysis to /exome_run, and the output destination of stage 0 to /exome_run/mappings in the current project:

$ dx run --clone analysis-xxxx --rerun-stage "*" \
  --destination "exome_run" \
  --stage-relative-output-folder 0 "mappings" --brief -y

Specifying a Different Instance Type

If you wish to specify the instance type of all stages in your analysis or a specific set of stages in your analysis, you can do so with the flag --instance-type. Specifically, the format --instance-type STAGE_ID=INSTANCE_TYPE allows us to set the instance type of a specific stage, while --instance-type INSTANCE_TYPE sets one instance types for all of the stages. The two options can be combined, for example --instance-type mem2_ssd1_x2 --instance-type my_stage_0=mem3_ssd1_x16 will set all stages' instance types to mem2_ssd1_x2 except for the stage my_stage_0 for which mem3_ssd1_x16 will be used.

Here STAGE_ID is an ID of a stage, the stage's name, or the index of that stage (where the first stage of a workflow is indexed at 0).

The example below reruns all stages of analysis-xxxx and specifies that the first and second stages should be run on mem1_ssd2_x8 and mem1_ssd2_x16 instances respectively:

 $ dx run --clone analysis-xxxx \
    --rerun-stage "*" --instance-type '{"0": "mem1_hdd2_x8", "1": "mem1_ssd2_x4"}' \
    --brief -y

Adding Metadata to an Analysis

This is identical to adding metadata to a job; see Adding metadata to a job for details.

Monitoring an Analysis

It is not possible to monitor an analysis to a command line. For information about monitoring a job from the command line, see Monitoring Executions.

On the DNAnexus Platform, jobs are limited to a runtime of 30 days. Jobs running longer than 30 days will be automatically terminated.

Providing Input JSON

This is identical to providing an input JSON to a job; for more information, see Providing input JSON.

Note that as in running a workflow in non-interactive mode, inputs to a workflow must be specified as STAGE_ID.<input>, where STAGE_ID is either an ID of the form stage-xxxx or the index of that stage in the workflow (starting with the first stage at index 0).

Last updated

Copyright 2024 DNAnexus