Introduction to Building Workflows

Creating a workflow is easiest via the web interface, but you can use the DNAnexus SDK, dx-toolkit, if you want to automate workflow creation or lock down your workflow. This tutorial provides step-by-step instructions from a local workstation.

For information on building Nextflow workflows, see Running Nextflow Pipelines.

Basic Workflows

A workflow can be created on the DNAnexus Platform from a dxworkflow.json file.

This tutorial builds a workflow named "BWA MEM + FreeBayes Exome Workflow". The stages field of the JSON file holds a list of executables for the workflow. The example includes two stages: the first runs the app BWA-MEM FASTQ Read Mapper and the second runs FreeBayes Variant Caller. The JSON also specifies a name and an output folder for results. The example dxworkflow.json looks as follows:

The dxworkflow.json file in this example contains two separate fields: sorted_bams and sorted_bam. The sorted_bams input field for the FreeBayes app is bound to the sorted_bam field of the BWA step.

{
  "name": "BWA MEM + FreeBayes Exome Workflow ",
  "outputFolder": "/results",
  "stages": [
    {
      "id": "align_reads",
      "executable": "app-bwa_mem_fastq_read_mapper",
      "input": {
        "genomeindex_targz": {
          "$dnanexus_link": {
            "project": "project-BQpp3Y804Y0xbyG4GJPQ01xv",
            "id": "file-B6ZY4942J35xX095VZyQBk0v"
          }
        }
      }
    },
    {
      "id": "call_variants",
      "executable": "app-freebayes",
      "input": {
        "sorted_bams": [{
          "$dnanexus_link": {
            "stage": "align_reads",
            "outputField": "sorted_bam"
          }
        }],
        "genome_fastagz": {
          "$dnanexus_link":{
            "project": "project-BQpp3Y804Y0xbyG4GJPQ01xv",
            "id": "file-B6ZY7VG2J35Vfvpkj8y0KZ01"
          }
        }
      }
    }
  ]
}

Each stage in the stages list must include an id (a free-form string unique in the workflow) and an executable field that contains the ID or name of an app or an ID of an applet to run in that stage.

Add an input field for a stage to bind the stage input to an output or input of another stage. For example, the file array input sorted_bams of the second stage, call_variants, receives values from the output field sorted_bam of the first stage, align_reads:

{
  "input": {
    "sorted_bams": [{
      "$dnanexus_link": {
        "stage": "align_reads",
        "outputField": "sorted_bam"
      }
   }]
  }
}

Input and output field names are defined by the apps or applets they belong to. For apps, find these field names in the app documentation available in the online interface under the Tools Library.

To view the names of an executable's input and output fields, run the dx describe command.

Use the input section of a stage to set default values for a field. The example selects the file hs37d5.bwa-index.tar.gz (file-B6ZY4942J35xX095VZyQBk0v), which is publicly available in the reference project "Apps Data: AWS US (East)" (project-BQpp3Y804Y0xbyG4GJPQ01xv), as the default reference file for the alignment step, align_reads.

"input": {
  "genomeindex_targz": {
    "$dnanexus_link": {
      "project": "project-BQpp3Y804Y0xbyG4GJPQ01xv",
      "id": "file-B6ZY4942J35xX095VZyQBk0v"
    }
  }
}

Creating a Workflow on the DNAnexus Platform

Create a workflow object on the DNAnexus Platform with the following steps:

  1. On the local workstation, create a directory named "BWA MEM + FreeBayes Exome Workflow". The directory name does not need to match the workflow name exactly, but matching them is a good practice.

  2. Place the dxworkflow.json file in the new directory.

  3. Create the workflow on the DNAnexus Platform by navigating to the directory and entering the following commands:

$ ls "BWA MEM + FreeBayes Exome Workflow"
dxworkflow.json
$ dx build "BWA MEM + FreeBayes Exome Workflow"

After dx build finishes, it shows you the ID of the resulting workflow. You can also view this workflow by logging in to your DNAnexus account on the Platform and viewing the workflow from your project's Manage page.

To run a workflow, pass or override values to any stage inputs:

dx run -i align_reads.reads_fastqgzs=myreads.fastq.gz \
  -i align_reads.genomeindex_targz=file-xxxx \
  "BWA MEM + FreeBayes Exome Workflow"

Locked Workflows

Reasons to Lock a Workflow

Sometimes, you may want to prevent users from changing certain stage inputs in a workflow. For example, you might want to ensure that only a specific reference genome is used, and restrict users from modifying the reference genome input.

To achieve that, you can add workflow-level inputs and outputs fields during creation, with links to stage inputs and outputs. When the workflow runs, users can pass values only to fields defined in inputs. All the parameters that are not visible in this workflow-level I/O interface cannot be changed.

Creating locked workflows is useful to simplify the workflow execution and make it clear which inputs users are expected to provide.

This approach also improves execution of WDL workflows on the platform because WDL workflows explicitly specify workflow inputs and outputs.

Building a Locked-Down Workflow

The example shows a locked-down version named "BWA MEM + FreeBayes Exome Workflow (locked)". All inputs are locked except reads_fastqgzs in the align_reads stage. When locking workflows, list the inputs that are not locked in the workflow-level inputs field. All other inputs become locked and users cannot override them at runtime.

Inputs

To create a locked workflow, add a workflow-level input specification in the inputs field, for example:

{
  "inputs": [
    {
      "name": "reads",
      "help": "An array of files, in gzipped FASTQ format.",
      "class": "array:file",
      "patterns": [ "*.fq.gz", "*.fastq.gz" ]
    }
  ]
}

In this case the workflow has only one input, named reads.

Stages

Next, define which stage or stages consume that input by adding a link from those stages to the workflow input using the workflowInputField field, as in the example below. If a file is supplied to reads at runtime, it is directed to reads_fastqgzs in the align_reads stage.

{
  "stages": [
    {
      "id": "align_reads",
      "name": "BWA MEM",
      "executable": "app-bwa_mem_fastq_read_mapper",
      "input": {
        "reads_fastqgzs": {
          "$dnanexus_link": {
            "workflowInputField": "reads"
          }
        },
        "genomeindex_targz": {
          "$dnanexus_link": {
            "project": "project-BQpp3Y804Y0xbyG4GJPQ01xv",
            "id": "file-B6ZY4942J35xX095VZyQBk0v"
          }
        }
      }
    },
    {
      "id": "call_variants",
      "name": "FreeBayes",
      "executable": "app-freebayes",
      "folder": "call_variants_output",
      "input": {
        "sorted_bams": [{
          "$dnanexus_link": {
            "stage": "align_reads",
            "outputField": "sorted_bam"
          }
        }],
        "genome_fastagz": {
          "$dnanexus_link":{
            "project": "project-BQpp3Y804Y0xbyG4GJPQ01xv",
            "id": "file-B6ZY7VG2J35Vfvpkj8y0KZ01"
          }
        }
      }
    }
  ]
}

Notice that the input fields genomeindex_targz and genome_fastagz are not included in the workflow-level inputs, indicating that these fields are locked. Because users cannot supply values for locked fields, set these values in each stage's input field (not the workflow-level inputs). The workflow then runs with the values file-B6ZY4942J35xX095VZyQBk0v and file-B6ZY7VG2J35Vfvpkj8y0KZ01 respectively.

Required Inputs

Any required stage inputs in a locked workflow must be specified in the dxworkflow.json. In the example, the stages have the following required inputs:

  • align_reads stage has the inputs reads_fastqgzs and genomeindex_targz.

  • call_variants stage has the inputs sorted_bams and genome_fastagz.

The reads_fastqgzs input is created as a workflow-level input in inputs. This input is not locked and users supply its value. The remaining inputs are locked. The workflow creator must set values for locked inputs. In the example, those values are set as shown in the code snippet above.

If the workflow-level inputs specification is null or not specified at all, the workflow can accept inputs provided directly to the workflow stages by the user.

Multiple stages can also link to the same workflow-level input.

Outputs

Optionally, specify workflow-level outputs:

{
  "outputs": [
    {
      "name": "variants",
      "class": "file",
      "outputSource": {
        "$dnanexus_link": {
          "stage": "call_variants",
          "outputField": "variants_vcfgz"
        }
      }
    }
  ]
}

The outputSource field configures which stage-level outputs become workflow outputs. Together with inputs, this is useful when setting a workflow as an executable within another workflow.

Full JSON Description of a Locked-Down Workflow

The example dxworkflow.json description looks as follows:

{
  "name": "BWA MEM + FreeBayes Exome Workflow (locked)",
  "outputFolder": "/results",
  "inputs": [
    {
      "name": "reads",
      "label": "Reads",
      "help": "An array of files, in gzipped FASTQ format.",
      "class": "array:file",
      "patterns": [
        "*.fq.gz",
        "*.fastq.gz"
      ]
    }
  ],
  "stages": [
    {
      "id": "align_reads",
      "name": "BWA MEM",
      "executable": "app-bwa_mem_fastq_read_mapper",
      "input": {
        "reads_fastqgzs": {
          "$dnanexus_link": {
            "workflowInputField": "reads"
          }
        },
        "genomeindex_targz": {
          "$dnanexus_link": {
            "project": "project-BQpp3Y804Y0xbyG4GJPQ01xv",
            "id": "file-B6ZY4942J35xX095VZyQBk0v"
          }
        }
      }
    },
    {
      "id": "call_variants",
      "name": "FreeBayes",
      "executable": "app-freebayes",
      "folder": "call_variants_output",
      "input": {
        "sorted_bams": [{
          "$dnanexus_link": {
            "stage": "align_reads",
            "outputField": "sorted_bam"
          }
        }],
        "genome_fastagz": {
          "$dnanexus_link":{
            "project": "project-BQpp3Y804Y0xbyG4GJPQ01xv",
            "id": "file-B6ZY7VG2J35Vfvpkj8y0KZ01"
          }
        }
      }
    }
  ],
  "outputs": [
    {
      "name": "variants",
      "class": "file",
      "outputSource": {
        "$dnanexus_link": {
          "stage": "call_variants",
          "outputField": "variants_vcfgz"
        }
      }
    }
  ]
}

Build the workflow by running this command on the directory "BWA MEM + FreeBayes Exome Workflow (locked)" (which contains the dxworkflow.json):

dx build "BWA MEM + FreeBayes Exome Workflow (locked)"

Running a Locked-Down Workflow via the CLI

To run the workflow, pass a FASTQ input file to the workflow-level reads input field:

dx run "BWA MEM + FreeBayes Exome Workflow (locked)" \
  -i reads="Exome Analysis Demo":/Input/SRR504516_1.fastq.gz

Providing the input file directly to the stage, for example, -ialign_reads.reads_fastqgzs=my_input_file.fastq.gz, is impossible for locked workflows.

To find out how to run the workflow and what inputs it accepts, run this command:

dx run "BWA MEM + FreeBayes Exome Workflow (locked)" --help

Running a Locked-Down Workflow via the UI

Locked workflows in the UI resemble running an app, with inputs on the left and outputs on the right.

Locked workflows cannot be edited or created in the UI. You can build locked workflows in the CLI by using dx get and then dx build.

Locking Down an Existing Workflow

To lock down an existing workflow, run dx get "BWA MEM + FreeBayes Exome Workflow", add inputs to the downloaded dxworkflow.json, set workflowInputField references from stages to these inputs, and run dx build again.

Last updated

Was this helpful?