Introduction to Building Workflows

Creating a workflow is easiest via the web interface, but you can use the DNAnexus SDK, dx-toolkit, if you want to automate workflow creation or lock down your workflow. In this tutorial we will show you how to do that step by step from your local workstation.

Simple workflows

A workflow can be created in the DNAnexus platform from a dxworkflow.json file.

In this tutorial, we will build a workflow named "BWA MEM + Freebayes Exome Workflow". The stages field of our JSON file holds a list of executables for the workflow. We'll add two stages to our workflow: the first one will run the app BWA-MEM FASTQ Read Mapper and the second one will run Freebayes Variant Caller. We'll also specify a name and an output folder where our results will be saved. Our dxworkflow.json will look as follows:

Note that there are two separate fields, sorted_bamsandsorted_bam, in the dxworkflow.json file we are using. The sorted_bams input field for the Freebayes app is bound to the sorted_bam field of the BWA step.

{
"name": "BWA MEM + Freebayes Exome Workflow ",
"outputFolder": "/results",
"stages": [
{
"id": "align_reads",
"executable": "app-bwa_mem_fastq_read_mapper",
"input": {
"genomeindex_targz": {
"$dnanexus_link": {
"project": "project-BQpp3Y804Y0xbyG4GJPQ01xv",
"id": "file-B6ZY4942J35xX095VZyQBk0v"
}
}
}
},
{
"id": "call_variants",
"executable": "app-freebayes",
"input": {
"sorted_bams": [{
"$dnanexus_link": {
"stage": "align_reads",
"outputField": "sorted_bam"
}
}],
"genome_fastagz": {
"$dnanexus_link":{
"project": "project-BQpp3Y804Y0xbyG4GJPQ01xv",
"id": "file-B6ZY7VG2J35Vfvpkj8y0KZ01"
}
}
}
}
]
}

Each stage in the stages list should have an id, which is a free-form string unique in a given workflow, and anexecutable field, which holds either the ID/name of an app, or an ID of an applet that we want to run in that stage.

We can add an input field for a stage if we want to bind the input of that stage with an output/input of a different stage. For example, the file array input sorted_bams of our second stage, call_variants, will receive values from the output field sorted_bam of the first stage, align_reads:

{
"input": {
"sorted_bams": [{
"$dnanexus_link": {
"stage": "align_reads",
"outputField": "sorted_bam"
}
}]
}
}

Note that these input and output field names are not created by us, but are the names of the input and output fields that are specified by the apps/applets they belong to. For apps, these field names can be found in each app's documentation, which can be viewed from the online interface using the Tools Library.

You can view the names of the input and output fields of an executable (app or applet) that is defined for a stage by running the dx describe command in the command line.

We also use the input section of a stage to set default values for a field. We select the file hs37d5.bwa-index.tar.gz (file-B6ZY4942J35xX095VZyQBk0v), which is publicly available in the reference project "Apps Data: AWS US (East)" (project-BQpp3Y804Y0xbyG4GJPQ01xv) on the DNAnexus platform, to be the default reference file for the alignment step, align_reads.

"input": {
"genomeindex_targz": {
"$dnanexus_link": {
"project": "project-BQpp3Y804Y0xbyG4GJPQ01xv",
"id": "file-B6ZY4942J35xX095VZyQBk0v"
}
}

Creating a Workflow on the DNAnexus Platform

We can now create a workflow object in the DNAnexus platform by following these steps:

  1. Create a directory named "BWA MEM + Freebayes Exome Workflow" in your local workstation. The directory name does not have to be identical to the name of the workflow, but it's a good practice to keep them the same.

  2. Place your dxworkflow.json file in the newly created directory.

  3. To create the workflow in the DNAnexus platform, navigate to your directory and then perform the following commands:

$ ls "BWA MEM + Freebayes Exome Workflow"
dxworkflow.json
$ dx build "BWA MEM + Freebayes Exome Workflow"

You should see the ID of your resulting workflow in the command line. You can also view this workflow by logging in to your DNAnexus account on the Platform and viewing the workflow from your project's Manage page.

When we want to run a workflow, we can pass or override values to any stage inputs:

$ dx run -ialign_reads.reads_fastqgzs=myreads.fastq.gz \
-ialign_reads.genomeindex_targz=file-xxxx \
"BWA MEM + Freebayes Exome Workflow"

Locked workflows

Motivation

In certain situations, it may be desirable to disallow or discourage the user of our workflow to override an input to a particular stage. For example, we may want only a specific reference genome to be used for the workflow and thus lock down the reference genome input.

In order to achieve that we can add explicit fields called inputs and outputs to the workflow during creation, with links to inputs and outputs of specific stages. When the workflow is run, the user will be able to pass values only to the fields defined in inputs, and all the parameters that are not visible in this workflow-level I/O interface will be unchangeable/non-overridable.

Creating locked workflows is also useful when we want to simplify the workflow execution and make it clear which inputs users are expected to provide.

This feature also makes the execution of WDL workflows on the Platform more seamless since these also explicitly specify workflow inputs and outputs.

Building a locked-down workflow

For our example, we will create a locked down version of the workflow above and name it "BWA MEM + Freebayes Exome Workflow (locked)". Our workflow will have all inputs locked except for one in the stage align_reads, reads_fastqgzs. When locking workflows we always define those inputs that are not locked, by listing them in the workflow-level inputs field. All the other inputs will be automatically locked and users will not be able to override their values when running this workflow.

Inputs

To create a locked workflow we first need to add a workflow-level input specification in the inputs field, which may look like this:

{
"inputs": [
{
"name": "reads",
"help": "An array of files, in gzipped FASTQ format.",
"class": "array:file",
"patterns": [ "*.fq.gz", "*.fastq.gz" ]
}
]
}

In this case the workflow will have only one input, named reads.

Stages

Next, we should define which stage or stages will consume that input by adding a link from that stage(s) to the workflow input. We can do this by using the field workflowInputField, as in the example below. If a file is supplied to reads when the workflow is run, it will be directed to reads_fastqgzs of the stage align_reads.

{
"stages": [
{
"id": "align_reads",
"name": "BWA MEM",
"executable": "app-bwa_mem_fastq_read_mapper",
"input": {
"reads_fastqgzs": {
"$dnanexus_link": {
"workflowInputField": "reads"
}
},
"genomeindex_targz": {
"$dnanexus_link": {
"project": "project-BQpp3Y804Y0xbyG4GJPQ01xv",
"id": "file-B6ZY4942J35xX095VZyQBk0v"
}
}
}
},
{
"id": "call_variants",
"name": "Freebayes",
"executable": "app-freebayes",
"folder": "call_variants_output",
"input": {
"sorted_bams": [{
"$dnanexus_link": {
"stage": "align_reads",
"outputField": "sorted_bam"
}
}],
"genome_fastagz": {
"$dnanexus_link":{
"project": "project-BQpp3Y804Y0xbyG4GJPQ01xv",
"id": "file-B6ZY7VG2J35Vfvpkj8y0KZ01"
}
}
}
}
]
}

Notice that the input fields genomeindex_targz and genome_fastagz are not put into the workflow-level input field, indicating that these fields are locked. Since the user cannot input values for locked fields, we have to set the value for these fields using each individual stage'sinput field as above (not the workflow-level inputs), and the workflow will only be run with the values file-B6ZY4942J35xX095VZyQBk0v and file-B6ZY7VG2J35Vfvpkj8y0KZ01 respectively.

Required inputs

Any required stage inputs in a locked workflow must be specified in the dxworkflow.json. In the example, our stages have the following required inputs:

  • align_reads stage has the inputs reads_fastqgzs and genomeindex_targz

  • call_variants stage has the inputs sorted_bams and genome_fastagz

reads_fastqgzs is created as a workflow-level input in inputs (so it is not locked, and the user will set the input) while the remaining inputs are locked. The values used in the locked input fields must be set by the creator of the workflow, and for our example, they have been set according to the code snippet seen above.

If the workflow-level inputs specification is null or not specified at all, the workflow can accept inputs provided directly to the workflow stages by the user.

Multiple stages can also link to the same workflow-level input.

Outputs

Optionally, we can also specify workflow-level outputs:

{
"outputs": [
{
"name": "variants",
"class": "file",
"outputSource": {
"$dnanexus_link": {
"stage": "call_variants",
"outputField": "variants_vcfgz"
}
}
}
]
}

The field outputSource allows us to configure which stage-level outputs will be the outputs of the workflow. This, together with inputs, is especially useful when we want to set a workflow as an executable within another workflow.

Full JSON description of a locked-down workflow

Our example dxworkflow.json workflow description will look as follows:

{
"name": "BWA MEM + Freebayes Exome Workflow (locked)",
"outputFolder": "/results",
"inputs": [
{
"name": "reads",
"label": "Reads",
"help": "An array of files, in gzipped FASTQ format.",
"class": "array:file",
"patterns": [
"*.fq.gz",
"*.fastq.gz"
]
}
],
"stages": [
{
"id": "align_reads",
"name": "BWA MEM",
"executable": "app-bwa_mem_fastq_read_mapper",
"input": {
"reads_fastqgzs": {
"$dnanexus_link": {
"workflowInputField": "reads"
}
},
"genomeindex_targz": {
"$dnanexus_link": {
"project": "project-BQpp3Y804Y0xbyG4GJPQ01xv",
"id": "file-B6ZY4942J35xX095VZyQBk0v"
}
}
}
},
{
"id": "call_variants",
"name": "Freebayes",
"executable": "app-freebayes",
"folder": "call_variants_output",
"input": {
"sorted_bams": [{
"$dnanexus_link": {
"stage": "align_reads",
"outputField": "sorted_bam"
}
}],
"genome_fastagz": {
"$dnanexus_link":{
"project": "project-BQpp3Y804Y0xbyG4GJPQ01xv",
"id": "file-B6ZY7VG2J35Vfvpkj8y0KZ01"
}
}
}
}
],
"outputs": [
{
"name": "variants",
"class": "file",
"outputSource": {
"$dnanexus_link": {
"stage": "call_variants",
"outputField": "variants_vcfgz"
}
}
}
]
}

We can then build the workflow by running this command on the directory "BWA MEM + Freebayes Exome Workflow (locked)" (which contains the dxworkflow.json):

$ dx build "BWA MEM + Freebayes Exome Workflow (locked)"

Running a locked-down workflow in the CLI

To run the workflow, we should pass a FASTQ input file to the workflow-level reads input field:

$ dx run "BWA MEM + Freebayes Exome Workflow (locked)" -ireads="Exome Analysis Demo":/Input/SRR504516_1.fastq.gz

Providing the input file directly to the stage, for example -ialign_reads.reads_fastqgzs=my_input_file.fastq.gz, is not possible for locked workflows.

To find out how to run the workflow and what inputs it accepts, we can use this command:

$ dx run "BWA MEM + Freebayes Exome Workflow (locked)" --help

Running a locked-down workflow in the UI

In the online interface, the experience of running a locked workflow will resemble an app, with inputs on the left side and outputs on the right one.

A locked workflow currently cannot be edited or created in the UI; we can build it in the CLI using dx get and then dx build commands.

Locking down an existing workflow

To lock down an existing workflow, we get the workflow from the platform by running dx get "BWA MEM + Freebayes Exome Workflow", add inputs to the downloaded dxworkflow.json, set workflowInputField references from stages to these inputs as explained above, and run dx build again.