Advanced App Tutorial
Create more advanced Bash apps on the platform using Sambamba.
In this tutorial, we will be introducing some advanced features for writing DNAnexus apps/applets in Bash. We will be writing a simple bash applet to take an arbitrary number of BAM files, use Sambamba to merge them, then output the resulting merged BAM.
Sambamba is an open source toolkit for efficiently working with BAM data. For more information, please visit the Sambamba website and documentation.
Before you begin, you should first download the DNAnexus SDK and run through the Command-Line Quickstart. If this is your first time writing a DNAnexus app, we recommend you first go through the Intro to Building Apps Tutorial before diving into this one.
When you create a DNAnexus app, you must first create the local directory for the app source code and resources. We will be creating the following local directory structure:
sambamba_merge_applet/
├── dxapp.json
├── Readme.md
├── resources/
│ └── usr/
│ └── bin/
│
└── src/
└── script.sh
If you are unfamiliar with bash, you can use the following commands to set up the directory and subdirectories:
$ mkdir -p sambamba_merge_applet/resources/usr/bin
$ mkdir sambamba_merge_applet/src
Next, use your favorite text editor to write the
dxapp.json
. This file should be located in the root directory of the app directory as shown in the structure above.The
dxapp.json
is a DNAnexus application metadata file. Its presence in a directory tells DNAnexus tools that it contains DNAnexus applet source code. We explain selected fields of this file below.In this file, we specify that our applet will be named
sambamba_merge_applet
(field: name
). Under the inputSpec
field, we specify that the app will take in 2 inputs:- 1.
sorted_bams
: an array of BAM files - 2.
advanced_options
: an optional string of advanced command line options to be passed to the Sambamba merge command.
Under the
outputSpec
field, we specify that the app will always return 1 output:- 1.
merged_bam
: a single merged BAM file
Additionally, we specify that the
sorted_bams
input and merged_bam
output should contain filenames that match the pattern "\*.bam"
. This specification tells the web UI to filter only files which match this pattern when selecting input files.Next, we specify under
runSpec
that this is a bash script (field: interpreter
) and that the worker running the applet should execute the executable located in the applet directory at src/script.sh
(field: file
).Finally, under
runSpec
, systemRequirements
, *
, instanceType
, we specify that all entry points of the applet should be run with the mem2_ssd1_v2_x4 instance type. {
"name": "sambamba_merge_applet",
"title": "Sambamba Mappings Merger",
"summary": "Uses Sambamba to merge multiple sorted BAM files into a single BAM file",
"version": "0.0.1",
"inputSpec":
[
{
"name": "sorted_bams",
"label": "Sorted mappings",
"help": "A set of coordinate-sorted BAM files to be merged.",
"class": "array:file",
"patterns": ["*.bam"]
},
{
"name": "advanced_options",
"label": "Advanced command line options",
"help": "Advanced command line options that will be supplied directly to the Sambamba merge execution.",
"class": "string",
"optional": true
}
],
"outputSpec": [
{
"name": "merged_bam",
"label": "Merged sorted mappings",
"help": "A BAM file with the merged mappings.",
"class": "file",
"patterns": ["*.bam"]
}
],
"runSpec": {
"interpreter": "bash" ,
"file": "src/script.sh",
"systemRequirements": {
"*": {
"instanceType": "mem2_ssd1_v2_x4"
}
},
"distribution": "Ubuntu",
"release": "20.04",
"execDepends": []
},
"openSource": true
}
In this applet, we will be using the Sambamba binary, which you can download from the Sambamba releases page. You can download this binary, uncompress the executable, and place it in the
resources/usr/bin/
directory of your app directory.After downloading the binary, run the following commands:
# Navigate to your applet root directory
cd /path/to/app/directory
# Untar the downloaded executable
tar -xzf /path/to/downloaded/sambamba_executable
# Rename and move the executable to the correct directory
# Note: if you don't rename the executable, make sure the
# app source code uses the full name of the downloaded
# sambamba executable.
mv sambamba_* resources/usr/bin/sambamba
Next, we will be writing the script that the worker will execute when the applet is invoked. This file will be named
script.sh
and located in the applet directory at the path src/script.sh
. This location is important as this is the location specified in the dxapp.json
above.The first few lines of the bash script specify where the bash interpreter can be found on the system and specify a couple settings for the execution of the script. The
-e
flag causes bash to exit at any point if there is any error, the -o pipefail
flag tells bash to throw an error if it encounters an error within a pipeline, while the -x
flag causes bash to output each line as it is executed -- useful for debugging.#!/bin/bash
set -e -x -o pipefail
At this time, the workers have the flag
-e
set by default. If you wish to keep the script running to the end regardless of any errors that may occur during the execution, use set +e
at the beginning of the script.You can easily download all file inputs to your applet with the
dx-download-all-inputs
command-line utility. Add this line to your script.sh
:$ dx-download-all-inputs
This utility will automatically download all the files supplied as input to the applet into the path
$HOME/in/
. Each file input parameter specified under inputSpec
in the dxapp.json
will have its own folder under the $HOME/in/
directory. In the case of this applet, there will be one folder for the sorted_bams
input on the path $HOME/in/sorted_bams/
. Since sorted_bams
is an array of files, these files will be placed into numbered subdirectories under a parent directory $HOME/in/sorted_bams/
. For example, if the user supplied the following 3 files to the applet, SRR100022_chrom20_mapped_to_b37.bam
, SRR100022_chrom21_mapped_to_b37.bam
SRR100022_chrom22_mapped_to_b37.bam
, in that order, the files would be downloaded into the following paths respectively:$HOME/in/sorted_bams/0/SRR100022_chrom20_mapped_to_b37.bam
$HOME/in/sorted_bams/1/SRR100022_chrom21_mapped_to_b37.bam
$HOME/in/sorted_bams/2/SRR100022_chrom22_mapped_to_b37.bam
The following is a visualization of this example structure:
$HOME
├── in
│ └── sorted_bams
│ ├── 0
│ │ └── SRR100022_chrom20_mapped_to_b37.bam
│ ├── 1
│ │ └── SRR100022_chrom21_mapped_to_b37.bam
│ └── 2
│ └── SRR100022_chrom22_mapped_to_b37.bam
│ ...
Next, create a folder for your output file:
$ mkdir -p out/merged_bam
We just made a directory with the path
$HOME/out/merged_bam
, which corresponds to the merged_bam output parameter in the dxapp.json
. Later, we will place the output of Sambamba merge, a merged BAM file, into this subdirectory.Later, at the end of the bash script, we will call the
dx-upload-all-outputs
. This utility will automatically upload all files found on the path $HOME/out/
and link the files to the appropriate output parameter (the outputs specified under outputSpec
in the dxapp.json
).By convention, only directories with names equal to output parameter names are expected to be found in the output directory, and any file(s) found in those subdirectories will be uploaded as the corresponding outputs.
In our case, the merged BAM file placed into the path
$HOME/out/merged_bam/
will be uploaded as the merged_bam output parameter of the job.The execution of an applet on a worker starts inside
$HOME
, so in this tutorial $HOME/in
, ~/out
, and out/
are all the same since we have not changed directories.DNAnexus has provided some environment variables to make it even simpler to write bash apps. Here, we will use the
$sorted_bams_prefix
variable to help us name our output file. This variable is provided for every file
or array:file
input parameter specified in the applet's dxapp.json
.In this case, our only file input parameter is sorted_bams, an
array:file
. The variable $sorted_bams_prefix
is a bash array of filenames of every file in the file array with the extension stripped off, as well as any .gz
extension (if applicable).For example, given the inputs outlined above, the first item in the bash array
$sorted_bams_prefix[0]
will be NA12878.chr1
, the second item $sorted_bams_prefix[1]
will be NA12878.chr2
, etc.We will use the prefix of the first file prefix in the array to name our output file.
$ output_name="${sorted_bams_prefix[0]}_merged.bam"
Next, we will run Sambamba merge. The syntax for Sambamba merge is as follows:
sambamba merge [OPTIONS] <output.bam> <input1.bam> <input2.bam> [...]
Add the following lines to your
script.sh
:sambamba merge $advanced_options "$output_name" "${sorted_bams_path[@]}"
We pass any
advanced_options
string the user may have entered as input to the app. The string was stored as the variable $advanced_options
during app initialization.We will name the output file according the the
$output name
bash variable set in the section above.Finally, we can use the
$sorted_bams_path
variable to help us pass the input files to the executable.Similar to the prefix variable explained above, a path bash app helper variable is provided for every file or array:file input parameter specified in the applet's
dxapp.json
. This bash variable stores the full path of each input file, assuming that the file was downloaded using dx-download-all-inputs
.Since our sorted_bams input is of type array:file, the
$sorted_bams_path
variable is a bash array containing the file paths of the files given as input to sorted_bams, in the order they were given to the app. "$sorted_bams_path[@]"
represents the array as a string, properly tokenized for any whitespace.When the shell script is run, bash will automatically interpret all the variables in the command. For example, if we have 3 input files named
NA12878.chr1.bam
, NA12878.chr2.bam
, and NA12878.chr3.bam
, the interpreted sambamba merge command will look like this:sambamba merge $HOME/out/bam_output/part_0_merged.bam $HOME/in/sorted_bams/0/NA12878.chr1.bam $HOME/in/sorted_bams/1/NA12878.chr2.bam $HOME/in/sorted_bams/2/NA12878.chr3.bam -out
Alternatively, the following command also works:
sambamba merge $advanced_options $output_folder/$output_name $HOME/in/sorted_bams/*/*
Finally, after Sambamba merge is finished, we move the merged bam file into the
out/merged_bam/
folder to be uploaded using dx-upload-all-outputs
. This utility will upload the contents of the subdirectories on the path $HOME/out/
.$ mv $output_name out/merged_bam/
$ dx-upload-all-outputs
At this point, you are done with your app source code. The final script should look like this:
#!/bin/bash
set -e -x -o pipefail
dx-download-all-inputs
mkdir -p out/merged_bam
output_name="${sorted_bams_prefix[0]}_merged.bam"
sambamba merge $advanced_options "$output_name" "${sorted_bams_path[@]}"
mv "$output_name" out/merged_bam/
dx-upload-all-outputs
You are ready to build and run your applet using the following commands. If you have not done so already, login to DNAnexus in your terminal and select a project you wish to work in.
You can either upload your own BAM files to merge, or use the example BAM files available in the Demo Data public project (Developer Quickstart folder). If you upload your own data, we recommend you first test your app with small files.
$ dx build path/to/app/directory
$ dx run sambamba_merge_applet
Congratulations! You are done with the advanced bash app tutorial.
For the sake of this tutorial we manually created the applet local directory,
dxapp.json
, and shell script (src/script.sh
). However, you can automate this step by using the dx-app-wizard
as explained in the Intro to Building Apps tutorial.The
dx-app-wizard
will prompt you for inputs, and automatically creates the dxapp.json
based on your answers and a template file for your shell script. However, the app wizard was not intended to be a tool for the advanced developer. Thus, it does not prompt you for more advanced fields in the applet specification such as patterns
, and instanceType
. Additionally, it does not use the dx-download-all-inputs
or dx-upload-all-outputs
utilities.We have found it useful for advanced developers to use the app wizard to create an app directory template, and basic
dxapp.json
. Afterwards, you can then go back in and add additional fields to the dxapp.json
and replace the template bash script with your own.- See the Developer Tutorials page for language-specific tutorials in a variety of programming languages. These will walk you through writing more complex apps in the language of your choice.
- If you would like to see more example code, you can use the
dx get
command to reconstruct and download the source directory of open-source apps (e.g.dx get app-cloud_workstation
). You can find open-source apps with the command below
$ dx api system findApps '{"describe":{"fields":{"openSource": true, "name": true}}}'| \
jq '.results|.[]|select(.describe.openSource)|.describe.name'
Last modified 4mo ago