3rd Party App Style Guide

This document provides guidelines for App(let) development. While we outline best practices in this guide, we recommend using industry standard style guides to reinforce clean maintainable code.

As always with any style guide if you have a reason or a different convention is followed in the code you are extending then it's okay, even recommended, to deviate from the style guide.

dxapp.json

The dxapp.json JSON file establishes the convention that users of the DNAnexus platform use to create app(let)s on the platform. When defining applications for wide use, this style guide sets standards for user-friendly UI, CLI, and runtime app(let) definitions.

Name

Applet names should be all lowercase with words separated by underscores.

"name": "hisat2_mapper"

Summary

An app summary must be concise. It should fit on one line and not terminate in a period. The current app(let) is the assumed subject of the summary.

Good

"summary": "Merges multiple BAM files into a single one"
"summary": "Maps FASTQ reads (paired or unpaired) to a reference genome with the BWA-MEM algorithm"

Bad

"summary": "This app takes multiple BAM file inputs. Then uses a SAMtools(v1.3.1) to merge and output a single BAM file."

The subject "This app" is unnecessary. The sentence is too verbose in its explanation; it should rely on the subject being assumed.

App Execution Environment

Make sure the app runs in Python3 App Execution Environment by setting runSpec accordingly. For example, to run an app in Ubuntu 16.04 Python3 environment specify:

"runSpec": {"release": "16.04", "version": "1", "interpreter": "python3", ...}

and a bash app in Ubuntu 16.04 Python3 environment:

"runSpec": {"release": "16.04", "version": "1", "interpreter": "bash", ...}

Licenses

Include licenses of the dependency software and packages installed in the upstreamProjects property of the dxapp.json. If there are additional hidden layers of dependencies from the ones you explicitly installed, it is the package author's responsibility to list the appropriate licenses.

The following keys are required to ensure compliance with open-source licenses: name, repoUrl, version, license, and licenseUrl, while author is optional but good to have.

Note:

  • license: Follow spdx standards for license abbrivation.

  • licenseUrl

    • When the software is maintained on GitHub:

      1. Find the LICENSE or COPYING file link from the software version's tag/commit

      2. Use "permalink" for the static hyperlink pointing to the particular version applied. Press "y" at the browser for permanent link: getting-permanent-links-to-files

    • When the software is not maintained on GitHub:

      1. Find the license document from the software's website and provide URL pointing to the license document.

  • author

    • First use authors from the relevant paper for the tool

    • Second use AUTHORS.md file (usually an org name or person's name) if present at the software's repository

    • Third if no options, then provide no author

"details": {
...
"upstreamProjects": [
{
"name": "BWA",
"repoUrl": "https://github.com/lh3/bwa",
"version": "0.7.15-r1140",
"license": "GPL-3.0-or-later",
"licenseUrl": "https://github.com/lh3/bwa/blob/08764215c6615ea52894e1ce9cd10d2a2faa37a6/COPYING",
"author": "Heng Li"
},
{
"name": "biobambam2",
"repoUrl": "https://github.com/gt1/biobambam2",
"version": "2.0.87-release-20180301132713",
"license": "MIT, GPL-3.0-or-later",
"licenseUrl": "https://github.com/gt1/biobambam2/blob/5798e74558e001e33855cb93cc8bf149344b931d/COPYING",
"author": "German Tischler"
},
{
"name": "GNU Gzip",
"repoUrl": "https://www.gnu.org/software/gzip/",
"version": "1.6",
"license": "GPL-3.0-or-later",
"licenseUrl": "https://www.gnu.org/licenses/gpl-3.0.html",
"author": "Jean-loup Gailly"
}
],
...
}

Citations

Cite the publications that are associated with the software being used. Use a DOI name to refer to the paper.

"details": {
...
"citations": [
"doi:10.1093/bioinformatics/btv098",
"arXiv:1303.3997v2" # As on 20190712, plaotform webUI does not resolve arXiv links, though it can be queired through CLI
]
...
}

Categories (Apps)

Categories are great for filtering applets from the CLI using dx find. While an app can have many categories, a subset will show up in the web UI. You can assign any category to an app(let) but remember, Categories searchable in the UI are defined by DNAnexus. If you want to add/remove a category from the web UI, contact support@dnanexus.com.

Help

Optional arguments should start with "(Optional)".

{
...
"optional": true,
"help": "(Optional) Annotations file in Ensembl GTF format.",
...
}

Inputs output specification (I/O spec)

Treat the I/O spec of an app(let) like the docstring of a well-documented function; it should be descriptive and provide good understanding without looking under the hood.

Input variable ordering

For inputs use the group field to dictate how options will be shown. Groups will be shown in order of first appearance in the input specification with the unnamed group always appearing first. Strive to sensibly group inputs.

Output files

When possible, output index files (for example BAMs + BAIs and VCFs + TBIs) along with the primary file.

When outputting a reference index file from an app(let) script retain the reference name in the generated index. For example:

referencename.fa.gz -> indexed by HISAT2 index -> referencename.hisat2-index.tar.gz output filename

Suggestions

For file type inputs you can recommend projects containing inputs or specific input files for users. For reference genomes and indexes suggest the "DNAnexus Reference Genomes" project.

{
"name": "genomeindex_targz",
"label": "BWA reference genome index",
"help": "A file, in gzipped tar archive format, with the reference genome sequence already indexed with BWA.",
"class": "file",
"patterns": ["*.bwa-index.tar.gz"],
"suggestions": [
{
"name": "DNAnexus Reference Genomes",
"project": "project-BQpp3Y804Y0xbyG4GJPQ01xv",
"path": "/"
}
]
}

Name specification

App(let) I/O name fields should follow the pattern:

noun[index]_[adjective]_filetype

BAM files

mappings_bam # BAM files
mappings_sorted_bam # Sorted BAM File
mappings_sorted_bai # Sorted BAI File
mappings_readname_bam # Readname Sorted Bam

FASTQ files

reads_fastqgz # Forward gzipped reads
reads2_fastqgz # Reverse gzipped reads

VCF Files

variants_vcf # Variants files
variants_vcfgz # Gzipped variants files
variants_tbi # Variants index file

Reference Files

reference_targz # *.tar.gz reference file
reference.tool-index.tar.gz # bioinformatics tool indexed reference

Script

General Guidelines

Local variable naming

Efforts should be made to download files as their filename on the platform, not as a constant. Any errors that occur due to the input will contain filenames familiar to users. For example:

Good - src/code.sh

$ dx download "${sorted_bam}"
$ samtools view -H "${sorted_bam_name}"

Bad - src/code.sh

$ dx download $sorted_bam -o sorted.bam
$ samtools view -H sorted.bam

Good - src/code.py

mapping_sorted_bam = dxpy.DXFile(sorted_bam)
sorted_bam_name = mapping_sorted_bam.name
dxpy.download_dxfile(mapping_sorted_bam.get_id(), sorted_bam_name)
subprocess.check_call(‘samtools view -H {bam_name}.
format(bam_name=sorted_bam_name))

Bad - src/code.py

mapping_sorted_bam = dxpy.DXFile(sorted_bam)
dxpy.download_dxfile(mapping_sorted_bam.get_id(), "input.bam")
subprocess.check_call(‘samtools view -H input.bam’)

Bash specifics

Variables used in bash should always be enclosed in brackets and quotes to prevent globbing or word splitting unless intended. This is especially important when constructing file names and other values:

Good - src/code.sh

prefix="SRR504516"
sortedbam_name="${prefix}_sorted.bam"
echo "${sortedbam_name}"
# outputs: SRR504516_sorted.bam

Bad - src/code.sh

prefix="SRR504516"
sortedbam_name=$prefix_sorted.bam
echo $sortedbam_name
# outputs: .bam
# uses var prefix_sorted, which doesn’t exist

References

References should have filenames which are descriptive of what is in them. This includes references which have been indexed for specific uses. We have multiple ways of how references are handled.

Script Section commenting

#
# Section overview
# ---------------------------------------------------------------
# Summary and additional notes on the section complete sentences.
#
# Sentences/ideas can be separated by newlines if needed.
#
# Remember a style guide is JUST a suggestion, as long as you're consistent
# With whatever section/block commenting pattern you use, you're golden.
#

Functions

Function/Method naming

descriptive_lowercase_function_name_seperated_by_underscores()

Syntactically functions, like variables, should be all lowercase and separated by underscores. Names should describe the task being performed in the function body.

Good - src/code.sh

function split_bam_by_chr () {
echo bam filename: "$1"
echo chromosome: "$2"
samtools view -b "$1" "$2" -o "${1%.bam}"_chr"$2".bam;
}

Good - src/code.py

def split_bam_by_chr(bam_file, chromosome):
split_cmd = "samtools view -b {bam} {chr}".format(bam=bam_file,
chr=chromosome)
subprocess.check_call(split_cmd, shell=true)

Descriptions (Doc Strings)

Function descriptions

In general, try to keep functions simple and easy to understand. Comments should be used for: long functions, complex algorithms, or a series of difficult to read shell commands. Descriptions should be included as a block comment in bash or a docstring in python (for python follow PEP 257). When needed, include an overview of function Arguments, Returns, and Exceptions (Raises) in the docstring/comment.

Good - src/code.sh

#####################################################
# Split BAM by specified chromosome.
#
# Globals:
# VIEW_OPT: predefined view cmd options
# Arguments:
# $1: bam filename
# $2: chromosome region to use
# Returns:
# name of the generated BAM file
#####################################################
function split_bam_by_chr() {
echo bam filename: "$1"
echo chromosome: "$2"
split_bam_name="${1%.bam}"_chr"$2".bam
samtools view -b "$VIEW_OPT" "$1" "$2" -o "$split_bam_name"
echo "$split_bam_name"
}

Good - src.py

def split_bam_by_chr(bam_filename, chromosome):
"""Create bam file from a specified chromosome.
Notes:
doc strings follow Google best practices. Again a style is just
a suggestion, the most important thing is... Be consistent!
A side note worth mentioning specifically for python,
Following style guides for commenting allows for auto-generating
code libraries like sphinx to parse and compile autodocs.
Args:
bam_file (str): bam filename.
chromosome (str): Chromosome region to split into its own BAM.
Returns:
None
Raises:
CalledProcessError: If subprocess.check_call() fails
"""
split_bam_name="${bam}_chr{chr}.bam".format(
bam=bam_filename, chr=chromosome)
split_cmd = "samtools view -b {bam} {chr} -o {outbam}"
.format(bam=bam_filename,
chr=chromosome,
outbam=split_bam_name)
subprocess.check_call(split_cmd, shell=true)

Supplementary information

Building commands

In App(let)s commands that are executed (via subprocess in Python) are constructed in different ways based on user input. Incorrect command construction can lead to unexpected failures and results due to word splitting and globbing. In python use string.format() to build commands and remember to escape special characters and in bash use arrays to construct commands.

Good - src/code.sh

Use arrays and correct quoting to build commands

options=()
options+=("view")
options+=("-c")
options+=("bam with space.bam") # Quotes prevent unwanted word splits
samtools "${options[@]}" # Quotes prevent re-splitting of elements

Good - src/code.py

cmd = "samtools view {options} \"{bam_file}\"".format(options="-c", bam_file="bam with space.bam")
# The escaped quotes prevent word splitting in the subshell