Third Party App Style Guide
This document provides guidelines for app and applet development. While we outline best practices in this guide, we recommend using the following industry-standard style guides, to ensure that code is clean and maintainable:
As always with any style guide if you have a reason or a different convention is followed in the code you are extending then it's okay, even recommended, to deviate from the style guide.
The
dxapp.json
JSON file establishes the convention that users of the DNAnexus platform use to create app(let)s on the platform. When defining applications for wide use, this style guide sets standards for user-friendly UI, CLI, and runtime app(let) definitions.Applet names should be all lowercase with words separated by underscores.
"name": "hisat2_mapper"
An app summary must be concise. It should fit on one line and not terminate in a period. The current app(let) is the assumed subject of the summary.
"summary": "Merges multiple BAM files into a single one"
"summary": "Maps FASTQ reads (paired or unpaired) to a reference genome with the BWA-MEM algorithm"
Bad
"summary": "This app takes multiple BAM file inputs. Then uses a SAMtools(v1.3.1) to merge and output a single BAM file."
The subject "This app" is unnecessary. The sentence is too verbose in its explanation; it should rely on the subject being assumed.
Make sure the app runs in Python3 App Execution Environment by setting
runSpec
accordingly. For example, to run an app in Ubuntu 20.04 Python3 environment specify:"runSpec": {"release": "20.04", "version": "0", "interpreter": "python3", ...}
and a Bash app in Ubuntu 20.04 Python3 environment:
"runSpec": {"release": "20.04", "version": "0", "interpreter": "bash", ...}
Include licenses of the dependency software and packages installed in the
upstreamProjects
property of the dxapp.json
. If there are additional hidden layers of dependencies from the ones you explicitly installed, it is the package author's responsibility to list the appropriate licenses.The following keys are required to ensure compliance with open-source licenses:
name
, repoUrl
, version
, license
, and licenseUrl
, while author
is optional but good to have.Take note of the following:
licenseUrl
- When the software is maintained on GitHub:
- 1.Find the
LICENSE
orCOPYING
file link from the software version's tag/commit - 2.Use "permalink" for the static hyperlink pointing to the particular version applied. Press "y" at the browser for permanent link: getting-permanent-links-to-files
- When the software is not maintained on GitHub:
- 1.Find the license document from the software's website and provide URL pointing to the license document.
author
- First use authors from the relevant paper for the tool
- Second use AUTHORS.md file (usually an org name or person's name) if present at the software's repository
- Third if no options, then provide no author
"details": {
...
"upstreamProjects": [
{
"name": "BWA",
"repoUrl": "https://github.com/lh3/bwa",
"version": "0.7.15-r1140",
"license": "GPL-3.0-or-later",
"licenseUrl": "https://github.com/lh3/bwa/blob/08764215c6615ea52894e1ce9cd10d2a2faa37a6/COPYING",
"author": "Heng Li"
},
{
"name": "biobambam2",
"repoUrl": "https://github.com/gt1/biobambam2",
"version": "2.0.87-release-20180301132713",
"license": "MIT, GPL-3.0-or-later",
"licenseUrl": "https://github.com/gt1/biobambam2/blob/5798e74558e001e33855cb93cc8bf149344b931d/COPYING",
"author": "German Tischler"
},
{
"name": "GNU Gzip",
"repoUrl": "https://www.gnu.org/software/gzip/",
"version": "1.6",
"license": "GPL-3.0-or-later",
"licenseUrl": "https://www.gnu.org/licenses/gpl-3.0.html",
"author": "Jean-loup Gailly"
}
],
...
}
Cite the publications that are associated with the software being used. Use a DOI name to refer to the paper.
"details": {
...
"citations": [
"doi:10.1093/bioinformatics/btv098",
"arXiv:1303.3997v2" # As of 20190712, platform webUI does not resolve arXiv links, though it can be queried through CLI
]
...
}
Categories are great for filtering applets from the CLI using
dx find
. While an app can have many categories, a subset will show up in the web UI. You can assign any category to an app(let) but remember, Categories searchable in the UI are defined by DNAnexus. If you want to add/remove a category from the web UI, contact [email protected].Optional arguments should start with "(Optional)".
{
...
"optional": true,
"help": "(Optional) Annotations file in Ensembl GTF format.",
...
}
Treat the I/O spec of an app(let) like the docstring of a well-documented function; it should be descriptive and provide good understanding without looking under the hood.
For inputs use the
group
field to dictate how options will be shown. Groups will be shown in order of first appearance in the input specification with the unnamed group always appearing first. Strive to sensibly group inputs.When possible, output index files (for example BAMs + BAIs and VCFs + TBIs) along with the primary file.
When outputting a reference index file from an app(let) script retain the reference name in the generated index. For example:
referencename.fa.gz
-> indexed by HISAT2 index -> referencename.hisat2-index.tar.gz
output filenameFor file type inputs you can recommend projects containing inputs or specific input files for users. For reference genomes and indexes suggest the "DNAnexus Reference Genomes" project.
{
"name": "genomeindex_targz",
"label": "BWA reference genome index",
"help": "A file, in gzipped tar archive format, with the reference genome sequence already indexed with BWA.",
"class": "file",
"patterns": ["*.bwa-index.tar.gz"],
"suggestions": [
{
"name": "DNAnexus Reference Genomes",
"project": "project-BQpp3Y804Y0xbyG4GJPQ01xv",
"path": "/"
}
]
}
App(let) I/O
name
fields should follow the pattern:noun[index]_[adjective]_filetype
For BAM files:
mappings_bam # BAM files
mappings_sorted_bam # Sorted BAM File
mappings_sorted_bai # Sorted BAI File
mappings_readname_bam # Readname Sorted Bam
For FASTQ files:
reads_fastqgz # Forward gzipped reads
reads2_fastqgz # Reverse gzipped reads
For VCF files:
variants_vcf # Variants files
variants_vcfgz # Gzipped variants files
variants_tbi # Variants index file
For reference files:
reference_targz # *.tar.gz reference file
reference.tool-index.tar.gz # bioinformatics tool indexed reference
Efforts should be made to download files as their filename on the platform, not as a constant. Any errors that occur due to the input will contain filenames familiar to users. For example:
Good (
src/code.sh
):$ dx download "${sorted_bam}"
$ samtools view -H "${sorted_bam_name}"
Bad (
src/code.sh
):$ dx download $sorted_bam -o sorted.bam
$ samtools view -H sorted.bam
Good (
src/code.py
):mapping_sorted_bam = dxpy.DXFile(sorted_bam)
sorted_bam_name = mapping_sorted_bam.name
dxpy.download_dxfile(mapping_sorted_bam.get_id(), sorted_bam_name)
subprocess.check_call(‘samtools view -H {bam_name}’.
format(bam_name=sorted_bam_name))
Bad (
src/code.py
):mapping_sorted_bam = dxpy.DXFile(sorted_bam)
dxpy.download_dxfile(mapping_sorted_bam.get_id(), "input.bam")
subprocess.check_call(‘samtools view -H input.bam’)
Variables used in bash should always be enclosed in brackets and quotes to prevent globbing or word splitting unless intended. This is especially important when constructing file names and other values:
Good (
src/code.sh
):prefix="SRR504516"
sortedbam_name="${prefix}_sorted.bam"
echo "${sortedbam_name}"
# outputs: SRR504516_sorted.bam
Bad (
src/code.sh
):prefix="SRR504516"
sortedbam_name=$prefix_sorted.bam
echo $sortedbam_name
# outputs: .bam
# uses var prefix_sorted, which doesn’t exist
References should have filenames which are descriptive of what is in them. This includes references which have been indexed for specific uses. We have multiple ways of how references are handled.
#
# Section overview
# ---------------------------------------------------------------
# Summary and additional notes on the section complete sentences.
#
# Sentences/ideas can be separated by newlines if needed.
#
# Remember a style guide is JUST a suggestion, as long as you're consistent
# With whatever section/block commenting pattern you use, you're golden.
#
descriptive_lowercase_function_name_separated_by_underscores()
Functions, like variables, should be all lowercase, with name parts separated by underscores. Names should describe the task being performed in the function body.
Good (
src/code.sh
):function split_bam_by_chr () {
echo bam filename: "$1"
echo chromosome: "$2"
samtools view -b "$1" "$2" -o "${1%.bam}"_chr"$2".bam;
}
Good (
src/code.py
):def split_bam_by_chr(bam_file, chromosome):
split_cmd = "samtools view -b {bam} {chr}".format(bam=bam_file,
chr=chromosome)
subprocess.check_call(split_cmd, shell=true)
In general, try to keep functions simple and easy to understand. Comments should be used for: long functions, complex algorithms, or a series of difficult to read shell commands. Descriptions should be included as a block comment in bash or a docstring in Python (for Python follow PEP 257). When needed, include an overview of function Arguments, Returns, and Exceptions (Raises) in the docstring/comment.
Good (
src/code.sh
):#####################################################
# Split BAM by specified chromosome.
#
# Globals:
# VIEW_OPT: predefined view cmd options
# Arguments:
# $1: bam filename
# $2: chromosome region to use
# Returns:
# name of the generated BAM file
#####################################################
function split_bam_by_chr() {
echo bam filename: "$1"
echo chromosome: "$2"
split_bam_name="${1%.bam}"_chr"$2".bam
samtools view -b "$VIEW_OPT" "$1" "$2" -o "$split_bam_name"
echo "$split_bam_name"
}
Good (
src/code.py
):def split_bam_by_chr(bam_filename, chromosome):
"""Create bam file from a specified chromosome.
Notes:
doc strings follow Google best practices. Again a style is just
a suggestion, the most important thing is... Be consistent!
A side note worth mentioning specifically for python,
Following style guides for commenting allows for auto-generating
code libraries like sphinx to parse and compile autodocs.
Args:
bam_file (str): bam filename.
chromosome (str): Chromosome region to split into its own BAM.
Returns:
None
Raises:
CalledProcessError: If subprocess.check_call() fails
"""
split_bam_name="${bam}_chr{chr}.bam".format(
bam=bam_filename, chr=chromosome)
split_cmd = "samtools view -b {bam} {chr} -o {outbam}"
.format(bam=bam_filename,
chr=chromosome,
outbam=split_bam_name)
subprocess.check_call(split_cmd, shell=true)
In App(let)s commands that are executed (via subprocess in Python) are constructed in different ways based on user input. Incorrect command construction can lead to unexpected failures and results due to word splitting and globbing. In Python use string.format() to build commands and remember to escape special characters and in Bash use arrays to construct commands.
Good (
src/code.sh
):Use arrays and correct quoting to build commands
options=()
options+=("view")
options+=("-c")
options+=("bam with space.bam") # Quotes prevent unwanted word splits
samtools "${options[@]}" # Quotes prevent re-splitting of elements
Good (
src/code.py
):cmd = "samtools view {options} \"{bam_file}\"".format(options="-c", bam_file="bam with space.bam")
# The escaped quotes prevent word splitting in the subshell
Last modified 6mo ago