Learn to manage data, users, and work on the Platform, via its API. Create and share reusable pipelines, applications for analyzing data, custom viewers, and workflows.
This section is targeted towards organizational leads who have the permission to enable others to use DNAnexus for scientific purposes. Operations include managing organization permissions, billing, and authentication to the platform.
Download, install, and get started using the DNAnexus Platform SDK, the DNAnexus upload and download agents, and dxCompiler.
Get details on new features, changes, and bug fixes for each Platform and toolkit release.
DNAnexus Essentials
Learn to upload data, create a project, run an analysis, and visualize results.
Learn More
See these Key Concepts pages to learn more about how the DNAnexus Platform works, and how to get the most from it:
Inside the dxapp.json, you would add "httpsApp": {"ports":[443], "shared_access": "VIEW"} to tell the worker to expose this port.
R Shiny needs two scripts, server.R and ui.R, which should be under resources/home/dnanexus/my_app/. When a job starts based on this applet, the resources directory is copied onto the worker, and since the ~/ path on the worker is /home/dnanexus, that means you have ~/my_app with those two scripts inside.
From the main applet script code.sh, start shiny pointing to ~/my_app, serving its mini-application on port 443.
For all web apps, if everything is running smoothly and no errors are encountered (the ideal case), the line of code that starts the server keeps it running indefinitely. The applet stops only when it is terminated. This also means that any lines of code after the server starts are not executed.
Modifying this example for your own applet
To make your own applet with R Shiny, copy the source code from this example and modify server.R and ui.R inside resources/home/dnanexus/my_app.
How to rebuild the shiny asset
To build the asset, run the dx build_asset command and pass shiny-asset, that is the name of the directory holding dxasset.json:
This outputs a record ID record-xxxx that you can then put into the applet's dxapp.json in place of the existing one:
Build the applet
Build and run the applet itself:
Once it spins up, you can go to that job's designated URL based on its job ID, https://job-xxxx.dnanexus.cloud/, to see the result.
This applet performs a basic samtools view -c {bam} command, referred to as "SAMtools count", on the DNAnexus Platform.
Download BAM Files
For bash scripts, inputs to a job execution become environment variables. The inputs from the dxapp.json file are formatted as shown below:
The object mappings_bam, a DNAnexus link containing the file ID of that file, is available as an environmental variable in the applet's execution. Use the command dx download to download the BAM file. By default, downloading a file preserves the filename of the object on the platform.
SAMtools Count
Use the bash helper variable mappings_bam_name for file inputs. For these inputs, the DNAnexus Platform creates a bash variable [VARIABLE]_name that holds the platform filename. Because the file was downloaded with default parameters, the worker filename matches the platform filename. The helper variable [VARIABLE]_prefix contains the filename minus any suffixes specified in the input field patterns (for example, the platform removes the trailing .bam to create [VARIABLE]_prefix).
Upload Result
Use command to upload data to the platform. This uploads the file into the job container, a temporary project that holds onto files associated with the job. When running the command dx upload with the flag --brief, the command returns only the file ID.
Job containers are an integral part of the execution process. To learn more see .
Associate With Output
The output of an applet must be declared before the applet is even built. Looking back to the dxapp.json file, you see the following:
The applet declares a file type output named counts_txt. In the applet script, specify which file should be associated with the output counts_txt. On job completion, this file is copied from the temporary job container to the project that launched the job.
Precompiled Binary
This tutorial showcases packaging a precompiled binary in the resources/ directory of an app(let).
In this applet, the SAMtools binary was precompiled on an Ubuntu machine. A user can do this compilation on an Ubuntu machine of their own, or they can use the Cloud Workstation app to build and compile a binary. On the Cloud Workstation, the user can download the SAMtools source code and compile it in the worker environment, ensuring that the binary runs on future workers.
See in the App library for more information.
Resources Directory
The SAMtools precompiled binary is placed in the <applet dir>/resources/ directory. Any files found in the resources/ directory are packaged, uploaded to the Platform, and then extracted in the root directory \ of the worker. In this case, the resources/ dir is structured as follows:
When this applet is run on a worker, the resources/ directory is placed in the worker's root directory /:
The SAMtools command is available because the respective binary is visible from the default $PATH variable. The directory /usr/bin/ is part of $PATH, so the script can reference the samtools command directly:
Bash Helpers
Learn to build an applet that performs a basic SAMtools count with the aid of bash helper variables.
Source Code
Concurrent Computing Tutorials
Learn important terminology before using parallel and distributed computing paradigms on the DNAnexus Platform.
Many definitions and approaches exist for tackling the concept of parallelization and distributing workloads in the cloud (Here's a on the subject). To help make the documentation easier to understand, when discussing concurrent computing paradigms this guide refers to:
Parallel: Using multiple threads or logical cores to concurrently process a workload.
Distributed: Using multiple machines (in this case instances in the cloud) that communicate to concurrently process a workload.
User
In this section, learn to access and use the Platform via both its command-line interface (CLI) and its user interface (UI).
To use the CLI, you need to .
If you're not familiar with the dx client, check the .
This section provides detailed instructions on using the dx client to perform such common actions as logging in, selecting projects, listing, copying, moving, and deleting objects, and launching and monitoring jobs. Details on using the UI are included throughout, as applicable.
Keep these formal definitions in mind as you read through the tutorials and learn how to compute concurrently on the DNAnexus Platform.
The DNAnexus Platform provides powerful tools for data management, analysis, and collaboration. You can organize your data and analyses in secure, shareable projects with robust tools for uploading, downloading, and managing files.
Using the Platform, you can run apps, workflows, and custom analyses at scale. You can also share your projects and results with team members while maintaining access controls.
A project is a collaborative workspace on the DNAnexus Platform where you can store objects such as files, applets, and workflows. Within projects, you can run apps and workflows. You can also share a project with other users by giving them access to it. Read about projects in the Key Concepts section.
Download input files using the dx-download-all-inputs command. The dx-download-all-inputs command goes through all inputs and downloads into folders with the pattern /home/dnanexus/in/[VARIABLE]/[file or subfolder with files].
Step 2. Create an Output Directory
Create an output directory in preparation for the dx-upload-all-outputs DNAnexus command in the Upload Results section.
Step 3. Run SAMtools View
After executing the dx-download-all-inputs command, there are three helper variables created to aid in scripting. For this applet, the input variable name mappings_bam with platform filename my_mappings.bam has a helper variables:
Use the bash helper variable mappings_bam_path to reference the location of a file after it has been downloaded using dx-download-all-inputs.
Step 4. Upload Result
Use the dx-upload-all-outputs command to upload data to the platform and specify it as the job's output. The dx-upload-all-outputs command expects to find file paths matching the pattern /home/dnanexus/out/[VARIABLE]/*. It uploads matching files and then associates them as the output corresponding to [VARIABLE]. In this case, the output is called counts_txt. After creating the folders, place the outputs there.
{
"name": "counts_txt",
"class": "file",
"label": "Read count file",
"patterns": [
"*.txt"
],
"help": "Output file with Total reads as the first line."
}
# [VARIABLE]_path the absolute string path to the file.
$ echo $mappings_bam_path
/home/dnanexus/in/mappings_bam/my_mappings.bam
# [VARIABLE]_prefix the file name minus the longest matching pattern in the dxapp.json file
$ echo $mappings_bam_prefix
my_mappings
# [VARIABLE]_name the file name from the platform
$ echo $mappings_bam_name
my_mappings.bam
This applet performs a basic SAMtools count of alignments present in an input BAM.
Prerequisites
The app must have network access to the hostname where the git repository is located. In this example, access.network is set to:
To learn more about access and network fields see .
How is the SAMtools dependency added?
SAMtools is cloned and built from the repository. The following is a closer look at the dxapp.json file's runSpec.execDepends property:
The execDepends value is a JSON array of dependencies to resolve before the applet source code is run. In this applet, the git fetch dependencies for htslib and SAMtools are specified. Dependencies resolve in the order listed. Specify htslib first, before the SAMtools build_commands, because newer versions of SAMtools depend on htslib. An overview of each property in the git dependency:
package_manager - Details the type of dependency and how to resolve. .
url - Must point to the server containing the repository. In this case, a GitHub URL.
The build_commands are executed from the destdir. Use cd when appropriate.
How is SAMtools called in the src script?
Because "destdir": "/home/dnanexus" is set in dxapp.json, the git repository is cloned to the same directory from which the script executes. The example directory's structure:
The SAMtools command in the app script is samtools/samtools.
Applet Script
You can build SAMtools in a directory that is on the $PATH or add the binary directory to $PATH. Keep this in mind for your app(let) development.
Dash Example Web App
This is an example web app made with Dash, which in turn uses Flask underneath.
After configuring an app with Dash, start the server on port 443.
Inside the dxapp.json, you would add "httpsApp": {"ports":[443], "shared_access": "VIEW"} to tell the worker to expose this port.
For all web apps, if everything is running smoothly and no errors are encountered (the ideal case), the line of code that starts the server keeps it running forever. The applet stops only when it is terminated. This also means that any lines of code after the server starts are not executed.
The rest of these instructions apply to building any applet with dependencies stored in an asset.
Creating an applet on DNAnexus
Install the and , then run dx-app-wizard with default options.
Creating the asset
dash-asset specifies all the packages and versions needed. These come from the .
Add these into dash-asset/dxasset.json:
Build the asset:
Use the asset from the applet
Add this asset to the applet's dxapp.json:
Build the applet
Build and run the applet itself:
You can always use dx ssh job-xxxx to ssh into the worker and inspect what's going on or experiment with quick changes Then go to that job's special URL https://job-xxxx.dnanexus.cloud/ and see the result!
Optional local testing
The main code is in dash-web-app/resources/home/dnanexus/my_app.py with a local launcher script called local_test.py in the same folder. This allows you to launch the same core code in the applet locally to quickly iterate. This is optional because you can also do all testing on the platform itself.
Install locally the same libraries listed above.
To launch the web app locally:
Once it spins up, you can go to that job's designated URL based on its job ID, https://job-xxxx.dnanexus.cloud/, to see the result.
Parallel by Region (sh)
This applet performs a basic SAMtools count on a series of sliced (by canonical chromosome) BAM files in parallel using wait (Ubuntu 14.04+).
The SAMtools dependency is resolved by declaring an Apt-Get package in the dxapp.jsonrunSpec.execDepends.
Debugging
The command set -e -x -o pipefail assists you in debugging this applet:
-e causes the shell to immediately exit if a command returns a non-zero exit code.
-x prints commands as they are executed, which is useful for tracking the job's status or pinpointing the exact execution failure.
-o pipefail
The *.bai file was an optional job input. You can check for a empty or unset var using the bash built-in test [[ - z ${var}} ]]. You can then download or create a *.bai index as needed.
Parallel Run
Bash's system allows for convenient management of multiple processes. In this example, bash commands are run in the background as the maximum job executions are controlled in the foreground. You can place processes in the background using the character & after a command.
Job Output
Once the input BAM has been sliced, counted, and summed, the output counts_txt is uploaded using the command . The following directory structure required for dx-upload-all-outputs is below:
In your applet, upload all outputs by creating the output directory and then using dx-upload-all-outputs to upload the output files.
Parallel xargs by Chr
This applet slices a BAM file by canonical chromosome then performs a parallelized samtools view -c using xargs. Type man xargs for general usage information.
The SAMtools compiled binary is placed directory in the <applet dir>/resources directory. Any files found in the resources/ directory are uploaded so that they are present in the root directory of the worker. In this case:
When this applet is run on a worker, the resources/ folder is placed in the worker's root directory /:
/usr/bin is part of the $PATH variable, so the script can reference the samtools command directly, for example, samtools view -c ....
Parallel Run
Splice BAM
First, download the BAM file and slice it by canonical chromosome, writing the *bam file names to another file.
To split a BAM by regions, you need a *.bai index. You can either create an app(let) which takes the *.bai as an input or generate a *.bai in the applet. In this tutorial, you generate the *.bai in the applet, sorting the BAM if necessary.
Xargs SAMtools view
In the previous section, you recorded the name of each sliced BAM file into a record file. Next, perform a samtools view -c on each slice using the record file as input.
Upload results
The results file is uploaded using the standard bash process:
Upload a file to the job execution's container.
Provide the DNAnexus link as a job's output using the script dx-jobutil-add-output <output name>
This example demonstrates how to run TensorBoard inside a DNAnexus applet.
TensorBoard is a web application used to visualize and inspect what is going on inside TensorFlow training. To use TensorBoard, your training script in TensorFlow must include code that saves specific data to a log directory where TensorBoard can then find the data to display it.
This example uses an example script from the TensorBoard authors. For more guidance on how to use TensorBoard, check out the TensorFlow website (external link).
Creating the web application
The applet code runs a training script, which is placed in resources/home/dnanexus/ to make it available in the current working directory of the worker, and then it starts TensorBoard on port 443 (HTTPS).
Run the training script in the background to start TensorBoard immediately, which lets you see the results while training is still running. This is particularly important for long-running training scripts.
For all web apps, if everything is running smoothly and no errors are encountered (the ideal case), the line of code that starts the server keeps it running forever. The applet stops only when it is terminated. This also means that any lines of code after the server starts are not executed.
As with all web apps, the dxapp.json must include "httpsApp": {"ports":[443], "shared_access": "VIEW"} to tell the worker to expose port 443.
Creating an applet on DNAnexus
Build the asset with the libraries first:
Take the record ID it outputs and add it to the dxapp.json for the applet.
Then build the applet
Once it spins up, you can go to that job's designated URL based on its job ID, https://job-xxxx.dnanexus.cloud/, to see the result.
Apps and Workflows Glossary
Learn key terms used to describe apps and workflows.
On the DNAnexus Platform, the following terms are used when discussing apps and workflows:
Execution: An analysis or job.
Root execution: The initial analysis or job that's created when a user makes an API call to run a workflow, app, or applet. Analyses and jobs created from a job via /executable-xxxx/run API call with detach flag set to true are also root executions.
Execution tree: The set of all jobs and/or analyses that are created because of running a root execution.
Analysis: An analysis is created when a workflow is run. It consists of some number of stages, each of which consists of either another analysis (if running a workflow) or a job (if running an app or applet).
Parent analysis: Each analysis is the parent analysis to each of the jobs that are created to run its stages.
Job: A job is a unit of execution that is run on a worker in the cloud. A job is created when an app or applet is run, or when a job spawns another job.
Origin job: The job created when an app or applet is run by either a user or an analysis. An origin job always executes the "" entry point.
Master job: The job created when an app or applet is run by a user, job, or analysis. A master job always executes the "main" entry point. All origin jobs are also master jobs.
Job-based object reference: A hash containing a job ID and an output field name. This hash is given in the input or output of a job. Once the specified job has transitioned to the "done" state, it is replaced with the specified job's output field.
FreeSurfer in JupyterLab
Learn how to use FreeSurfer in JupyterLab.
JupyterLab is accessible to all users of the UK Biobank Research Analysis Platform and the Our Future Health Trusted Research Environment.
A license is required to access JupyterLab on the DNAnexus Platform. Contact DNAnexus Sales for more information.
About FreeSurfer
FreeSurfer is a software package for the analysis and visualization of structural and functional neuroimaging data from cross-sectional or longitudinal studies.
The FreeSurfer package comes pre-installed with the IMAGE_PROCESSING .
FreeSurfer License Registration
To use FreeSurfer on the DNAnexus Platform, you need a valid FreeSurfer license. You can register for the FreeSurfer license at the .
Using the FreeSurfer License on DNAnexus
To use the FreeSurfer license, complete the following steps:
Upload the license text file to your project on the DNAnexus Platform.
Launch the JupyterLab app and specify the IMAGE_PROCESSING feature.
Once JupyterLab is running, open your existing notebook (or a new notebook) and download the license file into the FREESURFER_HOME directory.
The commands to download the license file are as follows:
A license is required to access Spark functionality on the DNAnexus Platform. Contact DNAnexus Sales for more information.
The Spark application is an extension of the current app(let) framework. App(let)s have a for their VM (instance type, OS, packages). This has been extended to allow for an additional optional with type=dxspark.
Calling /app(let)-xxxx/run for Spark apps creates a Spark cluster (+ master VM).
The master VM (where the app shell code runs) acts as the driver node for Spark.
Code in the master VM leverages the Spark infrastructure.
Spark apps can be launched over a distributed Spark cluster.
TensorBoard Example Web App
This example demonstrates how to run TensorBoard inside a DNAnexus applet.
TensorBoard is a web application used to visualize and inspect what is going on inside TensorFlow training. To use TensorBoard, the training script in TensorFlow needs to include code that saves specific data to a log directory where TensorBoard can then find the data to display it.
This example uses an example script from the TensorBoard authors. For more guidance on how to use TensorBoard, check out the TensorFlow website ().
Job Notifications
Learn how to set job notification thresholds on the DNAnexus Platform.
A license is required to use the functionality described on this page. Contact for more information.
Being notified of when a job may be stuck can help users to troubleshoot problems. On DNAnexus, users can set to limit the amount of time their jobs can run, or set a threshold on how long a job can take to run before the user is notified. The notification threshold can be specified in the executable at compile time via or .
When the threshold is reached for a , the system sends an email notification to both the user who launched the executable and the org admin.
Executions and Time Limits
Learn about different types of time limits on executions, and how they can affect your executions on the DNAnexus Platform.
Types of Time Limits
On the DNAnexus Platform, executions are subject to two independent time limits: job timeouts, and execution tree expirations.
Kaplan-Meier Survival Curve
Learn to build and use Kaplan-Meier Survival Curve charts in the Cohort Browser.
An Apollo license is required to use Cohort Browser on the DNAnexus Platform. Org approval may also be required. for more information.
Building a Kaplan-Meier Survival Curve Chart
Scatter Plot
Learn to build and use scatter plots in the Cohort Browser.
An Apollo license is required to use Cohort Browser on the DNAnexus Platform. Org approval may also be required. for more information.
When to Use Scatter Plots
Stacked Row Chart
Learn to build and use stacked row charts in the Cohort Browser.
An Apollo license is required to use Cohort Browser on the DNAnexus Platform. Org approval may also be required. for more information.
When to Use Stacked Row Charts
JupyterLab Quickstart
In this tutorial, you will learn how to create and run a notebook in JupyterLab on the platform, download data from the notebook, and upload results to the platform.
JupyterLab is accessible to all users of the UK Biobank Research Analysis Platform and the Our Future Health Trusted Research Environment.
A license is required to access JupyterLab on the DNAnexus Platform. for more information.
MONAI in JupyterLab
Using MONAI Core, MONAI Label/3D Slicer (SlicerJupyter) via JupyterLab
Medical Open Network for AI () is a framework built for deep learning in healthcare imaging. To use MONAI on the DNAnexus Platform, with the MONAI_ML feature, which includes:
: PyTorch-based framework for deep learning in healthcare imaging.
: An intelligent image labeling and learning tool designed to create training datasets and build AI annotation models. It provides a server-client framework that integrates with imaging viewers.
app.run_server(host='0.0.0.0', port=443)
Job Timeouts
Each job has a timeout setting. This setting denotes the maximum amount of "wall clock time" that the job can spend in the "running" state, that is, running on the DNAnexus Platform.
If the job is still running when this limit is reached, the job is terminated.
As noted above, job timeouts only apply to the time a job spends in the "running" state.
Job timeouts do not apply to any time a job spends waiting to begin running - as, for example, when a job is waiting for inputs to become available.
Job timeouts also do not apply to the time a job may spend between exiting the "running" state, and entering the "done" state - as, for example, when it is waiting for subjobs to finish.
Each job is part of an execution tree. All jobs in an execution tree must complete running within 30 days of the launch of the tree's root execution.
After this limit has been reached, all jobs within the execution tree lose the ability to access the Platform.
If an execution tree is restarted, its timeout setting is not reset. Jobs in the tree lose Platform access 30 days after the initial launch (the first try) of the tree's root execution.
Errors
If an execution tree reaches its time limit, jobs in the tree may not fail right away. If such a job is waiting for inputs or outputs, or if it is running without accessing the Platform, it may remain in that state. Only when the job tries to access the Platform does it fail. Depending on the access pattern, the Platform returns AppInternalError, AppError, or AuthError as the job's failure reason.
Monitoring Time Limits
To see information on time limits for execution and execution trees:
Navigate to the project in which the execution or execution tree is being run.
Click the Monitor tab.
Click the name of the execution or execution tree to open a page showing detailed information on it.
If a time limit is approaching, a warning message provides information on when the limit is reached.
If a job is waiting for subjobs to finish, it is shown as running, but job timeout information is not displayed. Execution tree information continues to be displayed.
Parent job: A job that creates another job or analysis via an /executable-xxxx/run or /job/new API call.
Child job: A job created from a parent job via an /app[let]-xxxx/run or /job/new API call.
Subjob: A job created from a job via a /job/new API call. A subjob runs the same executable as its parent, and executes the entry point specified in the API call that created it.
Job tree: A set of all jobs that share the same origin job.
3D Slicer: An open-source software designed for the visualization, processing, and analysis of medical, biomedical, and other 3D images. In a Jupyter environment, 3D Slicer is accessible through the SlicerJupyter kernel and acts as a client for the MONAI Label server.
The MONAI Core, MONAI Label, and 3D Slicer (SlicerJupyter) come pre-installed with the JupyterLab MONAI_MLfeature option.
destdir - Directory on worker to which the git repository is cloned.
build_commands - Commands to build the dependency, run from the repository destdir. In this example, htslib is built when SAMtools is built, so only the SAMtools entry includes build_commands.
makes the return code the first non-zero exit code. (Typically, the return code of pipes is the exit code of the last command, which can create difficult to debug problems.)
The applet code runs a training script, which is placed in resources/home/dnanexus/ to make it available in the current working directory of the worker, and then it starts TensorBoard on port 443 (HTTPS).
The training script runs in the background to start TensorBoard immediately, which allows you to see the results while training is still running. This is particularly important for long-running training scripts.
For all web apps, if everything is running smoothly and no errors are encountered (the ideal case), the line of code that starts the server keeps it running forever. The applet stops only when it is terminated. This also means that any lines of code after the server starts are not executed.
As with all web apps, the dxapp.json must include "httpsApp": {"ports":[443], "shared_access": "VIEW"} to tell the worker to expose port 443.
Creating an applet on DNAnexus
Build the asset with the libraries first:
Take the record ID it outputs and add it to the dxapp.json for the applet.
Then build the applet
Once it spins up, you can go to that job's designated URL based on its job ID, https://job-xxxx.dnanexus.cloud, to see the result.
For a root execution, the turnaround time is the time between its creation time and the time it reaches the terminal state (or the current time if it is not in a terminal state). The terminal states of an execution are done, terminated, and failed. The job tree turnaround time threshold can be set from the dxapp.jsonapp metadata file using the treeTurnaroundTimeThreshold supported field, where the threshold time is set in seconds. When a user runs an executable that has a threshold, the threshold applies only to the resulting root execution. See here for more details on the treeTurnAroundTimeThresholdAPI.
Example of including the treeTurnaroundTimeThreshold field in dxapp.json:
In the command-line interface (CLI), the dx build and dx build --app commands can accept the treeTurnaroundTimeThreshold field from dxapp.json, and the resulting app is built with the job tree turnaround time threshold from the JSON file.
To check the treeTurnaroundTimeThreshold value of an executable, users can use dx describe{app, applet, workflow or global workflow id} --json command.
Using the dx describe {execution_id} --json command displays the selectedTreeTurnaroundTimeThreshold, selectedTreeTurnaroundTimeThresholdFrom, and treeTurnaroundTime values of root executions.
WDL Workflows
For WDL workflows and tasks, dxCompiler enables tree turnaround time specification using the extras JSON file. dxCompiler reads the treeTurnaroundTimeThreshold field from the perWorkflowDxAttributes and defaultWorkflowDxAttributes sections in extras and applies this threshold to the generated workflow. To set a job tree turnaround time threshold for an applet using dxCompiler, add the treeTurnaroundTimeThreshold field to the perTaskDxAttributes and defaultTaskDxAttributes sections in the extras JSON file.
Example of including the treeTurnAroundTimeThreshold field in perWorkflowDxAttributes:
First, launch JupyterLab in the project of your choice, as described in the Running JupyterLab guide.
After starting your JupyterLab session, click on the DNAnexus tab on the left sidebar to see all the files and folders in the project.
2. Create an Empty Notebook
To create a new empty notebook in the DNAnexus project, select DNAnexus > New Notebook from the top menu.
This creates an untitled ipynb file, viewable in the DNAnexus project browser, which refreshes every few seconds.
To rename your file, right-click on its name and select Rename.
3. Edit and Save the Notebook in the Project
You can open and edit the newly created notebook directly from the project (accessible from the DNAnexus tab in the left sidebar). To save your changes, press Ctrl+S (or Command+S on macOS), or click on the save icon in the Toolbar (an area below the tab bar at the top). A new notebook version lands in the project, and you should see in the "Last modified" column that the file was created recently.
Since DNAnexus files are immutable, each notebook save creates a new version in the project, replacing the file of the same name. The previous version moves to the .Notebook_archive with a timestamp suffix added to its name. Saving notebooks directly in the project as new files preserves your analyses beyond the JupyterLab session's end.
4. Download the Data to the Execution Environment
To process your data in the notebook, the data must be available in the execution environment (as is the case with any DNAnexus app).
You can also use the terminal to execute the dx command.
5. Upload Data to the Project
For any data generated by your notebook that needs to be preserved, upload it to the project before the session ends and the JupyterLab worker terminates. Upload data directly in the notebook by running dx upload from a notebook cell or from the terminal:
If you create a notebook from the Launcher or from the top menu (File > New > Notebook), the notebook is not created in the project but in the local execution environment. To move it to the project, you must upload it to the project manually. Make sure you upload your local notebooks to the project before the session expires, or work on your notebooks directly from the project, so as not to lose your work.
Next Steps
Check the References guide for tips on the most useful operations and features in JupyterLab.
set -e -x -o pipefail
echo "Value of mappings_sorted_bam: '${mappings_sorted_bam}'"
echo "Value of mappings_sorted_bai: '${mappings_sorted_bai}'"
mkdir workspace
cd workspace
dx download "${mappings_sorted_bam}"
if [ -z "$mappings_sorted_bai" ]; then
samtools index "$mappings_sorted_bam_name"
else
dx download "${mappings_sorted_bai}"
fi
# Extract valid chromosome names from BAM header
chromosomes=$(
samtools view -H "${mappings_sorted_bam_name}" | \
grep "@SQ" | \
awk -F '\t' '{print $2}' | \
awk -F ':' '{
if ($2 ~ /^chr[0-9XYM]+$|^[0-9XYM]/) {
print $2
}
}'
)
# Split BAM by chromosome and record output file names
for chr in $chromosomes; do
samtools view -b "${mappings_sorted_bam_name}" "${chr}" -o "bam_${chr}.bam"
echo "bam_${chr}.bam"
done > bamfiles.txt
# Parallel counting of reads per chromosome BAM
busyproc=0
while read -r b_file; do
echo "${b_file}"
# If busy processes hit limit, wait for one to finish
if [[ "${busyproc}" -ge "$(nproc)" ]]; then
echo "Processes hit max"
while [[ "${busyproc}" -gt 0 ]]; do
wait -n
busyproc=$((busyproc - 1))
done
fi
# Count reads in background
samtools view -c "${b_file}" > "count_${b_file%.bam}" &
busyproc=$((busyproc + 1))
done < bamfiles.txt
while [[ "${busyproc}" -gt 0 ]]; do
wait -n # p_id
busyproc=$((busyproc-1))
done
├── $HOME
│ ├── out
│ ├── < output name in dxapp.json >
│ ├── output file
# Start the training script and put it into the background,
# so the next line of code runs immediately
python mnist_tensorboard_example.py --log_dir LOGS_FOR_TENSORBOARD &
# Run TensorBoard
tensorboard --logdir LOGS_FOR_TENSORBOARD --host 0.0.0.0 --port 443
dx build -f tensorboard-web-app
dx run tensorboard-web-app
# Start the training script and put it into the background,
# so the next line of code will run immediately
python3 mnist_tensorboard_example.py --log_dir LOGS_FOR_TENSORBOARD &
# Run TensorBoard
tensorboard --logdir LOGS_FOR_TENSORBOARD --host 0.0.0.0 --port 443
To generate a survival chart, select one numerical field representing time, and one categorical field, which is transformed into the individual's status.
The categorical field should use one of the following 4 terms (case-insensitive) to indicate a status of "Living": living, alive, diseasefree, disease-free.
For multi-entity datasets, survival curve charts only support data fields from the main entity, or entities with 1:1 relation to the main entity.
Calculating Survival Percentage
To calculate survival percent at the current event the system evaluates the following formula:
ST=LT0LT0−D
ST: Survival at the current event
LT0: Number of subjects living at the start of the period or event
D: Number of subjects that died
For each time period the following values are generated:
Status: Each individual is considered Dead unless they qualify as Living.
Number of Subjects Living at the Start (LT0)
For the initial value this is the total number of records returned by the backend from survival data with Living or Dead Status.
For followup events this is the number of subjects at the start of the previous event minus the number of subjects that died in the previous event and the subjects that dropped out or were censored in the previous event.
Number of Subjects Who Died (): * 1 for each individual who at the event does not have a status of Living.
Number of Subjects Dropped or Censored: * 1 for each individual who at the event has a status of Living.
Survival Percent at the Current Event ():
Cumulative Survival (): where is the survival percent at the previous event.
This is the actual point drawn on the survival plot.
Scatter plots can be used to compare the distribution of values in a field containing numerical data, across different groups in a cohort. In a scatter plot, each such group is defined by its members sharing the same value in another field that also contains numerical data.
Primary field values are plotted on the x axis. Secondary field values are plotted on the y axis.
Supported Data Types
Primary Field
Secondary Field
Numerical (Integer) or Numerical (Float)
Numerical (Integer) or Numerical (Float)
Using Scatter Plots in the Cohort Browser
In the scatter plot below, each dot represents a particular combination of values, found in one or more records in a cohort, in fields Insurance Billed and Cost. The lighter the dot at a particular point, the fewer the records that share that combination. Darker dots, meanwhile, indicate that more records share a particular combination.
Scatter Plot: Insurance Billed x Cost
Non-Numeric Data in Scatter Plots
Fields containing primarily numeric data may also include non-numeric values. These non-numeric values cannot be represented in a scatter plot. The message "This field contains non-numeric values" appears below the scatter plot, as in this sample chart:
Scatter Plot Based on Field or Fields Containing Non-Numeric Values
Clicking the "non-numeric values" link displays detail on those values, and the number of record in which each appears.
Detail on Non-Numeric Values
Limit on Number of Data Points
In the Cohort Browser, scatter plots can show up to 30,000 distinct data points. If you create a scatter plot that would require that more data points be shown, you see this message above the chart:
Scatter Plot with Warning Message about Data Point Limit
In this scenario, add a cohort filter to generate a scatter plot that shows data for all the members of a cohort.
Cohort Compare
Scatter plots are not supported in Cohort Compare.
Stacked row charts can be used to compare the distribution of values in a field containing categorical data, across different groups in a cohort. In a stacked row chart, each such group is defined by its patient sharing the same value in another field that also contains categorical data.
When creating a stacked row chart:
Both the primary and secondary fields must contain categorical data
Both the primary and secondary fields must contain no more than 20 distinct category values
Supported Data Types
Primary Field
Secondary Field
Categorical (<=20 distinct category values)
Categorical (<=20 distinct category values)
Categorical multiple and categorical hierarchical data are not supported in stacked row charts.
Using Stacked Row Charts in the Cohort Browser
In the stacked row chart below, the primary field is VisitType, while DoctorType is the secondary field. In this chart, a cohort has been broken down into two groups, with the first sharing the value "Out-patient" in the VisitType field, while the second shares the value "In-patient."
The size of each bar, and the number to its right, indicate the total number of records in each group. In the chart below, for example, you can see that 3,179 records contain the value "Out-patient" in the VisitType field.
Each bar contains a color-coded section indicating how many of the group's records contain a specific value in the secondary field. Hovering over one of these sections reveals how many records, within a particular group, share a particular value in the secondary field. In the chart below, for example, you can see that 87 records in the first group share the value "specialist" in the DoctorType field.
Stacked Row Chart: VisitType x DoctorType
Cohort Compare
Stacked row charts are not supported in Cohort Compare. Use a list view instead.
Preparing Data for Visualization in Stacked Row Charts
Developers new to the DNAnexus Platform may find it helpful to learn by doing. This page contains a collection of tutorials and examples intended to showcase common tasks and methodologies when creating an app(let) on the DNAnexus Platform.
By following the tutorials and examples, you learn to develop app(let)s that:
Run efficiently using cloud computing methodologies
Are straightforward to debug and use
Take advantage of the DNAnexus Platform's flexibility and scale
Reduce support burden while enabling collaboration
If it's your first time developing an app(let), read the series. This series introduces terms and concepts that tutorials and examples build on.
These tutorials are not meant to show realistic everyday examples, but rather provide a strong starting point for app(let) developers. These tutorials showcase varied implementations of the SAMtools view command on the DNAnexus Platform.
Bash App(let) Tutorials
Bash app(let)s use dx-toolkit, the platform SDK, and the command-line interface along with common Bash practices to create bioinformatic pipelines in the cloud.
Python App(let) Tutorials
Python app(let)s make of use dx-toolkit's along with common Python modules such as to create bioinformatic pipelines in the cloud.
Web App(let) Tutorials
To create a web applet, you need access to Titan or Apollo features. Web applets can be made as either Python or Bash applets. The only difference is that they launch a web server and expose port 443 (for HTTPS) to allow a user to interact with that web application through a web browser.
Concurrent Computing Tutorials
A bit of terminology before starting the discussion of parallel and distributed computing paradigms on the DNAnexus Platform.
Many definitions and approaches exist for tackling the concept of parallelization and distributing workloads in the cloud (Here's a on the subject). To make the documentation easier to understand when discussing concurrent computing paradigms, this guide refers to:
Parallel: Using multiple threads or logical cores to concurrently process a workload.
Distributed: Using multiple machines (in this case, cloud instances) that communicate to concurrently process a workload.
Keep these formal definitions in mind as you read through the tutorials and learn how to compute concurrently on the DNAnexus Platform.
Parallel
Distributed
Executions and Cost and Spending Limits
Learn about limits on the costs executions can incur, and how these limits can affect executions on the DNAnexus Platform.
Types of Cost and Spending Limits
A running execution can be terminated when it incurs charges that cause a cost or spending limit to be reached. When a spending limit is reached, this can also prevent new executions from being launched.
Execution Cost Limits
. This limit is set when a root execution is launched. Once this limit is reached, the DNAnexus Platform terminates running executions in the affected execution tree.
Errors
When an execution is terminated in this fashion, the Platform sets . This failure code is displayed on the UI, on the relevant project's Monitor page.
Billing Account Spending Limits
Billing account spending limits are managed by billing administrators, and can impact executions in projects billed to the account.
Billing account spending limits apply to cumulative charges incurred by projects billed to the account.
If cumulative charges reach this limit, the Platform terminates running jobs in projects billed to the account, and prevents new executions from being launched.
Errors
When a job is terminated in this fashion, the Platform sets as the failure reason. This failure reason is displayed on the UI, on the relevant project's Monitor page.
Project-Level Compute and Egress Spending Limits
A license is required to use the Enforce Monthly Spending Limit for Computing and Egress feature. for more information.
, and can impact executions run within the project. Project admins can also set a separate monthly project-level egress spending limit, which can impact data egress from the project.
If the compute spending limit is reached, the Platform may terminate running jobs launched by project members, and prevent new executions from being launched. If the egress spending limit is reached, the Platform may prevent data egress from the project. The exact behavior depends on the policies of the org to which the project is billed.
For more information on these limits, see the , and the .
Compute Charges Incurred by Using Relational Database Clusters
Monthly project compute limits do not apply to compute charges incurred by using .
Compute Charges for Using Public IPv4 Addresses for Workers
Using public IPv4 addresses for workers incurs additional charges. When a job uses such a worker, IPv4 charges are included in the total cost figure shown for the job on the UI. These charges also count toward any .
For information on how to find the per-hour charge for using IPv4 addresses, in each cloud region in which org members can run executions, see the .
Getting Info on Cost and Spending Limits
Execution Costs and Cost Limits
The UI displays information on costs and cost limits for both individual executions and execution trees. Navigate to the project in which the execution or execution tree is being run, then click the Monitor tab. Click on the name of the execution or execution tree to open a page showing detailed information about it.
While an execution or execution tree is running, information is displayed on the charges it has incurred so far, and on additional charges it can incur, before an applicable cost limit is reached.
Spending Limits
Org spending limit information is available from the .
Project-Level Monthly Spending Limits
If project-level monthly spending limits have been set for a project, detailed information is available via the CLI, using the command .
Cohort Browser
Visualize your data and browse your multi-omics datasets.
An Apollo license is required to use Cohort Browser on the DNAnexus Platform. Org approval may also be required. Contact DNAnexus Sales for more information.
Cohort Browser is a visualization tool for exploring and filtering structured datasets. It provides an intuitive interface for creating visualizations, defining patient cohorts, and analyzing complex data.
Cohort Browser supports multiple types of datasets:
Clinical and phenotypic data - Patient demographics, clinical measurements, and outcomes
Multi-assay datasets - Datasets combining multiple assay types or instances of the same assay type
If you need to perform custom statistical analysis, you can also use environments with Spark clusters to query your data programmatically.
Prerequisites
You need to before you can access it through a dataset in the Cohort Browser.
Opening Datasets Using the Cohort Browser
In Projects, select the project where your dataset is located.
Go to the Manage tab.
Select your dataset.
You can also use the Info Panel to view information about the selected dataset, such as its creator or .
Getting familiar with Cohort Browser
Depending on your dataset, the Cohort Browser shows the following tabs:
Overview - Clinical data using interactive charts and dashboards
Data Preview - Clinical data in tabular format
Assay-specific tabs - Additional tabs appear based on your dataset content:
Exploring Data in a Dataset
In the Cohort Browser's Overview tab, you can . These visualizations provide an introduction to the dataset and insights on the clinical data it contains.
When you open a dataset, Cohort Browser automatically creates an empty cohort that includes all records in the dataset. From here, you can add filters to , to explore your data, and export filtered data for further analysis outside the platform.
Next Steps
- Build visualizations and manage dashboard layouts
- Filter data and create patient groups
- Work with inherited genetic variations
Row Chart
Learn to build and use row charts in the Cohort Browser.
An Apollo license is required to use Cohort Browser on the DNAnexus Platform. Org approval may also be required. Contact DNAnexus Sales for more information.
When to Use Row Charts
Row charts can be used to visualize categorical data.
When creating a row chart:
The data must be from a field that contains either categorical or categorical multi-select data
This field must contain no more than 20 distinct category values
The values cannot be organized in a hierarchy
Supported Field Types
See if you need to visualize hierarchical categorical data.
When to Use List Views for Categorical Data
Row charts can't be used to visualize data in categorical fields that have a hierarchical structure. For this type of data, use a .
Row charts aren't supported in Cohort Compare mode. In Cohort Compare mode, row charts are converted to .
Using Stacked Row Charts for Multivariate Visualizations
Row charts can't be used to visualize data from more than one field. To visualize categorical data from two fields, you can use a .
Using Row Charts in the Cohort Browser
In a row chart, each row shows a single category value, along with the number of records - the "count" - in which that value appears in the selected field. Also shown is the percentage of total cohort records in which it appears - its "freq." or "frequency."
Below is a sample row chart showing the distribution of values in a field Salt added to food. In the current cohort selection of 100,000 participants, 27,979 records contain the value "Sometimes", which represents 27.98% of the current cohort size.
When records are missing values for the displayed field, the sum of the "count" figures is smaller than the total cohort size, and the sum of the "freq." figures is less than 100%. See for more information on how missing data affects chart calculations.
Preparing Data for Visualization in Row Charts
When , the following data types can be visualized in row charts, if category values are specified as such in the coding file used at ingestion:
String Categorical
String Categorical Sparse
String Categorical Multi-select
While sparse serial data can be visualized using row charts, non-encoded values are not supported. These values do not appear as rows.
VCF Preprocessing
Learn about preprocessing VCF data before using it in an analysis.
Overview
It may be necessary to preprocess, or harmonize, the data before you load them.
Harmonizing Data
The raw data is expected to be a set of gVCF files -- one file per sample in the cohort.
is used to harmonize sites across all gVCFs and generate a single pVCF file containing all harmonized sites and all genotypes for all samples.
Basic Run
Advanced Run
To learn more about GLnexus, see or .
Annotating Variants
VCF files can include variant annotations. SnpEff annotations provided as INFO/ANN tags are loaded into the database. You can annotate the harmonized pVCF yourself by running any standard SnpEff annotator before loading it. For large pVCFs, rely on the internal annotation step in the VCF Loader instead of generating an annotated intermediate file. The VCF Loader performs annotation in a distributed, massively parallel process.
The VCF Loader does not persist the intermediate, annotated pVCF as a file. If you want to have access to the annotated file up front, you should annotate it yourself.
VCF annotation flows. In (a) the annotation step is external to the VCF Loader, whereas in (b) the annotation step is internal. In any case, SnpEff annotations present as INFO/ANN tags are loaded into the database by the VCF Loader.
CSV Loader
A license is required to access Spark functionality on the DNAnexus Platform. Contact DNAnexus Sales for more information.
Overview
The CSV Loader ingests CSV files into a database. The input CSV files are loaded into a Parquet-format database and tables that can be queried using Spark SQL.
You can load a single CSV file or many CSV files. In the many files case, all files must be syntactically equal.
For example:
All files must have the same separator. This can be a comma, tab, or another consistent delimiter.
All files must include a header line, or all files must exclude it
Each CSV file is loaded into its own table within the specified database.
How to Run CSV Loader
Input:
CSV (array of CSV files to load into the database)
Required Parameters:
database_name -> name of the database to load the CSV files into.
create_mode -> strict mode creates database and tables from scratch and optimistic mode creates databases and tables if they do not already exist.
Other Options:
spark_read_csv_header -> default false -- whether the first line of each CSV should be used as column names for the corresponding table.
spark_read_csv_sep -> default , -- the separator character used by each CSV.
Basic Run
The following case creates a brand new database and loads data into two new tables:
Running Older Versions of JupyterLab
Learn how to run an older version of JupyterLab via the user interface or command-line interface.
Why Run an Older Version of JupyterLab?
The primary reason to run an older version of JupyterLab is to access snapshots containing tools that cannot be run in the current version's execution environment.
Launching an Older Version via the User Interface (UI)
From the main Platform menu, select Tools, then Tools Library.
Find and select, from the list of tools, either JupyterLab with Python, R, Stata, ML, Image Processing or JupyterLab with Spark Cluster.
From the tool detail page, click on the Versions tab.
Launching an Older Version via the Command-Line Interface (CLI)
Select the project in which you want to run JupyterLab.
Launch the version of JupyterLab you want to run, substituting the version number for x.y.z in the following commands:
For JupyterLab without the Spark cluster capability, run the command
Running JupyterLab at "high" priority is not required. However, doing so ensures that your interactive session is not interrupted by spot instance termination.
Accessing JupyterLab
After launching JupyterLab, access the JupyterLab environment using your browser. To do this:
Get the job ID for the job created when you launched JupyterLab. See the for details on how to get the job ID, via either the UI or the CLI.
Open the URL https://job-xxxx.dnanexus.cloud, substituting the job's ID for job-xxxx.
Mkfifo and dx cat
This applet performs a SAMtools count on an input file while minimizing disk usage. For additional details on using FIFO (named pipes) special files, run the command man fifo in your shell.
Named pipes require BOTH a stdin and stdout
Distributed by Region (sh)
Entry Points
Distributed bash-interpreter apps use bash functions to declare entry points. Entry points are executed as subjobs on new workers with their own respective system requirements. This app has the following entry points specified as bash functions:
Parallel xargs by Chr
This applet slices a BAM file by canonical chromosome and performs a parallelized SAMtools view.
How is the SAMtools dependency provided?
The SAMtools compiled binary is placed directory in the <applet dir>/resources directory. Any files found in the resources/ directory are uploaded so that they are present in the root directory of the worker. In this case:
Creating Charts and Dashboards
Create charts, manage dashboards, and build visualizations to explore your datasets in the Cohort Browser.
An Apollo license is required to use Cohort Browser on the DNAnexus Platform. Org approval may also be required. for more information.
Create interactive visualizations and manage dashboard layouts in the Cohort Browser.
Grouped Box Plot
Learn to build and use grouped box plots in the Cohort Browser.
An Apollo license is required to use Cohort Browser on the DNAnexus Platform. Org approval may also be required. for more information.
When to Use Grouped Box Plots
Histogram
Learn to build and use histograms in the Cohort Browser.
An Apollo license is required to use Cohort Browser on the DNAnexus Platform. Org approval may also be required. for more information.
When to Use Histograms
Running JupyterLab
Learn to launch a JupyterLab session on the DNAnexus Platform, via the JupyterLab app.
JupyterLab is accessible to all users of the UK Biobank Research Analysis Platform and the Our Future Health Trusted Research Environment.
For DNAnexus Platform users, a license is required to access JupyterLab. for more information.
Environment Variables
The command-line client and the client bindings use a set of environment variables to communicate with the API server and to store state on the current default project and directory. These settings are set when you run dx login and can be changed through other dx commands. To display the active settings in human-readable format, use the dx env command:
To print the bash commands for setting the environment variables to match what dx is using, you can run the same command with the --bash flag.
Running a dx command from the command-line does not (and cannot) overwrite your shell's environment variables. The environment variables are stored in the
insert_mode -> append appends data to the end of tables and overwrite is equivalent to truncating the tables and then appending to them.
table_name -> array of table names, one for each corresponding CSV file by array index.
type -> the cluster type, "spark" for Spark apps
spark_read_csv_infer_schema -> default false -- whether the input schema should be inferred from the data.
. The following examples run incomplete named pipes in background processes so the foreground script does not block.
To approach this use case, outline the desired steps for the applet:
Stream the BAM file from the platform to a worker.
While the BAM streams, count the number of reads present.
Write the result to a file.
Stream the result file to the platform.
Stream BAM file from the platform to a worker
First, establish a named pipe on the worker. Then, stream to stdin of the named pipe and download the file as a stream from the platform using dx cat.
FIFO
stdin
stdout
BAM file
YES
NO
Output BAM file read count
Having created the FIFO special file representing the streamed BAM, you can call the samtools command as you normally would. The samtools command reading the BAM provides the BAM FIFO file with a stdout. However, remember that you want to stream the output back to the Platform. You must create a named pipe representing the output file too.
FIFO
stdin
stdout
BAM file
YES
YES
output file
YES
NO
The directory structure created here (~/out/counts_txt) is required to use the dx-upload-all-outputs command in the next step. All files found in the path ~/out/<output name> are uploaded to the corresponding <output name> specified in the dxapp.json.
Stream the result file to the platform
A stream from the platform has been established, piped into a samtools command, and the results are output to another named pipe. However, the background process remains blocked without a stdout for the output file. Creating an upload stream to the platform resolves this.
Alternatively, dx upload - can upload directly from stdin, eliminating the need for the directory structure required for dx-upload-all-outputs. Warning: When uploading a file that exists on disk, dx upload is aware of the file size and automatically handles any cloud service provider upload chunk requirements. When uploading as a stream, the file size is not automatically known and dx upload uses default parameters. While these parameters are fine for most use cases, you may need to specify upload part size with the --buffer-size option.
Wait for background processes
With background processes running, wait in the foreground for those processes to finish.
Without waiting, the app script running in the foreground would finish and terminate the job prematurely.
How is the SAMtools dependency provided?
The SAMtools compiled binary is placed directly in the <applet dir>/resources directory. Any files found in the resources/ directory are uploaded so that they are present in the worker's root directory. In this case:
When this applet is run on a worker, the resources/ folder is placed in the worker's root directory /:
/usr/bin is part of the $PATH variable, so the samtools command can be referenced directly in the script as samtools view -c ....
The main function takes the initial *.bam, generates an index *.bai if needed, and obtains the list of regions from the *.bam file. Every 10 regions are sent, as input, to the count_func entry point using dx-jobutil-new-job command.
Job outputs from the count_func entry point are referenced as Job Based Object References JBOR and used as inputs for the sum_reads entry point.
Job outputs of the sum_reads entry point is used as the output of the main entry point via JBOR reference in the dx-jobutil-add-output command.
count_func
This entry point performs a SAMtools count of the 10 regions passed as input. This execution runs on a new worker. As a result, variables from other functions are not accessible here. This includes variables from the main() function.
Once the output file with counts is created, it is uploaded to the platform and assigned as the entry point's job output counts_txt via the command dx-jobutil-add-output.
sum_reads
The main entry point triggers this subjob, providing the output of count_func as an input JBOR. This entry point gathers all the readcount.txt files generated by the count_func jobs and sums the totals.
This entry point returns read_sum as a JBOR, which is then referenced as job output.
When this applet is run on a worker, the resources/ folder is placed in the worker's root directory /:
/usr/bin is part of the $PATH variable, so in the script, you can reference the samtools command directly, as in samtools view -c ....
Parallel Run
Splice BAM
First, download the BAM file and slice it by canonical chromosome, writing the *bam file names to another file.
To split a BAM by regions, you need to have a *.bai index. You can either create an app(let) which takes the *.bai as an input or generate a *.bai in the applet. In this tutorial, the *.bai is generated in the applet, sorting the BAM if necessary.
Xargs SAMtools view
In the previous section, the name of each sliced BAM file was recorded into a record file. Next, perform a samtools view -c on each slice using the record file as input.
Upload results
The results file is uploaded using the standard bash process:
Upload a file to the job execution's container.
Provide the DNAnexus link as a job's output using the script dx-jobutil-add-output <output name>
The dx command always prioritizes the environment variables that are set in the shell. This means that if you have set your environment variable for DX_SECURITY_CONTEXT and then use dx login to log in as a different user, it still uses the original environment variable. When not run in a script, it prints a warning to stderr whenever the environment variables and its stored state have a mismatch. To get out of this situation, the best approach is often to run source ~/.dnanexus_config/unsetenv. Setting environment variables is generally used within a shell script or as part of a job environment in the cloud.
In the interaction below, environment variables have already been set, but the user then uses dx to log in which is still overridden by the shell's environment variables.
Clearing dx-set Variables
If you instead want to discard the values which dx has stored, the command dx clearenv removes the dx-generated configuration file ~/.dnanexus_config/environment.json for you.
Command Line Options
Most dx commands have the following additional flags to temporarily override the values of the respective variables.
For example, you can temporarily override the current default project used:
/
├── usr
│ ├── bin
│ ├── < samtools binary >
├── home
│ ├── dnanexus
# Extract list of reference regions from BAM header
regions=$(
samtools view -H "${mappings_sorted_bam_name}" | \
grep "@SQ" | \
sed 's/.*SN:\(\S*\)\s.*/\1/'
)
echo "Segmenting into regions"
count_jobs=()
counter=0
temparray=()
# Loop through each region
for r in $(echo "$regions"); do
if [[ "${counter}" -ge 10 ]]; then
echo "${temparray[@]}"
count_jobs+=($(
dx-jobutil-new-job \
-ibam_file="${mappings_sorted_bam}" \
-ibambai_file="${mappings_sorted_bai}" \
"${temparray[@]}" \
count_func
))
temparray=()
counter=0
fi
# Add region to temp array of -i<parameter>s
temparray+=("-iregions=${r}")
counter=$((counter + 1))
done
# Handle remaining regions (less than 10)
if [[ $counter -gt 0 ]]; then
echo "${temparray[@]}"
count_jobs+=($(
dx-jobutil-new-job \
-ibam_file="${mappings_sorted_bam}" \
-ibambai_file="${mappings_sorted_bai}" \
"${temparray[@]}" \
count_func
))
fi
echo "Merge count files, jobs:"
echo "${count_jobs[@]}"
readfiles=()
for count_job in "${count_jobs[@]}"; do
readfiles+=("-ireadfiles=${count_job}:counts_txt")
done
echo "file name: ${sorted_bamfile_name}"
echo "Set file, readfile variables:"
echo "${readfiles[@]}"
countsfile_job=$(dx-jobutil-new-job -ifilename="${mappings_sorted_bam_prefix}" "${readfiles[@]}" sum_reads)
$ dx env
Auth token used adLTkSNkjxoAerREqbB1dVkspQzCOuug
API server protocol https
API server host api.dnanexus.com
API server port 443
Current workspace project-9zVpbQf4Zg2641v5BGY00001
Current workspace name "Scratch Project"
Current folder /
Current user alice
$ dx ls -l
Project: Sample Project (project-9zVpbQf4Zg2641v5BGY00001)
Folder : /
<Contents of Sample Project>
$ dx login
Acquiring credentials from https://auth.dnanexus.com
Username: alice
Password:
Note: Use "dx select --level VIEW" or "dx select --public" to select from
projects for which you only have VIEW permissions.
Available projects:
0) SAM importer test (CONTRIBUTE)
1) Scratch Project (ADMINISTER)
2) Mouse (ADMINISTER)
Pick a numbered choice [1]: 2
Setting current project to: Mouse
$ dx ls
WARNING: The following environment variables were found to be different than the
values last stored by dx: DX_SECURITY_CONTEXT, DX_PROJECT_CONTEXT_ID
To use the values stored by dx, unset the environment variables in your shell by
running "source ~/.dnanexus_config/unsetenv". To clear the dx-stored values,
run "dx clearenv".
Project: Sample Project (project-9zVpbQf4Zg2641v5BGY00001)
Folder : /
<Contents of Sample Project>
$ source ~/.dnanexus_config/unsetenv
$ dx ls -l
Project: Mouse (project-9zVpbQf4Zg2641v5BGY00002)
Folder : /
<Contents of Mouse>
$ dx --env-help
usage: dx command ... [--apiserver-host APISERVER_HOST]
[--apiserver-port APISERVER_PORT]
[--apiserver-protocol APISERVER_PROTOCOL]
[--project-context-id PROJECT_CONTEXT_ID]
[--workspace-id WORKSPACE_ID]
[--security-context SECURITY_CONTEXT]
[--auth-token AUTH_TOKEN]
optional arguments:
--apiserver-host APISERVER_HOST
API server host
--apiserver-port APISERVER_PORT
API server port
--apiserver-protocol APISERVER_PROTOCOL
API server protocol (http or https)
--project-context-id PROJECT_CONTEXT_ID
Default project or project context ID
--workspace-id WORKSPACE_ID
Workspace ID (for jobs only)
--security-context SECURITY_CONTEXT
JSON string of security context
--auth-token AUTH_TOKEN
Authentication token
$ dx env --project-context-id project-B0VK6F6gpqG6z7JGkbqQ000Q
Auth token used R54BN6Ws6Zl3Y0VqBA9o1qweUswYW5o4
API server protocol https
API server host api.dnanexus.com
API server port 443
Current workspace project-B0VK6F6gpqG6z7JGkbqQ000Q
Current folder /
Dashboards contain your charts and define their layout. Each such configuration is called a dashboard view. Dashboard views can be specific to a saved cohort or standalone (custom dashboard view). You can create multiple dashboard views, allowing you to switch between different visualizations and analyses.
By using Dashboard Actions, you can save or load your own dashboard views. This lets you quickly switch between different visualizations without having to set them up each time.
Save Dashboard View - Saves the current dashboard configuration as a record of the DashboardView type, including all tiles and their settings.
Load Dashboard View - Loads a custom dashboard view, restoring the tiles and their configurations.
Using Dashboard Actions
After loading a dashboard view once, you can access it again from Dashboard Actions > Custom Dashboard Views.
Moving dashboards between datasets? If you want to use your dashboard views with a different Apollo Dataset, you can use the Rebase Cohorts And Dashboards app to transfer your custom dashboard configurations to a new target dataset.
Visualizing Data
Add charts to your dashboards to visualize the clinical and phenotypical data in your dataset. For example, you can add charts to display patient demographics or clinical measurements.
Working with Multi-Assay Visualizations
For omics datasets, such as those for germline variants, somatic variants, or gene expression, you have additional predefined visualization options available:
Gene expression data is visualized using expression level and feature correlation charts. For details, see .
Adding Tiles to Visualize Data
Each chart is represented as a tile on the dashboard. You can add multiple tiles to visualize different aspects of your data.
In the Overview tab, click + Add Tile on the top-right.
In the hierarchical list of the dataset fields, select the field you want to visualize.
In Data Field Details, choose your preferred chart type.
The available depend on the field's value type.
Click Add Tile.
The tile appears on the dashboard with the current cohort data. You can add up to 15 tiles.
Creating Multi-Variable Charts
When selecting data fields to visualize, you can add a secondary data field to create a multi-variable chart. This allows you to visualize relationships between two data fields in the same chart.
To visualize the relationship between two data fields in the same chart, first select your primary data field from the hierarchical list. This opens a Data Field Details panel, showing the field's information and a preview of a basic chart.
To add a secondary field, keep the primary field selected and search for the desired field. When you find it, click the Add as Secondary Field icon (+) next to its name rather than selecting it directly. This adds the new field to the visualization. The Data Field Details panel updates to show the combined information for both fields.
You can click the + icon only when at least one chart type is supported for the specified combination.
For certain chart types, such as Stacked Row Chart and Scatter Plot, you can re-order the primary and secondary data fields by dragging the data field in Data Field Details.
Adding grouped box plot by combining two data fields
For more details on multi-variable charts, including how to build a survival curve, see Multi-Variable Charts.
Chart Optimization
When working with large datasets, keep these tips in mind:
Limit dashboard tiles: To ensure fast loading times and a clear overview, it's best to limit the number of charts on a single dashboard. Typically, 8-10 tiles is a good number for human comprehension and optimal performance.
Filter data first: Reduce the volume of data by applying filters before you create complex visualizations. This improves chart loading speed.
Grouped box plots can be used to compare the distribution of values in a field containing numerical data, across different groups in a cohort. In a grouped box plot, each such group is defined by its members sharing the same value in another field that contains categorical data.
When creating a grouped box plot:
The primary field must contain categorical or categorical multiple data
The primary field must contain no more than 15 distinct category values
The secondary field must contain numerical data
Supported Data Types
Primary Field
Secondary Field
Categorical or Categorical Multiple (<=15 categories)
Numerical (Integer) or Numerical (Float)
Using Grouped Box Plots in the Cohort Browser
The grouped box plot below shows a cohort that has been broken down into groups, according to the value in a field Doctor. For each group, a box plot provides detail on the reported Visit Feeling, for cohort members who share a doctor:
Grouped Box Plot
Non-Numeric Values in Grouped Box Plots
A field containing numeric data may also contain some non-numeric values. These values cannot be represented in a grouped box plot. See the chart above for an example of the informational message that shows below the chart, in this scenario.
Clicking the "non-numeric values" link displays detail on those values, and the number of records in which each appears:
Grouped Box Plot: Detail on Non-Numeric Values
Outliers
Cohort Browser grouped box plots represent all non-null numeric values. When a field contains an outlier value or values - that is, values that are unusually high or low - this can result in a grouped box plot that looks like this:
Outlier Value in a Grouped Box Plot
This grouped box plot displays data on the number of cups of coffee consumed per day, by members of different groups in a particular cohort, with groups defined by shared value in a field Coffee type. In multiple groups, one member was recorded as consuming far more cups of coffee per day than others in the group.
Grouped Box Plots in Cohort Compare
In Cohort Compare mode, a grouped box plot can be used to compare the distribution of values in a field that's common to both cohorts, across groups defined using values in a categorical field that is also common to both cohorts.
In this scenario, a separate, color-coded box plot is displayed for each group in each cohort.
Hovering over one of these box plots opens an informational window showing detail on the distribution of values for the group.
Clicking the "ˇ" icon, in the lower right corner of the tile containing the chart, opens a tooltip showing the cohort names and the colors used to represent data in each.
Grouped Box Plot in cohort compare mode
Preparing Data for Visualization in Grouped Box Plots
Histograms can be used to visualize numerical, date, and datetime data.
Supported Data Types
Numerical (Integer)
Numerical (Float)
Date
Datetime
Using Histograms in the Cohort Browser
In a histogram in the Cohort Browser, each vertical bar represents the count of records in a particular "bin." Each bin groups records that share the same value or similar values, in a particular field.
The Cohort Browser automatically groups records into bins, based on the distribution of values in the dataset, for the field. Values are distributed in a linear fashion, on the x axis.
Below is a sample histogram showing the distribution of values in a field Critical care total days. The label under the chart title indicates the number of records (203) for which values are shown, and the name of the entity ("RNAseq Notes") to which the data relates.
Histogram in the Cohort Browser
Customizing Chart Display
You can customize how histogram data is displayed by clicking ⛭ Chart Settings in the chart toolbar.
Histogram chart settings showing Display Statistics, Transform Data, and Chart Type options
For data with wide value ranges or skewed distributions, you can apply logarithmic scaling to either or both axes:
log₂ - Values transformed using f(x)=sign(x)⋅log2(∣x∣+1)
log₁₀ - Values transformed using f(x)=sign(x)⋅log10(∣x∣+1)
When you apply logarithmic transformation, the axis label updates to show the transformation type (log₂ or log₁₀).
Non-Numeric Data in Histograms
A field containing numeric data may also contain some non-numeric values. These values cannot be represented in a histogram. In such cases, you see the following informational message below the chart:
Histogram Displaying Data for a Field Containing Non-Numeric Values
Clicking the "non-numeric values" link displays detail on those values, and the number of record in which each appears:
Detail on Non-Numeric Values Omitted from a Histogram
In Cohort Compare mode, histograms can be used to compare the distribution of values in a field that's common to both cohorts. In this scenario, the distributions are overlaid one atop another. Clicking the "ˇ" icon, in the lower right corner of the tile containing the chart, opens a tooltip showing the cohort names and the colors used to represent data in each.
In the main menu, navigate to Tools > JupyterLab. If you have used JupyterLab before, the page shows your previous sessions across different projects.
Click New JupyterLab.
Configure your JupyterLab session:
Specify the session name and select an instance type.
Choose the project where JupyterLab should run.
Set the session duration after which the environment automatically shuts down.
Optionally, provide a snapshot file to load a previously saved environment.
If needed, enable Spark Cluster and set the number of nodes.
Select a feature option based on your analysis needs:
PYTHON_R (default): Python3 and R kernel and interpreter
ML: Python3 with machine learning packages (TensorFlow, PyTorch, CNTK) and image processing (Nipype), but no R
Review the pricing estimate (if you have billing access) based on your selected duration and instance type.
Click Start Environment to launch your session. The JupyterLab shows an "Initializing" state while the worker spins up and the server starts.
Open your JupyterLab environment by clicking the session name link once the state changes to "Ready". You can also access it directly via https://job-xxxx.dnanexus.cloud, where job-xxxx is your job ID.
Snapshots created using older versions of JupyterLab are incompatible with the current version. If you need to use an older JupyterLab snapshot, see environment snapshot guidelines.
For a detailed list of libraries included in each feature option, see the in-product documentation.
Running JupyterLab from the CLI
You can start the JupyterLab environment directly from the command line by running the app:
Once the app starts, you may check if the JupyterLab server is ready to server connections, which is indicated by the job's property httpsAppState set to running. Once it is running, you can open your browser and go to https://job-xxxx.dnanexus.cloud where job-xxxx is the ID of the job running the app.
To run the Spark version of the app, use the command:
You can check the optional input parameters for the apps on the DNAnexus Platform (platform login required to access the links):
Learn to use projects to collaborate, organize your work, manage billing, and control access to files and executables.
About Projects
Within the DNAnexus Platform, a project is first and foremost a means of enabling users to collaborate, by providing them with shared access to specific data and tools.
Projects have a series of features designed for collaboration, helping project members coordinate and organize their work, and ensuring appropriate control over both data and tools.
See the for details on how to create a project, share it with other users, and run an analysis.
Managing Project Content
A key function of each project is to serve as a shared storehouse of data objects used by project members as they collaborate.
Click on a project's Manage tab to see a list of all the data objects stored in the project. Within the Manage screen, you can browse and manage these objects, with the range of available actions for an object dependent on its type.
The following are four common actions you can perform on objects from within the Manage screen.
Downloading Files
You can directly download file objects from the system.
Select the file's row.
Click More Actions (⋮).
From the list of available actions, select Download.
Getting More Information on Objects
To learn more about an object:
Select its row, then click the Show Info Panel button - the "i" icon - in the upper corner of the Manage screen.
Select the row showing the name of the object about which you want to know more. An info panel opens on the right, displaying a range of information about the object. This includes its unique ID, as well as metadata about its owner, time of creation, size, tags, properties, and more.
Deleting Objects
Deletion is permanent and cannot be undone.
To delete an object:
Select its row.
Click More Actions (⋮).
From the list of available actions, select Delete.
Copying Data to Another Project
To copy a data object or objects to another project, you must have CONTRIBUTE or ADMINISTER access to that project.
Select the object or objects you want to copy to a new project, by clicking the box to the left of the name of each object in the objects list.
Click the Copy button in the upper right corner of the Manage screen. A modal window opens.
Select the project to which you want to copy the object or objects, then select the location within the project to which the objects should be copied.
Access and Sharing
Adding Project Members
You can collaborate on the platform by . On sharing a project with a user, or group of users in an , they become project members, with access at one of the levels described below. Project access can be revoked at any time by a project administrator.
Removing Project Members
To remove a user or org from a project to which you have ADMINISTER access:
On the project's Manage screen, click the Share Project button - the "two people" icon - in the top right corner of the page. A modal window opens, showing a list of project members.
Find the row showing the user you want to remove from the project.
Move your mouse over that row, then click the Remove from Members button at the right end of the row.
Project Access Levels
Access Level
Description
Project Access Levels: Two Examples
Suppose you have a set of samples sequenced at your lab, and you have a collaborator who's interested in three of the samples. You can upload the data associated with those samples into a new project, then share that new project with your collaborator, granting them VIEW access.
Alternatively, suppose that you and your collaborator are working on the same tissue samples, but each of you wants to try a different sequencing process. You can create a new project, then upload your sequenced data to the project. Then grant your collaborator UPLOAD access to the project, allowing them to upload their data. You both are then able to use each other's data to perform downstream analyses.
Restricting Access to Executables
A project admin can configure a project to allow project members to run only specific executables as . The list of allowed executables is set by entering the following command, via the CLI:
This command overwrites any existing list of allowed executables.
To discard the allowed executables list, that is, let project members run all available executables as root executions, enter the following command:
Executables that are called by a permitted executable can run even if they are not included in the list.
Project Data Access Controls
Users with ADMINISTER access to a project can restrict the ability of project members to view, copy, delete, and download project data. The project-level boolean flags below provide fine-grained data access control. All data access control flags default to false and you can view and modify them via the CLI and the Platform API. In the project's Settings web screen, you can view and modify the protected, restricted, downloadRestricted, previewViewerRestricted, externalUploadRestricted, and containsPHI settings as described below.
protected: If set to true, only project members with ADMINISTER access to the project can delete project data. Otherwise, project members with ADMINISTER and CONTRIBUTE access can delete project data. This flag corresponds to the Delete Access policy in the project's Settings web interface screen.
restricted: If set to true,
PHI Data Protection
Only projects billed to org billing accounts can have PHI Data Protection enabled.
A license and a signed Business Associate Agreement are required to enable and use PHI Data Protection. for more information.
Protected Health Information, or PHI, is identifiable health information that can be linked to a specific person. On the DNAnexus Platform, PHI Data Protection safeguards the confidentiality and integrity of data in compliance with the Health Insurance Portability and Accountability Act of 1996 (HIPAA).
When PHI Data Protection is enabled for a project, it is subject to the following protective restrictions:
Data in this project cannot be cloned to other projects that do not have containsPHI set to true
Any jobs that run in non-PHI projects cannot access any data that can only be found in PHI projects
Job email notifications sent from the project refer to objects by object ID instead of by name, and other information in the notification may be elided. If you receive such a notification, you can view the elided information by logging onto the Platform and opening the notification and accessing it in the Notifications pane, accessible by clicking the "bell" icon at the far right end of the main menu.
Billing and Charges
On the DNAnexus Platform, running analyses, storing data, and egressing data are billable activities, and always take place within a specific project. Each project is associated with a billing account to which invoices are sent, covering all billable activities carried out within the project.
For information on configuring your billing account, see .
You link a project to a billing account, that is an organization that the expenses are billed to, when you .
Monthly Project Spending and Usage Limits
Licenses are required for both the Monthly Project Compute and Egress Usage Limit and Monthly Project Storage Spending Limit features. for more information.
The Monthly Project Usage Limit for Compute and Egress and Monthly Project Storage Spending Limit features can help project admins monitor and keep project costs under control. For more information, see .
In the project's Settings tab under the Usage Limits section, project admins can view the project's compute and egress usage limits.
For details on how to set and retrieve project-specific compute and egress usage limits, and storage spending limits, see the .
Transferring Project Billing Responsibility
Transferring Billing Responsibility to Another User
If you have ADMINISTER access to a project, you can transfer project billing responsibility to another user, by doing the following:
On the project's Settings screen, scroll down to the Administration section.
Click the Transfer Billing button. A modal window opens.
Enter the email address or username of the user to whom you want to transfer billing responsibility for the project.
The user receives an email notification of your request. To finalize the transfer, they need to log onto the Platform and formally accept it.
Transferring Billing Responsibility to an Org
If you have billable activities access in the org to which you wish to transfer the project, you can change the billing account of the project to the org. To do this, navigate to the project settings page by clicking on the gear icon in the project header. On the project settings page, you can then select which to which billing account the project should be billed.
If you do not have billable activities access in the org you wish to transfer the project to, you need to transfer the project to a user who does have this access. The recipient is then able to follow the instructions below to accept a project transfer on behalf of an org.
Cancelling a Transfer of Billing Responsibility
You can cancel a transfer of project billing responsibility, so long as it hasn't yet been formally accepted by the recipient. To do this:
Select All Projects from the Projects link in the main menu. Open the project. You see a Pending Project Ownership Transfer notification at the top of the screen.
Click the Cancel Transfer button to cancel the transfer.
Accepting a Transfer Request
When another user initiates a project transfer to you, you receive a project transfer request, via both an email, and a notification accessible by clicking the Notifications button - the "bell" - at the far right end of the main menu.
If you did not already have access to the project being transferred, you receive VIEW access and the project appears in the list on the Projects screen.
To accept the transfer:
Open the project. You see a Pending Project Ownership Transfer notification in the project header.
Click the Accept Transfer button.
Select a new billing account for the project from the dropdown of eligible accounts.
Projects with PHI Data Protection Enabled
If a project has PHI Data Protection enabled, it may only be transferred to an org billing account which also has PHI Data Protection enabled.
Sponsored Projects
Ownership of may not be transferred without the sponsorship first being terminated.
Project Sponsorship
A user or org can sponsor the cost of data storage in a project for a fixed term. During the sponsorship period, project members may copy this data to their own projects and store it there, without incurring storage charges.
On setting up the sponsorship, the sponsor sets it end date. The sponsor can change this end date at any time.
Billing responsibility for sponsored projects may not be transferred.
Sponsored projects may not be deleted, without the project sponsor first ending the sponsorship, by changing its end date to a date in the past.
For more information about sponsorship, contact .
Learn More
for detailed information on projects that are billed to an org.
Learn about accessing and working with projects via the CLI:
Learn about working with projects as a developer:
Pysam
This applet performs a SAMtools count on an input BAM using Pysam, a python wrapper for SAMtools.
Pysam is provided through a pip3 install using the pip3 package manager in the dxapp.json's runSpec.execDepends property:
The execDepends value is a JSON array of dependencies to resolve before the applet source code is run. In this applet, pip3 is specified as the package manager and pysam version 0.15.4 as the dependency to resolve.
Downloading Input
The fields mappings_sorted_bam and mappings_sorted_bai are passed to the main function as parameters for the job. These parameters are dictionary objects with key-value pair {"$dnanexus_link": "<file>-<xxxx>"}. File objects from the platform are handled through handles. If an index file is not supplied, then a *.bai index is created.
Working with Pysam
Pysam provides key methods that mimic SAMtools commands. In this applet example, the focus is only on canonical chromosomes. The Pysam object representation of a BAM file is pysam.AlignmentFile.
The helper function get_chr
Once a list of canonical chromosomes is established, you can iterate over them and perform the Pysam version of samtools view -c, pysam.AlignmentFile.count.
Uploading Outputs
The summarized counts are returned as the job output. The dx-toolkit Python SDK function uploads and generates a DXFile corresponding to the tabulated result file.
Python job outputs have to be a dictionary of key-value pairs, with the keys being job output names as defined in the dxapp.json file and the values being the output values for corresponding output classes. For files, the output type is a DXLink. The function generates the appropriate DXLink value.
Distributed by Region (sh)
Entry Points
Distributed bash-interpreter apps use bash functions to declare entry points. Entry points are executed as subjobs on new workers with their own respective system requirements. This app has the following entry points specified as bash functions:
main
count_func
sum_reads
main
The main function takes the initial *.bam, generates an index *.bai if needed, and obtains the list of regions from the *.bam file. Every 10 regions are sent, as input, to the count_func entry point using command.
Job outputs from the count_func entry point are referenced as Job Based Object References and used as inputs for the sum_reads entry point.
Job outputs of the sum_reads entry point is used as the output of the main entry point via JBOR reference in the command.
count_func
This entry point performs a SAMtools count of the 10 regions passed as input. This execution runs on a new worker. As a result, variables from other functions are not accessible here. This includes variables from the main() function.
Once the output file with counts is created, it is uploaded to the platform and assigned as the entry point's job output counts_txt via the command .
sum_reads
The main entry point triggers this subjob, providing the output of count_func as an input JBOR. This entry point gathers all the readcount.txt files generated by the count_func jobs and sums the totals.
This entry point returns read_sum as a JBOR, which is then referenced as job output.
The SAMtools dependency is resolved by declaring an Apt-Get package in the dxapp.json file's runSpec.execDepends.
For additional information, see execDepends.
Entry Points
Distributed bash-interpreter apps use bash functions to declare entry points. This app has the following entry points specified as bash functions:
main
count_func
sum_reads
Entry points are executed on a new worker with its own system requirements. The instance type can be set in the dxapp.json file's runSpec.systemRequirements:
main
The main function slices the initial *.bam file and generates an index *.bai if needed. The input *.bam is then sliced into smaller *.bam files containing only reads from canonical chromosomes. First, the main function downloads the BAM file and gets the headers.
Sliced *.bam files are uploaded and their file IDs are passed to the count_func entry point using the command.
Outputs from the count_func entry points are referenced as Job Based Object References (JBOR) and used as inputs for the sum_reads entry point.
The output of the sum_reads entry point is used as the output of the main entry point via JBOR reference using the command dx-jobutil-add-output.
count_func
This entry point downloads and runs the command samtools view -c on the sliced *.bam. The generated counts_txt output file is uploaded as the entry point's job output via the command dx-jobutil-add-output.
sum_reads
The main entry point triggers this sub job, providing the output of count_func as an input. This entry point gathers all the files generated by the count_func jobs and sums them.
This function returns read_sum_file as the entry point output.
Parallel by Region (sh)
This applet performs a basic SAMtools count on a series of sliced (by canonical chromosome) BAM files in parallel using wait.
The SAMtools dependency is resolved by declaring an Apt-Get package in the dxapp.jsonrunSpec.execDepends.
Debugging
The command set -e -x -o pipefail assists you in debugging this applet:
-e causes the shell to immediately exit if a command returns a non-zero exit code.
-x prints commands as they are executed, which is useful for tracking the job's status or pinpointing the exact execution failure.
Parallel Run
Bash's system allows for convenient management of multiple processes. In this example, you can run bash commands in the background as you control maximum job executions in the foreground. Place processes in the background using the character & after a command.
Job Output
Once the input BAM has been sliced, counted, and summed, the output counts_txt is uploaded using the command . The following directory structure required for dx-upload-all-outputs is below:
In your applet, upload all outputs by:
Analyzing Germline Variants
Analyze germline genomic variants, including filtering, visualization, and detailed variant annotation in the Cohort Browser.
An Apollo license is required to use Cohort Browser on the DNAnexus Platform. Org approval may also be required. Contact DNAnexus Sales for more information.
Explore and analyze datasets with germline data by opening them in the Cohort Browser and switching to the Germline Variants tab. You can create cohorts based on germline variants, visualize variant patterns, and examine detailed variant information.
Filtering by Germline Variants
You can to include only samples with specific germline variants.
To apply a germline filter to your cohort:
For the cohort you want to edit, click Add Filter.
In Add Filter to Cohort > Assays > Genomic Sequencing, select a genomic filter.
In Edit Filter: Variant (Germline), specify your filtering criteria:
After you apply or edit filters, the participant count updates immediately. However, visualization tiles do not automatically refresh. Click Refresh Visualizations at the top of the dashboard to update all tiles. Click Refresh on individual tiles to update specific charts.
Exploring Variant Patterns in Your Cohort
The Germline Variants tab includes a lollipop plot displaying allele frequencies for variants in a specified genomic region. This visualization helps you identify patterns in germline variants across your cohort and understand the distribution of allelic frequencies.
If your dataset contains multiple germline variant assays, such as WES and WGS assays, you can choose the assay to visualize at the top of the dashboard. The Cohort Browser displays data from only one assay at a time. When you switch between assays, your charts and their display settings are preserved.
Examining Variant Annotations
The allele table, located below the lollipop plot, shows the same variants in a tabular format with comprehensive annotation information. It allows you to examine specific variant characteristics and compare allele frequencies within your selected cohort, the entire dataset, and from annotation databases, including gnomAD.
The annotation information includes:
Type: whether the variant is an SNP, deletion, insertion, or mixed.
Consequences: The impact of variant according to . For variants with multiple gene annotations, this column displays the most severe consequence per gene.
Population Allele Frequency: Allele frequency calculated across entire dataset from which the cohort is created.
If canonical transcript information is available, the following three columns with additional annotation information appear in the Table:
Consequences (Canonical Transcript): Canonical effects per each associated gene, according to SnpEff.
HGVS DNA (Canonical Transcript): HGVS (DNA) standard terminology per each associated gene with this variant
HGVS Protein (Canonical Transcript): HGVS (Protein) standard terminology per each associated gene with this variant
Exporting Variant Metadata
You can export the selected variants in the table as a list of variant IDs or a CSV file.
To copy a comma-separated list of variant IDs to your clipboard, select the set of IDs you want to copy, and click Copy.
To export variants as a CSV file, select the set of IDs you need, and click Download (.csv file).
For large datasets, you can use the to download data in a more efficient way.
Accessing Detailed Variant Information
In Allele table > Location column, you can click on the specific location to open the locus details. The locus details provides in-depth annotations and population genetics data for the selected genomic position.
When genomic information is ingested and made available in the Cohort Browser, variants are annotated using and . The specific versions of each are provided during the ingestion process and create a set of tables optimized for cohort creation through the Cohort Browser.
The locus details page displays three main sections of pre-calculated information from dataset ingestion: Location Info, Genotypes, and Alleles. These sections provide a comprehensive view starting with a locus summary, including genotype frequencies, followed by detailed annotations for each allele.
Location Info
The Location Info section provides a quick overview of the genomic locus in your dataset, including the chromosome and starting position, the frequency of both the reference allele and no-calls, and the total number of alleles available.
Genotypes
The Genotypes section shows a detailed breakdown of genotypes in the dataset at the specific location. Since allele order is not preserved, genotypes like C/A and A/C are counted in the same category, which is why only half of the comparison table is populated. These genotype frequencies represent the entire dataset at this location, not only your selected cohort.
Alleles
The Alleles section displays detailed information for each allele, collected from dbSNP and gnomAD during data ingestion. When available, rsID or AffyID appear with direct links to the corresponding page. The section provides allele type, affected samples (dataset), and gnomAD frequency for quick reference, with additional details sorted by transcript ID in the Genes / Transcripts table. For canonical transcripts, a blue indicator appears next to the transcript ID, identifying the primary transcript annotations.
Integrating with Advanced Analysis Tools
For more sophisticated genomic analysis beyond the Cohort Browser's visualization capabilities, you can connect your variant data with other DNAnexus tools. Export variant lists for detailed analysis in , leverage for large-scale genomic computations, or connect to for complex queries across your dataset.
Box Plot
Learn to build and use box plots in the Cohort Browser.
An Apollo license is required to use Cohort Browser on the DNAnexus Platform. Org approval may also be required. Contact DNAnexus Sales for more information.
When to Use Box Plots
Box plots can be used to visualize numerical data.
Supported Data Types
Numerical data can also be visualized using .
Using Box Plots in the Cohort Browser
Box plots provide a range of detail on the distribution of values in a field containing numerical data. Each box plot includes three thin blue horizontal lines, indicating, from top to bottom:
Max - The maximum, or highest value
Med - The median value
Min - The minimum, or lowest value
The blue box straddling the median value line represents the span covered by the median 50% of values. Of the total number of values, 25% sit above the box, and 25% lie below it.
Hovering over the middle of a box plot opens a window displaying detail on the maximum, median, and minimum values. Also shown are the values at the "top" ("Q3") and "bottom" ("Q1") of the box. "Q1" is the highest value in the first, or lowest, quartile of values. "Q3" is the highest value in the third quartile.
Also shown in this window is the total count of values covered by the box plot, along with the name of the entity to which the data relates.
Customizing Chart Display
You can customize how box plot data is displayed by clicking ⛭ Chart Settings in the chart toolbar.
For data with wide value ranges or skewed distributions, you can apply logarithmic scaling to the Y-axis:
log₂ - Values transformed using
log₁₀ - Values transformed using
When you apply logarithmic transformation, the Y-axis label updates to show the transformation type (log₂ or log₁₀).
Non-Numeric Data in Box Plots
Fields containing primarily numeric data may also include non-numeric values. These non-numeric values cannot be represented in a box plot. See the chart above for an example of the informational message that shows below the chart when non-numeric values are present.
Clicking the "non-numeric values" link displays detail on those values, and the number of record in which each appears:
In this scenario, a discrepancy exists between the "count" figure shown in the chart label and the one shown in the informational window that opens when hovering over the middle of a box plot. The latter figure is smaller, with the discrepancy determined by the number of records for which values can't be displayed in the box plot.
Outliers
Cohort Browser box plots represent all non-null numeric values. When a field contains an outlier value or values - that is, values that are unusually high or low - this can result in a box plot that looks like this:
This box plot displays data on the number of cups of coffee consumed per day, by patients of a particular cohort. One cohort patient was recorded as consuming 42 cups of coffee per day, much higher than the value (2 cups/day) at the "top" of the third quartile, and far higher than the median value of 2 cups/day.
Box Plots in Cohort Compare Mode
In Cohort Compare mode, a box plot chart can be used to compare the distribution of values in a field that's common to both cohorts. In this scenario, a separate, color-coded box plot is displayed for each cohort.
Hovering over either of the plots opens an informational window showing detail on the distribution of values for the cohort.
Clicking the "ˇ" icon, in the lower right corner of the tile containing the chart, opens a tooltip showing the cohort names and the colors used to represent data in each.
Preparing Data for Visualization in Box Plots
When , the following data types can be visualized in box plots:
Integer
Integer Sparse
Float
Omics Data Assistant
Explore and analyze datasets using natural language queries with Omics Data Assistant, a GenAI-powered interface integrated into Cohort Browser.
A license is required to use Omics Data Assistant on the DNAnexus Platform. Contact DNAnexus Sales for more information.
Omics Data Assistant (the assistant) is a GenAI-powered conversational interface that helps you explore and analyze complex biomedical and clinical datasets using natural language. The assistant is integrated directly into Cohort Browser. This means you can combine conversational queries with powerful visualization tools for comprehensive data analysis.
Whether you're new to a dataset or an experienced bioinformatician, the assistant saves you time by understanding your questions in plain English. New users can quickly discover what data their datasets contain without browsing through fields and schemas. Experienced users can define cohorts in seconds by describing criteria in a few sentences, eliminating the need to manually configure multiple filters.
Omics Data Assistant uses generative AI to accelerate your analysis. While powerful, AI models can occasionally produce inaccurate or incomplete results. Always verify generated cohorts and insights against your underlying data. The assistant alone should not be used for clinical diagnosis or treatment decisions.
How It Works
Omics Data Assistant uses the latest Anthropic Claude model for natural language understanding and response generation. This model supports large context windows, allowing the assistant to handle complex queries and maintain context throughout conversations.
The assistant accesses only your Apollo dataset and does not connect to the internet or external data sources. Omics Data Assistant is deployed regionally to meet data residency requirements, keeping your data in your region throughout all operations. Conversations are stored securely and remain private to you.
Getting Started
Prerequisites
Your organization has active Omics Data Assistant and Cohort Browser licenses.
You have access to an Apollo dataset and its associated databases in a project.
Opening Omics Data Assistant
Omics Data Assistant works with datasets in Cohort Browser.
In the DNAnexus Platform, .
Click ✨ Omics Data Assistant in the lower right corner.
By default, the assistant opens in a panel on the right side. You can enlarge the assistant panel by clicking Enter Full Screen.
In the assistant's input field, you can explore the data by .
Below the input, you can use two controls to get started quickly:
Dataset Overview: Opens an AI-generated overview of the opened dataset. Use this to learn what the dataset contains without writing a prompt. You can still ask follow-up questions for specific details.
Help: Opens a guide about Omics Data Assistant, its capabilities, and example prompts to try.
First-Time Dataset Indexing
The first time Omics Data Assistant is used with a dataset, the dataset must be indexed. This one-time process enables natural language queries by creating vector representations of the dataset structure. Only one person needs to start this indexing process. Subsequent users can query the dataset immediately once indexing is complete.
After indexing starts, it runs in the background. Most datasets complete indexing within 15 minutes. Large datasets like UK Biobank may take over an hour. During indexing, you cannot ask questions until the process completes. You can monitor indexing progress through the assistant interface.
Index data is stored securely in the same AWS region as your data to maintain data residency requirements.
Using Omics Data Assistant
Asking Questions
Omics Data Assistant responds to your questions in plain English and translates them into structured database queries.
Example prompts:
"Find all patients diagnosed with IBD within 6 months of a diabetes diagnosis"
"Get patients with lower hemoglobin values than the laboratory's recommended value"
"Create cohort of all patients with exon loss variants in KIAA1109"
Understanding Responses
Each response includes the assistant's thinking process, which shows how it interpreted your question and the SQL queries it generated. You can verify the assistant understood your question correctly, review the SQL queries for accuracy, and learn how natural language translates to database queries.
If your question is unclear, the assistant asks clarifying questions to ensure accurate results.
For each response, you can:
Copy responses: Click Copy at the end of any response to copy the markdown-formatted text to your clipboard for use in documents or notes.
Provide feedback: Click the thumbs up or thumbs down buttons on responses to help improve the assistant's accuracy. When giving negative feedback, you can describe the problem in your own words. Your feedback helps the DNAnexus team enhance the assistant for all users.
Creating Cohorts
The assistant excels at creating cohorts and generating demographic summaries. For complex visualizations, create your cohorts through the assistant, then use Cohort Browser's native for detailed analysis.
To create a cohort directly from Omics Data Assistant, phrase your question to specify patient criteria. For example, "Create a cohort of patients with amplifications in HER2".
When the assistant returns the cohort results, click + Add to Dashboard to filter the dataset by the new cohort in Cohort Browser. You can add multiple cohorts from the assistant to your dashboard and .
From Cohort Browser, you can save cohorts to your project as .
Managing Conversations
Your conversation history is stored separately for each dataset. Your conversations remain private to you. Other users cannot access your conversation history.
To manage your conversations, click the three dots next to a conversation name to either rename or delete the conversation.
To view and search through your past conversations, click See All at the bottom of the panel.
Exploring and Querying Datasets
A license is required to access Spark functionality on the DNAnexus Platform. Contact DNAnexus Sales for more information.
Extracting Data From a Dataset With Spark
The dx commands, and , let you either retrieve the data dictionary of a dataset or extract the underlying data described by that dictionary. You can also use these commands to get dataset metadata, such as the names and titles of entities and fields, or to list all relevant assays in a dataset.
Often, you can retrieve data without using Spark, and extra compute resources are not required (see the ). However, if you need more compute power—such as when working with complex data models, large datasets, or extracting large volumes of data—you can use a private Spark resource. Using private compute resources helps avoid these timeouts by scaling resources as needed.
If you use the --sql flag, the command returns a SQL statement (as a string) that you can use in a standalone Spark-enabled application, such as JupyterLab.
Initiating a Spark Session
The most common way to use Spark on the DNAnexus Platform is via a .
After creating a Jupyter notebook within a project, enter the commands shown below, to start a Spark session.
Python:
R:
Executing SQL Queries
Once you've initiated a Spark session, you can run SQL queries on the database within your notebook, with the results written to a Spark DataFrame:
Python:
R:
Query to Extract Data From Database Using extract_dataset
Python:
Where dataset is the record-id or the path to the dataset or cohort, for example, "record-abc123" or "/mydirectory/mydataset.dataset."
R:
Where dataset is the record-id or the path to the dataset or cohort.
Query to Filter and Extract Data from Database Using extract_assay germline
Python:
R:
In the examples above, dataset is the record-id or the path to the dataset or cohort, for example, record-abc123 or /mydirectory/mydataset.dataset. allele_filter.json is a JSON object, as a file, and which contains filters for the --retrieve-allele command. For more information, refer to the notebooks in the .
Run SQL Query to Extract Data
Python:
R:
Best Practices
When querying large datasets - such as those containing genomic data - ensure that your Spark cluster is scaled up appropriately with multiple clusters to parallelize across.
Ensure that your Spark session is only once per Jupyter session. If you initialize the Spark session in multiple notebooks in the same Jupyter Job - for example, run notebook 1 and also run notebook 2 OR run a notebook from start to finish multiple times - the Spark session becomes corrupted and you need to restart the specific notebook's kernel. As a best practice, shut down the kernel of any notebook you are not using, before running a second notebook in the same session.
VCF Loader
A license is required to access Spark functionality on the DNAnexus Platform. Contact DNAnexus Sales for more information.
Overview
VCF Loader ingests Variant Call Format (VCF) files into a database. The input VCF files are loaded into a Parquet-format database that can be queried using Spark SQL.
The input VCF for every run can be a single VCF file or many VCF files, but the merged input must represent a single logical VCF file. In the many files case, the logical VCF may be partitioned by chromosome, by genomic region, and/or by sample. In any case, every input VCF file must be a syntactically correct, sorted VCF file.
VCF Preprocessing
Although VCF data can be loaded into Apollo databases after the variant call step, the dataset may not be normalized for downstream analyses across large cohorts. In that case, preprocessing and harmonizing the data before loading is recommended. To learn more, see .
How to Run VCF Loader
Input:
vcf_manifest: (file) a text file containing a list of file ID's of the VCF files to load (one per line). The referenced files' names must be distinct and end in .vcf.gz. If more than one file is specified, then the complete VCF file to load is considered to be partitioned and every specified partition must be a valid VCF file. After the partition-merge step in preprocessing, the complete VCF file must be valid.
Required Parameters:
database_name: (string) name of the database into which to load the VCF files.
create_mode: (string) strict mode creates database and tables from scratch and optimistic mode creates databases and tables if they do not already exist.
Other Options:
snpeff: (boolean) default true -- whether to include the SnpEff annotation step in preprocessing with INFO/ANN tags. If SnpEff annotations are desired in the database, then either pre-annotate the raw VCF separately, or include this SnpEff annotation step -- it is not necessary to do both.
snpeff_human_genome: (string) default GRCh38.92 -- id of the SnpEff human genome to use in the SnpEff annotation step in preprocessing.
Basic Run
Chart Types
Get an overview of the range of different charts you can build and use in the Cohort Browser.
An Apollo license is required to use Cohort Browser on the DNAnexus Platform. Org approval may also be required. Contact DNAnexus Sales for more information.
While working in the Cohort Browser, you can visualize data using a variety of different types of charts.
The following single-variable chart types are available in the Cohort Browser:
Multi-Variable Charts
The following multi-variable chart types are available in the Cohort Browser:
When creating multi-variable charts using datasets that include data related to multiple entities, the entity relationship between the selected data fields affects chart type availability. Often, data fields related to the same entity, or data fields related to entities that in turn relate to one another in 1:1, N:1, or 1:N fashion, can be used together in a multi-variable chart.
Interpreting Chart Data
Chart Totals and Missing Data
In all charts used in the Cohort Browser, a chart total count is displayed under the chart's title. This figure represents the number of records for which data is displayed in the chart. The label - "Participants" in the chart shown below - indicates the entity to which the data relates.
This figure is not always the same as the number of records in the cohort.
In a single-variable chart, if a field in a record is empty or contains a null value, that record is not included in the total, as its data can't be visualized. If any such records exist in the cohort, an "i" warning icon appears next to the chart total figure. Hover over the icon to show a tooltip with information about records that aren't included in the total.
The same holds for multi-variable charts. If any record contains a null value in either of the selected fields, or if either field is empty, that record isn't included in the chart total count, as its data can't be visualized.
Apps and Workflows
Every analysis in DNAnexus is run using apps. Apps can be linked together to create workflows. Learn the basics of using both.
You must set up billing for your account before you can perform an analysis, or upload or egress data.
Finding the Right App or Workflow
Developer Quickstart
Learn to build an app that you can run on the Platform.
This tutorial provides a quick intro to the DNAnexus developer experience, and progresses to building a fully functional, useful app on the Platform. For a more in-depth discussion of the Platform, see .
The steps below require the . You must download and install it if you have not done so already.
Besides this Quickstart, there are Developer Tutorials located in the sidebar that go over helpful tips for new users as well. A few of them include:
Parallel by Chr (py)
This applet tutorial performs a SAMtools count using parallel threads.
To take full advantage of the scalability that cloud computing offers, your scripts have to implement the correct methodologies. This applet tutorial shows you how to:
Install SAMtools
Download BAM file
Parallel by Region (py)
This applet tutorial performs a SAMtools count using parallel threads.
To take full advantage of the scalability that cloud computing offers, your scripts must implement the correct methodologies. This applet tutorial shows you how to:
Install SAMtools
Download BAM file
Parallel by Chr (py)
This applet tutorial performs a SAMtools count using parallel threads.
To take full advantage of the scalability that cloud computing offers, your scripts have to implement the correct methodologies. This applet tutorial shows you how to:
Install SAMtools
Download BAM file
Parallel by Region (py)
This applet tutorial performs a SAMtools count using parallel threads.
To take full advantage of the scalability that cloud computing offers, your scripts must implement the correct methodologies. This applet tutorial:
Install SAMtools
Download BAM file
List View
Learn to build and use list views in the Cohort Browser.
An Apollo license is required to use Cohort Browser on the DNAnexus Platform. Org approval may also be required. for more information.
When to Use List Views
Distributed by Chr (sh)
How is the SAMtools dependency provided?
The SAMtools dependency is resolved by declaring an Apt-Get package in the dxapp.json file's runSpec.execDepends.
Analyzing Gene Expression Data
Analyze gene expression data, including expression-based filtering, visualization, and molecular profiling in the Cohort Browser.
An Apollo license is required to use Cohort Browser on the DNAnexus Platform. Org approval may also be required. for more information.
Explore and analyze datasets with gene expression assays by opening them in the Cohort Browser and switching to the Gene Expression tab. You can create cohorts based on expression levels, visualize expression patterns, and examine detailed gene information.
Visualizing Data
The DNAnexus Platform offers multiple different methods for viewing your files and data.
Previewing Files
DNAnexus allows users to preview and open the following file types directly on the platform:
IMAGE_PROCESSING: Python3 with image processing packages (Nipype, FreeSurfer, FSL), but no R. FreeSurfer requires a license. GUI viewers such as fsleyes and freeview cannot be launched in the headless environment.
MONAI_ML: Extends the ML feature with specialized medical imaging frameworks, such as MONAI Core, MONAI Label, and 3D Slicer.
For datasets with multiple germline variant assays, select the specific assay to filter by.
On the Genes / Effects tab, select variants of specific types and variant consequences within the specified genes and/or genomic ranges. You can specify up to 5 genes or genomic ranges in a comma-separated list.
On the Variant IDs tab, specify a list of variant IDs, with a maximum of 100 variants.
To enter multiple genes, genomic ranges, or variants, separate them with commas or place each on a new line.
Click Apply Filter.
Cohort Allele Frequency: Allele frequency calculated across current cohort selection.
GnomAD Allele Frequency: Allele frequency of the specified allele from the public dataset gnomAD.
Follow the instructions in the modal window that opens.
Follow the instructions in the modal window that opens.
Click the Copy Selected button.
data in this project cannot be cloned to another project
data in this project cannot be used as input to a job or an analysis in another project
any running app or applet that reads from this project cannot write results to any other project
a job running in the project has singleContext flag set to true irrespective of the singleContext value supplied to /job/new and /executable-xxxx/run, and is only allowed to use the job's DNAnexus authentication token when issuing requests to the proxied DNAnexus API endpoint within the job. Use of any other authentication token results in an error.
This flag corresponds to the Copy Access policy in the project's Settings web interface screen.
downloadRestricted: If set to true, data in this project cannot be downloaded outside of the platform. For database objects, users cannot access the data in the project from outside DNAnexus. When set to true, previewViewerRestricted defaults to true unless explicitly overridden. This flag corresponds to the Download Access policy in the project's Settings web interface screen.
previewViewerRestricted: If set to true, file preview and viewer are disabled for the project. This flag defaults to true when downloadRestricted is set to true. In the project's Settings screen, the File Preview setting is adjustable when Download Access is set to restrict downloads, but is disabled and set to allow preview when Download Access allows all members to download. You can override this by explicitly setting previewViewerRestricted to false using the /project-xxxx/update API method.
databaseUIViewOnly: If set to true, project members with VIEW access have their access to project databases restricted to the Cohort Browser only. This feature is only available to customers with an Apollo license. Contact DNAnexus Sales for more information.
containsPHI: If set to true, data in this project is treated as Protected Health Information (PHI), an identifiable health information that can be linked to a specific person. PHI data protection safeguards the confidentiality and integrity of the project data in compliance with the Health Insurance Portability and Accountability Act of 1996 (HIPAA) by imposing additional restrictions documented in PHI Data Protection section. This flag corresponds to the PHI Data Protection setting in the Administration section of a project's Settings web interface screen.
displayDataProtectionNotice: If set to true, ADMIN users can turn on/off the ability to show a Data Protection Notice to any users accessing the selected project. If the Data Protection Notice feature is enabled for a project, all users, when first accessing the project, are required to review and confirm their acceptance of a requirement not to egress data from the project. A license is required to use this feature. Contact DNAnexus Sales for more information.
externalUploadRestricted: If set to true, external file uploads to this project (from outside the job context) are rejected. This flag corresponds to the External Upload Access policy in the project's Settings web interface screen. A license is required to use this feature. Contact DNAnexus Sales for more information.
httpsAppIsolatedBrowsing: If set to true, httpsApp access to jobs launched in this project are wrapped in Isolated Browsing, which restricts data transfers through the httpsApp job interface. A license is required to use this limited-access feature. Contact DNAnexus Sales for more information.
Apollo database access is subject to additional restrictions.
Once PHI Data Protection is activated for a project, it cannot be disabled.
Click Send Transfer Request.
VIEW
Allows users to browse and visualize data stored in the project, download data to a local computer, and copy data to other projects.
UPLOAD
Gives users VIEW access, plus the ability to create new folders and data objects, modify the metadata of open data objects, and close data objects.
CONTRIBUTE
Gives users UPLOAD access, plus the ability to run executions directly in the project.
Gives users CONTRIBUTE access, plus the power to change project permissions and policies, including giving other users access, revoking access, transferring project ownership, and deleting the project.
If you want to use a database outside your project's scope, you must refer to it using its unique database name (typically this looks something like database_fjf3y28066y5jxj2b0gz4g85__metabric_data) as opposed to the database name (metabric_data in this case).
insert_mode: (string) append appends data to the end of tables and overwrite is equivalent to truncating the tables and then appending to them.
run_mode: (string) site mode processes only the site-specific data, genotype mode processes genotype-specific data and other non-site-specific data and all mode processes both types of data.
etl_spec_id: (string) Only the genomics-phenotype schema choice is supported.
is_sample_partitioned: (boolean) whether the raw VCF data is partitioned.
snpeff_opt_no_upstream: (boolean) default true -- exclude SnpEff upstream gene annotations in the SnpEff annotation step (equivalent to SnpEff's -no-upstream option). This option does not filter pre-calculated annotations outside of the SnpEff annotation step.
snpeff_opt_no_downstream: (boolean) default true -- exclude SnpEff downstream gene annotations in the SnpEff annotation step (equivalent to SnpEff's -no-downstream option). This option does not filter pre-calculated annotations outside of the SnpEff annotation step.
calculate_worst_effects: (boolean) default true -- whether to include the SnpEff worst-effects annotation step in preprocessing, which adds the SnpEff worst-effects for every alternate-allele--gene combination as INFO/ANN_WORST tags (Number "A"). This option automatically filters SnpEff annotations to exclude 'feature_type!=transcript', 'transcript_biotype!=protein_coding', 'effect=upstream_gene_variant' and 'effect=downstream_gene_variant'
calculate_locus_frequencies: (boolean) default true -- whether to include the locus-level frequencies annotation step in preprocessing, which adds locus-level allele and genotype frequencies as INFO tags.
snpsift: (boolean) default true -- whether to include the SnpSift/dbSNP annotation step in preprocessing. The SnpSift/dbSNP annotation step adds dbSNP ID annotations to the INFO/RSID tag (Number "A"), which is required in the ETL stage. If the raw VCF is already pre-annotated, then this annotation step is not necessary.
num_init_partitions: (int) integer defining the number of partitions for the initial VCF lines Spark RDD.
makes the return code the first non-zero exit code. (Typically, the return code of pipes is the exit code of the last command, which can create difficult to debug problems.)
The *.bai file was an optional job input. You can check for an empty or unset var using the bash built-in test [[ - z ${var}} ]]. Then, you can download or create a *.bai index as needed.
dxapp.json: a file containing the app's metadata: its inputs and outputs, how the app is run, and execution requirements
a script that is executed in the cloud when the app is run
Start by creating a file called dxapp.json with the following text:
The example specifies the app name (coolapp), the interpreter (python3) to run the script, and the path (code.py) to the script created next. ("version":"0") refers to the Ubuntu 24.04 application execution environment version that supports the python3 interpreter.
Next, create the script in a file called code.py with the following text:
That's all you need. To build the app, first log in to DNAnexus and start a project with dx login. In the directory with the two files above, run:
Next, run the app and watch the output:
That's it! You have made and run your first DNAnexus applet. Applets are lightweight apps that live in your project, and are not visible in the App Library. When you typed dx run, the app ran on its own Linux instance in the cloud. You have exclusive, secure access to the CPU, storage, and memory on the instance. The DNAnexus API lets your app read and write data on the Platform, as well as launch other apps.
The app is available in the DNAnexus web interface, as part of the project that you started. It can be configured and run in the Workflow Builder, or shared with other users by sharing the project.
Step 2. Run BLAST
Next, make the app do something a bit more interesting: take in two files with FASTA-formatted DNA, run the BLAST tool to compare them, and output the result.
In the cloud, your app runs on Ubuntu Linux24.04, where BLAST is available as an APT package, ncbi-blast+. You can request that the DNAnexus execution environment install it before your script is run by listing ncbi-blast+ in the execDepends field of your dxapp.json like this:
Next, update code.py to run BLAST:
Rebuild the app and test it on some real data. You can use demo inputs available in the Demo Data project, or you can upload your own data with dx upload or via the website. If you use the Demo Data inputs, make sure the project you are running your app in is the same region as the Demo Data project.
Rebuild the app with dx build -a, and run it like this:
Once the job is done, you can examine the output with dx head report.txt, download it with dx download, or view it on the website.
Step 3. Provide an Input/Output Spec
Workflows are a powerful way to visually connect, configure, and run multiple apps in pipelines. To add the app to a workflow and connect its inputs and outputs to other apps, specify both input and output specifications. Update the dxapp.json as follows:
Rebuild the app with dx build -a. Run it as before, and add the applet to a workflow by clicking "New Workflow" while viewing your project on the website, then click coolapp once to add it to the workflow. Inputs and outputs appear on the workflow stage and can be connected to other stages.
If you run dx run coolapp with no input arguments from the command line, the command prompts for the input values for seq1 and seq2.
Step 4. Configure App Settings
Besides specifying input files, the I/O specification can also configure settings the app uses. For example, configure the E-value setting and other BLAST settings with this code and dxapp.json:
Rebuild the app again and add it in the workflow builder. You should see the evalue and blast_args settings available when you click the gear button on the stage. After building and configuring a workflow, you can run the workflow itself with dx run workflowname.
Step 5. Use SDK Tools
One of the utilities provided in the SDK is dx-app-wizard. This tool prompts you with a series of questions with which it creates the basic files needed for a new app. It also gives you the option of writing your app as a bash shell script instead of Python. Run dx-app-wizard to try it out.
Learn More
For additional information and examples of how to run jobs using the CLI, see Working with files using dx run may be useful. This material is not a part of the official DNAnexus documentation and is for reference only.
The dxpy.download_all_inputs() function downloads all input files into the /home/dnanexus/in directory. A folder is created for each input and the files are downloaded to that directory. For convenience, the dxpy.download_all_inputs function returns a dictionary containing the following keys:
<var>_path (string): full absolute path to where the file was downloaded.
<var>_name (string): name of the file, including extension.
<var>_prefix (string): name of the file minus the longest matching pattern found in the dxapp.json I/O pattern field.
The path, name, and prefix key-value pattern is repeated for all applet file class inputs specified in the dxapp.json. In this example, the dictionary has the following key-value pairs:
Count Regions in Parallel
Before performing the parallel SAMtools count, determine the workload for each thread. The number of workers is arbitrarily set to 10 and the workload per thread is set to 1 chromosome at a time. Python offers multiple ways to achieve multithreaded processing. For the sake of simplicity, use multiprocessing.dummy, a wrapper around Python's threading module.
Each worker creates a string to be called in a subprocess.Popen call. The multiprocessing.dummy.Pool.map(<func>, <iterable>) function is used to call the helper function run_cmd for each string in the iterable of view commands. Because multithreaded processing is performed using subprocess.Popen, the process does not alert to any failed processes. Closed workers are verified in the verify_pool_status helper function.
Important: In this example, you use subprocess.Popen to process and verify results in verify_pool_status. In general, it is considered good practice to use Python's built-in subprocess convenience functions. In this case, subprocess.check_call would achieve the same goal.
Gather Results
Each worker returns a read count of only one region in the BAM file. Sum and output the results as the job output. The dx-toolkit Python SDK function dxpy.upload_local_file is used to upload and generate a DXFile corresponding to the result file. For Python, job outputs have to be a dictionary of key-value pairs, with the keys being job output names as defined in the dxapp.json and the values being the output values for corresponding output classes. For files, the output type is a DXLink. The dxpy.dxlink function is used to generate the appropriate DXLink value.
The SAMtools dependency is resolved by declaring an Apt-Get package in the dxapp.jsonrunSpec.execDepends field.
Download Inputs
This applet downloads all inputs at once using dxpy.download_all_inputs:
Split workload
Using the Python multiprocessing module, you can split the workload into multiple processes for parallel execution:
With this pattern, you can quickly orchestrate jobs on a worker. For a more detailed overview of the multiprocessing module, visit the Python docs.
Specific helpers are created in the applet script to manage the workload. One helper you may have seen before is run_cmd. This function manages the subprocess calls:
Before the workload can be split, you need to identify the regions present in the BAM input file. This initial parsing is handled in the parse_sam_header_for_region function:
Once the workload is split and processing has started, wait and review the status of each Pool worker. Then, merge and output the results.
The run_cmd function returns a tuple containing the stdout, stderr, and exit code of the subprocess call. These outputs from the workers are parsed to determine whether the run failed or passed.
The dxpy.download_all_inputs() function downloads all input files into the /home/dnanexus/in directory. A folder is created for each input and the files are downloaded to that directory. For convenience, the dxpy.download_all_inputs function returns a dictionary containing the following keys:
<var>_path (string): full absolute path to where the file was downloaded.
<var>_name (string): name of the file, including extension.
<var>_prefix (string): name of the file minus the longest matching pattern found in the dxapp.json I/O pattern field.
The path, name, and prefix key-value pattern is repeated for all applet file class inputs specified in the dxapp.json. In this example, the dictionary has the following key-value pairs:
Count Regions in Parallel
Before performing the parallel SAMtools count, determine the workload for each thread. The number of workers is arbitrarily set to 10 and the workload per thread is set to 1 chromosome at a time. Python offers multiple ways to achieve multithreaded processing. For the sake of simplicity, use multiprocessing.dummy, a wrapper around Python's threading module.
Each worker creates a string to be called in a subprocess.Popen call. The multiprocessing.dummy.Pool.map(<func>, <iterable>) function is used to call the helper function run_cmd for each string in the iterable of view commands. Because multithreaded processing is performed using subprocess.Popen, the process does not alert to any failed processes. Closed workers are verified in the verify_pool_status helper function.
Important: In this example, subprocess.Popen is used to process and verify results in verify_pool_status. In general, it is considered good practice to use Python's built-in subprocess convenience functions. In this case, subprocess.check_call would achieve the same goal.
Gather Results
Each worker returns a read count of only one region in the BAM file. Sum and output the results as the job output. The dx-toolkit Python SDK function dxpy.upload_local_file is used to upload and generate a DXFile corresponding to the result file. For Python, job outputs have to be a dictionary of key-value pairs, with the keys being job output names as defined in the dxapp.json and the values being the output values for corresponding output classes. For files, the output type is a DXLink. The dxpy.dxlink function is used to generate the appropriate DXLink value.
The SAMtools dependency is resolved by declaring an Apt-Get package in the dxapp.jsonrunSpec.execDepends field.
Download Inputs
This applet downloads all inputs at once using dxpy.download_all_inputs:
Split workload
This tutorial processes data in parallel using the Python multiprocessing module with a straightforward pattern shown below:
This convenient pattern allows you to quickly orchestrate jobs on a worker. For more detailed overview of the multiprocessing module, visit the Python docs.
The applet script includes helper functions to manage the workload. One helper is run_cmd, which manages subprocess calls:
Before splitting the workload, determine what regions are present in the BAM input file. This initial parsing is handled in the parse_sam_header_for_region function:
Once the workload is split and processing has started, wait and review the status of each Pool worker. Then, merge and output the results.
The run_cmd function returns a tuple containing the stdout, stderr, and exit code of the subprocess call. These outputs are parsed from the workers to determine whether the run failed or passed.
Distributed bash-interpreter apps use bash functions to declare entry points. This app has the following entry points specified as bash functions:
main
count_func
sum_reads
Entry points are executed on a new worker with its own system requirements. The instance type can be set in the dxapp.json file's runSpec.systemRequirements:
main
The main function slices the initial *.bam file and generates an index *.bai if needed. The input *.bam is the sliced into smaller *.bam files containing only reads from canonical chromosomes. First, the main function downloads the BAM file and gets the headers.
Sliced *.bam files are uploaded and their file IDs are passed to the count_func entry point using the dx-jobutil-new-job command.
Outputs from the count_func entry points are referenced as Job Based Object References (JBOR) and used as inputs for the sum_reads entry point.
The output of the sum_reads entry point is used as the output of the main entry point via JBOR reference using the command dx-jobutil-add-output.
count_func
This entry point downloads and runs the command samtools view -c on the sliced *.bam. The generated counts_txt output file is uploaded as the entry point's job output via the command dx-jobutil-add-output.
sum_reads
The main entry point triggers this sub job, providing the output of count_func as an input. This entry point gathers all the files generated by the count_func jobs and sums them.
This function returns read_sum_file as the entry point output.
print(mappings_sorted_bai)
print(mappings_sorted_bam)
mappings_sorted_bam = dxpy.DXFile(mappings_sorted_bam)
sorted_bam_name = mappings_sorted_bam.name
dxpy.download_dxfile(mappings_sorted_bam.get_id(),
sorted_bam_name)
ascii_bam_name = unicodedata.normalize( # Pysam requires ASCII not Unicode string.
'NFKD', sorted_bam_name).encode('ascii', 'ignore')
if mappings_sorted_bai is not None:
mappings_sorted_bai = dxpy.DXFile(mappings_sorted_bai)
dxpy.download_dxfile(mappings_sorted_bai.get_id(),
mappings_sorted_bai.name)
else:
pysam.index(ascii_bam_name)
mappings_obj = pysam.AlignmentFile(ascii_bam_name, "rb")
regions = get_chr(mappings_obj, canonical_chr)
def get_chr(bam_alignment, canonical=False):
"""Helper function to return canonical chromosomes from SAM/BAM header
Arguments:
bam_alignment (pysam.AlignmentFile): SAM/BAM pysam object
canonical (boolean): Return only canonical chromosomes
Returns:
regions (list[str]): Region strings
"""
regions = []
headers = bam_alignment.header
seq_dict = headers['SQ']
if canonical:
re_canonical_chr = re.compile(r'^chr[0-9XYM]+$|^[0-9XYM]')
for seq_elem in seq_dict:
if re_canonical_chr.match(seq_elem['SN']):
regions.append(seq_elem['SN'])
else:
regions = [''] * len(seq_dict)
for i, seq_elem in enumerate(seq_dict):
regions[i] = seq_elem['SN']
return regions
total_count = 0
count_filename = "{bam_prefix}_counts.txt".format(
bam_prefix=ascii_bam_name[:-4])
with open(count_filename, "w") as f:
for region in regions:
temp_count = mappings_obj.count(region=region)
f.write("{region_name}: {counts}\n".format(
region_name=region, counts=temp_count))
total_count += temp_count
f.write("Total reads: {sum_counts}".format(sum_counts=total_count))
regions=$(samtools view -H "${mappings_sorted_bam_name}" \
| grep "\@SQ" | sed 's/.*SN:\(\S*\)\s.*/\1/')
echo "Segmenting into regions"
count_jobs=()
counter=0
temparray=()
for r in $(echo $regions); do
if [[ "${counter}" -ge 10 ]]; then
echo "${temparray[@]}"
count_jobs+=( \
$(dx-jobutil-new-job \
-ibam_file="${mappings_sorted_bam}" \
-ibambai_file="${mappings_sorted_bai}" "${temparray[@]}" count_func))
temparray=()
counter=0
fi
temparray+=("-iregions=${r}") # Here we add to an array of -i<parameter>'s
counter=$((counter+1))
done
if [[ counter -gt 0 ]]; then # Previous loop misses last iteration if it's < 10
echo "${temparray[@]}"
count_jobs+=($(dx-jobutil-new-job -ibam_file="${mappings_sorted_bam}" -ibambai_file="${mappings_sorted_bai}" "${temparray[@]}" count_func))
fi
echo "Merge count files, jobs:"
echo "${count_jobs[@]}"
readfiles=()
for count_job in "${count_jobs[@]}"; do
readfiles+=("-ireadfiles=${count_job}:counts_txt")
done
echo "file name: ${sorted_bamfile_name}"
echo "Set file, readfile variables:"
echo "${readfiles[@]}"
countsfile_job=$(dx-jobutil-new-job -ifilename="${mappings_sorted_bam_prefix}" "${readfiles[@]}" sum_reads)
set -e -x -o pipefail
echo "Value of mappings_sorted_bam: '${mappings_sorted_bam}'"
echo "Value of mappings_sorted_bai: '${mappings_sorted_bai}'"
mkdir workspace
cd workspace
dx download "${mappings_sorted_bam}"
if [ -z "$mappings_sorted_bai" ]; then
samtools index "$mappings_sorted_bam_name"
else
dx download "${mappings_sorted_bai}"
fi
input_bam = inputs['mappings_bam_name'][0]
bam_to_use = create_index_file(input_bam)
print("Dir info:")
print(os.listdir(os.getcwd()))
regions = parseSAM_header_for_region(bam_to_use)
view_cmds = [
create_region_view_cmd(bam_to_use, region)
for region
in regions]
print('Parallel counts')
t_pools = ThreadPool(10)
results = t_pools.map(run_cmd, view_cmds)
t_pools.close()
t_pools.join()
verify_pool_status(results)
def verify_pool_status(proc_tuples):
err_msgs = []
for proc in proc_tuples:
if proc[2] != 0:
err_msgs.append(proc[1])
if err_msgs:
raise dxpy.exceptions.AppInternalError(b"\n".join(err_msgs))
inputs = dxpy.download_all_inputs()
# download_all_inputs returns a dictionary that contains mapping from inputs to file locations.
# Additionaly, helper keys, value pairs are added to the dicitonary, similar to bash helper functions
inputs
# mappings_sorted_bam_path: [u'/home/dnanexus/in/mappings_sorted_bam/SRR504516.bam']
# mappings_sorted_bam_name: u'SRR504516.bam'
# mappings_sorted_bam_prefix: u'SRR504516'
# mappings_sorted_bai_path: u'/home/dnanexus/in/mappings_sorted_bai/SRR504516.bam.bai'
# mappings_sorted_bai_name: u'SRR504516.bam.bai'
# mappings_sorted_bai_prefix: u'SRR504516'
print("Number of cpus: {0}".format(cpu_count())) # Get cpu count from multiprocessing
worker_pool = Pool(processes=cpu_count()) # Create a pool of workers, 1 for each core
results = worker_pool.map(run_cmd, collection) # map run_cmds to a collection
# Pool.map handles orchestrating the job
worker_pool.close()
worker_pool.join() # Make sure to close and join workers when done
def run_cmd(cmd_arr):
"""Run shell command.
Helper function to simplify the pool.map() call in our parallelization.
Raises OSError if command specified (index 0 in cmd_arr) isn't valid
"""
proc = subprocess.Popen(
cmd_arr,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
stdout, stderr = proc.communicate()
exit_code = proc.returncode
proc_tuple = (stdout, stderr, exit_code)
return proc_tuple
def parse_sam_header_for_region(bamfile_path):
"""Helper function to match SN regions contained in SAM header
Returns:
regions (list[string]): list of regions in bam header
"""
header_cmd = ['samtools', 'view', '-H', bamfile_path]
print('parsing SAM headers:', " ".join(header_cmd))
headers_str = subprocess.check_output(header_cmd).decode("utf-8")
rgx = re.compile(r'SN:(\S+)\s')
regions = rgx.findall(headers_str)
return regions
# Write results to file
resultfn = inputs['mappings_sorted_bam_name'][0]
resultfn = (
resultfn[:-4] + '_count.txt'
if resultfn.endswith(".bam")
else resultfn + '_count.txt')
with open(resultfn, 'w') as f:
sum_reads = 0
for res, reg in zip(results, regions):
read_count = int(res[0])
sum_reads += read_count
f.write("Region {0}: {1}\n".format(reg, read_count))
f.write("Total reads: {0}".format(sum_reads))
count_file = dxpy.upload_local_file(resultfn)
output = {}
output["count_file"] = dxpy.dxlink(count_file)
return output
def verify_pool_status(proc_tuples):
"""
Helper to verify worker succeeded.
As failed commands are detected, the `stderr` from that command is written
to the job_error.json file. This file is printed to the Platform
job log on App failure.
"""
all_succeed = True
err_msgs = []
for proc in proc_tuples:
if proc[2] != 0:
all_succeed = False
err_msgs.append(proc[1])
if err_msgs:
raise dxpy.exceptions.AppInternalError(b"\n".join(err_msgs))
input_bam = inputs['mappings_bam_name'][0]
bam_to_use = create_index_file(input_bam)
print("Dir info:")
print(os.listdir(os.getcwd()))
regions = parseSAM_header_for_region(bam_to_use)
view_cmds = [
create_region_view_cmd(bam_to_use, region)
for region
in regions]
print('Parallel counts')
t_pools = ThreadPool(10)
results = t_pools.map(run_cmd, view_cmds)
t_pools.close()
t_pools.join()
verify_pool_status(results)
def verify_pool_status(proc_tuples):
err_msgs = []
for proc in proc_tuples:
if proc[2] != 0:
err_msgs.append(proc[1])
if err_msgs:
raise dxpy.exceptions.AppInternalError(b"\n".join(err_msgs))
inputs = dxpy.download_all_inputs()
# download_all_inputs returns a dictionary that contains mapping from inputs to file locations.
# Additionaly, helper keys, value pairs are added to the dicitonary, similar to bash helper functions
inputs
# mappings_sorted_bam_path: [u'/home/dnanexus/in/mappings_sorted_bam/SRR504516.bam']
# mappings_sorted_bam_name: u'SRR504516.bam'
# mappings_sorted_bam_prefix: u'SRR504516'
# mappings_sorted_bai_path: u'/home/dnanexus/in/mappings_sorted_bai/SRR504516.bam.bai'
# mappings_sorted_bai_name: u'SRR504516.bam.bai'
# mappings_sorted_bai_prefix: u'SRR504516'
# Get cpu count from multiprocessing
print("Number of cpus: {0}".format(cpu_count()))
# Create a pool of workers, 1 for each core
worker_pool = Pool(processes=cpu_count())
# Map run_cmds to a collection
# Pool.map handles orchestrating the job
results = worker_pool.map(run_cmd, collection)
# Make sure to close and join workers when done
worker_pool.close()
worker_pool.join()
def run_cmd(cmd_arr):
"""Run shell command.
Helper function to simplify the pool.map() call in our parallelization.
Raises OSError if command specified (index 0 in cmd_arr) isn't valid
"""
proc = subprocess.Popen(
cmd_arr,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
stdout, stderr = proc.communicate()
exit_code = proc.returncode
proc_tuple = (stdout, stderr, exit_code)
return proc_tuple
def parse_sam_header_for_region(bamfile_path):
"""Helper function to match SN regions contained in SAM header
Returns:
regions (list[string]): list of regions in bam header
"""
header_cmd = ['samtools', 'view', '-H', bamfile_path]
print('parsing SAM headers:', " ".join(header_cmd))
headers_str = subprocess.check_output(header_cmd).decode("utf-8")
rgx = re.compile(r'SN:(\S+)\s')
regions = rgx.findall(headers_str)
return regions
# Write results to file
resultfn = inputs['mappings_sorted_bam_name'][0]
resultfn = (
resultfn[:-4] + '_count.txt'
if resultfn.endswith(".bam")
else resultfn + '_count.txt')
with open(resultfn, 'w') as f:
sum_reads = 0
for res, reg in zip(results, regions):
read_count = int(res[0])
sum_reads += read_count
f.write("Region {0}: {1}\n".format(reg, read_count))
f.write("Total reads: {0}".format(sum_reads))
count_file = dxpy.upload_local_file(resultfn)
output = {}
output["count_file"] = dxpy.dxlink(count_file)
return output
def verify_pool_status(proc_tuples):
"""
Helper to verify worker succeeded.
As failed commands are detected, the `stderr` from that command is written
to the job_error.json file. This file is printed to the Platform
job log on App failure.
"""
all_succeed = True
err_msgs = []
for proc in proc_tuples:
if proc[2] != 0:
all_succeed = False
err_msgs.append(proc[1])
if err_msgs:
raise dxpy.exceptions.AppInternalError(b"\n".join(err_msgs))
if [ -z "${mappings_sorted_bai}" ]; then
samtools index "${mappings_sorted_bam_name}"
else
dx download "${mappings_sorted_bai}" -o "${mappings_sorted_bam_name}".bai
fi
count_jobs=()
for chr in $chromosomes; do
seg_name="${mappings_sorted_bam_prefix}_${chr}".bam
samtools view -b "${mappings_sorted_bam_name}" "${chr}" > "${seg_name}"
bam_seg_file=$(dx upload "${seg_name}" --brief)
count_jobs+=($(dx-jobutil-new-job -isegmentedbam_file="${bam_seg_file}" -ichr="${chr}" count_func))
done
for job in "${count_jobs[@]}"; do
readfiles+=("-ireadfiles=${job}:counts_txt")
done
sum_reads_job=$(dx-jobutil-new-job "${readfiles[@]}" -ifilename="${mappings_sorted_bam_prefix}" sum_reads)
sum_reads ()
{
set -e -x -o pipefail;
printf "Value of read file array %s" "${readfiles[@]}";
echo "Filename: ${filename}";
echo "Summing values in files and creating output read file";
for read_f in "${readfiles[@]}";
do
echo "${read_f}";
dx download "${read_f}" -o - >> chromosome_result.txt;
done;
count_file="${filename}_chromosome_count.txt";
total=$(awk '{s+=$2} END {print s}' chromosome_result.txt);
echo "Total reads: ${total}" >> "${count_file}";
readfile_name=$(dx upload "${count_file}" --brief);
dx-jobutil-add-output read_sum_file "${readfile_name}" --class=file
}
The Tools Library provides a list of available apps and workflows. To see this list, select Tools Library from the Tools entry in the main Platform menu.
On the DNAnexus Platform, apps and workflows are generically referred to as "tools."
To find the tool you're looking for in the Tools Library, you can use search filters. Filtering enables you to find tools with a specific name, in a specific category, or of a specific type:
Find all tools with 'assay' in their name.
To see what inputs a tool requires, and what outputs it generates, select that tool's row in the list. The row is highlighted in blue. The tool's inputs and outputs are displayed in a pane to the right of the list:
Check a tool's list of inputs and outputs.
To make sure you can find a tool later, you can pin it to the top of the list. Click More actions (⋮) icon at the far right end of the row showing the tool's name and key details about it. Then click Add Pin.
Pin your favorite apps to the top of the list.
To learn more about a tool, click on its name in the list. The tool's detail page opens, showing a wide range of info, including guidance in how to use it, version history, pricing, and more:
View app details with usage instructions.
Running Apps and Workflows
Launching a Tool
Launching from the Tools Library
You can quickly launch the latest version of any given tool from the Tools Library page. Or you can navigate to the apps details and click Run.
By default, you run the latest app version.
Launching from a Project
From within a project, navigate to the Manage pane, then click the Start Analysis button.
A dialog window opens, showing a list of tools. These include the same tools as shown in the Tools Library, as well as workflows and applets specifically available in the current project. Select the tool you want to run, then click Run Selected:
Workflows and applets can be launched directly from where they reside within a project. Select the workflow or applet in their folder location, and click Run.
Launch Configuration
Confirm details of the tool you are about to run. Selection of a project location is required for any tool to be run. You need at minimum Contributor access level to the project.
Provide name and output location before the launch
Specialized tools, such as JupyterLab and Spark Apps, require special licenses to run.
Configure Inputs and Outputs
The tool may require specific inputs to be filled in before starting the run. You can quickly identify the required inputs by looking for the highlighted areas that are marked Inputs Required on the page.
Fill in the required inputs before starting the run
You can access help information about each input or output by inspecting the label of each item. If a detailed README is provided for the executable, you can click the View Documentation icon to open the app or workflow info pane.
Help information for each field and the tool overall
To configure instance type settings for a given tool or stage, click the Instance Type icon located on the top-right corner of the stage.
Show / Hide instance type settings
To configure output location and view info regarding output items, go to the Outputs tab under each stage. For workflows, output location can be specified separately for each stage.
Configure output locations for each stage of the workflow
The I/O graph provides an overview of the input/output structure of the tool. The graph is available for any tool and can be accessed via the Actions/Workflow Actions menu.
The workflow's I/O graph visualization
Once all required inputs have been configured, the page indicates that the run is ready to start. Click on Start Analysis to proceed to the final step.
The tool has been fully configured and ready to start the run
Configure Runtime Settings
As the last step before launching the tool, you can review and confirm specific runtime settings, including execution name, output location, priority, job rank, spending limit, and resource allocation. You can also review and modify instance type settings before starting the run.
Once you have confirmed final details, click Launch Analysis to start the run.
Review and confirm runtime settings before starting the run
Configure advanced runtime settings
A license is required to use the Job Ranking feature. Contact DNAnexus Support for more information.
Batch Run
Batch run allows users to run the same app or workflow multiple times, with specific inputs varying between runs.
Specify Batch Inputs
To enable batch run, start from any input that you wish to specify for batch run, and open its I/O Options menu on the right hand side. From the list of options available, select Enable Batch Run.
Input fields with batch run enabled are highlighted with a Batch label. Click any of the batch enabled input fields to enter the batch run configuration page.
Not all input classes are supported for batch run configuration. See table below.
Input Class
Batch Run Support
Files and other data objects
Yes
Files and other data objects (array)
Partially supported. Can accept entry of a single-value array
String
Yes
Integer
Configure Batch Inputs
The batch run configuration page allows specifying inputs across multiple runs. Interact with each table cell to fill in desired values for any run or field.
Similar to configuration of inputs for non-batch runs, you need to fill all the required input fields to proceed to next steps. Optional inputs, or required inputs with a predefined default value, can be left empty.
Once all required fields (for both batch inputs and non-batch inputs) have been configured, you can proceed to start the run via the Start Analysis button.
The total 10 batch runs have been fully configured and ready to launch
Using List Views to Visualize Hierarchically Organized Data
List views, unlike row charts, can be used to visualize categorical data with values that are organized in a hierarchical fashion.
Using List Views to Visualize Data from Two Different Fields
List views can be used to visualize categorical data from two different fields. The same restrictions apply to the fields whose values are displayed, as when creating a basic list view.
Using List Views in the Cohort Browser
Visualizing Data from a Single Field
In a list view in the Cohort Browser showing data from one field, each row displays a value, along with the number of records in the current cohort - the "count" - that contain this value. Also shown is a figure labeled "freq." - this is the percentage of all cohort records, that contain the value.
Below is a sample list view showing the distribution of values in a field Episode type. In the current cohort selection of 80 participants, 13 records contain the value "Delivery episode", which represents 16.25% of the current cohort size.
List View in the Cohort Browser
When records are missing values for the displayed field, the sum of the "count" figures is smaller than the total cohort size, and the sum of the "freq." figures is less than 100%. See Chart Totals and Missing Data for more information on how missing data affects chart calculations.
Visualizing Data from Two Fields
To visualize data from two fields, select a categorical field, then select "List View" as your visualization type. In the field list, select a second categorical field as a secondary field.
Below is the default view of a sample list view visualizing data from two fields: Critical care record origin and Critical care record format:
Primary Field Values in a List View Visualizing Data from Two Fields
Critical care record origin is the primary field, Critical care record format is the secondary field.
Here, the user has clicked the ">" icon next to "Originating from Scotland" to display additional rows with detail on records that contain that value in the field Critical care record origin:
Seeing Combinations of Field Values
Each of these additional rows shows the number of records that contain a particular value for Critical care record format, along with the value "Originating from Scotland" for Critical care record origin.
In these additional rows, "count" and "freq." figures refer to records having a particular combination of values, in the fields.
Visualizing Complex Categorical Data
Below is an example of a list view used to visualize data in a categorical hierarchical field Home State/Province:
List View of Hierarchical Categorical Data
By default, only values in the category at the top level of the hierarchy are displayed.
Here, the user has clicked ">" next to one of these values, revealing additional rows that show how many records have the value "Canada" for the top-level category, in combination with different values in the category at the next level down:
Seeing Combinations of Values in a Field Containing Hierarchical Categorical Data
In these additional rows, "count" and "freq." figures refer to records having a particular combination of values, in the fields. In the list view above, for example, a single record, representing 10% of the cohort, has both the value "Canada" for the top-level category, and "British Columbia" for the second-level category.
The following example shows how "count" and "freq." are calculated, for list views based on fields containing categorical data organized into multiple levels of hierarchy:
Multiple Levels of Hierarchy
For the bottommost row, "count" and "freq" refer to records having the following values:
"Yes" for the category at the top of the hierarchy
"9" for the category at the second level of the hierarchy
"8" for the category at the third level of the hierarchy
"7" for the category at the fourth level of the hierarchy
"3" for the category at the bottom level of the hierarchy
Locating Values in a List View
In cases where the field has categories at multiple levels and this make it difficult to find a particular value, use the search box at the bottom of the list view, to hone in on a row or rows containing that value:
Using the Search Function in a List View
List Views in Cohort Compare
In Cohort Compare mode, a list view can be used to compare the distribution of values in a field that's common to both cohorts. In this scenario, the list includes a color-coded column for each cohort, as well as color-coded "count" figures for each, as in this example:
List view: Treatment/Medication Code in compare mode
In each column, count and "freq." figures refer to the occurrence of values in the individual cohort, not across both cohorts.
You can customize your Gene Expression dashboard to focus on the most relevant analyses for your research:
Create new Expression Distribution or Feature Correlation charts.
Remove charts you no longer need.
Resize and reposition charts to optimize your workspace.
Save your dashboard customizations along with your cohort.
Visualizing gene expression data in Cohort Browser
The Gene Expression dashboard supports up to 15 charts, allowing you to create comprehensive expression analysis workspaces.
For datasets with multiple gene expression assays, you can choose the specific assay to visualize at the top of the dashboard. The Cohort Browser displays data from only one assay at a time. Switching between assays preserves your charts and their display settings.
Filtering by Gene Expression
You can define your cohort by gene expression to include only patients with specific expression characteristics.
To apply a gene expression filter to your cohort:
For the cohort you want to edit, click Add Filter.
In Add Filter to Cohort > Assays > Gene Expression, select a genomic filter.
In Edit Filter: Gene Expression, specify the criteria:
For datasets with multiple gene expression assays, select the specific assay to filter by.
In Expression Level, specify inclusive minimum and maximum values. For an individual to be included, all their expression values across all samples for the feature must fall within the range.
In Gene / Transcript, enter a gene symbol, such as BRCA1, or feature ID, such as ENSG00000012048 or ENST00000309586. Search is case insensitive.
Click Apply Filter.
You can specify up to 10 gene expression filters for each cohort. All filters use an AND relationship.
Adding a gene expression filter
After you apply or edit filters, the participant count updates immediately. However, visualization tiles do not automatically refresh. Click Refresh Visualizations at the top of the dashboard to update all tiles. Click Refresh on individual tiles to update specific charts.
Visualizing Expression Distribution
The Expression Level charts help you visualize gene expression patterns for individual transcript or gene features. You can examine how expression values are distributed across your cohort, identify outliers, and compare patterns between different patient groups.
The chart displays data for one gene or transcript at a time. You can directly enter a transcript or gene feature ID, such as ID starting with ENST or ENSG, or search by gene symbol to see available options.
Visualizing TP53 gene expression
You can view the data as either a histogram showing frequency distribution or a box plot displaying quartiles and outliers. To customize the chart display, including applying logarithmic transformations for wide-range expression data, click ⛭ Chart Settings.
When comparing cohorts, the chart shows data from each cohort on the same axes for direct comparison.
You can also modify your charts by selecting different transcript or gene features, resizing and rearranging them on your dashboard, or adjusting display settings to focus on the most relevant analyses for your research.
Exploring Feature Correlations
The Feature Correlation charts help you understand how the expression levels of two genes or transcripts relate to each other. You can use these charts to identify genes or transcripts that are co-expressed, explore potential pathway interactions, and compare correlation patterns between different cohorts.
The chart displays a scatter plot where each point represents a sample, with the X and Y axes showing expression values for your two selected features. A best fit line shows the overall relationship trend, and you can swap which gene appears on which axis to view the data from different perspectives.
Exploring feature correlations between ERBB2 and TP53
The correlation analysis includes statistical measures to help you determine if the relationship you're seeing is meaningful. The Pearson correlation coefficient shows both the strength and direction of the linear relationship (ranging from -1 to +1), while the p-value indicates whether the correlation is statistically significant.
You can toggle these statistics on or off as needed. The chart updates when you change your feature selections or switch between viewing single cohorts versus comparing multiple cohorts. This quantitative analysis helps you assess whether observed correlations are both statistically sound and biologically relevant to your research.
Examining Detailed Gene Expression Information
The Expression Per Feature table provides gene metadata and expression statistics for all features in your dataset. Use the search bar to find specific genes by symbol or explore genes within genomic ranges.
Examining TP53 expression per feature
The table displays one row per feature ID with the following columns:
Feature ID: The unique transcript or gene identifier, such as ENST for a transcript or ENSG for a gene
Gene Symbol: The official gene name or symbol associated with the feature ID, such as TP53
Location: The genomic coordinates in "chromosome:start-end" format
Strand: The DNA strand orientation (+ or -)
Expression (Mean): The average expression value for this feature across the current cohort
Expression (SD): The standard deviation of expression values
Expression (Median): The median expression value
When comparing cohorts, the table shows separate expression statistics for each cohort, allowing direct comparison of expression patterns.
Each feature includes links to external annotation resources:
Ensembl transcript pages: Detailed transcript information and annotations
Ensembl gene pages: Comprehensive gene summaries and functional data
These links provide quick access to additional context about genes and transcripts of interest.
To preview these files, select the file you wish to view by either clicking on its name in the Manage tab or selecting the checkbox next to the file. If the file is one of the file types listed above, the "Preview" and "Open in New Tab" options appear in the toolbar above.
Alternatively, you can click on the three dots on the far right and choose the "Preview" or "Open in New Tab" options from the dropdown menu.
"Preview" opens a fixed-sized box in your current tab to preview the file of interest. "Open in New Tab" enables viewing the file in a separate tab. Due to limitations in web browser technologies, "Preview" and "Open in New Tab" may produce different results.
The file type is not necessarily determined by the file extension. For example, you can preview a FASTA file reads.fa, even though the file extension is not .txt. However, you cannot preview a BAM file (a binary file) using the Preview option.
Preview Restrictions
File preview and viewer functionality are subject to project access controls. When a project has the previewViewerRestricted flag enabled, preview and viewer capabilities are disabled for all project members. This flag is automatically set to true when downloadRestricted is enabled on a project (for both new projects and when updating existing projects), though project admins can override this behavior by explicitly providing the previewViewerRestricted flag.
Using File Viewers
For files not listed in the section above, the DNAnexus Platform also provides a lightweight framework called Viewers, which allows users to view their data using new or existing web-based tools.
A Viewer is an HTML file that you can give one or more DNAnexus URLs representing files to be viewed. Viewers generally integrate third-party technologies, such as HTML-based genome browsers.
The data you select to be viewed is accessible by the Viewer, which can also access the Internet. You should only run Viewers from trusted sources.
Launching a Viewer
You can launch a viewer by clicking on the Visualize tab within a project.
This tab opens a window displaying all Viewers available to you within your project. Any Viewers you've created and saved within your current project appear in this list along with the DNAnexus-provided Viewers.
Clicking on a Viewer opens a data selector for you to choose the files you wish to visualize. Tick one or more files that you want to provide to the Viewer. (The Viewer does not have access to any other of your data.) From there, you can either create a Viewer Shortcut or launch the Viewer.
Example Viewers
Human Genome Browsers (BioDalliance, IGV.js)
The BioDalliance and IGV.js viewers provide HTML-based human genome browsers which you can use to visualize mappings and variants. When launching either one of these viewers, tick a pair of *.bam + *.bai files for each mappings track you would like to visualize, and a pair of *.vcf.gz + *.vcf.gz.tbi for each variant track you want to add. Also, the BioDalliance browser supports bigBed (*.bb) and bigWig (*.bw) tracks.
The BAM Header Viewer allows you to peek inside a BAM header, similar to what you would get if you were to run samtools view -H on the BAM file. (BAM headers include information about the reference genome sequences, read groups, and programs used). When launching this viewer, tick one or more BAM files (*.bam).
Jupyter Notebook Viewer
The Jupyter notebook viewer displays *.ipynb notebook files, showing notebook images, highlighted code blocks and rendered markdown blocks as shown below.
Gzipped File Viewers
This viewer allows you to decompress and see the first few kilobytes of a gzipped file. It is conceptually similar to what you would get if you were to run zcat <file> \| head. Use this viewer to peek inside compressed reads files (*.fastq.gz) or compressed variants files (*.vcf.gz). When launching this viewer, tick one or more gzipped files (*.gz).
Troubleshooting Viewers
If a viewer fails to load, try temporarily disabling browser extensions such as AdBlock and Privacy Badger. Also, viewers are not supported in Incognito browser windows.
Custom Viewers
Developers comfortable with HTML and JavaScript can create custom viewers to visualize data on the platform.
Viewer Shortcuts
Viewer Shortcuts are objects which, when opened, open a data selector to select inputs for launching a specified Viewer. The Viewer Shortcut includes a Viewer and an array of inputs that are selected by default.
The Viewer Shortcut appears in your project as an object of type "Viewer Shortcut." You can modify the name of the Viewer Shortcut and move it within your folders and projects like any other object in the DNAnexus Platform.
Smart Reuse (Job Reuse)
Speed workflow development and reduce testing costs by reusing computational outputs.
A license is required to access the Smart Reuse feature. Contact DNAnexus Sales for more information.
DNAnexus allows organizations to optionally reuse outputs of jobs that share the same executable and input IDs, even if these outputs are across projects or entire organizations. This feature has two primary use cases.
Example Use Cases
Dramatically Speed Up R&D of Workflows
For example, suppose you are developing a workflow, and at each stage you end up debugging an issue. Each stage takes about one hour to develop and run. If you do not reuse outputs during development, the process takes 1 + 2 + 3 + ... + n hours because at every stage you fix something and must recompute results from previous stages. By reusing results for stages that have matured and are no longer modified, the total development time equals the time it takes to develop and run the pipeline (in this case n hours). This is an order-of-magnitude reduction in development time, and the improvement becomes more pronounced for longer workflows.
This feature also saves time when developing forks of existing workflows. For example, suppose you are a developer in an R&D organization and want to modify the last couple of stages of a production workflow in another organization. As long as the new workflow uses the same executable IDs for the earlier stages, the time required for R&D of the forked version equals the time for the last stages.
Dramatically Reduce Costs When Testing at Scale
In production environments, test R&D modifications to a workflow at scale. This is especially relevant for workflows used in clinical tests. For example, suppose you are testing a workflow like the forked workflow discussed earlier. This clinical workflow must be tested on thousands of samples (let that number be represented by m) before it is vetted for production. Suppose the whole workflow takes n hours but only the last k stages changed. You save (n-k)m total compute hours. This can add up to dramatic cost savings as m grows and if k is small.
Example Reuse with WDL
To show Smart Reuse, the following example uses WDL syntax as supported by DNAnexus SDK and .
The workflow above is a two-step workflow that duplicates a file and takes the first 10 lines from the duplicate.
Suppose the user has run the workflow above on some file and wants to tweak headfile to output the first 15 lines instead:
Here the only differences are the renamed headfile and basic_reuse, and the change from 10 to 15. The compilation process automatically detects that dupfile is the same but the second stage differs. The generated workflow therefore uses the original executable ID for dupfile but a different executable ID for headfile2.
When executing basic_reuse_tweaked on the same input file with Smart Reuse enabled, the results from dupfile task are reused. This is because since there is already a job on the DNAnexus Platform that has run that specific executable with the same input file, the system can reuse that file.
When using Smart Reuse with complex WDL workflows involving WDL expressions in input arguments, scatters, and nested sub-workflows, we recommend launching workflows using the option. This preserves the outputs of all jobs in the execution tree in the project and increases the potential for subsequent Smart Reuse.
Requirements for Smart Reuse
Jobs can reuse results from previous jobs if the following criteria are met:
The organization that is billed for the job has .
Smart Reuse applies only to jobs completed after the org policy was enabled.
Smart Reuse is enabled at the executable level ().
When a job reuses results, it includes an outputReusedFrom field pointing to the previous job ID. Reused jobs are reported as having run for 0 seconds and are billed at $0. If the reused job or workflow is in a different project or folder, output data is not cloned to the new project or destination folder (the job or workflow is not actually rerun).
Controlling Smart Reuse
Smart Reuse can be controlled at three levels. Runtime settings override executable defaults. Executable defaults override organization policy. If Smart Reuse is disabled at any level, reuse does not occur.
Organization policy: Set the jobReuse policy to true (default is false). See .
Executable default: Set to true or false. The default is false
How to Enable Smart Reuse
To enable or disable Smart Reuse for your organization:
If you plan to reuse results across projects, you must modify all applet and app configurations to include "allProjects": "VIEW" in the .
If you are a licensed customer and cannot run the command above, contact . If you are interested in Smart Reuse and are not a licensed customer, reach out to or your account executive for more information.
Filtering Objects and Jobs
You can perform advanced filtering on projects, data objects, and jobs using the filter bars above the table of results. This feature is displayed at the top of the Monitor tab but is hidden by default on the Manage tab and Projects page. You can display or hide the filter bar by toggling the filters icon in the top right corner.
The filter bar lets you to specify different criteria on which to filter your data. You can combine multiple different filters for greater control over your results.
To use this feature, first choose the field you want to filter your data by, then enter your filter criteria. For example, select the "Name" filter then search for "NA12878". The filter activates when you press enter or click outside of the filter bar.
Filtering Projects
The following filters are available for projects, and can be added by selecting them from the "Filters" dropdown menu.
Billed to: The user or org ID that the project is billed to, for example, "user-xxxx" or "org-xxxx". When viewing a partner organization's projects, the "Billed to" field is fixed to the org ID.
Project Name: Search by case insensitive string or regex, for example, "Example" or "exam$" both match "Example Project"
ID: Search by project ID, for example, "project-xxxx"
Filtering Objects
The following filters are available for objects. Filters listed in italics are not displayed in the filter bar by default but can be added by selecting them from the "Filters" dropdown menu on the right.
Search scope: The default scope is "Entire project", but if you know the location of the object you are looking for, limiting your search scope to "Current Folder" allows you to search more efficiently.
Object name: Search by case insensitive string or regex, for example, NA1 or bam$ both match NA12878.bam
When filtering on anything other than the current folder, results appear from many different places in the project. The folders appear in a lighter gray font and some actions are unavailable (such as creating a new workflow or folder), but otherwise functionality remains the same as in the normal data view.
Filtering Jobs and Analyses
The following filters are available for executions. Filters listed in italics are not displayed in the filter bar by default but can be added to the bar by selecting them from the "Filters" dropdown menu on the right.
Search scope: The default displays root executions only, but you can choose to view all executions (root and subjobs) instead
State: for example, Failed, Waiting, Done, Running, In Progress, Terminated
Name: Search by case-insensitive string or regex, for example, "BWA" or "MEM$" both match "BWA-MEM". This only matches the name of the job or analysis, not the executable name.
Multi-Word Queries in Filters and Searches
When filtering on a name, any spaces expand to include intermediate words. For example, filtering by "b37 build" also returns "b37 dbSNP build".
Filtering by Date
Some filters allow you to specify a date range for your query. For example, the "Created date" filter allows you to specify a beginning time ("From") and/or an end time ("To"). Clicking on the date box opens a calendar widget which allows you to specify a relative time in minutes, hours, days, weeks, months, or an absolute time by specifying a certain date.
For relative time, specify an amount of time before the access time. For example, selecting "Day" and typing 5 sets the datetime to 5 days before the current time.
Alternatively, you can use the calendar to represent an exact (absolute) datetime.
Setting only the beginning datetime ("From") creates a range from that time to the access time. Setting only the end datetime ("To") creates a range from the earliest records to the "To" time.
A filter with a relative time period updates each time it is accessed. For example, a filter for items created within two hours shows different results at different times: items from 9am at 11am, and items from 2pm at 4pm. For consistent results, use absolute datetimes from the calendar widget.
Filtering by Tags and Properties
Tags
To search by tag, enter or select the tags you want to find. For example, to find all objects tagged with "human", type "human" in the filter box and select the checkbox next to the tag.
Unlike other searches where you can enter partial text, tag searches require the complete tag name. However, capitalization doesn't matter. For example, searching for "HUMAN", "human", or "Human" all find objects with the "Human" tag. Partial matches like "Hum" do not return results.
Properties
Properties have two parts: a key and a value. The system prompts for both when creating a new property. Like tags, properties allow you to create your own common attributes across multiple projects or items and find them quickly. When searching for a property, you can either search for all items that have that property, or items that have a property with a certain value.
To search for all items that have a property, regardless of the value of that property, select the "Properties" filter (not displayed by default), enter the property key, and click Apply. To search for items that have a property with a specific value, enter that property's key and value.
The keys and values must be entered in their entirety. For example, entering the key sample and the value NA does not match objects with {"sample_id": "NA12878"}.
Any vs. All Queries
Some filters allow you to select multiple values. For example, the "Tag" filter allows you to specify multiple tags in the dialog. When you have selected multiple tags, you have a choice whether to search for objects containing any of the selected tags or containing all the selected tags.
Given the following set of objects:
Object 1 (tags: "human", "normal")
Object 2 (tags: "human", "tumor")
Object 3 (tags: "mouse", "tumor")
Selecting both "human" and "tumor" tags, and choosing to filter by any tag returns all 3 objects. Choosing to filter by all tags returns only Object 2.
Clearing All Filters
Click the "Clear All Filters" button on the filter bar to reset your filters.
Saving Filters
If you wish to save your filters, active filters are saved in the URL of the filtered page. You can bookmark this URL in your browser to return to your filtered view in the future.
Bookmarking a filtered URL saves the search parameters, not the search results. The filters are applied to the data present when accessing the bookmarked link. For example, filters for items created in the last thirty days show items from the thirty days before viewing the results, not the thirty days before creating the bookmark. Results update based on when you access the saved search.
User Interface Quickstart
Learn to create a project, add members and data to the project, and run a simple workflow.
You must set up billing for your account before you can perform an analysis, or upload or egress data.
Step 1. Create Your First Project
Path Resolution
When using the command-line client, you may refer to objects either through their ID or by name.
In the DNAnexus Platform, every data object has a unique starting with the class of the object followed by a hyphen ('-') and 24 alphanumeric characters. Common object classes include "record", "file", and "project". An example ID would be record-9zGPKyvvbJ3Q3P8J7bx00005. A string matching this format is always interpreted to be meant as the ID of such an object and is not further resolved as a name.
The command-line client, however, also accepts names and paths as input in a particular syntax.
Running Batch Jobs
To launch a DNAnexus application or workflow on many files automatically, one may write a short script to loop over the desired files in a project and launch jobs or analyses. Alternatively, the provides a few handy utilities for batch processing. To use the GUI to run in batch mode, see these .
Overview
In this tutorial, you batch process a series of sample FASTQ files (forward and reverse reads). Use the dx generate_batch_inputs command to generate a batch file -- a tab-delimited (TSV) file where each row corresponds to a single run in the batch. Then you process the batch using the
Job Lifecycle
Learn about the states through which a job or analysis may go, during its lifecycle.
Example Execution Tree
The following example shows a workflow that has two stages, one of which is an applet, and the other of which is an app.
When the workflow runs, it generates an analysis with an attached workspace for storing intermediate output from its stages. Jobs are created to run the two stages. These jobs can spawn additional jobs to run other functions in the same executable or to run separate executables. The blue labels indicate which jobs or analyses can be described using a particular term (as defined above).
Relational Database Clusters
Relational Database Clusters
The DNAnexus Relational Database Service provides users with a way to create and manage cloud database clusters (referred to as dbcluster objects on the platform). These databases can then be securely accessed from within DNAnexus jobs/workers.
The Relational Database Service is accessible through the application program interface (API) in AWS regions only. See for details.
Archiving Files
Learn how to archive files, a cost-effective way to retain files in accord with data-retention policies, while keeping them secure and accessible, and preserving file provenance and metadata.
A license is required to use the DNAnexus Archive Service. Contact for more information.
Archiving in DNAnexus is file-based. You can archive individual files, folders with files, or entire projects' files and save on storage costs. You can also unarchive one or more files, folders, or projects when you need to make the data available for further analyses.
The DNAnexus Archive Service is available via the API in Amazon AWS and Microsoft Azure regions.
Created date: Search by projects created before, after, or between different dates
Modified date: Search by projects modified before, after, or between different dates
Creator: The user ID who created the project, for example, "user-xxxx"
Shared with member: A user ID with whom the project is shared, for example, "user-xxxx" or "org-xxxx"
Level: The minimum permission level to the project. The dropdown has the options Viewer+", "Uploader+", "Contributor+", and "Admin only". For example, "Contributor+" filters projects with access CONTRIBUTOR or ADMINISTER
Tags: Search by tag. The filter bar automatically populates with tags available on projects
Properties: Search by properties. The filter bar automatically provides properties available on projects
ID: Search by object ID, for example, file-xxxx or applet-xxxx
Modified date: Search by objects modified before, after, or between different dates
Class: such as "File", "Applet", "Folder"
Types: such as "File" or custom Type
Created date: Search by objects created before, after, or between different dates
Tags: Search by tag. The filter bar automatically populates with tags available on objects within the current folder
Properties: Search by properties. The filter bar automatically provides properties available on objects within the current folder
ID: Search by job or analysis ID, for example, "job-1234" or "analysis-5678"
Created date: Search by executions created before, after, or between different dates
Launched by: Search by the user ID of the user who launched the job. The filter bar shows users who have run jobs visible in the project
Tags: Search by tag. The filter bar automatically populates with tags available on the visible executions
Properties: Search by properties. The filter bar automatically provides properties available on executions visible in the project
Executable: Search by the ID of executable run by the executions in question. Examples include app-1234 or applet-5678
Class: for example, Analysis or Job
Origin Jobs: ID of origin job
Parent Jobs: ID of parent job
Parent Analysis: ID of parent analysis
Root Executions: ID of root execution
Yes
Float
Yes
Boolean
Yes
String (array)
No
Integer (array)
No
Float (array)
No
Boolean (array)
No
Hash
No
Smart Reuse is enabled at runtime (not using the --ignore-reuse flag).
A previous job used the exact same executable and input IDs (including the function called within the applet).
If an input is watermarked, both the watermark and its version match. Other settings, such as instance type, do not affect reuse.
The job being reused has all outputs available and accessible at the time of reuse.
You have at least VIEW access to the previous job's outputs.
The previous job's outputs still exist on the Platform.
For cross-project reuse, the application's dxapp.json file includes "allProjects": "VIEW" in the "access" field.
Outputs are assumed to be deterministic.
, allowing reuse. When
ignoreReuse: true
, Smart Reuse is disabled for the executable.
Runtime override: Control reuse at runtime using any of these methods:
Use the --ignore-reuse flag with dx run to disable reuse.
Use --extra-args '{"ignoreReuse": false}' or --extra-args '{"ignoreReuse": true}' to explicitly enable or disable reuse.
Set the ignoreReuse parameter in API calls to /app-xxxx/run, /applet-xxxx/run, or /workflow-xxxx/run.
For workflows, use --ignore-reuse-stage STAGE_ID to control specific stages.
The project My Research Project contains the following files in the project's root directory:
Batch process these read pairs using BWA-MEM (link requires platform login). For a single execution of the BWA-MEM app, specify the following inputs:
reads_fastqgzs - FASTQ containing the left mates
reads2_fastqgzs - FASTQ containing the right mates
genomeindex_targz - BWA reference genome index
The BWA reference genome index from the public Reference Genome (requires platform login) project is used for all runs. However, for the forward and reverse reads, the read pairs used vary from run to run. To generate a batch file that pairs the input reads:
You can optionally provide a --path argument and provide a specific file and folder to search for recursively within your project. Specifically, the value for --path must be a directory specified as:
/path/to/directory or project-xxxx:/path/to/directory
Any file present within this directory or recursively within any subdirectory of this directory is considered a candidate for a batch run.
The (.*) are regular expression groups. You can provide arbitrary regular expressions as input. The first match in the group is the pattern used to group pairs in the batch. These matches are called batch identifiers (batch IDs). To explain this behavior in more detail, consider the output of the dx generate_batch_inputs command:
The dx generate_batch_inputs command creates the dx_batch.0000.tsv that looks like:
Recall the regular expression was RP(.*)_R1_(.*).fastq.gz. Although there are two grouped matches in this example, only the first one is used as the pattern for the batch ID. For example, the pattern identified for RP10B_S1_R1_001.fastq.gz is 10B_S1 which corresponds to the first grouped match while the second one is ignored.
Examining the TSV file above, the files are grouped as expected, with the first match labeling the identifier of the group within the batch. The next two columns show the file names. The last two columns contain the IDs of the files on the DNAnexus Platform. You can either edit this file directly or import it into a spreadsheet to make any subsequent changes.
If an input for the app is an array, the input file IDs within the batch.tsv file need to be in square brackets to work. The following bash command adds brackets to the file IDs in column 4 and 5. You may need to change the variables in the command ($4 and $5) to match the correct columns in your file. The command's output file, "new.tsv", is ready for the dx run --batch-tsv command.
The example above is for a case where all files have been paired properly. dx generate_batch_inputs creates a TSV for all files that can be successfully matched for a particular batch ID. Two classes of errors may occur for batch IDs that are not successfully matched:
A particular input is missing. This could occur when reads_fastqgzs has a pattern but no corresponding match can be found for reads2_fastqgzs.
More than one file ID matches the exact same name.
For both of these cases, dx generate_batch_inputs returns a description of these errors to STDERR.
When matching more than 500 files, multiple batch files are generated in groups of 500 to limit the number of jobs in a single batch run.
Run a Batch Job
With the batch file prepared, you can execute the BWA-MEM batch process:
Here, genomeindex_targz is a parameter set at execution time that is common to all groups in the batch and --batch-tsv corresponds to the input file generated above.
To monitor a batch job, use the 'Monitor' tab like you normally would for jobs you launch.
Setting Output Folders for Batch Jobs
To direct the output of each run into a separate folder, the --batch-folders flag can be used, for example:
This command outputs the results for each sample in folders named after batch IDs, such as /10B_S1/, /10T_S5/, /15B_S4/, and /15T_S8/. If the folders do not exist, they are created.
The output folders are created under a path defined with --destination, which by default is set to the current project and the "/" folder. For example, this command outputs the result files in /run_01/10B_S1/, /run_01/10T_S5/, and other sample-specific folders:
Batching Multiple Inputs
The dx generate_batch_inputs command works well for batch processing with file inputs, but it has limitations. If you need to vary other input types (like strings, numbers, or file arrays), or want to customize run properties like job names, a for loop provides more flexibility.
Here's an example of using a loop to launch multiple jobs with different inputs:
You can also use the dx run command to use stage_id . For example, if you create a workflow called "Trio Exome Workflow - Jan 1st 2020 9:00am" in your project, you can run it from the command line:
The \ character is needed to escape the : in the workflow name.
Inputs to the workflow can be specified using dx run <workflow> --input name=stage_id:value, where stage_id is a numeric ID starting at 0. More help can be found by running the commands dx run --help and dx run <workflow> --help.
To batch multiple inputs then, do the following:
Additional Resources
For additional information and examples of how to run batch jobs, Batch Processing on the Cloud may be useful. This material is not a part of the official DNAnexus documentation and is for reference only.
# Enable Smart Reuse
dx api org-myorg update '{"policies":{"jobReuse":true}}'
# Disable Smart Reuse
dx api org-myorg update '{"policies":{"jobReuse":false}}'
$ dx select "My Research Project"
Selected project My Research Project
$ dx ls /
RP10B_S1_R1_001.fastq.gz
RP10B_S1_R2_001.fastq.gz
RP10T_S5_R1_001.fastq.gz
RP10T_S5_R2_001.fastq.gz
RP15B_S4_R1_002.fastq.gz
RP15B_S4_R2_002.fastq.gz
RP15T_S8_R1_002.fastq.gz
RP15T_S8_R2_002.fastq.gz
$ dx generate_batch_inputs \
-i reads_fastqgzs='RP(.*)_R1_(.*).fastq.gz' \
-i reads2_fastqgzs='RP(.*)_R2_(.*).fastq.gz'
Found 4 valid batch IDs matching desired pattern.
Created batch file dx_batch.0000.tsv
CREATED 1 batch files each with at most 500 batch IDs.
for i in 1 2; do
dx run swiss-army-knife -icmd="wc *>${i}.out" -iin="fileinput_batch${i}a" -iin="file_input_batch${i}b" --name "sak_batch${i}"
done
dx login
dx run "Trio Exome Workflow - Jan 1st 2020 9\:00am"
dx cd /path/to/inputs
for i in $(dx ls); do
dx run "Trio Exome Workflow - Jan 1st 2020 9\:00am" --input 0.reads="$i"
done
On the DNAnexus Platform, all data is stored within projects. Before you upload, browse, or analyze any data, you must create a project to house that data.
To create a project:
In the DNAnexus Platform, select Projects > All Projects.
In the Projects page, click New Project.
In the New Project dialog:
In Project Name, enter your project's name.
(Optional) In More Info, you can enter or custom-defined . These make it easier to find the project later, and organize it among other projects.
(Optional) In More Info, you can enter a Project Summary and Project Description to help other users understand the project's purpose.
In Billing > Billed To, choose a to which project charges are billed.
In Billing > Billed To, choose a to use for storing project files and running analyses. Feel free to use the default region.
(Optional) In Usage Limits, available in Billed To orgs with compute and egress usage limits configured, you can set project-level limits for each.
In Access, you can specify for specific , defining who can copy, delete, and download data. Feel free to accept the defaults.
Click Create Project.
After the project is created, you can add data in the Manage page.
Once you add data to your project, this is where you can see and get info on this data, and launch analyses that use it.
Step 2. Add Project Members
Once you've created a project, you can add members by doing the following:
From the project's Manage screen, click the Share Project button - the "two people" icon - in the top right corner of the project page.
Type the username or the email address of an existing Platform user, or the ID of an org whose members you want to add the project.
If you don't want the user to receive an email notification on being added to the project, click the Email Notification to "Off."
Click the Add User button.
Repeat Steps 2-5, for each user you want to add to the project.
Click Done when you're finished adding members.
Step 3. Add Data to Your Project
To add data to your project, click the Add button in the top right corner of the project's Manage screen. You see three options for adding data:
Upload Data - Use your web browser to upload data from your computer. For long upload times, you must stay logged into the Platform and keep your browser window open until the upload completes.
Add Data from Server - Specify an URL of an accessible server from which the file is uploaded.
Copy Data from Project - Copy data from another project on the Platform.
When uploading large files, consider using the Upload Agent, a command-line tool that's both faster and more reliable than uploading via the UI.
Adding Data to Use in Your First Analysis
To prepare for running your first analysis, as detailed in Steps 4-7, copy in data from the "Demo Data" project:
From the project's Manage screen, click the Add button, then select Copy Data from Project.
In the Copy Data from Project modal window, open the "Demo Data" project by clicking on its name.
Open the "Quickstart" folder. This folder contains two 1000 Genomes project files with the paired-end sequencing reads from chromosome 20 of exome SRR100022: SRR100022_20_1.fq.gz and SRR100022_20_2.fq.gz.
Click the box next to the Name header, to select both files.
Click Copy to copy the files to your project.
Step 4. Install Apps
Next, install the apps you need, to analyze the data you added to the project in Step 3:
Select Tools Library from the Tools link in the main menu.
A tool detail page opens, a full range of information about the tool, and how to use it.
Click the Install button in the upper left part of the screen, under the name of the tool.
In the Install App modal, click the Agree and Install button.
After the tool has been installed, you are returned to the tool detail page.
Use your browser's "Back" button to return to the tools list page.
Repeat Steps 3-6 to install the .
Step 5. Build a Workflow
Build a workflow using the two apps you installed, and configure it to use the data you added to your project in Step 3.
Adding Workflow Steps
A workflow runs tools as part of a preconfigured series of steps. Start building your workflow by adding steps to it:
Return to your project's Manage screen. You can do this by using your browser's "Back" button, or by selecting All Projects from the Projects link in the main menu, then clicking on the name of your project in the projects list.
Click the Add button in the top right corner of the screen, then select New Workflow from the dropdown. The Workflow Builder opens.
In the Workflow Builder, give your new workflow a name. In the upper left corner of the screen, you see a field with a placeholder value that begins "Untitled Workflow." Click on the "pencil" icon next to this placeholder name, then enter a name of your choosing.
Click the Add a Step button. In the Select a Tool modal window, find the BWA-MEM FASTQ Read Mapper and click the "+" to the left of its name, to add it to your workflow.
Repeat Step 4 for the FreeBayes Variant Caller.
Close the Select a Tool modal window, by clicking either on the "x" in its upper right corner, or the Close button in its lower right corner. You return to the main Workflow Builder screen.
Setting Inputs for Each Step
In the Workflow Builder, required inputs have orange placeholder text, while optional inputs have black placeholder text.
Set the required inputs for each step by doing the following:
To set the required inputs for the first step, start by clicking on the input labeled "Reads [array]" for the BWA-MEM FASTQ Read Mapper. In the Select Data for Reads Input modal window, click the box for the SRR100022_20_1.fq.gz file. Then click the Select button.
Since the SRR100022 exome was sequenced using paired-end sequencing, you need to provide the right-mates for the first set of reads. Click on the input labeled "Reads (right mates) [array]" for the BWA-MEM FASTQ Read Mapper. Select the SRR100022_20_2.fq.gz file.
Click on the input labeled "BWA reference genome index." At the bottom of the modal window that opens, there is a Suggestions section that includes a link to a folder containing reference genome files. Click on this link, then open the folder named H. Sapiens - GRCh37 - b37 (1000 Genomes Phase I). Select the human_g1k_v37.bwa-index.tar.gz file.
Next set the "Sorted mappings [array]" required input for the second step. In the "Output" section for the first step, click on the blue pill labeled "Sorted mappings," then drag it to the second step input labeled "Sorted mappings [array]."
Click on the second step input labeled "Genome." In the modal that opens, find the reference genomes folder as in Step 3. Open the folder named H. Sapiens - GRCh37 - b37 (1000 Genomes Phase I). Select the human_g1k_v37.fa.gz file.
Each tool has different input and output requirements. To learn about a tool's required and optional inputs and outputs, file format restrictions, and other configuration details, refer to its detail page in the Tools Library.
Step 6. Launch the Workflow
You're ready to launch your workflow, by doing the following:
Click the Start Analysis button at the upper right corner of the Workflow Builder.
In the modal window that opens, click the Run as Analysis button.
The BWA-MEM FASTQ Read Mapper starts executing immediately. Once it finishes, the FreeBayes Variant Caller starts, using the Read Mapper's output as an input.
Step 7. Monitor Your Job
Once you've launched your workflow, you are taken to your project's Monitor screen. Here, you see a list of both current and past analyses run within the project, along with key information about each run.
As your workflow runs, its status shows as "In Progress."
Terminating Your Job
If for some reason you need to terminate the run before it completes, find its row in the list on the Monitor screen. In the last column on the right, you see a red button labeled Terminate. Click the button to terminate the job. This process may take some time. While the job is being terminated, the job's status shows as "Terminating."
Step 8. Access the Results
When your workflow completes, output files are placed into a new folder in your project, with the same name as the workflow. The folder is accessible by navigating to your project's Manage screen.
Running the Workflow Using the Full SRR100022 Exome
You can run this workflow using the full SRR100022 exome, which is available in the SRR100022 folder, in the "Demo Data" project. Because this means working with a much larger file, running the workflow using the exome data takes longer.
Learn More
See these Key Concepts pages for more in-depth treatments of topics that are covered briefly here:
The DNAnexus Platform recognizes three main types of paths for referring to data objects: project paths, job-based object references (JBORs), and DNAnexus links.
Project Paths
To refer to a project by name, it must be suffixed with the colon character ":". Anything appearing after the ":" or without a ":" is interpreted as a folder path to a named object. For example, to refer to a file called "hg19.fq.gz" in a folder called "human" in a project called "Genomes", the following path can be used in place of its object ID:
The folder path appearing after the ":" is assumed to be relative to the root folder "/" of the project.
Exceptions to this are when commands take in arbitrary names. This applies to commands like dx describe which accepts app names, user IDs, and other identifiers. In this case, all possible interpretations are attempted. However, it is always assumed that it is not a project name unless it ends in ":".
Job-Based Object References (JBORs)
To refer to the output of a particular job, you can use the syntax <job id>:<output name>.
Examples
If you have the job ID handy, you can use it directly.
Or if you know it's the last analysis you ran:
You can also automatically download a file once the job producing it is done:
If the output is an array, you can extract a single element by specifying its array index (starting from 0) as follows:
DNAnexus Links
DNAnexus links are JSON hashes which are used for job input and output. They always contain one key, $dnanexus_link, and have as a value either
a string representing a data object ID
another hash with two keys:
project a string representing a project or other data container ID
id a string representing a data object ID
For example:
Special Characters
When naming data objects, certain characters require special handling because they have specific meanings in the DNAnexus Platform:
The colon (:) identifies project names
The forward slash (/) separates folder names
Asterisks (*) and question marks (?) are used for wildcard matching
To use these characters in object names, you must escape them with backslashes. Spaces may also need escaping depending on your shell environment and whether you use quotes.
For the best experience, we recommend avoiding special characters in names when possible. If you need to work with objects that have special characters, using their object IDs directly is often simpler.
The table below shows how to escape special characters when accessing objects with these characters in their names:
Character
Without Quotes
With Quotes
(single space)
' '
:
\\\\:
'\\:'
The following example illustrates how the special characters are escaped for use on the command line, with and without quotes.
For commands where the argument supplied involves naming or renaming something, the only escaping necessary is whatever is necessary for your shell or for setting it apart from a project or folder path.
Name Conflicts
It is possible to have multiple objects with the same name in the same folder. When an attempt is made to access or modify an object which shares the same name as another object, you are prompted to select the desired data object.
Some commands (like mv here) allow you to enter * so that all matches are used. Other commands may automatically apply the command to all matches. This includes commands like ls and describe. Some commands require that exactly one object be chosen, such as the run command.
The subjob or child job of stage 1's origin job shares the same temporary workspace as its parent job. API calls to run a new applet or app using /applet-xxxx/run or /app-xxxx/run launch a master job that has its own separate workspace, and (by default) no visibility into its parent job's workspace.
Job States
Successful Jobs
Every successful job goes through at least the following four states: 1. idle: initial state of every new job, regardless of what API call was made to create it. 2. runnable: the job's inputs are ready, and it is not waiting for any other job to finish or data object to finish closing. 3. running: the job has been assigned to and is being run on a worker in the cloud. 4. done: the job has completed, and it is not waiting for any descendent job to finish or data object to finish closing. This is a terminal state, so no job becomes a different state after transitioning to done.
Jobs may also pass through the following transitional states as part of more complicated execution patterns:
waiting_on_input (between idle and runnable): a job enters and stays in this state if at least one of the following is true:
it has an unresolved job-based object reference in its input
it has a data object input that cannot be cloned yet because it is not in the closed state or a linked hidden object is not in the closed state
it was created to wait on a list of jobs or data objects that must enter the done or closed states, respectively (see the dependsOn field of any API call that creates a job). Linked hidden objects are implicitly included in this list
waiting_on_output (between running and done): a job enters and stays in this state if at least one of the following is true:
it has a descendant job that has not been moved to the done state
it has an unresolved job-based object reference in its output
Unsuccessful Jobs
Two terminal job states exist other than the done state: terminated and failed. A job can enter either of these states from any other state except another terminal state.
Terminated Jobs
The terminated state occurs when a user requests termination of the job (or another job sharing the same origin job). For all terminated jobs, the failureReason in their describe hash contains "Terminated", and the failureMessage indicates the user responsible for termination. Only the user who launched the job or administrators of the job's project context can terminate the job.
Failed Jobs
Jobs can fail for a variety of reasons, and once a job fails, this triggers failure for all other jobs that share the same origin job. If an unrelated job not in the same job tree has a job-based object reference or otherwise depends on a failed job, then it also fails. For more information about errors that jobs can encounter, see the Error Information page.
On the DNAnexus Platform, jobs are limited to a runtime of 30 days. Jobs running longer than 30 days fail with JobTimeoutExceeded error.
Restartable Jobs
Jobs can automatically restart when they encounter specific types of failures. You configure which failure types trigger restarts in the executionPolicy of an app, applet, or workflow. Common restartable failure types include:
UnresponsiveWorker
ExecutionError
AppInternalError
AppInsufficientResourceError
JobTimeoutExceeded
SpotInstanceInterruption
How job restarts work
When a job fails for a restartable reason, the system determines where to restart based on the restartableEntryPoints configuration:
master setting (default): The failure propagates to the nearest master job, which then restarts
all setting: The job restarts itself directly
The system restarts a job up to the maximum number of times specified in the executionPolicy. Once this limit is reached, the entire job tree fails.
During the restart process, jobs transition through specific states:
restartable: The job is ready to be restarted
restarted: The job attempt was restarted (a new attempt begins)
Job try tracking
For jobs in root executions launched after July 12, 2023 00:13 UTC, the platform tracks restart attempts using a try integer attribute:
First attempt: try = 0
Second attempt (first restart): try = 1
Third attempt (second restart): try = 2
Multiple API methods support job try operations and include try information in their responses:
/job-xxxx/describe
/job-xxxx/addTags
/job-xxxx/removeTags
/job-xxxx/setProperties
/system/findExecutions
/system/findJobs
/system/findAnalyses
When you provide a job ID without specifying a try argument, these methods automatically refer to the most recent attempt for that job.
Additional States
For unsuccessful jobs, additional states exist between the running state and the terminal state of terminated or failed. Unsuccessful jobs starting in other non-terminal states transition directly to the appropriate terminal state.
terminating: the transitional state when the cloud worker begins terminating the job and tearing down the execution environment. The job moves to its terminal state after the worker reports successful termination or becomes unresponsive.
debug_hold: a job has been run with debugging options and has failed for an applicable reason, and is being held for debugging by the user. For more information about triggering this state, see the Connecting to Jobs page.
Analysis States
All analyses start in the state in_progress, and, like jobs, reach one of the terminal states done, failed, or terminated. The following diagram shows the state transitions for successful analyses.
If an analysis is unsuccessful, it may transition through one or more intermediate states before it reaches its terminal state:
partially_failed: this state indicates that one or more stages in the analysis have not finished successfully, and there is at least one stage which has not transitioned to a terminal state. In this state, some stages may have already finished successfully (and entered the done state), and the remaining stages are also allowed to finish successfully if they can.
terminating: an analysis may enter this state either via an API call where a user has terminated the analysis, or there is some failure condition under which the analysis is terminating any remaining stages. This may happen if the executionPolicy for the analysis (or a stage of an analysis) had the onNonRestartableFailure value set to "failAllStages".
Billing
Compute and data storage costs for jobs that fail due to user error are charged to the project running those jobs. This includes errors such as InputError and OutputError. The same applies to terminated jobs. For DNAnexus Platform internal errors, these costs are not billed.
The costs for each stage in an analysis is determined independently. If the first stage finishes successfully while a second stage fails for a system error, the first stage is still billed, and the second is not.
A license is required to access the Relational Database Service. Contact DNAnexus Sales for more information.
Overview of the Relational Database Service
DNAnexus Relational DB Cluster States
When describing a DNAnexus DBCluster, the status field can be any of the following:
DBCluster status
Details
creating
The database cluster is being created, but not yet available for reading/writing.
available
The database cluster is created and all replicas are available for reading/writing.
stopping
The database cluster is stopping.
stopped
Connecting to a DB Cluster
DB Clusters are not accessible from outside of the DNAnexus Platform. Any access to these databases must occur from within a DNAnexus job. Refer to this page on cloud workstations for one possible way to access a DB Cluster from within a job. Executions such as app/applets can access a DB Cluster as well.
The parameters needed for connecting to the database are:
* - db_std1 instances may incur CPU Burst charges similar to AWS T3 Db instances described in AWS Instance Types. Regular hourly charges for this instance type are based on 1 core, CPU Burst charges are based on 2 cores.
Restriction on Transfers of Projects Containing DBClusters
To understand the archival life cycle as well as which operations can be performed on files and how billing works, it's helpful to understand the different file states associated with archival. A file in a project can assume one of four archival states:
Archival states
Details
live
The file is in standard storage, such as AWS S3 or Azure Blob.
archival
Archival requested on the current file, but other copies of the same file are in the live state in multiple projects with the same billTo entity. The file is still in standard storage.
archived
The file is in archival storage, such as AWS S3 Glacier or Azure Blob ARCHIVE.
unarchiving
Different states of a file allow different operations to the file. See the table below, for which operations can be performed based on a file's current archival state.
Archival states
Download
Clone
Compute
Archive
Unarchive
live
Yes
Yes
Yes
Yes
* Clone operation would fail if the object is actively transitioning from archival to archived.
File Archival Life Cycle
When the project-xxxx/archive API is called on a file object, the file transitions from the live state to the archival state. Only when all copies of a file in all projects with the same billTo organization are in the archival state, does the file transition to the archived state automatically by the platform.
Likewise, when the project-xxxx/unarchive API is called on a file in the archived state, the file transitions from the archived to the unarchiving state. During the unarchiving state, the file is being restored by the third-party storage platform, such as AWS or Azure. The unarchiving process may take a while depending on the retrieval option selected for the specific platform. Finally, when unarchiving is completed, and the file becomes available on standard storage, the file is transitioned to a live state.
Archive Service Operations
The File-based Archive Service allows users who have the CONTRIBUTE or ADMINISTER permissions to a project to archive or unarchive files that reside in the project.
Using API, users can archive or unarchive files, folders, or entire projects, although the archiving process itself happens at the file level. The API can accept a list of up to 1000 files for archiving and unarchiving.
When archiving or unarchiving folders or projects, the API by default archives or unarchives all the files at the root level and those in the subfolders recursively. If you archive a folder or a project that includes files in different states, the Service only archives files that are in the live state and skips files that are in other states. Likewise, if you unarchive a folder or a project that includes files in different states, the Service only unarchives files that are in the archived state, transitions archival files back to the live state, and skips files in other states.
Archival Billing
The archival process incurs specific charges, all billed to the billTo organization of the project:
Standard storage charge: The monthly storage charge for files that are located in the standard storage on the platform. The files in the live and archival state incur this charge. The archival state indicates that the file is waiting to be archived or that other copies of the same file in other projects are still in the live state, so the file is in standard storage, such as AWS S3. The standard storage charge continues to get billed until all copies of the file are requested to be archived and eventually the file is moved to archival storage and transitioned into the archived state.
Archival storage charge: The monthly storage charge for files that are located in archival storage on the platform. Files in the archived state incur a monthly archival storage charge.
Retrieval fee: The retrieval fee is a one-time charge at the time of unarchiving based on the volume of data being unarchived.
Early retrieval fee: If you retrieve or delete data from archival storage before the required retention period is met, an early retrieval fee applies. This is 90 days for AWS regions and 180 days for Microsoft Azure regions. You are be charged a pro-rated fee equivalent to the archival storage charges for any remaining days within that period.
Best Practices
When using the Archive Service, we recommend the following best practices.
The Archive Service does not work on sponsored projects. If you want to archive files within a sponsored project, then you must move files into a different project or end the project sponsorship before archival.
If a file is shared in multiple projects, archiving one copy in one of the projects only transitions the file into the archival state, which still incurs the standard storage cost. To achieve the lower archival storage cost, you need to ensure that all copies of the file in all projects with the same billTo org are being archived. When all copies of the file reach the archival state, the Service moves the files from archival to archived state. Consider using the allCopies option of the API to archive all copies of the file. You must be the org ADMIN of the billTo org of the current project to use the allCopies option.
Refer to the following example: The file-xxxx has copies in project-xxxx, project-yyyy, and project-zzzz which are sharing the same billTo org (org-xxxx). You are the ADMINISTER of project-xxxx, and a CONTRIBUTE of project-yyyy, but do not have any role in project-zzzz. You are the org ADMIN of the project
The SAMtools dependency is resolved by declaring an Apt-Get package in the dxapp.jsonrunSpec.execDepends.
For additional information, refer to the .
Entry Points
Distributed Python-interpreter apps use Python decorators on functions to declare entry points. This app has the following entry points as decorated functions:
main
samtoolscount_bam
combine_files
Entry points are executed on a new worker with their own system requirements. In this example, the files are split and merged on basic mem1_ssd1_x2 instances and the more intensive processing step is performed on a mem1_ssd1_x4 instance. Instance type can be set in the dxapp.jsonrunSpec.systemRequirements:
main
The main function scatters by region bins based on user input. If no *.bai file is present, the applet generates an index *.bai.
Regions bins are passed to the samtoolscount_bam entry point using the function.
Outputs from the samtoolscount_bam entry points are used as inputs for the combine_files entry point. The output of the combine_files entry point is used as the output of the main entry point.
samtoolscount_bam
This entry point downloads and creates a samtools view -c command for each region in the input bin. The dictionary returned from dxpy.download_all_inputs() is used to reference input names and paths.
This entry point returns {"readcount_fileDX": readCountDXlink}, a JBOR referencing an uploaded text file. This approach to scatter-gather stores the results in files and uploads/downloads the information as needed. This approach exaggerates a scatter-gather for tutorial purposes. You're able to pass types other than file such as int.
combine_files
The main entry point triggers this subjob, providing the output of samtoolscount_bam as an input. This entry point gathers all the files generated by the samtoolscount_bam jobs and sums them.
Important: While the main entry point triggers the processing and gathering entry points, remember that the main entry point doesn't do any heavy lifting or processing. Notice in the .runSpec JSON above the process starts with a lightweight instance, scales up for the processing entry point, then finally scales down for the gathering step.
Distributed by Region (py)
This applet creates a count of reads from a BAM format file.
The SAMtools dependency is resolved by declaring an Apt-Get package in the dxapp.jsonrunSpec.execDepends.
For additional information, refer to the .
Entry Points
Distributed Python-interpreter apps use Python decorators on functions to declare entry points. This app has the following entry points as decorated functions:
main
samtoolscount_bam
combine_files
Entry points are executed on a new worker with their own system requirements. In this example, the applet splits and merges files on basic mem1_ssd1_x2 instances and performs a more intensive processing step on a mem1_ssd1_x4 instance. Instance type can be set in the dxapp.json's runSpec.systemRequirements:
main
The main function scatters by region bins based on user input. If no *.bai file is present, the applet generates an index *.bai.
Regions bins are passed to the samtoolscount_bam entry points in the function.
Outputs from the samtoolscount_bam entry points are used as inputs for the combine_files entry point. The output of the combine_files entry point is used as the output of the main entry point.
samtoolscount_bam
This entry point downloads and creates a samtools view -c command for each region in the input bin. The dictionary returned from dxpy.download_all_inputs() is used to reference input names and paths.
This entry point returns {"readcount_fileDX": readCountDXlink}, a JBOR referencing an uploaded text file. This approach to scatter-gather stores the results in files and uploads/downloads the information as needed. This approach exaggerates a scatter-gather for tutorial purposes. You're able to pass types other than file such as int.
combine_files
The main entry point triggers this subjob, providing the output of samtoolscount_bam as an input. This entry point gathers all the files generated by the samtoolscount_bam jobs and sums them.
Important: While the main entry point triggers the processing and gathering entry points, remember that the main entry point doesn't do any heavy lifting or processing. The .runSpec JSON shows a workflow that starts with a lightweight instance, scales up for the processing entry point, and then scales down for the gathering step.
Project Navigation
You can treat dx as an invocation command for navigating the data objects on the DNAnexus Platform. By adding dx in front of commonly used bash commands, you can manage objects in the platform directly from the command-line. Common commands include dx ls, dx cd, dx mv, and dx cp, which let you list objects, change folders, move data objects, and copy objects.
Listing Objects
Listing Objects in Your Current Project
By default when you set your current project, you are placed in the root folder / of the project. You can list the objects and folders in your current folder with .
Listing Object Details
To see more details, you can run the command with the option dx ls -l.
As in bash, you can list the contents on a path.
Listing Objects in a Different Project
You can also list the contents of a different project. To specify a path that points to a different project, start with the project-ID, followed by a :, then the path within the project where / is the root folder of the project.
Enclose the path in quotes (" ") so dx interprets the spaces as part of the folder name, not as a new command.
Listing Objects That Match a Pattern
You can also list only the objects that match a pattern. In this example, an asterisk * acts as a wildcard to represent all objects with names containing .fasta. This returns only a subset of the objects from the original query.
Enclose the path in quotes for two reasons:
The shell passes the wildcard pattern to dx without expanding it against local files.
The dx command correctly interprets any spaces in the path.
For more information about using wildcards with dx commands, see .
Switching Contexts
Changing Folders
To find out your present folder location, use the dx pwd command. You can switch contexts to a subfolder in a project using .
Moving or Renaming Data Objects
You can move and rename data objects and folders using the command .
To rename an object or a folder, "move" it to a new name in the same folder. Here, a file named ce10.fasta.gz is renamed to C.elegans10.fastq.gz.
If you want to move the renamed file into a folder, specify the path to the folder as the destination of the move command (dx mv).
Copying Objects or Folders to Another Project
You can copy data objects or folders to another project by running the command dx cp. The following example shows how to copy a human reference genome FASTA file (hs37d5.fa.gz) from a public project, "Reference Genome Files", to a project "Scratch Project" that the user has ADMINISTER permission to.
You can also copy folders between projects by running dx cp folder_name destination_path. Folders are automatically copied recursively.
The Platform prevents copying a data object within the same project, since each specific data object exists only once in a project. The system also prohibits copying any data object between projects that are located in different through dx cp.
Changing Your Current Project
Changing to Another Project With a Project Prompt List
You can change to another project where you wanted to work by running the command . It brings up a prompt with a list of projects for you to select from. In the following example, the user has entered option 2 to select the project named "Mouse".
Changing to a Public Project
To view and select between all public projects, projects available to all DNAnexus users, you can run the command dx select --public:
Changing to a Project With VIEW Permission
By default, dx select prompts a list of projects that you have at least CONTRIBUTE permission to. If you wanted to switch to a project that you have VIEW permission to view the data objects, you can run dx select --level VIEW to list all the projects in which you have at least VIEW permission to.
Changing Directly to a Specific Project
If you know the project ID or name, you can also give it directly to switch to the project as dx select [project-ID | project-name]:
JupyterLab Reference
This page is a reference for most useful operations and features in the JupyterLab environment.
JupyterLab is accessible to all users of the UK Biobank Research Analysis Platform and the Our Future Health Trusted Research Environment.
A license is required to access JupyterLab on the DNAnexus Platform. Contact DNAnexus Sales for more information.
Download Files from the Project to the Local Execution Environment
Bash
You can download input data from a project using in a notebook cell:
The %%bash keyword converts the whole cell to a magic cell which allows you to run bash code in that cell without exiting the Python kernel. See examples of magic commands in the . The ! prefix achieves the same result:
Alternatively, the dx command can be executed from the .
Python
To download data with Python in the notebook, you can use the function:
Check the for details on how to download files and folders.
Upload Data from the Session to the Project
Bash
Any files from the execution environment can be uploaded to the project using :
Python
To upload data using Python in the notebook, you can use the function:
Check the for details on how to upload files and folders.
Download and Upload Data to Your Local Machine
By selecting a notebook or any other file on your computer and dragging it into the DNAnexus project file browser, you can upload the files directly to the project. To download a file, right-click on it and click Download (to local computer).
You may upload and download data to the in a similar way, that is, by dragging and dropping files to the execution file browser or by right-clicking on the files there and clicking Download.
Use the Terminal
It is useful to have a terminal provided by JupyterLab at hand, which uses bash shell by default and lets you execute shell scripts or interact with the platform via dx toolkit. For example, the following command confirms what the current project context is:
Running pwd shows you that the working directory of the execution environment is /opt/notebooks. The JupyterLab server is launched from this directory, which is also the default location of the output files generated in the notebooks.
To open a terminal window, go to File > New > Terminal or open it from the Launcher (using the "Terminal" box at the bottom). To open a Launcher, select File > New Launcher.
Install Custom Packages in the Session Environment
You can install pip, conda, apt-get, and other packages in the execution environment from the notebook:
By creating a , you can start subsequent sessions with these packages pre-installed by providing the snapshot as input.
Access Public and Private GitHub Repositories from the JupyterLab Terminal
You can access public GitHub repositories from the JupyterLab terminal using git clone command. By placing a private ssh key that's registered with your GitHub account in /root/.ssh/id_rsa you can clone private GitHub repositories using git clone and push any changes back to GitHub using git push from the JupyterLab terminal.
Below is a screenshot of a JupyterLab session with a terminal displaying a script that:
sets up ssh key to access a private GitHub repository and clones it,
clones a public repository,
downloads a JSON file from the DNAnexus project,
This animation shows the first part of the script in action:
Run Notebooks Non-Interactively
A command can be run in the JupyterLab Docker container without starting an interactive JupyterLab server. To do that, provide the cmd input and additional input files using the in input file array. The command runs in the directory where the JupyterLab server is started and notebooks are run, that is, /opt/notebooks/. Any output files generated in this directory are uploaded to the project and returned in the out output.
The cmd input makes it possible to use a papermill tool pre-installed in the JupyterLab environment that executes notebooks non-interactively. For example, to execute all the cells in a notebook and produce an output notebook:
where notebook.ipynb is the input notebook to papermill, which needs to be passed in the in input, and output_notebook.ipynb is the name of the output notebook, which stores the result of the cells' execution. The output is uploaded to the project at the end of the app execution.
If the snapshot parameter is specified, execution of cmd takes place in the specified Docker container. The duration argument is ignored when running the app with cmd. The app can be run from the command line with the --extra-args flag to limit the runtime, for example, dx run dxjupyterlab --extra-args '{"timeoutPolicyByExecutable": {"app-xxxx":{"\*": {"hours": 1}}}}'".
If cmd is not specified, the in parameter is ignored and the output of an app consists of an empty array.
Use newer NVIDIA GPU-accelerated software
If you are trying to use newer NVIDIA GPU-accelerated software, you may find that the NVIDIA GPU Driver kernel-mode driver NVIDIA.ko that is installed outside of the JupyterLab environment does not support the newer CUDA version required by your application. You can install packages to use the newer CUDA version required by your application by following the steps below in a JupyterLab terminal.
Session Inactivity
After 15 to 30 minutes of inactivity in the JupyterLab browser tabs, the system logs you out automatically from the JupyterLab session and displays a "Server Connection Error" message. To re-enter the JupyterLab session, reload the JupyterLab webpage and log into the platform to be redirected to the JupyterLab session.
Searching Data Objects
You can use the dx ls command to list the objects in your current project. You can determine the current project and folder you are in by using the command dx pwd. Using glob patterns, you can broaden your search for objects by specifying filenames with wildcard characters such as * and ?. An asterisk (*) represents zero or more characters in a string, and a question mark (?) represents exactly one character.
Searching Objects with Glob Patterns
Searching Objects in Your Current Folder
By listing objects in your current directory with the wildcard characters * and ?, you can search for objects with a filename using a glob pattern. The examples below use the folder "C. Elegans - Ce10/" in the public project (platform login required to access this link).
Printing the Current Working Directory
Listing Folders and/or Objects in a Folder
Listing Objects Named Using a Pattern
Searching Across Objects in the Current Project
To search the entire project with a filename pattern, use the command dx find data --name with the wildcard characters. Unless --path or --all-projects is specified, dx find data searches data under the current project. Below, the command dx find data is used in the public project (platform login required to access this link) using the --name option to specify the filename of objects that you're searching for.
Quoting Wildcards in Shell Commands
When using wildcard characters (* and ?) with dx commands, enclose the pattern in single ' or double " quotes. Without quotes, the shell expands the wildcards against files in your local filesystem before passing the pattern to the dx command, which produces unexpected results.
Quoting the pattern ensures the shell treats it as a literal string and passes it directly to the dx command, where DNAnexus interprets the wildcards to search Platform objects.
Bash also expands other special characters like ?, [, ], {, and }. For complete details about shell expansion and quoting, see the .
Escaping Special Characters
Escape special characters in filenames with a backslash (\) when you want to search for them literally. Characters that require escaping include wildcards (* and ?) when you want to find them as literal characters in filenames. You must also escape colons (:) and slashes (/), because these have special meaning in DNAnexus paths.
Shell behavior affects escaping rules. In many shells, you need to either double-escape (\\) or use single quotes to prevent the shell from interpreting the backslash.
The following examples show proper escaping techniques:
Searching Objects with Other Criteria
dx find data also allows you to search data using metadata fields, such as when the data was created, the data tags, or the project the data exists in.
Searching Objects Created Within a Certain Period of Time
You can use the flags --created-after and --created-before to search for data objects created within a specific time period.
Searching Objects by Their Metadata
You can search for objects based on their metadata. An object's metadata can be set by performing the command or to respectively tag or setup key-value pairs to describe your data object. You can also set metadata while uploading data to the platform. To search by object tags, use the option --tag. This option can be repeated if the search requires multiple tags.
To search by object properties, use the option --property. This option can be repeated if the search requires multiple properties.
Searching Objects in Another Project
You can search for an object living in a different project than your current working project by specifying a project and folder path with the flag --path. Below, the project ID (project-BQfgzV80bZ46kf6pBGy00J38) of the public project (platform login required to access this link) is specified as an example.
Searching Objects Across Projects with VIEW and Above Permissions
To search for data objects in all projects where you have VIEW and above permissions, use the --all-projects flag. Public projects are not shown in this search.
Scoping Within Projects
To describe data for small amounts of files (typically below 100), scope findDataObjects to only a project level.
The below is an example of code used to scope a project:
See the for more information about usage.
Stata in JupyterLab
Using Stata via JupyterLab, working with project files, and creating datasets with Spark.
Stata is a powerful statistics package for data science. Stata commands and functionality can be accessed on the DNAnexus Platform via stata_kernel, in Jupyter notebooks.
Before You Begin
Project License Requirement
On the DNAnexus Platform, use the to create and edit Jupyter notebooks.
You can only run this app within a project that's billed to an account with a license that allows the use of both and . if you need to upgrade your license.
JupyterLab is accessible to all users of the UK Biobank Research Analysis Platform and the Our Future Health Trusted Research Environment. A license is required to access JupyterLab on the DNAnexus Platform. for more information.
Stata License Requirement
To use Stata on the DNAnexus Platform, you need a valid Stata license. Before launching Stata in a project, you must save your license details according to the instructions below in a plain text file with the extension .json, then upload this file to the project's root directory. You only need to do this once per project.
Creating a Stata License Details File
Start by creating the file in a text editor, including all the fields shown here, where <user> is your DNAnexus username, and<organization>is the org of which you're a member:
Save the file according to the following format, where <username> is your DNAnexus username: .stataSettings.user-<username>.json
Some operating systems may not support the naming of files with a "." as the first character. If this is the case, you can rename the .json file after uploading it to your project by hovering over the name of your file and clicking the pencil icon that appears.
Uploading the Stata License Details File
Open the project in which you want to use Stata. Upload the Stata license details file to the project's root directory by going to your project's Manage tab, clicking on the Add button on the upper right, and then selecting the Upload data option.
Secure Indirect Format Option for Shared Projects
When working in a shared project, you can take an additional step to avoid exposing your Stata license details to project collaborators.
Create a private project. Then create and save a Stata license details file in that project's root directory, per the instructions above.
Within the shared project, create and save a Stata license details file in this format, where project-yyyy is the name of the private project, and file-xxxx is the license details file ID, in that private project:
When working on the Research Analysis Platform, you can only create a private credentials project from the .
Launching JupyterLab
Open the project in which you want to use Stata. From within the project's Manage tab, click the Start Analysis button.
Select the app JupyterLab with Python, R, Stata, ML, Image Processing.
Click the Run Selected button. If you haven't run this app before, you are prompted to install it. Next, you are taken to the Run Analysis screen.
The app can take some time to load and start running.
Once the analysis starts, you see the notification "Running" appear under the name of the app.
Opening JupyterLab
Click the Monitor tab heading. This opens a list of running and past jobs. Jobs are shown in reverse chronological order, with the most recently launched at the top. The topmost row should show the job you launched. To open the job and enter the JupyterLab interface, click on the URL shown under Worker URL.
If you do not see the worker URL, click on the name of the job in the Monitor page.
Using Stata Within JupyterLab
Within the JupyterLab interface, open the DNAnexus tab shown at the left edge of the screen.
Open a new Stata notebook by clicking the Stata tile in the Notebooks section.
Working with Project Files
You can download DNAnexus data files to the JupyterLab container from Stata notebook with:
Data files in the current project can also be accessed using a /mnt/project folder from a Stata notebook as follows: To load a DTA file:
To load a CSV file:
To write a DTA file to the JupyterLab container:
To write a CSV file to the JupyterLab container:
To upload a data file from the JupyterLab container to the project, use the following command in a Stata notebook:
Alternatively, open a new Launcher tab, open Terminal, and run:
The /mnt/project directory is read-only, so trying to write to it results in an error.
Creating a Stata Dataset with Spark
can be used to query and filter DNAnexus returning a PySpark DataFrame. PySpark Dataframe can be converted to a pandas DataFrame with:
Pandas dataframe can be exported to CSV or Stata DTA files in the JupyterLab container with:
To upload a data file from the JupyterLab container to the DNAnexus project in the JupyterLab Spark Cluster app, use
Once saved to the project, data files can be used in a JupyterLab Stata session using the instructions above.
Genomes:human/hg19.fa.gz
dx describe job-B0kK3p64Zg2FG1J75vJ00004:reads
dx describe $(dx find jobs -n 1 --brief):reads
dx download $(dx run some_exporter_app -iinput=my_input -y --brief --wait):file_output
dx describe job-B0kK3p64Zg2FG1J75vJ00004:reads.0
$ dx ls '{"$dnanexus_link": "file-B2VBGXyK8yjzxF5Y8j40001Y"}'
file-name
$ dx ls Project\ Mouse:
name: with/special*characters?
$ dx cd Project\ Mouse:
$ dx describe name\\\\:\ with\\\\/special\\\\\\\\*characters\\\\\\\\?
ID file-9zz0xKJkf6V4yzQjgx2Q006Y
Class file
Project project-9zb014Jkf6V33pgy75j0000G
Folder /
Name name: with/special*characters?
State closed
Hidden visible
Types -
Properties -
Tags -
Outgoing links -
Created Wed Jul 11 16:39:37 2012
Created by alice
Last modified Sat Jul 21 14:19:55 2012
Media type text/plain
Size (bytes) 4
$ dx describe "name\: with\/special\\\\\\*characters\\\\\\?"
...
$ dx new record -o "must\: escape\/everything\*once\?at creation"
ID record-B13BBVK4Zg29fvVv08q00005
...
Name must: escape/everything*once? at creation
...
$ dx rename record-B13BBVK4Zg29fvVv08q00005 "no:escaping/necessary*even?wildcards"
$ dx ls
sample : file-9zbpq72y8x6F0xPzKZB00003
sample : file-9zbjZf2y8x61GP1199j00085
$ dx mv sample mouse_sample
The given path "sample" resolves to the following data objects:
0) closed 2012-06-27 18:04:28 sample (file-9zbpq72y8x6F0xPzKZB00003)
1) closed 2012-06-27 15:34:00 sample (file-9zbjZf2y8x61GP1199j00085)
Pick a numbered choice or "*" for all: 1
$ dx ls -l
closed 2012-06-27 15:34:00 mouse_sample (file-9zbjZf2y8x61GP1199j00085)
closed 2012-06-27 18:04:28 sample (file-9zbpq72y8x6F0xPzKZB00003)
On the Run Analysis screen, open the Analysis Inputs tab and click the Stata settings file button.
Add your Stata settings file as an input. This is the .json file you created, containing your Stata license details.
In the Common section at the bottom of the Analysis Inputs pane, open the Feature dropdown menu and select Stata.
Click the Start Analysis button at the top right corner of the screen. This launches the JupyterLab app, and takes you to the project's Monitor tab, where you can monitor the app's status as it loads.
regions = parseSAM_header_for_region(filename)
split_regions = [regions[i:i + region_size]
for i in range(0, len(regions), region_size)]
if not index_file:
mappings_bam, index_file = create_index_file(filename, mappings_bam)
print('creating subjobs')
subjobs = [dxpy.new_dxjob(
fn_input={"region_list": split,
"mappings_bam": mappings_bam,
"index_file": index_file},
fn_name="samtoolscount_bam")
for split in split_regions]
fileDXLinks = [subjob.get_output_ref("readcount_fileDX")
for subjob in subjobs]
def samtoolscount_bam(region_list, mappings_bam, index_file):
"""Processing function.
Arguments:
region_list (list[str]): Regions to count in BAM
mappings_bam (dict): dxlink to input BAM
index_file (dict): dxlink to input BAM
Returns:
Dictionary containing dxlinks to the uploaded read counts file
"""
#
# Download inputs
# -------------------------------------------------------------------
# dxpy.download_all_inputs will download all input files into
# the /home/dnanexus/in directory. A folder will be created for each
# input and the file(s) will be download to that directory.
#
# In this example our dictionary inputs has the following key, value pairs
# Note that the values are all list
# mappings_bam_path: [u'/home/dnanexus/in/mappings_bam/<bam filename>.bam']
# mappings_bam_name: [u'<bam filename>.bam']
# mappings_bam_prefix: [u'<bam filename>']
# index_file_path: [u'/home/dnanexus/in/index_file/<bam filename>.bam.bai']
# index_file_name: [u'<bam filename>.bam.bai']
# index_file_prefix: [u'<bam filename>']
#
inputs = dxpy.download_all_inputs()
# SAMtools view command requires the bam and index file to be in the same
shutil.move(inputs['mappings_bam_path'][0], os.getcwd())
shutil.move(inputs['index_file_path'][0], os.getcwd())
input_bam = inputs['mappings_bam_name'][0]
#
# Per region perform SAMtools count.
# --------------------------------------------------------------
# Output count for regions and return DXLink as job output to
# allow other entry points to download job output.
#
with open('read_count_regions.txt', 'w') as f:
for region in region_list:
view_cmd = create_region_view_cmd(input_bam, region)
region_proc_result = run_cmd(view_cmd)
region_count = int(region_proc_result[0])
f.write("Region {0}: {1}\n".format(region, region_count))
readcountDXFile = dxpy.upload_local_file("read_count_regions.txt")
readCountDXlink = dxpy.dxlink(readcountDXFile.get_id())
return {"readcount_fileDX": readCountDXlink}
def combine_files(countDXlinks, resultfn):
"""The 'gather' subjob of the applet.
Arguments:
countDXlinks (list[dict]): list of DXlinks to process job output files.
resultfn (str): Filename to use for job output file.
Returns:
DXLink for the main function to return as the job output.
Note: Only the DXLinks are passed as parameters.
Subjobs work on a fresh instance so files must be downloaded to the machine
"""
if resultfn.endswith(".bam"):
resultfn = resultfn[:-4] + '.txt'
sum_reads = 0
with open(resultfn, 'w') as f:
for i, dxlink in enumerate(countDXlinks):
dxfile = dxpy.DXFile(dxlink)
filename = "countfile{0}".format(i)
dxpy.download_dxfile(dxfile, filename)
with open(filename, 'r') as fsub:
for line in fsub:
sum_reads += parse_line_for_readcount(line)
f.write(line)
f.write('Total Reads: {0}'.format(sum_reads))
countDXFile = dxpy.upload_local_file(resultfn)
countDXlink = dxpy.dxlink(countDXFile.get_id())
return {"countDXLink": countDXlink}
regions = parseSAM_header_for_region(filename)
split_regions = [regions[i:i + region_size]
for i in range(0, len(regions), region_size)]
if not index_file:
mappings_bam, index_file = create_index_file(filename, mappings_bam)
print('creating subjobs')
subjobs = [dxpy.new_dxjob(
fn_input={"region_list": split,
"mappings_bam": mappings_bam,
"index_file": index_file},
fn_name="samtoolscount_bam")
for split in split_regions]
fileDXLinks = [subjob.get_output_ref("readcount_fileDX")
for subjob in subjobs]
def samtoolscount_bam(region_list, mappings_bam, index_file):
"""Processing function.
Arguments:
region_list (list[str]): Regions to count in BAM
mappings_bam (dict): dxlink to input BAM
index_file (dict): dxlink to input BAM
Returns:
Dictionary containing dxlinks to the uploaded read counts file
"""
#
# Download inputs
# -------------------------------------------------------------------
# dxpy.download_all_inputs will download all input files into
# the /home/dnanexus/in directory. A folder will be created for each
# input and the file(s) will be download to that directory.
#
# In this example, the dictionary has the following key-value pairs
# Note that the values are all list
# mappings_bam_path: [u'/home/dnanexus/in/mappings_bam/<bam filename>.bam']
# mappings_bam_name: [u'<bam filename>.bam']
# mappings_bam_prefix: [u'<bam filename>']
# index_file_path: [u'/home/dnanexus/in/index_file/<bam filename>.bam.bai']
# index_file_name: [u'<bam filename>.bam.bai']
# index_file_prefix: [u'<bam filename>']
#
inputs = dxpy.download_all_inputs()
# SAMtools view command requires the bam and index file to be in the same
shutil.move(inputs['mappings_bam_path'][0], os.getcwd())
shutil.move(inputs['index_file_path'][0], os.getcwd())
input_bam = inputs['mappings_bam_name'][0]
#
# Per region perform SAMtools count.
# --------------------------------------------------------------
# Output count for regions and return DXLink as job output to
# allow other entry points to download job output.
#
with open('read_count_regions.txt', 'w') as f:
for region in region_list:
view_cmd = create_region_view_cmd(input_bam, region)
region_proc_result = run_cmd(view_cmd)
region_count = int(region_proc_result[0])
f.write("Region {0}: {1}\n".format(region, region_count))
readcountDXFile = dxpy.upload_local_file("read_count_regions.txt")
readCountDXlink = dxpy.dxlink(readcountDXFile.get_id())
return {"readcount_fileDX": readCountDXlink}
def combine_files(countDXlinks, resultfn):
"""The 'gather' subjob of the applet.
Arguments:
countDXlinks (list[dict]): list of DXlinks to process job output files.
resultfn (str): Filename to use for job output file.
Returns:
DXLink for the main function to return as the job output.
Note: Only the DXLinks are passed as parameters.
Subjobs work on a fresh instance so files must be downloaded to the machine
"""
if resultfn.endswith(".bam"):
resultfn = resultfn[:-4] + '.txt'
sum_reads = 0
with open(resultfn, 'w') as f:
for i, dxlink in enumerate(countDXlinks):
dxfile = dxpy.DXFile(dxlink)
filename = "countfile{0}".format(i)
dxpy.download_dxfile(dxfile, filename)
with open(filename, 'r') as fsub:
for line in fsub:
sum_reads += parse_line_for_readcount(line)
f.write(line)
f.write('Total Reads: {0}'.format(sum_reads))
countDXFile = dxpy.upload_local_file(resultfn)
countDXlink = dxpy.dxlink(countDXFile.get_id())
return {"countDXLink": countDXlink}
$ dx select
Note: Use "dx select --level VIEW" or "dx select --public" to select from
projects for which you only have VIEW permission to.
Available projects (CONTRIBUTE or higher):
0) SAM importer test (CONTRIBUTE)
1) Scratch Project (ADMINISTER)
2) Mouse (ADMINISTER)
Project # [1]: 2
Setting current project to: Mouse
$ dx ls -l
Project: Mouse (project-9zVfbG2y8x65kxKY7x20005G)
Folder : /
$ dx select --public
Available public projects:
0) Example 1 (VIEW)
1) Apps Data (VIEW)
2) Parliament (VIEW)
3) CNVkit Tests (VIEW)
...
m) More options not shown...
Pick a numbered choice or "m" for more options: 1
$ dx select --level VIEW
Available projects (VIEW or higher):
0) SAM importer test (CONTRIBUTE)
1) Scratch Project (ADMINISTER)
2) Shared Applets (VIEW)
3) Mouse (ADMINISTER)
Pick a numbered choice or "m" for more options: 2
$ dx ls '*.fa*' # List objects with filenames of the pattern "*.fa*"
ce10.fasta.fai
ce10.fasta.gz
$ dx ls ce10.???-index.tar.gz # List objects with filenames of the pattern "ce10.???-index.tar.gz"
ce10.cw2-index.tar.gz
ce10.bt2-index.tar.gz
ce10.bwa-index.tar.gz
# Correct usage with quotes
dx ls '*.fa*' # Single quotes prevent shell expansion
dx find data --name "*.gz" # Double quotes also work
# Searching for a file with colons in the name
dx find data --name "sample\:123.txt"
# Or alternatively with single quotes
dx find data --name 'sample\:123.txt'
# Searching for a file with a literal asterisk
dx find data --name "experiment\*.fastq"
dx api system findDataObjects '{"scope": {"project": "project-xxxx"}, "describe":{"fields":{"state":true}}}'
{
"license": {
"serialNumber": "<Serial number from Stata>",
"code": "<Code from Stata>",
"authorization": "<Authorization from Stata>",
"user": "<Registered user line 1>",
"organization": "<Registered user line 2>"
}
}
Learn how to log into and out of the DNAnexus Platform, via both the user interface and the command-line interface. Learn how to use tokens to log in, and how to set up two-factor authentication.
Logging In and Out via the User Interface
To log in via the user interface (UI), open the DNAnexus Platform login page and enter your username and password.
To log out via the UI, open your account menu and select Sign Out:
Logging in and out
Logging In via the Command-Line Interface
To log in via the command-line interface (CLI), make sure you've . From the CLI, enter the command .
Next, enter your username, or, if you've logged in before on the same computer and your username is displayed, hit Return to confirm that you want to use it to log in. Then enter your password.
See below for directions on .
See the for detail on optional arguments that can be used with dx login.
Logging Out via the Command-Line Interface
When using the CLI, log out by entering the command .
If you use a token to log in, logging out invalidates that token. To log in again, you must .
See the for detail on optional arguments that can be used with dx logout.
Auto Logout
Session inactivity
By default, the system logs out users after 15 minutes of inactivity. Exceptions apply to users logged in with an that specifies a different session duration, or users in an org with a custom autoLogoutAfter policy.
for more information on setting a custom autoLogoutAfter policy for an org.
Credentials change
The system automatically logs out users when they change their account credentials. This happens immediately after the credentials change is complete. Exceptions apply to users logged in with an .
The following actions are considered credentials changes:
Change a password
Reset a password
Confirm a new email address after updating account email
Enable or disable multi-factor authentication (MFA)
By default, changing your credentials does not automatically terminate any running jobs or active downloads and uploads that are authenticated on your behalf. When you change your credentials, you can choose Revoke Active Tokens to terminate these running jobs and active transfers. See details on .
If you suspect your account may be compromised, we strongly recommend that you:
Opt-in to terminate any active jobs authenticated on your behalf.
Using Tokens
You can log in via the CLI, and stay logged in for a fixed length of time, by using an API token, also called an authentication token.
Exercise caution when sharing DNAnexus Platform tokens. Anyone with a token can access the Platform and impersonate you as a user. They gain your access level to any projects accessible by the token, enabling them to run jobs and potentially incur charges to your account.
Generating a Token
To generate a token, open your account menu and select My Profile.
Next, click on the API Tokens tab. Then click the New Token button:
The New Token form opens in a modal window:
Consider the following points when filling out the form:
The token provides access to each project at the level at which you have access. See the .
If the token provides access to a project within which you have PHI data access, it enables access to that PHI data.
Tokens without a specified expiration date expire in one month.
After completing the form, click Generate Token. The system generates a 32-character token and displays it with a confirmation message.
Copy your token immediately. The token is inaccessible after dismissing the confirmation message or navigating away from the API Tokens screen.
Using a Token to Log In
To log in with a token via the CLI, enter the command , followed by a valid 32-character token.
Token Use Cases
Tokens are useful in multiple scenarios, such as:
Logging in via the CLI with single sign-on enabled - If your organization uses , logging in via the CLI might require a token instead of a username and password.
Logging in via a script - Scripts can use tokens to authenticate with the Platform.
When incorporating a token into a script, take care to set the token's expiration date such that the script has Platform access for only as long as necessary. Ensure as well that the script only has access to that project or those projects to which it must have access, to function properly.
Revoking a Token
When you revoke API tokens, all running jobs and all active file uploads or downloads authenticated with those tokens are terminated immediately and fail with the . Any compute or egress charges incurred up to the point of termination remain billable to the account associated with those operations.
To revoke a token, navigate to the API Tokens screen within your profile on the UI. Select the token you want to revoke, then click the Revoke button:
In the Revoke Tokens Confirmation modal window, click the Yes, revoke it button. The token is revoked, and its name no longer appears in the list of tokens on the API Tokens screen.
When to Revoke a Token
Token shared too widely - Revoke a token if someone with whom you've shared the token should no longer be able to use it, or if you're not certain who has access to it.
Token no longer needed - Revoke a token if a script that uses it is no longer in use, or if a group that had been using it no longer needs access to the Platform, or in any other situation in which the token is no longer necessary.
Logging In Non-Interactively
Though logging in typically requires direct interaction with the Platform through the UI or CLI, non-interactive login is also possible. Scripts commonly automate both login and project selection.
Non-interactive login uses dx login with the --token argument. The command automates project selection. For manual project selection, add the argument to dx login.
Two-Factor Authentication
DNAnexus recommends adding two-factor authentication to your account, to provide an extra means of ensuring the security of all data to which you have access, on the Platform.
With two-factor authentication enabled, you must enter a two-factor authentication code to log into the Platform and access certain other services. This code is a time-based one-time password valid for a single session, generated by a third-party two-factor authenticator application, such as Google Authenticator.
Two-factor authentication protects your account by requiring both your credentials and an authentication code. This prevents unauthorized access even if your username and password are compromised.
Enabling Two-Factor Authentication
DNAnexus recommends using a time-based one-time password (TOTP)–compliant authenticator application on your mobile device. Popular options include Google Authenticator, Authy, Microsoft Authenticator, and 1Password. Google Authenticator is a free application that's available for both Apple iOS and Android mobile devices. Get it on or from the .
If you are unable to use a smartphone application, compatible two-factor authenticator applications that use the TOTP (time-based one-time password) algorithm exist for other platforms.
From your account menu, select Account Security.
In the Two-Factor Authentication section, click Enable 2FA.
Choose a TOTP-compatible authenticator application and click Next.
Save your backup codes before closing the setup dialog — they will not be accessible again after you close it. Store them in a secure location. Without backup codes and without access to your authenticator application, Platform login becomes impossible.
if you lose both your codes and access to your authenticator application.
Logging In with a Backup Code
If you lose access to your authenticator application, you can use a saved backup code to log in.
Navigate to the Platform login page and enter your username and password.
When prompted for your two-factor authentication code, enter one of your saved backup codes instead.
Each backup code can only be used once. Keep track of which codes you have used and store remaining codes securely.
If you lose access to both your authenticator application and your backup codes, you will not be able to log in to the Platform. for assistance.
Disabling Two-Factor Authentication
DNAnexus recommends keeping two-factor authentication enabled after activation. You can disable it manually, provided it has not been strictly enforced by your organization admin.
Turning off two-factor authentication logs you out of all active web sessions immediately and is treated as a . If you re-enable 2FA later, you will need to reconfigure your authenticator application by scanning a new QR code or entering a new secret key, and saving a new set of backup codes.
From your account menu, select Account Security.
In the Two-Factor Authentication section, click Turn Off.
In the confirmation dialog, enter your current password and either the 6-digit code from your authenticator app or a backup code.
If you used a backup code to log in to the Platform, you need a second backup code to confirm disabling 2FA in step 3, because each backup code can only be used once.
Troubleshooting Two-Factor Authentication
Code validation failures are most commonly caused by a time-desynchronization issue on your mobile device. This can occur in two ways:
Code invalid during setup — The Platform reports an invalid code immediately after scanning the QR code during initial 2FA setup.
Codes stopped working — You previously set up 2FA successfully, but newly generated codes are no longer accepted at login.
To resolve either issue, enable Automatic Date and Time in your device's system settings and restart your phone. This synchronizes your device clock with the server time required for the TOTP algorithm to generate valid codes.
Analyzing Somatic Variants
Analyze somatic variants, including cancer-specific filtering, visualization, and variant landscape exploration in the Cohort Browser.
An Apollo license is required to use Cohort Browser on the DNAnexus Platform. Org approval may also be required. Contact DNAnexus Sales for more information.
Explore and analyze datasets with somatic variant assays by opening them in the Cohort Browser and switching to the Somatic Variants tab. You can create cohorts based on somatic variants, visualize variant patterns, and examine detailed variant information.
You can analyze somatic variants across four main categories: Single Nucleotide Variants (SNVs) & Indels for small genomic changes, Copy Number Variants (CNVs) for alterations in gene copy numbers, Fusions for structural rearrangements involving gene coding sequences, and Structural Variants (SVs) for larger genomic rearrangements.
Somatic assay datasets are created using the .
Variant Classification
The somatic data model classifies all genomic variants into four main classes, defined by their size, structure, and representation in VCF files. Each variant type has specific criteria that must be met for classification.
Variant Type
Classification Criteria
Examples
Somatic Variants in Cohort Browser
CNVs and Fusions are also classified as Structural Variants in the Cohort Browser because they use symbolic allele representations (<CNV>, <DEL>, <DUP>, <BND>
Filtering by Somatic Variants
You can to include only samples with specific somatic variants.
To apply a somatic filter to your cohort:
For the cohort you want to edit, click Add Filter.
In Add Filter to Cohort > Assays > Variant (Somatic), select a genomic filter.
In Edit Filter: Variant (Somatic), specify the criteria:
You can specify up to 10 somatic variant filters for each cohort.
After you apply or edit filters, the participant count updates immediately. However, visualization tiles do not automatically refresh. Click Refresh Visualizations at the top of the dashboard to update all tiles. Click Refresh on individual tiles to update specific charts.
Working with Large Structural Variants (>10Mbp)
Structural variants larger than 10 megabases lack gene-level annotations, which limits how you can filter and visualize them. Use these alternative filtering approaches:
Filter by genomic coordinates: In the Genes / Effects filter, enter genomic coordinates in the format chr:start-end, for example, 17:7661779-7687538 for the TP53 gene region. Set the variant type scope to SV or CNV and leave consequence types blank. Find gene coordinates by typing the gene symbol in the search icon next to the Variants & Events table.
Filter by variant IDs: In the Variant IDs filter, enter up to 10 variant IDs in the format chr_pos_ref_alt, for example, 17_7674257_A_<DEL>
For comprehensive structural variant analysis, combine multiple filtering approaches. Use gene symbol filters to capture annotated structural variants ≤ 10Mbp, then add coordinate-based filters to include larger structural variants in the same genomic regions.
Large structural variants are visible in the table with full details, but they do not appear in the due to missing gene-level annotations.
Comparing Variant Patterns Across Your Cohort
The Variant Frequency Matrix provides a visual overview of how often somatic variants appear throughout your cohort. The matrix helps you identify variant patterns across tumor samples and discover which variants frequently occur together. You can also measure the mutation burden in different genes and compare how mutation profiles differ between two cohorts. This makes it easier to spot trends and relationships in your data that might not be apparent when examining individual variants.
The Variant Frequency Matrix is interactive. You can , and , and zoom in on specific genes or regions.
In the Variant Frequency Matrix, the rows represents genes and columns represent samples, both are sorted by variant frequency.
Sorted gene list: Genes are ranked from most to least frequently affected by variants. A sample is considered "affected" by a gene if it is a tumor sample with at least one detected variant of high or moderate impact in that gene's canonical transcript. Matched normal samples are not included in this calculation.
Sorted sample list: Samples are ordered by the total number of genes that contain variants. This ranking is independent of how frequently each individual gene is affected.
The Variant Frequency Matrix displays up to the top 50 genes with the most variants and up to 500 samples for any given cohort. The samples shown are the 500 with the highest number of genes containing variants. If your cohort has fewer than 500 samples, the matrix shows all samples.
Filtering by Genes and Consequences
By default, the Variant Frequency Matrix includes all genes and samples. To narrow your view, you can filter the matrix to specific classes of somatic variants, such as SNVs & Indels, Structural Variants, CNVs, or Fusions.
Using the legend in bottom right, you can focus on specific variants, events, or consequences. This allows you to better explore particular areas of interest, such as high-impact mutations or specific consequences relevant to your research.
When , the matrix can display the top 200 samples (columns) from both the primary and secondary cohorts. The top genes are selected and sorted by their variant frequency within the primary cohort.
Viewing Gene and Sample Details
The Variant Frequency Matrix is highly interactive, allowing you to quickly access more details and apply filters.
When you hover over a cell, the matrix shows a unique identifier for the sample, along with a breakdown of the variants detected in that gene, organized by their consequence type. You can copy the sample ID to your clipboard to apply it to a cohort filter.
When you hover over a gene ID on the left axis, the matrix shows more information about that gene. This includes a unique identifier for the gene, along with a quick breakdown of available external annotations, with direct links to the and databases (when available).
To create a filter, hover over the gene and click + Add to Filter, or copy the gene ID to your clipboard for use in a custom filter.
Color Coding and Consequences
The Variant Frequency Matrix uses color coding to represent the consequences of detected variants, providing a quick visual assessment of variant types. Only high and moderate impact consequences, as defined by , are included in this visualization.
Samples with two or more detected variants are color-coded as "Multi Hit", indicating a complex variant profile.
Exploring Gene-Level Mutation Patterns
The Lollipop Plot is a visualization tool that shows the somatic variants of a cohort on a single gene's canonical protein. With Lollipop Plot, you can identify mutation hotspots within a specific gene, understand the functional impact of variants in the context of protein domains, compare mutation patterns across different patient cohorts, and explore recurrent mutations in cancer driver genes.
Use the Go to Gene field to quickly navigate to a gene of interest, such as TP53.
When you hover over a lollipop, you can see details about the amino acid change, such as the HGVS notation and the frequency of that change in the current cohort. The plot also shows the location of each mutation along the protein sequence, with color coding to indicate the consequence type.
The Lollipop Plot displays SNV & Indel data from the same genomic region as the . When you change the genomic region in the table, the Lollipop Plot updates to reflect the change and the other way around.
Reading the Lollipop Plot
Each lollipop on the plot represents amino acid changes at a specific location.
The horizontal position (X axis) indicates the location of the change, while the height (Y axis) represents the frequency of that change within the current cohort.
Lollipops are color-coded by consequence based on the canonical transcript.
Examining Detailed Variant Information
The Variants & Events table displays details on the same genomic region as the . You can filter the table to focus on specific variant types, such as SNV & Indels, SV (Structural Variants), CNV, or Fusion.
Unlike the Variant Frequency Matrix, the Variants & Events table displays all structural variants including those larger than 10Mbp. Use this table to examine large SVs that may not appear in other visualizations.
Information displayed in the Variants & Events table includes:
Location of variant, with a link to its
Reference allele of variant
Alternate allele of variant
Exporting Variant Information
You can export the selected variants in the Variants & Events table as a list of variant IDs or a CSV file.
To copy a comma-separated list of variant IDs to your clipboard, select the set of IDs you want to copy, and click Copy.
To export variants as a CSV file, select the set of IDs you need, and click Download (.csv file).
Accessing External Annotations and Resources
In Variants & Events > Location column, you can click on the specific location to open the locus details.
The locus details show specific SNV & Indel variants as well as up to 200 structural variants overlapping with the specific location. For canonical transcripts, a blue indicator appears next to the transcript ID, identifying the primary transcript annotations.
The locus details include enhanced annotations to external resources:
Gene-level links - Direct links to gene information in external databases
Variant-level links - Links to variant-specific annotation resources
These links allow you to quickly navigate to external annotation resources for further information about genes or variants of interest.
SQL Runner
A license is required to access Spark functionality on the DNAnexus Platform. for more information.
Overview
Rotate your API tokens after changing credentials. That is, delete your existing API tokens and create new ones.
Enter your current Platform password and click Next.
Scan the provided QR code with your authenticator app. If you cannot scan it, enter the displayed text code manually into your app instead.
Enter the 6-digit code generated by your app and click Next.
Click Print or Download to save your one-time-use backup codes in a secure location.
Click Finish & Log Out to complete setup.
(Optional) Check Revoke Active Tokens to immediately terminate all running jobs and active file uploads and downloads.
All must match:
• ALT field contains breakend notation with square brackets ([ or ])
• At least one breakpoint overlaps with annotated gene or transcript
[chr2:123456[, ]chr5:789012]
Structural Variant (SV)
Large or complex structural changes
Either must match:
• Variant length > 50bp
• ALT field contains symbolic allele (<DEL>, <INV>, <CNV>, <BND>)
<DEL>, <INV>, large insertions
). This dual classification ensures they are correctly distinguished from SNVs regardless of their physical length.
For optimal performance and annotation scalability, the Cohort Browser processes SVs and CNVs between 50bp and 10Mbp differently than larger variants:
SVs and CNVs ≤ 10Mbp: Fully annotated with gene symbols and consequences, appear in all visualizations including the Variant Frequency Matrix
SVs and CNVs > 10Mbp: Ingested and visible in the Variants & Events table but lack gene-level annotations. These larger variants do not appear in the Variant Frequency Matrix and cannot be filtered using gene symbols or consequence terms. Use genomic coordinates or variant IDs to filter for these variants (see Working with Large Structural Variants below).
Fusions are not affected by this size limit as they are considered two single-position events.
For datasets with multiple somatic variant assays, select the specific assay to filter by.
Choose whether to include patients with at least one detected variant matching the specified criteria (WITH Variant), or include only patients who have no detected variants matching the criteria (WITHOUT Variant). By default, the filter includes those with matching variants. This choice applies to all specified filtering criteria.
On the Genes / Effects tab, select variants of specific types and variant consequences within specified genes and genomic ranges. You can specify up to 5 genes or genomic ranges in a comma-separated list.
On the HGVS tab, specify a particular HGVS DNA or HGVS protein notation, preceded by a gene symbol. Example: KRAS p.Arg1459Ter.
On the Variant IDs tab, specify variant IDs using the standard format chr_pos_ref_alt (for example, 17_7674257_A_G). You can enter up to 10 variant IDs in a comma-separated list.
Enter multiple genes, ranges, or variants, by separating them with commas or placing each on a new line.
Click Apply Filter.
. To get variant IDs, navigate to the gene region in the Variants & Events table, select variants of interest, and download the CSV file - the Location column contains the variant IDs.
If a lollipop represents multiple consequence types, it is coded as "Multi Hit".
You can identify mutation hotspots for a given gene and see protein changes in HGVS short form notation, such as T322A, and HGVS.p notation, such as p.Thr322Ala.
Type of variant, such as SNV, Indel, or Structural Variant
Variant consequences, with entries color-coded by level of severity
HGVS cDNA
HGVS Protein
COSMIC ID
RSID, with a link to the dbSNP entry for the variant
SNV & Indel
Single base substitutions and small insertions/deletions with precise allele sequences
All must match:
• Variant size ≤ 50bp
• ALT field contains precise allele (NOT symbolic like <DEL>, <INS>, <DUP>, <CNV>)
A→G, ATCG→A, A→ATCG
Copy Number Variant (CNV)
Changes in gene copy number
All must match:
• ALT field contains symbolic allele (<CNV>, <DEL>, <DUP>)
• Explicit copy number value present in FORMAT field key CN
The Spark SQL Runner application brings up a Spark cluster and executes your provided list of SQL queries. This is especially useful if you need to perform a sequence repeatedly or if you need to run a complex set of queries. You can vary the size of your cluster to speed up your tasks.
How to Run Spark SQL Runner
Input:
sqlfile: [Required] A SQL file which contains an ordered list of SQL queries.
substitutions: A JSON file which contains the variable substitutions.
user_config: User configuration JSON file, in case you want to set or override certain Spark configurations.
Other Options:
export: (boolean) default false. Exports output files with results for the queries in the sqlfile.
export_options: A JSON file which contains the export configurations.
collect_logs: (boolean) default false. Collects cluster logs from all nodes.
executor_memory: (string) Amount of memory to use per executor process, in MiB unless otherwise specified. Common values include 2g or 8g. This is passed as --executor-memory to Spark submit.
executor_cores: (integer) Number of cores to use per executor process. This is passed as --executor-cores to Spark submit.
driver_memory: (string) Amount of memory to use for the driver process. Common values include 2g or 8g. This is passed as --driver-memory to Spark submit.
log_level: (string) default INFO. Logging level for both driver and executors. [ALL, TRACE, DEBUG, INFO]
Output:
output_files: Output files include report SQL file and query export files.
Basic Run
Examples
sqlfile
How sqlfile is Processed
The SQL runner extracts each command in sqlfile and runs them in sequential order.
Every SQL command needs to be separated with a semicolon ;.
Any command starting with -- is ignored (comments). Any comment within a command should be inside /*...*/ The following are examples of valid comments:
Variable Substitution
Variable substitution can be done by specifying the variables to replace in substitutions.
In the above example, each reference to srcdb in sqlfile within ${...} is substituted with sskrdemo1. For example, select * from ${srcdb}.${patient_table};. The script adds the set command before executing any of the SQL commands in sqlfile. As a result, select * from ${srcdb}.${patient_table}; translates to:
Export
If enabled, the results of the SQL commands are exported to a CSV file. export_options defines an export configuration.
num_files: default 1. This defines the maximum number of output files to generate. The number generally depends on how many executors are running in the cluster and how many partitions of this file exist in the system. Each output file corresponds to a part file in parquet.
fileprefix: The filename prefix for every SQL output file. By default, output files are prefixed with query_id, which is the order in which the queries are listed in sqlfile (starting with 1), for example, 1-out.csv. If a prefix is specified, output files are named like <prefix>-1-out.csv.
header: Default is true. If true, a header is added to each exported file.
User Configuration
Values in spark-defaults.conf override or add to the default Spark configuration.
Output Files
The export folder contains two generated files:
<JobId>-export.tar: Contains all the query results.
<JobId>-outfile.sql: SQL debug file.
Export Files
After extracting the export tar file, the structure appears as follows:
In the above example, demo is the fileprefix used. The export produces one folder per query. Each folder contains a SQL file with the query executed and a .csv folder containing the result CSV.
SQL Report File
Every SQL run execution generates a SQL runner debug report file. This is a SQL file.
It lists all the queries executed and status of the execution (Success or Fail). It also lists the name of the output file for that command and the time taken. If there are any failures, it reports the query and stops executing subsequent commands.
SQL Errors
During execution of the series of SQL commands, a command may fail (error, syntax, etc). In that case, the app quits and uploads a SQL debug file to the project:
The output identifies the line with the SQL error and its response.
The query in the .sql file can be fixed, and this report file can be used as input for a subsequent run, allowing you to resume from where execution stopped.
SELECT * FROM ${srcdb}.${patient_table};
DROP DATABASE IF EXISTS ${dstdb} CASCADE;
CREATE DATABASE IF NOT EXISTS ${dstdb} LOCATION 'dnax://';
CREATE VIEW ${dstdb}.patient_view AS SELECT * FROM ${srcdb}.patient;
SELECT * FROM ${dstdb}.patient_view;
SHOW DATABASES;
SELECT * FROM dbname.tablename1;
SELECT * FROM
dbname.tablename2;
DESCRIBE DATABASE EXTENDED dbname;
-- SHOW DATABASES;
-- SELECT * FROM dbname.tablename1;
SHOW /* this is valid comment */ TABLES;
-- [SQL Runner Report] --;
-- [SUCCESS][TimeTaken: 1.90734863281e-06 secs ] set f2c=patient_f2c;
-- [SUCCESS][TimeTaken: 1.90734863281e-06 secs ] set srcdb=sskrdemosrcdb1_13;
-- [SUCCESS][TimeTaken: 9.53674316406e-07 secs ] set patient=patient_new;
-- [SUCCESS][TimeTaken: 9.53674316406e-07 secs ] set derived=patient_derived;
-- [SUCCESS][TimeTaken: 9.53674316406e-07 secs ] set composed=patient_composed;
-- [SUCCESS][TimeTaken: 9.53674316406e-07 secs ] set patient_table=patient;
-- [SUCCESS][TimeTaken: 1.19209289551e-06 secs ] set complex=patient_complex;
-- [SUCCESS][TimeTaken: 9.53674316406e-07 secs ] set patient_view=patient_newview;
-- [SUCCESS][TimeTaken: 9.53674316406e-07 secs ] set cna=cna_new;
-- [SUCCESS][TimeTaken: 0.0 secs ] set brca=brca_new;
-- [SUCCESS][TimeTaken: 2.14576721191e-06 secs ] set dstdb=sskrdemodstdb1_13;
-- [SUCCESS][OutputFile: demo-0-out.csv, TimeTaken: 8.83630990982 secs] SHOW DATABASES;
-- [SUCCESS][OutputFile: demo-1-out.csv, TimeTaken: 3.85295510292 secs] create database sskrdemo2 location 'dnax://';
-- [SUCCESS][OutputFile: demo-2-out.csv, TimeTaken: 4.8106200695 secs] use sskrdemo2;
-- [SUCCESS][OutputFile: demo-3-out.csv , TimeTaken: 1.00737595558 secs] create table patient (first_name string, last_name string, age int, glucose int, temperature int, dob string, temp_metric string) stored as parquet;
-- [SQL Runner Report] --;
-- [SUCCESS][TimeTaken: 1.90734863281e-06 secs ] set f2c=patient_f2c;
-- [SUCCESS][TimeTaken: 1.90734863281e-06 secs ] set srcdb=sskrdemosrcdb1_13;
-- [SUCCESS][TimeTaken: 9.53674316406e-07 secs ] set patient=patient_new;
-- [SUCCESS][TimeTaken: 9.53674316406e-07 secs ] set derived=patient_derived;
-- [SUCCESS][TimeTaken: 9.53674316406e-07 secs ] set composed=patient_composed;
-- [SUCCESS][TimeTaken: 9.53674316406e-07 secs ] set patient_table=patient;
-- [SUCCESS][TimeTaken: 1.19209289551e-06 secs ] set complex=patient_complex;
-- [SUCCESS][TimeTaken: 9.53674316406e-07 secs ] set patient_view=patient_newview;
-- [SUCCESS][TimeTaken: 9.53674316406e-07 secs ] set cna=cna_new;
-- [SUCCESS][TimeTaken: 0.0 secs ] set brca=brca_new;
-- [SUCCESS][TimeTaken: 2.14576721191e-06 secs ] set dstdb=sskrdemodstdb1_13;
-- [SUCCESS][OutputFile: demo-0-out.csv, TimeTaken: 8.83630990982 secs] select * from ${srcdb}.${patient_table};
-- [FAIL] SQL ERROR while below command [ Reason: u"\nextraneous input '`' expecting <EOF>(line 1, pos 45)\n\n== SQL ==\ndrop database if exists sskrtest2011 cascade `\n---------------------------------------------^^^\n"];
drop database if exists ${dstdb} cascade `;
create database if not exists ${dstdb} location 'dnax://';
create view ${dstdb}.patient_view as select * from ${srcdb}.patient;
select * from ${dstdb}.patient_view;
drop database if exists ${dstdb} cascade `;
Defining and Managing Cohorts
Create, filter, and manage patient cohorts using clinical, genomic, and other data fields in the Cohort Browser.
An Apollo license is required to use Cohort Browser on the DNAnexus Platform. Org approval may also be required. Contact DNAnexus Sales for more information.
Create comprehensive patient cohorts by filtering your datasets. You can combine, compare, and export your cohorts for further analysis.
When you start exploring a dataset, Cohort Browser automatically creates a cohort that includes all patients/samples. You can then by adding filters, and repeat multiple times to create additional cohorts.
The Cohorts panel gives you an overview of your active cohorts on the dashboard (up to 2) and the recently used cohorts (up to 8) in your current session. These can be temporary unsaved cohorts as well as .
To change the active cohorts on the dashboard, you need to swap them between the Dashboard and Recent sections:
In Cohorts > Dashboard, click In Dashboard to remove a cohort from the dashboard.
In Cohorts > Recent, click Add to Dashboard next to the cohort you want to add to the dashboard.
This way you can quickly explore, , and iterate across multiple cohorts within a single session.
Defining Cohort Criteria
Adding Clinical and Phenotypic Filters
To apply a filter to your cohort:
For the cohort you want to edit, click Add Filter.
In Add Filter to Cohort > Clinical, select a data field to filter by.
Click Add Cohort Filter.
After you apply or edit filters, the participant count updates immediately. However, visualization tiles do not automatically refresh. Click Refresh Visualizations at the top of the dashboard to update all tiles. Click Refresh on individual tiles to update specific charts.
Adding Assay Filters
With multi-assay datasets, you can create cohorts by applying filters from multiple assay types and instances.
When adding filters, you can find assay types under the Assays tab. This allows you to create cohorts that combine different types of data. For example, you can filter patients based on both clinical characteristics and germline variants, merge somatic mutation criteria with gene expression levels, or build cohorts that span multiple assays of the same type.
To learn more about filtering by specific assay types, see:
When working with an omics dataset that includes multiple assays, such as a germline dataset with both WES and WGS assays, you can:
Choose specific assays for filtering.
Apply different filters per assay.
Create separate cohorts for different assays of the same type and compare results.
Filter Limits by Assay Type
The maximum number of filters allowed varies by assay type and is shared across all instances of that type:
Germline variant assays: 1 filter maximum
Somatic variant assays: Up to 10 filter criteria
Gene expression assays: Up to 10 filter criteria
Creating Filter Groups
If you add multiple filters from the same category, such as Patient or Sample, they automatically form a filter group.
By default, filters within a filter group are joined by the logical operator 'AND', meaning that all filters in the group must be satisfied for a record to be included in the cohort. You can change the logical operator used within the group to 'OR' by clicking on the operator.
Joining Multiple Filters
Join filters allow you to create cohorts by combining criteria across multiple related data entities within your dataset. This is useful when working with complex datasets that contain interconnected information, such as patient records linked to visits, medications, lab tests, or other clinical data.
Understanding Data Entities
An entity is a grouping of data around a unique item, event, or concept.
In the Cohort Browser, an entity can refer either to a data model object, such as patient or visit, or to a specific input parameter in the app.
Common examples of data entities include:
Patient: Demographics, medical history, baseline characteristics
Medication: Prescriptions, dosages, administration records
Creating Join Filters
To create join filters that span multiple data entities:
Start a new join filter: On the cohort panel, click Add Filter or, on a chart tile, click Cohort Filters > Add Cohort Filter.
Select secondary entity: Choose data fields from a secondary entity (different from your primary entity) to create the join relationship.
Add criteria to existing joins: To expand an existing join filter, click Add additional criteria on the row of the chosen filter.
Working with Logical Operators
Join filters support both AND as well as OR logical operators to control how criteria are combined:
AND logic: All specified criteria must be met
OR logic: Any of the specified criteria can be met
Key rules for logical operators:
Click on the operator buttons to switch between the AND logic and OR logic.
For a specific level of join filtering, joins are either all AND or all OR.
When using OR for join filters, the existence condition applies first: "where exists, join 1 OR join 2".
Building Complex Join Structures
As your filtering needs become more sophisticated, you can create multi-layered join structures:
Add criteria to branches: Further define secondary entities by adding additional criteria to existing join branches
Create nested joins: Add more layers of join filters that derive from the current branch
Automatic field filtering: The field selector automatically hides fields that are ineligible based on the current join structure
Practical Examples
The following examples show how join filters work in practice:
First Example Cohort - Separate Conditions: This cohort identifies all patients with a "high" or "medium" risk level who meet both of these conditions:
Have a first hospital visit (visit instance = 1)
Have had a "nasal swab" lab test at any point (not necessarily during the first visit)
Saving Cohorts
You can save your cohort selection to a as a by clicking Save Cohort in the top-right corner of the cohort panel.
Cohorts are saved with their applied filters, as well as the latest visualizations and dashboard layout. Like other dataset objects, you can find your saved cohorts under the Manage tab in your project.
To open a cohort, double-click it or click Explore Data.
Need to use your cohorts with a different dataset? If you want to apply your cohort definitions to a different Apollo Dataset, you can use the app to transfer your saved cohorts to a new target dataset.
Exporting Data from Cohorts
For each cohort, you can export a list of main entity IDs in your current cohort selection as a CSV file by clicking Export sample IDs.
Data Preview
On the Data Preview tab, you can export tabular information as record IDs or a CSV file. Select multiple table rows to see export options in the top-right corner. Exports include only the fields displayed in the Data Preview tab.
The Data Preview supports up to 30 columns per tab. Tables with 30-200 columns show column names only. In such cases, you can save cohorts but data is not queried. Tables with over 200 columns are not supported.
You can view up to 30,000 records in the Data Preview. If your cohort exceeds this size, the table may not display all data. For larger exports, use the .
If your view contains more than one table, such as a participants table and a hospital records table, exporting to CSV or TSV generates a separate file for each table.
Download Restrictions
The Cohort Browser follows your project's . When every copy of your dataset exists in projects with restricted download policies, downloads are blocked. However, if at least one copy exists in a project that allows downloads, then downloads are permitted.
Downloads are blocked if the database storing your dataset has restricted download permissions, preventing downloads from any Cohort Browser view of that dataset regardless of which project contains the cohort or dashboard.
Downloads are also blocked if the specific cohort or dashboard you're viewing has restricted download permissions, regardless of the underlying dataset permissions.
Combining Cohorts
You can create complex cohorts by combining existing cohorts from the same dataset.
Near the cohort name, click + > Combine Cohorts.
In the Cohorts panel, click Combine Cohorts.
You can also create a combined cohort based on the cohorts already being .
The Cohort Browser supports the following combination logic:
Once a combined cohort is created, you can inspect the combination logic and its original cohorts in the cohort filters section.
Cohorts already combined cannot be combined a second time.
Comparing Cohorts
You can compare two cohorts from the same dataset by adding both cohorts into the Cohort Browser.
To compare cohorts, click + next to the cohort name. You can create a new cohort, duplicate the current cohort, or load a previously saved cohort.
When comparing cohorts:
All visualizations are converted to show data from both cohorts.
You can continue to edit both cohorts and visualize the results dynamically.
You can compare a cohort with its complement in the dataset by selecting Compare / Combine Cohorts > Not In …. Similar to combining cohorts, you first need to save your current cohort before creating its not-in counterpart.
Cohorts created using Not In cannot be used for further creation of combined or not-in cohorts. "Not In" cohorts are linked to the cohort they are originally based on. Once a not-in cohort is created, further changes to the original cohort definition are not reflected.
Creating Cohorts via CLI
The dx command generates a new Cohort object on the platform using an existing Dataset or Cohort object, and a list of primary IDs. The filters are applied to the global primary key of the dataset/cohort object.
When the input is a CohortBrowser typed record, the existing filters are preserved and the output record has additional filters on the global primary key. The filters are combined in a way such that the resulting record is an intersection of the IDs present in the original input and the IDs passed through CLI.
For additional details, see the and example notebooks in the public GitHub repository, .
Describing Data Objects
You can describe objects (files, app(let)s, and workflows) on the DNAnexus Platform using the command dx describe.
Describing an Object by Name
Objects can be described using their DNAnexus Platform name via the command line interface (CLI) using a path.
Describe an Object With a Relative Path
Objects can be described relative to the user's current directory on the DNAnexus Platform. In the following example, the indexed reference genome file human_g1k_v37.bwa-index.tar.gz is described.
The entire path is enclosed in quotes because the folder name Original files contains whitespace. Instead of quotes, escape special characters with \: dx describe Original\ files/human_g1k_v37.bwa-index.tar.gz.
Describe an Object in a Different Project Using an Absolute Path
Objects can be described using an absolute path. This allows you to describe objects outside the current project context. In the following example, selects the project "My Research Project" and dx describe describes the file human_g1k_v37.fa.gz in the "Reference Genome Files" project.
Describe an Object Using Object ID
Objects can be described using a unique object ID.
This example describes the workflow object "Exome Analysis Workflow" using its ID. This workflow is publicly available in the "Exome Analysis Demo" project.
Because workflows can include many app(let)s, inputs/outputs, and default parameters, the dx describe output can seem overwhelming.
Manipulating Outputs
The output from a dx describe command can be used for multiple purposes. The optional argument --json converts the output from dx describe into JSON format for advanced scripting and command line use.
In this example, the publicly available workflow object "Exome Analysis Workflow" is described and the output is returned in JSON format.
Parse, process, and query the JSON output using . Below, the dx describe --json output is processed to generate a list of all stages in the exome analysis pipeline.
To get the "executable" value of each stage present in the "stages" array value of the dx describe output above, use the following command:
General Response Fields Overview
Field name
Objects
Description
Using Spark
Connect with Spark for database sharing, big data analytics, and rich visualizations.
A license is required to access Spark functionality on the DNAnexus Platform. Contact DNAnexus Sales for more information.
Apache Spark can help you tackle big data analytics combined with rich visualization. Sharing a database is straightforward: platform access levels map directly to SQL abilities, so you can fine-tune access control to your databases at either an individual or org level.
Spark Applications
You can launch a Spark application distributed across a cluster of workers. Since this is all tightly integrated with the rest of the platform, Spark jobs leverage the features of normal jobs. You have the same ways to monitor a job's progress, SSH into a job instance to debug, and use the features of dx-toolkit and the platform web UI. You also have access to logs from workers and can monitor the job in the Spark UI.
Visualization
With Spark, you can visualize your results in real time. You can save those queries as cohorts, share them with your team, or use them as inputs to Spark-based analysis apps. You can create charts and shareable dashboards. The filter view allows you to build cohorts quickly without the need to write complex SQL queries by hand.
Databases
A database is a on the Platform. A object is stored in a project.
Database Sharing
Databases can be shared with other users or organizations through project sharing. Access to a database can be revoked at any time by revoking access to the project by the project administrator. If revoking access to the project is impossible, the database can be to another project with different set of collaborators.
Database and Project Policies
Project policies restrict how the data can be modified or copied to other projects. Databases follow the Delete Policy and the Copy Policy. If a database is in a restricted project, the database can be accessed for reading only from the same project context. Databases also adhere to the project's policy. If a database is in a project for which Data Protection is enabled ("PHI project"), the database is subject to the following restrictions:
The database cannot be accessed by Spark apps launched in projects for which PHI Data Protection is not enabled ("non-PHI projects").
If a non-PHI project is provided as a project context, only databases from non-PHI projects are available for retrieving data.
If a PHI project is provided as a project context, only databases from PHI projects are available to add new data.
A license and a signed Business Associate Agreement are required to enable and use PHI Data Protection. for more information.
Database Access
As with all DNAnexus file objects, database access is controlled by project access. These access levels and database object translate into specific SQL abilities for the database, tables, data and database object in the project.
The following tables reference supported actions on a database and database object with lowest necessary access level for an open and closed database.
Spark SQL Function
Open Database
Closed Database
Data Object Action
Open Database
Closed Database
(*) If a project is protected, then ADMINISTER access is required.
Database Naming Conventions
The system handles database names in two ways:
User-provided name: Your database name is converted to lowercase and stored as the databaseName attribute.
System-generated unique name: A unique identifier is created by combining your lowercase database name with the database object ID (also converted to lowercase with hyphens changed to underscores) separated by two underscores. This is stored as the uniqueDatabaseName attribute.
When a database is created using the following SQL statement and a user-generated database name (referenced below as, db_name):
The platform database object, database-xxxx, is created with all lowercase characters. However, when creating a database using , the Python module supported by the DNAnexus SDK, dx-toolkit, the following case-sensitive command returns a database ID based on the user-generated database name, assigned here to the variable db_name:
With that in mind, it is suggested to either use lowercase characters in your db_name assignment or to instead apply a forcing function like, .lower(), to the user-generated database name:
Spark Cluster-Enabled JupyterLab
Learn to use the JupyterLab Spark Cluster app.
JupyterLab is accessible to all users of the UK Biobank Research Analysis Platform and the Our Future Health Trusted Research Environment.
A license is required to access JupyterLab on the DNAnexus Platform. for more information.
In Edit Filter, select operators and enter the values to filter by.
Click Apply Filter.
Lab Test: Results, procedures, sample information
Second Example Cohort - Connected Conditions: This cohort includes all patients with a "high" or "medium" risk level who had the "nasal swab" test performed specifically during their first visit, creating a more restrictive temporal relationship between the visit and lab test.
Select members that are present only in the first selected cohort and not in the second. Example: Subtraction of cohort A, B would be A - B.
2 cohorts
Unique
Select members that appear in exactly one of the selected cohorts. Example: Unique of cohort A, B would be (A - B) ∪ (B - A).
2 cohorts
Logic
Description
Number of Cohorts Supported
Intersection
Select members that are present in ALL selected cohorts. Example: intersection of cohort A, B and C would be A ∩ B ∩ C.
Up to 5 cohorts
Union
Select members that are present in ANY of the selected cohorts. Example: union of cohort A, B and C would be A ∪ B ∪ C.
Up to 5 cohorts
Logic
Description
Not In
Select patients that are present in the dataset, but not in the current cohort. Example: In dataset U, the result of "Not In" A would be U - A.
The JupyterLab Spark Cluster app is a Spark application that runs a fully-managed standalone Spark/Hadoop cluster. This cluster enables distributed data processing and analysis from directly within the JupyterLab application. In the JupyterLab session, you can interactively create and query DNAnexus databases or run any analysis on the Spark cluster.
Besides the core JupyterLab features, the Spark cluster-enabled JupyterLab app allows you to:
Explore the available databases and get an overview of the available datasets
Perform analyses and visualizations directly on data available in the database
Create databases
Submit data analysis jobs to the Spark cluster
Check the general Overview for an introduction to JupyterLab.
Running and Using JupyterLab Spark Cluster
The Quickstart page contains information on how to start a JupyterLab session and create notebooks on the DNAnexus Platform. The References page has additional useful tips for using the environment.
Instantiating the Spark Context
Having created your notebook in the project, you can populate your first cells as below. It is good practice to instantiate your Spark context at the beginning of your analyses, as shown below.
Basic Operations on DNAnexus Databases
Exploring Existing Databases
To view any databases to which you have access to in your current region and project context, run a cell with the following code:
A sample output should be:
You can inspect one of the returned databases by running:
which should return an output similar to:
To find a database in your current region that may be in a different project than your current context, run the following code:
A sample output should be:
To inspect one of the databases listed in the output, use the unique database name. If you use only the database name, results are limited to the current project. For example:
Creating Databases
Here's an example of how to create and populate your own database:
You can separate each line of code into different cells to view the outputs iteratively.
Using Hail
Hail is an open-source, scalable framework for exploring and analyzing genomic data. It is designed to run primarily on a Spark cluster and is available with JupyterLab Spark Cluster. It is included in the app and can be used when the app is run with the feature input set to HAIL (set as default).
Initialize the context when beginning to use Hail. It's important to pass previously started Spark Context sc as an argument:
We recommend continuing your exploration of Hail with the GWAS using Hail tutorial. For example:
Using VEP with Hail
To use VEP (Ensembl Variant Effect Predictor) with Hail, select "Feature," then "HAIL" when launching Spark Cluster-Enabled JupyterLab via the CLI.
VEP can predict the functional effects of genomic variants on genes, transcripts, protein sequences, and regulatory regions. This includes the LOFTEE plugin, which is activated when using the configuration file below.
Add the following JSON configuration file to your DNAnexus project:
Once the vep-GRCh38.json file is in your project, you can annotate the Hail MatrixTable (mt) using the following command:
Behind the Scenes
The Spark cluster app is a Docker-based app which runs the JupyterLab server in a Docker container.
The JupyterLab instance runs on port 443. Because it is an HTTPS app, you can bring up the JupyterLab environment in a web browser using the URL https://job-xxxx.dnanexus.cloud, where job-xxxx is the ID of the job that runs the app.
The script run at the instantiation of the container, /opt/start_jupyterlab.sh, configures the environment and starts the server needed to connect to the Spark cluster. The environment variables needed are set by sourcing two scripts, bind-mounted into the container:
The default user in the container is root.
The option --network host is used when starting Docker to remove the network isolation between the host and the Docker container, which allows the container to bind to the host's network and access Sparks master port directly.
Accessing AWS S3 Buckets
S3 buckets can have private or public access. Either the s3 or the s3a scheme can be used to access S3 buckets. The s3 scheme is automatically aliased to s3a in all Apollo Spark Clusters.
Public Bucket Access
To access public s3 buckets, you do not need to have s3 credentials. The example below shows how to access the public 1000Genomes bucket in a JupyterLab notebook:
When the above is run in a notebook, the following is displayed:
Private Bucket Access
To access private buckets, see the example code below. The example assumes that a Spark session has been created as shown above.
Use Jupyter notebooks on the DNAnexus Platform to craft sophisticated custom analyses in your preferred coding language.
JupyterLab is accessible to all users of the UK Biobank Research Analysis Platform and the Our Future Health Trusted Research Environment.
A license is required to access JupyterLab on the DNAnexus Platform. Contact DNAnexus Sales for more information.
Jupyter notebooks are a popular way to track the work performed in computational experiments the way a lab notebook tracks the work done in a wet lab setting. JupyterLab is an application provided by DNAnexus that allows you to perform computational experiments on the DNAnexus Platform using Jupyter notebooks. JupyterLab allows users on the DNAnexus Platform to collaborate on notebooks and extends with options for directly accessing a DNAnexus project from the JupyterLab environment.
Why Use JupyterLab?
JupyterLab supports the use of Bioconductor and Bioconda, useful tools for bioinformatics analysis.
JupyterLab is a versatile application that can be used to:
Collaborate on exploratory analysis of data
Reproduce and fork work performed in computational analyses
Visualize and gain insights into data generated from biological experiments
The DNAnexus Platform offers two different JupyterLab apps. One is a general-purpose JupyterLab application. The other is Spark cluster-enabled, and can be used within the framework.
Both apps instantiate a JupyterLab server that allows for data analyses to be interactively performed in Jupyter notebooks on a DNAnexus worker.
The app contains all the features found in the general-purpose JupyterLab along with access to a fully-managed, on-demand Spark cluster for big data processing and translational informatics.
Version Information
JupyterLab 2.2 is the default version on the DNAnexus Platform. .
Creating Interactive Notebooks
A step-by-step guide on how to start with JupyterLab and create and edit Jupyter notebooks can be found in the .
JupyterLab Environments
Creating a JupyterLab session requires the use of two different environments:
The DNAnexus project (accessible through the web platform and the CLI).
The worker execution environment.
The Project on the DNAnexus Platform
You have direct access to the project in which the application is run from the JupyterLab session. The project file browser (which lists folders, notebooks, and other files in the project) can be accessed from the DNAnexus tab in the left sidebar or from the :
The project is selected when the JupyterLab app is started and cannot be subsequently changed.
The DNAnexus file browser shows:
Up to 1,000 of your most recently modified files and folders
All Jupyter notebooks in the project
Databases (Spark-enabled app only, limited to 1,000 most recent)
The file list refreshes automatically every 10 seconds. You can also refresh manually by clicking the circular arrow icon in the top right corner.
Need to see more files? Use dx ls in the terminal or access them programmatically through the API.
Worker Execution Environment
When you open and run a notebook from the the kernel corresponding to this notebook is started in the worker execution environment and is used to execute the notebook code. DNAnexus notebooks have a [DX] prepended to the notebook name in the tab of all opened notebooks.
The execution environment file browser is accessible from the left sidebar (notice the folder icon at the top) or from the terminal:
To create Jupyter notebooks in the worker execution environment, use the File menu. These notebooks are stored on the local file system of the JupyterLab execution environment and require persistence in a DNAnexus project. More information about saving appears in the .
Local vs. DNAnexus Notebooks
DNAnexus Notebooks
You can directly in the DNAnexus project as well as duplicate, delete, or download them to your local machine. Notebooks stored in your DNAnexus project, which are housed within the DNAnexus tab on the left sidebar, are fetched from and saved to the project on the DNAnexus Platform without being stored in the JupyterLab execution environment file system. These are referred to as "DNAnexus notebooks" and these notebooks persist in the DNAnexus project after the JupyterLab instance is terminated.
DNAnexus notebooks can be recognized by the [DX] that is prepended to its name in the tab of all opened notebooks.
DNAnexus notebooks can be created by clicking the DNAnexus Notebook icon from the Launcher tab that appears on starting the JupyterLab session, or by clicking the DNAnexus tab on the upper menu and then clicking "New notebook". The Launcher tab can also be opened by clicking File and then selecting "New Launcher" from the upper menu.
Local Notebooks
To create a new local notebook, click the File tab in the upper menu and then select "New" and then "Notebook". These non-DNAnexus notebooks can be saved to DNAnexus by dragging and dropping them in the DNAnexus file viewer in the left panel.
Accessing Data
In JupyterLab, users can access input data that is located in a DNAnexus project in one of the following ways.
For reading the input file multiple times or for reading a large fraction of the file in random order:
Download the file from the DNAnexus project to the execution environment with dx download and access the downloaded local file from Jupyter notebook.
Uploading Data
Files, such as local notebooks, can be persisted in the DNAnexus project by using one of these options:
dx upload in bash console.
Drag the file onto the DNAnexus tab that is in the column of icons on the left side of the screen. This uploads the file into the selected DNAnexus folder.
Exporting DNAnexus Notebooks
Exporting DNAnexus notebooks to formats such as HTML or PDF is not supported. However, you can dx download the DNAnexus notebook from the current DNAnexus project to the JupyterLab environment and export the downloaded notebook. For exporting local notebook to certain formats, the following commands might be needed beforehand: apt-get update && apt-get install texlive-xetex texlive-fonts-recommended texlive-plain-generic.
Non-Interactive Execution of Notebooks
A command can be executed in the JupyterLab worker execution environment without starting an interactive JupyterLab server. To do that, provide the cmd input and additional input files using the in input file array to the JupyterLab app. The provided command runs in the /opt/notebooks/directory and any output files generated in this directory are uploaded to the project and returned in the out output field of the job that ran the JupyterLab app.
The cmd input makes it possible to use the papermill command that is pre-installed in the JupyterLab environment to execute notebooks non-interactively. For example, to execute all the cells in a notebook and produce an output notebook:
Where notebook.ipynb is the input notebook to the papermill command, which is passed to the dxjupyterlab app using the in input, and output_notebook.ipynb is the name of the output notebook, which contains the result of executing the input notebook and is uploaded to the project at the end of app's execution. See the for details.
Collaboration in the Cloud
Collaborators can work on notebooks in the project without the risk of overwriting each other's changes.
Notebook Locking During Editing
If a user has opened a specific notebook in a JupyterLab session, other users cannot open or edit the notebook. This is indicated by a red lock icon next to the notebook's name.
It is still possible to create a duplicate to see what changes are being saved in the locked notebook or to continue work on this "forked" version of the notebook. To copy a notebook, right-click on its name and select Duplicate. After a few seconds, a notebook with the same name and a "copy" suffix should appear in the project.
Once the editing user closes the notebook, the lock is released and anybody else with access to the project can open it.
Notebook Versioning
Whenever a notebook is saved in the project, it is uploaded to the platform as a new file that replaces the previous version, that is, the file of the same name. The previous version is moved to the .Notebook_archive folder with a timestamp suffix added to its name and its ID is saved in the properties of the new file. Saving notebooks directly in the project ensures that your analyses are not lost when the JupyterLab session ends.
If a notebook saved to the project exceeds 20 MB, it may no longer open in JupyterLab and could trigger a "JSON Parse Error." To recover your code, open an earlier version from the .Notebook_archive folder, or download the notebook to your local machine and clear the notebook's outputs using a local Jupyter editor before re-uploading.
Session Timeout Control
JupyterLab sessions begin with a set duration and shut down automatically at the end of this period. The timeout clock appears in the footer on the right side and can be adjusted using the Update duration button. The session terminates at the set timestamp even if the JupyterLab webpage is closed. Job lengths have an upper limit of 30 days, which cannot be extended.
A session can be terminated immediately from the top menu (DNAnexus > End Session).
Environment Snapshots
It is possible to save the current session environment and data and reload it later by creating a session snapshot (DNAnexus > Create Snapshot).
A JupyterLab session is , and a session snapshot file is a tarball generated by saving the Docker container state (with the docker commit and docker save commands). Any installed packages and files created locally are saved to a snapshot file, except for directories /home/dnanexus and /mnt/, which are not included. This file is then uploaded to the project to .Notebook_snapshots and can be passed as input the next time the app is started.
If many large files are created locally, the resulting snapshots take a long time to save and load. In general, it is recommended not to snapshot more than 1 GB of locally saved data/packages and rely on downloading larger files as needed.
Snapshots Created in Older Versions of JupyterLab
Snapshots created with JupyterLab versions older than 2.0.0 (released mid-2023) are not compatible with the current version. These previous snapshots contain tool versions that may conflict with the newer environment, potentially causing problems.
Using Previous Snapshots in the Current Version of JupyterLab
To use a snapshot from a previous version in the current version of JupyterLab, recreate the snapshot as follows:
Create a tarball incorporating all the necessary data files and packages.
Save the tarball in a project.
Launch the current version of JupyterLab.
Accessing an Older Snapshot in an Older Version of JupyterLab
If you don't want to have to recreate your older snapshot, you can run an and access the snapshot there.
Viewing Other Files in the Project
Viewing any other file types from your project, such as CSV, JSON, PDF files, images, or scripts, is convenient because JupyterLab displays them accordingly. For example, JSON files are collapsible and navigable and CSV files are presented in the tabular format.
However, editing and saving any open files from the project other than IPython notebooks results in an error.
Files larger than 20 MB display only their metadata in the JupyterLab file viewer. To access the full contents of a large file, download it using or the , or use the DNAnexus file browser on the platform.
Permissions in the JupyterLab Session
The JupyterLab apps are run in a specific project, defined at start time, and this project cannot be subsequently changed. The job associated with the JupyterLab app has CONTRIBUTE access to the project in which it is run.
When running the JupyterLab app, it is possible to view, but not update, other projects the user has access to. This enhanced scope is required to be able to read databases which may be located in different projects and cannot be cloned.
Running Jobs in the JupyterLab Session
Use dx run to start new jobs from within a notebook or the terminal. If the billTo for the project where your JupyterLab session runs does not have a license for detached executions, any started jobs run as subjobs of your interactive JupyterLab session. In this situation, the --project argument for dx run is ignored, and the job uses the JupyterLab session's workspace instead of the specified project. If a subjob fails or terminates on the DNAnexus Platform, the entire job tree—including your interactive JupyterLab session—terminates as well.
Jobs are limited to a runtime of 30 days. The system automatically terminates jobs running longer than 30 days.
Environment and Feature Options
The JupyterLab app is a Docker-based app that runs the JupyterLab server instance in a Docker container. The server runs on port 443. Because it's an HTTPS app, you can bring up the JupyterLab environment in a web browser using the URL https://job-xxxx.dnanexus.cloud, where job-xxxx is the ID of the job that runs the app. Only the user who launched the JupyterLab job has access to the JupyterLab environment. Other users see a "403 Permission Forbidden" message under the JupyterLab session's URL.
On the DNAnexus Platform, the JupyterLab server runs in a Python 3.9.16 environment, in a container running Ubuntu 20.04 as its operating system.
Feature Options
When launching JupyterLab, the feature options available are PYTHON_R, ML, IMAGE_PROCESSING, STATA, and MONAI_ML.
PYTHON_R (default option): Loads the environment with Python3 and R kernel and interpreter.
ML: Loads the environment with Python3 and machine learning packages, such as TensorFlow, PyTorch, CNTK as well as the image processing package Nipype, but it does not contain R.
IMAGE_PROCESSING
The JupyterLab environment is headless and command-line only. While FSL and FreeSurfer command-line tools are available for batch processing, GUI viewers such as fsleyes and freeview cannot be launched. To visualize results interactively, download the output files to your local machine.
STATA: Requires a license to run. See for more information about running Stata in JupyterLab.
MONAI_ML: Loads the environment with Python3 and extends the ML feature. This feature is ideal for medical imaging research involving machine learning model development and testing. It includes medical imaging frameworks designed for AI-powered analysis. For details, see .
For the full list of pre-installed packages, see the . This list includes details on feature-specific packages available when running the PYTHON_R, ML, IMAGE_PROCESSING, STATA, and MONAI_ML features.
Installing Additional Packages
Additional packages can be during a JupyterLab session. By creating a Docker container , users can then start subsequent sessions with the new packages pre-installed by providing the snapshot as input.
JupyterLab Documentation
For more information on the features and benefits of JupyterLab, see the .
Next Steps
Create your first notebooks by following the instructions in the guide.
See the guide for tips and info on the most useful JupyterLab features.
Organizations
Learn about organizations, which associate users, projects, and resources with one another, enabling fluid collaboration, and simplifying the management of access, sharing, and billing.
This functionality is also available via command line interface (CLI) tools. You may find it easier to use the CLI tools to perform some actions, such as inviting multiple users or exporting information into a machine-readable format.
$ dx describe "Original files/human_g1k_v37.bwa-index.tar.gz"
Result 1:
ID file-xxxx
Class file
Project project-xxxx
Folder /Original files
Name human_g1k_v37.bwa-index.tar.gz
State closed
Visibility visible
Types -
Properties -
Tags -
Outgoing links -
Created ----
Created by Amy
via the job job-xxxx
Last modified ----
archivalState "live"
Size 3.21 GB
$ dx select "My Research Project"
$ dx describe Reference\ Genome\ Files:H.\ Sapiens\ -\ GRCh37\ -\ b37\ (1000\ Genomes\ Phase\ I)/human_g1k_v37.fa.gz
Result 1:
ID file-xxxx
Class file
Project project-xxxx
Folder /H. Sapiens - GRCh37 - b37 (1000 Genomes Phase I)
Name human_g1k_v37.fa.gz
State closed
Visibility visible
Types -
Properties -
Tags -
Outgoing links -
Created ----
Created by Amy
via the job job-xxxx
Last modified ----
archivalState "live"
Size 810.45 MB
db = "database_xxxx__brca_pheno"
spark.sql(f"SHOW TABLES FROM {db}").show(truncate=False)
# Create a database
my_database = "my_database"
spark.sql("create database " + my_database + " location 'dnax://'")
spark.sql("create table " + my_database + ".foo (k string, v string) using parquet")
spark.sql("insert into table " + my_database + ".foo values ('1', '2')")
sql("select * from " + my_database + ".foo")
import hail as hl
hl.init(sc=sc)
# Download example data from 1k genomes project and inspect the matrix table
hl.utils.get_1kg('data/')
hl.import_vcf('data/1kg.vcf.bgz').write('data/1kg.mt', overwrite=True)
mt = hl.read_matrix_table('data/1kg.mt')
mt.rows().select().show(5)
# Annotation process relies on "dnanexus/dxjupyterlab-vep" docker container
# as well as VEP and LoF resources that are pre-installed on every Spark node when
# HAIL-VEP feature is selected.
annotated_mt = hl.vep(mt, "file:///mnt/project/vep-GRCh38.json")
#read csv from public bucket
df = spark.read.options(delimiter='\t', header='True', inferSchema='True').csv("s3://1000genomes/20131219.populations.tsv")
df.select(df.columns[:4]).show(10, False)
#access private data in S3 by first unsetting the default credentials provider
sc._jsc.hadoopConfiguration().set('fs.s3a.aws.credentials.provider', '')
# replace "redacted" with your keys
sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'redacted')
sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'redacted')
df=spark.read.csv("s3a://your_private_bucket/your_path_to_csv")
df.select(df.columns[:5]).show(10, False)
Create figures and tables for scientific publications
Build and test algorithms directly in the cloud before creating DNAnexus apps and workflows
Test and train machine/deep learning models
Interactively run commands on a terminal
For scanning the content of the input file once or for reading only a small fraction of file's content:
A project in which the app is running is mounted in a read-only fashion at /mnt/project folder. Reading the content of the files in /mnt/project dynamically fetches the content from the DNAnexus Platform, so this method uses minimal disk space in the JupyterLab execution environment, but uses more API calls to fetch the content.
: Loads the environment with Python3 and Image Processing packages such as Nipype, FreeSurfer and FSL but it does not contain R. The FreeSurfer package requires a license to run. Details about license creation and usage can be found in the
my_cmd="papermill notebook.ipynb output_notebook.ipynb"
dx run dxjupyterlab -icmd="$my_cmd" -iin="notebook.ipynb"
What Is an Org?
An organization (or "org") is a DNAnexus entity used to manage a group of users. Use orgs to group users, projects, and other resources together, in a way that models real-world collaborative structures.
In its simplest form, an org can be thought of as referring to a group of users on the same project. An org can be used efficiently to share projects and data with multiple users - and, if necessary, to revoke access.
Org admins can manage org membership, configure access and projects associated with the org, and oversee billing. All storage and compute costs associated with an org are invoiced to a single billing account designated by the org admin. You can create an org that is associated with a billing account by contacting DNAnexus Sales.
Orgs are referenced on the DNAnexus Platform by a unique org ID, such as org-dnanexus. Org IDs are used when sharing projects with an org in the Platform user interface or when manipulating the org in the CLI.
Org Membership Levels
Users may have one of two membership levels in an org:
ADMIN
MEMBER
An ADMIN-level user is granted all possible access in the org and may perform org administrative functions. These functions include adding/removing users or modifying org policies. A MEMBER-level user, on the other hand, is granted only a subset of the possible org accesses in the org and has no administrative power in the org.
Members
A user with MEMBER level can be configured to have a subset of the following org access. These access levels determine which actions each user can perform in an org.
Access
Description
Options
Billable activities access
If allowed, the org member can create new projects and apps billed to the org, download data (incurring data egress charges against the org), and set their own default billing account to that of the org.
[Allowed] or [Not Allowed]
Shared apps access
If allowed, the org member has access to view and run apps in which the org has been added as an "authorized user".
[Allowed] or [Not Allowed]
These accesses allow you to have fine-grained control over what members of your orgs can do in the context of your org.
Admins
Org admins are granted all possible access in the org. More specifically, org admins receive the following set of accesses:
Access
Level
Billable activities access
Allowed
Shared apps access
Allowed
Shared projects access
ADMINISTER
Org admins also have the following special privileges:
Viewing Metadata for All Org Projects
Org admins can list and view metadata for all org projects (projects billed to the org) even if the project is not explicitly shared with them. They can also give themselves access to any project billed to the org. For example, when a member creates a new project, Project-P, and bills it to the org, they are the only user with access to Project-P. The org admin can see all projects billed to the org, including Project-P. Org admins can also invite themselves to Project-P at any time to get access to objects and jobs in the project.
Becoming a Developer for All Org Apps
Org admins can add themselves as developers to any app billed to the org. For example, when a member creates a new app, App-A, billed to the org, they are the only developer for App-A. However, any org admins may add themselves as developers at any time.
Examples of Using Orgs
Org Structure Diagram
In the diagram below, there are 3 examples of how organizations can be structured.
ORG-1
The simplest example, ORG-1, is represented by the leftmost circle. In this situation, ORG-1 is a billable org that has 3 members who share one billing account, so all 5 projects created by the members of ORG-1 are billed to that org. One admin (user A) manages ORG-1.
ORG-2 and ORG-3
The second example shows ORG-2 and ORG-3 demonstrating more a complicated organizational setup. Here users are grouped into two different billable orgs, with some users belonging to both orgs and others belonging to only one.
In this case, ORG-2 and ORG-3 bill their work against separate billing accounts. This separation of orgs can represent two different groups in one company working in different departments, each with their own budgets, two different labs that work closely together, or any other scenario in which two collaborators would share work.
ORG-2 has 5 members, 4 projects, and is managed by one org admin (user G). ORG-3 has 5 members and 3 projects, but is managed by 2 admins (users G and I).
In this example, admin G and member H belong to both ORG-2 and ORG-3. They can create new projects billed to either org, depending on the project they're working on. Admin G can manage users and projects in both ORG-2 and ORG-3.
Example 1: Creating an Org for Sharing Data
You can create a non-billable org as an alias for a group of users. For example, you have a group of users who all need access to a shared dataset. You can make an org which represents all the users who need access to the dataset, for example, an org named org-dataset_access, and share all the projects and apps related to the dataset with that org. All members of the org have at least VIEW "shared project access" and "shared app access" so that they are all given permission to view the dataset. If a member no longer needs access to the dataset, they can be removed from the org, and then no longer have access to any projects or apps shared with org-dataset_access.
Example 2: Only Admins can Create Projects
You can contact DNAnexus Sales to create a billable org where only one member, the org admin, can create new org projects. All other org members are not granted the "billable activities access", and so cannot create new org projects. The org admin can then assign each org member a "shared projects access" (VIEW, UPLOAD, CONTRIBUTE, ADMINISTER) and share every org project with the org with ADMINISTER access. The members' permissions to the projects are restricted by their respective "shared project access."
For example, in a given group, bioinformaticians can be given CONTRIBUTE access to the projects shared with the entire org, so they can run analyses and produce new data in any of the org projects. However, the sequencing center technicians only need UPLOAD permissions to add new data to the projects. Analysts in the group are only given VIEW access to projects shared with the org. When you need to add a new member to your group and give them access to the projects shared with the org, you need to add them to the org as a new member and assign them the appropriate permission levels.
This membership structure allows the org admin to control the number of projects billed to the org. The org admin can also quickly share new projects with their org and revoke permissions from users who have been removed from the org.
Example 3: Shared Billing Account
You can contact DNAnexus Sales to create a billable org where users work independently and bill their activities to the org billing account (as specified by the org admin). All org members are granted "billable activities access." The org members also need to share common resources. These resources might include incoming samples or reference datasets.
In this case, all members should be granted the "shared apps access" and assigned VIEW as their "shared projects access." The reference datasets that need to be shared with the org are stored in an "Org Resources" project that is shared with the org, which is granted VIEW access. The org can also have best-practice executables built as apps on the DNAnexus system.
The apps can be shared with the org so all members of the org have access to these (potentially proprietary) executables. If any user leaves your company or institution, their access to reference datasets and executables is revoked by removing them from the org.
Other Cases
In general, it is possible to apply many different schemas to orgs as they were designed for many different real-life collaborative structures. If you have a type of collaboration you would like to support, contact DNAnexus Support for more information about how orgs can work for you.
Managing Your Orgs
If you are an admin of an org, you can access the org admin tools from the Org Admin link in the header of the DNAnexus Platform. From here, you can quickly navigate to the list of orgs you administer via All Orgs, or to a specific org.
The Organizations list shows you the list of all orgs to which you have admin access. On this page, you can quickly see your orgs, the org IDs, their Project Transfer setting, and the Member List Visibility setting.
Within an org, the Settings tab allows you to view and edit basic information, billing, and policies for your org.
Viewing and Updating Org Information
You can find the org overview on the Settings tab. From here, you can:
View and edit the organization name (this is how the org is referred to in the Platform user interface and in email notifications).
View the organization ID, the unique ID used to reference a particular org on the CLI. An example org ID would be org-demo_org.
View the number of org members, org projects, and org apps.
View the list of organization admins.
Managing Org Members
Within an org page, the Members tab allows you to view all the members of the org, invite new members, remove existing members, and update existing members' permission levels.
From the Members tab, you can quickly see the names and access levels for all org members. For more information about org membership, see the organization member guide.
Inviting a New Member
To add existing DNAnexus user to your org, you can use the + Invite New Member button from the org's Members tab. This opens a screen where you can enter the user's username, such as smithj, or user-ID, such as user-smithj. Then you can configure the user's access level in the org.
If you add a member to the org with billable activities access set to billing allowed, they have the ability to create new projects billed to the org.
However, adding the member does not change their default billing account. If the user wishes to use the org as their default billing account, they must set their own default billing account.
If the member has any pre-existing projects that are not billed to the org, the user must transfer the project to an org if they wish to have the project billed to the org.
The user receives an email notification informing them that they have been added to the organization.
Creating New DNAnexus Accounts
Org admins have the ability to create new DNAnexus accounts on behalf of the org, provided the org is covered by a license that enables account provisioning. The user then receives an email with instructions to activate their account and set their password.
If this feature has already been turned on for an org you administer, you see an option to Create New User when you go to invite a new member.
Here you can specify a username, such as alice or smithj, the new user's name, and their email address. The system automatically creates a new user account for the given email address and adds them as a member in the org.
If you create a new user and set their Billable Activities Access to Billing Allowed, consider setting the org as the user's default billing account. This option is available as a checkbox under the Billable Activities Access dropdown.
Editing Member Access
From the org Members tab, you can edit the permissions for one or multiple members of the org. The option to Edit Access appears when you have one or more org members selected in the table.
When you edit multiple members, you have the option of changing only one access while leaving the rest alone.
Removing Members
From the org Members tab, you can remove one or more members from the org. The option to Remove appears when you have one or more org members selected on the Members tab.
Removing a member revokes the user's access to all projects and apps billed to or shared with the org.
Org Projects
In the org's Projects tab you to see the list of all projects billed to the org. This list includes all projects in which you have VIEW and above permissions as well as projects that are billed to the org in which you do not have permissions (not a Member of).
You can view all project metadata, such as the list of members, data usage, and creation date. You can also view other optional columns such as project creator. To enable the optional columns, select the column from the dropdown menu to the right of the column names.
Granting Admin Access to Org Projects
Org admins can give themselves access to any project billed to the org. If you select a project in which you are not a member, you are still able to navigate into the project's settings page. On the project settings page, you can click a button to grant yourself ADMINISTER permissions to the project.
You can also grant yourself ADMINISTER permissions if you are a member of a project billed to your org but you only have VIEW, CONTRIBUTE, or UPLOAD permissions.
Org Billing
Accessing Org Billing Information
To access your org's billing information:
In the main menu, click Orgs > All Orgs.
Select an organization you want to view.
Select the Billing tab to view billing information.
Setting Up or Updating Billing Information for an Org
To set up or update the billing information for an org you administer, contact DNAnexus Billing team.
Setting up billing for an organization designates someone to receive and pay DNAnexus invoices, including usage by organization members. The billing contact can be you, someone from your finance department, or another designated person.
When you click Confirm Billing, DNAnexus sends an email to the designated billing contact requesting confirmation of their responsibility for receiving and paying invoices. The organization's billing contact information does not update until DNAnexus receives this confirmation.
Setting and Modifying an Org Spending Limit
The org spending limit is the total in outstanding usage charges that can be incurred by projects linked to an org.
If you are an org admin, you can set or modify this spending limit:
In the main menu, click Orgs > All Orgs.
Select the org for which you'd like to set or modify a spending limit.
In the org details, select the Billing tab.
In Summary, click Increase Spending Limit to request increasing the limit via DNAnexus Support.
Doing this only submits your request.
Before approving your request, DNAnexus Support may follow up with you via email with questions about the change.
Viewing Estimated Charges
The Usage Charges section allows users with billable access to view total charges incurred to date. You can see how much is left of the org's spending limit. This section is only visible if your org is a billable org, which means your org has confirmed billing information.
If your org doesn't have a spending limit, your org is unlimited and shows as "N/A."
Using Monthly Project Spending and Usage Limits
You need a license to use both the Monthly Project Usage Limit for Computing and Egress, and Monthly Project Spending Limit for Storage features. Contact DNAnexus Sales for more information.
For orgs with the Monthly Project Usage Limit for Computing and Egress and/or the Monthly Project Spending Limit for Storage feature enabled, org admins can set, update, and view default limits for each spending type, and set limit enforcement actions via API calls.
Configuring limits and their enforcement in org details > Billing > Usage Limits.
Monitoring Spending and Usage
Licenses are required to use the Per-Project Usage Report and Root Execution Stats Report features. Contact DNAnexus Sales for more information.
Configuration of these features, and report delivery, is handled by DNAnexus Support.
The Per-Project Usage Report and Root Execution Stats Report are monthly reports that provide detailed breakdowns of charges incurred by org members. These reports help you track and analyze spending patterns across your organization. For more information, see organization management and usage monitoring.
Org Policies
Org admins can also set configurable policies for the org. Org policies dictate many different behaviors when the org interacts with other entities. The following policies exist:
Policy
Description
Options
Membership List Visibility
Dictates the minimum org membership level required to view the list of org members, their membership level, and access within the org. If PUBLIC, any DNAnexus user can view the list of org members.
[ADMIN], [MEMBER], or [PUBLIC]
Project Transfer
Dictates the minimum org membership level allowed to change the billing account of an org project (via the UI or project transfer).
[ADMIN] or [MEMBER]
DNAnexus recommends, as a starting point, to restrict the "membership list visibility policy" to ADMIN and "project transfer policy" to ADMIN. This ensures that only the org admin is allowed to see the list of members and their access within the org and that org projects always remain under control of the org.
You can update org policies for your org in the Policies and Administration section of the org Settings tab. Here, you can both change the membership list visibility and restrict project transfer policies for the org and contact DNAnexus Support to enable PHI data policies for org projects.
Glossary of Org Terms
Billable activities access is an access level that can be granted to org members. If allowed, the org member can create new projects and apps billed to the org, download data (incurring data egress charges against the org), and set their own default billing account to that of the org.
Billable org is an org that has confirmed billing information or a non-negative spending limit remaining. Users with billable activities access in a billable org are allowed to create new projects billed to the org. See the definition of a non-billable org for an org that is used for sharing.
Billed to an org (app context) sets the billing account of an app to an org. Apps require storage for their resources and assets, and the billing account of the app are billed for that storage. The billing account of an app does not pay for invocations of the app unless the app is run in a project billed to the org.
Billed to an org (project context) sets the billing account of a project to an org. The org is invoiced the storage for all data stored in the project as well as compute charges for all jobs and analyses run in the project.
Membership level describes one of two membership levels available to users in an org, ADMIN or MEMBER. Remember that ADMINISTER is a type of access level.
Membership list visibility policy dictates the minimum org membership level required to view the list of org members, their membership level, and access within the org.
Non-billable org describes an org only used as an alias for a group of users. Non-billable orgs do not have billing information and do not have any org projects or org apps. Any user can share a project with a non-billable org.
Org access is granted to a user to determine which actions the user can perform in an org.
Org admin describes administrators of an org who can manage org membership, configure access and projects associated with the org, and oversee billing.
Org app is an app billed to an org.
Org ID is the unique ID used to reference a particular org on the DNAnexus Platform. An example is org-dnanexus.
Org member is a DNAnexus user associated with an org. Org members can have variable membership levels in an org which define their role in the org. Admins are a type of org member as well.
Org policy is a configurable policy for the org. Org policies dictate many different behaviors when the org interacts with other entities.
Org project describes a project billed to an org.
Org (or "organization") is a DNAnexus entity that is used to associate a group of users. Orgs are referenced on the DNAnexus Platform by a unique org ID.
Project transfer policy dictates the minimum org membership level allowed to change the billing account of an org project.
Share with an org means to give the members of an org access to a project or app via giving the org access to the project or adding the org as an "authorized user" of an app.
Shared apps access is an org access level that can be granted to org members. If allowed, the org member can view and run apps in which the org has been added as an "authorized user."
Shared projects access is an org access level that can be granted to org members: the maximum access level a user can have in projects shared with an org.
You can run workflows from the command-line using the command dx run. The inputs to these workflows can be from any project for which you have VIEW access.
The examples here use the publicly available Exome Analysis Workflow (platform login required to access this link).
Running dx run without specifying an input launches interactive mode. The system prompts for each required input, followed by options to select from a list of optional parameters to modify. Optional parameters include all modifiable parameters for each stage of the workflow. The interface outputs a JSON file detailing the input specified and generates an analysis ID of the form analysis-xxxx unique to this particular run of the workflow.
Below is an example of running the Exome Analysis Workflow from the public "Exome Analysis Demo" project.
Running in Non-Interactive Mode
You can specify each input on the command-line using the -i or --input flags using the syntax -i<stage ID>.<input name>=<input value>. <input-value> must take the form of a DNAnexus object ID or a file named in the project you have selected. It is also possible to specify the number of a stage in place of the stage ID for a given workflow, where stages are indexed starting at zero. The inputs in the following example are specified for the first stage of the workflow only to illustrate this point. The parentheses around the <input-value> in the help string are omitted when entering input.
Possible values for the input name field can be found by running the command dx run workflow-xxxx -h, as shown below using the Exome Analysis Workflow.
This help message describes the inputs for each stage of the workflow in the order they are specified. For each stage of the workflow, the help message first lists the required inputs for that stage, specifying the requisite type in the <input-value> field. Next, the message describes common options for that stage (as seen in that stage's corresponding UI on the platform). Lastly, it lists advanced command-line options for that stage. If any stage's input is linked to the output of a prior stage, the help message shows the default value for that stage as a DNAnexus link of the form
This link format can also be used to specify output from any prior stage in the workflow as input for the current stage.
For the Exome Analysis Workflow, one required input parameter needs to be specified manually: -ibwa_mem_fastq_read_mapper.reads_fastqgzs.
This parameter targets the first stage of the workflow. For convenience, use the stage number instead of the full stage ID. Since this is the first stage (and workflow stages are zero-indexed), replace bwa_mem_fastq_read_mapper with 0 like this: -i0.reads_fastqgzs.
The example below shows how to run the same Exome Analysis Workflow on a FASTQ file containing reads, as well as a BWA reference genome, using the default parameters for each subsequent stage.
Specifying Array Input
Array input can be specified by specifying multiple inputs for a single parameter in a stage. For example, the following flags would add files 1 through 3 to the file_inputs parameter for stage-xxxx of the workflow:
If no project is selected, or if the file is in another project, the project containing the files you wish to use must be specified as follows: -i<stage ID>.<input name>=<project id>:<file id>.
Job-Based Object References (JBORs)
The -i flag can also be used to specify (JBORs) with the syntax -i<stage ID or number>:<input name>=<job id>:<output name>. The --brief flag, when used with the command dx run, outputs only the execution's ID. You can also skip the interactive prompts confirming the execution using the -y flag. Calling dx run on a job with the --brief flag returns only the job ID of that execution, and you can skip being prompted to begin execution with the -y flag.
The example below calls the app (platform login required to access this link) to produce the sorted_bam output described in the help string produced by running dx run app-bwa_mem_fastq_read_mapper -h. This output is then used as input to the first stage of the featured on the DNAnexus Platform (platform login required to access this link).
Advanced Options
Quiet Output
Using the --brief flag at the end of a dx run command causes the command line to print the execution's analysis ID ("analysis-xxxx") instead of the input JSON for the execution. This ID can be saved for later reference.
Rerunning Analyses With Modified Settings
To modify specific settings from the previous analysis, you can run the command dx run --clone analysis-xxxx [options]. The [options] parameters override anything set by the --clone flag, and take the form of options passed as input from the command line.
The --clone flag does not copy the usage of the --allow-ssh or --debug-on flags, which must be set with the new execution. Only the applet, instance type, and input spec are copied. See the page for more information on the usage of these flags.
For example, the command below redirects the output of the analysis to the outputs/ folder and reruns all stages.
Only the outputs of stages rerun are placed in the destination specified.
Rerunning Specific Stages
When rerunning workflows, if a stage runs identically to how it ran in a previous analysis, the stage itself is not rerun. The outputs of that stage are not copied or rewritten in a new location. To rerun a specific stage, use the option --rerun-stage STAGE_ID to force a stage to be run again, where STAGE_ID is an ID of the form stage-xxxx, the stage's name, or the index of that stage (where the first stage of a workflow is indexed at 0). If you want to rerun all stages of an analysis, you can use --rerun-stage "*", where the asterisk is enclosed in quotes to prevent expansion of that variable into all folders in your current directory via globbing.
The command below reruns the third and final stage of analysis-xxxx
Specifying Analysis Output Folders
The --destination flag allows you to specify the path of the output of a workflow. By default, every output of every stage is written to the destination specified.
Specifying Output Folders
You can use the --stage-output-folder <stage_ID> <folder> command to specify the output destination of a particular stage in the analysis being run. In this command, stage_ID is the stage's name, or the index of that stage (where the first stage of a workflow is indexed at 0). The folder is the project and path to which you wish the stage to write using the syntax project-xxxx:/PATH where PATH is the path to the folder in project-xxxx where you wish to write outputs.
The following command reruns all stages of analysis-xxxx and sets the output destination of the first step of the workflow (BWA) to "mappings" in the current project:
Specifying Stage-Relative Output Folders
If you want to specify output folder of a stage within the current output folder of the entire analysis, you can use the flag --stage-relative-output-folder <stage_id> <folder>, where stage_id is the stage's name (stage-xxxx), or the index of that stage (where the first stage of a workflow is indexed at 0). For the folder argument, you can specify a quoted path to write the output of that stage that is relative to the output folder of the analysis.
The following command reruns all stages of analysis-xxxx, setting the output destination of the analysis to /exome_run, and the output destination of stage 0 to /exome_run/mappings in the current project:
Specifying a Different Instance Type
To specify the instance type of all stages in your analysis or a specific set of stages in your analysis, use the flag --instance-type. Specifically, the format --instance-type STAGE_ID=INSTANCE_TYPE allows you to set the instance type of a specific stage, while --instance-type INSTANCE_TYPE sets one instance type for all stages. The two options can be combined, for example, --instance-type mem2_ssd1_x2 --instance-type my_stage_0=mem3_ssd1_x16 sets all stages' instance types to mem2_ssd1_x2 except for the stage my_stage_0, for which mem3_ssd1_x16 is used.
Here STAGE_ID is an ID of a stage, the stage's name, or the index of that stage (where the first stage of a workflow is indexed at 0).
The example below reruns all stages of analysis-xxxx and specifies that the first and second stages should be run on mem1_ssd2_x8 and mem1_ssd2_x16 instances respectively:
Adding Metadata to an Analysis
This is identical to adding metadata to a job. See for details.
Monitoring an Analysis
Command line monitoring of an analysis is not available. For information about monitoring a job from the command line, see .
On the DNAnexus Platform, jobs are limited to a runtime of 30 days. Jobs that run longer than 30 days are automatically terminated.
Providing Input JSON
This is identical to providing an input JSON to a job. For more information, see .
As in running a workflow in non-interactive mode, inputs to a workflow must be specified as STAGE_ID.<input>. Here STAGE_ID is either an ID of the form stage-xxxx or the index of that stage in the workflow (starting with the first stage at index 0).
$ dx run "Exome Analysis Demo:Exome Analysis Workflow"
Entering interactive mode for input selection.
Input: Reads (bwa_mem_fastq_read_mapper.reads_fastqgzs)
Class: array:file
Enter file values, one at a time (^D or <ENTER> to finish, <TAB> twice for compatible files in
current directory, '?' for more options)
bwa_mem_fastq_read_mapper.reads_fastqgzs[0]: "Exome Analysis Demo:/Input/SRR504516_1.fastq.gz"
Select an optional parameter to set by its # (^D or <ENTER> to finish):
[0] Reads (right mates) (bwa_mem_fastq_read_mapper.reads2_fastqgzs)
[1] Read group information (bwa_mem_fastq_read_mapper.rg_info_csv)
.
.
.
[33] Output prefix (gatk4_genotypegvcfs.prefix)
[34] Extra command line options (gatk4_genotypegvcfs.extra_options) [default="-G StandardAnnotation --only-output-calls-starting-in-intervals"]
Optional param #: 0
Input: Reads (right mates) (bwa_mem_fastq_read_mapper.reads2_fastqgzs)
Class: array:file
Enter file values, one at a time (^D or <ENTER> to finish, <TAB> twice for compatible files in
current directory, '?' for more options)
bwa_mem_fastq_read_mapper.reads2_fastqgzs[0]: "Exome Analysis Demo:/Input/SRR504516_2.fastq.gz"
bwa_mem_fastq_read_mapper.reads2_fastqgzs[1]:
Optional param #: <ENTER>
Using input JSON:
{
"bwa_mem_fastq_read_mapper.reads_fastqgzs": [
{
"$dnanexus_link": {
"project": "project-BQfgzV80bZ46kf6pBGy00J38",
"id": "file-B40jg7v8KfPy38kjz1vQ001y"
}
}
],
"bwa_mem_fastq_read_mapper.reads2_fastqgzs": [
{
"$dnanexus_link": {
"project": "project-BQfgzV80bZ46kf6pBGy00J38",
"id": "file-B40jgYG8KfPy38kjz1vQ0020"
}
}
]
}
Confirm running the executable with this input [Y/n]: <ENTER>
Calling workflow-xxxx with output destination project-xxxx:/
Analysis ID: analysis-xxxx
$ dx run "Exome Analysis Demo:Exome Analysis Workflow" -h
usage: dx run Exome Analysis Demo:Exome Analysis Workflow [-iINPUT_NAME=VALUE ...]
Workflow: GATK4 Exome FASTQ to VCF (hs38DH)
Runs GATK4 Best Practice for Exome on hs38DH reference genome
Inputs:
bwa_mem_fastq_read_mapper
Reads: -ibwa_mem_fastq_read_mapper.reads_fastqgzs=(file) [-ibwa_mem_fastq_read_mapper.reads_fastqgzs=... [...]]
An array of files, in gzipped FASTQ format, with the first read mates
to be mapped.
Reads (right mates): [-ibwa_mem_fastq_read_mapper.reads2_fastqgzs=(file) [-ibwa_mem_fastq_read_mapper.reads2_fastqgzs=... [...]]]
(Optional) An array of files, in gzipped FASTQ format, with the second
read mates to be mapped.
BWA reference genome index: [-ibwa_mem_fastq_read_mapper.genomeindex_targz=(file, default={"$dnanexus_link": {"project": "project-BQpp3Y804Y0xbyG4GJPQ01xv", "id": "file-FFJPKp0034KY8f20F6V9yYkk"}})]
A file, in gzipped tar archive format, with the reference genome
sequence already indexed with BWA.
...
fastqc
Reads: [-ifastqc.reads=(file, default={"$dnanexus_link": {"stage": "bwa_mem_fastq_read_mapper", "outputField": "sorted_bam"}})]
A file containing the reads to be checked. Accepted formats are
gzipped-FASTQ and BAM.
...
gatk4_bqsr
Sorted mappings: [-igatk4_bqsr.mappings_sorted_bam=(file, default={"$dnanexus_link": {"outputField": "sorted_bam", "stage": "bwa_mem_fastq_read_mapper"}})]
A coordinate-sorted BAM or CRAM file with the base quality scores to
be recalibrated.
...
...
Outputs:
Sorted mappings: bwa_mem_fastq_read_mapper.sorted_bam (file)
A coordinate-sorted BAM file with the resulting mappings.
Sorted mappings index: bwa_mem_fastq_read_mapper.sorted_bai (file)
The associated BAM index file.
...
Variants index: gatk4_genotypegvcfs.variants_vcfgztbi (file)
The associated TBI file.
The maximum access level a user can have in projects shared with an org. For example, if this is set to UPLOAD for an org member, the member has at most UPLOAD access in projects shared with the org, even if the org was given CONTRIBUTE or ADMINISTER access to the project.
[NONE], [VIEW], [UPLOAD], [CONTRIBUTE] or [ADMINISTER]
Project Sharing
Dictates the minimum org membership level allowed for a user to invite that org to a project
[ADMIN] or [MEMBER]
Instance Upgrade on Job Restart
Controls whether the platform automatically retries a job on a larger instance type when the job fails with an AppInsufficientResourceError (out of memory or out of storage). When enabled, the platform upgrades the instance one step within the same instance family on retry. This policy corresponds to the allowInstanceUpgradeOnJobRestart org policy key.
[Enabled] or [Disabled] (default: Disabled)
Running Apps and Applets
You can run apps and applets from the command-line using the command dx run. The inputs to these app(let)s can be from any project for which you have VIEW access. Or run from UI.
Running in Interactive Mode
If dx run is run without specifying any inputs, interactive mode launches. When you run this command, the platform prompts you for each required input, followed by a prompt to set any optional parameters. As shown below using the BWA-MEM FASTQ Read Mapper app (platform login required to access this link), after you are done entering inputs, you must confirm that you want the applet/app to run with the inputs you have selected.
Running in Non-interactive Mode
Naming Each Input
You can also specify each input parameter by name using the ‑i or ‑‑input flags with syntax ‑i<input name>=<input value>. Names of data objects in your project are resolved to the appropriate IDs and packaged correctly for the API method as shown below.
When specifying input parameters using the ‑i/‑‑input flag, you must use the input field names (not to be confused with their human-readable labels). To look up the input field names for an app, applet, or workflow, you can run the command dx run app(let)-xxxx -h, as shown below using the (platform login required to access this link).
The help message describes the inputs and outputs of the app, their types, and how to identify them when running the app from the command line. For example, from the above help message, the Swiss Army Knife app has two primary inputs: one or more file and a string to be executed on the command line, to be specified as -iin=file-xxxx and icmd=<string>, respectively.
The example below shows you how to run the same Swiss Army Knife app to sort a small BAM file using these inputs.
Specifying Array Input
To specify array inputs, reuse the ‑i/‑‑input flag for each input in the array and each file specified is appended into an array in the same order as it was entered on the command line. Below is an example of how to use the to index multiple BAM files (platform login required to access this link).
Job-Based Object References (JBORs)
(JBORs) can also be provided using the -i flag with syntax ‑i<input name>=<job id>:<output name>. Combined with the --brief flag (which allows dx run to output only the job ID) and the -y flag (to skip confirmation), you can string together two jobs using one command.
Below is an example of how to run the (platform login required to access this link), producing the output named sorted_bam as described in the dx help output by executing the command dx run app-bwa_mem_fastq_read_mapper -h. The sorted_bam output is then used as input for the (platform login required to access this link).
Advanced Options
Some examples of additional functionalities provided by dx run are listed below.
Quiet Output
Regardless of whether you run a job interactively or non-interactively, the command dx run always prints the exact input JSON with which it is calling the applet or app. If you don't want to print this verbose output, you can use the --brief flag which tells dx to print out only the job ID instead. This job ID can then be saved.
To run jobs without being prompted for confirmation, use the -y or --yes option. This is especially helpful when scripting or automating job submissions.
If you want to both skip confirmation and immediately monitor the job's progress, use -y --watch. This starts the job and displays its logs in your terminal as it runs.
Rerunning a Job With the Same Settings
If you are debugging applet-xxxx and wish to rerun a job you previously ran, using the same settings (destination project and folder, inputs, instance type requests), but use a new executable applet-yyyy, you can use the --clone flag.
In the above command, the command overrides the --clone job-xxxx command to use the executable (platform login required to access this link) rather than that used by the job.
If you want to modify some but not all settings from the previous job, you can run dx run <executable> --clone job-xxxx [options]. The command-line arguments you provide in [options] override the settings reused from --clone. For example, this is useful if you want to rerun a job with the same executable and inputs but a different instance type, or if you want to run an executable with the same settings but slightly different inputs.
The example shown below redirects the outputs of the job to the folder "outputs/".
While the --clone job-xxxx flag copies the applet, instance type, and inputs, it does not copy usage of the --allow-ssh or --debug-on flags. These must be re-specified for each job run. For more information, see the page.
Specifying the Job Output Folder
The --destination flag allows you to specify the full project-ID:/folder/ path in which to output the results of the app(let). If this flag is unspecified, the output of the job defaults to the present working directory, which can be determined by running .
In the above command, the flag --destination project-xxxx:/mappings instructs the job to output all results into the "mappings" folder of project-xxxx.
Specifying a Different Instance Type
The dx run --instance-type command allows you to specify the instance types to use for the job. More information is available by running the command dx run --instance-type-help.
Some apps and applets have multiple , meaning that different instance types can be specified for different functions executed by the app(let). In the example below, the (platform login required to access this link) is run while specifying the instance types for the entry points honey, ssake, ssake_insert, and main. Specifying the instance types for each entry point requires a JSON-like string, meaning that the string should be wrapped in single quotes, as explained earlier, and demonstrated below.
Adding Metadata to a Job
If you are running many jobs that have varying purposes, you can organize the jobs using metadata. Two types of metadata are available on the DNAnexus Platform: properties and tags.
Properties are key-value pairs that can be attached to any object on the platform, whereas tags are strings associated with objects on the platform. The --property flag allows you to attach a property to a job, and the --tag flag allows you to tag a job.
Adding metadata to executions does not affect the metadata of the executions' output files. Metadata on jobs make it easier for you to search for a particular job in your job history. This is useful when you want to tag all jobs run with a particular sample, for instance.
Specifying an App Version
If your current workflow is not using the most up-to-date version of an app, you can specify an older version when running your job. Append the app name with the version required, for example, app-xxxx/0.0.1 if the current version is app-xxxx/1.0.0.
Watching a Job
To monitor your job as it runs, use the --watch flag to display the job's logs in your terminal window as it progresses.
Providing Input JSON
You can also specify the input JSON in its entirety. To specify a data object, you must wrap it in (a key-value pair with a key of $dnanexus_link and value of the data object's ID). Because you are already providing the JSON in its entirety, as long as the applet/app ID can be resolved and the JSON can be parsed, confirmation before the job starts is not required. Three methods exist for entering the full input JSON, discussed in separate sections below.
From the CLI
If using the CLI to enter the full input JSON, you must use the flag ‑j/‑‑input‑json followed by the JSON in single quotes. Only single quotes should be used to wrap the JSON to avoid interfering with the double quotes used by the JSON itself.
From a File
If using a file to enter the input JSON, you must use the flag ‑f/‑‑input‑json‑file followed by the name of the JSON file.
From stdin
Entering the input JSON file using stdin is done the same way as entering the file using the -f flag with the substitution of using "-" as the filename. Below is an example that demonstrates how to echo the input JSON to stdin and pipe the output to the input of dx run. As before, use single quotes to wrap the JSON input to avoid interfering with the double quotes used by the JSON itself.
Getting Additional Information on dx run
Executing the dx run --help command shows the flags available to use with dx run. The message printed by this command is identical to the one displayed in the brief description of .
Cost Run Limits
The --cost-limit cost_limit sets the maximum cost of the job before termination. In case of workflows, it is the cost of the entire analysis job. For batch run, this limit applies per job. See the dx run --help command for more information.
Job Runtime Limits
On the DNAnexus Platform, jobs are limited to a runtime of 30 days. Jobs running longer than 30 days are automatically terminated.
$ dx run app-bwa_mem_fastq_read_mapper
Entering interactive mode for input selection.
Input: Reads (reads_fastqgz)
Class: file
Enter file ID or path ((<TAB> twice for compatible files in current directory, '?' for more options)
reads_fastqgz: reads.fastq.gz
Input: BWA reference genome index (genomeindex_targz)
Class: file
Suggestions:
project-BQpp3Y804Y0xbyG4GJPQ01xv://file-* (DNAnexus Reference Genomes)
Enter file ID or path (<TAB> twice for compatible files in current directory, '?' for more options)
genomeindex_targz: "Reference Genome Files:/H. Sapiens - hg19 (UCSC)/ucsc_hg19.bwa-index.tar.gz"
Select an optional parameter to set by its # (^D or <ENTER> to finish):
[0] Reads (right mates) (reads2_fastqgz)
[1] Add read group information to the mappings (required by downstream GATK)? (add_read_group) [default=true]
[2] Read group id (read_group_id) [default={"$dnanexus_link": {"input": "reads_fastqgz", "metadata": "name"}}]
[3] Read group platform (read_group_platform) [default="ILLUMINA"]
[4] Read group platform unit (read_group_platform_unit) [default="None"]
[5] Read group library (read_group_library) [default="1"]
[6] Read group sample (read_group_sample) [default="1"]
[7] Output all alignments for single/unpaired reads? (all_alignments)
[8] Mark shorter split hits as secondary? (mark_as_secondary) [default=true]
[9] Advanced command line options (advanced_options)
Optional param #: <ENTER>
Using input JSON:
{
"reads_fastqgz": {
"$dnanexus_link": {
"project": "project-xxxx",
"id": "file-xxxx"
}
},
"genomeindex_targz": {
"$dnanexus_link": {
"project": "project-xxxx",
"id": "file-xxxx"
}
}
}
Confirm running the applet/app with this input [Y/n]: <ENTER>
Calling app-xxxx with output destination project-xxxx:/
Job ID: job-xxxx
Watch launched job now? [Y/n] n
usage: dx run app-swiss-army-knife [-iINPUT_NAME=VALUE ...]
App: Swiss Army Knife
Version: 5.1.0 (published)
A multi-purpose tool for all your basic analysis needs
See the app page for more information:
https://platform.dnanexus.com/app/swiss-army-knife
Inputs:
Input files: [-iin=(file) [-iin=... [...]]]
(Optional) Files to download to instance temporary folder before
command is executed.
Command line: -icmd=(string)
Command to execute on instance. View the app readme for details.
Whether to use "dx-mount-all-inputs"?: [-imount_inputs=(boolean, default=false)]
(Optional) Whether to mount all files that were supplied as inputs to
the app instead of downloading them to the local storage of the
execution worker.
Public Docker image identifier: [-iimage=(string)]
(Optional) Instead of using the default Ubuntu 24.04 environment, the
input command <CMD> will be run using the specified publicly
accessible Docker image <IMAGE> as it would be when running 'docker
run <IMAGE> <CMD>'. Example image identifiers are 'ubuntu:25.04',
'quay.io/ucsc_cgl/samtools'. Cannot be specified together with
'image_file'. This input relies on access to internet and is unusable
in an internet-restricted project.
Platform file containing Docker image accepted by `docker load`: [-iimage_file=(file)]
(Optional) Instead of using the default Ubuntu 24.04 environment, the
input command <CMD> will be run using the Docker image <IMAGE> loaded
from the specified image file <IMAGE_FILE> as it would be when running
'docker load -i <IMAGE_FILE> && docker run <IMAGE> <CMD>'. Cannot be
specified together with 'image'.
Outputs:
Output files: [out (array:file)]
(Optional) New files that were created in temporary folder.
The dx command-line client is included in the DNAnexus SDK (dx-toolkit). You can use the dx client to log into the Platform, to upload, browse, and organize data, and to launch analyses.
All the projects and data referenced in this Quickstart are publicly available, so you can follow along step-by-step.
Before You Begin
If you haven't already done so, , which includes the dx command-line client, as well as range of useful utilities.
Getting Help
As you work, use the as a reference.
On the command line, you can also enter dx help to see a list of commands, broken down by category. To see a list of commands from a particular category, enter dx help <category>.
To learn what a particular command does, enter dx help <command>, dx <command> -h, or dx <command> -help . For example, enter dx help ls to learn about the command dx ls:
Step 1: Log In
The first step is to . If you have not created a DNAnexus account, open the and sign up. User signup is not supported on the command line.
Your and your current project settings are saved in a local configuration file, and you can start accessing your project.
You can generate an authentication token from the online DNAnexus Platform .
Step 2: Explore
Public Projects
Look inside some public projects that have already been set up. From the command line, enter the command:
By running the command and picking a project, you perform the command-line equivalent of going to the project page for (platform login required to access this link) on the website. This is a DNAnexus-sponsored project containing popular genomes for use in analyses with your own data.
For more information about the dx select command, see the page.
DNAnexus-sponsored data is free to copy from this project as many times as needed.
List the data in the top-level directory of the project you've selected by running the command . View the contents of a folder by running the command dx ls <folder_name>.
When you use wildcard characters like * or ? with dx commands, always enclose the pattern in quotes. For example, use dx ls "*.fastq". Without quotes, your shell expands the wildcards against local files before passing them to dx. This produces unexpected results. For details, see .
You can avoid typing out the full name of the folder by typing in dx ls C and then pressing <TAB>. The folder name auto-completes from there.
You don't have to be in a project to inspect its contents. You can also look into another project, and a folder within the project, by giving the project name or ID, followed by a colon (:) and the folder path. Here, the contents of the publicly available project "Demo Data" are listed using both its name and ID.
As shown above, you can use the -l flag with dx ls to list more details about files, such as the time a file was last modified, its size (if applicable), and its full DNAnexus ID.
Describing DNAnexus Objects
You can use the command to learn more about on the platform. Given a DNAnexus object ID or name, dx describe returns detailed information about the object. dx describe only returns results for data objects to which you have access.
Besides describing data and projects (examples for which are shown below), you can also describe apps, jobs, and users.
Describing a File
Below, the reference genome file for C. Elegans located in the "Reference Genome Files: AWS US (East)" project that has been used is described (which should be accessible from other regions as well). You need to add a colon (:) after the project name, here that would be Reference Genome Files\: AWS US (East): .
Describing a Project
Below, the publicly available Reference Genome Files project that has been used is described.
Step 3: Create Your Own Project
Use the command to create a new project.
The text project-xxxx denotes a placeholder for a unique, immutable project ID. For more information about object IDs, see the page.
The project is ready for uploading data and running analyses.
The new command can also allow you to create other new data objects, including new orgs or users. Use the command dx help new to see additional information. For mode information, see the .
Step 4: Upload and Manage Your Data
To analyze a sample, use the command or the if installed. For this tutorial, download the file , which represents the first 25000 C. elegans reads from SRR070372. This file is used in the sample analysis below.
For uploading multiple or large files, use the . It compresses files and uploads them in parallel over multiple HTTP connections and supports resumable uploads.
The following command uploads the small-celegans-sample.fastq file into the current directory of the current project. The --wait flag tells to wait until uploading is complete before returning the prompt and describing the result.
If you run the same command but add the flag --brief, only the file ID (in the form of file-xxxx) is printed to the terminal. Other dx commands also accept the --brief flag and report only object IDs.
Examining Data
To take a quick look at the first few lines of the file you uploaded, use the command. By default, it prints the first 10 lines of the given file.
Run it on the file you uploaded and use the -n flag to ask for the first 12 lines (the first 3 reads) of the FASTQ file.
Downloading Data
If you'd like to download a file from the platform, use the command. This command uses the name of the file for the filename unless you specify your own with the -o or --output flag. The example below downloads the same C. elegans file that was uploaded previously.
About Metadata
Files have different available fields for metadata, such as "properties" (key-value pairs) and "tags".
Step 5: Analyze a Sample
For the next few steps, if you would like to follow along, you need a C. elegans FASTQ file. This tutorial maps the reads against the ce10 genome. If you haven't already, you can download and use the following FASTQ file, which contains the first 25,000 reads from SRR070372: .
You can also substitute your own reads file for a different species (though it may take longer to run the example). For convenience, DNAnexus has already imported a variety of reference genomes to the platform. If you have a FASTA file to use, upload it and create genome indices for BWA using the (platform login required to access these links).
The following walkthrough explains what each command does and shows which apps run. If you only want to convert a gzipped FASTQ file to a VCF via BWA and the FreeBayes Variant Caller, to see the commands required to run the apps.
Uploading Reads
If you have not yet done so, you can upload a FASTQ file for analysis.
For more information about using the command , see the page.
Mapping Reads
Next, use the (platform login required to access this link) to map the uploaded reads file to a reference genome.
Finding the App Name
If you don't know the command-line name of the app to run, you have two options:
Navigate to its web page from the (platform login required to access this link). The app's page shows how to run it from the command line. See the for details on the app used here (platform login required).
Alternatively, search for apps from the command line by running dx find apps. The command-line name appears in parentheses in the output (underlined below).
Installing and Running the App
Install the app using and check that it has been installed. While you do not always need to install an app to run it, you may find it useful as a bookmarking tool.
You can run the app using . When you run it without any arguments, it prompts you for required and then optional arguments. The reference file genomeindex_targz for this C. elegans sample is in a .tar.gz format and can be found in the Reference Genome folder of the region your project is in.
Monitoring Your Job
You can use the command to monitor jobs. The command prints out the log file of the job, including the STDOUT, STDERR, and INFO printouts.
You can also use the command dx describe job-xxxx to learn more about your job. If you don't know the job's ID, you can use the command to list all the jobs run in the current project, along with the user who ran them, their status, and when they began.
Additional options are available to restrict your search of previous jobs, such as by their names or when they were run.
Terminating Your Job
If for some reason you need to terminate your job before it completes, use the command .
After Your Job Finishes
You should see two new files in your project: the mapped reads in a BAM file, and an index of that BAM file with a .bai extension. You can refer to the output file by name or by the job that produced it using the syntax job-xxxx:<output field>. Try it yourself with the job ID you got from calling the BWA-MEM app!
Variant Calling
You can use the (platform login required to access this link) to call variants on your BAM file.
This time, instead of relying on interactive mode to enter inputs, you provide them directly. First, look up the app's spec to determine the input names. Run the command dx run freebayes -h.
Optional inputs are shown using square brackets ([]) around the command-line syntax for each input. Notice that there are two required inputs that must be specified:
Sorted mappings (sorted_bams): A list of files with a .bam extension.
Genome (genome_fastagz): A reference genome in FASTA format that has been gzipped.
You can also run dx describe freebayes for a more compact view of the input and output specifications. By default, it hides the advanced input options, but you can view them using the --verbose flag.
Running the App with a One-Liner Using a Job-Based Object Reference
It is sometimes more convenient to run apps using a single one-line command. You can do this by specifying all the necessary inputs either via the command line or in a prepared file. Use the -i flag to specify inputs as suggested by the output of dx run freebayes ‑h:
sorted_bams: The output of the previous BWA step (see the section for more information).
genome_fastagz: The ce10 genome in the Reference Genomes project.
To specify new job input using the output of a previous job, use a via the job-xxxx:<output field> syntax used earlier.
You can use job-based object references as input even before the referenced jobs have finished. The system waits until the input is ready to begin the new job.
Replace the job ID below with that generated by the BWA app you ran earlier. The -y flag skips the input confirmation.
Automatically Running a Command After a Job Finishes
Use the command to wait for a job to finish. If you run the following command immediately after launching the FreeBayes app, it shows recent jobs only after the job has finished, as shown in the example below.
Congratulations! You have called variants on a reads sample using the command line. Next, see how to automate this process.
Automation
The CLI enables automation of these steps. The following script assumes that you are logged in. It is hardcoded to use the ce10 genome and takes a local gzipped FASTQ file as its command-line argument.
Learn More
You can start scripting using dx. The --brief flag is useful for scripting. A list of all dx commands and flags is on the page.
For more detailed information about running apps and applets from the command line, see the page.
For a comprehensive guide to the DNAnexus SDK, see the .
Want to start writing your own apps? Check out the for some useful tutorials.
Monitoring Executions
Learn how to get information on current and past executions via both the UI and the CLI.
$ dx help ls
usage: dx ls [-h] [--color {off,on,auto}] [--delimiter [DELIMITER]]
[--env-help] [--brief | --summary | --verbose] [-a] [-l] [--obj]
[--folders] [--full]
[path]
List folders and/or objects in a folder
... # output truncated for brevity
$ dx login
Acquiring credentials from https://auth.dnanexus.com
Username: <your username>
Password: <your password>
No projects to choose from. You can create one with the command "dx new project".
To pick from projects for which you only have VIEW permissions, use "dx select --level VIEW" or "dx select --public".
$ dx ls
C. Elegans - Ce10/
D. melanogaster - Dm3/
H. Sapiens - GRCh37 - b37 (1000 Genomes Phase I)/
H. Sapiens - GRCh37 - hs37d5 (1000 Genomes Phase II)/
H. Sapiens - GRCh38/
H. Sapiens - hg19 (Ion Torrent)/
H. Sapiens - hg19 (UCSC)/
M. musculus - mm10/
M. musculus - mm9/
$ dx ls "C. Elegans - Ce10/"
ce10.bt2-index.tar.gz
ce10.bwa-index.tar.gz
... # output truncated for brevity
$ dx ls "Demo Data:/SRR100022/"
SRR100022_1.filt.fastq.gz
SRR100022_2.filt.fastq.gz
$ dx ls -l "project-BQbJpBj0bvygyQxgQ1800Jkk:/SRR100022/"
Project: Demo Data (project-BQbJpBj0bvygyQxgQ1800Jkk)
Folder : /SRR100022
State Last modified Size Name (ID)
... # output truncated for brevity
$ dx describe "Reference Genome Files\: AWS US (East):/C. Elegans - Ce10/ce10.fasta.gz"
Result 1:
ID file-BQbY9Bj015pB7JJVX0vQ7vj5
Class file
Project project-BQpp3Y804Y0xbyG4GJPQ01xv
Folder /C. Elegans - Ce10
Name ce10.fasta.gz
State closed
Visibility visible
Types -
Properties Assembly=UCSC ce10,
Origin=https://hgdownload.cse.ucsc.edu/goldenPath/ce10/bigZips/ce10.2bit,
Species=Caenorhabditis elegans,
Taxonomy
ID=6239
Tags -
Outgoing links -
Created Tue Sep 30 18:54:35 2014
Created by bhannigan
via the job job-BQbY8y80KKgP380QVQY000qz
Last modified Thu Mar 2 12:17:27 2017
Media type application/x-gzip
archivalState "live"
Size 29.21 MB, sponsored by DNAnexus
$ dx describe "Reference Genome Files\: AWS US (East):"
Result 1:
ID project-BQpp3Y804Y0xbyG4GJPQ01xv
Class project
Name Reference Genome Files: AWS US (East)
Summary
Billed to org-dnanexus
Access level VIEW
Region aws:us-east-1
Protected true
Restricted false
Contains PHI false
Created Wed Oct 8 16:42:53 2014
Created by tnguyen
Last modified Tue Oct 23 14:15:59 2018
Data usage 0.00 GB
Sponsored data 519.77 GB
Sponsored egress 0.00 GB used of 0.00 GB total
Tags -
Properties -
downloadRestricted false
defaultInstanceType "mem2_hdd2_x2"
$ dx new project "My First Project"
Created new project called "My First Project"
(project-xxxx)
Switch to new project now? [y/N]: y
$ dx upload --wait small-celegans-sample.fastq
[===========================================================>] Uploaded (16801690 of 16801690 bytes) 100% small-celegans-sample.fastq
ID file-xxxx
Class file
Project project-xxxx
Folder /
Name small-celegans-sample.fastq
State closed
Visibility visible
Types -
Properties -
Tags -
Details {}
Outgoing links -
Created Sun Jan 1 09:00:00 2017
Created by amy
Last modified Sat Jan 1 09:00:00 2017
Media type text/plain
Size 16.02 MB
$ dx run bwa_mem_fastq_read_mapper
Entering interactive mode for input selection.
Input: Reads (reads_fastqgz)
Class: file
Enter file ID or path (<TAB> twice for compatible files in current directory,'?' for help)
reads_fastqgz[0]: <small-celegans-sample.fastq.gz>
Input: BWA reference genome index (genomeindex_targz)
Class: file
Suggestions:
project-BQpp3Y804Y0xbyG4GJPQ01xv://file-\* (DNAnexus Reference Genomes)
Enter file ID or path (<TAB> twice for compatible files in current
directory,'?' for more options)
genomeindex_targz: <"Reference Genome Files\: <REGION_OF_PROJECT>:/C. Elegans - Ce10/ce10.bwa-index.tar.gz">
Select an optional parameter to set by its # (^D or <ENTER> to finish):
[0] Reads (right mates) (reads2_fastqgz)
[1] Add read group information to the mappings (required by downstream GATK)? (add_read_group) [default=true]
[2] Read group id (read_group_id) [default={"$dnanexus_link": {"input": "reads_fastqgz", "metadata": "name"}}]
[3] Read group platform (read_group_platform) [default="ILLUMINA"]
[4] Read group platform unit (read_group_platform_unit) [default="None"]
[5] Read group library (read_group_library) [default="1"]
[6] Read group sample (read_group_sample) [default="1"]
[7] Output all alignments for single/unpaired reads? (all_alignments)
[8] Mark shorter split hits as secondary? (mark_as_secondary) [default=true]
[9] Advanced command line options (advanced_options)
Optional param #: <ENTER>
Using input JSON:
{
"reads_fastqgz": {
"$dnanexus_link": {
"project": "project-B3X8bjBqqBk1y7bVPkvQ0001",
"id": "file-B3P6v02KZbFFkQ2xj0JQ005Y"
}
"genomeindex_targz": {
"$dnanexus_link": {
"project": "project-xxxx(project ID for the reference genome in your region)",
"id": "file-BQbYJpQ09j3x9Fj30kf003JG"
}
}
}
Confirm running the applet/app with this input [Y/n]: <ENTER>
Calling app-BP2xVx80fVy0z92VYVXQ009j with output destination
project-xxxx:/
Job ID: job-xxxx
usage: dx run freebayes [-iINPUT_NAME=VALUE ...]
App: FreeBayes Variant Caller
Version: 3.0.1 (published)
Calls variants (SNPs, indels, and other events) using FreeBayes
See the app page for more information:
https://platform.dnanexus.com/app/freebayes
Inputs:
Sorted mappings: -isorted_bams=(file) [-isorted_bams=... [...]]
One or more coordinate-sorted BAM files containing mappings to call
variants for.
Genome: -igenome_fastagz=(file)
A file, in gzipped FASTA format, with the reference genome that the
reads were mapped against.
Suggestions:
project-BQpp3Y804Y0xbyG4GJPQ01xv://file-* (DNAnexus Reference Genomes: AWS US (East))
project-F3zxk7Q4F30Xp8fG69K1Vppj://file-* (DNAnexus Reference Genomes: AWS Germany)
project-F0yyz6j9Jz8YpxQV8B8Kk7Zy://file-* (DNAnexus Reference Genomes: Azure US (West))
project-F4gXb605fKQyBq5vJBG31KGG://file-* (DNAnexus Reference Genomes: AWS Sydney)
project-FGX8gVQB9X7K5f1pKfPvz9yG://file-* (DNAnexus Reference Genomes: Azure Amsterdam)
project-GvGXBbk36347jYPxP0j755KZ://file-* (DNAnexus Reference Genomes: Bahrain)
Target regions: [-itargets_bed=(file)]
(Optional) A BED file containing the coordinates of the genomic
regions to intersect results with. Supplying this will cause 'bcftools
view -R' to be used, to limit the results to that subset. This option
does not speed up the execution of FreeBayes.
Suggestions:
project-B6JG85Z2J35vb6Z7pQ9Q02j8:/vendor_exomes/file-* (Vendor Exomes (GRCh37 and hg19): AWS US (East))
project-F3zqGV04fXX5j7566869fjFq:/vendor_exomes/file-* (Vendor Exomes (GRCh37 and hg19): AWS Germany)
project-F29g0xQ90fvQf5z1BX6b5106:/vendor_exomes/file-* (Vendor Exomes (GRCh37 and hg19): Azure US (West))
project-F4gYG1850p1JXzjp95PBqzY5:/vendor_exomes/file-* (Vendor Exomes (GRCh37 and hg19): AWS Sydney)
project-FGXfq9QBy7Zv5BYQ9Yvqj9Xv:/vendor_exomes/file-* (Vendor Exomes (GRCh37 and hg19): Azure Amsterdam)
project-GvGXBZk3f624QVfBPjB8916j:/vendor_exomes/file-* (Vendor Exomes (GRCh37 and hg19): Bahrain)
Common
Output prefix: [-ioutput_prefix=(string)]
(Optional) The prefix to use when naming the output files (they will
be called prefix.vcf.gz, prefix.vcf.gz.tbi). If not provided, the
prefix will be the same as the first BAM file given.
Apply standard filters?: [-istandard_filters=(boolean, default=true)]
Select this to use stringent input base and mapping quality filters,
which may reduce false positives. This will supply the
'--standard-filters' option to FreeBayes.
Normalize variants representation?: [-inormalize_variants=(boolean, default=true)]
Select this to use 'bcftools norm' in order to normalize the variants
representation, which may help with downstream compatibility.
Perform parallelization?: [-iparallelized=(boolean, default=true)]
Select this to parallelize FreeBayes using multiple threads. This will
use the 'freebayes-parallel' script from the FreeBayes package, with a
granularity of 3 million base pairs. WARNING: This option may be
incompatible with certain advanced command-line options.
Advanced
Report genotype qualities?: [-igenotype_qualities=(boolean, default=false)]
Select this to have FreeBayes report genotype qualities.
Add RG tags to BAM files?: [-ibam_add_rg=(boolean, default=false)]
Select this to have FreeBayes add read group tags to the input BAM
files so each file will be treated as an individual sample. WARNING:
This may increase the memory requirements for FreeBayes.
Advanced command line options: [-iadvanced_options=(string)]
(Optional) Advanced command line options that will be supplied
directly to the FreeBayes program.
Outputs:
Variants: variants_vcfgz (file)
A bgzipped VCF file with the called variants.
Variants index: variants_tbi (file)
A tabix index (TBI) file with the associated variants index.
On the Projects list page, find and click on the name of the project within which the execution was launched.
Click on the Monitor tab to open the Monitor screen.
The Monitor screen shows a list of executions launched within the project. By default, executions appear in reverse chronological order, with the most recently launched execution at the top.
Find the row displaying information on the execution.
For an analysis (the execution of a workflow), click the "+" icon to the left of the analysis name to expand the row and view information on its stages. For executions with further descendants, click the "+" icon next to the name to expand the row and show additional details.
To see additional information on an execution, .
The following shortcuts allow you to view information from the details page directly on the list page, or relaunch an execution:
To view the :
Available Basic Information on Executions
The list on the Monitor screen displays the following information for each execution that is running or has been run within the project:
Name - The default name for an execution is the name of the app, applet, or workflow being run. When configuring an execution, you can give it a custom name, either via the UI, or via the CLI. The execution's name is used in Platform email alerts related to the execution. Clicking on a name in the executions list opens the execution details page, giving in-depth information on the execution.
State - This is the execution's state. State values include:
"Waiting" - The execution awaits Platform resource allocation or completion of dependent executions.
"Running" - The job is actively executing.
"In Progress" - The analysis is actively processing.
"Done" - The execution completed successfully without errors.
"Failed" - The execution encountered an error and could not complete. See for troubleshooting assistance.
"Partially Failed" - if one or more workflow stages did not finish successfully, with at least one stage not in a terminal state (either "Done," "Failed," or "Terminated").
"Terminating" - The worker has initiated but not completed the termination process.
"Terminated" - The execution stopped before completion.
"Debug Hold" - The execution, run with debugging options, encountered an applicable failure and entered debugging hold.
Executable - The executable or executables run during the execution. If the execution is an analysis, each stage appears in a separate row, including the name of the executable run during the stage. If an informational page exists with details about the executable's configuration and use, the executable name becomes clickable, and clicking displays that page.
Tags - Tags are strings associated with objects on the platform. They are a .
Launched By - The name of the user who launched the execution.
Launched On - The time at which the execution was launched. This time often precedes the time in the Started Running column due to executions waiting for available resources before starting.
Started Running - The time at which the execution started running, if it has done so. This is not always the same as its launch time, if it requires time waiting for available resources before starting.
Duration - For jobs, this figure represents the time elapsed since the job entered the running state. For analyses, it represents the time elapsed since the analysis was created.
Cost - A value is displayed in this column when the user has access to billing info for the execution. The figure shown represents either, for a running execution, an estimate of the charges it has incurred so far, or, for a completed execution, the total costs it incurred.
Priority - The priority assigned to the execution - either "low," "normal," or "high" - when it was configured, either or . This setting determines the scheduling priority of the execution relative to other executions that are waiting to be launched.
Worker URL - If the execution runs an executable, such as , with direct web URL connection capability, the URL appears here. Clicking the URL opens a connection to the executable in a new browser tab.
Output Folder - For each execution, the value shows a path relative to the project's root folder. Click the value to open the folder containing the execution's outputs.
Additional Basic Information
Additional basic information can be displayed for each execution. To do this:
Click on the "table" icon at the right edge of the table header row.
Select one or more of the entries in the list, to display an additional column or columns.
Available additional columns include:
Stopped Running - The time at which the execution stopped running.
Custom properties columns - If a custom property or properties have been assigned to any of the listed executions, a column can be added to the table, for each such property, showing the values assigned to each execution, for that property.
Customizing the Executions List Display
To remove columns from the list, click on the "table" icon at the right edge of the table header row, then de-select one or more of the entries in the list, to hide the column or columns.
Filtering the Executions List
A filter menu above the executions list allows you to run a search that refines the list to display only executions meeting specific criteria.
By default, pills are available to set search criteria for filtering executions by one or more of these attributes:
Launched By - The user who launched an execution or executions
Launch Time - The time range within which executions were launched
Click the List icon, above the right edge of the executions list, to display pills that allow filtering by additional execution attributes.
Search Scope
By default, filters are set to display only root executions that meet the criteria defined in the filter. To include all executions, including those run during individual stages of workflows, click the button above the left edge of the executions list showing the default value "Root Executions Only," then click "All Executions."
Saving and Reusing Filters
To save a particular filter, click the Bookmark icon, above the right edge of the executions list, assign your filter a name, then click Save.
To apply a saved filter to the executions list, click the Bookmark icon, then select the filter from the list.
Terminating an Execution from the Monitor Screen
If you launched an execution or have contributor access to the project in which the execution is running, you can terminate the execution from the list on the Monitor screen when it is in a non-terminal state. You can also terminate executions launched by other project members if you have project admin status.
To terminate an execution:
Find the execution in the list:
Select the execution by clicking on the row. Click the red Terminate button that appears at the end of the header.
Hover over the row and click on the "More Actions" button that looks like three vertical dots at the end of the row to select Terminate in the menu.
A modal window opens, asking you to confirm the termination. Click Terminate to confirm.
The execution's state changes to "Terminating" during termination, then to "Terminated" once complete.
Getting Detailed Information on an Execution via the UI
For additional information about an execution, click its name in the list on the Monitor screen to open its details page.
Available Detailed Information on Executions
The details page for an execution displays a range of information:
High-level details
For a standalone execution, such as a job without subjobs, the display shows a single entry with details about the execution state, start and stop times, and duration in the running state.
For an execution with descendants, such as an analysis with multiple stages, the display shows a list with each row containing details about stage executions. For executions with descendants, click the "+" icon next to the name to expand the row and view descendant information. A page displaying detailed information on a stage appears when clicking on its name in the list. To navigate back to the workflow's details page, click its name in the "breadcrumb" navigation menu in the top right corner of the screen.
Execution state
In the Execution Tree section, each execution row includes a color bar that represents the execution's current state. For descendants within the same execution tree, the time visualizations are staggered, indicating their different start and stop times compared to each other. The colors include:
Blue - A blue bar indicates that the execution is in the "Running" or "In Progress" state.
Green - A green bar indicates that the execution is in the "Done" state.
Red - A red bar indicates that the execution is in the "Failed" or "Partially Failed" state.
Grey - A grey bar indicates that the execution is in the "Terminated" state.
Execution start and stop times
Times are displayed in the header bar at the top of the Execution Tree section. These times run, from left to right, from the time at which the job started running, or when the analysis was created, to either the current time, or the time at which the execution entered a terminal state ("Done," "Failed," or "Terminated").
Inputs
This section lists the execution inputs. Available input files appear as hyperlinks to their project locations. For inputs from other workflow executions, the source execution name appears as a hyperlink to its details page.
Outputs
This section lists the execution's outputs. Available output files appear as hyperlinks. Click a link to open the folder containing the output file.
Log files
An execution's log file is useful in understanding details about, for example, the resources used by an execution, the costs it incurred, and the source of any delays it encountered. To access log files, and, as needed, download them in .txt format:
To access the log file for a job, click either the View Log button in the top right corner of the screen, or the View Log link in the Execution Tree section.
To access the log file for each stage in an analysis, click the View Log link next to the row displaying information on the stage, in the Execution Tree section.
For executions reusing results from another execution, the information appears in a blue pane above the Execution Tree section. Click the source execution's name to see details about the execution that generated these results.
Getting Help with Failed Executions
For failed executions, a Cause of Failure pane appears above the Execution Tree section. The cause of failure is a system-generated error message. For assistance in diagnosing the failure and any related issues:
Click the button labeled Send Failure Report to DNAnexus Support.
A form opens in a modal window, with pre-populated Subject and Message fields containing diagnostic information for DNAnexus Support.
Click the button in the Grant Access section to grant DNAnexus Support "View" access to the project, enabling faster issue diagnosis and resolution.
Click Send Report to send the report.
Launching a New Execution
To re-launch a job from the execution details screen:
Click the Launch as New Job button in the upper right corner of the screen.
A new browser tab opens, displaying the Run App / Applet form.
Configure the run, then click Start Analysis.
To re-launch an analysis from the execution details screen:
Click the Launch as New Analysis button in the upper right corner of the screen.
A new browser tab opens, displaying the Run Analysis form.
Configure the run, then click Start Analysis.
Saving a Workflow as a New Workflow
To save a copy of a workflow along with its input configurations under a new name from the execution details screen:
Click the Save as New Workflow button in the upper right corner of the screen.
In the Save as New Workflow modal window, give the workflow a name, and select the project in which you'd like to save it.
Click Save.
Viewing Initial Tries for Restarted Jobs
As described in job states, jobs can be configured to restart automatically on certain types of failures.
If you want to view the execution details for the initial tries for a restarted job:
Click on the "Tries" link below the job name in the summary banner, or the "Tries" link next to the job name in the execution tree.
A modal window opens.
Click the name of the try for which you'd like to view execution details.
You can only send a failure report for the most recent try, not for any previous tries.
Monitoring a Job via the CLI
You can use dx watch to view the log of a running job or any past jobs, which may have finished successfully, failed, or been terminated.
Monitoring a Running Job
Use dx watch to view a job's log stream during execution. The log stream includes stdout, stderr, and additional worker output information.
Terminating a Job
To terminate a job before completion, use the command dx terminate.
Monitoring Past Jobs
Use the dx watch command to view completed jobs. The log stream includes stdout, stderr, and additional worker output information from the execution.
Finding Executions via the CLI
Use dx find executions to display the ten most recent executions in your current project. Specify a different number of executions by using dx find executions -n <specified number>. The output matches the information shown in the "Monitor" tab on the DNAnexus web UI.
Below is an example of dx find executions. In this case, only two executions have been run in the current project. An individual job, DeepVariant Germline Variant Caller, and a workflow consisting of two stages, Variant Calling Workflow, are shown. A stage is represented by either another analysis (if running a workflow) or a job (if running an app(let)).
The job running the DeepVariant Germline Variant Caller executable is running and has been running for 10 minutes and 28 seconds. The analysis running the Variant Calling Workflow consists of 2 stages, FreeBayes Variant Caller, which is waiting on input, and BWA-MEM FASTQ Read Mapper, which has been running for 10 minutes and 18 seconds.
Using dx find executions
The dx find executions operation searches for jobs or analyses created when a user runs an app or applet. For jobs that are part of an analysis, the results appear in a tree representation linking related jobs together.
By default, dx find executions displays up to ten of the most recent executions in your current project, ordered by creation time.
Filter executions by job type using command flags: --origin-jobs shows only original jobs, while --all-jobs includes both original jobs and subjobs.
Finding Analyses via the CLI
You can monitor analyses by using the command dx find analyses, which displays the top-level analyses, excluding contained jobs. Analyses are executions of workflows and consist of one or more app(let)s being run.
Below is an example of dx find analyses:
Finding Jobs via the CLI
Jobs are runs of an individual app(let) and compose analyses. Monitor jobs using the command dx find jobs to display a flat list of jobs. For jobs within an analysis, the command returns all jobs in that analysis.
Below is an example of dx find jobs:
Advanced CLI Monitoring Options
Searches for executions can be restricted to specific parameters.
Viewing stdout and/or stderr from a Job Log
To extract stdout only from this job, run the command dx watch job-xxxx --get-stdout.
To extract stderr only from this job, run the command dx watch job-xxxx --get-stderr.
To extract both stdout and stderr from this job, run the command dx watch job-xxxx --get-streams.
Below is an example of viewing stdout lines of a job log:
Viewing Subjobs
To view the entire job tree, including both main jobs and subjobs, use the command dx watch job-xxxx --tree.
Viewing the Most Recent n Messages of a Job Log
To view the most recent n messages from a job log, use the command dx watch job-xxxx -n 8. If the job already ran, output is displayed as well.
In the example below, the app Sample Prints doesn't have any output.
Finding and Examining Initial Tries for Restarted Jobs
Jobs can be configured to restart automatically on certain types of failures as described in the Restartable Jobs section. To view initial tries of the restarted jobs along with execution subtrees rooted in those initial tries, use dx find executions --include-restarted. To examine job logs for initial tries, use dx watch job-xxxx --try X. An example of these commands is shown below.
Searching Across All Projects
By default, dx find restricts searches to your current project context. Use the --all-projects flag to search across all accessible projects.
Returning More Than Ten Results
By default, dx find returns up to ten of the most recently launched executions matching your search query. Use the -n option to change the number of executions returned.
Searching by Executable
A user can search for only executions of a specific app(let) or workflow based on its entity ID.
Searching by Execution Start Time
Users can also use the --created-before and --created-after options to search based on when the execution began.
Searching by Date
Searching by Time
Searching by Execution State
Users can also restrict the search to a specific state, for example, "done", "failed", "terminated".
Scripting
Delimiters
The --delim flag produces tab-delimited output, suitable for processing by other shell commands.
Returning Only IDs
Use the --brief flag to display only the object IDs for objects returned by your search query. The ‑‑origin‑jobs flag excludes subjob information.
Below is an example usage of the --brief flag:
Below is an example of using the flags --origin-jobs and --brief. In the example below, the last job run in the current default project is described.
Rerunning Time-Specific Failed Jobs With Updated Instance Types
Rerunning Failed Executions With an Updated Executable
# Update the app and navigate to within app directory
$ dx build -a
INFO:dxpy:Archived app app-xxxx to project-xxxx:"/.App_archive/bwa_mem_fastq_read_mapper (Sun Jan 1 09:00:00 2024)"
{"id": "app-yyyy"}
# Rerun job with updated app
dx run bwa_mem_fastq_read_mapper --clone job-xxxx
dx find jobs --tag TAG
Click the Info icon, above the right edge of the executions list, if it's not already selected, and then select the execution by clicking on the row.
Hover over the row and click on the "More Actions" button that looks like three vertical dots at the end of the row to select View Info in the fly out menu.
To view the for a job, do either of the following:
Select the execution by clicking on the row. When a View Log button appears in the header, click it,
Hover over the row and click on the "More Actions" button that looks like three vertical dots at the end of the row to select View Log in the fly out menu.
To , do either of the following:
Select the execution by clicking on the row. When a Launch as New Job button appears in the header, click it.
Hover over the row and click on the "More Actions" button that looks like three vertical dots at the end of the row, then select Launch as New Job in the menu.
To , do either of the following:
Select the execution by clicking on the row. When a Launch as New Analysis button appears in the header, click it.
Hover over the row and click on the "More Actions" button that looks like three vertical dots at the end of the row to select Launch as New Analysis in the menu.
Fetches a file from a URL onto the DNAnexus Platform
Unix shell on a Platform cloud worker in your browser. Use it for on-demand CLI operations and to launch httpsApp-enabled apps or applets on 2 extra ports
Sort alignment result based on coordinates
QC
rnaseqc
Transcriptomics Expression Quantification
fastqc
FastQC
Transcriptomics Expression Quantification
Short read alignment
bwa_fasta_indexer
BWA- bwa index
Building reference for BWA alignment
bwa_mem_fastq_read_mapper
BWA-MEM
Short read alignment
star_generate_genome_index
(Spliced Transcripts Alignment to a Reference) (--runMode genomeGenerate)
RNA Seq- indexing
star_mapping
(Spliced Transcripts Alignment to a Reference)
RNA Seq- mapping
subread_feature_counts
featureCounts
Read summarization, RNAseq
salmon_index_builder
Salmon
Transcriptomics Expression Quantification
salmon_mapping_quant
Salmon
Transcriptomics Expression Quantification
QC
trimmomatic
Read quality trimming, adapter trimming
RNA Seq- indexing
star_mapping
(Spliced Transcripts Alignment to a Reference)
RNA Seq- mapping
subread_feature_counts
featureCounts
Read summarization, RNAseq
star_generate_genome_index
(Spliced Transcripts Alignment to a Reference) (--runMode genomeGenerate)
Unix shell on a platform cloud worker in your browser. Use it for on-demand CLI operations and to launch httpsApp-enabled apps or applets on 2 extra ports
Running analyses, visualizing data, building and testing models and algorithms in an interactive way, accessing and manipulating data in spark databases and tables
This tutorial demonstrates how to use Nextflow pipelines on the DNAnexus Platform by importing a Nextflow pipeline from a remote repository or building from local disk space.
A license is required to create a DNAnexus app or applet from the Nextflow script folder. Contact DNAnexus Sales for more information.
This documentation assumes you already have a basic understanding of how to develop and run a Nextflow pipeline. To learn more about Nextflow, consult the official Nextflow Documentation.
To run a Nextflow pipeline on the DNAnexus Platform:
Import the pipeline script from a remote repository or local disk.
Convert the script to an app or applet.
Run the app or applet.
You can do this via either the user interface (UI) or the command-line interface (CLI), using the .
Use the latest version of to take advantage of recent improvements and bug fixes.
As of dx-toolkit version v0.391.0, pipelines built using dx build --nextflow default to running on Ubuntu 24.04. To use the Ubuntu 20.04 instead, override the default by specifying the release in --extra-args:
Quickstart
Pipeline Script Folder Structure
A Nextflow pipeline script is structured as a folder with Nextflow scripts with optional configuration files and subfolders. Below are the basic elements of the folder structure when building a Nextflow executable:
(Required) A main Nextflow file with the extension .nf containing the pipeline. The default filename is main.nf. A different filename can be specified in the nextflow.config file.
(Optional) A .
An flavored folder structure is encouraged but not required.
Importing a Nextflow Pipeline
Import via UI
To import a Nextflow pipeline via the UI, click on the Add button on the top-right corner of the project's Manage tab, then expand the dropdown menu. Select the Import Pipeline/Workflow option.
Once the Import Pipeline/Workflow modal appears, enter the repository URL where the Nextflow pipeline source code resides, for example, the . Then choose the desired project import location. If the repository is private, provide the credentials necessary for accessing it.
An example of the Import Pipeline/Workflow modal:
Click the Start Import button after providing the necessary information. This starts a pipeline import job in the project specified in the Import To field (default is the current project).
After launching the import job, a status message "External workflow import job started" appears.
Access information about the pipeline import job in the project's Monitor tab:
After the import finishes, the imported pipeline executable exists as an applet. This is the output of the pipeline import job:
The newly created Nextflow pipeline applet appears in the project, for example, hello.
Import via CLI from a Remote Repository
To import a Nextflow pipeline from a remote repository via the CLI, run the following command to specify the repository's URL. You can also provide optional information, such as a and an :
Use the latest version of to take advantage of recent improvements and bug fixes.
All versions beginning with v0.338.0 support converting Nextflow pipelines to apps or applets.
This documentation covers features available in dx-toolkit versions beginning with v0.370.0.
Your destination project's billTo feature needs to be enabled for Nextflow pipeline applet building. for more information.
For Nextflow pipelines stored in private repositories, access requires credentials provided via the --git-credentials option with a DNAnexus file containing your authentication details. The file should be specified using either its qualified ID or path on the Platform. See the section for more details on setting up and formatting these credentials.
Once the pipeline import job finishes, it generates a new Nextflow pipeline applet with an applet ID in the form applet-zzzz.
Use dx run -h to get more information about running the applet:
Building from a Local Disk
Through the CLI you can also build a Nextflow pipeline applet from a pipeline script folder stored on a local disk. For example, you may have a copy of the nextflow-io/hello pipeline from the Nextflow on your local laptop, stored in a directory named hello, which contains the following files:
Ensure that the folder structure is in the required format, as .
To build a Nextflow pipeline applet using a locally stored pipeline script, run the following command and specify the path to the folder containing the Nextflow pipeline scripts. You can also provide , such as an import destination:
Your destination project's billTo feature needs to be enabled for Nextflow pipeline applet building. Contact for more information.
This command packages the Nextflow pipeline script folder as an applet named hello with ID applet-yyyy, and stores the applet in the destination project and path project-xxxx:/applets2/hello. If an import destination is not provided, the current working directory is used.
The command can be run to see information about this applet, similar to the above example.
A Nextflow pipeline applet has a type nextflow under its metadata. This applet acts like a regular DNAnexus applet object, and can be shared with other DNAnexus users who have access to the project containing the applet.
For advanced information regarding the parameters of dx build --, run dx build --help in the CLI and find the Nextflow section for all arguments that are supported for building an Nextflow pipeline applet.
Building a Nextflow Pipeline App from a Nextflow Pipeline Applet
You can also build a Nextflow pipeline by running the command: dx build --app --from applet-xxxx.
Running a Nextflow Pipeline Executable (App or Applet)
Running a Nextflow Pipeline Executable via UI
You can access a Nextflow pipeline applet from the Manage tab in your project, while the Nextflow pipeline app that you built can be accessed by clicking on the Tools Library option from the Tools tab. Once you click on the applet or app, the Run Analysis tab is displayed. Fill out the required inputs/outputs and click the Start Analysis button to launch the job.
Running a Nextflow Pipeline Applet via CLI
To run the Nextflow pipeline applet, use dx run applet-xxxx or dx run app-xxxx commands in the CLI and specify your :
You can list and see the progress of the Nextflow pipeline job tree, which is structured as a head job with many subjobs, using the following :
Monitoring Jobs
Each Nextflow pipeline executable run is represented as a job tree with one head job and many subjobs. The head job launches and supervises the entire pipeline execution. Each subjob handles a process in the Nextflow pipeline. You can monitor the progress of the entire pipeline job tree by viewing the status of the subjobs (see example above).
Monitor the detail log of the head job and the subjobs through each job's DNAnexus log via the UI or the CLI.
On the DNAnexus Platform, jobs are limited to a runtime of 30 days. Jobs running longer than 30 days are automatically terminated.
Monitoring in the UI
Once your job tree is running, you can go to the Monitor tab to view the status of your job tree. From the Monitor tab, view the job log of the head job as well as the subjobs by clicking on the Log link in the row of the desired job. The costs (when your account has permission) and resource usage of a job are also viewable.
An example of the log of a head job:
An example of the log of a subjob:
Monitoring in the CLI
From the CLI, you can use the command to check the status and view the log of the head job or each subjob.
Monitoring the head job:
Monitoring a subjob:
Advanced Options: Running a Nextflow Pipeline Executable (App or Applet)
Nextflow Execution on DNAnexus
The Nextflow pipeline executable is launched as a job tree, with one head job running the Nextflow , and multiple subjobs running a single each. Throughout the pipeline's execution, the head job remains in "running" state and supervises the job tree's execution.
Nextflow Execution Log File
When a Nextflow head job (job-xxxx) enters its terminal state, either "done" or "failed", the system writes a named nextflow-<job-xxxx>.log to the of the head job.
Private Docker Repository
DNAnexus supports Docker container engines for the Nextflow pipeline execution environment. The pipeline developer may refer to a public Docker repository or a private one. When the pipeline is referencing a private Docker repository, you should provide your Docker credential file as a file input of docker_creds to the Nextflow pipeline executable when launching the job tree.
Syntax of a private Docker credential:
Store this credential file in a separate project with restricted access permissions for security.
Nextflow Pipeline Executable Inputs and Outputs
Specifying Input Values to a Nextflow Pipeline Executable
Below are all possible means that you can specify an input value at build time and runtime. They are listed in order of precedence (items listed first have greater precedence and override items listed further down the list):
Executable (app or applet) run time
DNAnexus Platform app or applet input.
CLI example:
dx run project-xxxx:applet-xxxx -i reads_fastqgz=project-xxxx:file-yyyy
Formats of PATH to File, Folder, or Wildcards
While you can specify a file input parameter's value at different places as seen above, the valid PATH format referring to the same file is different. This depends on the level (DNAnexus API/CLI level or Nextflow script-level) and the (file object or string) of the executable's input parameter. Examples of this are given below.
Scenarios
Valid PATH format
Specifying a Nextflow Job Tree Output Folder
When launching a DNAnexus job, you can specify a job-level output destination such as project-xxxx:/destination/ using the platform-level optional parameter on the or on the . For pipelines with publishDir settings, each output file is saved to <dx_run_path>/<publishDir>/, where <dx_run_path> is the job-level output destination and <publishDir> is the path assigned by the Nextflow script's process.
Read more detail about the output folder specification and . Find an example on how to construct output paths of an nf-core pipeline job tree at run time in the .
Using an AWS S3 Bucket as a Work Directory for Nextflow Pipeline Runs
You can have your Nextflow pipeline runs use an Amazon Web Services (AWS) S3 bucket as a work directory. To do this, follow the steps outlined below.
Step 1. Configure Your AWS Account to Trust the DNAnexus Platform as an OIDC Identity Provider
to configure your AWS account to trust the Platform, as an OIDC identity provider. Be sure to note the value entered in the "Audience" field. This value is required in a configuration file used by your pipeline to enable pipeline runs to access the S3 bucket.
Step 2. Configure an AWS IAM Role with the Proper Trust and Permissions Policies
Next, configure an , such that its permissions and trust policies allow Platform jobs that assume this role, to access and use resources in the S3 bucket.
Permissions Policy
The following example shows how to structure an IAM role's permission policy, to enable the role to use an S3 bucket - accessible via the S3 URI s3://my-nextflow-s3-workdir - as the work directory of Nextflow pipeline runs:
In the above example:
The "Action" section contains a list of the actions the role is allowed to perform, including deleting, getting, listing, and putting objects.
The two entries in the list in the "Resource" section enable the role to access all resources in the bucket accessible via the S3 URI my-nextflow-s3-workdir.
Trust Policy
The following example shows how to configure an IAM role's trust policy, to allow only properly configured Platform jobs to assume the role:
In the above example:
To assume the role, a job must be launched from within a specific Platform project (in this case, project-xxxx).
To assume the role, a job must be launched by a specific Platform user (in this case, user-aaaa).
Via the "Federated" setting in the "Principal" section, the policy configures the role to trust the Platform as an OIDC identity provider, as accessible at
Step 3. Configure Your Nextflow Pipeline's Configuration File to Access the S3 Bucket
Next you need to configure your pipeline so that when it's run, it can access the S3 bucket. To do this, add, in a configuration file, a dnanexus that includes the properties shown in this example:
In the above example:
workDir is the path to the bucket to be used as a work directory, in S3 URI format.
jobTokenAudience is the value of "Audience" you defined in above.
jobTokenSubjectClaims
Using Subject Claims to Control Bucket Access
When configuring the trust policy for the role that allows access to the S3 bucket, use custom subject claims to control which jobs can assume this role. Here are some typical combinations that we recommend, with their implications:
Having included custom subject claims in the trust policy for the role, you need then, in the , to set the value of jobTokenSubjectClaims to equal a comma-separated list of claims, entered in the same order in which you entered them in the trust policy.
For example, if you configured a role's trust policy per the , you are requiring a job, to assume the role, to present custom subject claims project_id and launched_by, in that order. In your Nextflow configuration file, set the value of jobTokenSubjectClaims, within the dnanexus config scope, as follows:
Within the dna config scope, you must also set the value of iamRoleArnToAssume to that of the appropriate role:
Advanced Options: Building a Nextflow Pipeline Executable
Nextflow Pipeline Executable Permissions
By default, the Platform . Nextflow pipeline apps and applets have the following capabilities that are exceptions to these limits:
External internet access ("network": ["*"]) - This is required for Nextflow pipeline apps and applets to be able to pull Docker images from external Docker registries at runtime.
UPLOAD access to the project in which a Nextflow pipeline job is run ("project": "UPLOAD") - This is required in order for Nextflow pipeline jobs to record the progress of executions, and preserve the run cache, to enable resume functionality.
You can modify a Nextflow pipeline app or applet's permissions by overriding the default values when , using the --extra-args flag with . An example:
Here are the key points:
"network": [] prevents jobs from accessing the internet.
"allProjects":"VIEW" increases jobs' access permission level to VIEW. This means that each job has "read" access to projects that can be accessed by the user running the job. Use this carefully. This permission setting can be useful when expected input file PATHs are provided as DNAnexus URIs - via a , for example, - from projects other than the one in which a job is being run.
Advanced Building and Importing Pipelines
Additional options exist for dx build --nextflow:
Option
Class
Description
Use dx build --help for more information.
Private Nextflow Pipeline Repository
When the Nextflow pipeline to be imported is from a private repository, you must provide a file object that contains the credentials needed to access the repository. Via the CLI, use the --git-credentials flag, and format the object as follows:
To safeguard this credentials field object, store it in a separate project that only you can access.
Platform File Objects as Runtime Docker Images
When building a Nextflow pipeline executable, you can replace any with a Platform file object in tarball format. These Docker tarball objects serve as substitutes for referencing external Docker repositories.
This approach enhances the provenance and reproducibility of the pipeline by minimizing reliance on external dependencies, thereby reducing associated risks. Also, it fortifies data security by eliminating the need for internet access to external resources, during pipeline execution.
Two methods are available for preparing Docker images as tarball file objects on the platform: or .
Built-in Docker Image Caching vs. Manually Preparing Tarballs
Built-in Docker image caching
Manually preparing tarballs
Built-in Docker Image Caching
This method initiates a building job that begins by taking the pipeline script, then identifying Docker containers by scanning the script's source code based on the final execution tree. Next, the job converts the containers to tarballs, and saves those tarballs to the project in which the job is running. Finally, the job builds the Nextflow pipeline executable, in the tarballs, as bundledDepends.
You can use built-in caching via the CLI by using the flag --cache-docker at build time. All cached Docker tarballs are stored as file objects, within the Docker cache path, at project-xxxx:/.cached_docker_images/<image_name>/<image_name>_<version>.
An example:
If you need to access a Docker container that's stored in a private repository, you must provide, along with the flag --docker-secrets, a file object that contains the credentials needed to access the repository. This object must be in the following format:
When a pipeline requires specific inputs, such as file objects, sample values must be present within the project in which building job is to execute. These values must be provided along with the flag --nextflow-pipeline-params.
It's crucial that these sample values be structured in the same way as actual input data is structured. This ensures that the execution logic of the Nextflow pipeline remains intact. During the build process, use small files, containing data representative of the larger dataset, as sample data, to reduce file localization overhead.
Manually Preparing Tarballs
You can manually convert Docker images to tarball file objects. Within Nextflow pipeline scripts, you must then reference the location of each such tarball, in one of the following three ways:
Option A: Reference each tarball by its unique Platform ID such as dx://project-xxxx:file-yyyy. Use this approach if you want deterministic execution behavior.
You can use Platform IDs in Nextflow pipeline scripts (*.nf) or configuration files (*.config), as follows:
When accessing a Platform project, a Nextflow pipeline job needs the VIEW or higher permission to the project.
Option B: Within a Nextflow pipeline script, you can also reference a Docker image by using its . Use this name within a path that's in the following format: project-xxxx:/.cached_docker_images/<image_name>/<image_name>_<version>.
An example:
File extensions are not necessary, and project-xxxx is the project where the Nextflow pipeline executable was built and is executed. For.cached_docker_images, substitute the name of the folder in which these images have been stored. An exact <version> reference must be included - latest is not an accepted tag in this context.
At Nextflow pipeline executable runtime:
If no image is found at the path provided, the Nextflow pipeline job attempts to pull the Docker image from the remote external registry, based on the image name. This pull attempt requires internet access.
When the version is referenced as latest
Here are examples of tarball file object paths and names, as constructed from image names and version tags:
Image Name
Version Tag
Tarball File Object Path and Name
Option C: You can also reference Docker image names in pipeline scripts by digest - for example, <Image_name>@sha256:XYZ123…). File extensions are not necessary, and project-xxxx is the project where the Nextflow pipeline executable was built and is executed. For.cached_docker_images, substitute the name of the folder in which these images have been stored. An exact <version> reference must be included - latest is not an accepted tag in this context. When referring to a tarball file on the Platform using this method, the file must have an object property image_digest assigned to it. A typical format would be "image_digest":"<IMAGE_DIGEST_HERE>".
An example:
Nextflow Input Parameter Type Conversion to DNAnexus Executable Input Parameter Class
Based on the input parameter's type and format (when applicable) defined in the corresponding , each parameter is assigned to the corresponding class (, ).
File Input as String or File Class
As a pipeline developer, you can specify a file input variable as {"type":"string", "format":"file-path"} or {"type":"string", "format":"path"}, which is assigned to "file" or "string" class, respectively. When running the executable, based on the class (file or string) of the executable's input parameter, you use a specific PATH format to specify the value. See the for an acceptable PATH format for each class.
Converting a URL path to a String
When converting a file reference from a URL format to a String, you use the method toUriString(). An example of a URL format would be dx://project-xxxx:/path/to/file for a DNAnexus URI. The method toURI().toString() does not give the same result because toURI() removes the context ID, such as project-xxxx, and toString() removes the scheme, such as dx://. More information about the Nextflow methods is available in the .
Managing intermediate files and publishing outputs
Pipeline Output Setting Using output: block and publishDir
All files generated by a Nextflow job tree are stored in its session's corresponding workDir, which is the path where the temporary results are stored. On DNAnexus, when the Nextflow pipeline job is run with "preserve_cache=true", the workDir is set at the path: project-xxxx:/.nextflow_cache_db/<session_id>/work/. The project-xxxx is the project where the job took place, and you can follow the path to access all preserved temporary results. It is useful to be able to access these results for investigating the detailed pipeline progress, and use them for resuming job runs for pipeline development purposes.
When the Nextflow pipeline job is run with "preserve_cache=false" (default), temporary files are stored in the job's which is deconstructed when the head job enters its terminate state - "done", "failed", or "terminated". Since a lot of these files are intermediate input/output being passed between processes and expected to be cleaned up after the job is completed, running with "preserve_cache=false" helps reduce project storage cost for files that are not of interest. It also saves you from remembering to clean up all temporary files.
To save the final results of interest, and to display them as the Nextflow pipeline executable's output, you can declare output files matching the declaration under the script's output: block, and use Nextflow's optional directive to publish them.
This makes the published output files available as the Nextflow pipeline head job's , under the executable's formally defined placeholder output parameter, published_files, as array:file class. Then the files are organized under the relative folder structure assigned via publishDir. This works for both "preserve_cache=true" and "preserve_cache=false". Only the "copy" publish mode is supported on DNAnexus.
Values of publishDir
At pipeline development time, the valid value of publishDir can be:
A local path string, for example, "publishDir path: ./path/to/nf/publish_dir/",
A dynamic string value defined as a pipeline input parameter such as "params.outdir", where "outdir" is a string-class input. This allows pipeline users to determine parameter values at runtime. For example, "publishDir path: '${params.outdir}/some/dir/'" or './some/dir/${params.outdir}/
Find an example on how to construct output paths for an nf-core pipeline job tree at run time in the .
publishDir is NOT supported on DNAnexus when assigned as an absolute path starting at root (/), such as /path/to/nf/publish_dir/. If an absolute path is defined for the publishDir, no output files are generated as the job's output parameter "published_files".
Queue Size Configuration
The queueSize option is part of Nextflow's executor . It defines how many tasks the executor handles in a parallel way. On DNAnexus, this represents the number of subjobs being created at a time (5 by default) by the Nextflow pipeline executable's head job. If the pipeline's executor configuration has a value assigned to queueSize, it overrides the default value. If the value exceeds the upper limit (1000) on DNAnexus, the root job errors out. See the Nextflow executor page for examples.
Instance Type Determination
Head job instance type determination
The head job of the job tree defaults to running on instance type mem2_ssd1_v2_x4 in AWS regions and azure:mem2_ssd1_x4 in Azure regions. Users can change to a different instance type than the default, but this is not recommended. The head job executes and monitors the subjobs. Changing the instance type for the head job does not affect the computing resources available for subjobs, where most of the heavy computation takes place (see below where to configure instance types for Nextflow processes). Changing the instance type for the head job may be necessary only if it runs out of memory or disk space when staging input files, collecting pipeline output files, or uploading pipeline output files to the project.
Subjob instance type determination
Each subjob's instance type is determined based on the profile information provided in the Nextflow pipeline script. Specify required instances by via Nextflow's directive (example below). Alternatively, use a set of system requirements such as , , , and other resource parameters according to the official Nextflow documentation. The executor matches instance types to the minimal requirements described in the Nextflow pipeline profile using this logic:
Choose the cheapest instance that satisfies the system requirements.
Use only SSD type instances.
For all things equal (price and instance specifications), it prefers a instance type.
Order of precedence for subjob instance type determination:
The value assigned to machineType directive.
Values assigned to cpus, memory, and disk directives in their .
An example command for specifying machineType by DNAnexus instance type name is provided below:
Values assigned to cpus, memory, and disk directives serve two purposes: they determine the instance type and can be recalled by Nextflow's process such as ${task.cpus}, ${task.memory}, and ${task.disk} at runtime for task allocation.
Nextflow Resume
Preserve Run Caches and Resuming Previous Jobs
Nextflow's feature enables skipping the processes that have been finished successfully and cached in previous runs. The new run can directly jump to downstream processes without needing to start from the beginning of the pipeline. By retrieving cached progress, Nextflow resume helps pipeline developers to save both time and compute costs. It is helpful for testing and troubleshooting when building and developing a Nextflow pipeline.
Nextflow uses a scratch storage area for caching and preserving each task's temporary results. The directory is called "working directory", and the directory's path is defined by
The session id, a universally unique identifier (UUID) associated with current execution
Each task's unique hash ID: a hash number composed of each task's input values, input files, command line strings, container ID such as Docker image, conda environment, environment modules, and executed scripts in the bin directory, when applicable.
You can use the Nextflow resume feature with the following Nextflow pipeline executable parameters:
preserve_cache Boolean type. Default value is false. When set to true, the run is cached in the current project for future resumes. For example:
This enables the Nextflow job tree to preserve cached information as well as all temporary results in the project where it is executed under the following paths, based on its session ID and each subjob's unique ID.
When preserve_cache=true, DNAnexus executor overrides the value of workDir of the job tree to be project-xxxx:/.nextflow_cache_db/<session_id>/work/, where project-xxxx is the project where the job tree was executed.
Below are four possible scenarios and the recommended use cases for –i resume:
Scenarios
Parameters
Use Cases
Note
Cache Preserve Limitations and Cleaning Up workDir
To save on storage costs, clean up the workDir regularly. The maximum number of sessions that can be preserved in a DNAnexus project is 20 sessions. If you exceed the limit, the job generates an error with the following message:
"The number of preserved sessions is already at the limit (N=20) and preserve_cache is true. Remove the folders in <project-id>:/.nextflow_cache_db/ to be under the limit, if you want to preserve the cache of this run. "
To clean up all preserved sessions under a project, you can delete the entire ./nextflow_cache_db folder. To clean up a specific session's cached folder, you can delete the specific .nextflow_cache_db/<session_id>/ folder. To delete a folder in UI, you can follow the documentation on . To delete a folder in CLI, you can run:
Be aware that deleting an object on UI or using CLI dx rm cannot be undone. Once the session work directory is deleted or moved, subsequent runs cannot resume from the session.
For each session, only one job can resume the session's cached results and preserve its own progress to this session. Multiple jobs can resume and preserve different sessions without limitations, as long as each job preserves a different session. Similarly, multiple jobs can resume the same session without limitations, as long as only one or none is preserving the progress to the session.
Nextflow's errorStrategy
Nextflow's directive allows you to define how the error condition is managed by the Nextflow executor at the process level. When an error status is returned, by default, the process and other pending processes stop immediately (the default is errorStrategyterminate). This forces the entire pipeline execution to be terminated.
Four error strategy options exist for Nextflow executor: terminate, finish, ignore, and retry. Below is a table of behaviors for each strategy. The "all other subjobs" referenced in the third column have not yet entered their terminal states.
When more than one errorStrategy directives are applied to a pipeline job tree, the following rules apply depending on the first errorStrategy used.
When terminate is the first errorStrategy directive to be triggered in a subjob, all the other ongoing subjobs result in the "failed" state immediately.
When finish is the first errorStrategy directive to be triggered in a subjob, any other errorStrategy that is reached in the remaining ongoing subjobs also applies the
Independent from Nextflow process-level error conditions, when a Nextflow subjob encounters platform-related restartable , such as ExecutionError, UnresponsiveWorker, JMInternalError, AppInternalError, AppInsufficientResourceError, or JobTimeoutExceeded, the subjob follows the executionPolicy determined for the subjob and restarts itself. It does not restart from the head job.
FAQ
My Nextflow job tree failed, how do I find where the errors are?
A: Find the errored subjob's job ID from the head job's nextflow_errored_subjob and nextflow_errorStrategy properties to investigate which subjob failed and which errorStrategy was applied. To query these errorStrategy related properties in CLI, run the following command:
where job-xxxx is the head job's job ID. \
After finding the errored subjob, investigate the job log using the Monitor page by accessing the URL https://platform.dnanexus.com/projects/<projectID>/monitor/job/<jobID>. In this URL, jobID is the subjob's ID such as job-yyyy. Alternatively, watch the job log in CLI using dx watch job-yyyy.
With the preserve_cache value set to true when starting the Nextflow pipeline executable, trace the cache workDir such as project-xxxx:/.nextflow_cache_db/<session_id>/work/ to investigate the intermediate results of this run.
What is the version of Nextflow that is used?
A: Find the Nextflow version by reading the log of the head job. Each built Nextflow executable is locked down to the specific version of Nextflow executor.
What container runtimes are supported?
A: DNAnexus supports as the container runtime for Nextflow pipeline applets. It is recommended to set docker.enabled=true in the Nextflow pipeline configuration, which enables the built Nextflow pipeline applet to execute the pipeline using Docker.
My job hangs at the end of the analysis. What can I do to avoid this problem?
A: There can be many possibilities causing the head job to become unresponsive. One of the known reasons is caused by the being written directly to a DNAnexus URI such as dx://project-xxxx:/path/to/file. To avoid this cause, specify -with-trace path/to/tracefile (using a local path string) to the Nextflow pipeline applet's nextflow_run_opts input parameter.
Can I have an example of how to construct an output path when I run a Nextflow pipeline with params.outdir, publishDir and job-level destination?
Taking as an example, start with reading the pipeline's logic:
The pipeline's is constructed with a prefix of the params.outdir variable followed by each task's name for each subfolder:
publishDir = [ path: { "${params.outdir}/${...}" }, ... ]
params.outdir is a to the pipeline, and the . The user running the corresponding Nextflow pipeline executable must specify a value to params.outdir to:
To specify a value of params.outdir for the Nextflow pipeline executable built from the nf-core/sarek pipeline script, you can use the following command:
You can also set a job tree's output destination using :
This command constructs the final output paths as follows:
project-xxxx:/path/to/jobtree/destination/ as the destination of the job tree's shared output folder.
project-xxxx:/path/to/jobtree/destination/local/to/outdir as the shared output folder of the all tasks/processes/subjobs of this pipeline.
project-xxxx:/path/to/jobtree/destination/local/to/outdir/<task_name> as the output folder of each specific task/process/subjob of this pipeline.
This example is built based on and .
Not all Nextflow pipelines have params.outdir as input, nor do all use params.outdir
This documentation covers features available in dx-toolkit versions beginning with v0.378.0
(Optional, recommended) A nextflow_schema.json file. If this file is present at the root folder of the Nextflow script when importing or building the executable, the input parameters described in the file are exposed as the built Nextflow pipeline applet's input parameters. For more information on how the exposed parameters are used at run time, see specifying input values to a Nextflow pipeline executable.
(Optional) Subfolders and other configuration files. Subfolders and other configuration files can be referenced by the main Nextflow file or nextflow.config via the include or includeConfig keyword. Ensure that all referenced subfolders and files exist under the pipeline script folder at the time of building or importing the pipeline.
reads_fastqgz is an example of an executable input parameter name. All Nextflow pipeline inputs can be configured and exposed by the pipeline developer using an nf-core flavored pipeline schema file (nextflow_schema.json).
When the input parameter is expecting a file, you need to specify the value in a certain format based on the class of the input parameter. When the input is of the "file" class, use DNAnexus qualified ID, which is the absolute path to the file object such as "project-xxxx:file-yyyy". When the input is of the "string" class, use the DNAnexus URI ("dx://project-xxxx:/path/to/file"). See table below for full descriptions of the formatting of PATHs.
You can use dx run <app(let)> --help to query the class of each input parameter at the app(let) level. In the example code block below, fasta is an input parameter of a file object, while fasta_fai is an input parameter of a string object. You then use DNAnexus qualifiedID format for fasta, and DNAnexus URI format for fasta_fai.
The DNAnexus object class of each input parameter is based on the type and format specified in the pipeline's nextflow_schema.json, when it exists. See additional documentation in the Nextflow Input Parameter Type Conversion section to understand how Nextflow input parameter's type and format (when applicable) converts to an app or applet's input class.
It is recommended to always use the app/applet means for specifying input values. The platform validates the input class and existence before the job is created.
All inputs for a Nextflow pipeline executable are set as "optional" inputs. This allows users to have flexibility to specify input via other means.
Nextflow pipeline command line input parameter, available as nextflow_pipeline_params. This is an optional "string" class input, available for any Nextflow pipeline executable on it being built.
CLI example:
dx run project-xxxx:applet-xxxx -i nextflow_pipeline_params="--foo=xxxx --bar=yyyy", where "--foo=xxxx --bar=yyyy" corresponds to the "--something value" pattern of Nextflow input specification referenced in the Nextflow Configuration documentation.
Because nextflow_pipeline_params is a string type parameter with file-path format, use the DNAnexus URI format when the file is stored on DNAnexus.
Nextflow options parameter nextflow_run_opts. This is an optional "string" class input, available for any Nextflow pipeline executable on it being built.
CLI example:
dx run project-xxxx:applet-xxxx -i nextflow_run_opts="-profile test", where -profile is single-dash prefix parameter that corresponds to the Nextflow run options pattern, specifying a preset input configuration.
Nextflow parameter file nextflow_params_file. This is an optional "file" class input, available for any Nextflow pipeline executable that is being built.
CLI example:
dx run project-xxxx:applet-xxxx -i nextflow_params_file=project-xxxx:file-yyyy, where project-xxxx:file-yyyy is the DNAnexus qualified ID of the file being passed to nextflow run -params-file <file>. This corresponds to -params-file option of nextflow run.
Nextflow soft configuration override file nextflow_soft_confs. This is an optional "array:file" class input, available for any Nextflow pipeline executable that is being built.
CLI example:
dx run project-xxxx:applet-xxxx -i nextflow_soft_confs=project-xxxx:file-1111 -i nextflow_soft_confs=project-xxxx:file-2222, where project-xxxx:file-1111 and project-xxxx:file-2222 are the DNAnexus qualified IDs of the file being passed to nextflow run -c <config-file1> -c <config-file2>. This corresponds to option of nextflow run, and the order specified for this array of file input is preserved when passing to the nextflow run execution.
The soft configuration file can be used for assigning default values of configuration scopes (such as ).
It is highly recommended to use nextflow_params_file as a replacement to using nextflow_soft_confs for the use case of specifying parameter values, especially when running Nextflow DSL2 nf-core pipelines. Read more about this at .
Pipeline source code:
nextflow_schema.json
Pipeline developers may specify default values of inputs in the nextflow_schema.json file.
If an input parameter is of Nextflow's string type with file-path format, use DNAnexus URI format when the file is stored on DNAnexus.
nextflow.config
Pipeline developers may specify default values of inputs in the nextflow.config file.
Pipeline developers may specify a default profile value using
main.nf, sourcecode.nf
Pipeline developers may specify default values of inputs in the Nextflow source code file (*.nf).
job-oidc.dnanexus.com
.
is an ordered, comma-separated list of DNAnexus
- for example, project_id, launched_by - that the job must present, to assume the role that enables bucket access.
iamRoleArnToAssume is the Amazon Resource Name (ARN) for the role that you configured in Step 2 above, and that is assumed by jobs to access the bucket.
You need also to configure your pipeline to access the bucket within the appropriate AWS region, which you specify via the region parameter, within an aws config scope.
string
Specifies tag for Git repository. Can be used only with --repository.
--git-credentials GIT_CREDENTIALS
file
Git credentials used to access Nextflow pipelines from private Git repositories. Can be used only with --repository. More information about the file syntax can be found in the .
--cache-docker
flag
Stores a container image tarball in the selected project in /.cached_dockerImages. Only Docker engine is supported. Incompatible with --remote.
Custom pipeline parameters to be referenced when collecting the Docker images.
--docker-secrets DOCKER_SECRETS
file
A DNAnexus file ID with credentials for a private Docker repository.
Job first attempts to access Docker images cached as bundledDepends. If this fails, the job attempts to find the image on the Platform. If this fails, the job tries to pull the images from the external repository, via the internet.
Job attempts to find the Docker image based on the Docker cache path referenced.
If this fails, the job attempts to pull from the external repository, via the internet.
For pipelines featuring conditional process trees determined by input values, provide mocked input values for caching Docker containers used by processes affected by the condition.
A building job requires CONTRIBUTE or higher permission to the destination project, that is the project for placing tarballs created from Docker containers.
Pipeline source code is saved at /.nf_source/<pipeline_folder_name>/ in the destination project. The user handles cleaning up this folder after the executable has been built.
, or when no version tag is provided, the Nextflow pipeline job attempts to search the digest of the image's
latest
reference from the external Docker repository and uses it to search for the corresponding tarball on The platform. This digest search requires internet access. If no digest is found, or if there is no internet access, the execution fails.
latest
Nextflow pipeline job attempts to pull from remote external registry
path
string
string
NA
string
integer
NA
int
number
NA
float
boolean
NA
boolean
object
NA
hash
' or
'./some/dir/${params.outdir}/some/dir/'
.
When publishDir is defined this way, the user who launches the Nextflow pipeline executable handles constructing the publishDir to be a valid relative path.
The actual selected instance type's resources (CPUs, memory, disk capacity) may differ from what is allocated by the task. Instance type selection follows the precedence rules described above, while task allocation uses the values assigned in the configuration file.
When using Docker as the runtime container, the Nextflow executor propagates task execution settings to the Docker run command. For example, when task.memory is specified, this becomes the maximum amount of memory allowed for the container: docker run --memory ${task.memory}
The session's cache directory containing information on the location of the workDir, the session progress, job status, and configuration data is saved to project-xxxx:/.nextflow_cache_db/<session_id>/cache.tar, where project-xxxx is the project where the job tree is executed.
Each task's working directory is saved to project-xxxx:/.nextflow_cache_db/<session_id>/work/<2digit>/<30characters>/, where <2digit>/<30characters>/ is technically the task's unique ID, and project-xxxx is the project where the job tree is executed.
resume String type. Default value is an empty string, and the run begins without any cached data. When assigned with a session id, the run resumes from what is cached for the session id on the project. When assigned with "true" or "last", the run determines the session id that corresponds to the latest valid execution in the current project and resumes the run from it. For example, dx run applet-xxxxm -i reads_fastqgz="project-xxxx:file-yyyy" -i resume="<session_id>"
When a new job is launched and resumes a cached session (where session_id has a format like 12345678-1234-1234-1234-123456789012), the new job not only resumes from where the cache left at, but also shares the same session_id with the cached session it resumes. When a new job makes progress in a session and if the job is being cached, it creates temporary results to the same session's workDir. This generates a new cache directory (cache.tar) with the latest cache information.
You can have many Nextflow job trees sharing the same sessionID and writing to the same path for workDir and creating its own cache.tar, while only the latest job that ends in "done" or "failed" state is preserved on the project.
When the head job enters its terminal state such as "failed" or "terminated" that is not caused by the executor, no cache directory is preserved, even when the job was run with preserve_cache=true. Subsequent new jobs cannot resume from this job run. This can happen when a job tree fails due to exceeding a cost limit or a user terminating a job of the job tree.
Only up to 20 Nextflow sessions can be preserved per project.
3
resume=<session_ID> | "true" | "last" and preserve_cache=false
Pipeline development. Pipeline developers can investigate the job workspace with --delay_workspace_destruction and --ssh
4
resume=<session_ID> | "true" | "last" and preserve_cache=true
Pipeline development. Only happens for the first few tests.
Only 1 job with the same <session_ID> can run at each time point.
- Job properties set with:
"nextflow_errorStrategy":"finish""nextflow_errored_subjob":"job-xxxx, job-2xxx"
where job-xxxx and job-2xxx are errored subjobs.
- No new subjobs created after error.
- Ends in "failed" state eventually, after other subjobs enter terminal states, with error message: "Job was ended with finish errorStrategy for job-xxxx, check the job log to find the failure."
- Keep running until terminal state.
- If error occurs in any, finish errorStrategy is applied (ignoring other error strategies), per Nextflow default behavior.
retry
- Job properties set with:
"nextflow_errorStrategy":"retry""nextflow_errored_subjob":"self"
- Ends in "done" state immediately
- Spins off a new subjob to retry the errored job, named <name> (retry: <RetryCount>).
- Ends in a terminal state depending on other subjobs (can be "done", "failed", or "terminated").
- Keep running until terminal state.
- If error occurs, their own errorStrategy is applied.
ignore
- Job properties set with:
"nextflow_errorStrategy":"ignore""nextflow_errored_subjob":"self"
- Ends in "done" state immediately
- Job properties set with:
"nextflow_errorStrategy":"ignore""nextflow_errored_subjob":"job-1xxx, job-2xxx"
- Shows "subjobs <job-1xxx>, <job-2xxx> runs into Nextflow process errors' ignore errorStrategy were applied" at end of job log.
- Ends in a terminal state depending on other subjobs (can be "done", "failed", or "terminated").
- Keep running until terminal state.
- If error occurs, their own errorStrategy is applied.
finish
errorStrategy
, ignoring any other error strategies set in the pipeline's source code or configuration.
If the retryerrorStrategy is the first directive triggered in a subjob, if any of the remaining subjobs trigger a terminate, finish, or ignoreerrorStrategy, these other errorStrategy directives are applied to the corresponding subjob.
When ignore is the first errorStrategy directive to trigger in a subjob , and if any of terminate, finish, or retryerrorStrategy directives applies to the remaining subjobs, that other errorStrategy is applied to the corresponding subjob.
Meet the input requirement for executing the pipeline.
Resolve the value of publishDir, with outdir as the leading path and each task's name as the subfolder name.
in
publishDir
. Read the source script of the Nextflow pipeline for the actual context of usage and requirements for
params.outdir
and
publishDir
.
• App or applet input parameter class as file object
• CLI/API level, such as dx run --destination PATH
DNAnexus qualified ID (absolute path to the file object).
• Example (file):
project-xxxx:file-yyyyproject-xxxx:/path/to/file
• Example (folder):
project-xxxx:/path/to/folder/
• App or applet input parameter class as string
• Nextflow configuration and source code files, such as nextflow_schema.json, nextflow.config, main.nf, and sourcecode.nf
DNAnexus URI.
• Example (file):
dx://project-xxxx:/path/to/file
• Example (folder):
dx://project-xxxx:/path/to/folder/
• Example (wildcard):
dx://project-xxxx:/path/to/wildcard_files
Values of StringEquals:job-oidc.dnanexus.com/:sub
Which jobs can assume the role that enables bucket access?
project_id;project-xxxx
Any Nextflow pipeline jobs that are running in project-xxxx
launched_by;user-aaaa
Any Nextflow pipeline jobs that are launched by user-aaaa
project_id;project-xxxx;launched_by;user-aaaa
Any Nextflow pipeline jobs that are launched by user-aaaa in project-xxxx
bill_to;org-zzzz
--profile PROFILE
string
Set default profile for the Nextflow pipeline executable.
--repository REPOSITORY
string
Specifies a Git repository of a Nextflow pipeline. Incompatible with --remote.
Requires running a "building job" with external internet access?
Yes, if building an applet for the first time or if any image is going to be updated.
No internet access required on rebuild.
No
Docker images packaged as bundledDepends?
Yes.
For Docker images that are used in the execution, they are cached and bundled at build time.
From: Nextflow Input Parameter
(defined at nextflow_schema.json) Type
Format
To: DNAnexus Input Parameter Class
string
file-path
file
string
directory-path
string
1 (default)
resume="" (empty string) and preserve_cache=false
Production data processing. Most high volume use cases
2
resume="" (empty string) and preserve_cache=true
errorStrategy
Subjob Error
Head Job
All Other Subjobs
terminate
- Job properties set with:
"nextflow_errorStrategy":"terminate""nextflow_errored_subjob":"self"
- Ends in "failed" state immediately
- Job properties set with:
"nextflow_errorStrategy":"terminate""nextflow_errored_subjob":"job-xxxx""nextflow_terminated_subjob":"job-yyyy, job-zzzz"
where job-xxxx is the errored subjob, and job-yyyy, job-zzzz are other subjobs terminated due to this error.
- Ends in "failed" state immediately, with error message: "Job was terminated by Nextflow with terminate errorStrategy for job-xxxx, check the job log to find the failure."
$ dx build --nextflow \
--repository https://github.com/nextflow-io/hello \
--destination project-xxxx:/applets/hello
Started builder job job-aaaa
Created Nextflow pipeline applet-zzzz
$ dx run project-xxxx:/applets/hello -h
usage: dx run project-xxxx:/applets/hello [-iINPUT_NAME=VALUE ...]
Applet: hello
hello
Inputs:
Nextflow options
Nextflow Run Options: [-inextflow_run_opts=(string)]
Additional run arguments for Nextflow (e.g. -profile docker).
Nextflow Top-level Options: [-inextflow_top_level_opts=(string)]
Additional top-level options for Nextflow (e.g. -quiet).
Soft Configuration File: [-inextflow_soft_confs=(file) [-inextflow_soft_confs=... [...]]]
(Optional) One or more nextflow configuration files to be appended to the Nextflow pipeline
configuration set
Script Parameters File: [-inextflow_params_file=(file)]
(Optional) A file, in YAML or JSON format, for specifying input parameter values
Advanced Executable Development Options
Debug Mode: [-idebug=(boolean, default=false)]
Shows additional information in the job log. If true, the execution log messages from
Nextflow are also included.
Resume: [-iresume=(string)]
Unique ID of the previous session to be resumed. If 'true' or 'last' is provided instead of
the sessionID, resumes the latest resumable session run by an applet with the same name
in the current project in the last 6 months.
Preserve Cache: [-ipreserve_cache=(boolean, default=false)]
Enable storing pipeline cache and local working files to the current project. If true, local
working files and cache files are uploaded to the platform, so the current session could
be resumed in the future
Outputs:
Published files of Nextflow pipeline: [published_files (array:file)]
Output files published by current Nextflow pipeline and uploaded to the job output
destination.
$ pwd
/path/to/hello
$ ls
LICENSE README.md main.nf nextflow.config
# Query for the class of each input parameter
$ dx run project-yyyy:applet-xxxx --help
usage: dx run project-yyyy:applet-xxxx [-iINPUT_NAME=VALUE ...]
Applet: example_applet
example_applet
Inputs:
…
fasta: [-ifasta=(file)]
…
fasta_fai: [-ifasta_fai=(string)]
…
# Assign values of the parameter based on the class of the parameter
$ dx run project-yyyy:applet-xxxx -ifasta="project-xxxx:file-yyyy" -ifasta_fai="dx://project-xxxx:/path/to/file"
# In a nextflow configuration file:
aws { region = '<aws region>'}
dnanexus {
workDir = '<S3 URI path>'
jobTokenAudience = '<OIDC_audience_name>'
jobTokenSubjectClaims = '<list of claims separated by commas>'
iamRoleArnToAssume = '<arn of the role who is set with permission>'
}
# In a nextflow configuration file:
dnanexus {
...
jobTokenSubjectClaims = 'project_id,launched_by'
...
}
# In a nextflow configuration file:
dnanexus {
...
iamRoleArnToAssume = arn:aws:iam::123456789012:role/NextflowRunIdentityToken
...
}
# In a Nextflow pipeline script:
process foo {
container 'dx://project-xxxx:file-yyyy'
'''
do this
'''
}
# In nextflow.config // at root folder of the nextflow pipeline:
process {
withName:foo {
container = 'dx://project-xxxx:file-yyyy'
}
}
# In nextflow configuration file:
docker.enabled = true
docker.registry = 'quay.io'
# In the Nextflow pipeline script:
process bar {
container 'quay.io/biocontainers/tabix:1.11--hdfd78af_0'
'''
do this
'''
}
# In nextflow configuration file:
docker.enabled = true
docker.registry = 'quay.io'
# In the Nextflow pipeline script:
process bar {
container 'quay.io/biocontainers/tabix@sha256:XYZ123…'
'''
do this
'''
}
dx run applet-xxxx -i reads_fastqgz=project-xxxx:file-yyyy -i preserve_cache=true
dx rm -r project-xxxx:/.nextflow_cache_db/ # cleanup ALL sessions caches
dx rm -r project-xxxx:/.nextflow_cache_db/<session_id>/ # clean up a specific session's cache