arrow-left

Only this pageAll pages
gitbookPowered by GitBook
triangle-exclamation
Couldn't generate the PDF for 247 pages, generation stopped at 100.
Extend with 50 more pages.
1 of 100

DNAnexus Documentation

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Key Concepts

By understanding projects, organizations, apps, and workflows, you'll improve your understanding of the DNAnexus Platform.

Python

Bash

Overview

hashtag
Using the DNAnexus Platform

Getting Startedchevron-right

Get to know commonly used features in a series of short, task-oriented tutorials.

Userchevron-right

Learn to access and use the Platform via both its command-line interface and its user interface.

Developerchevron-right

Learn to manage data, users, and work on the Platform, via its API. Create and share reusable pipelines, applications for analyzing data, custom viewers, and workflows.

Administratorchevron-right

This section is targeted towards organizational leads who have the permission to enable others to use DNAnexus for scientific purposes. Operations include managing organization permissions, billing, and authentication to the platform.

Download, install, and get started using the DNAnexus Platform SDK, the DNAnexus upload and download agents, and dxCompiler.

Get details on new features, changes, and bug fixes for each Platform and toolkit release.

DNAnexus Essentials

Learn to upload data, create a project, run an analysis, and visualize results.

hashtag
Learn More

See these Key Concepts pages to learn more about how the DNAnexus Platform works, and how to get the most from it:

  • Projects

Get up and running quickly using the Platform via both its user interface (UI) and its command-line interface (CLI):

Learn the basics of developing for the Platform:

Getting Started

Get to know features you'll use every day, in these short, task-oriented tutorials.

circle-info

You must set up billing for your account before you can perform an analysis, or upload or egress data. Follow these instructions to set up billing.

hashtag
Uploading and Sharing Data

hashtag
Running a Single App

hashtag
Creating and Running a Workflow

hashtag
Monitoring Jobs and Viewing Results

hashtag
Visualizing Data

hashtag
Learn More

See these Key Concepts pages for more in-depth treatments of topics that are covered briefly here:

For a step-by-step written tutorial to using the Platform via its UI, see .

For a step-by-step written tutorial to using the Platform via its CLI, see .

For a more in-depth video intro to the Platform, watch the .

hashtag
Additional Tutorials

As a developer, you may be interested in the following:

As a bioinformatician, see the walkthrough and other content in the .

Downloadschevron-right
Release Noteschevron-right

Parallel

Running Apps and Workflows

Organizations
Apps and Workflows
User Interface Quickstart
Command Line Quickstart
Developer Quickstart
Developer Tutorials
Projects
Organizations
Apps and Workflows
User Interface Quickstart
Command Line Quickstart
DNAnexus Platform Essentials video
dxCompiler
JupyterLab
Spark on JupyterLab
SAIGE GWAS
Science Corner

R Shiny Example Web App

This is an example web applet that demonstrates how to build and run an R Shiny application on DNAnexus.

View full source code on GitHubarrow-up-right

hashtag
Creating the web application

Inside the dxapp.json, you would add "httpsApp": {"ports":[443], "shared_access": "VIEW"} to tell the worker to expose this port.

R Shiny needs two scripts, server.R and ui.R, which should be under resources/home/dnanexus/my_app/. When a job starts based on this applet, the resources directory is copied onto the worker, and since the ~/ path on the worker is /home/dnanexus, that means you have ~/my_app with those two scripts inside.

From the main applet script code.sh, start shiny pointing to ~/my_app, serving its mini-application on port 443.

For all web apps, if everything is running smoothly and no errors are encountered (the ideal case), the line of code that starts the server keeps it running indefinitely. The applet stops only when it is terminated. This also means that any lines of code after the server starts are not executed.

hashtag
Modifying this example for your own applet

To make your own applet with R Shiny, copy the source code from this example and modify server.R and ui.R inside resources/home/dnanexus/my_app.

hashtag
How to rebuild the shiny asset

To build the asset, run the dx build_asset command and pass shiny-asset, that is the name of the directory holding dxasset.json:

This outputs a record ID record-xxxx that you can then put into the applet's dxapp.json in place of the existing one:

hashtag
Build the applet

Build and run the applet itself:

Once it spins up, you can go to that job's designated URL based on its job ID, https://job-xxxx.dnanexus.cloud/, to see the result.

SAMtools count

View full source code on GitHubarrow-up-right

This applet performs a basic samtools view -c {bam} command, referred to as "SAMtools count", on the DNAnexus Platform.

hashtag
Download BAM Files

For bash scripts, inputs to a job execution become environment variables. The inputs from the dxapp.json file are formatted as shown below:

The object mappings_bam, a DNAnexus link containing the file ID of that file, is available as an environmental variable in the applet's execution. Use the command dx download to download the BAM file. By default, downloading a file preserves the filename of the object on the platform.

hashtag
SAMtools Count

Use the bash helper variable mappings_bam_name for file inputs. For these inputs, the DNAnexus Platform creates a bash variable [VARIABLE]_name that holds the platform filename. Because the file was downloaded with default parameters, the worker filename matches the platform filename. The helper variable [VARIABLE]_prefix contains the filename minus any suffixes specified in the input field patterns (for example, the platform removes the trailing .bam to create [VARIABLE]_prefix).

hashtag
Upload Result

Use command to upload data to the platform. This uploads the file into the job container, a temporary project that holds onto files associated with the job. When running the command dx upload with the flag --brief, the command returns only the file ID.

Job containers are an integral part of the execution process. To learn more see .

hashtag
Associate With Output

The output of an applet must be declared before the applet is even built. Looking back to the dxapp.json file, you see the following:

The applet declares a file type output named counts_txt. In the applet script, specify which file should be associated with the output counts_txt. On job completion, this file is copied from the temporary job container to the project that launched the job.

Precompiled Binary

This tutorial showcases packaging a precompiled binary in the resources/ directory of an app(let).

View full source code on GitHubarrow-up-right

hashtag
Precompiling a Binary

In this applet, the SAMtools binary was precompiled on an Ubuntu machine. A user can do this compilation on an Ubuntu machine of their own, or they can use the Cloud Workstation app to build and compile a binary. On the Cloud Workstation, the user can download the SAMtools source code and compile it in the worker environment, ensuring that the binary runs on future workers.

See in the App library for more information.

hashtag
Resources Directory

The SAMtools precompiled binary is placed in the <applet dir>/resources/ directory. Any files found in the resources/ directory are packaged, uploaded to the Platform, and then extracted in the root directory \ of the worker. In this case, the resources/ dir is structured as follows:

When this applet is run on a worker, the resources/ directory is placed in the worker's root directory /:

The SAMtools command is available because the respective binary is visible from the default $PATH variable. The directory /usr/bin/ is part of $PATH, so the script can reference the samtools command directly:

Bash Helpers

Learn to build an applet that performs a basic SAMtools count with the aid of bash helper variables.

hashtag
Source Code

hashtag

Concurrent Computing Tutorials

Learn important terminology before using parallel and distributed computing paradigms on the DNAnexus Platform.

Many definitions and approaches exist for tackling the concept of parallelization and distributing workloads in the cloud (Here's a on the subject). To help make the documentation easier to understand, when discussing concurrent computing paradigms this guide refers to:

  • Parallel: Using multiple threads or logical cores to concurrently process a workload.

  • Distributed: Using multiple machines (in this case instances in the cloud) that communicate to concurrently process a workload.

User

In this section, learn to access and use the Platform via both its command-line interface (CLI) and its user interface (UI).

To use the CLI, you need to .

If you're not familiar with the dx client, check the .

This section provides detailed instructions on using the dx client to perform such common actions as logging in, selecting projects, listing, copying, moving, and deleting objects, and launching and monitoring jobs. Details on using the UI are included throughout, as applicable.

Keep these formal definitions in mind as you read through the tutorials and learn how to compute concurrently on the DNAnexus Platform.

particularly helpful Stack Exchange postarrow-up-right
hashtag
Key Platform Capabilities

The DNAnexus Platform provides powerful tools for data management, analysis, and collaboration. You can organize your data and analyses in secure, shareable projects with robust tools for uploading, downloading, and managing files.

Using the Platform, you can run apps, workflows, and custom analyses at scale. You can also share your projects and results with team members while maintaining access controls.

download and install the dx command-line client
Command-Line Quickstart

Example Applications

Objects

Projects

A project is a collaborative workspace on the DNAnexus Platform where you can store objects such as files, applets, and workflows. Within projects, you can run apps and workflows. You can also share a project with other users by giving them access to it. Read about projects in the Key Concepts section.

Uploading and Downloading Files

View dxasset.json filearrow-up-right
dx upload
Containers for Execution
Cloud Workstationarrow-up-right
Step 1. Download BAM Files

Download input files using the dx-download-all-inputs command. The dx-download-all-inputs command goes through all inputs and downloads into folders with the pattern /home/dnanexus/in/[VARIABLE]/[file or subfolder with files].

hashtag
Step 2. Create an Output Directory

Create an output directory in preparation for the dx-upload-all-outputs DNAnexus command in the Upload Results section.

hashtag
Step 3. Run SAMtools View

After executing the dx-download-all-inputs command, there are three helper variables created to aid in scripting. For this applet, the input variable name mappings_bam with platform filename my_mappings.bam has a helper variables:

Use the bash helper variable mappings_bam_path to reference the location of a file after it has been downloaded using dx-download-all-inputs.

hashtag
Step 4. Upload Result

Use the dx-upload-all-outputs command to upload data to the platform and specify it as the job's output. The dx-upload-all-outputs command expects to find file paths matching the pattern /home/dnanexus/out/[VARIABLE]/*. It uploads matching files and then associates them as the output corresponding to [VARIABLE]. In this case, the output is called counts_txt. After creating the folders, place the outputs there.

View full source code on GitHubarrow-up-right
main() {
  R -e "shiny::runApp('~/my_app', host='0.0.0.0', port=443)"
}
dx build_asset shiny-asset
"runSpec": {
    ...
    "assetDepends": [
    {
      "id": "record-xxxx
    }
  ]
    ...
}
dx build -f dash-web-app
dx run dash-web-app
{
  "inputSpec": [
    {
      "name": "mappings_bam",
      "label": "Mapping",
      "class": "file",
      "patterns": ["*.bam"],
      "help": "BAM format file."
    }
  ]
}
dx download "${mappings_bam}"
readcount=$(samtools view -c "${mappings_bam_name}")
echo "Total reads: ${readcount}" > "${mappings_bam_prefix}.txt"
counts_txt_id=$(dx upload "${mappings_bam_prefix}.txt" --brief)
{
  "name": "counts_txt",
  "class": "file",
  "label": "Read count file",
  "patterns": [
    "*.txt"
  ],
  "help": "Output file with Total reads as the first line."
}
dx-jobutil-add-output counts_txt "${counts_txt_id}" --class=file
├── Applet dir
│   ├── src
│   ├── dxapp.json
│   ├── resources
│       ├── usr
│           ├── bin
│               ├── < samtools binary >
/
├── usr
│   ├── bin
│       ├── < samtools binary >
├── home
│   ├── dnanexus
│       ├── applet script
samtools view -c "${mappings_bam_name}" > "${mappings_bam_prefix}.txt"
dx-download-all-inputs
mkdir -p out/counts_txt
# [VARIABLE]_path the absolute string path to the file.
$ echo $mappings_bam_path
/home/dnanexus/in/mappings_bam/my_mappings.bam
# [VARIABLE]_prefix the file name minus the longest matching pattern in the dxapp.json file
$ echo $mappings_bam_prefix
my_mappings
# [VARIABLE]_name the file name from the platform
$ echo $mappings_bam_name
my_mappings.bam
samtools view -c "${mappings_bam_path}" \
  > out/counts_txt/"${mappings_bam_prefix}.txt"
dx-upload-all-outputs

Git Dependency

View full source code on GitHubarrow-up-right

hashtag
What does this applet do?

This applet performs a basic SAMtools count of alignments present in an input BAM.

hashtag
Prerequisites

The app must have network access to the hostname where the git repository is located. In this example, access.network is set to:

To learn more about access and network fields see .

hashtag
How is the SAMtools dependency added?

SAMtools is cloned and built from the repository. The following is a closer look at the dxapp.json file's runSpec.execDepends property:

The execDepends value is a JSON array of dependencies to resolve before the applet source code is run. In this applet, the git fetch dependencies for htslib and SAMtools are specified. Dependencies resolve in the order listed. Specify htslib first, before the SAMtools build_commands, because newer versions of SAMtools depend on htslib. An overview of each property in the git dependency:

  • package_manager - Details the type of dependency and how to resolve. .

  • url - Must point to the server containing the repository. In this case, a GitHub URL.

The build_commands are executed from the destdir. Use cd when appropriate.

hashtag
How is SAMtools called in the src script?

Because "destdir": "/home/dnanexus" is set in dxapp.json, the git repository is cloned to the same directory from which the script executes. The example directory's structure:

The SAMtools command in the app script is samtools/samtools.

hashtag
Applet Script

You can build SAMtools in a directory that is on the $PATH or add the binary directory to $PATH. Keep this in mind for your app(let) development.

Dash Example Web App

This is an example web app made with Dash, which in turn uses Flask underneath.

View full source code on GitHubarrow-up-right

hashtag
Creating the web application

After configuring an app with Dash, start the server on port 443.

Inside the dxapp.json, you would add "httpsApp": {"ports":[443], "shared_access": "VIEW"} to tell the worker to expose this port.

For all web apps, if everything is running smoothly and no errors are encountered (the ideal case), the line of code that starts the server keeps it running forever. The applet stops only when it is terminated. This also means that any lines of code after the server starts are not executed.

The rest of these instructions apply to building any applet with dependencies stored in an asset.

hashtag
Creating an applet on DNAnexus

Install the and , then run dx-app-wizard with default options.

hashtag
Creating the asset

dash-asset specifies all the packages and versions needed. These come from the .

Add these into dash-asset/dxasset.json:

Build the asset:

hashtag
Use the asset from the applet

Add this asset to the applet's dxapp.json:

hashtag
Build the applet

Build and run the applet itself:

You can always use dx ssh job-xxxx to ssh into the worker and inspect what's going on or experiment with quick changes Then go to that job's special URL https://job-xxxx.dnanexus.cloud/ and see the result!

hashtag
Optional local testing

The main code is in dash-web-app/resources/home/dnanexus/my_app.py with a local launcher script called local_test.py in the same folder. This allows you to launch the same core code in the applet locally to quickly iterate. This is optional because you can also do all testing on the platform itself.

Install locally the same libraries listed above.

To launch the web app locally:

Once it spins up, you can go to that job's designated URL based on its job ID, https://job-xxxx.dnanexus.cloud/, to see the result.

Parallel by Region (sh)

This applet performs a basic SAMtools count on a series of sliced (by canonical chromosome) BAM files in parallel using wait (Ubuntu 14.04+).

View full source code on GitHubarrow-up-right

hashtag
How is the SAMtools dependency provided?

The SAMtools dependency is resolved by declaring an Apt-Getarrow-up-right package in the dxapp.json runSpec.execDepends.

hashtag
Debugging

The command set -e -x -o pipefail assists you in debugging this applet:

  • -e causes the shell to immediately exit if a command returns a non-zero exit code.

  • -x prints commands as they are executed, which is useful for tracking the job's status or pinpointing the exact execution failure.

  • -o pipefail

The *.bai file was an optional job input. You can check for a empty or unset var using the bash built-in test [[ - z ${var}} ]]. You can then download or create a *.bai index as needed.

hashtag
Parallel Run

Bash's system allows for convenient management of multiple processes. In this example, bash commands are run in the background as the maximum job executions are controlled in the foreground. You can place processes in the background using the character & after a command.

hashtag
Job Output

Once the input BAM has been sliced, counted, and summed, the output counts_txt is uploaded using the command . The following directory structure required for dx-upload-all-outputs is below:

In your applet, upload all outputs by creating the output directory and then using dx-upload-all-outputs to upload the output files.

Parallel xargs by Chr

This applet slices a BAM file by canonical chromosome then performs a parallelized samtools view -c using xargs. Type man xargs for general usage information.

View full source code on GitHubarrow-up-right

hashtag
How is the SAMtools dependency provided?

The SAMtools compiled binary is placed directory in the <applet dir>/resources directory. Any files found in the resources/ directory are uploaded so that they are present in the root directory of the worker. In this case:

When this applet is run on a worker, the resources/ folder is placed in the worker's root directory /:

/usr/bin is part of the $PATH variable, so the script can reference the samtools command directly, for example, samtools view -c ....

hashtag
Parallel Run

hashtag
Splice BAM

First, download the BAM file and slice it by canonical chromosome, writing the *bam file names to another file.

To split a BAM by regions, you need a *.bai index. You can either create an app(let) which takes the *.bai as an input or generate a *.bai in the applet. In this tutorial, you generate the *.bai in the applet, sorting the BAM if necessary.

hashtag
Xargs SAMtools view

In the previous section, you recorded the name of each sliced BAM file into a record file. Next, perform a samtools view -c on each slice using the record file as input.

hashtag
Upload results

The results file is uploaded using the standard bash process:

  1. Upload a file to the job execution's container.

  2. Provide the DNAnexus link as a job's output using the script dx-jobutil-add-output <output name>

TensorBoard Example Web App

View full source code on GitHubarrow-up-right

This example demonstrates how to run TensorBoard inside a DNAnexus applet.

TensorBoard is a web application used to visualize and inspect what is going on inside TensorFlow training. To use TensorBoard, your training script in TensorFlow must include code that saves specific data to a log directory where TensorBoard can then find the data to display it.

This example uses an example script from the TensorBoard authors. For more guidance on how to use TensorBoard, check out the TensorFlow website (external linkarrow-up-right).

hashtag
Creating the web application

The applet code runs a training script, which is placed in resources/home/dnanexus/ to make it available in the current working directory of the worker, and then it starts TensorBoard on port 443 (HTTPS).

Run the training script in the background to start TensorBoard immediately, which lets you see the results while training is still running. This is particularly important for long-running training scripts.

For all web apps, if everything is running smoothly and no errors are encountered (the ideal case), the line of code that starts the server keeps it running forever. The applet stops only when it is terminated. This also means that any lines of code after the server starts are not executed.

As with all web apps, the dxapp.json must include "httpsApp": {"ports":[443], "shared_access": "VIEW"} to tell the worker to expose port 443.

hashtag
Creating an applet on DNAnexus

Build the asset with the libraries first:

Take the record ID it outputs and add it to the dxapp.json for the applet.

Then build the applet

Once it spins up, you can go to that job's designated URL based on its job ID, https://job-xxxx.dnanexus.cloud/, to see the result.

Apps and Workflows Glossary

Learn key terms used to describe apps and workflows.

On the DNAnexus Platform, the following terms are used when discussing apps and workflows:

  • Execution: An analysis or job.

    • Root execution: The initial analysis or job that's created when a user makes an API call to run a workflow, app, or applet. Analyses and jobs created from a job via /executable-xxxx/run API call with detach flag set to true are also root executions.

    • Execution tree: The set of all jobs and/or analyses that are created because of running a root execution.

  • Analysis: An analysis is created when a workflow is run. It consists of some number of stages, each of which consists of either another analysis (if running a workflow) or a job (if running an app or applet).

    • Parent analysis: Each analysis is the parent analysis to each of the jobs that are created to run its stages.

  • Job: A job is a unit of execution that is run on a worker in the cloud. A job is created when an app or applet is run, or when a job spawns another job.

    • Origin job: The job created when an app or applet is run by either a user or an analysis. An origin job always executes the "" entry point.

    • Master job: The job created when an app or applet is run by a user, job, or analysis. A master job always executes the "main" entry point. All origin jobs are also master jobs.

  • Job-based object reference: A hash containing a job ID and an output field name. This hash is given in the input or output of a job. Once the specified job has transitioned to the "done" state, it is replaced with the specified job's output field.

FreeSurfer in JupyterLab

Learn how to use FreeSurfer in JupyterLab.

circle-info

JupyterLab is accessible to all users of the UK Biobank Research Analysis Platform and the Our Future Health Trusted Research Environment.

A license is required to access JupyterLab on the DNAnexus Platform. Contact DNAnexus Salesenvelope for more information.

hashtag
About FreeSurfer

​FreeSurfer is a software package for the analysis and visualization of structural and functional neuroimaging data from cross-sectional or longitudinal studies.

The FreeSurfer package comes pre-installed with the IMAGE_PROCESSING .

hashtag
FreeSurfer License Registration

To use FreeSurfer on the DNAnexus Platform, you need a valid FreeSurfer license. You can register for the FreeSurfer license at the .

hashtag
Using the FreeSurfer License on DNAnexus

To use the FreeSurfer license, complete the following steps:

  1. Upload the license text file to your project on the DNAnexus Platform.

  2. Launch the JupyterLab app and specify the IMAGE_PROCESSING feature.

  3. Once JupyterLab is running, open your existing notebook (or a new notebook) and download the license file into the FREESURFER_HOME directory.

The commands to download the license file are as follows:

  • Python kernel: !dx download license.txt -o $FREESURFER_HOME

  • Bash kernel: dx download license.txt -o $FREESURFER_HOME

Apollo Apps

hashtag
Spark Applications

circle-info

A license is required to access Spark functionality on the DNAnexus Platform. Contact DNAnexus Salesenvelope for more information.

The Spark application is an extension of the current app(let) framework. App(let)s have a for their VM (instance type, OS, packages). This has been extended to allow for an additional optional with type=dxspark.

  • Calling /app(let)-xxxx/run for Spark apps creates a Spark cluster (+ master VM).

  • The master VM (where the app shell code runs) acts as the driver node for Spark.

  • Code in the master VM leverages the Spark infrastructure.

Spark apps can be launched over a distributed Spark cluster.

TensorBoard Example Web App

This example demonstrates how to run TensorBoard inside a DNAnexus applet.

TensorBoard is a web application used to visualize and inspect what is going on inside TensorFlow training. To use TensorBoard, the training script in TensorFlow needs to include code that saves specific data to a log directory where TensorBoard can then find the data to display it.

This example uses an example script from the TensorBoard authors. For more guidance on how to use TensorBoard, check out the TensorFlow website ().

hashtag

Job Notifications

Learn how to set job notification thresholds on the DNAnexus Platform.

circle-info

A license is required to use the functionality described on this page. Contact for more information.

Being notified of when a job may be stuck can help users to troubleshoot problems. On DNAnexus, users can set to limit the amount of time their jobs can run, or set a threshold on how long a job can take to run before the user is notified. The notification threshold can be specified in the executable at compile time via or .

When the threshold is reached for a , the system sends an email notification to both the user who launched the executable and the org admin.

Executions and Time Limits

Learn about different types of time limits on executions, and how they can affect your executions on the DNAnexus Platform.

hashtag
Types of Time Limits

On the DNAnexus Platform, executions are subject to two independent time limits: job timeouts, and execution tree expirations.

hashtag

Kaplan-Meier Survival Curve

Learn to build and use Kaplan-Meier Survival Curve charts in the Cohort Browser.

circle-info

An Apollo license is required to use Cohort Browser on the DNAnexus Platform. Org approval may also be required. for more information.

hashtag
Building a Kaplan-Meier Survival Curve Chart

Scatter Plot

Learn to build and use scatter plots in the Cohort Browser.

circle-info

An Apollo license is required to use Cohort Browser on the DNAnexus Platform. Org approval may also be required. for more information.

hashtag
When to Use Scatter Plots

Stacked Row Chart

Learn to build and use stacked row charts in the Cohort Browser.

circle-info

An Apollo license is required to use Cohort Browser on the DNAnexus Platform. Org approval may also be required. for more information.

hashtag
When to Use Stacked Row Charts

JupyterLab Quickstart

In this tutorial, you will learn how to create and run a notebook in JupyterLab on the platform, download data from the notebook, and upload results to the platform.

circle-info

JupyterLab is accessible to all users of the UK Biobank Research Analysis Platform and the Our Future Health Trusted Research Environment.

A license is required to access JupyterLab on the DNAnexus Platform. for more information.

MONAI in JupyterLab

Using MONAI Core, MONAI Label/3D Slicer (SlicerJupyter) via JupyterLab

Medical Open Network for AI () is a framework built for deep learning in healthcare imaging. To use MONAI on the DNAnexus Platform, with the MONAI_ML feature, which includes:

  • : PyTorch-based framework for deep learning in healthcare imaging.

  • : An intelligent image labeling and learning tool designed to create training datasets and build AI annotation models. It provides a server-client framework that integrates with imaging viewers.

app.run_server(host='0.0.0.0', port=443)
Job Timeouts

Each job has a timeout setting. This setting denotes the maximum amount of "wall clock time" that the job can spend in the "running" state, that is, running on the DNAnexus Platform.

If the job is still running when this limit is reached, the job is terminated.

The default job timeout setting is 30 days, though individual apps may have different timeout settings, as specified by the app's creator. A job may be given a custom timeout setting.

hashtag
How Job Timeouts Work

As noted above, job timeouts only apply to the time a job spends in the "running" state.

Job timeouts do not apply to any time a job spends waiting to begin running - as, for example, when a job is waiting for inputs to become available.

Job timeouts also do not apply to the time a job may spend between exiting the "running" state, and entering the "done" state - as, for example, when it is waiting for subjobs to finish.

To learn more about timeouts, see job lifecycle and job states.

hashtag
Errors

If a job fails to complete running before reaching its timeout limit, it is terminated, with the Platform returning JobTimeoutExceeded as the job's failure reason.

hashtag
Execution Tree Expiration

Each job is part of an execution tree. All jobs in an execution tree must complete running within 30 days of the launch of the tree's root execution.

After this limit has been reached, all jobs within the execution tree lose the ability to access the Platform.

If an execution tree is restarted, its timeout setting is not reset. Jobs in the tree lose Platform access 30 days after the initial launch (the first try) of the tree's root execution.

hashtag
Errors

If an execution tree reaches its time limit, jobs in the tree may not fail right away. If such a job is waiting for inputs or outputs, or if it is running without accessing the Platform, it may remain in that state. Only when the job tries to access the Platform does it fail. Depending on the access pattern, the Platform returns AppInternalError, AppError, or AuthError as the job's failure reason.

hashtag
Monitoring Time Limits

To see information on time limits for execution and execution trees:

  1. Navigate to the project in which the execution or execution tree is being run.

  2. Click the Monitor tab.

  3. Click the name of the execution or execution tree to open a page showing detailed information on it.

If a time limit is approaching, a warning message provides information on when the limit is reached.

If a job is waiting for subjobs to finish, it is shown as running, but job timeout information is not displayed. Execution tree information continues to be displayed.

  • Parent job: A job that creates another job or analysis via an /executable-xxxx/run or /job/new API call.

  • Child job: A job created from a parent job via an /app[let]-xxxx/run or /job/new API call.

  • Subjob: A job created from a job via a /job/new API call. A subjob runs the same executable as its parent, and executes the entry point specified in the API call that created it.

  • Job tree: A set of all jobs that share the same origin job.

  • main
    feature of JupyterLab
    FreeSurfer registration pagearrow-up-right
    Job mechanisms (monitoring, termination, and management) are the same for Spark apps as for any other regular app(let)s on the Platform.
  • Spark apps use the same platform dx communication between the master VM and DNAnexus API servers.

  • There's a new log collection mechanism to collect logs from all nodes.

  • You can use the Spark UI to monitor running job using ssh tunneling.

  • specification
    cluster specification
  • 3D Slicerarrow-up-right: An open-source software designed for the visualization, processing, and analysis of medical, biomedical, and other 3D images. In a Jupyter environment, 3D Slicer is accessible through the SlicerJupyterarrow-up-right kernel and acts as a client for the MONAI Label server.

  • The MONAI Core, MONAI Label, and 3D Slicer (SlicerJupyter) come pre-installed with the JupyterLab MONAI_ML feature option.

    circle-check

    For the full list of pre-installed packages, see the JupyterLab in-product documentationarrow-up-right.

    hashtag
    Using MONAI Core

    For sample Jupyter notebooks and tutorials, see the official project MONAI tutorialsarrow-up-right.

    You can find technical documentation for MONAI Corearrow-up-right.

    hashtag
    Using MONAI Label with 3D Slicer

    For examples showing how to use 3D Slicer with MONAI Label, see the following sample Jupyter notebooks in DNAnexus OpenBio repository:

    • Radiology Auto-Segmentation and Training with MONAI Label and 3D Slicer (NIfTI/CT)arrow-up-right: Demonstrates auto-segmentation and model training on NIfTI CT spleen data using MONAI Label and 3D Slicer (SlicerJupyter).

    • Whole Brain Segmentation with MONAI Label and 3D Slicer (DICOM/MRI)arrow-up-right: Shows auto-segmentation and model training on DICOM MRI brain data, including DICOM-to-NIfTI conversion and interactive annotation in 3D Slicer.

    For general examples and tutorials on using MONAI Label and 3D Slicer (SlicerJupyter), explore the following GitHub repositories:

    • MONAI Label tutorials: Project-MONAI/tutorials/monailabelarrow-up-right

    • 3D Slicer (SlicerJupyter) example notebooks: Slicer/SlicerNotebooksarrow-up-right

    MONAIarrow-up-right
    run JupyterLab
    MONAI Corearrow-up-right
    MONAI Labelarrow-up-right
    tag
    /
    branch
    - Git tag/branch to fetch.
  • destdir - Directory on worker to which the git repository is cloned.

  • build_commands - Commands to build the dependency, run from the repository destdir. In this example, htslib is built when SAMtools is built, so only the SAMtools entry includes build_commands.

  • Execution Environment Reference
    SAMtools GitHubarrow-up-right
    supplementary details
    DNAnexus SDK
    log in
    Dash installation guidearrow-up-right
    makes the return code the first non-zero exit code. (Typically, the return code of pipes is the exit code of the last command, which can create difficult to debug problems.)
    job controlarrow-up-right
    dx-upload-all-outputs
    Creating the web application

    The applet code runs a training script, which is placed in resources/home/dnanexus/ to make it available in the current working directory of the worker, and then it starts TensorBoard on port 443 (HTTPS).

    The training script runs in the background to start TensorBoard immediately, which allows you to see the results while training is still running. This is particularly important for long-running training scripts.

    For all web apps, if everything is running smoothly and no errors are encountered (the ideal case), the line of code that starts the server keeps it running forever. The applet stops only when it is terminated. This also means that any lines of code after the server starts are not executed.

    As with all web apps, the dxapp.json must include "httpsApp": {"ports":[443], "shared_access": "VIEW"} to tell the worker to expose port 443.

    hashtag
    Creating an applet on DNAnexus

    Build the asset with the libraries first:

    Take the record ID it outputs and add it to the dxapp.json for the applet.

    Then build the applet

    Once it spins up, you can go to that job's designated URL based on its job ID, https://job-xxxx.dnanexus.cloud, to see the result.

    View full source code on GitHubarrow-up-right
    external linkarrow-up-right
    hashtag
    Setting Thresholds From the Command Line

    For a root execution, the turnaround time is the time between its creation time and the time it reaches the terminal state (or the current time if it is not in a terminal state). The terminal states of an execution are done, terminated, and failed. The job tree turnaround time threshold can be set from the dxapp.json app metadata file using the treeTurnaroundTimeThreshold supported field, where the threshold time is set in seconds. When a user runs an executable that has a threshold, the threshold applies only to the resulting root execution. See here for more details on the treeTurnAroundTimeThreshold API.

    Example of including the treeTurnaroundTimeThreshold field in dxapp.json:

    In the command-line interface (CLI), the dx build and dx build --app commands can accept the treeTurnaroundTimeThreshold field from dxapp.json, and the resulting app is built with the job tree turnaround time threshold from the JSON file.

    To check the treeTurnaroundTimeThreshold value of an executable, users can use dx describe {app, applet, workflow or global workflow id} --json command.

    Using the dx describe {execution_id} --json command displays the selectedTreeTurnaroundTimeThreshold, selectedTreeTurnaroundTimeThresholdFrom, and treeTurnaroundTime values of root executions.

    hashtag
    WDL Workflows

    For WDL workflows and tasks, dxCompiler enables tree turnaround time specification using the extrasarrow-up-right JSON file. dxCompiler reads the treeTurnaroundTimeThreshold field from the perWorkflowDxAttributes and defaultWorkflowDxAttributes sections in extras and applies this threshold to the generated workflow. To set a job tree turnaround time threshold for an applet using dxCompiler, add the treeTurnaroundTimeThreshold field to the perTaskDxAttributes and defaultTaskDxAttributes sections in the extras JSON file.

    Example of including the treeTurnAroundTimeThreshold field in perWorkflowDxAttributes:

    DNAnexus Salesenvelope
    timeouts
    dx
    dxCompiler
    job tree
    hashtag
    Run a JupyterLab Session and Create Notebooks

    hashtag
    1. Launch JupyterLab and View the Project

    First, launch JupyterLab in the project of your choice, as described in the Running JupyterLab guide.

    After starting your JupyterLab session, click on the DNAnexus tab on the left sidebar to see all the files and folders in the project.

    hashtag
    2. Create an Empty Notebook

    To create a new empty notebook in the DNAnexus project, select DNAnexus > New Notebook from the top menu.

    This creates an untitled ipynb file, viewable in the DNAnexus project browser, which refreshes every few seconds.

    To rename your file, right-click on its name and select Rename.

    hashtag
    3. Edit and Save the Notebook in the Project

    You can open and edit the newly created notebook directly from the project (accessible from the DNAnexus tab in the left sidebar). To save your changes, press Ctrl+S (or Command+S on macOS), or click on the save icon in the Toolbar (an area below the tab bar at the top). A new notebook version lands in the project, and you should see in the "Last modified" column that the file was created recently.

    Since DNAnexus files are immutable, each notebook save creates a new version in the project, replacing the file of the same name. The previous version moves to the .Notebook_archive with a timestamp suffix added to its name. Saving notebooks directly in the project as new files preserves your analyses beyond the JupyterLab session's end.

    hashtag
    4. Download the Data to the Execution Environment

    To process your data in the notebook, the data must be available in the execution environment (as is the case with any DNAnexus app).

    You can download input data from a project for your notebook using dx download in a notebook cell:

    You can also use the terminal to execute the dx command.

    hashtag
    5. Upload Data to the Project

    For any data generated by your notebook that needs to be preserved, upload it to the project before the session ends and the JupyterLab worker terminates. Upload data directly in the notebook by running dx upload from a notebook cell or from the terminal:

    circle-info

    If you create a notebook from the Launcher or from the top menu (File > New > Notebook), the notebook is not created in the project but in the local execution environment. To move it to the project, you must upload it to the project manually. Make sure you upload your local notebooks to the project before the session expires, or work on your notebooks directly from the project, so as not to lose your work.

    hashtag
    Next Steps

    • Check the References guide for tips on the most useful operations and features in JupyterLab.

    Contact DNAnexus Salesenvelope
    "access": {
      "network": ["github.com"]
    }
      "runSpec": {
     ...
        "execDepends": [
            {
              "name": "htslib",
              "package_manager": "git",
              "url": "https://github.com/samtools/htslib.git",
              "tag": "1.3.1",
              "destdir": "/home/dnanexus"
            },
            {
              "name": "samtools",
              "package_manager": "git",
              "url": "https://github.com/samtools/samtools.git",
              "tag": "1.3.1",
              "destdir": "/home/dnanexus",
              "build_commands": "make samtools"
            }
        ],
    ...
      }
    ├── home
    │   ├── dnanexus
    │       ├── < app script >
    │       ├── htslib
    │       ├── samtools
    │           ├── < samtools binary >
    main() {
      set -e -x -o pipefail
    
      dx download "$mappings_bam"
    
      count_filename="${mappings_bam_prefix}.txt"
      readcount=$(samtools/samtools view -c "${mappings_bam_name}")
      echo "Total reads: ${readcount}" > "${count_filename}"
    
      counts_txt=$(dx upload "${count_filename}" --brief)
      dx-jobutil-add-output counts_txt "${counts_txt}" --class=file
    }
    pip install dash==0.39.0  # The core dash backend
    pip install dash-html-components==0.14.0  # HTML components
    pip install dash-core-components==0.44.0  # Supercharged components
    pip install dash-table==3.6.0  # Interactive DataTable component
    pip install dash-daq==0.1.0  # DAQ components
    {
      ...
      "execDepends": [
        {"name": "dash", "version":"0.39.0", "package_manager": "pip"},
            {"name": "dash-html-components", "version":"0.14.0", "package_manager": "pip"},
            {"name": "dash-core-components", "version":"0.44.0", "package_manager": "pip"},
            {"name": "dash-table", "version":"3.6.0", "package_manager": "pip"},
            {"name": "dash-daq", "version":"0.1.0", "package_manager": "pip"}
      ],
      ...
    }
    dx build_asset dash-asset
    "runSpec": {
        ...
        "assetDepends": [
        {
          "id": "record-xxxx
        }
      ]
        ...
    }
    dx build -f dash-web-app
    dx run dash-web-app
    cd dash-web-app/resources/home/dnanexus/
    python3 local_test.py
      "runSpec": {
        ...
        "execDepends": [
          {"name": "samtools"}
        ]
      }
    set -e -x -o pipefail
    echo "Value of mappings_sorted_bam: '${mappings_sorted_bam}'"
    echo "Value of mappings_sorted_bai: '${mappings_sorted_bai}'"
    
    mkdir workspace
    cd workspace
    dx download "${mappings_sorted_bam}"
    
    if [ -z "$mappings_sorted_bai" ]; then
      samtools index "$mappings_sorted_bam_name"
    else
      dx download "${mappings_sorted_bai}"
    fi
    # Extract valid chromosome names from BAM header
    chromosomes=$(
      samtools view -H "${mappings_sorted_bam_name}" | \
      grep "@SQ" | \
      awk -F '\t' '{print $2}' | \
      awk -F ':' '{
        if ($2 ~ /^chr[0-9XYM]+$|^[0-9XYM]/) {
          print $2
        }
      }'
    )
    
    # Split BAM by chromosome and record output file names
    for chr in $chromosomes; do
      samtools view -b "${mappings_sorted_bam_name}" "${chr}" -o "bam_${chr}.bam"
      echo "bam_${chr}.bam"
    done > bamfiles.txt
    
    # Parallel counting of reads per chromosome BAM
    busyproc=0
    
    while read -r b_file; do
      echo "${b_file}"
    
      # If busy processes hit limit, wait for one to finish
      if [[ "${busyproc}" -ge "$(nproc)" ]]; then
        echo "Processes hit max"
        while [[ "${busyproc}" -gt 0 ]]; do
          wait -n
          busyproc=$((busyproc - 1))
        done
      fi
    
      # Count reads in background
      samtools view -c "${b_file}" > "count_${b_file%.bam}" &
      busyproc=$((busyproc + 1))
    
    done < bamfiles.txt
    while [[ "${busyproc}" -gt  0 ]]; do
      wait -n # p_id
      busyproc=$((busyproc-1))
    done
    ├── $HOME
    │   ├── out
    │       ├── < output name in dxapp.json >
    │           ├── output file
    outputdir="${HOME}/out/counts_txt"
    mkdir -p "${outputdir}"
    cat count* \
      | awk '{sum+=$1} \
      END{print "Total reads = ",sum}' \
      > "${outputdir}/${mappings_sorted_bam_prefix}_count.txt"
    
    dx-upload-all-outputs
    ├── Applet dir
    │   ├── src
    │   ├── dxapp.json
    │   ├── resources
    │       ├── usr
    │           ├── bin
    │               ├── <samtools binary>
    /
    ├── usr
    │   ├── bin
    │       ├── < samtools binary >
    ├── home
    │   ├── dnanexus
    # Download BAM from DNAnexus
    dx download "${mappings_bam}"
    
    # Attempt to index the BAM file
    indexsuccess=true
    bam_filename="${mappings_bam_name}"
    samtools index "${mappings_bam_name}" || indexsuccess=false
    
    # If indexing fails, sort then index
    if [[ $indexsuccess == false ]]; then
      samtools sort -o "${mappings_bam_name}" "${mappings_bam_name}"
      samtools index "${mappings_bam_name}"
      bam_filename="${mappings_bam_name}"
    fi
    
    # Extract chromosome names from header
    chromosomes=$(
      samtools view -H "${bam_filename}" | \
      grep "@SQ" | \
      awk -F '\t' '{print $2}' | \
      awk -F ':' '{
        if ($2 ~ /^chr[0-9XYM]+$|^[0-9XYM]/) {
          print $2
        }
      }'
    )
    
    # Split BAM by chromosome and record filenames
    for chr in $chromosomes; do
      samtools view -b "${bam_filename}" "${chr}" -o "bam_${chr}.bam"
      echo "bam_${chr}.bam"
    done > bamfiles.txt
    counts_txt_name="${mappings_bam_prefix}_count.txt"
    
    # Sum all read counts across split BAM files
    sum_reads=$(
      < bamfiles.txt xargs -I {} \
      samtools view -c $view_options '{}' | \
      awk '{s += $1} END {print s}'
    )
    
    # Write the total read count to a file
    echo "Total Count: ${sum_reads}" > "${counts_txt_name}"
    counts_txt_id=$(dx upload "${counts_txt_name}" --brief)
    dx-jobutil-add-output counts_txt "${counts_txt_id}" --class=file
    # Start the training script and put it into the background,
    # so the next line of code runs immediately
    python mnist_tensorboard_example.py --log_dir LOGS_FOR_TENSORBOARD &
    
    # Run TensorBoard
    tensorboard  --logdir LOGS_FOR_TENSORBOARD --host 0.0.0.0 --port 443
    dx build_asset tensorflow_asset
    "runSpec": {
     ...
     "assetDepends": [
        {
          "id": "record-xxxx
        }
      ]
     ...
    }
    dx build -f tensorboard-web-app
    dx run tensorboard-web-app
    # Start the training script and put it into the background,
    # so the next line of code will run immediately
    python3 mnist_tensorboard_example.py --log_dir LOGS_FOR_TENSORBOARD &
    
    # Run TensorBoard
    tensorboard  --logdir LOGS_FOR_TENSORBOARD --host 0.0.0.0 --port 443
    dx build_asset tensorflow_asset
    "runSpec": {
        ...
        "assetDepends": [
        {
          "id": "record-xxxx
        }
      ]
        ...
    }
    dx build -f tensorboard-web-app
    dx run tensorboard-web-app
    {
      "treeTurnaroundTimeThreshold": {threshold},
      ...
    }
    {
      "perWorkflowDxAttributes": {
        {workflow_name}: {
          "treeTurnaroundTimeThreshold": {threshold},
          ...
        },
        ...
      }
    }
    %%bash
    dx download input_data/reads.fastq
    %%bash
    dx upload results.csv
    To generate a survival chart, select one numerical field representing time, and one categorical field, which is transformed into the individual's status.

    The categorical field should use one of the following 4 terms (case-insensitive) to indicate a status of "Living": living, alive, diseasefree, disease-free.

    For multi-entity datasets, survival curve charts only support data fields from the main entity, or entities with 1:1 relation to the main entity.

    hashtag
    Calculating Survival Percentage

    To calculate survival percent at the current event the system evaluates the following formula:

    ST=LT0−DLT0S_T = \frac{L_{T0} - D}{L_{T0}}ST​=LT0​LT0​−D​

    • STS_TST​: Survival at the current event

    • LT0L_{T0}LT0​: Number of subjects living at the start of the period or event

    • DDD: Number of subjects that died

    For each time period the following values are generated:

    • Status: Each individual is considered Dead unless they qualify as Living.

    • Number of Subjects Living at the Start (LT0L_{T0}LT0​)

      • For the initial value this is the total number of records returned by the backend from survival data with Living or Dead Status.

      • For followup events this is the number of subjects at the start of the previous event minus the number of subjects that died in the previous event and the subjects that dropped out or were censored in the previous event.

    • Number of Subjects Who Died (): * 1 for each individual who at the event does not have a status of Living.

    • Number of Subjects Dropped or Censored: * 1 for each individual who at the event has a status of Living.

    • Survival Percent at the Current Event ():

    • Cumulative Survival (): where is the survival percent at the previous event.

    This is the actual point drawn on the survival plot.

    hashtag
    Learn More

    • Survival Curve Wikipediaarrow-up-right

    • Kaplan-Meier Wikipediaarrow-up-right

    Contact DNAnexus Salesenvelope
    Scatter plots can be used to compare the distribution of values in a field containing numerical data, across different groups in a cohort. In a scatter plot, each such group is defined by its members sharing the same value in another field that also contains numerical data.

    Primary field values are plotted on the x axis. Secondary field values are plotted on the y axis.

    Supported Data Types

    Primary Field

    Secondary Field

    Numerical (Integer) or Numerical (Float)

    Numerical (Integer) or Numerical (Float)

    hashtag
    Using Scatter Plots in the Cohort Browser

    In the scatter plot below, each dot represents a particular combination of values, found in one or more records in a cohort, in fields Insurance Billed and Cost. The lighter the dot at a particular point, the fewer the records that share that combination. Darker dots, meanwhile, indicate that more records share a particular combination.

    Scatter Plot: Insurance Billed x Cost

    hashtag
    Non-Numeric Data in Scatter Plots

    Fields containing primarily numeric data may also include non-numeric values. These non-numeric values cannot be represented in a scatter plot. The message "This field contains non-numeric values" appears below the scatter plot, as in this sample chart:

    Scatter Plot Based on Field or Fields Containing Non-Numeric Values

    Clicking the "non-numeric values" link displays detail on those values, and the number of record in which each appears.

    Detail on Non-Numeric Values

    hashtag
    Limit on Number of Data Points

    In the Cohort Browser, scatter plots can show up to 30,000 distinct data points. If you create a scatter plot that would require that more data points be shown, you see this message above the chart:

    Scatter Plot with Warning Message about Data Point Limit

    In this scenario, add a cohort filter to generate a scatter plot that shows data for all the members of a cohort.

    hashtag
    Cohort Compare

    Scatter plots are not supported in Cohort Compare.

    hashtag
    Preparing Data for Visualization in Scatter Plots

    When ingesting data using Data Model Loader, the following data types can be visualized in scatter plots:

    • Integer

    • Integer Sparse

    • Float

    • Float Sparse

    Contact DNAnexus Salesenvelope
    Stacked row charts can be used to compare the distribution of values in a field containing categorical data, across different groups in a cohort. In a stacked row chart, each such group is defined by its patient sharing the same value in another field that also contains categorical data.

    When creating a stacked row chart:

    • Both the primary and secondary fields must contain categorical data

    • Both the primary and secondary fields must contain no more than 20 distinct category values

    Supported Data Types

    Primary Field

    Secondary Field

    Categorical (<=20 distinct category values)

    Categorical (<=20 distinct category values)

    circle-info

    Categorical multiple and categorical hierarchical data are not supported in stacked row charts.

    hashtag
    Using Stacked Row Charts in the Cohort Browser

    In the stacked row chart below, the primary field is VisitType, while DoctorType is the secondary field. In this chart, a cohort has been broken down into two groups, with the first sharing the value "Out-patient" in the VisitType field, while the second shares the value "In-patient."

    The size of each bar, and the number to its right, indicate the total number of records in each group. In the chart below, for example, you can see that 3,179 records contain the value "Out-patient" in the VisitType field.

    Each bar contains a color-coded section indicating how many of the group's records contain a specific value in the secondary field. Hovering over one of these sections reveals how many records, within a particular group, share a particular value in the secondary field. In the chart below, for example, you can see that 87 records in the first group share the value "specialist" in the DoctorType field.

    Stacked Row Chart: VisitType x DoctorType

    hashtag
    Cohort Compare

    Stacked row charts are not supported in Cohort Compare. Use a list view instead.

    hashtag
    Preparing Data for Visualization in Stacked Row Charts

    When ingesting data using Data Model Loader, the following data types can be visualized in stacked row charts:

    • String Categorical

    • String Categorical Sparse

    • Integer Categorical

    Contact DNAnexus Salesenvelope

    Developer Tutorials

    Access developer tutorials and examples.

    Developers new to the DNAnexus Platformarrow-up-right may find it helpful to learn by doing. This page contains a collection of tutorials and examples intended to showcase common tasks and methodologies when creating an app(let) on the DNAnexus Platform.

    By following the tutorials and examples, you learn to develop app(let)s that:

    • Run efficiently using cloud computing methodologies

    • Are straightforward to debug and use

    • Take advantage of the DNAnexus Platform's flexibility and scale

    • Reduce support burden while enabling collaboration

    If it's your first time developing an app(let), read the series. This series introduces terms and concepts that tutorials and examples build on.

    These tutorials are not meant to show realistic everyday examples, but rather provide a strong starting point for app(let) developers. These tutorials showcase varied implementations of the SAMtools view command on the DNAnexus Platform.

    hashtag
    Bash App(let) Tutorials

    Bash app(let)s use dx-toolkit, the platform SDK, and the command-line interface along with common Bash practices to create bioinformatic pipelines in the cloud.

    hashtag
    Python App(let) Tutorials

    Python app(let)s make of use dx-toolkit's along with common Python modules such as to create bioinformatic pipelines in the cloud.

    hashtag
    Web App(let) Tutorials

    To create a web applet, you need access to Titan or Apollo features. Web applets can be made as either Python or Bash applets. The only difference is that they launch a web server and expose port 443 (for HTTPS) to allow a user to interact with that web application through a web browser.

    hashtag
    Concurrent Computing Tutorials

    A bit of terminology before starting the discussion of parallel and distributed computing paradigms on the DNAnexus Platform.

    Many definitions and approaches exist for tackling the concept of parallelization and distributing workloads in the cloud (Here's a on the subject). To make the documentation easier to understand when discussing concurrent computing paradigms, this guide refers to:

    • Parallel: Using multiple threads or logical cores to concurrently process a workload.

    • Distributed: Using multiple machines (in this case, cloud instances) that communicate to concurrently process a workload.

    Keep these formal definitions in mind as you read through the tutorials and learn how to compute concurrently on the DNAnexus Platform.

    hashtag
    Parallel

    hashtag
    Distributed

    Executions and Cost and Spending Limits

    Learn about limits on the costs executions can incur, and how these limits can affect executions on the DNAnexus Platform.

    hashtag
    Types of Cost and Spending Limits

    A running execution can be terminated when it incurs charges that cause a cost or spending limit to be reached. When a spending limit is reached, this can also prevent new executions from being launched.

    hashtag
    Execution Cost Limits

    . This limit is set when a root execution is launched. Once this limit is reached, the DNAnexus Platform terminates running executions in the affected execution tree.

    hashtag
    Errors

    When an execution is terminated in this fashion, the Platform sets . This failure code is displayed on the UI, on the relevant project's Monitor page.

    hashtag
    Billing Account Spending Limits

    Billing account spending limits are managed by billing administrators, and can impact executions in projects billed to the account.

    Billing account spending limits apply to cumulative charges incurred by projects billed to the account.

    If cumulative charges reach this limit, the Platform terminates running jobs in projects billed to the account, and prevents new executions from being launched.

    hashtag
    Errors

    When a job is terminated in this fashion, the Platform sets as the failure reason. This failure reason is displayed on the UI, on the relevant project's Monitor page.

    hashtag
    Project-Level Compute and Egress Spending Limits

    circle-info

    A license is required to use the Enforce Monthly Spending Limit for Computing and Egress feature. for more information.

    , and can impact executions run within the project. Project admins can also set a separate monthly project-level egress spending limit, which can impact data egress from the project.

    If the compute spending limit is reached, the Platform may terminate running jobs launched by project members, and prevent new executions from being launched. If the egress spending limit is reached, the Platform may prevent data egress from the project. The exact behavior depends on the policies of the org to which the project is billed.

    For more information on these limits, see the , and the .

    hashtag
    Compute Charges Incurred by Using Relational Database Clusters

    Monthly project compute limits do not apply to compute charges incurred by using .

    hashtag
    Compute Charges for Using Public IPv4 Addresses for Workers

    Using public IPv4 addresses for workers incurs additional charges. When a job uses such a worker, IPv4 charges are included in the total cost figure shown for the job on the UI. These charges also count toward any .

    For information on how to find the per-hour charge for using IPv4 addresses, in each cloud region in which org members can run executions, see the .

    hashtag
    Getting Info on Cost and Spending Limits

    hashtag
    Execution Costs and Cost Limits

    The UI displays information on costs and cost limits for both individual executions and execution trees. Navigate to the project in which the execution or execution tree is being run, then click the Monitor tab. Click on the name of the execution or execution tree to open a page showing detailed information about it.

    While an execution or execution tree is running, information is displayed on the charges it has incurred so far, and on additional charges it can incur, before an applicable cost limit is reached.

    hashtag
    Spending Limits

    Org spending limit information is available from the .

    hashtag
    Project-Level Monthly Spending Limits

    If project-level monthly spending limits have been set for a project, detailed information is available via the CLI, using the command .

    Cohort Browser

    Visualize your data and browse your multi-omics datasets.

    circle-info

    An Apollo license is required to use Cohort Browser on the DNAnexus Platform. Org approval may also be required. Contact DNAnexus Salesenvelope for more information.

    Cohort Browser is a visualization tool for exploring and filtering structured datasets. It provides an intuitive interface for creating visualizations, defining patient cohorts, and analyzing complex data.

    Cohort Browser supports multiple types of datasets:

    • Clinical and phenotypic data - Patient demographics, clinical measurements, and outcomes

    • Germline variants - Inherited genetic variations

    • Somatic variants - Cancer-related genetic changes

    • Gene expressions - Molecular expression measurements

    • Multi-assay datasets - Datasets combining multiple assay types or instances of the same assay type

    circle-check

    If you need to perform custom statistical analysis, you can also use environments with Spark clusters to query your data programmatically.

    hashtag
    Prerequisites

    You need to before you can access it through a dataset in the Cohort Browser.

    hashtag
    Opening Datasets Using the Cohort Browser

    1. In Projects, select the project where your dataset is located.

    2. Go to the Manage tab.

    3. Select your dataset.

    You can also use the Info Panel to view information about the selected dataset, such as its creator or .

    hashtag
    Getting familiar with Cohort Browser

    Depending on your dataset, the Cohort Browser shows the following tabs:

    • Overview - Clinical data using interactive charts and dashboards

    • Data Preview - Clinical data in tabular format

    • Assay-specific tabs - Additional tabs appear based on your dataset content:

    hashtag
    Exploring Data in a Dataset

    In the Cohort Browser's Overview tab, you can . These visualizations provide an introduction to the dataset and insights on the clinical data it contains.

    When you open a dataset, Cohort Browser automatically creates an empty cohort that includes all records in the dataset. From here, you can add filters to , to explore your data, and export filtered data for further analysis outside the platform.

    hashtag
    Next Steps

    • - Build visualizations and manage dashboard layouts

    • - Filter data and create patient groups

    • - Work with inherited genetic variations

    Row Chart

    Learn to build and use row charts in the Cohort Browser.

    circle-info

    An Apollo license is required to use Cohort Browser on the DNAnexus Platform. Org approval may also be required. Contact DNAnexus Salesenvelope for more information.

    hashtag
    When to Use Row Charts

    Row charts can be used to visualize categorical data.

    When creating a row chart:

    • The data must be from a field that contains either categorical or categorical multi-select data

    • This field must contain no more than 20 distinct category values

    • The values cannot be organized in a hierarchy

    hashtag
    Supported Field Types

    See if you need to visualize hierarchical categorical data.

    hashtag
    When to Use List Views for Categorical Data

    Row charts can't be used to visualize data in categorical fields that have a hierarchical structure. For this type of data, use a .

    Row charts aren't supported in Cohort Compare mode. In Cohort Compare mode, row charts are converted to .

    hashtag
    Using Stacked Row Charts for Multivariate Visualizations

    Row charts can't be used to visualize data from more than one field. To visualize categorical data from two fields, you can use a .

    hashtag
    Using Row Charts in the Cohort Browser

    In a row chart, each row shows a single category value, along with the number of records - the "count" - in which that value appears in the selected field. Also shown is the percentage of total cohort records in which it appears - its "freq." or "frequency."

    Below is a sample row chart showing the distribution of values in a field Salt added to food. In the current cohort selection of 100,000 participants, 27,979 records contain the value "Sometimes", which represents 27.98% of the current cohort size.

    circle-info

    When records are missing values for the displayed field, the sum of the "count" figures is smaller than the total cohort size, and the sum of the "freq." figures is less than 100%. See for more information on how missing data affects chart calculations.

    hashtag
    Preparing Data for Visualization in Row Charts

    When , the following data types can be visualized in row charts, if category values are specified as such in the coding file used at ingestion:

    • String Categorical

    • String Categorical Sparse

    • String Categorical Multi-select

    circle-info

    While sparse serial data can be visualized using row charts, non-encoded values are not supported. These values do not appear as rows.

    VCF Preprocessing

    Learn about preprocessing VCF data before using it in an analysis.

    hashtag
    Overview

    It may be necessary to preprocess, or harmonize, the data before you load them.

    hashtag
    Harmonizing Data

    • The raw data is expected to be a set of gVCF files -- one file per sample in the cohort.

    • is used to harmonize sites across all gVCFs and generate a single pVCF file containing all harmonized sites and all genotypes for all samples.

    hashtag
    Basic Run

    hashtag
    Advanced Run

    circle-check

    To learn more about GLnexus, see or .

    hashtag
    Annotating Variants

    VCF files can include variant annotations. SnpEff annotations provided as INFO/ANN tags are loaded into the database. You can annotate the harmonized pVCF yourself by running any standard SnpEff annotator before loading it. For large pVCFs, rely on the internal annotation step in the VCF Loader instead of generating an annotated intermediate file. The VCF Loader performs annotation in a distributed, massively parallel process.

    The VCF Loader does not persist the intermediate, annotated pVCF as a file. If you want to have access to the annotated file up front, you should annotate it yourself.

    VCF annotation flows. In (a) the annotation step is external to the VCF Loader, whereas in (b) the annotation step is internal. In any case, SnpEff annotations present as INFO/ANN tags are loaded into the database by the VCF Loader.

    CSV Loader

    circle-info

    A license is required to access Spark functionality on the DNAnexus Platform. Contact DNAnexus Salesenvelope for more information.

    hashtag
    Overview

    The CSV Loader ingests CSV files into a database. The input CSV files are loaded into a Parquet-format database and tables that can be queried using Spark SQL.

    You can load a single CSV file or many CSV files. In the many files case, all files must be syntactically equal.

    For example:

    • All files must have the same separator. This can be a comma, tab, or another consistent delimiter.

    • All files must include a header line, or all files must exclude it

    circle-info

    Each CSV file is loaded into its own table within the specified database.

    hashtag
    How to Run CSV Loader

    Input:

    • CSV (array of CSV files to load into the database)

    Required Parameters:

    • database_name -> name of the database to load the CSV files into.

    • create_mode -> strict mode creates database and tables from scratch and optimistic mode creates databases and tables if they do not already exist.

    Other Options:

    • spark_read_csv_header -> default false -- whether the first line of each CSV should be used as column names for the corresponding table.

    • spark_read_csv_sep -> default , -- the separator character used by each CSV.

    hashtag
    Basic Run

    The following case creates a brand new database and loads data into two new tables:

    Running Older Versions of JupyterLab

    Learn how to run an older version of JupyterLab via the user interface or command-line interface.

    hashtag
    Why Run an Older Version of JupyterLab?

    The primary reason to run an older version of JupyterLab is to access snapshots containing tools that cannot be run in the current version's execution environment.

    hashtag
    Launching an Older Version via the User Interface (UI)

    1. From the main Platform menu, select Tools, then Tools Library.

    2. Find and select, from the list of tools, either JupyterLab with Python, R, Stata, ML, Image Processing or JupyterLab with Spark Cluster.

    3. From the tool detail page, click on the Versions tab.

    hashtag
    Launching an Older Version via the Command-Line Interface (CLI)

    1. Select the project in which you want to run JupyterLab.

    2. Launch the version of JupyterLab you want to run, substituting the version number for x.y.z in the following commands:

      • For JupyterLab without the Spark cluster capability, run the command

    circle-info

    Running JupyterLab at "high" priority is not required. However, doing so ensures that your interactive session is not interrupted by spot instance termination.

    hashtag
    Accessing JupyterLab

    After launching JupyterLab, access the JupyterLab environment using your browser. To do this:

    1. Get the job ID for the job created when you launched JupyterLab. See the for details on how to get the job ID, via either the UI or the CLI.

    2. Open the URL https://job-xxxx.dnanexus.cloud, substituting the job's ID for job-xxxx.

    Mkfifo and dx cat

    This applet performs a SAMtools count on an input file while minimizing disk usage. For additional details on using FIFO (named pipes) special files, run the command man fifo in your shell.

    circle-exclamation

    Named pipes require BOTH a stdin and stdout

    Distributed by Region (sh)

    hashtag
    Entry Points

    Distributed bash-interpreter apps use bash functions to declare entry points. Entry points are executed as subjobs on new workers with their own respective system requirements. This app has the following entry points specified as bash functions:

    Parallel xargs by Chr

    This applet slices a BAM file by canonical chromosome and performs a parallelized SAMtools view.

    hashtag
    How is the SAMtools dependency provided?

    The SAMtools compiled binary is placed directory in the <applet dir>/resources directory. Any files found in the resources/ directory are uploaded so that they are present in the root directory of the worker. In this case:

    Creating Charts and Dashboards

    Create charts, manage dashboards, and build visualizations to explore your datasets in the Cohort Browser.

    circle-info

    An Apollo license is required to use Cohort Browser on the DNAnexus Platform. Org approval may also be required. for more information.

    Create interactive visualizations and manage dashboard layouts in the Cohort Browser.

    circle-check

    Grouped Box Plot

    Learn to build and use grouped box plots in the Cohort Browser.

    circle-info

    An Apollo license is required to use Cohort Browser on the DNAnexus Platform. Org approval may also be required. for more information.

    hashtag
    When to Use Grouped Box Plots

    Histogram

    Learn to build and use histograms in the Cohort Browser.

    circle-info

    An Apollo license is required to use Cohort Browser on the DNAnexus Platform. Org approval may also be required. for more information.

    hashtag
    When to Use Histograms

    Running JupyterLab

    Learn to launch a JupyterLab session on the DNAnexus Platform, via the JupyterLab app.

    circle-info

    JupyterLab is accessible to all users of the UK Biobank Research Analysis Platform and the Our Future Health Trusted Research Environment.

    For DNAnexus Platform users, a license is required to access JupyterLab. for more information.

    Environment Variables

    The command-line client and the client bindings use a set of environment variables to communicate with the API server and to store state on the current default project and directory. These settings are set when you run dx login and can be changed through other dx commands. To display the active settings in human-readable format, use the dx env command:

    To print the bash commands for setting the environment variables to match what dx is using, you can run the same command with the --bash flag.

    Running a dx command from the command-line does not (and cannot) overwrite your shell's environment variables. The environment variables are stored in the

    DDD
    STS_TST​
    ST=LT0−DLT0S_T = \frac{L_{T0} - D}{L_{T0}}ST​=LT0​LT0​−D​
    SSS
    S=ST−1⋅STS = S_{T-1} \cdot S_TS=ST−1​⋅ST​
    ST−1S_{T-1}ST−1​
    Git Dependency
  • Mkfifo and dx cat

  • Parallel by Region (sh)

  • Parallel xargs by Chr

  • Precompiled Binary

  • R Shiny Example Web App

  • SAMtools count

  • TensorBoard Example Web App

  • Parallel by Region (py)
  • Pysam

  • Parallel xargs by Chr
    Getting started
    Bash Helpers
    Distributed by Chr (sh)
    Distributed by Region (sh)
    Python implementationarrow-up-right
    subprocessarrow-up-right
    Dash Example Web App
    Distributed by Region (py)
    Parallel by Chr (py)
    R Shiny Example Web App
    TensorBoard Example Web App
    particularly helpful Stack Exchange postarrow-up-right
    Parallel by Chr (py)
    Parallel by Region (py)
    Parallel by Region (sh)
    Distributed by Chr (sh)
    Distributed by Region (py)
    Distributed by Region (sh)
    An execution cost limit is an optional limit on the usage charges an execution tree can incur
    CostLimitExceeded as the failure reason
    SpendingLimitExceeded
    Contact DNAnexus Salesenvelope
    Monthly project compute spending limits can be set by project admins
    billing and account management overview
    detailed explanation of setting org spending limit policies
    relational database clusters
    compute spending limit that applies to the project in which the job is running
    org-xxxx/describe method
    Billing page for each org
    dx describe project-id

    Select the version you'd like to run. Click the Run button.

    dx run app-dxjupyterlab/x.y.z --priority high
    .
  • For JupyterLab with the Spark cluster capability, run the command dx run app-dxjupyterlab_spark_cluster/x.y.z --priority high

  • You may see an error message "502 Bad Gateway" if JupyterLab is not yet accessible. If this happens, wait a few minutes, then try again.
    Monitoring Executions page
    Click Explore Data.

    Germline Variants - For datasets containing germline genomic variants

  • Somatic Variants - For datasets containing somatic variants and mutations

  • Gene Expression - For datasets containing molecular expression data

  • Analyzing Somatic Variants and Mutations - Explore cancer-related genetic changes
  • Analyzing Gene Expression Data - Examine molecular expression patterns

  • JupyterLab
    ingest your data
    sponsorship
    visualize your data using charts
    create specific cohorts
    build visualizations
    Creating Charts and Dashboards
    Defining and Managing Patient Cohorts
    Analyzing Germline Genomic Variants
    Selecting a dataset to explore
    Sample clinical dashboard with multiple tiles and cohorts
    Integer Categorical
  • Integer Categorical Multi-select

  • Supported Data Types

    Limitations

    Categorical

    ≤20 distinct category values

    Categorical Multi-Select

    ≤20 distinct category values

    When to Use List Views for Categorical Data
    list view
    list views
    Stacked Row Chart
    Chart Totals and Missing Data
    ingesting data using Data Model Loader
    Row Chart in the Cohort Browser
    GLnexusarrow-up-right
    GLnexusarrow-up-right
    Getting started with GLnexusarrow-up-right
    Apollo GLnexus
    Annotation flow

    insert_mode -> append appends data to the end of tables and overwrite is equivalent to truncating the tables and then appending to them.

  • table_name -> array of table names, one for each corresponding CSV file by array index.

  • type -> the cluster type, "spark" for Spark apps

  • spark_read_csv_infer_schema -> default false -- whether the input schema should be inferred from the data.

    . The following examples run incomplete named pipes in background processes so the foreground script does not block.

    To approach this use case, outline the desired steps for the applet:

    1. Stream the BAM file from the platform to a worker.

    2. While the BAM streams, count the number of reads present.

    3. Write the result to a file.

    4. Stream the result file to the platform.

    hashtag
    Stream BAM file from the platform to a worker

    First, establish a named pipe on the worker. Then, stream to stdin of the named pipe and download the file as a stream from the platform using dx cat.

    FIFO

    stdin

    stdout

    BAM file

    YES

    NO

    hashtag
    Output BAM file read count

    Having created the FIFO special file representing the streamed BAM, you can call the samtools command as you normally would. The samtools command reading the BAM provides the BAM FIFO file with a stdout. However, remember that you want to stream the output back to the Platform. You must create a named pipe representing the output file too.

    FIFO

    stdin

    stdout

    BAM file

    YES

    YES

    output file

    YES

    NO

    The directory structure created here (~/out/counts_txt) is required to use the dx-upload-all-outputs command in the next step. All files found in the path ~/out/<output name> are uploaded to the corresponding <output name> specified in the dxapp.json.

    hashtag
    Stream the result file to the platform

    A stream from the platform has been established, piped into a samtools command, and the results are output to another named pipe. However, the background process remains blocked without a stdout for the output file. Creating an upload stream to the platform resolves this.

    Upload as a stream to the platform using the commands dx-upload-all-outputs or dx upload -. Specify --buffer-size when needed.

    FIFO

    stdin

    stdout

    BAM file

    YES

    YES

    output file

    YES

    YES

    Alternatively, dx upload - can upload directly from stdin, eliminating the need for the directory structure required for dx-upload-all-outputs. Warning: When uploading a file that exists on disk, dx upload is aware of the file size and automatically handles any cloud service provider upload chunk requirements. When uploading as a stream, the file size is not automatically known and dx upload uses default parameters. While these parameters are fine for most use cases, you may need to specify upload part size with the --buffer-size option.

    hashtag
    Wait for background processes

    With background processes running, wait in the foreground for those processes to finish.

    Without waiting, the app script running in the foreground would finish and terminate the job prematurely.

    hashtag
    How is the SAMtools dependency provided?

    The SAMtools compiled binary is placed directly in the <applet dir>/resources directory. Any files found in the resources/ directory are uploaded so that they are present in the worker's root directory. In this case:

    When this applet is run on a worker, the resources/ folder is placed in the worker's root directory /:

    /usr/bin is part of the $PATH variable, so the samtools command can be referenced directly in the script as samtools view -c ....

    View full source code on GitHubarrow-up-right
    main
  • count_func

  • sum_reads

  • hashtag
    main

    The main function takes the initial *.bam, generates an index *.bai if needed, and obtains the list of regions from the *.bam file. Every 10 regions are sent, as input, to the count_func entry point using dx-jobutil-new-job command.

    Job outputs from the count_func entry point are referenced as Job Based Object References JBOR and used as inputs for the sum_reads entry point.

    Job outputs of the sum_reads entry point is used as the output of the main entry point via JBOR reference in the dx-jobutil-add-output command.

    hashtag
    count_func

    This entry point performs a SAMtools count of the 10 regions passed as input. This execution runs on a new worker. As a result, variables from other functions are not accessible here. This includes variables from the main() function.

    Once the output file with counts is created, it is uploaded to the platform and assigned as the entry point's job output counts_txt via the command dx-jobutil-add-output.

    hashtag
    sum_reads

    The main entry point triggers this subjob, providing the output of count_func as an input JBOR. This entry point gathers all the readcount.txt files generated by the count_func jobs and sums the totals.

    This entry point returns read_sum as a JBOR, which is then referenced as job output.

    In the main function, the output is referenced

    View full source code on GitHubarrow-up-right

    When this applet is run on a worker, the resources/ folder is placed in the worker's root directory /:

    /usr/bin is part of the $PATH variable, so in the script, you can reference the samtools command directly, as in samtools view -c ....

    hashtag
    Parallel Run

    hashtag
    Splice BAM

    First, download the BAM file and slice it by canonical chromosome, writing the *bam file names to another file.

    To split a BAM by regions, you need to have a *.bai index. You can either create an app(let) which takes the *.bai as an input or generate a *.bai in the applet. In this tutorial, the *.bai is generated in the applet, sorting the BAM if necessary.

    hashtag
    Xargs SAMtools view

    In the previous section, the name of each sliced BAM file was recorded into a record file. Next, perform a samtools view -c on each slice using the record file as input.

    hashtag
    Upload results

    The results file is uploaded using the standard bash process:

    1. Upload a file to the job execution's container.

    2. Provide the DNAnexus link as a job's output using the script dx-jobutil-add-output <output name>

    View full source code on GitHubarrow-up-right
    ~/.dnanexus_config/environment.json
    file.

    hashtag
    Configuration File Prioritization

    The following is an ordered list of which DNAnexus utilities load values from configuration sources:

    1. Command line options (if available)

    2. Environment variables already set in the shell

    3. ~/.dnanexus_config/environment.json (dx configuration file)

    4. Hardcoded defaults

    hashtag
    Overriding the dx Configuration File

    The dx command always prioritizes the environment variables that are set in the shell. This means that if you have set your environment variable for DX_SECURITY_CONTEXT and then use dx login to log in as a different user, it still uses the original environment variable. When not run in a script, it prints a warning to stderr whenever the environment variables and its stored state have a mismatch. To get out of this situation, the best approach is often to run source ~/.dnanexus_config/unsetenv. Setting environment variables is generally used within a shell script or as part of a job environment in the cloud.

    In the interaction below, environment variables have already been set, but the user then uses dx to log in which is still overridden by the shell's environment variables.

    hashtag
    Clearing dx-set Variables

    If you instead want to discard the values which dx has stored, the command dx clearenv removes the dx-generated configuration file ~/.dnanexus_config/environment.json for you.

    hashtag
    Command Line Options

    Most dx commands have the following additional flags to temporarily override the values of the respective variables.

    For example, you can temporarily override the current default project used:

    dx run app-glnexus \
        -i common.gvcf_manifest=<manifest_file_id> \
        -i common.config=gatk_unfiltered \
        -i common.targets_bed=<bed_target_ranges>
    dx run workflow-glnexus \
        -i common.gvcf_manifest=<manifest_file_id> \
        -i common.config=gatk_unfiltered \
        -i common.targets_bed=<bed_target_ranges> \
        -i unify.shards_bed=<bed_genomic_partition_ranges> \
        -i etl.shards=<num_sample_partitions>
    dx run app-csv-loader \
       -i database_name=pheno_db \
       -i create_mode=strict \
       -i insert_mode=append \
       -i spark_read_csv_header=true \
       -i spark_read_csv_sep=, \
       -i spark_read_csv_infer_schema=true \
       -i csv=file-xxxx \
       -i table_name=sample_metadata \
       -i csv=file-yyyy \
       -i table_name=gwas_result
    mkdir workspace
    mappings_fifo_path="workspace/${mappings_bam_name}"
    mkfifo "${mappings_fifo_path}" # FIFO file is created
    dx cat "${mappings_bam}" > "${mappings_fifo_path}" &
    input_pid="$!"
    mkdir -p ./out/counts_txt/
    
    counts_fifo_path="./out/counts_txt/${mappings_bam_prefix}_counts.txt"
    
    mkfifo "${counts_fifo_path}" # FIFO file is created, readcount.txt
    samtools view -c "${mappings_fifo_path}" > "${counts_fifo_path}" &
    process_pid="$!"
    mkdir -p ./out/counts_txt/
    
    counts_fifo_path="./out/counts_txt/${mappings_bam_prefix}_counts.txt"
    
    mkfifo "${counts_fifo_path}" # FIFO file is created, readcount.txt
    samtools view -c "${mappings_fifo_path}" > "${counts_fifo_path}" &
    process_pid="$!"
    wait -n  # "$input_pid"
    wait -n  # "$process_pid"
    wait -n  # "$upload_pid"
    ├── Applet dir
    │   ├── src
    │   ├── dxapp.json
    │   ├── resources
    │       ├── usr
    │           ├── bin
    │               ├── < samtools binary >
    /
    ├── usr
    │   ├── bin
    │       ├── < samtools binary >
    ├── home
    │   ├── dnanexus
    # Extract list of reference regions from BAM header
    regions=$(
      samtools view -H "${mappings_sorted_bam_name}" | \
      grep "@SQ" | \
      sed 's/.*SN:\(\S*\)\s.*/\1/'
    )
    
    echo "Segmenting into regions"
    
    count_jobs=()
    counter=0
    temparray=()
    
    # Loop through each region
    for r in $(echo "$regions"); do
      if [[ "${counter}" -ge 10 ]]; then
        echo "${temparray[@]}"
        count_jobs+=($(
          dx-jobutil-new-job \
            -ibam_file="${mappings_sorted_bam}" \
            -ibambai_file="${mappings_sorted_bai}" \
            "${temparray[@]}" \
            count_func
        ))
        temparray=()
        counter=0
      fi
      # Add region to temp array of -i<parameter>s
      temparray+=("-iregions=${r}")
      counter=$((counter + 1))
    done
    
    # Handle remaining regions (less than 10)
    if [[ $counter -gt 0 ]]; then
      echo "${temparray[@]}"
      count_jobs+=($(
        dx-jobutil-new-job \
          -ibam_file="${mappings_sorted_bam}" \
          -ibambai_file="${mappings_sorted_bai}" \
          "${temparray[@]}" \
          count_func
      ))
    fi
    echo "Merge count files, jobs:"
    echo "${count_jobs[@]}"
    readfiles=()
    for count_job in "${count_jobs[@]}"; do
      readfiles+=("-ireadfiles=${count_job}:counts_txt")
    done
    echo "file name: ${sorted_bamfile_name}"
    echo "Set file, readfile variables:"
    echo "${readfiles[@]}"
    countsfile_job=$(dx-jobutil-new-job -ifilename="${mappings_sorted_bam_prefix}" "${readfiles[@]}" sum_reads)
    echo "Specifying output file"
    dx-jobutil-add-output counts_txt "${countsfile_job}:read_sum" --class=jobref
    count_func() {
    
      set -e -x -o pipefail
    
      echo "Value of bam_file: '${bam_file}'"
      echo "Value of bambai_file: '${bambai_file}'"
      echo "Regions being counted '${regions[@]}'"
    
    
      dx-download-all-inputs
    
    
      mkdir workspace
      cd workspace || exit
      mv "${bam_file_path}" .
      mv "${bambai_file_path}" .
      outputdir="./out/samtool/count"
      mkdir -p "${outputdir}"
      samtools view -c "${bam_file_name}" "${regions[@]}" >> "${outputdir}/readcounts.txt"
    
    
      counts_txt_id=$(dx upload "${outputdir}/readcounts.txt" --brief)
      dx-jobutil-add-output counts_txt "${counts_txt_id}" --class=file
    }
    sum_reads() {
    
      set -e -x -o pipefail
      echo "$filename"
    
      echo "Value of read file array '${readfiles[@]}'"
      dx-download-all-inputs
      echo "Value of read file path array '${readfiles_path[@]}'"
    
      echo "Summing values in files"
      readsum=0
      for read_f in "${readfiles_path[@]}"; do
        temp=$(cat "$read_f")
        readsum=$((readsum + temp))
      done
    
      echo "Total reads: ${readsum}" > "${filename}_counts.txt"
    
      read_sum_id=$(dx upload "${filename}_counts.txt" --brief)
      dx-jobutil-add-output read_sum "${read_sum_id}" --class=file
    }
    echo "Specifying output file"
    dx-jobutil-add-output counts_txt "${countsfile_job}:read_sum" --class=jobref
      counts_txt_id=$(dx upload "${counts_txt_name}" --brief)
      dx-jobutil-add-output counts_txt "${counts_txt_id}" --class=file
    ├── Applet dir
    │   ├── src
    │   ├── dxapp.json
    │   ├── resources
    │       ├── usr
    │           ├── bin
    │               ├── < samtools binary >
    /
    ├── usr
    │   ├── bin
    │       ├── < samtools binary >
    ├── home
    │   ├── dnanexus
    dx download "${mappings_bam}"
    
    indexsuccess=true
    bam_filename="${mappings_bam_name}"
    samtools index "${mappings_bam_name}" || indexsuccess=false
    if [[ $indexsuccess == false ]]; then
      samtools sort -o "${mappings_bam_name}" "${mappings_bam_name}"
      samtools index "${mappings_bam_name}"
      bam_filename="${mappings_bam_name}"
    fi
    
    chromosomes=$( \
      samtools view -H "${bam_filename}" \
      | grep "\@SQ" \
      | awk -F '\t' '{print $2}' \
      | awk -F ':' '{if ($2 ~ /^chr[0-9XYM]+$|^[0-9XYM]/) {print $2}}')
    
    for chr in $chromosomes; do
      samtools view -b "${bam_filename}" "${chr}" -o "bam_${chr}."bam
      echo "bam_${chr}.bam"
    done > bamfiles.txt
    counts_txt_name="${mappings_bam_prefix}_count.txt"
    
    sum_reads=$( \
      <bamfiles.txt xargs -I {} samtools view -c $view_options '{}' \
      | awk '{s+=$1} END {print s}')
    echo "Total Count: ${sum_reads}" > "${counts_txt_name}"
    $ dx env
    Auth token used         adLTkSNkjxoAerREqbB1dVkspQzCOuug
    API server protocol     https
    API server host         api.dnanexus.com
    API server port         443
    Current workspace       project-9zVpbQf4Zg2641v5BGY00001
    Current workspace name  "Scratch Project"
    Current folder          /
    Current user            alice
    $ dx env --bash
    export DX_SECURITY_CONTEXT='{"auth_token_type": "bearer", "auth_token": "adLTkSNkjxoAerREqbB1dVkspQzCOuug"}'
    export DX_APISERVER_PROTOCOL=https
    export DX_APISERVER_HOST=api.dnanexus.com
    export DX_APISERVER_PORT=443
    export DX_PROJECT_CONTEXT_ID=project-9zVpbQf4Zg2641v5BGY00001
    $ dx ls -l
    Project: Sample Project (project-9zVpbQf4Zg2641v5BGY00001)
    Folder : /
    <Contents of Sample Project>
    $ dx login
    Acquiring credentials from https://auth.dnanexus.com
    Username: alice
    Password:
    
    Note: Use "dx select --level VIEW" or "dx select --public" to select from
    projects for which you only have VIEW permissions.
    
    Available projects:
    0) SAM importer test (CONTRIBUTE)
    1) Scratch Project (ADMINISTER)
    2) Mouse (ADMINISTER)
    
    Pick a numbered choice [1]: 2
    Setting current project to: Mouse
    $ dx ls
    WARNING: The following environment variables were found to be different than the
    values last stored by dx: DX_SECURITY_CONTEXT, DX_PROJECT_CONTEXT_ID
    To use the values stored by dx, unset the environment variables in your shell by
    running "source ~/.dnanexus_config/unsetenv". To clear the dx-stored values,
    run "dx clearenv".
    Project: Sample Project (project-9zVpbQf4Zg2641v5BGY00001)
    Folder : /
    <Contents of Sample Project>
    $ source ~/.dnanexus_config/unsetenv
    $ dx ls -l
    Project: Mouse (project-9zVpbQf4Zg2641v5BGY00002)
    Folder : /
    <Contents of Mouse>
    $ dx --env-help
    usage: dx command ... [--apiserver-host APISERVER_HOST]
                          [--apiserver-port APISERVER_PORT]
                          [--apiserver-protocol APISERVER_PROTOCOL]
                          [--project-context-id PROJECT_CONTEXT_ID]
                          [--workspace-id WORKSPACE_ID]
                          [--security-context SECURITY_CONTEXT]
                          [--auth-token AUTH_TOKEN]
    
    optional arguments:
      --apiserver-host APISERVER_HOST
                            API server host
      --apiserver-port APISERVER_PORT
                            API server port
      --apiserver-protocol APISERVER_PROTOCOL
                            API server protocol (http or https)
      --project-context-id PROJECT_CONTEXT_ID
                            Default project or project context ID
      --workspace-id WORKSPACE_ID
                            Workspace ID (for jobs only)
      --security-context SECURITY_CONTEXT
                            JSON string of security context
      --auth-token AUTH_TOKEN
                            Authentication token
    $ dx env --project-context-id project-B0VK6F6gpqG6z7JGkbqQ000Q
    Auth token used         R54BN6Ws6Zl3Y0VqBA9o1qweUswYW5o4
    API server protocol     https
    API server host         api.dnanexus.com
    API server port         443
    Current workspace       project-B0VK6F6gpqG6z7JGkbqQ000Q
    Current folder          /
    If you'd like to filter your dataset to specific samples, see Defining and Managing Cohorts.

    hashtag
    Managing Dashboards

    Dashboards contain your charts and define their layout. Each such configuration is called a dashboard view. Dashboard views can be specific to a saved cohort or standalone (custom dashboard view). You can create multiple dashboard views, allowing you to switch between different visualizations and analyses.

    By using Dashboard Actions, you can save or load your own dashboard views. This lets you quickly switch between different visualizations without having to set them up each time.

    • Save Dashboard View - Saves the current dashboard configuration as a record of the DashboardView type, including all tiles and their settings.

    • Load Dashboard View - Loads a custom dashboard view, restoring the tiles and their configurations.

    Using Dashboard Actions

    After loading a dashboard view once, you can access it again from Dashboard Actions > Custom Dashboard Views.

    circle-check

    Moving dashboards between datasets? If you want to use your dashboard views with a different Apollo Dataset, you can use the Rebase Cohorts And Dashboards app to transfer your custom dashboard configurations to a new target dataset.

    hashtag
    Visualizing Data

    Add charts to your dashboards to visualize the clinical and phenotypical data in your dataset. For example, you can add charts to display patient demographics or clinical measurements.

    circle-info

    Working with Multi-Assay Visualizations

    For omics datasets, such as those for germline variants, somatic variants, or gene expression, you have additional predefined visualization options available:

    • Germline and somatic variants are visualized using lollipop plots and variant frequency matrices. For details, see Analyzing Germline Variants and Analyzing Somatic Variants.

    • Gene expression data is visualized using expression level and feature correlation charts. For details, see .

    hashtag
    Adding Tiles to Visualize Data

    Each chart is represented as a tile on the dashboard. You can add multiple tiles to visualize different aspects of your data.

    1. In the Overview tab, click + Add Tile on the top-right.

    2. In the hierarchical list of the dataset fields, select the field you want to visualize.

    3. In Data Field Details, choose your preferred chart type.

      • The available depend on the field's value type.

    4. Click Add Tile.

    The tile appears on the dashboard with the current cohort data. You can add up to 15 tiles.

    hashtag
    Creating Multi-Variable Charts

    When selecting data fields to visualize, you can add a secondary data field to create a multi-variable chart. This allows you to visualize relationships between two data fields in the same chart.

    To visualize the relationship between two data fields in the same chart, first select your primary data field from the hierarchical list. This opens a Data Field Details panel, showing the field's information and a preview of a basic chart.

    To add a secondary field, keep the primary field selected and search for the desired field. When you find it, click the Add as Secondary Field icon (+) next to its name rather than selecting it directly. This adds the new field to the visualization. The Data Field Details panel updates to show the combined information for both fields.

    circle-info

    You can click the + icon only when at least one chart type is supported for the specified combination.

    For certain chart types, such as Stacked Row Chart and Scatter Plot, you can re-order the primary and secondary data fields by dragging the data field in Data Field Details.

    Adding grouped box plot by combining two data fields
    circle-check

    For more details on multi-variable charts, including how to build a survival curve, see Multi-Variable Charts.

    hashtag
    Chart Optimization

    When working with large datasets, keep these tips in mind:

    • Limit dashboard tiles: To ensure fast loading times and a clear overview, it's best to limit the number of charts on a single dashboard. Typically, 8-10 tiles is a good number for human comprehension and optimal performance.

    • Filter data first: Reduce the volume of data by applying filters before you create complex visualizations. This improves chart loading speed.

    Contact DNAnexus Salesenvelope
    Grouped box plots can be used to compare the distribution of values in a field containing numerical data, across different groups in a cohort. In a grouped box plot, each such group is defined by its members sharing the same value in another field that contains categorical data.

    When creating a grouped box plot:

    • The primary field must contain categorical or categorical multiple data

    • The primary field must contain no more than 15 distinct category values

    • The secondary field must contain numerical data

    Supported Data Types

    Primary Field

    Secondary Field

    Categorical or Categorical Multiple (<=15 categories)

    Numerical (Integer) or Numerical (Float)

    hashtag
    Using Grouped Box Plots in the Cohort Browser

    The grouped box plot below shows a cohort that has been broken down into groups, according to the value in a field Doctor. For each group, a box plot provides detail on the reported Visit Feeling, for cohort members who share a doctor:

    Grouped Box Plot

    hashtag
    Non-Numeric Values in Grouped Box Plots

    A field containing numeric data may also contain some non-numeric values. These values cannot be represented in a grouped box plot. See the chart above for an example of the informational message that shows below the chart, in this scenario.

    Clicking the "non-numeric values" link displays detail on those values, and the number of records in which each appears:

    Grouped Box Plot: Detail on Non-Numeric Values

    hashtag
    Outliers

    Cohort Browser grouped box plots represent all non-null numeric values. When a field contains an outlier value or values - that is, values that are unusually high or low - this can result in a grouped box plot that looks like this:

    Outlier Value in a Grouped Box Plot

    This grouped box plot displays data on the number of cups of coffee consumed per day, by members of different groups in a particular cohort, with groups defined by shared value in a field Coffee type. In multiple groups, one member was recorded as consuming far more cups of coffee per day than others in the group.

    hashtag
    Grouped Box Plots in Cohort Compare

    In Cohort Compare mode, a grouped box plot can be used to compare the distribution of values in a field that's common to both cohorts, across groups defined using values in a categorical field that is also common to both cohorts.

    In this scenario, a separate, color-coded box plot is displayed for each group in each cohort.

    Hovering over one of these box plots opens an informational window showing detail on the distribution of values for the group.

    Clicking the "ˇ" icon, in the lower right corner of the tile containing the chart, opens a tooltip showing the cohort names and the colors used to represent data in each.

    Grouped Box Plot in cohort compare mode

    hashtag
    Preparing Data for Visualization in Grouped Box Plots

    When ingesting data using Data Model Loader, the following data types can be visualized in grouped box plots:

    hashtag
    Primary Field

    • String Categorical

    • String Categorical Multi-Select

    • String Categorical Sparse

    • Integer Categorical

    • Integer Categorical Multi-Select

    hashtag
    Secondary Field

    • Integer

    • Integer Sparse

    • Float

    • Float Sparse

    Contact DNAnexus Salesenvelope
    Histograms can be used to visualize numerical, date, and datetime data.
    Supported Data Types

    Numerical (Integer)

    Numerical (Float)

    Date

    Datetime

    hashtag
    Using Histograms in the Cohort Browser

    In a histogram in the Cohort Browser, each vertical bar represents the count of records in a particular "bin." Each bin groups records that share the same value or similar values, in a particular field.

    The Cohort Browser automatically groups records into bins, based on the distribution of values in the dataset, for the field. Values are distributed in a linear fashion, on the x axis.

    Below is a sample histogram showing the distribution of values in a field Critical care total days. The label under the chart title indicates the number of records (203) for which values are shown, and the name of the entity ("RNAseq Notes") to which the data relates.

    Histogram in the Cohort Browser

    hashtag
    Customizing Chart Display

    You can customize how histogram data is displayed by clicking ⛭ Chart Settings in the chart toolbar.

    Histogram chart settings showing Display Statistics, Transform Data, and Chart Type options

    For data with wide value ranges or skewed distributions, you can apply logarithmic scaling to either or both axes:

    • log₂ - Values transformed using f(x)=sign(x)⋅log⁡2(∣x∣+1)f(x) = \text{sign}(x) \cdot \log_2(|x|+1)f(x)=sign(x)⋅log2​(∣x∣+1)

    • log₁₀ - Values transformed using f(x)=sign(x)⋅log⁡10(∣x∣+1)f(x) = \text{sign}(x) \cdot \log_{10}(|x|+1)f(x)=sign(x)⋅log10​(∣x∣+1)

    When you apply logarithmic transformation, the axis label updates to show the transformation type (log₂ or log₁₀).

    hashtag
    Non-Numeric Data in Histograms

    A field containing numeric data may also contain some non-numeric values. These values cannot be represented in a histogram. In such cases, you see the following informational message below the chart:

    Histogram Displaying Data for a Field Containing Non-Numeric Values

    Clicking the "non-numeric values" link displays detail on those values, and the number of record in which each appears:

    Detail on Non-Numeric Values Omitted from a Histogram

    In Cohort Compare mode, histograms can be used to compare the distribution of values in a field that's common to both cohorts. In this scenario, the distributions are overlaid one atop another. Clicking the "ˇ" icon, in the lower right corner of the tile containing the chart, opens a tooltip showing the cohort names and the colors used to represent data in each.

    Histogram in Cohort Compare Mode

    See Comparing Cohorts for more on using Cohort Compare mode.

    hashtag
    Preparing Data for Visualization in Histograms

    When ingesting data using Data Model Loader, the following data types can be visualized in histograms:

    • Integer

    • Integer Sparse

    • Float

    • Float Sparse

    • Date

    • Date Sparse

    • Datetime

    • Datetime Sparse

    Contact DNAnexus Salesenvelope
    hashtag
    Running from the UI
    1. In the main menu, navigate to Tools > JupyterLab. If you have used JupyterLab before, the page shows your previous sessions across different projects.

    2. Click New JupyterLab.

    3. Configure your JupyterLab session:

      • Specify the session name and select an instance type.

      • Choose the project where JupyterLab should run.

      • Set the session duration after which the environment automatically shuts down.

      • Optionally, provide a snapshot file to load a previously saved environment.

      • If needed, enable Spark Cluster and set the number of nodes.

    4. Select a feature option based on your analysis needs:

      • PYTHON_R (default): Python3 and R kernel and interpreter

      • ML: Python3 with machine learning packages (TensorFlow, PyTorch, CNTK) and image processing (Nipype), but no R

    5. Review the pricing estimate (if you have billing access) based on your selected duration and instance type.

    6. Click Start Environment to launch your session. The JupyterLab shows an "Initializing" state while the worker spins up and the server starts.

    7. Open your JupyterLab environment by clicking the session name link once the state changes to "Ready". You can also access it directly via https://job-xxxx.dnanexus.cloud, where job-xxxx is your job ID.

    circle-info

    Snapshots created using older versions of JupyterLab are incompatible with the current version. If you need to use an older JupyterLab snapshot, see environment snapshot guidelines.

    For a detailed list of libraries included in each feature option, see the in-product documentationarrow-up-right.

    hashtag
    Running JupyterLab from the CLI

    You can start the JupyterLab environment directly from the command line by running the app:

    Once the app starts, you may check if the JupyterLab server is ready to server connections, which is indicated by the job's property httpsAppState set to running. Once it is running, you can open your browser and go to https://job-xxxx.dnanexus.cloud where job-xxxx is the ID of the job running the app.

    To run the Spark version of the app, use the command:

    You can check the optional input parameters for the apps on the DNAnexus Platform (platform login required to access the links):

    • JupyterLab Apparrow-up-right

    • JupyterLab Spark Cluster Apparrow-up-right

    From the CLI, you can learn more about dx run with the following command:

    where APP_NAME is either app-dxjupyterlab or app-dxjupyterlab_spark_cluster.

    hashtag
    Next Steps

    See the Quickstart and References pages for more details on how to use JupyterLab.

    Contact DNAnexus Salesenvelope

    Projects

    Learn to use projects to collaborate, organize your work, manage billing, and control access to files and executables.

    hashtag
    About Projects

    Within the DNAnexus Platform, a project is first and foremost a means of enabling users to collaborate, by providing them with shared access to specific data and tools.

    Projects have a series of features designed for collaboration, helping project members coordinate and organize their work, and ensuring appropriate control over both data and tools.

    circle-info

    See the for details on how to create a project, share it with other users, and run an analysis.

    hashtag
    Managing Project Content

    A key function of each project is to serve as a shared storehouse of data objects used by project members as they collaborate.

    Click on a project's Manage tab to see a list of all the data objects stored in the project. Within the Manage screen, you can browse and manage these objects, with the range of available actions for an object dependent on its type.

    The following are four common actions you can perform on objects from within the Manage screen.

    hashtag
    Downloading Files

    You can directly download file objects from the system.

    1. Select the file's row.

    2. Click More Actions (⋮).

    3. From the list of available actions, select Download.

    hashtag
    Getting More Information on Objects

    To learn more about an object:

    1. Select its row, then click the Show Info Panel button - the "i" icon - in the upper corner of the Manage screen.

    2. Select the row showing the name of the object about which you want to know more. An info panel opens on the right, displaying a range of information about the object. This includes its unique ID, as well as metadata about its owner, time of creation, size, tags, properties, and more.

    hashtag
    Deleting Objects

    circle-exclamation

    Deletion is permanent and cannot be undone.

    To delete an object:

    1. Select its row.

    2. Click More Actions (⋮).

    3. From the list of available actions, select Delete.

    hashtag
    Copying Data to Another Project

    circle-info

    To copy a data object or objects to another project, you must have CONTRIBUTE or ADMINISTER access to that project.

    1. Select the object or objects you want to copy to a new project, by clicking the box to the left of the name of each object in the objects list.

    2. Click the Copy button in the upper right corner of the Manage screen. A modal window opens.

    3. Select the project to which you want to copy the object or objects, then select the location within the project to which the objects should be copied.

    hashtag
    Access and Sharing

    hashtag
    Adding Project Members

    You can collaborate on the platform by . On sharing a project with a user, or group of users in an , they become project members, with access at one of the levels described below. Project access can be revoked at any time by a project administrator.

    hashtag
    Removing Project Members

    To remove a user or org from a project to which you have ADMINISTER access:

    1. On the project's Manage screen, click the Share Project button - the "two people" icon - in the top right corner of the page. A modal window opens, showing a list of project members.

    2. Find the row showing the user you want to remove from the project.

    3. Move your mouse over that row, then click the Remove from Members button at the right end of the row.

    hashtag
    Project Access Levels

    Access Level
    Description

    hashtag
    Project Access Levels: Two Examples

    Suppose you have a set of samples sequenced at your lab, and you have a collaborator who's interested in three of the samples. You can upload the data associated with those samples into a new project, then share that new project with your collaborator, granting them VIEW access.

    Alternatively, suppose that you and your collaborator are working on the same tissue samples, but each of you wants to try a different sequencing process. You can create a new project, then upload your sequenced data to the project. Then grant your collaborator UPLOAD access to the project, allowing them to upload their data. You both are then able to use each other's data to perform downstream analyses.

    hashtag
    Restricting Access to Executables

    A project admin can configure a project to allow project members to run only specific executables as . The list of allowed executables is set by entering the following command, via the CLI:

    This command overwrites any existing list of allowed executables.

    To discard the allowed executables list, that is, let project members run all available executables as root executions, enter the following command:

    Executables that are called by a permitted executable can run even if they are not included in the list.

    hashtag
    Project Data Access Controls

    Users with ADMINISTER access to a project can restrict the ability of project members to view, copy, delete, and download project data. The project-level boolean flags below provide fine-grained data access control. All data access control flags default to false and you can view and modify them via the CLI and the Platform API. In the project's Settings web screen, you can view and modify the protected, restricted, downloadRestricted, previewViewerRestricted, externalUploadRestricted, and containsPHI settings as described below.

    • protected: If set to true, only project members with ADMINISTER access to the project can delete project data. Otherwise, project members with ADMINISTER and CONTRIBUTE access can delete project data. This flag corresponds to the Delete Access policy in the project's Settings web interface screen.

    • restricted: If set to true,

    hashtag
    PHI Data Protection

    circle-info

    Only projects billed to org billing accounts can have PHI Data Protection enabled.

    A license and a signed Business Associate Agreement are required to enable and use PHI Data Protection. for more information.

    Protected Health Information, or PHI, is identifiable health information that can be linked to a specific person. On the DNAnexus Platform, PHI Data Protection safeguards the confidentiality and integrity of data in compliance with the Health Insurance Portability and Accountability Act of 1996 (HIPAA).

    When PHI Data Protection is enabled for a project, it is subject to the following protective restrictions:

    • Data in this project cannot be cloned to other projects that do not have containsPHI set to true

    • Any jobs that run in non-PHI projects cannot access any data that can only be found in PHI projects

    • Job email notifications sent from the project refer to objects by object ID instead of by name, and other information in the notification may be elided. If you receive such a notification, you can view the elided information by logging onto the Platform and opening the notification and accessing it in the Notifications pane, accessible by clicking the "bell" icon at the far right end of the main menu.

    hashtag
    Billing and Charges

    On the DNAnexus Platform, running analyses, storing data, and egressing data are billable activities, and always take place within a specific project. Each project is associated with a billing account to which invoices are sent, covering all billable activities carried out within the project.

    circle-info

    For information on configuring your billing account, see .

    You link a project to a billing account, that is an organization that the expenses are billed to, when you .

    hashtag
    Monthly Project Spending and Usage Limits

    circle-info

    Licenses are required for both the Monthly Project Compute and Egress Usage Limit and Monthly Project Storage Spending Limit features. for more information.

    The Monthly Project Usage Limit for Compute and Egress and Monthly Project Storage Spending Limit features can help project admins monitor and keep project costs under control. For more information, see .

    In the project's Settings tab under the Usage Limits section, project admins can view the project's compute and egress usage limits.

    For details on how to set and retrieve project-specific compute and egress usage limits, and storage spending limits, see the .

    hashtag
    Transferring Project Billing Responsibility

    hashtag
    Transferring Billing Responsibility to Another User

    If you have ADMINISTER access to a project, you can transfer project billing responsibility to another user, by doing the following:

    1. On the project's Settings screen, scroll down to the Administration section.

    2. Click the Transfer Billing button. A modal window opens.

    3. Enter the email address or username of the user to whom you want to transfer billing responsibility for the project.

    The user receives an email notification of your request. To finalize the transfer, they need to log onto the Platform and formally accept it.

    hashtag
    Transferring Billing Responsibility to an Org

    If you have billable activities access in the org to which you wish to transfer the project, you can change the billing account of the project to the org. To do this, navigate to the project settings page by clicking on the gear icon in the project header. On the project settings page, you can then select which to which billing account the project should be billed.

    If you do not have billable activities access in the org you wish to transfer the project to, you need to transfer the project to a user who does have this access. The recipient is then able to follow the instructions below to accept a project transfer on behalf of an org.

    hashtag
    Cancelling a Transfer of Billing Responsibility

    You can cancel a transfer of project billing responsibility, so long as it hasn't yet been formally accepted by the recipient. To do this:

    1. Select All Projects from the Projects link in the main menu. Open the project. You see a Pending Project Ownership Transfer notification at the top of the screen.

    2. Click the Cancel Transfer button to cancel the transfer.

    hashtag
    Accepting a Transfer Request

    When another user initiates a project transfer to you, you receive a project transfer request, via both an email, and a notification accessible by clicking the Notifications button - the "bell" - at the far right end of the main menu.

    If you did not already have access to the project being transferred, you receive VIEW access and the project appears in the list on the Projects screen.

    To accept the transfer:

    1. Open the project. You see a Pending Project Ownership Transfer notification in the project header.

    2. Click the Accept Transfer button.

    3. Select a new billing account for the project from the dropdown of eligible accounts.

    hashtag
    Projects with PHI Data Protection Enabled

    If a project has PHI Data Protection enabled, it may only be transferred to an org billing account which also has PHI Data Protection enabled.

    hashtag
    Sponsored Projects

    Ownership of may not be transferred without the sponsorship first being terminated.

    hashtag
    Project Sponsorship

    A user or org can sponsor the cost of data storage in a project for a fixed term. During the sponsorship period, project members may copy this data to their own projects and store it there, without incurring storage charges.

    On setting up the sponsorship, the sponsor sets it end date. The sponsor can change this end date at any time.

    Billing responsibility for sponsored projects may not be transferred.

    Sponsored projects may not be deleted, without the project sponsor first ending the sponsorship, by changing its end date to a date in the past.

    For more information about sponsorship, contact .

    hashtag
    Learn More

    for detailed information on projects that are billed to an org.

    Learn about accessing and working with projects via the CLI:

    Learn about working with projects as a developer:

    Pysam

    This applet performs a SAMtools count on an input BAM using Pysam, a python wrapper for SAMtools.

    View full source code on GitHubarrow-up-right

    hashtag
    How is Pysam provided?

    Pysam is provided through a pip3 install using the pip3 package manager in the dxapp.json's runSpec.execDepends property:

    The execDepends value is a JSON array of dependencies to resolve before the applet source code is run. In this applet, pip3 is specified as the package manager and pysam version 0.15.4 as the dependency to resolve.

    hashtag
    Downloading Input

    The fields mappings_sorted_bam and mappings_sorted_bai are passed to the main function as parameters for the job. These parameters are dictionary objects with key-value pair {"$dnanexus_link": "<file>-<xxxx>"}. File objects from the platform are handled through handles. If an index file is not supplied, then a *.bai index is created.

    hashtag
    Working with Pysam

    Pysam provides key methods that mimic SAMtools commands. In this applet example, the focus is only on canonical chromosomes. The Pysam object representation of a BAM file is pysam.AlignmentFile.

    The helper function get_chr

    Once a list of canonical chromosomes is established, you can iterate over them and perform the Pysam version of samtools view -c, pysam.AlignmentFile.count.

    hashtag
    Uploading Outputs

    The summarized counts are returned as the job output. The dx-toolkit Python SDK function uploads and generates a DXFile corresponding to the tabulated result file.

    Python job outputs have to be a dictionary of key-value pairs, with the keys being job output names as defined in the dxapp.json file and the values being the output values for corresponding output classes. For files, the output type is a DXLink. The function generates the appropriate DXLink value.

    Distributed by Region (sh)

    hashtag
    Entry Points

    Distributed bash-interpreter apps use bash functions to declare entry points. Entry points are executed as subjobs on new workers with their own respective system requirements. This app has the following entry points specified as bash functions:

    • main

    • count_func

    • sum_reads

    hashtag
    main

    The main function takes the initial *.bam, generates an index *.bai if needed, and obtains the list of regions from the *.bam file. Every 10 regions are sent, as input, to the count_func entry point using command.

    Job outputs from the count_func entry point are referenced as Job Based Object References and used as inputs for the sum_reads entry point.

    Job outputs of the sum_reads entry point is used as the output of the main entry point via JBOR reference in the command.

    hashtag
    count_func

    This entry point performs a SAMtools count of the 10 regions passed as input. This execution runs on a new worker. As a result, variables from other functions are not accessible here. This includes variables from the main() function.

    Once the output file with counts is created, it is uploaded to the platform and assigned as the entry point's job output counts_txt via the command .

    hashtag
    sum_reads

    The main entry point triggers this subjob, providing the output of count_func as an input JBOR. This entry point gathers all the readcount.txt files generated by the count_func jobs and sums the totals.

    This entry point returns read_sum as a JBOR, which is then referenced as job output.

    In the main function, the output is referenced

    Distributed by Chr (sh)

    View full source code on GitHubarrow-up-right

    hashtag
    How is the SAMtools dependency provided?

    The SAMtools dependency is resolved by declaring an Apt-Get package in the dxapp.json file's runSpec.execDepends.

    For additional information, see execDepends.

    hashtag
    Entry Points

    Distributed bash-interpreter apps use bash functions to declare entry points. This app has the following entry points specified as bash functions:

    • main

    • count_func

    • sum_reads

    Entry points are executed on a new worker with its own system requirements. The instance type can be set in the dxapp.json file's runSpec.systemRequirements:

    hashtag
    main

    The main function slices the initial *.bam file and generates an index *.bai if needed. The input *.bam is then sliced into smaller *.bam files containing only reads from canonical chromosomes. First, the main function downloads the BAM file and gets the headers.

    Sliced *.bam files are uploaded and their file IDs are passed to the count_func entry point using the command.

    Outputs from the count_func entry points are referenced as Job Based Object References (JBOR) and used as inputs for the sum_reads entry point.

    The output of the sum_reads entry point is used as the output of the main entry point via JBOR reference using the command dx-jobutil-add-output.

    hashtag
    count_func

    This entry point downloads and runs the command samtools view -c on the sliced *.bam. The generated counts_txt output file is uploaded as the entry point's job output via the command dx-jobutil-add-output.

    hashtag
    sum_reads

    The main entry point triggers this sub job, providing the output of count_func as an input. This entry point gathers all the files generated by the count_func jobs and sums them.

    This function returns read_sum_file as the entry point output.

    Parallel by Region (sh)

    This applet performs a basic SAMtools count on a series of sliced (by canonical chromosome) BAM files in parallel using wait.

    View full source code on GitHubarrow-up-right

    hashtag
    How is the SAMtools dependency provided?

    The SAMtools dependency is resolved by declaring an Apt-Getarrow-up-right package in the dxapp.json runSpec.execDepends.

    hashtag
    Debugging

    The command set -e -x -o pipefail assists you in debugging this applet:

    • -e causes the shell to immediately exit if a command returns a non-zero exit code.

    • -x prints commands as they are executed, which is useful for tracking the job's status or pinpointing the exact execution failure.

    hashtag
    Parallel Run

    Bash's system allows for convenient management of multiple processes. In this example, you can run bash commands in the background as you control maximum job executions in the foreground. Place processes in the background using the character & after a command.

    hashtag
    Job Output

    Once the input BAM has been sliced, counted, and summed, the output counts_txt is uploaded using the command . The following directory structure required for dx-upload-all-outputs is below:

    In your applet, upload all outputs by:

    Analyzing Germline Variants

    Analyze germline genomic variants, including filtering, visualization, and detailed variant annotation in the Cohort Browser.

    circle-info

    An Apollo license is required to use Cohort Browser on the DNAnexus Platform. Org approval may also be required. Contact DNAnexus Salesenvelope for more information.

    Explore and analyze datasets with germline data by opening them in the Cohort Browser and switching to the Germline Variants tab. You can create cohorts based on germline variants, visualize variant patterns, and examine detailed variant information.

    hashtag
    Filtering by Germline Variants

    You can to include only samples with specific germline variants.

    To apply a germline filter to your cohort:

    1. For the cohort you want to edit, click Add Filter.

    2. In Add Filter to Cohort > Assays > Genomic Sequencing, select a genomic filter.

    3. In Edit Filter: Variant (Germline), specify your filtering criteria:

    circle-info

    After you apply or edit filters, the participant count updates immediately. However, visualization tiles do not automatically refresh. Click Refresh Visualizations at the top of the dashboard to update all tiles. Click Refresh on individual tiles to update specific charts.

    hashtag
    Exploring Variant Patterns in Your Cohort

    The Germline Variants tab includes a lollipop plot displaying allele frequencies for variants in a specified genomic region. This visualization helps you identify patterns in germline variants across your cohort and understand the distribution of allelic frequencies.

    circle-info

    If your dataset contains multiple germline variant assays, such as WES and WGS assays, you can choose the assay to visualize at the top of the dashboard. The Cohort Browser displays data from only one assay at a time. When you switch between assays, your charts and their display settings are preserved.

    hashtag
    Examining Variant Annotations

    The allele table, located below the lollipop plot, shows the same variants in a tabular format with comprehensive annotation information. It allows you to examine specific variant characteristics and compare allele frequencies within your selected cohort, the entire dataset, and from annotation databases, including gnomAD.

    The annotation information includes:

    • Type: whether the variant is an SNP, deletion, insertion, or mixed.

    • Consequences: The impact of variant according to . For variants with multiple gene annotations, this column displays the most severe consequence per gene.

    • Population Allele Frequency: Allele frequency calculated across entire dataset from which the cohort is created.

    If canonical transcript information is available, the following three columns with additional annotation information appear in the Table:

    • Consequences (Canonical Transcript): Canonical effects per each associated gene, according to SnpEff.

    • HGVS DNA (Canonical Transcript): HGVS (DNA) standard terminology per each associated gene with this variant

    • HGVS Protein (Canonical Transcript): HGVS (Protein) standard terminology per each associated gene with this variant

    hashtag
    Exporting Variant Metadata

    You can export the selected variants in the table as a list of variant IDs or a CSV file.

    • To copy a comma-separated list of variant IDs to your clipboard, select the set of IDs you want to copy, and click Copy.

    • To export variants as a CSV file, select the set of IDs you need, and click Download (.csv file).

    circle-check

    For large datasets, you can use the to download data in a more efficient way.

    hashtag
    Accessing Detailed Variant Information

    In Allele table > Location column, you can click on the specific location to open the locus details. The locus details provides in-depth annotations and population genetics data for the selected genomic position.

    circle-info

    When genomic information is ingested and made available in the Cohort Browser, variants are annotated using and . The specific versions of each are provided during the ingestion process and create a set of tables optimized for cohort creation through the Cohort Browser.

    The locus details page displays three main sections of pre-calculated information from dataset ingestion: Location Info, Genotypes, and Alleles. These sections provide a comprehensive view starting with a locus summary, including genotype frequencies, followed by detailed annotations for each allele.

    hashtag
    Location Info

    The Location Info section provides a quick overview of the genomic locus in your dataset, including the chromosome and starting position, the frequency of both the reference allele and no-calls, and the total number of alleles available.

    hashtag
    Genotypes

    The Genotypes section shows a detailed breakdown of genotypes in the dataset at the specific location. Since allele order is not preserved, genotypes like C/A and A/C are counted in the same category, which is why only half of the comparison table is populated. These genotype frequencies represent the entire dataset at this location, not only your selected cohort.

    hashtag
    Alleles

    The Alleles section displays detailed information for each allele, collected from dbSNP and gnomAD during data ingestion. When available, rsID or AffyID appear with direct links to the corresponding page. The section provides allele type, affected samples (dataset), and gnomAD frequency for quick reference, with additional details sorted by transcript ID in the Genes / Transcripts table. For canonical transcripts, a blue indicator appears next to the transcript ID, identifying the primary transcript annotations.

    hashtag
    Integrating with Advanced Analysis Tools

    For more sophisticated genomic analysis beyond the Cohort Browser's visualization capabilities, you can connect your variant data with other DNAnexus tools. Export variant lists for detailed analysis in , leverage for large-scale genomic computations, or connect to for complex queries across your dataset.

    Box Plot

    Learn to build and use box plots in the Cohort Browser.

    circle-info

    An Apollo license is required to use Cohort Browser on the DNAnexus Platform. Org approval may also be required. Contact DNAnexus Salesenvelope for more information.

    hashtag
    When to Use Box Plots

    Box plots can be used to visualize numerical data.

    Supported Data Types

    Numerical data can also be visualized using .

    hashtag
    Using Box Plots in the Cohort Browser

    Box plots provide a range of detail on the distribution of values in a field containing numerical data. Each box plot includes three thin blue horizontal lines, indicating, from top to bottom:

    • Max - The maximum, or highest value

    • Med - The median value

    • Min - The minimum, or lowest value

    The blue box straddling the median value line represents the span covered by the median 50% of values. Of the total number of values, 25% sit above the box, and 25% lie below it.

    Hovering over the middle of a box plot opens a window displaying detail on the maximum, median, and minimum values. Also shown are the values at the "top" ("Q3") and "bottom" ("Q1") of the box. "Q1" is the highest value in the first, or lowest, quartile of values. "Q3" is the highest value in the third quartile.

    Also shown in this window is the total count of values covered by the box plot, along with the name of the entity to which the data relates.

    hashtag
    Customizing Chart Display

    You can customize how box plot data is displayed by clicking ⛭ Chart Settings in the chart toolbar.

    For data with wide value ranges or skewed distributions, you can apply logarithmic scaling to the Y-axis:

    • log₂ - Values transformed using

    • log₁₀ - Values transformed using

    When you apply logarithmic transformation, the Y-axis label updates to show the transformation type (log₂ or log₁₀).

    hashtag
    Non-Numeric Data in Box Plots

    Fields containing primarily numeric data may also include non-numeric values. These non-numeric values cannot be represented in a box plot. See the chart above for an example of the informational message that shows below the chart when non-numeric values are present.

    Clicking the "non-numeric values" link displays detail on those values, and the number of record in which each appears:

    In this scenario, a discrepancy exists between the "count" figure shown in the chart label and the one shown in the informational window that opens when hovering over the middle of a box plot. The latter figure is smaller, with the discrepancy determined by the number of records for which values can't be displayed in the box plot.

    hashtag
    Outliers

    Cohort Browser box plots represent all non-null numeric values. When a field contains an outlier value or values - that is, values that are unusually high or low - this can result in a box plot that looks like this:

    This box plot displays data on the number of cups of coffee consumed per day, by patients of a particular cohort. One cohort patient was recorded as consuming 42 cups of coffee per day, much higher than the value (2 cups/day) at the "top" of the third quartile, and far higher than the median value of 2 cups/day.

    hashtag
    Box Plots in Cohort Compare Mode

    In Cohort Compare mode, a box plot chart can be used to compare the distribution of values in a field that's common to both cohorts. In this scenario, a separate, color-coded box plot is displayed for each cohort.

    Hovering over either of the plots opens an informational window showing detail on the distribution of values for the cohort.

    Clicking the "ˇ" icon, in the lower right corner of the tile containing the chart, opens a tooltip showing the cohort names and the colors used to represent data in each.

    hashtag
    Preparing Data for Visualization in Box Plots

    When , the following data types can be visualized in box plots:

    • Integer

    • Integer Sparse

    • Float

    Omics Data Assistant

    Explore and analyze datasets using natural language queries with Omics Data Assistant, a GenAI-powered interface integrated into Cohort Browser.

    circle-info

    A license is required to use Omics Data Assistant on the DNAnexus Platform. Contact DNAnexus Salesenvelope for more information.

    Omics Data Assistant (the assistant) is a GenAI-powered conversational interface that helps you explore and analyze complex biomedical and clinical datasets using natural language. The assistant is integrated directly into Cohort Browser. This means you can combine conversational queries with powerful visualization tools for comprehensive data analysis.

    Whether you're new to a dataset or an experienced bioinformatician, the assistant saves you time by understanding your questions in plain English. New users can quickly discover what data their datasets contain without browsing through fields and schemas. Experienced users can define cohorts in seconds by describing criteria in a few sentences, eliminating the need to manually configure multiple filters.

    circle-exclamation

    Omics Data Assistant uses generative AI to accelerate your analysis. While powerful, AI models can occasionally produce inaccurate or incomplete results. Always verify generated cohorts and insights against your underlying data. The assistant alone should not be used for clinical diagnosis or treatment decisions.

    hashtag
    How It Works

    Omics Data Assistant uses the latest Anthropic Claude model for natural language understanding and response generation. This model supports large context windows, allowing the assistant to handle complex queries and maintain context throughout conversations.

    The assistant accesses only your Apollo dataset and does not connect to the internet or external data sources. Omics Data Assistant is deployed regionally to meet data residency requirements, keeping your data in your region throughout all operations. Conversations are stored securely and remain private to you.

    hashtag
    Getting Started

    hashtag
    Prerequisites

    • Your organization has active Omics Data Assistant and Cohort Browser licenses.

    • You have access to an Apollo dataset and its associated databases in a project.

    hashtag
    Opening Omics Data Assistant

    Omics Data Assistant works with datasets in Cohort Browser.

    1. In the DNAnexus Platform, .

    2. Click ✨ Omics Data Assistant in the lower right corner.

    By default, the assistant opens in a panel on the right side. You can enlarge the assistant panel by clicking Enter Full Screen.

    In the assistant's input field, you can explore the data by .

    Below the input, you can use two controls to get started quickly:

    • Dataset Overview: Opens an AI-generated overview of the opened dataset. Use this to learn what the dataset contains without writing a prompt. You can still ask follow-up questions for specific details.

    • Help: Opens a guide about Omics Data Assistant, its capabilities, and example prompts to try.

    hashtag
    First-Time Dataset Indexing

    The first time Omics Data Assistant is used with a dataset, the dataset must be indexed. This one-time process enables natural language queries by creating vector representations of the dataset structure. Only one person needs to start this indexing process. Subsequent users can query the dataset immediately once indexing is complete.

    After indexing starts, it runs in the background. Most datasets complete indexing within 15 minutes. Large datasets like UK Biobank may take over an hour. During indexing, you cannot ask questions until the process completes. You can monitor indexing progress through the assistant interface.

    circle-info

    Index data is stored securely in the same AWS region as your data to maintain data residency requirements.

    hashtag
    Using Omics Data Assistant

    hashtag
    Asking Questions

    Omics Data Assistant responds to your questions in plain English and translates them into structured database queries.

    Example prompts:

    • "Find all patients diagnosed with IBD within 6 months of a diabetes diagnosis"

    • "Get patients with lower hemoglobin values than the laboratory's recommended value"

    • "Create cohort of all patients with exon loss variants in KIAA1109"

    hashtag
    Understanding Responses

    Each response includes the assistant's thinking process, which shows how it interpreted your question and the SQL queries it generated. You can verify the assistant understood your question correctly, review the SQL queries for accuracy, and learn how natural language translates to database queries.

    If your question is unclear, the assistant asks clarifying questions to ensure accurate results.

    For each response, you can:

    • Copy responses: Click Copy at the end of any response to copy the markdown-formatted text to your clipboard for use in documents or notes.

    • Provide feedback: Click the thumbs up or thumbs down buttons on responses to help improve the assistant's accuracy. When giving negative feedback, you can describe the problem in your own words. Your feedback helps the DNAnexus team enhance the assistant for all users.

    hashtag
    Creating Cohorts

    The assistant excels at creating cohorts and generating demographic summaries. For complex visualizations, create your cohorts through the assistant, then use Cohort Browser's native for detailed analysis.

    To create a cohort directly from Omics Data Assistant, phrase your question to specify patient criteria. For example, "Create a cohort of patients with amplifications in HER2".

    When the assistant returns the cohort results, click + Add to Dashboard to filter the dataset by the new cohort in Cohort Browser. You can add multiple cohorts from the assistant to your dashboard and .

    From Cohort Browser, you can save cohorts to your project as .

    hashtag
    Managing Conversations

    Your conversation history is stored separately for each dataset. Your conversations remain private to you. Other users cannot access your conversation history.

    • To manage your conversations, click the three dots next to a conversation name to either rename or delete the conversation.

    • To view and search through your past conversations, click See All at the bottom of the panel.

    Exploring and Querying Datasets

    circle-info

    A license is required to access Spark functionality on the DNAnexus Platform. Contact DNAnexus Salesenvelope for more information.

    hashtag
    Extracting Data From a Dataset With Spark

    The dx commands, and , let you either retrieve the data dictionary of a dataset or extract the underlying data described by that dictionary. You can also use these commands to get dataset metadata, such as the names and titles of entities and fields, or to list all relevant assays in a dataset.

    Often, you can retrieve data without using Spark, and extra compute resources are not required (see the ). However, if you need more compute power—such as when working with complex data models, large datasets, or extracting large volumes of data—you can use a private Spark resource. Using private compute resources helps avoid these timeouts by scaling resources as needed.

    If you use the --sql flag, the command returns a SQL statement (as a string) that you can use in a standalone Spark-enabled application, such as JupyterLab.

    hashtag
    Initiating a Spark Session

    The most common way to use Spark on the DNAnexus Platform is via a .

    After creating a Jupyter notebook within a project, enter the commands shown below, to start a Spark session.

    Python:

    R:

    hashtag
    Executing SQL Queries

    Once you've initiated a Spark session, you can run SQL queries on the database within your notebook, with the results written to a Spark DataFrame:

    Python:

    R:

    hashtag
    Query to Extract Data From Database Using extract_dataset

    Python:

    Where dataset is the record-id or the path to the dataset or cohort, for example, "record-abc123" or "/mydirectory/mydataset.dataset."

    R:

    Where dataset is the record-id or the path to the dataset or cohort.

    hashtag
    Query to Filter and Extract Data from Database Using extract_assay germline

    Python:

    R:

    In the examples above, dataset is the record-id or the path to the dataset or cohort, for example, record-abc123 or /mydirectory/mydataset.dataset. allele_filter.json is a JSON object, as a file, and which contains filters for the --retrieve-allele command. For more information, refer to the notebooks in the .

    hashtag
    Run SQL Query to Extract Data

    Python:

    R:

    hashtag
    Best Practices

    • When querying large datasets - such as those containing genomic data - ensure that your Spark cluster is scaled up appropriately with multiple clusters to parallelize across.

    • Ensure that your Spark session is only once per Jupyter session. If you initialize the Spark session in multiple notebooks in the same Jupyter Job - for example, run notebook 1 and also run notebook 2 OR run a notebook from start to finish multiple times - the Spark session becomes corrupted and you need to restart the specific notebook's kernel. As a best practice, shut down the kernel of any notebook you are not using, before running a second notebook in the same session.

    VCF Loader

    circle-info

    A license is required to access Spark functionality on the DNAnexus Platform. Contact DNAnexus Salesenvelope for more information.

    hashtag
    Overview

    VCF Loader ingests Variant Call Format (VCF) files into a database. The input VCF files are loaded into a Parquet-format database that can be queried using Spark SQL.

    The input VCF for every run can be a single VCF file or many VCF files, but the merged input must represent a single logical VCF file. In the many files case, the logical VCF may be partitioned by chromosome, by genomic region, and/or by sample. In any case, every input VCF file must be a syntactically correct, sorted VCF file.

    hashtag
    VCF Preprocessing

    Although VCF data can be loaded into Apollo databases after the variant call step, the dataset may not be normalized for downstream analyses across large cohorts. In that case, preprocessing and harmonizing the data before loading is recommended. To learn more, see .

    hashtag
    How to Run VCF Loader

    Input:

    • vcf_manifest: (file) a text file containing a list of file ID's of the VCF files to load (one per line). The referenced files' names must be distinct and end in .vcf.gz. If more than one file is specified, then the complete VCF file to load is considered to be partitioned and every specified partition must be a valid VCF file. After the partition-merge step in preprocessing, the complete VCF file must be valid.

    Required Parameters:

    • database_name: (string) name of the database into which to load the VCF files.

    • create_mode: (string) strict mode creates database and tables from scratch and optimistic mode creates databases and tables if they do not already exist.

    Other Options:

    • snpeff: (boolean) default true -- whether to include the SnpEff annotation step in preprocessing with INFO/ANN tags. If SnpEff annotations are desired in the database, then either pre-annotate the raw VCF separately, or include this SnpEff annotation step -- it is not necessary to do both.

    • snpeff_human_genome: (string) default GRCh38.92 -- id of the SnpEff human genome to use in the SnpEff annotation step in preprocessing.

    hashtag
    Basic Run

    Chart Types

    Get an overview of the range of different charts you can build and use in the Cohort Browser.

    circle-info

    An Apollo license is required to use Cohort Browser on the DNAnexus Platform. Org approval may also be required. Contact DNAnexus Salesenvelope for more information.

    While working in the Cohort Browser, you can visualize data using a variety of different types of charts.

    To visualize data stored in particular field, follow these directions to browse through the fields in a dataset, select one, then create a chart based on the values in the field. When you select a field, the Cohort Browser suggests a chart type to use, to visualize the type of data it contains. You can also create multi-variable charts, displaying data from two fields, to help clarify the relationship between the data stored in each.

    hashtag
    Single-Variable Charts

    The following single-variable chart types are available in the Cohort Browser:

    hashtag
    Multi-Variable Charts

    The following multi-variable chart types are available in the Cohort Browser:

    circle-info

    When creating multi-variable charts using datasets that include data related to multiple entities, the entity relationship between the selected data fields affects chart type availability. Often, data fields related to the same entity, or data fields related to entities that in turn relate to one another in 1:1, N:1, or 1:N fashion, can be used together in a multi-variable chart.

    hashtag
    Interpreting Chart Data

    hashtag
    Chart Totals and Missing Data

    In all charts used in the Cohort Browser, a chart total count is displayed under the chart's title. This figure represents the number of records for which data is displayed in the chart. The label - "Participants" in the chart shown below - indicates the entity to which the data relates.

    This figure is not always the same as the number of records in the cohort.

    In a single-variable chart, if a field in a record is empty or contains a null value, that record is not included in the total, as its data can't be visualized. If any such records exist in the cohort, an "i" warning icon appears next to the chart total figure. Hover over the icon to show a tooltip with information about records that aren't included in the total.

    The same holds for multi-variable charts. If any record contains a null value in either of the selected fields, or if either field is empty, that record isn't included in the chart total count, as its data can't be visualized.

    Apps and Workflows

    Every analysis in DNAnexus is run using apps. Apps can be linked together to create workflows. Learn the basics of using both.

    circle-info

    You must set up billing for your account before you can perform an analysis, or upload or egress data.

    hashtag
    Finding the Right App or Workflow

    Developer Quickstart

    Learn to build an app that you can run on the Platform.

    circle-info

    This tutorial provides a quick intro to the DNAnexus developer experience, and progresses to building a fully functional, useful app on the Platform. For a more in-depth discussion of the Platform, see .

    The steps below require the . You must download and install it if you have not done so already.

    Besides this Quickstart, there are Developer Tutorials located in the sidebar that go over helpful tips for new users as well. A few of them include:

    Parallel by Chr (py)

    This applet tutorial performs a SAMtools count using parallel threads.

    To take full advantage of the scalability that cloud computing offers, your scripts have to implement the correct methodologies. This applet tutorial shows you how to:

    1. Install SAMtools

    2. Download BAM file

    Parallel by Region (py)

    This applet tutorial performs a SAMtools count using parallel threads.

    To take full advantage of the scalability that cloud computing offers, your scripts must implement the correct methodologies. This applet tutorial shows you how to:

    1. Install SAMtools

    2. Download BAM file

    Parallel by Chr (py)

    This applet tutorial performs a SAMtools count using parallel threads.

    To take full advantage of the scalability that cloud computing offers, your scripts have to implement the correct methodologies. This applet tutorial shows you how to:

    1. Install SAMtools

    2. Download BAM file

    Parallel by Region (py)

    This applet tutorial performs a SAMtools count using parallel threads.

    To take full advantage of the scalability that cloud computing offers, your scripts must implement the correct methodologies. This applet tutorial:

    1. Install SAMtools

    2. Download BAM file

    List View

    Learn to build and use list views in the Cohort Browser.

    circle-info

    An Apollo license is required to use Cohort Browser on the DNAnexus Platform. Org approval may also be required. for more information.

    hashtag
    When to Use List Views

    Distributed by Chr (sh)

    hashtag
    How is the SAMtools dependency provided?

    The SAMtools dependency is resolved by declaring an Apt-Get package in the dxapp.json file's runSpec.execDepends.

    Analyzing Gene Expression Data

    Analyze gene expression data, including expression-based filtering, visualization, and molecular profiling in the Cohort Browser.

    circle-info

    An Apollo license is required to use Cohort Browser on the DNAnexus Platform. Org approval may also be required. for more information.

    Explore and analyze datasets with gene expression assays by opening them in the Cohort Browser and switching to the Gene Expression tab. You can create cohorts based on expression levels, visualize expression patterns, and examine detailed gene information.

    circle-info

    Visualizing Data

    The DNAnexus Platform offers multiple different methods for viewing your files and data.

    hashtag
    Previewing Files

    DNAnexus allows users to preview and open the following file types directly on the platform:

    dx run app-dxjupyterlab
    dx run app-dxjupyterlab_spark_cluster
    dx run -h APP_NAME
    {
    
      ...
      "runSpec": {
        ...
        "execDepends": [
          {
            "name": "samtools"
          }
        ]
      }
      ...
    }

    IMAGE_PROCESSING: Python3 with image processing packages (Nipype, FreeSurfer, FSL), but no R. FreeSurfer requires a license. GUI viewers such as fsleyes and freeview cannot be launched in the headless environment.

  • STATA: Stata requires a license to run

  • MONAI_ML: Extends the ML feature with specialized medical imaging frameworks, such as MONAI Core, MONAI Label, and 3D Slicer.

  • For datasets with multiple germline variant assays, select the specific assay to filter by.

  • On the Genes / Effects tab, select variants of specific types and variant consequencesarrow-up-right within the specified genes and/or genomic ranges. You can specify up to 5 genes or genomic ranges in a comma-separated list.

  • On the Variant IDs tab, specify a list of variant IDs, with a maximum of 100 variants.

  • To enter multiple genes, genomic ranges, or variants, separate them with commas or place each on a new line.

  • Click Apply Filter.

  • Cohort Allele Frequency: Allele frequency calculated across current cohort selection.

  • GnomAD Allele Frequency: Allele frequency of the specified allele from the public dataset gnomADarrow-up-right.

  • define your cohort
    SnpEffarrow-up-right
    SQL Runner app
    NCBI dbSNParrow-up-right
    gnomADarrow-up-right
    NCBI dbSNParrow-up-right
    JupyterLab
    Spark clusters
    SQL Runner
    Adding a germline filter
    Genomic Variant Browser and Details
    Viewing specific locus details
    Float Sparse

    Numerical (Integer)

    Numerical (Float)

    f(y)=sign(y)⋅log⁡2(∣y∣+1)f(y) = \text{sign}(y) \cdot \log_2(|y|+1)f(y)=sign(y)⋅log2​(∣y∣+1)
    f(y)=sign(y)⋅log⁡10(∣y∣+1)f(y) = \text{sign}(y) \cdot \log_{10}(|y|+1)f(y)=sign(y)⋅log10​(∣y∣+1)
    histograms
    ingesting data using Data Model Loader
    Box Plot with Detail on Value Distribution
    Box Plot chart settings showing Display Statistics, Transform Data, and Chart Type options
    Detail on Non-Numeric Values Omitted from a Box Plot
    Outlier Value in a Box Plot
    Box Plot in Cohort Compare Mode
    open a dataset in Cohort Browser
    asking questions
    visualization tools
    switch between cohorts
    CohortBrowser records
    Look for Omics Data Assistant in lower right corner in Cohort Browser
    Omics Data Assistant in full screen
    Asking a question in Omics Data Assistant
    Creating cohorts using Omics Data Assistant
    Managing conversations in Omics Data Assistant
    Row Chart
    Stacked Row Chart
    Box Plot
    Histogram
    List View
    Grouped Box Plot
    Kaplan-Meier Survival Curve
    List View
    Detail on "missing" records
    Analyzing Gene Expression Data
    chart types
    Follow the instructions in the modal window that opens.
    Follow the instructions in the modal window that opens.

    Click the Copy Selected button.

    data in this project cannot be cloned to another project

  • data in this project cannot be used as input to a job or an analysis in another project

  • any running app or applet that reads from this project cannot write results to any other project

  • a job running in the project has singleContext flag set to true irrespective of the singleContext value supplied to /job/new and /executable-xxxx/run, and is only allowed to use the job's DNAnexus authentication token when issuing requests to the proxied DNAnexus API endpoint within the job. Use of any other authentication token results in an error.

    This flag corresponds to the Copy Access policy in the project's Settings web interface screen.

  • downloadRestricted: If set to true, data in this project cannot be downloaded outside of the platform. For database objects, users cannot access the data in the project from outside DNAnexus. When set to true, previewViewerRestricted defaults to true unless explicitly overridden. This flag corresponds to the Download Access policy in the project's Settings web interface screen.

  • previewViewerRestricted: If set to true, file preview and viewer are disabled for the project. This flag defaults to true when downloadRestricted is set to true. In the project's Settings screen, the File Preview setting is adjustable when Download Access is set to restrict downloads, but is disabled and set to allow preview when Download Access allows all members to download. You can override this by explicitly setting previewViewerRestricted to false using the /project-xxxx/update API method.

  • databaseUIViewOnly: If set to true, project members with VIEW access have their access to project databases restricted to the Cohort Browser only. This feature is only available to customers with an Apollo license. Contact DNAnexus Salesenvelope for more information.

  • containsPHI: If set to true, data in this project is treated as Protected Health Information (PHI), an identifiable health information that can be linked to a specific person. PHI data protection safeguards the confidentiality and integrity of the project data in compliance with the Health Insurance Portability and Accountability Act of 1996 (HIPAA) by imposing additional restrictions documented in PHI Data Protection section. This flag corresponds to the PHI Data Protection setting in the Administration section of a project's Settings web interface screen.

  • displayDataProtectionNotice: If set to true, ADMIN users can turn on/off the ability to show a Data Protection Notice to any users accessing the selected project. If the Data Protection Notice feature is enabled for a project, all users, when first accessing the project, are required to review and confirm their acceptance of a requirement not to egress data from the project. A license is required to use this feature. Contact DNAnexus Salesenvelope for more information.

  • externalUploadRestricted: If set to true, external file uploads to this project (from outside the job context) are rejected. This flag corresponds to the External Upload Access policy in the project's Settings web interface screen. A license is required to use this feature. Contact DNAnexus Salesenvelope for more information.

  • httpsAppIsolatedBrowsing: If set to true, httpsApp access to jobs launched in this project are wrapped in Isolated Browsing, which restricts data transfers through the httpsApp job interface. A license is required to use this limited-access feature. Contact DNAnexus Salesenvelope for more information.

  • Apollo database access is subject to additional restrictions.

  • Once PHI Data Protection is activated for a project, it cannot be disabled.

  • Click Send Transfer Request.

    VIEW

    Allows users to browse and visualize data stored in the project, download data to a local computer, and copy data to other projects.

    UPLOAD

    Gives users VIEW access, plus the ability to create new folders and data objects, modify the metadata of open data objects, and close data objects.

    CONTRIBUTE

    Gives users UPLOAD access, plus the ability to run executions directly in the project.

    ADMINISTER

    User Interface Quickstart
    sharing a project with other DNAnexus users
    organization
    root executions
    Contact DNAnexus Salesenvelope
    Setting Up Billing
    create a new project
    Contact DNAnexus Salesenvelope
    Enforcing Monthly Spending and Usage Limits
    /projects/new API method
    sponsored projects
    DNAnexus Supportenvelope
    See the Org Management page
    Project Navigation
    Path Resolution
    Project API Specifications
    Project Permissions and Sharing

    Gives users CONTRIBUTE access, plus the power to change project permissions and policies, including giving other users access, revoking access, transferring project ownership, and deleting the project.

    DXFilearrow-up-right
    dxpy.upload_local_filearrow-up-right
    dxpy.dxlinkarrow-up-right
    dx-jobutil-new-job
    JBOR
    dx-jobutil-add-output
    dx-jobutil-add-output
    dx-jobutil-new-job
    If you want to use a database outside your project's scope, you must refer to it using its unique database name (typically this looks something like database_fjf3y28066y5jxj2b0gz4g85__metabric_data) as opposed to the database name (metabric_data in this case).
    extract_dataset
    extract_assay germline
    example OpenBio notebooksarrow-up-right
    Spark enabled JupyterLab notebook
    DNAnexus OpenBio dx-toolkit examplesarrow-up-right
    initialized

    insert_mode: (string) append appends data to the end of tables and overwrite is equivalent to truncating the tables and then appending to them.

  • run_mode: (string) site mode processes only the site-specific data, genotype mode processes genotype-specific data and other non-site-specific data and all mode processes both types of data.

  • etl_spec_id: (string) Only the genomics-phenotype schema choice is supported.

  • is_sample_partitioned: (boolean) whether the raw VCF data is partitioned.

  • snpeff_opt_no_upstream: (boolean) default true -- exclude SnpEff upstream gene annotations in the SnpEff annotation step (equivalent to SnpEff's -no-upstream option). This option does not filter pre-calculated annotations outside of the SnpEff annotation step.

  • snpeff_opt_no_downstream: (boolean) default true -- exclude SnpEff downstream gene annotations in the SnpEff annotation step (equivalent to SnpEff's -no-downstream option). This option does not filter pre-calculated annotations outside of the SnpEff annotation step.

  • calculate_worst_effects: (boolean) default true -- whether to include the SnpEff worst-effects annotation step in preprocessing, which adds the SnpEff worst-effects for every alternate-allele--gene combination as INFO/ANN_WORST tags (Number "A"). This option automatically filters SnpEff annotations to exclude 'feature_type!=transcript', 'transcript_biotype!=protein_coding', 'effect=upstream_gene_variant' and 'effect=downstream_gene_variant'

  • calculate_locus_frequencies: (boolean) default true -- whether to include the locus-level frequencies annotation step in preprocessing, which adds locus-level allele and genotype frequencies as INFO tags.

  • snpsift: (boolean) default true -- whether to include the SnpSift/dbSNP annotation step in preprocessing. The SnpSift/dbSNP annotation step adds dbSNP ID annotations to the INFO/RSID tag (Number "A"), which is required in the ETL stage. If the raw VCF is already pre-annotated, then this annotation step is not necessary.

  • num_init_partitions: (int) integer defining the number of partitions for the initial VCF lines Spark RDD.

  • VCF Preprocessing
    -o pipefail
    makes the return code the first non-zero exit code. (Typically, the return code of pipes is the exit code of the last command, which can create difficult to debug problems.)

    The *.bai file was an optional job input. You can check for an empty or unset var using the bash built-in test [[ - z ${var}} ]]. Then, you can download or create a *.bai index as needed.

    job controlarrow-up-right
    dx-upload-all-outputs

    Distributed by Chr (sh)

  • Parallel by Chr (py)

  • R Shiny Example Web App

  • hashtag
    Step 1. Build an App

    Every DNAnexus app starts with 2 files:

    • dxapp.json: a file containing the app's metadata: its inputs and outputs, how the app is run, and execution requirements

    • a script that is executed in the cloud when the app is run

    Start by creating a file called dxapp.json with the following text:

    The example specifies the app name (coolapp), the interpreter (python3) to run the script, and the path (code.py) to the script created next. ("version":"0") refers to the Ubuntu 24.04 application execution environment version that supports the python3 interpreter.

    Next, create the script in a file called code.py with the following text:

    That's all you need. To build the app, first log in to DNAnexus and start a project with dx login. In the directory with the two files above, run:

    Next, run the app and watch the output:

    That's it! You have made and run your first DNAnexus applet. Applets are lightweight apps that live in your project, and are not visible in the App Libraryarrow-up-right. When you typed dx run, the app ran on its own Linux instance in the cloud. You have exclusive, secure access to the CPU, storage, and memory on the instance. The DNAnexus API lets your app read and write data on the Platform, as well as launch other apps.

    The app is available in the DNAnexus web interface, as part of the project that you started. It can be configured and run in the Workflow Builder, or shared with other users by sharing the project.

    hashtag
    Step 2. Run BLAST

    Next, make the app do something a bit more interesting: take in two files with FASTAarrow-up-right-formatted DNA, run the BLASTarrow-up-right tool to compare them, and output the result.

    In the cloud, your app runs on Ubuntu Linuxarrow-up-right 24.04arrow-up-right, where BLAST is available as an APT package, ncbi-blast+. You can request that the DNAnexus execution environment install it before your script is run by listing ncbi-blast+ in the execDepends field of your dxapp.json like this:

    Next, update code.py to run BLAST:

    Rebuild the app and test it on some real data. You can use demo inputs available in the Demo Dataarrow-up-right project, or you can upload your own data with dx upload or via the website. If you use the Demo Data inputs, make sure the project you are running your app in is the same region as the Demo Data project.

    Rebuild the app with dx build -a, and run it like this:

    Once the job is done, you can examine the output with dx head report.txt, download it with dx download, or view it on the website.

    hashtag
    Step 3. Provide an Input/Output Spec

    Workflows are a powerful way to visually connect, configure, and run multiple apps in pipelines. To add the app to a workflow and connect its inputs and outputs to other apps, specify both input and output specifications. Update the dxapp.json as follows:

    Rebuild the app with dx build -a. Run it as before, and add the applet to a workflow by clicking "New Workflow" while viewing your project on the website, then click coolapp once to add it to the workflow. Inputs and outputs appear on the workflow stage and can be connected to other stages.

    If you run dx run coolapp with no input arguments from the command line, the command prompts for the input values for seq1 and seq2.

    hashtag
    Step 4. Configure App Settings

    Besides specifying input files, the I/O specification can also configure settings the app uses. For example, configure the E-value setting and other BLAST settings with this code and dxapp.json:

    Rebuild the app again and add it in the workflow builder. You should see the evalue and blast_args settings available when you click the gear button on the stage. After building and configuring a workflow, you can run the workflow itself with dx run workflowname.

    hashtag
    Step 5. Use SDK Tools

    One of the utilities provided in the SDK is dx-app-wizard. This tool prompts you with a series of questions with which it creates the basic files needed for a new app. It also gives you the option of writing your app as a bash shell script instead of Python. Run dx-app-wizard to try it out.

    hashtag
    Learn More

    For additional information and examples of how to run jobs using the CLI, see Working with files using dx runarrow-up-right may be useful. This material is not a part of the official DNAnexus documentation and is for reference only.

    Intro to Building Apps
    DNAnexus SDK

    Count regions in parallel

    hashtag
    How is the SAMtools dependency provided?

    The SAMtools dependency is resolved by declaring an Apt-Getarrow-up-right package in the dxapp.json runSpec.execDepends.

    For additional information, refer to the execDepends documentation.

    hashtag
    Download BAM file

    The dxpy.download_all_inputs() function downloads all input files into the /home/dnanexus/in directory. A folder is created for each input and the files are downloaded to that directory. For convenience, the dxpy.download_all_inputs function returns a dictionary containing the following keys:

    • <var>_path (string): full absolute path to where the file was downloaded.

    • <var>_name (string): name of the file, including extension.

    • <var>_prefix (string): name of the file minus the longest matching pattern found in the dxapp.json I/O pattern field.

    The path, name, and prefix key-value pattern is repeated for all applet file class inputs specified in the dxapp.json. In this example, the dictionary has the following key-value pairs:

    hashtag
    Count Regions in Parallel

    Before performing the parallel SAMtools count, determine the workload for each thread. The number of workers is arbitrarily set to 10 and the workload per thread is set to 1 chromosome at a time. Python offers multiple ways to achieve multithreaded processing. For the sake of simplicity, use multiprocessing.dummyarrow-up-right, a wrapper around Python's threading module.

    Each worker creates a string to be called in a subprocess.Popen call. The multiprocessing.dummy.Pool.map(<func>, <iterable>) function is used to call the helper function run_cmd for each string in the iterable of view commands. Because multithreaded processing is performed using subprocess.Popen, the process does not alert to any failed processes. Closed workers are verified in the verify_pool_status helper function.

    Important: In this example, you use subprocess.Popen to process and verify results in verify_pool_status. In general, it is considered good practice to use Python's built-in subprocess convenience functions. In this case, subprocess.check_call would achieve the same goal.

    hashtag
    Gather Results

    Each worker returns a read count of only one region in the BAM file. Sum and output the results as the job output. The dx-toolkit Python SDK function dxpy.upload_local_filearrow-up-right is used to upload and generate a DXFile corresponding to the result file. For Python, job outputs have to be a dictionary of key-value pairs, with the keys being job output names as defined in the dxapp.json and the values being the output values for corresponding output classes. For files, the output type is a DXLink. The dxpy.dxlinkarrow-up-right function is used to generate the appropriate DXLink value.

    View full source code on GitHubarrow-up-right

    Split workload

  • Count regions in parallel

  • hashtag
    How is the SAMtools dependency provided?

    The SAMtools dependency is resolved by declaring an Apt-Getarrow-up-right package in the dxapp.json runSpec.execDepends field.

    hashtag
    Download Inputs

    This applet downloads all inputs at once using dxpy.download_all_inputs:

    hashtag
    Split workload

    Using the Python multiprocessing module, you can split the workload into multiple processes for parallel execution:

    With this pattern, you can quickly orchestrate jobs on a worker. For a more detailed overview of the multiprocessing module, visit the Python docsarrow-up-right.

    Specific helpers are created in the applet script to manage the workload. One helper you may have seen before is run_cmd. This function manages the subprocess calls:

    Before the workload can be split, you need to identify the regions present in the BAM input file. This initial parsing is handled in the parse_sam_header_for_region function:

    Once the workload is split and processing has started, wait and review the status of each Pool worker. Then, merge and output the results.

    The run_cmd function returns a tuple containing the stdout, stderr, and exit code of the subprocess call. These outputs from the workers are parsed to determine whether the run failed or passed.

    View full source code on GitHubarrow-up-right

    Count regions in parallel

    hashtag
    How is the SAMtools dependency provided?

    The SAMtools dependency is resolved by declaring an Apt-Getarrow-up-right package in the dxapp.json runSpec.execDepends.

    For additional information, refer to the execDepends documentation.

    hashtag
    Download BAM file

    The dxpy.download_all_inputs() function downloads all input files into the /home/dnanexus/in directory. A folder is created for each input and the files are downloaded to that directory. For convenience, the dxpy.download_all_inputs function returns a dictionary containing the following keys:

    • <var>_path (string): full absolute path to where the file was downloaded.

    • <var>_name (string): name of the file, including extension.

    • <var>_prefix (string): name of the file minus the longest matching pattern found in the dxapp.json I/O pattern field.

    The path, name, and prefix key-value pattern is repeated for all applet file class inputs specified in the dxapp.json. In this example, the dictionary has the following key-value pairs:

    hashtag
    Count Regions in Parallel

    Before performing the parallel SAMtools count, determine the workload for each thread. The number of workers is arbitrarily set to 10 and the workload per thread is set to 1 chromosome at a time. Python offers multiple ways to achieve multithreaded processing. For the sake of simplicity, use multiprocessing.dummyarrow-up-right, a wrapper around Python's threading module.

    Each worker creates a string to be called in a subprocess.Popen call. The multiprocessing.dummy.Pool.map(<func>, <iterable>) function is used to call the helper function run_cmd for each string in the iterable of view commands. Because multithreaded processing is performed using subprocess.Popen, the process does not alert to any failed processes. Closed workers are verified in the verify_pool_status helper function.

    Important: In this example, subprocess.Popen is used to process and verify results in verify_pool_status. In general, it is considered good practice to use Python's built-in subprocess convenience functions. In this case, subprocess.check_call would achieve the same goal.

    hashtag
    Gather Results

    Each worker returns a read count of only one region in the BAM file. Sum and output the results as the job output. The dx-toolkit Python SDK function dxpy.upload_local_filearrow-up-right is used to upload and generate a DXFile corresponding to the result file. For Python, job outputs have to be a dictionary of key-value pairs, with the keys being job output names as defined in the dxapp.json and the values being the output values for corresponding output classes. For files, the output type is a DXLink. The dxpy.dxlinkarrow-up-right function is used to generate the appropriate DXLink value.

    View full source code on GitHubarrow-up-right

    Split workload

  • Count regions in parallel

  • hashtag
    How is the SAMtools dependency provided?

    The SAMtools dependency is resolved by declaring an Apt-Getarrow-up-right package in the dxapp.json runSpec.execDepends field.

    hashtag
    Download Inputs

    This applet downloads all inputs at once using dxpy.download_all_inputs:

    hashtag
    Split workload

    This tutorial processes data in parallel using the Python multiprocessing module with a straightforward pattern shown below:

    This convenient pattern allows you to quickly orchestrate jobs on a worker. For more detailed overview of the multiprocessing module, visit the Python docsarrow-up-right.

    The applet script includes helper functions to manage the workload. One helper is run_cmd, which manages subprocess calls:

    Before splitting the workload, determine what regions are present in the BAM input file. This initial parsing is handled in the parse_sam_header_for_region function:

    Once the workload is split and processing has started, wait and review the status of each Pool worker. Then, merge and output the results.

    The run_cmd function returns a tuple containing the stdout, stderr, and exit code of the subprocess call. These outputs are parsed from the workers to determine whether the run failed or passed.

    View full source code on GitHubarrow-up-right
    For additional information, see execDepends

    hashtag
    Entry Points

    Distributed bash-interpreter apps use bash functions to declare entry points. This app has the following entry points specified as bash functions:

    • main

    • count_func

    • sum_reads

    Entry points are executed on a new worker with its own system requirements. The instance type can be set in the dxapp.json file's runSpec.systemRequirements:

    hashtag
    main

    The main function slices the initial *.bam file and generates an index *.bai if needed. The input *.bam is the sliced into smaller *.bam files containing only reads from canonical chromosomes. First, the main function downloads the BAM file and gets the headers.

    Sliced *.bam files are uploaded and their file IDs are passed to the count_func entry point using the dx-jobutil-new-job command.

    Outputs from the count_func entry points are referenced as Job Based Object References (JBOR) and used as inputs for the sum_reads entry point.

    The output of the sum_reads entry point is used as the output of the main entry point via JBOR reference using the command dx-jobutil-add-output.

    hashtag
    count_func

    This entry point downloads and runs the command samtools view -c on the sliced *.bam. The generated counts_txt output file is uploaded as the entry point's job output via the command dx-jobutil-add-output.

    hashtag
    sum_reads

    The main entry point triggers this sub job, providing the output of count_func as an input. This entry point gathers all the files generated by the count_func jobs and sums them.

    This function returns read_sum_file as the entry point output.

    View full source code on GitHubarrow-up-right
    dx update project project-xxxx --allowed-executables applet-yyyy --allowed-executables workflow-zzzz [...]
    dx update project project-xxxx --unset-allowed-executables
    {
     "runSpec": {
        ...
        "execDepends": [
          {"name": "pysam",
             "package_manager": "pip3",
             "version": "0.15.4"
          }
        ]
        ...
     }
    print(mappings_sorted_bai)
    print(mappings_sorted_bam)
    
    mappings_sorted_bam = dxpy.DXFile(mappings_sorted_bam)
    sorted_bam_name = mappings_sorted_bam.name
    dxpy.download_dxfile(mappings_sorted_bam.get_id(),
                            sorted_bam_name)
    ascii_bam_name = unicodedata.normalize(  # Pysam requires ASCII not Unicode string.
        'NFKD', sorted_bam_name).encode('ascii', 'ignore')
    
    if mappings_sorted_bai is not None:
        mappings_sorted_bai = dxpy.DXFile(mappings_sorted_bai)
        dxpy.download_dxfile(mappings_sorted_bai.get_id(),
                                mappings_sorted_bai.name)
    else:
        pysam.index(ascii_bam_name)
    mappings_obj = pysam.AlignmentFile(ascii_bam_name, "rb")
    regions = get_chr(mappings_obj, canonical_chr)
    def get_chr(bam_alignment, canonical=False):
        """Helper function to return canonical chromosomes from SAM/BAM header
    
        Arguments:
            bam_alignment (pysam.AlignmentFile): SAM/BAM pysam object
            canonical (boolean): Return only canonical chromosomes
        Returns:
            regions (list[str]): Region strings
        """
        regions = []
        headers = bam_alignment.header
        seq_dict = headers['SQ']
    
        if canonical:
            re_canonical_chr = re.compile(r'^chr[0-9XYM]+$|^[0-9XYM]')
            for seq_elem in seq_dict:
                if re_canonical_chr.match(seq_elem['SN']):
                    regions.append(seq_elem['SN'])
        else:
            regions = [''] * len(seq_dict)
            for i, seq_elem in enumerate(seq_dict):
                regions[i] = seq_elem['SN']
    
        return regions
    total_count = 0
    count_filename = "{bam_prefix}_counts.txt".format(
        bam_prefix=ascii_bam_name[:-4])
    
    with open(count_filename, "w") as f:
        for region in regions:
            temp_count = mappings_obj.count(region=region)
            f.write("{region_name}: {counts}\n".format(
                region_name=region, counts=temp_count))
            total_count += temp_count
    
        f.write("Total reads: {sum_counts}".format(sum_counts=total_count))
    counts_txt = dxpy.upload_local_file(count_filename)
    output = {}
    output["counts_txt"] = dxpy.dxlink(counts_txt)
    
    return output
    regions=$(samtools view -H "${mappings_sorted_bam_name}" \
      | grep "\@SQ" | sed 's/.*SN:\(\S*\)\s.*/\1/')
    
    echo "Segmenting into regions"
    count_jobs=()
    counter=0
    temparray=()
    for r in $(echo $regions); do
      if [[ "${counter}" -ge 10 ]]; then
        echo "${temparray[@]}"
        count_jobs+=( \
          $(dx-jobutil-new-job \
          -ibam_file="${mappings_sorted_bam}" \
          -ibambai_file="${mappings_sorted_bai}" "${temparray[@]}" count_func))
        temparray=()
        counter=0
      fi
      temparray+=("-iregions=${r}") # Here we add to an array of -i<parameter>'s
      counter=$((counter+1))
    done
    
    if [[ counter -gt 0 ]]; then # Previous loop misses last iteration if it's < 10
      echo "${temparray[@]}"
      count_jobs+=($(dx-jobutil-new-job -ibam_file="${mappings_sorted_bam}" -ibambai_file="${mappings_sorted_bai}" "${temparray[@]}" count_func))
    fi
    echo "Merge count files, jobs:"
    echo "${count_jobs[@]}"
    readfiles=()
    for count_job in "${count_jobs[@]}"; do
      readfiles+=("-ireadfiles=${count_job}:counts_txt")
    done
    echo "file name: ${sorted_bamfile_name}"
    echo "Set file, readfile variables:"
    echo "${readfiles[@]}"
    countsfile_job=$(dx-jobutil-new-job -ifilename="${mappings_sorted_bam_prefix}" "${readfiles[@]}" sum_reads)
    echo "Specifying output file"
    dx-jobutil-add-output counts_txt "${countsfile_job}:read_sum" --class=jobref
    count_func() {
      set -e -x -o pipefail
    
      echo "Value of bam_file: '${bam_file}'"
      echo "Value of bambai_file: '${bambai_file}'"
      echo "Regions being counted '${regions[@]}'"
    
      dx-download-all-inputs
    
      mkdir workspace
      cd workspace || exit
      mv "${bam_file_path}" .
      mv "${bambai_file_path}" .
      outputdir="./out/samtool/count"
      mkdir -p "${outputdir}"
      samtools view -c "${bam_file_name}" "${regions[@]}" >> "${outputdir}/readcounts.txt"
    
      counts_txt_id=$(dx upload "${outputdir}/readcounts.txt" --brief)
      dx-jobutil-add-output counts_txt "${counts_txt_id}" --class=file
    }
    sum_reads() {
    
      set -e -x -o pipefail
      echo "$filename"
    
      echo "Value of read file array '${readfiles[@]}'"
      dx-download-all-inputs
      echo "Value of read file path array '${readfiles_path[@]}'"
    
      echo "Summing values in files"
      readsum=0
      for read_f in "${readfiles_path[@]}"; do
        temp=$(cat "$read_f")
        readsum=$((readsum + temp))
      done
    
      echo "Total reads: ${readsum}" > "${filename}_counts.txt"
    
      read_sum_id=$(dx upload "${filename}_counts.txt" --brief)
      dx-jobutil-add-output read_sum "${read_sum_id}" --class=file
    echo "Specifying output file"
    dx-jobutil-add-output counts_txt "${countsfile_job}:read_sum" --class=jobref
    {
      "runSpec": {
        ...
        "systemRequirements": {
          "main": {
            "instanceType": "mem1_ssd1_x4"
          },
          "count_func": {
            "instanceType": "mem1_ssd1_x2"
          },
          "sum_reads": {
            "instanceType": "mem1_ssd1_x4"
          }
        },
        ...
      }
    }
    dx download "${mappings_sorted_bam}" \
      chromosomes=$( \
      samtools view -H "${mappings_sorted_bam_name}" \
      | grep "\@SQ" \
      | awk -F '\t' '{print $2}' \
      | awk -F ':' '{if ($2 ~ /^chr[0-9XYM]+$|^[0-9XYM]/) {print $2}}')
    if [ -z "${mappings_sorted_bai}" ]; then
        samtools index "${mappings_sorted_bam_name}"
    else
        dx download "${mappings_sorted_bai}" -o "${mappings_sorted_bam_name}.bai"
    fi
    
    count_jobs=()
    
    for chr in $chromosomes; do
        seg_name="${mappings_sorted_bam_prefix}_${chr}.bam"
        samtools view -b "${mappings_sorted_bam_name}" "${chr}" > "${seg_name}"
        bam_seg_file=$(dx upload "${seg_name}" --brief)
        count_jobs+=($(dx-jobutil-new-job \
            -isegmentedbam_file="${bam_seg_file}" \
            -ichr="${chr}" \
            count_func))
    done
    for job in "${count_jobs[@]}"; do
        readfiles+=("-ireadfiles=${job}:counts_txt")
    done
    
    sum_reads_job=$(
        dx-jobutil-new-job \
            "${readfiles[@]}" \
            -ifilename="${mappings_sorted_bam_prefix}" \
            sum_reads
    )
    count_func () {
        echo "Value of segmentedbam_file: '${segmentedbam_file}'"
        echo "Chromosome being counted '${chr}'"
    
        dx download "${segmentedbam_file}"
    
        readcount=$(samtools view -c "${segmentedbam_file_name}")
        printf "${chr}:\t%s\n" "${readcount}" > "${segmentedbam_file_prefix}.txt"
    
        readcount_file=$(dx upload "${segmentedbam_file_prefix}.txt" --brief)
        dx-jobutil-add-output counts_txt "${readcount_file}" --class=file
    }
    sum_reads () {
        set -e -x -o pipefail
    
        printf "Value of read file array %s" "${readfiles[@]}"
        echo "Filename: ${filename}"
        echo "Summing values in files and creating output read file"
    
        for read_f in "${readfiles[@]}"; do
            echo "${read_f}"
            dx download "${read_f}" -o - >> chromosome_result.txt
        done
    
        count_file="${filename}_chromosome_count.txt"
        total=$(awk '{s+=$2} END {print s}' chromosome_result.txt)
        echo "Total reads: ${total}" >> "${count_file}"
    
        readfile_name=$(dx upload "${count_file}" --brief)
        dx-jobutil-add-output read_sum_file "${readfile_name}" --class=file
    }
    import pyspark
    sc = pyspark.SparkContext()
    spark = pyspark.sql.SparkSession(sc)
    install.packages("sparklyr")
    library(sparklyr)
    port <- Sys.getenv("SPARK_MASTER_PORT")
    master <- paste("spark://master:", port, sep = '')
    sc = spark_connect(master)
    retrieve_sql = 'select .... from .... '
    df = spark.sql(retrieve_sql)
    library(DBI)
    retrieve_sql <- 'select .... from .... '
    df = dbGetQuery(sc, retrieve_sql)
    import subprocess
    cmd = ["dx", "extract_dataset", dataset, "--fields", "entity1.field1, entity1.field2, entity2.field4", "--sql", "-o", "extracted_data.sql"]
    subprocess.check_call(cmd)
    cmd <- paste("dx extract_dataset", dataset, " --fields", "entity1.field1, entity1.field2, entity2.field4", "--sql", "-o extracted_data.sql")
    system(cmd)
    import subprocess
    cmd = ["dx", "extract_assay", "germline", dataset, "--retrieve-allele", "allele_filter.json", "--sql", "-o", "extract_allele.sql"]
    subprocess.check_call(cmd)
    cmd <- paste("dx extract_assay", "germline", dataset, "--retrieve-allele", "allele_filter.json", "--sql", "-o extracted_allele.sql")
    system(cmd)
    with open("extracted_data.sql", "r") as file:
        retrieve_sql=""
        for line in file:
            retrieve_sql += line.strip()
    df = spark.sql(retrieve_sql.strip(";"))
    install.packages("tidyverse")
    library(readr)
    retrieve_sql <-read_file("extracted_data.sql")
    retrieve_sql <- gsub("[;\n]", "", retrieve_sql)
    df <- dbGetQuery(sc, retrieve_sql)
    dx run vcf-loader \
       -i vcf_manifest=file-xxxx \
       -i is_sample_partitioned=false \
       -i database_name=<my_favorite_db> \
       -i etl_spec_id=genomics-phenotype \
       -i create_mode=strict \
       -i insert_mode=append \
       -i run_mode=genotype
      "runSpec": {
        ...
        "execDepends": [
          {"name": "samtools"}
        ]
      }
    chromosomes=$( \
      samtools view -H "${mappings_sorted_bam_name}" \
      | grep "\@SQ" \
      | awk -F '\t' '{print $2}' \
      | awk -F ':' '{if ($2 ~ /^chr[0-9XYM]+$|^[0-9XYM]/) {print $2}}')
    
    for chr in $chromosomes; do
      samtools view -b "${mappings_sorted_bam_name}" "${chr}" -o "bam_${chr}.bam"
      echo "bam_${chr}.bam"
    done > bamfiles.txt
    
    busyproc=0
    while read -r b_file; do
      echo "${b_file}"
      if [[ "${busyproc}" -ge "$(nproc)" ]]; then
        echo Processes hit max
        while [[ "${busyproc}" -gt  0 ]]; do
          wait -n # p_id
          busyproc=$((busyproc-1))
        done
      fi
      samtools view -c "${b_file}"> "count_${b_file%.bam}" &
      busyproc=$((busyproc+1))
    done <bamfiles.txt
    while [[ "${busyproc}" -gt  0 ]]; do
      wait -n # p_id
      busyproc=$((busyproc-1))
    done
    ├── $HOME
    │   ├── out
    │       ├── < output name in dxapp.json >
    │           ├── output file
    outputdir="${HOME}/out/counts_txt"
    mkdir -p "${outputdir}"
    cat count* \
      | awk '{sum+=$1} END{print "Total reads = ",sum}' \
      > "${outputdir}/${mappings_sorted_bam_prefix}_count.txt"
    
    dx-upload-all-outputs
    set -e -x -o pipefail
    echo "Value of mappings_sorted_bam: '${mappings_sorted_bam}'"
    echo "Value of mappings_sorted_bai: '${mappings_sorted_bai}'"
    
    mkdir workspace
    cd workspace
    dx download "${mappings_sorted_bam}"
    
    if [ -z "$mappings_sorted_bai" ]; then
      samtools index "$mappings_sorted_bam_name"
    else
      dx download "${mappings_sorted_bai}"
    fi
    dxapp.json
    { "name": "coolapp",
      "runSpec": {
        "distribution": "Ubuntu",
        "release": "24.04",
        "version": "0",
        "interpreter": "python3",
        "file": "code.py"
      }
    }
    code.py
    import dxpy
    
    @dxpy.entry_point('main')
    def main(**kwargs):
        print("Hello, DNAnexus!")
        return {}
    dx login
    dx build -a
    dx run coolapp --watch
    dxapp.json
    { "name": "coolapp",
      "runSpec": {
        "distribution": "Ubuntu",
        "release": "24.04",
        "version": "0",
        "interpreter": "python3",
        "file": "code.py",
        "execDepends": [ {"name": "ncbi-blast+"} ]
      }
    }
    code.py
    import dxpy, subprocess
    
    @dxpy.entry_point('main')
    def main(seq1, seq2):
        dxpy.download_dxfile(seq1, "seq1.fasta")
        dxpy.download_dxfile(seq2, "seq2.fasta")
    
        subprocess.call("blastn -query seq1.fasta -subject seq2.fasta > report.txt", shell=True)
    
        report = dxpy.upload_local_file("report.txt")
        return {"blast_result": report}
    dx run coolapp \
      -i seq1="Demo Data:/Developer Quickstart/NC_000868.fasta" \
      -i seq2="Demo Data:/Developer Quickstart/NC_001422.fasta" \
      --watch
    dxapp.json
    {
      "name": "coolapp",
      "runSpec": {
        "distribution": "Ubuntu",
        "release": "24.04",
        "version": "0",
        "interpreter": "python3",
        "file": "code.py",
        "execDepends": [ {"name": "ncbi-blast+"} ]
      },
      "inputSpec": [
        {"name": "seq1", "class": "file"},
        {"name": "seq2", "class": "file"}
      ],
      "outputSpec": [
        {"name": "blast_result", "class": "file"}
      ]
    }
    code.py
    import dxpy, subprocess
    
    @dxpy.entry_point('main')
    def main(seq1, seq2, evalue, blast_args):
        dxpy.download_dxfile(seq1, "seq1.fasta")
        dxpy.download_dxfile(seq2, "seq2.fasta")
    
        command = "blastn -query seq1.fasta -subject seq2.fasta -evalue {e} {args} > report.txt".format(e=evalue, args=blast_args)
        subprocess.call(command, shell=True)
    
        report = dxpy.upload_local_file("report.txt")
        return {"blast_result": report}
    dxapp.json
    {
      "name": "coolapp",
      "runSpec": {
        "distribution": "Ubuntu",
        "release": "24.04",
        "version": "0",
        "interpreter": "python3",
        "file": "code.py",
        "execDepends": [ {"name": "ncbi-blast+"} ]
      },
      "inputSpec": [
        {"name": "seq1", "class": "file"},
        {"name": "seq2", "class": "file"},
        {"name": "evalue", "class": "float", "default": 0.01},
        {"name": "blast_args", "class": "string", "default": ""}
      ],
      "outputSpec": [
        {"name": "blast_result", "class": "file"}
      ]
    }
    "runSpec": {
      ...
      "execDepends": [
        {"name": "samtools"}
      ]
    }
    {
    mappings_bam_path: [u'/home/dnanexus/in/mappings_bam/SRR504516.bam']
    mappings_bam_name: [u'SRR504516.bam']
    mappings_bam_prefix: [u'SRR504516']
    index_file_path: [u'/home/dnanexus/in/index_file/SRR504516.bam.bai']
    index_file_name: [u'SRR504516.bam.bai']
    index_file_prefix: [u'SRR504516']
    }
    inputs = dxpy.download_all_inputs()
    shutil.move(inputs['mappings_bam_path'][0], os.getcwd())
    input_bam = inputs['mappings_bam_name'][0]
    
    bam_to_use = create_index_file(input_bam)
    print("Dir info:")
    print(os.listdir(os.getcwd()))
    
    regions = parseSAM_header_for_region(bam_to_use)
    
    view_cmds = [
        create_region_view_cmd(bam_to_use, region)
        for region
        in regions]
    
    print('Parallel counts')
    t_pools = ThreadPool(10)
    results = t_pools.map(run_cmd, view_cmds)
    t_pools.close()
    t_pools.join()
    
    verify_pool_status(results)
    def verify_pool_status(proc_tuples):
        err_msgs = []
        for proc in proc_tuples:
            if proc[2] != 0:
                err_msgs.append(proc[1])
        if err_msgs:
            raise dxpy.exceptions.AppInternalError(b"\n".join(err_msgs))
    resultfn = bam_to_use[:-4] + '_count.txt'
    with open(resultfn, 'w') as f:
        sum_reads = 0
        for res, reg in zip(results, regions):
            read_count = int(res[0])
            sum_reads += read_count
            f.write("Region {0}: {1}\n".format(reg, read_count))
        f.write("Total reads: {0}".format(sum_reads))
    
    count_file = dxpy.upload_local_file(resultfn)
    output = {}
    output["count_file"] = dxpy.dxlink(count_file)
    
    return output
    {
      "runSpec": {
        ...
        "execDepends": [
          {"name": "samtools"}
        ]
      }
    inputs = dxpy.download_all_inputs()
    # download_all_inputs returns a dictionary that contains mapping from inputs to file locations.
    # Additionaly, helper keys, value pairs are added to the dicitonary, similar to bash helper functions
    inputs
    #     mappings_sorted_bam_path: [u'/home/dnanexus/in/mappings_sorted_bam/SRR504516.bam']
    #     mappings_sorted_bam_name: u'SRR504516.bam'
    #     mappings_sorted_bam_prefix: u'SRR504516'
    #     mappings_sorted_bai_path: u'/home/dnanexus/in/mappings_sorted_bai/SRR504516.bam.bai'
    #     mappings_sorted_bai_name: u'SRR504516.bam.bai'
    #     mappings_sorted_bai_prefix: u'SRR504516'
    print("Number of cpus: {0}".format(cpu_count()))  # Get cpu count from multiprocessing
    worker_pool = Pool(processes=cpu_count())         # Create a pool of workers, 1 for each core
    results = worker_pool.map(run_cmd, collection)    # map run_cmds to a collection
                                                      # Pool.map handles orchestrating the job
    worker_pool.close()
    worker_pool.join()  # Make sure to close and join workers when done
    def run_cmd(cmd_arr):
        """Run shell command.
        Helper function to simplify the pool.map() call in our parallelization.
        Raises OSError if command specified (index 0 in cmd_arr) isn't valid
        """
        proc = subprocess.Popen(
            cmd_arr,
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE)
        stdout, stderr = proc.communicate()
        exit_code = proc.returncode
        proc_tuple = (stdout, stderr, exit_code)
        return proc_tuple
    def parse_sam_header_for_region(bamfile_path):
        """Helper function to match SN regions contained in SAM header
    
        Returns:
            regions (list[string]): list of regions in bam header
        """
        header_cmd = ['samtools', 'view', '-H', bamfile_path]
        print('parsing SAM headers:', " ".join(header_cmd))
        headers_str = subprocess.check_output(header_cmd).decode("utf-8")
        rgx = re.compile(r'SN:(\S+)\s')
        regions = rgx.findall(headers_str)
        return regions
    # Write results to file
    resultfn = inputs['mappings_sorted_bam_name'][0]
    resultfn = (
        resultfn[:-4] + '_count.txt'
        if resultfn.endswith(".bam")
        else resultfn + '_count.txt')
    with open(resultfn, 'w') as f:
        sum_reads = 0
        for res, reg in zip(results, regions):
            read_count = int(res[0])
            sum_reads += read_count
            f.write("Region {0}: {1}\n".format(reg, read_count))
        f.write("Total reads: {0}".format(sum_reads))
    
    count_file = dxpy.upload_local_file(resultfn)
    output = {}
    output["count_file"] = dxpy.dxlink(count_file)
    return output
    def verify_pool_status(proc_tuples):
        """
        Helper to verify worker succeeded.
    
        As failed commands are detected, the `stderr` from that command is written
        to the job_error.json file. This file is printed to the Platform
        job log on App failure.
        """
        all_succeed = True
        err_msgs = []
        for proc in proc_tuples:
            if proc[2] != 0:
                all_succeed = False
                err_msgs.append(proc[1])
        if err_msgs:
            raise dxpy.exceptions.AppInternalError(b"\n".join(err_msgs))
      "runSpec": {
        ...
        "execDepends": [
          {"name": "samtools"}
        ]
      }
    {
        mappings_bam_path: [u'/home/dnanexus/in/mappings_bam/SRR504516.bam']
        mappings_bam_name: [u'SRR504516.bam']
        mappings_bam_prefix: [u'SRR504516']
        index_file_path: [u'/home/dnanexus/in/index_file/SRR504516.bam.bai']
        index_file_name: [u'SRR504516.bam.bai']
        index_file_prefix: [u'SRR504516']
    }
    inputs = dxpy.download_all_inputs()
    shutil.move(inputs['mappings_bam_path'][0], os.getcwd())
    input_bam = inputs['mappings_bam_name'][0]
    
    bam_to_use = create_index_file(input_bam)
    print("Dir info:")
    print(os.listdir(os.getcwd()))
    
    regions = parseSAM_header_for_region(bam_to_use)
    
    view_cmds = [
        create_region_view_cmd(bam_to_use, region)
        for region
        in regions]
    
    print('Parallel counts')
    t_pools = ThreadPool(10)
    results = t_pools.map(run_cmd, view_cmds)
    t_pools.close()
    t_pools.join()
    
    verify_pool_status(results)
    def verify_pool_status(proc_tuples):
        err_msgs = []
        for proc in proc_tuples:
            if proc[2] != 0:
                err_msgs.append(proc[1])
        if err_msgs:
            raise dxpy.exceptions.AppInternalError(b"\n".join(err_msgs))
    resultfn = bam_to_use[:-4] + '_count.txt'
    with open(resultfn, 'w') as f:
        sum_reads = 0
        for res, reg in zip(results, regions):
            read_count = int(res[0])
            sum_reads += read_count
            f.write("Region {0}: {1}\n".format(reg, read_count))
        f.write("Total reads: {0}".format(sum_reads))
    
    count_file = dxpy.upload_local_file(resultfn)
    output = {}
    output["count_file"] = dxpy.dxlink(count_file)
    
    return output
    {
      "runSpec": {
        ...
        "execDepends": [
          {"name": "samtools"}
        ]
      }
    inputs = dxpy.download_all_inputs()
    # download_all_inputs returns a dictionary that contains mapping from inputs to file locations.
    # Additionaly, helper keys, value pairs are added to the dicitonary, similar to bash helper functions
    inputs
    #     mappings_sorted_bam_path: [u'/home/dnanexus/in/mappings_sorted_bam/SRR504516.bam']
    #     mappings_sorted_bam_name: u'SRR504516.bam'
    #     mappings_sorted_bam_prefix: u'SRR504516'
    #     mappings_sorted_bai_path: u'/home/dnanexus/in/mappings_sorted_bai/SRR504516.bam.bai'
    #     mappings_sorted_bai_name: u'SRR504516.bam.bai'
    #     mappings_sorted_bai_prefix: u'SRR504516'
    # Get cpu count from multiprocessing
    print("Number of cpus: {0}".format(cpu_count()))
    
    # Create a pool of workers, 1 for each core
    worker_pool = Pool(processes=cpu_count())
    
    # Map run_cmds to a collection
    # Pool.map handles orchestrating the job
    results = worker_pool.map(run_cmd, collection)
    
    # Make sure to close and join workers when done
    worker_pool.close()
    worker_pool.join()
    def run_cmd(cmd_arr):
        """Run shell command.
        Helper function to simplify the pool.map() call in our parallelization.
        Raises OSError if command specified (index 0 in cmd_arr) isn't valid
        """
        proc = subprocess.Popen(
            cmd_arr,
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE)
        stdout, stderr = proc.communicate()
        exit_code = proc.returncode
        proc_tuple = (stdout, stderr, exit_code)
        return proc_tuple
    def parse_sam_header_for_region(bamfile_path):
        """Helper function to match SN regions contained in SAM header
    
        Returns:
            regions (list[string]): list of regions in bam header
        """
        header_cmd = ['samtools', 'view', '-H', bamfile_path]
        print('parsing SAM headers:', " ".join(header_cmd))
        headers_str = subprocess.check_output(header_cmd).decode("utf-8")
        rgx = re.compile(r'SN:(\S+)\s')
        regions = rgx.findall(headers_str)
        return regions
    # Write results to file
    resultfn = inputs['mappings_sorted_bam_name'][0]
    resultfn = (
        resultfn[:-4] + '_count.txt'
        if resultfn.endswith(".bam")
        else resultfn + '_count.txt')
    with open(resultfn, 'w') as f:
        sum_reads = 0
        for res, reg in zip(results, regions):
            read_count = int(res[0])
            sum_reads += read_count
            f.write("Region {0}: {1}\n".format(reg, read_count))
        f.write("Total reads: {0}".format(sum_reads))
    
    count_file = dxpy.upload_local_file(resultfn)
    output = {}
    output["count_file"] = dxpy.dxlink(count_file)
    return output
    def verify_pool_status(proc_tuples):
        """
        Helper to verify worker succeeded.
    
        As failed commands are detected, the `stderr` from that command is written
        to the job_error.json file. This file is printed to the Platform
        job log on App failure.
        """
        all_succeed = True
        err_msgs = []
        for proc in proc_tuples:
            if proc[2] != 0:
                all_succeed = False
                err_msgs.append(proc[1])
        if err_msgs:
            raise dxpy.exceptions.AppInternalError(b"\n".join(err_msgs))
    {
    ...
        "runSpec": {
       ...
          "execDepends": [
            {"name": "samtools"}
          ]
        }
    ...
    }
    {
      "runSpec": {
        ...
        "systemRequirements": {
          "main": {
            "instanceType": "mem1_ssd1_x4"
          },
          "count_func": {
            "instanceType": "mem1_ssd1_x2"
          },
          "sum_reads": {
            "instanceType": "mem1_ssd1_x4"
          }
        },
        ...
      }
    }
    dx download "${mappings_sorted_bam}"
    chromosomes=$( \
      samtools view -H "${mappings_sorted_bam_name}" \
      | grep "\@SQ" \
      | awk -F '\t' '{print $2}' \
      | awk -F ':' '{if ($2 ~ /^chr[0-9XYM]+$|^[0-9XYM]/) {print $2}}')
    if [ -z "${mappings_sorted_bai}" ]; then
      samtools index "${mappings_sorted_bam_name}"
    else
      dx download "${mappings_sorted_bai}" -o "${mappings_sorted_bam_name}".bai
    fi
    
    count_jobs=()
    for chr in $chromosomes; do
      seg_name="${mappings_sorted_bam_prefix}_${chr}".bam
      samtools view -b "${mappings_sorted_bam_name}" "${chr}" > "${seg_name}"
      bam_seg_file=$(dx upload "${seg_name}" --brief)
      count_jobs+=($(dx-jobutil-new-job -isegmentedbam_file="${bam_seg_file}" -ichr="${chr}" count_func))
    done
    for job in "${count_jobs[@]}"; do
      readfiles+=("-ireadfiles=${job}:counts_txt")
    done
    
    sum_reads_job=$(dx-jobutil-new-job "${readfiles[@]}" -ifilename="${mappings_sorted_bam_prefix}" sum_reads)
    count_func ()
    {
        echo "Value of segmentedbam_file: '${segmentedbam_file}'";
        echo "Chromosome being counted '${chr}'";
        dx download "${segmentedbam_file}";
        readcount=$(samtools view -c "${segmentedbam_file_name}");
        printf "${chr}:\t%s\n" "${readcount}" > "${segmentedbam_file_prefix}.txt";
        readcount_file=$(dx upload "${segmentedbam_file_prefix}".txt --brief);
        dx-jobutil-add-output counts_txt "${readcount_file}" --class=file
    }
    sum_reads ()
    {
        set -e -x -o pipefail;
        printf "Value of read file array %s" "${readfiles[@]}";
        echo "Filename: ${filename}";
        echo "Summing values in files and creating output read file";
        for read_f in "${readfiles[@]}";
        do
            echo "${read_f}";
            dx download "${read_f}" -o - >> chromosome_result.txt;
        done;
        count_file="${filename}_chromosome_count.txt";
        total=$(awk '{s+=$2} END {print s}' chromosome_result.txt);
        echo "Total reads: ${total}" >> "${count_file}";
        readfile_name=$(dx upload "${count_file}" --brief);
        dx-jobutil-add-output read_sum_file "${readfile_name}" --class=file
    }
    The Tools Library provides a list of available apps and workflows. To see this list, select Tools Library from the Tools entry in the main Platform menu.
    circle-info

    On the DNAnexus Platform, apps and workflows are generically referred to as "tools."

    To find the tool you're looking for in the Tools Library, you can use search filters. Filtering enables you to find tools with a specific name, in a specific category, or of a specific type:

    Find all tools with 'assay' in their name.

    To see what inputs a tool requires, and what outputs it generates, select that tool's row in the list. The row is highlighted in blue. The tool's inputs and outputs are displayed in a pane to the right of the list:

    Check a tool's list of inputs and outputs.

    To make sure you can find a tool later, you can pin it to the top of the list. Click More actions (⋮) icon at the far right end of the row showing the tool's name and key details about it. Then click Add Pin.

    Pin your favorite apps to the top of the list.

    To learn more about a tool, click on its name in the list. The tool's detail page opens, showing a wide range of info, including guidance in how to use it, version history, pricing, and more:

    View app details with usage instructions.

    hashtag
    Running Apps and Workflows

    hashtag
    Launching a Tool

    hashtag
    Launching from the Tools Library

    You can quickly launch the latest version of any given tool from the Tools Library page. Or you can navigate to the apps details and click Run.

    By default, you run the latest app version.

    hashtag
    Launching from a Project

    From within a project, navigate to the Manage pane, then click the Start Analysis button.

    A dialog window opens, showing a list of tools. These include the same tools as shown in the Tools Library, as well as workflows and applets specifically available in the current project. Select the tool you want to run, then click Run Selected:

    Workflows and applets can be launched directly from where they reside within a project. Select the workflow or applet in their folder location, and click Run.

    hashtag
    Launch Configuration

    Confirm details of the tool you are about to run. Selection of a project location is required for any tool to be run. You need at minimum Contributor access level to the project.

    Provide name and output location before the launch
    circle-info

    Specialized tools, such as JupyterLab and Spark Apps, require special licenses to run.

    hashtag
    Configure Inputs and Outputs

    The tool may require specific inputs to be filled in before starting the run. You can quickly identify the required inputs by looking for the highlighted areas that are marked Inputs Required on the page.

    Fill in the required inputs before starting the run

    You can access help information about each input or output by inspecting the label of each item. If a detailed README is provided for the executable, you can click the View Documentation icon to open the app or workflow info pane.

    Help information for each field and the tool overall

    To configure instance type settings for a given tool or stage, click the Instance Type icon located on the top-right corner of the stage.

    Show / Hide instance type settings

    To configure output location and view info regarding output items, go to the Outputs tab under each stage. For workflows, output location can be specified separately for each stage.

    Configure output locations for each stage of the workflow

    The I/O graph provides an overview of the input/output structure of the tool. The graph is available for any tool and can be accessed via the Actions/Workflow Actions menu.

    The workflow's I/O graph visualization

    Once all required inputs have been configured, the page indicates that the run is ready to start. Click on Start Analysis to proceed to the final step.

    The tool has been fully configured and ready to start the run

    hashtag
    Configure Runtime Settings

    As the last step before launching the tool, you can review and confirm specific runtime settings, including execution name, output location, priority, job rank, spending limit, and resource allocation. You can also review and modify instance type settings before starting the run.

    Once you have confirmed final details, click Launch Analysis to start the run.

    Review and confirm runtime settings before starting the run
    Configure advanced runtime settings
    circle-info

    A license is required to use the Job Ranking feature. Contact DNAnexus Supportenvelope for more information.

    hashtag
    Batch Run

    Batch run allows users to run the same app or workflow multiple times, with specific inputs varying between runs.

    hashtag
    Specify Batch Inputs

    To enable batch run, start from any input that you wish to specify for batch run, and open its I/O Options menu on the right hand side. From the list of options available, select Enable Batch Run.

    Input fields with batch run enabled are highlighted with a Batch label. Click any of the batch enabled input fields to enter the batch run configuration page.

    circle-info

    Not all input classes are supported for batch run configuration. See table below.

    Input Class
    Batch Run Support

    Files and other data objects

    Yes

    Files and other data objects (array)

    Partially supported. Can accept entry of a single-value array

    String

    Yes

    Integer

    hashtag
    Configure Batch Inputs

    The batch run configuration page allows specifying inputs across multiple runs. Interact with each table cell to fill in desired values for any run or field.

    Similar to configuration of inputs for non-batch runs, you need to fill all the required input fields to proceed to next steps. Optional inputs, or required inputs with a predefined default value, can be left empty.

    Once all required fields (for both batch inputs and non-batch inputs) have been configured, you can proceed to start the run via the Start Analysis button.

    The total 10 batch runs have been fully configured and ready to launch

    hashtag
    Starting and Monitoring Your Analysis

    Once you've finished setting up your tool, start your analysis by clicking the Start Analysis button. Follow these instructions to monitor the job as it runs.

    hashtag
    Learn More

    Learn in depth about running apps and workflows, leveraging advanced techniques like Smart Reuse.

    Learn how to build an app.

    Learn more about building apps using Bash or Python.

    Learn in depth about building and deploying apps, including Spark apps.

    Learn in depth about importing, building, and running workflows.

    Follow these instructions to set up billing.
    List views can be used to visualize categorical data.

    When creating a list view:

    • The data must be from a field that contains either categorical or categorical multi-select data

    • This field must contain no more than 20 distinct category values

    • The values can be organized in a hierarchy

    Supported Data Types

    Categorical (<=20 distinct category values)

    Categorical Multiple (<=20 distinct category values)

    Categorical Hierarchical (<=20 distinct category values)

    Categorical Hierarchical Multiple (<=20 distinct category values)

    hashtag
    Using List Views to Visualize Hierarchically Organized Data

    List views, unlike row charts, can be used to visualize categorical data with values that are organized in a hierarchical fashion.

    hashtag
    Using List Views to Visualize Data from Two Different Fields

    List views can be used to visualize categorical data from two different fields. The same restrictions apply to the fields whose values are displayed, as when creating a basic list view.

    hashtag
    Using List Views in the Cohort Browser

    hashtag
    Visualizing Data from a Single Field

    In a list view in the Cohort Browser showing data from one field, each row displays a value, along with the number of records in the current cohort - the "count" - that contain this value. Also shown is a figure labeled "freq." - this is the percentage of all cohort records, that contain the value.

    Below is a sample list view showing the distribution of values in a field Episode type. In the current cohort selection of 80 participants, 13 records contain the value "Delivery episode", which represents 16.25% of the current cohort size.

    List View in the Cohort Browser
    circle-info

    When records are missing values for the displayed field, the sum of the "count" figures is smaller than the total cohort size, and the sum of the "freq." figures is less than 100%. See Chart Totals and Missing Data for more information on how missing data affects chart calculations.

    hashtag
    Visualizing Data from Two Fields

    To visualize data from two fields, select a categorical field, then select "List View" as your visualization type. In the field list, select a second categorical field as a secondary field.

    Below is the default view of a sample list view visualizing data from two fields: Critical care record origin and Critical care record format:

    Primary Field Values in a List View Visualizing Data from Two Fields

    Critical care record origin is the primary field, Critical care record format is the secondary field.

    Here, the user has clicked the ">" icon next to "Originating from Scotland" to display additional rows with detail on records that contain that value in the field Critical care record origin:

    Seeing Combinations of Field Values

    Each of these additional rows shows the number of records that contain a particular value for Critical care record format, along with the value "Originating from Scotland" for Critical care record origin.

    In these additional rows, "count" and "freq." figures refer to records having a particular combination of values, in the fields.

    hashtag
    Visualizing Complex Categorical Data

    Below is an example of a list view used to visualize data in a categorical hierarchical field Home State/Province:

    List View of Hierarchical Categorical Data

    By default, only values in the category at the top level of the hierarchy are displayed.

    Here, the user has clicked ">" next to one of these values, revealing additional rows that show how many records have the value "Canada" for the top-level category, in combination with different values in the category at the next level down:

    Seeing Combinations of Values in a Field Containing Hierarchical Categorical Data

    In these additional rows, "count" and "freq." figures refer to records having a particular combination of values, in the fields. In the list view above, for example, a single record, representing 10% of the cohort, has both the value "Canada" for the top-level category, and "British Columbia" for the second-level category.

    The following example shows how "count" and "freq." are calculated, for list views based on fields containing categorical data organized into multiple levels of hierarchy:

    Multiple Levels of Hierarchy

    For the bottommost row, "count" and "freq" refer to records having the following values:

    • "Yes" for the category at the top of the hierarchy

    • "9" for the category at the second level of the hierarchy

    • "8" for the category at the third level of the hierarchy

    • "7" for the category at the fourth level of the hierarchy

    • "3" for the category at the bottom level of the hierarchy

    hashtag
    Locating Values in a List View

    In cases where the field has categories at multiple levels and this make it difficult to find a particular value, use the search box at the bottom of the list view, to hone in on a row or rows containing that value:

    Using the Search Function in a List View

    hashtag
    List Views in Cohort Compare

    In Cohort Compare mode, a list view can be used to compare the distribution of values in a field that's common to both cohorts. In this scenario, the list includes a color-coded column for each cohort, as well as color-coded "count" figures for each, as in this example:

    List view: Treatment/Medication Code in compare mode

    In each column, count and "freq." figures refer to the occurrence of values in the individual cohort, not across both cohorts.

    hashtag
    Preparing Data for Visualization in List Views

    When ingesting data using Data Model Loader, the following data types can be visualized in list views:

    • String Categorical

    • String Categorical Hierarchical

    • String Categorical Multi-Select

    • String Categorical Multi-Select Hierarchical

    • String Categorical Sparse

    • String Categorical Sparse Hierarchical

    • Integer Categorical

    • Integer Categorical Hierarchical

    • Integer Categorical Multi-Select

    • Integer Categorical Multi-Select Hierarchical

    Contact DNAnexus Salesenvelope
    Gene expression datasets are created using the Molecular Expression Assay Loader.

    hashtag
    Customizing the Gene Expression Dashboard

    You can customize your Gene Expression dashboard to focus on the most relevant analyses for your research:

    • Create new Expression Distribution or Feature Correlation charts.

    • Remove charts you no longer need.

    • Resize and reposition charts to optimize your workspace.

    • Save your dashboard customizations along with your cohort.

    Visualizing gene expression data in Cohort Browser
    circle-info

    The Gene Expression dashboard supports up to 15 charts, allowing you to create comprehensive expression analysis workspaces.

    For datasets with multiple gene expression assays, you can choose the specific assay to visualize at the top of the dashboard. The Cohort Browser displays data from only one assay at a time. Switching between assays preserves your charts and their display settings.

    hashtag
    Filtering by Gene Expression

    You can define your cohort by gene expression to include only patients with specific expression characteristics.

    To apply a gene expression filter to your cohort:

    1. For the cohort you want to edit, click Add Filter.

    2. In Add Filter to Cohort > Assays > Gene Expression, select a genomic filter.

    3. In Edit Filter: Gene Expression, specify the criteria:

      • For datasets with multiple gene expression assays, select the specific assay to filter by.

      • In Expression Level, specify inclusive minimum and maximum values. For an individual to be included, all their expression values across all samples for the feature must fall within the range.

      • In Gene / Transcript, enter a gene symbol, such as BRCA1, or feature ID, such as ENSG00000012048 or ENST00000309586. Search is case insensitive.

    4. Click Apply Filter.

    You can specify up to 10 gene expression filters for each cohort. All filters use an AND relationship.

    Adding a gene expression filter
    circle-info

    After you apply or edit filters, the participant count updates immediately. However, visualization tiles do not automatically refresh. Click Refresh Visualizations at the top of the dashboard to update all tiles. Click Refresh on individual tiles to update specific charts.

    hashtag
    Visualizing Expression Distribution

    The Expression Level charts help you visualize gene expression patterns for individual transcript or gene features. You can examine how expression values are distributed across your cohort, identify outliers, and compare patterns between different patient groups.

    The chart displays data for one gene or transcript at a time. You can directly enter a transcript or gene feature ID, such as ID starting with ENSTarrow-up-right or ENSGarrow-up-right, or search by gene symbol to see available options.

    Visualizing TP53 gene expression

    You can view the data as either a histogram showing frequency distribution or a box plot displaying quartiles and outliers. To customize the chart display, including applying logarithmic transformations for wide-range expression data, click ⛭ Chart Settings.

    When comparing cohorts, the chart shows data from each cohort on the same axes for direct comparison.

    You can also modify your charts by selecting different transcript or gene features, resizing and rearranging them on your dashboard, or adjusting display settings to focus on the most relevant analyses for your research.

    hashtag
    Exploring Feature Correlations

    The Feature Correlation charts help you understand how the expression levels of two genes or transcripts relate to each other. You can use these charts to identify genes or transcripts that are co-expressed, explore potential pathway interactions, and compare correlation patterns between different cohorts.

    The chart displays a scatter plot where each point represents a sample, with the X and Y axes showing expression values for your two selected features. A best fit line shows the overall relationship trend, and you can swap which gene appears on which axis to view the data from different perspectives.

    Exploring feature correlations between ERBB2 and TP53

    The correlation analysis includes statistical measures to help you determine if the relationship you're seeing is meaningful. The Pearson correlation coefficient shows both the strength and direction of the linear relationship (ranging from -1 to +1), while the p-value indicates whether the correlation is statistically significant.

    You can toggle these statistics on or off as needed. The chart updates when you change your feature selections or switch between viewing single cohorts versus comparing multiple cohorts. This quantitative analysis helps you assess whether observed correlations are both statistically sound and biologically relevant to your research.

    hashtag
    Examining Detailed Gene Expression Information

    The Expression Per Feature table provides gene metadata and expression statistics for all features in your dataset. Use the search bar to find specific genes by symbol or explore genes within genomic ranges.

    Examining TP53 expression per feature

    The table displays one row per feature ID with the following columns:

    • Feature ID: The unique transcript or gene identifier, such as ENST for a transcript or ENSG for a gene

    • Gene Symbol: The official gene name or symbol associated with the feature ID, such as TP53

    • Location: The genomic coordinates in "chromosome:start-end" format

    • Strand: The DNA strand orientation (+ or -)

    • Expression (Mean): The average expression value for this feature across the current cohort

    • Expression (SD): The standard deviation of expression values

    • Expression (Median): The median expression value

    When comparing cohorts, the table shows separate expression statistics for each cohort, allowing direct comparison of expression patterns.

    Each feature includes links to external annotation resources:

    • Ensembl transcript pages: Detailed transcript information and annotations

    • Ensembl gene pages: Comprehensive gene summaries and functional data

    These links provide quick access to additional context about genes and transcripts of interest.

    Contact DNAnexus Salesenvelope
    TXT
  • PNG

  • PDF

  • HTML

  • To preview these files, select the file you wish to view by either clicking on its name in the Manage tab or selecting the checkbox next to the file. If the file is one of the file types listed above, the "Preview" and "Open in New Tab" options appear in the toolbar above.

    Alternatively, you can click on the three dots on the far right and choose the "Preview" or "Open in New Tab" options from the dropdown menu.

    "Preview" opens a fixed-sized box in your current tab to preview the file of interest. "Open in New Tab" enables viewing the file in a separate tab. Due to limitations in web browser technologies, "Preview" and "Open in New Tab" may produce different results.

    circle-info

    The file type is not necessarily determined by the file extension. For example, you can preview a FASTA file reads.fa, even though the file extension is not .txt. However, you cannot preview a BAM file (a binary file) using the Preview option.

    hashtag
    Preview Restrictions

    File preview and viewer functionality are subject to project access controls. When a project has the previewViewerRestricted flag enabled, preview and viewer capabilities are disabled for all project members. This flag is automatically set to true when downloadRestricted is enabled on a project (for both new projects and when updating existing projects), though project admins can override this behavior by explicitly providing the previewViewerRestricted flag.

    hashtag
    Using File Viewers

    For files not listed in the section above, the DNAnexus Platform also provides a lightweight framework called Viewers, which allows users to view their data using new or existing web-based tools.

    A Viewer is an HTML file that you can give one or more DNAnexus URLs representing files to be viewed. Viewers generally integrate third-party technologies, such as HTML-based genome browsers.

    circle-info

    The data you select to be viewed is accessible by the Viewer, which can also access the Internet. You should only run Viewers from trusted sources.

    hashtag
    Launching a Viewer

    You can launch a viewer by clicking on the Visualize tab within a project.

    This tab opens a window displaying all Viewers available to you within your project. Any Viewers you've created and saved within your current project appear in this list along with the DNAnexus-provided Viewers.

    Clicking on a Viewer opens a data selector for you to choose the files you wish to visualize. Tick one or more files that you want to provide to the Viewer. (The Viewer does not have access to any other of your data.) From there, you can either create a Viewer Shortcut or launch the Viewer.

    hashtag
    Example Viewers

    hashtag
    Human Genome Browsers (BioDalliance, IGV.js)

    The BioDalliance and IGV.js viewers provide HTML-based human genome browsers which you can use to visualize mappings and variants. When launching either one of these viewers, tick a pair of *.bam + *.bai files for each mappings track you would like to visualize, and a pair of *.vcf.gz + *.vcf.gz.tbi for each variant track you want to add. Also, the BioDalliance browser supports bigBed (*.bb) and bigWig (*.bw) tracks.

    For more information about BioDalliance, consult BioDalliance's Getting Startedarrow-up-right. For IGV.js, see the IGV websitearrow-up-right.

    hashtag
    BAM Header Viewer

    The BAM Header Viewer allows you to peek inside a BAM header, similar to what you would get if you were to run samtools view -H on the BAM file. (BAM headers include information about the reference genome sequences, read groups, and programs used). When launching this viewer, tick one or more BAM files (*.bam).

    hashtag
    Jupyter Notebook Viewer

    The Jupyter notebook viewer displays *.ipynb notebook files, showing notebook images, highlighted code blocks and rendered markdown blocks as shown below.

    hashtag
    Gzipped File Viewers

    This viewer allows you to decompress and see the first few kilobytes of a gzipped file. It is conceptually similar to what you would get if you were to run zcat <file> \| head. Use this viewer to peek inside compressed reads files (*.fastq.gz) or compressed variants files (*.vcf.gz). When launching this viewer, tick one or more gzipped files (*.gz).

    hashtag
    Troubleshooting Viewers

    If a viewer fails to load, try temporarily disabling browser extensions such as AdBlock and Privacy Badger. Also, viewers are not supported in Incognito browser windows.

    hashtag
    Custom Viewers

    Developers comfortable with HTML and JavaScript can create custom viewers to visualize data on the platform.

    hashtag
    Viewer Shortcuts

    Viewer Shortcuts are objects which, when opened, open a data selector to select inputs for launching a specified Viewer. The Viewer Shortcut includes a Viewer and an array of inputs that are selected by default.

    The Viewer Shortcut appears in your project as an object of type "Viewer Shortcut." You can modify the name of the Viewer Shortcut and move it within your folders and projects like any other object in the DNAnexus Platform.

    Smart Reuse (Job Reuse)

    Speed workflow development and reduce testing costs by reusing computational outputs.

    circle-info

    A license is required to access the Smart Reuse feature. Contact DNAnexus Salesenvelope for more information.

    DNAnexus allows organizations to optionally reuse outputs of jobs that share the same executable and input IDs, even if these outputs are across projects or entire organizations. This feature has two primary use cases.

    hashtag
    Example Use Cases

    hashtag
    Dramatically Speed Up R&D of Workflows

    For example, suppose you are developing a workflow, and at each stage you end up debugging an issue. Each stage takes about one hour to develop and run. If you do not reuse outputs during development, the process takes 1 + 2 + 3 + ... + n hours because at every stage you fix something and must recompute results from previous stages. By reusing results for stages that have matured and are no longer modified, the total development time equals the time it takes to develop and run the pipeline (in this case n hours). This is an order-of-magnitude reduction in development time, and the improvement becomes more pronounced for longer workflows.

    This feature also saves time when developing forks of existing workflows. For example, suppose you are a developer in an R&D organization and want to modify the last couple of stages of a production workflow in another organization. As long as the new workflow uses the same executable IDs for the earlier stages, the time required for R&D of the forked version equals the time for the last stages.

    hashtag
    Dramatically Reduce Costs When Testing at Scale

    In production environments, test R&D modifications to a workflow at scale. This is especially relevant for workflows used in clinical tests. For example, suppose you are testing a workflow like the forked workflow discussed earlier. This clinical workflow must be tested on thousands of samples (let that number be represented by m) before it is vetted for production. Suppose the whole workflow takes n hours but only the last k stages changed. You save (n-k)m total compute hours. This can add up to dramatic cost savings as m grows and if k is small.

    hashtag
    Example Reuse with WDL

    To show Smart Reuse, the following example uses WDL syntax as supported by DNAnexus SDK and .

    The workflow above is a two-step workflow that duplicates a file and takes the first 10 lines from the duplicate.

    Suppose the user has run the workflow above on some file and wants to tweak headfile to output the first 15 lines instead:

    Here the only differences are the renamed headfile and basic_reuse, and the change from 10 to 15. The compilation process automatically detects that dupfile is the same but the second stage differs. The generated workflow therefore uses the original executable ID for dupfile but a different executable ID for headfile2.

    When executing basic_reuse_tweaked on the same input file with Smart Reuse enabled, the results from dupfile task are reused. This is because since there is already a job on the DNAnexus Platform that has run that specific executable with the same input file, the system can reuse that file.

    When using Smart Reuse with complex WDL workflows involving WDL expressions in input arguments, scatters, and nested sub-workflows, we recommend launching workflows using the option. This preserves the outputs of all jobs in the execution tree in the project and increases the potential for subsequent Smart Reuse.

    hashtag
    Requirements for Smart Reuse

    Jobs can reuse results from previous jobs if the following criteria are met:

    • The organization that is billed for the job has .

    • Smart Reuse applies only to jobs completed after the org policy was enabled.

    • Smart Reuse is enabled at the executable level ().

    When a job reuses results, it includes an outputReusedFrom field pointing to the previous job ID. Reused jobs are reported as having run for 0 seconds and are billed at $0. If the reused job or workflow is in a different project or folder, output data is not cloned to the new project or destination folder (the job or workflow is not actually rerun).

    hashtag
    Controlling Smart Reuse

    Smart Reuse can be controlled at three levels. Runtime settings override executable defaults. Executable defaults override organization policy. If Smart Reuse is disabled at any level, reuse does not occur.

    1. Organization policy: Set the jobReuse policy to true (default is false). See .

    2. Executable default: Set to true or false. The default is false

    hashtag
    How to Enable Smart Reuse

    To enable or disable Smart Reuse for your organization:

    circle-info

    If you plan to reuse results across projects, you must modify all applet and app configurations to include "allProjects": "VIEW" in the .

    If you are a licensed customer and cannot run the command above, contact . If you are interested in Smart Reuse and are not a licensed customer, reach out to or your account executive for more information.

    Filtering Objects and Jobs

    You can perform advanced filtering on projects, data objects, and jobs using the filter bars above the table of results. This feature is displayed at the top of the Monitor tab but is hidden by default on the Manage tab and Projects page. You can display or hide the filter bar by toggling the filters icon in the top right corner.

    The filter bar lets you to specify different criteria on which to filter your data. You can combine multiple different filters for greater control over your results.

    To use this feature, first choose the field you want to filter your data by, then enter your filter criteria. For example, select the "Name" filter then search for "NA12878". The filter activates when you press enter or click outside of the filter bar.

    hashtag
    Filtering Projects

    The following filters are available for projects, and can be added by selecting them from the "Filters" dropdown menu.

    • Billed to: The user or org ID that the project is billed to, for example, "user-xxxx" or "org-xxxx". When viewing a partner organization's projects, the "Billed to" field is fixed to the org ID.

    • Project Name: Search by case insensitive string or regex, for example, "Example" or "exam$" both match "Example Project"

    • ID: Search by project ID, for example, "project-xxxx"

    hashtag
    Filtering Objects

    The following filters are available for objects. Filters listed in italics are not displayed in the filter bar by default but can be added by selecting them from the "Filters" dropdown menu on the right.

    • Search scope: The default scope is "Entire project", but if you know the location of the object you are looking for, limiting your search scope to "Current Folder" allows you to search more efficiently.

    • Object name: Search by case insensitive string or regex, for example, NA1 or bam$ both match NA12878.bam

    When filtering on anything other than the current folder, results appear from many different places in the project. The folders appear in a lighter gray font and some actions are unavailable (such as creating a new workflow or folder), but otherwise functionality remains the same as in the normal data view.

    hashtag
    Filtering Jobs and Analyses

    The following filters are available for executions. Filters listed in italics are not displayed in the filter bar by default but can be added to the bar by selecting them from the "Filters" dropdown menu on the right.

    • Search scope: The default displays root executions only, but you can choose to view all executions (root and subjobs) instead

    • State: for example, Failed, Waiting, Done, Running, In Progress, Terminated

    • Name: Search by case-insensitive string or regex, for example, "BWA" or "MEM$" both match "BWA-MEM". This only matches the name of the job or analysis, not the executable name.

    hashtag
    Multi-Word Queries in Filters and Searches

    When filtering on a name, any spaces expand to include intermediate words. For example, filtering by "b37 build" also returns "b37 dbSNP build".

    hashtag
    Filtering by Date

    Some filters allow you to specify a date range for your query. For example, the "Created date" filter allows you to specify a beginning time ("From") and/or an end time ("To"). Clicking on the date box opens a calendar widget which allows you to specify a relative time in minutes, hours, days, weeks, months, or an absolute time by specifying a certain date.

    For relative time, specify an amount of time before the access time. For example, selecting "Day" and typing 5 sets the datetime to 5 days before the current time.

    Alternatively, you can use the calendar to represent an exact (absolute) datetime.

    Setting only the beginning datetime ("From") creates a range from that time to the access time. Setting only the end datetime ("To") creates a range from the earliest records to the "To" time.

    A filter with a relative time period updates each time it is accessed. For example, a filter for items created within two hours shows different results at different times: items from 9am at 11am, and items from 2pm at 4pm. For consistent results, use absolute datetimes from the calendar widget.

    hashtag
    Filtering by Tags and Properties

    hashtag
    Tags

    To search by tag, enter or select the tags you want to find. For example, to find all objects tagged with "human", type "human" in the filter box and select the checkbox next to the tag.

    Unlike other searches where you can enter partial text, tag searches require the complete tag name. However, capitalization doesn't matter. For example, searching for "HUMAN", "human", or "Human" all find objects with the "Human" tag. Partial matches like "Hum" do not return results.

    hashtag
    Properties

    Properties have two parts: a key and a value. The system prompts for both when creating a new property. Like tags, properties allow you to create your own common attributes across multiple projects or items and find them quickly. When searching for a property, you can either search for all items that have that property, or items that have a property with a certain value.

    To search for all items that have a property, regardless of the value of that property, select the "Properties" filter (not displayed by default), enter the property key, and click Apply. To search for items that have a property with a specific value, enter that property's key and value.

    The keys and values must be entered in their entirety. For example, entering the key sample and the value NA does not match objects with {"sample_id": "NA12878"}.

    hashtag
    Any vs. All Queries

    Some filters allow you to select multiple values. For example, the "Tag" filter allows you to specify multiple tags in the dialog. When you have selected multiple tags, you have a choice whether to search for objects containing any of the selected tags or containing all the selected tags.

    Given the following set of objects:

    • Object 1 (tags: "human", "normal")

    • Object 2 (tags: "human", "tumor")

    • Object 3 (tags: "mouse", "tumor")

    Selecting both "human" and "tumor" tags, and choosing to filter by any tag returns all 3 objects. Choosing to filter by all tags returns only Object 2.

    hashtag
    Clearing All Filters

    Click the "Clear All Filters" button on the filter bar to reset your filters.

    hashtag
    Saving Filters

    If you wish to save your filters, active filters are saved in the URL of the filtered page. You can bookmark this URL in your browser to return to your filtered view in the future.

    Bookmarking a filtered URL saves the search parameters, not the search results. The filters are applied to the data present when accessing the bookmarked link. For example, filters for items created in the last thirty days show items from the thirty days before viewing the results, not the thirty days before creating the bookmark. Results update based on when you access the saved search.

    User Interface Quickstart

    Learn to create a project, add members and data to the project, and run a simple workflow.

    circle-info

    You must set up billing for your account before you can perform an analysis, or upload or egress data.

    hashtag
    Step 1. Create Your First Project

    Path Resolution

    When using the command-line client, you may refer to objects either through their ID or by name.

    In the DNAnexus Platform, every data object has a unique starting with the class of the object followed by a hyphen ('-') and 24 alphanumeric characters. Common object classes include "record", "file", and "project". An example ID would be record-9zGPKyvvbJ3Q3P8J7bx00005. A string matching this format is always interpreted to be meant as the ID of such an object and is not further resolved as a name.

    The command-line client, however, also accepts names and paths as input in a particular syntax.

    hashtag

    Running Batch Jobs

    To launch a DNAnexus application or workflow on many files automatically, one may write a short script to loop over the desired files in a project and launch jobs or analyses. Alternatively, the provides a few handy utilities for batch processing. To use the GUI to run in batch mode, see these .

    hashtag
    Overview

    In this tutorial, you batch process a series of sample FASTQ files (forward and reverse reads). Use the dx generate_batch_inputs command to generate a batch file -- a tab-delimited (TSV) file where each row corresponds to a single run in the batch. Then you process the batch using the

    Job Lifecycle

    Learn about the states through which a job or analysis may go, during its lifecycle.

    hashtag
    Example Execution Tree

    The following example shows a workflow that has two stages, one of which is an applet, and the other of which is an app.

    When the workflow runs, it generates an analysis with an attached workspace for storing intermediate output from its stages. Jobs are created to run the two stages. These jobs can spawn additional jobs to run other functions in the same executable or to run separate executables. The blue labels indicate which jobs or analyses can be described using a particular term (as defined above).

    Relational Database Clusters

    hashtag
    Relational Database Clusters

    The DNAnexus Relational Database Service provides users with a way to create and manage cloud database clusters (referred to as dbcluster objects on the platform). These databases can then be securely accessed from within DNAnexus jobs/workers.

    The Relational Database Service is accessible through the application program interface (API) in AWS regions only. See for details.

    Archiving Files

    Learn how to archive files, a cost-effective way to retain files in accord with data-retention policies, while keeping them secure and accessible, and preserving file provenance and metadata.

    circle-info

    A license is required to use the DNAnexus Archive Service. Contact for more information.

    Archiving in DNAnexus is file-based. You can archive individual files, folders with files, or entire projects' files and save on storage costs. You can also unarchive one or more files, folders, or projects when you need to make the data available for further analyses.

    The DNAnexus Archive Service is available via the API in Amazon AWS and Microsoft Azure regions.

    Created date: Search by projects created before, after, or between different dates

  • Modified date: Search by projects modified before, after, or between different dates

  • Creator: The user ID who created the project, for example, "user-xxxx"

  • Shared with member: A user ID with whom the project is shared, for example, "user-xxxx" or "org-xxxx"

  • Level: The minimum permission level to the project. The dropdown has the options Viewer+", "Uploader+", "Contributor+", and "Admin only". For example, "Contributor+" filters projects with access CONTRIBUTOR or ADMINISTER

  • Tags: Search by tag. The filter bar automatically populates with tags available on projects

  • Properties: Search by properties. The filter bar automatically provides properties available on projects

  • ID: Search by object ID, for example, file-xxxx or applet-xxxx
  • Modified date: Search by objects modified before, after, or between different dates

  • Class: such as "File", "Applet", "Folder"

  • Types: such as "File" or custom Type

  • Created date: Search by objects created before, after, or between different dates

  • Tags: Search by tag. The filter bar automatically populates with tags available on objects within the current folder

  • Properties: Search by properties. The filter bar automatically provides properties available on objects within the current folder

  • ID: Search by job or analysis ID, for example, "job-1234" or "analysis-5678"

  • Created date: Search by executions created before, after, or between different dates

  • Launched by: Search by the user ID of the user who launched the job. The filter bar shows users who have run jobs visible in the project

  • Tags: Search by tag. The filter bar automatically populates with tags available on the visible executions

  • Properties: Search by properties. The filter bar automatically provides properties available on executions visible in the project

  • Executable: Search by the ID of executable run by the executions in question. Examples include app-1234 or applet-5678

  • Class: for example, Analysis or Job

  • Origin Jobs: ID of origin job

  • Parent Jobs: ID of parent job

  • Parent Analysis: ID of parent analysis

  • Root Executions: ID of root execution

  • Yes

    Float

    Yes

    Boolean

    Yes

    String (array)

    No

    Integer (array)

    No

    Float (array)

    No

    Boolean (array)

    No

    Hash

    No

    Smart Reuse is enabled at runtime (not using the --ignore-reuse flag).
  • A previous job used the exact same executable and input IDs (including the function called within the applet).

  • If an input is watermarked, both the watermark and its version match. Other settings, such as instance type, do not affect reuse.

  • The job being reused has all outputs available and accessible at the time of reuse.

  • You have at least VIEW access to the previous job's outputs.

  • The previous job's outputs still exist on the Platform.

  • For cross-project reuse, the application's dxapp.json file includes "allProjects": "VIEW" in the "access" field.

  • Outputs are assumed to be deterministic.

  • , allowing reuse. When
    ignoreReuse: true
    , Smart Reuse is disabled for the executable.
  • Runtime override: Control reuse at runtime using any of these methods:

    • Use the --ignore-reuse flag with dx run to disable reuse.

    • Use --extra-args '{"ignoreReuse": false}' or --extra-args '{"ignoreReuse": true}' to explicitly enable or disable reuse.

    • Set the ignoreReuse parameter in API calls to /app-xxxx/run, /applet-xxxx/run, or /workflow-xxxx/run.

    • For workflows, use --ignore-reuse-stage STAGE_ID to control specific stages.

  • dxCompiler
    --preserve-job-outputs
    Smart Reuse enabled
    ignoreReuse in dxapp.json
    How to Enable Smart Reuse
    ignoreReuse in dxapp.json
    "access" field
    DNAnexus Supportenvelope
    DNAnexus Salesenvelope
    command with the --batch-tsv options.

    hashtag
    Generate Batch File

    The project My Research Project contains the following files in the project's root directory:

    Batch process these read pairs using BWA-MEMarrow-up-right (link requires platform login). For a single execution of the BWA-MEM app, specify the following inputs:

    • reads_fastqgzs - FASTQ containing the left mates

    • reads2_fastqgzs - FASTQ containing the right mates

    • genomeindex_targz - BWA reference genome index

    The BWA reference genome index from the public Reference Genomearrow-up-right (requires platform login) project is used for all runs. However, for the forward and reverse reads, the read pairs used vary from run to run. To generate a batch file that pairs the input reads:

    circle-info

    You can optionally provide a --path argument and provide a specific file and folder to search for recursively within your project. Specifically, the value for --path must be a directory specified as:

    /path/to/directory or project-xxxx:/path/to/directory

    Any file present within this directory or recursively within any subdirectory of this directory is considered a candidate for a batch run.

    The (.*) are regular expression groups. You can provide arbitrary regular expressions as input. The first match in the group is the pattern used to group pairs in the batch. These matches are called batch identifiers (batch IDs). To explain this behavior in more detail, consider the output of the dx generate_batch_inputs command:

    The dx generate_batch_inputs command creates the dx_batch.0000.tsv that looks like:

    Recall the regular expression was RP(.*)_R1_(.*).fastq.gz. Although there are two grouped matches in this example, only the first one is used as the pattern for the batch ID. For example, the pattern identified for RP10B_S1_R1_001.fastq.gz is 10B_S1 which corresponds to the first grouped match while the second one is ignored.

    Examining the TSV file above, the files are grouped as expected, with the first match labeling the identifier of the group within the batch. The next two columns show the file names. The last two columns contain the IDs of the files on the DNAnexus Platform. You can either edit this file directly or import it into a spreadsheet to make any subsequent changes.

    If an input for the app is an array, the input file IDs within the batch.tsv file need to be in square brackets to work. The following bash command adds brackets to the file IDs in column 4 and 5. You may need to change the variables in the command ($4 and $5) to match the correct columns in your file. The command's output file, "new.tsv", is ready for the dx run --batch-tsv command.

    The example above is for a case where all files have been paired properly. dx generate_batch_inputs creates a TSV for all files that can be successfully matched for a particular batch ID. Two classes of errors may occur for batch IDs that are not successfully matched:

    • A particular input is missing. This could occur when reads_fastqgzs has a pattern but no corresponding match can be found for reads2_fastqgzs.

    • More than one file ID matches the exact same name.

    For both of these cases, dx generate_batch_inputs returns a description of these errors to STDERR.

    circle-info

    When matching more than 500 files, multiple batch files are generated in groups of 500 to limit the number of jobs in a single batch run.

    hashtag
    Run a Batch Job

    With the batch file prepared, you can execute the BWA-MEM batch process:

    Here, genomeindex_targz is a parameter set at execution time that is common to all groups in the batch and --batch-tsv corresponds to the input file generated above.

    To monitor a batch job, use the 'Monitor' tab like you normally would for jobs you launch.

    hashtag
    Setting Output Folders for Batch Jobs

    To direct the output of each run into a separate folder, the --batch-folders flag can be used, for example:

    This command outputs the results for each sample in folders named after batch IDs, such as /10B_S1/, /10T_S5/, /15B_S4/, and /15T_S8/. If the folders do not exist, they are created.

    The output folders are created under a path defined with --destination, which by default is set to the current project and the "/" folder. For example, this command outputs the result files in /run_01/10B_S1/, /run_01/10T_S5/, and other sample-specific folders:

    hashtag
    Batching Multiple Inputs

    The dx generate_batch_inputs command works well for batch processing with file inputs, but it has limitations. If you need to vary other input types (like strings, numbers, or file arrays), or want to customize run properties like job names, a for loop provides more flexibility.

    Here's an example of using a loop to launch multiple jobs with different inputs:

    You can also use the dx run command to use stage_id . For example, if you create a workflow called "Trio Exome Workflow - Jan 1st 2020 9:00am" in your project, you can run it from the command line:

    The \ character is needed to escape the : in the workflow name.

    Inputs to the workflow can be specified using dx run <workflow> --input name=stage_id:value, where stage_id is a numeric ID starting at 0. More help can be found by running the commands dx run --help and dx run <workflow> --help.

    To batch multiple inputs then, do the following:

    hashtag
    Additional Resources

    For additional information and examples of how to run batch jobs, Batch Processing on the Cloudarrow-up-right may be useful. This material is not a part of the official DNAnexus documentation and is for reference only.

    DNAnexus SDK
    instructions
    dx run
    task dupfile {
        File infile
    
        command { cat ${infile} ${infile} > outfile.txt  }
        output { File outfile = 'outfile.txt' }
    }
    
    task headfile {
        File infile
    
        command { head -10 ${infile} > outfile.txt  }
        output { File outfile = 'outfile.txt' }
    }
    
    workflow basic_reuse {
        File infile
        call dupfile { input: infile=infile }
        call headfile { input: infile=dupfile.outfile }
    }
    task dupfile {
        File infile
    
        command { cat ${infile} ${infile} > outfile.txt  }
        output { File outfile = 'outfile.txt' }
    }
    
    task headfile2 {
        File infile
    
        command { head -15 ${infile} > outfile.txt  }
        output { File outfile = 'outfile.txt' }
    }
    
    workflow basic_reuse_tweaked {
        File infile
        call dupfile { input: infile=infile }
        call headfile { input: infile=dupfile.outfile }
    }
    # Enable Smart Reuse
    dx api org-myorg update '{"policies":{"jobReuse":true}}'
    
    # Disable Smart Reuse
    dx api org-myorg update '{"policies":{"jobReuse":false}}'
    $ dx select "My Research Project"
    Selected project My Research Project
    $ dx ls /
    RP10B_S1_R1_001.fastq.gz
    RP10B_S1_R2_001.fastq.gz
    RP10T_S5_R1_001.fastq.gz
    RP10T_S5_R2_001.fastq.gz
    RP15B_S4_R1_002.fastq.gz
    RP15B_S4_R2_002.fastq.gz
    RP15T_S8_R1_002.fastq.gz
    RP15T_S8_R2_002.fastq.gz
    $ dx generate_batch_inputs \
        -i reads_fastqgzs='RP(.*)_R1_(.*).fastq.gz' \
        -i reads2_fastqgzs='RP(.*)_R2_(.*).fastq.gz'
    Found 4 valid batch IDs matching desired pattern.
    Created batch file dx_batch.0000.tsv
    
    CREATED 1 batch files each with at most 500 batch IDs.
    $ cat dx_batch.0000.tsv
    batch ID	reads_fastqgzs	reads2_fastqgzs	reads_fastqgzs ID	reads2_fastqgzs ID
    10B_S1	RP10B_S1_R1_001.fastq.gz	RP10B_S1_R2_001.fastq.gz	file-aaa	file-bbb
    10T_S5	RP10T_S5_R1_001.fastq.gz	RP10T_S5_R2_001.fastq.gz	file-ccc	file-ddd
    15B_S4	RP15B_S4_R1_002.fastq.gz	RP15B_S4_R2_002.fastq.gz	file-eee	file-fff
    15T_S8	RP15T_S8_R1_002.fastq.gz	RP15T_S8_R2_002.fastq.gz	file-ggg	file-hhh
    head -n 1 dx_batch.0000.tsv > temp.tsv && \
    tail -n +2 dx_batch.0000.tsv | \
    awk '{sub($4, "[&]"); print}' | \
    awk '{sub($5, "[&]"); print}' >> temp.tsv && \
    tr -d '\r' < temp.tsv > new.tsv && \
    rm temp.tsv
    dx run bwa_mem_fastq_read_mapper \
      -igenomeindex_targz="Reference Genome Files":\
    "/H. Sapiens - GRCh38/GRCh38.no_alt_analysis_set.bwa-index.tar.gz" \
      --batch-tsv dx_batch.0000.tsv
    dx run bwa_mem_fastq_read_mapper \
      -igenomeindex_targz="project-BQpp3Y804Y0xbyG4GJPQ01xv:\
    file-BFBy4G805pXZKqV1ZVGQ0FG8" \
      --batch-tsv dx_batch.0000.tsv \
      --batch-folders
    dx run bwa_mem_fastq_read_mapper \
      -igenomeindex_targz="project-BQpp3Y804Y0xbyG4GJPQ01xv:\
    file-BFBy4G805pXZKqV1ZVGQ0FG8" \
      --batch-tsv dx_batch.0000.tsv \
      --batch-folders \
      --destination=My_project:/run_01
    for i in 1 2; do
        dx run swiss-army-knife -icmd="wc *>${i}.out" -iin="fileinput_batch${i}a" -iin="file_input_batch${i}b" --name "sak_batch${i}"
    done
    dx login
    dx run "Trio Exome Workflow - Jan 1st 2020 9\:00am"
    dx cd /path/to/inputs
    for i in $(dx ls); do
        dx run "Trio Exome Workflow - Jan 1st 2020 9\:00am" --input 0.reads="$i"
    done
    On the DNAnexus Platform, all data is stored within projects. Before you upload, browse, or analyze any data, you must create a project to house that data.

    To create a project:

    1. In the DNAnexus Platform, select Projects > All Projects.

    2. In the Projects page, click New Project.

    3. In the New Project dialog:

      1. In Project Name, enter your project's name.

      2. (Optional) In More Info, you can enter or custom-defined . These make it easier to find the project later, and organize it among other projects.

      3. (Optional) In More Info, you can enter a Project Summary and Project Description to help other users understand the project's purpose.

      4. In Billing > Billed To, choose a to which project charges are billed.

      5. In Billing > Billed To, choose a to use for storing project files and running analyses. Feel free to use the default region.

      6. (Optional) In Usage Limits, available in Billed To orgs with compute and egress usage limits configured, you can set project-level limits for each.

      7. In Access, you can specify for specific , defining who can copy, delete, and download data. Feel free to accept the defaults.

      8. Click Create Project.

    After the project is created, you can add data in the Manage page.

    Once you add data to your project, this is where you can see and get info on this data, and launch analyses that use it.

    hashtag
    Step 2. Add Project Members

    Once you've created a project, you can add members by doing the following:

    1. From the project's Manage screen, click the Share Project button - the "two people" icon - in the top right corner of the project page.

    2. Type the username or the email address of an existing Platform user, or the ID of an org whose members you want to add the project.

    3. In Access, choose the type of access the user or org has to the project. For more on this, see the detailed explanation of project access levels.

    4. If you don't want the user to receive an email notification on being added to the project, click the Email Notification to "Off."

    5. Click the Add User button.

    6. Repeat Steps 2-5, for each user you want to add to the project.

    7. Click Done when you're finished adding members.

    hashtag
    Step 3. Add Data to Your Project

    To add data to your project, click the Add button in the top right corner of the project's Manage screen. You see three options for adding data:

    • Upload Data - Use your web browser to upload data from your computer. For long upload times, you must stay logged into the Platform and keep your browser window open until the upload completes.

    • Add Data from Server - Specify an URL of an accessible server from which the file is uploaded.

    • Copy Data from Project - Copy data from another project on the Platform.

    circle-info

    When uploading large files, consider using the Upload Agent, a command-line tool that's both faster and more reliable than uploading via the UI.

    hashtag
    Adding Data to Use in Your First Analysis

    To prepare for running your first analysis, as detailed in Steps 4-7, copy in data from the "Demo Data" project:

    1. From the project's Manage screen, click the Add button, then select Copy Data from Project.

    2. In the Copy Data from Project modal window, open the "Demo Dataarrow-up-right" project by clicking on its name.

    3. Open the "Quickstart" folder. This folder contains two 1000 Genomes projectarrow-up-right files with the paired-end sequencing reads from chromosome 20 of exome SRR100022: SRR100022_20_1.fq.gz and SRR100022_20_2.fq.gz.

    4. Click the box next to the Name header, to select both files.

    5. Click Copy to copy the files to your project.

    hashtag
    Step 4. Install Apps

    Next, install the apps you need, to analyze the data you added to the project in Step 3:

    1. Select Tools Library from the Tools link in the main menu.

    2. A list of available tools opens.

    3. Find the BWA-MEM FASTQ Read Mapperarrow-up-right in the list and click on its name.

    4. A tool detail page opens, a full range of information about the tool, and how to use it.

    5. Click the Install button in the upper left part of the screen, under the name of the tool.

    6. In the Install App modal, click the Agree and Install button.

    7. After the tool has been installed, you are returned to the tool detail page.

    8. Use your browser's "Back" button to return to the tools list page.

    9. Repeat Steps 3-6 to install the .

    hashtag
    Step 5. Build a Workflow

    Build a workflow using the two apps you installed, and configure it to use the data you added to your project in Step 3.

    hashtag
    Adding Workflow Steps

    A workflow runs tools as part of a preconfigured series of steps. Start building your workflow by adding steps to it:

    1. Return to your project's Manage screen. You can do this by using your browser's "Back" button, or by selecting All Projects from the Projects link in the main menu, then clicking on the name of your project in the projects list.

    2. Click the Add button in the top right corner of the screen, then select New Workflow from the dropdown. The Workflow Builder opens.

    3. In the Workflow Builder, give your new workflow a name. In the upper left corner of the screen, you see a field with a placeholder value that begins "Untitled Workflow." Click on the "pencil" icon next to this placeholder name, then enter a name of your choosing.

    4. Click the Add a Step button. In the Select a Tool modal window, find the BWA-MEM FASTQ Read Mapper and click the "+" to the left of its name, to add it to your workflow.

    5. Repeat Step 4 for the FreeBayes Variant Caller.

    6. Close the Select a Tool modal window, by clicking either on the "x" in its upper right corner, or the Close button in its lower right corner. You return to the main Workflow Builder screen.

    hashtag
    Setting Inputs for Each Step

    circle-info

    In the Workflow Builder, required inputs have orange placeholder text, while optional inputs have black placeholder text.

    Set the required inputs for each step by doing the following:

    1. To set the required inputs for the first step, start by clicking on the input labeled "Reads [array]" for the BWA-MEM FASTQ Read Mapper. In the Select Data for Reads Input modal window, click the box for the SRR100022_20_1.fq.gz file. Then click the Select button.

    2. Since the SRR100022 exome was sequenced using paired-end sequencing, you need to provide the right-mates for the first set of reads. Click on the input labeled "Reads (right mates) [array]" for the BWA-MEM FASTQ Read Mapper. Select the SRR100022_20_2.fq.gz file.

    3. Click on the input labeled "BWA reference genome index." At the bottom of the modal window that opens, there is a Suggestions section that includes a link to a folder containing reference genome files. Click on this link, then open the folder named H. Sapiens - GRCh37 - b37 (1000 Genomes Phase I). Select the human_g1k_v37.bwa-index.tar.gz file.

    4. Next set the "Sorted mappings [array]" required input for the second step. In the "Output" section for the first step, click on the blue pill labeled "Sorted mappings," then drag it to the second step input labeled "Sorted mappings [array]."

    5. Click on the second step input labeled "Genome." In the modal that opens, find the reference genomes folder as in Step 3. Open the folder named H. Sapiens - GRCh37 - b37 (1000 Genomes Phase I). Select the human_g1k_v37.fa.gz file.

    circle-info

    Each tool has different input and output requirements. To learn about a tool's required and optional inputs and outputs, file format restrictions, and other configuration details, refer to its detail page in the Tools Library.

    hashtag
    Step 6. Launch the Workflow

    You're ready to launch your workflow, by doing the following:

    1. Click the Start Analysis button at the upper right corner of the Workflow Builder.

    2. In the modal window that opens, click the Run as Analysis button.

    The BWA-MEM FASTQ Read Mapper starts executing immediately. Once it finishes, the FreeBayes Variant Caller starts, using the Read Mapper's output as an input.

    hashtag
    Step 7. Monitor Your Job

    Once you've launched your workflow, you are taken to your project's Monitor screen. Here, you see a list of both current and past analyses run within the project, along with key information about each run.

    As your workflow runs, its status shows as "In Progress."

    hashtag
    Terminating Your Job

    If for some reason you need to terminate the run before it completes, find its row in the list on the Monitor screen. In the last column on the right, you see a red button labeled Terminate. Click the button to terminate the job. This process may take some time. While the job is being terminated, the job's status shows as "Terminating."

    hashtag
    Step 8. Access the Results

    When your workflow completes, output files are placed into a new folder in your project, with the same name as the workflow. The folder is accessible by navigating to your project's Manage screen.

    hashtag
    Running the Workflow Using the Full SRR100022 Exome

    You can run this workflow using the full SRR100022 exome, which is available in the SRR100022 folder, in the "Demo Data" project. Because this means working with a much larger file, running the workflow using the exome data takes longer.

    hashtag
    Learn More

    See these Key Concepts pages for more in-depth treatments of topics that are covered briefly here:

    • Projects

    • Apps and Workflows

    For a video intro to the Platform, watch the series of short, task-oriented tutorials.

    For a more in-depth video intro to the Platform, watch the DNAnexus Platform Essentials video.

    Follow these instructions to set up billing.
    Path Syntax

    The DNAnexus Platform recognizes three main types of paths for referring to data objects: project paths, job-based object references (JBORs), and DNAnexus links.

    hashtag
    Project Paths

    To refer to a project by name, it must be suffixed with the colon character ":". Anything appearing after the ":" or without a ":" is interpreted as a folder path to a named object. For example, to refer to a file called "hg19.fq.gz" in a folder called "human" in a project called "Genomes", the following path can be used in place of its object ID:

    The folder path appearing after the ":" is assumed to be relative to the root folder "/" of the project.

    Exceptions to this are when commands take in arbitrary names. This applies to commands like dx describe which accepts app names, user IDs, and other identifiers. In this case, all possible interpretations are attempted. However, it is always assumed that it is not a project name unless it ends in ":".

    hashtag
    Job-Based Object References (JBORs)

    To refer to the output of a particular job, you can use the syntax <job id>:<output name>.

    hashtag
    Examples

    If you have the job ID handy, you can use it directly.

    Or if you know it's the last analysis you ran:

    You can also automatically download a file once the job producing it is done:

    If the output is an array, you can extract a single element by specifying its array index (starting from 0) as follows:

    hashtag
    DNAnexus Links

    DNAnexus links are JSON hashes which are used for job input and output. They always contain one key, $dnanexus_link, and have as a value either

    • a string representing a data object ID

    • another hash with two keys:

      • project a string representing a project or other data container ID

      • id a string representing a data object ID

    For example:

    hashtag
    Special Characters

    When naming data objects, certain characters require special handling because they have specific meanings in the DNAnexus Platform:

    • The colon (:) identifies project names

    • The forward slash (/) separates folder names

    • Asterisks (*) and question marks (?) are used for wildcard matching

    To use these characters in object names, you must escape them with backslashes. Spaces may also need escaping depending on your shell environment and whether you use quotes.

    For the best experience, we recommend avoiding special characters in names when possible. If you need to work with objects that have special characters, using their object IDs directly is often simpler.

    The table below shows how to escape special characters when accessing objects with these characters in their names:

    Character
    Without Quotes
    With Quotes

    (single space)

    ' '

    :

    \\\\:

    '\\:'

    The following example illustrates how the special characters are escaped for use on the command line, with and without quotes.

    For commands where the argument supplied involves naming or renaming something, the only escaping necessary is whatever is necessary for your shell or for setting it apart from a project or folder path.

    hashtag
    Name Conflicts

    It is possible to have multiple objects with the same name in the same folder. When an attempt is made to access or modify an object which shares the same name as another object, you are prompted to select the desired data object.

    Some commands (like mv here) allow you to enter * so that all matches are used. Other commands may automatically apply the command to all matches. This includes commands like ls and describe. Some commands require that exactly one object be chosen, such as the run command.

    ID

    The subjob or child job of stage 1's origin job shares the same temporary workspace as its parent job. API calls to run a new applet or app using /applet-xxxx/run or /app-xxxx/run launch a master job that has its own separate workspace, and (by default) no visibility into its parent job's workspace.

    hashtag
    Job States

    hashtag
    Successful Jobs

    Every successful job goes through at least the following four states: 1. idle: initial state of every new job, regardless of what API call was made to create it. 2. runnable: the job's inputs are ready, and it is not waiting for any other job to finish or data object to finish closing. 3. running: the job has been assigned to and is being run on a worker in the cloud. 4. done: the job has completed, and it is not waiting for any descendent job to finish or data object to finish closing. This is a terminal state, so no job becomes a different state after transitioning to done.

    Jobs may also pass through the following transitional states as part of more complicated execution patterns:

    • waiting_on_input (between idle and runnable): a job enters and stays in this state if at least one of the following is true:

      • it has an unresolved job-based object reference in its input

      • it has a data object input that cannot be cloned yet because it is not in the closed state or a linked hidden object is not in the closed state

      • it was created to wait on a list of jobs or data objects that must enter the done or closed states, respectively (see the dependsOn field of any API call that creates a job). Linked hidden objects are implicitly included in this list

    • waiting_on_output (between running and done): a job enters and stays in this state if at least one of the following is true:

      • it has a descendant job that has not been moved to the done state

      • it has an unresolved job-based object reference in its output

    hashtag
    Unsuccessful Jobs

    Two terminal job states exist other than the done state: terminated and failed. A job can enter either of these states from any other state except another terminal state.

    hashtag
    Terminated Jobs

    The terminated state occurs when a user requests termination of the job (or another job sharing the same origin job). For all terminated jobs, the failureReason in their describe hash contains "Terminated", and the failureMessage indicates the user responsible for termination. Only the user who launched the job or administrators of the job's project context can terminate the job.

    hashtag
    Failed Jobs

    Jobs can fail for a variety of reasons, and once a job fails, this triggers failure for all other jobs that share the same origin job. If an unrelated job not in the same job tree has a job-based object reference or otherwise depends on a failed job, then it also fails. For more information about errors that jobs can encounter, see the Error Information page.

    circle-exclamation

    On the DNAnexus Platform, jobs are limited to a runtime of 30 days. Jobs running longer than 30 days fail with JobTimeoutExceeded error.

    hashtag
    Restartable Jobs

    Jobs can automatically restart when they encounter specific types of failures. You configure which failure types trigger restarts in the executionPolicy of an app, applet, or workflow. Common restartable failure types include:

    • UnresponsiveWorker

    • ExecutionError

    • AppInternalError

    • AppInsufficientResourceError

    • JobTimeoutExceeded

    • SpotInstanceInterruption

    hashtag
    How job restarts work

    When a job fails for a restartable reason, the system determines where to restart based on the restartableEntryPoints configuration:

    • master setting (default): The failure propagates to the nearest master job, which then restarts

    • all setting: The job restarts itself directly

    The system restarts a job up to the maximum number of times specified in the executionPolicy. Once this limit is reached, the entire job tree fails.

    During the restart process, jobs transition through specific states:

    • restartable: The job is ready to be restarted

    • restarted: The job attempt was restarted (a new attempt begins)

    hashtag
    Job try tracking

    For jobs in root executions launched after July 12, 2023 00:13 UTC, the platform tracks restart attempts using a try integer attribute:

    • First attempt: try = 0

    • Second attempt (first restart): try = 1

    • Third attempt (second restart): try = 2

    Multiple API methods support job try operations and include try information in their responses:

    • /job-xxxx/describe

    • /job-xxxx/addTags

    • /job-xxxx/removeTags

    • /job-xxxx/setProperties

    • /system/findExecutions

    • /system/findJobs

    • /system/findAnalyses

    When you provide a job ID without specifying a try argument, these methods automatically refer to the most recent attempt for that job.

    hashtag
    Additional States

    For unsuccessful jobs, additional states exist between the running state and the terminal state of terminated or failed. Unsuccessful jobs starting in other non-terminal states transition directly to the appropriate terminal state.

    • terminating: the transitional state when the cloud worker begins terminating the job and tearing down the execution environment. The job moves to its terminal state after the worker reports successful termination or becomes unresponsive.

    • debug_hold: a job has been run with debugging options and has failed for an applicable reason, and is being held for debugging by the user. For more information about triggering this state, see the Connecting to Jobs page.

    hashtag
    Analysis States

    All analyses start in the state in_progress, and, like jobs, reach one of the terminal states done, failed, or terminated. The following diagram shows the state transitions for successful analyses.

    If an analysis is unsuccessful, it may transition through one or more intermediate states before it reaches its terminal state:

    • partially_failed: this state indicates that one or more stages in the analysis have not finished successfully, and there is at least one stage which has not transitioned to a terminal state. In this state, some stages may have already finished successfully (and entered the done state), and the remaining stages are also allowed to finish successfully if they can.

    • terminating: an analysis may enter this state either via an API call where a user has terminated the analysis, or there is some failure condition under which the analysis is terminating any remaining stages. This may happen if the executionPolicy for the analysis (or a stage of an analysis) had the onNonRestartableFailure value set to "failAllStages".

    hashtag
    Billing

    Compute and data storage costs for jobs that fail due to user error are charged to the project running those jobs. This includes errors such as InputError and OutputError. The same applies to terminated jobs. For DNAnexus Platform internal errors, these costs are not billed.

    The costs for each stage in an analysis is determined independently. If the first stage finishes successfully while a second stage fails for a system error, the first stage is still billed, and the second is not.

    circle-info

    A license is required to access the Relational Database Service. Contact DNAnexus Salesenvelope for more information.

    hashtag
    Overview of the Relational Database Service

    hashtag
    DNAnexus Relational DB Cluster States

    When describing a DNAnexus DBCluster, the status field can be any of the following:

    DBCluster status
    Details

    creating

    The database cluster is being created, but not yet available for reading/writing.

    available

    The database cluster is created and all replicas are available for reading/writing.

    stopping

    The database cluster is stopping.

    stopped

    hashtag
    Connecting to a DB Cluster

    DB Clusters are not accessible from outside of the DNAnexus Platform. Any access to these databases must occur from within a DNAnexus job. Refer to this page on cloud workstations for one possible way to access a DB Cluster from within a job. Executions such as app/applets can access a DB Cluster as well.

    The parameters needed for connecting to the database are:

    • host Use endpoint as returned from dbcluster-xxxx/describe

    • port 3306 for MySQL Engines or 5432 for PostgreSQL Engines

    • user root

    • password Use the adminPassword specified when creating the database

    • For MySQL: ssl-mode 'required'

    • For PostgreSQL: sslmode 'require' Note: For connecting and verifying certs, see

    hashtag
    DBCluster Instance Types

    The table below provides all the valid configurations of dxInstanceClass, database engine and versions

    DxInstanceClass
    Engine + Version Supported
    Memory (GB)
    # Cores

    db_std1_x2 (*)

    aurora-mysql: 8.0.mysql_aurora.3.04.1, aurora-postgresql: 12.9, 13.9, 14.6

    4

    2

    db_mem1_x2

    aurora-mysql: 8.0.mysql_aurora.3.04.1, aurora-postgresql: 12.9, 13.9, 14.6

    * - db_std1 instances may incur CPU Burst charges similar to AWS T3 Db instances described in AWS Instance Typesarrow-up-right. Regular hourly charges for this instance type are based on 1 core, CPU Burst charges are based on 2 cores.

    hashtag
    Restriction on Transfers of Projects Containing DBClusters

    If a project contains a DBCluster, its ownership cannot be changed. A PermissionDenied error occurs when attempting to change the billTo of such a project.

    DBClusters API page
    hashtag
    Overview

    hashtag
    File Archival States

    To understand the archival life cycle as well as which operations can be performed on files and how billing works, it's helpful to understand the different file states associated with archival. A file in a project can assume one of four archival states:

    Archival states
    Details

    live

    The file is in standard storage, such as AWS S3 or Azure Blob.

    archival

    Archival requested on the current file, but other copies of the same file are in the live state in multiple projects with the same billTo entity. The file is still in standard storage.

    archived

    The file is in archival storage, such as AWS S3 Glacier or Azure Blob ARCHIVE.

    unarchiving

    Different states of a file allow different operations to the file. See the table below, for which operations can be performed based on a file's current archival state.

    Archival states
    Download
    Clone
    Compute
    Archive
    Unarchive

    live

    Yes

    Yes

    Yes

    Yes

    * Clone operation would fail if the object is actively transitioning from archival to archived.

    hashtag
    File Archival Life Cycle

    When the project-xxxx/archive API is called on a file object, the file transitions from the live state to the archival state. Only when all copies of a file in all projects with the same billTo organization are in the archival state, does the file transition to the archived state automatically by the platform.

    Likewise, when the project-xxxx/unarchive API is called on a file in the archived state, the file transitions from the archived to the unarchiving state. During the unarchiving state, the file is being restored by the third-party storage platform, such as AWS or Azure. The unarchiving process may take a while depending on the retrieval option selected for the specific platform. Finally, when unarchiving is completed, and the file becomes available on standard storage, the file is transitioned to a live state.

    hashtag
    Archive Service Operations

    The File-based Archive Service allows users who have the CONTRIBUTE or ADMINISTER permissions to a project to archive or unarchive files that reside in the project.

    Using API, users can archive or unarchive files, folders, or entire projects, although the archiving process itself happens at the file level. The API can accept a list of up to 1000 files for archiving and unarchiving.

    When archiving or unarchiving folders or projects, the API by default archives or unarchives all the files at the root level and those in the subfolders recursively. If you archive a folder or a project that includes files in different states, the Service only archives files that are in the live state and skips files that are in other states. Likewise, if you unarchive a folder or a project that includes files in different states, the Service only unarchives files that are in the archived state, transitions archival files back to the live state, and skips files in other states.

    hashtag
    Archival Billing

    The archival process incurs specific charges, all billed to the billTo organization of the project:

    • Standard storage charge: The monthly storage charge for files that are located in the standard storage on the platform. The files in the live and archival state incur this charge. The archival state indicates that the file is waiting to be archived or that other copies of the same file in other projects are still in the live state, so the file is in standard storage, such as AWS S3. The standard storage charge continues to get billed until all copies of the file are requested to be archived and eventually the file is moved to archival storage and transitioned into the archived state.

    • Archival storage charge: The monthly storage charge for files that are located in archival storage on the platform. Files in the archived state incur a monthly archival storage charge.

    • Retrieval fee: The retrieval fee is a one-time charge at the time of unarchiving based on the volume of data being unarchived.

    • Early retrieval fee: If you retrieve or delete data from archival storage before the required retention period is met, an early retrieval fee applies. This is 90 days for AWS regions and 180 days for Microsoft Azure regions. You are be charged a pro-rated fee equivalent to the archival storage charges for any remaining days within that period.

    hashtag
    Best Practices

    When using the Archive Service, we recommend the following best practices.

    • The Archive Service does not work on sponsored projects. If you want to archive files within a sponsored project, then you must move files into a different project or end the project sponsorship before archival.

    • If a file is shared in multiple projects, archiving one copy in one of the projects only transitions the file into the archival state, which still incurs the standard storage cost. To achieve the lower archival storage cost, you need to ensure that all copies of the file in all projects with the same billTo org are being archived. When all copies of the file reach the archival state, the Service moves the files from archival to archived state. Consider using the allCopies option of the API to archive all copies of the file. You must be the org ADMIN of the billTo org of the current project to use the allCopies option.

      Refer to the following example: The file-xxxx has copies in project-xxxx, project-yyyy, and project-zzzz which are sharing the same billTo org (org-xxxx). You are the ADMINISTER of project-xxxx, and a CONTRIBUTE of project-yyyy, but do not have any role in project-zzzz. You are the org ADMIN of the project

      1. List all the copies of the file in the org-xxxx .

      2. Force archiving all the copies of file-xxxx .

      3. All copies of file-xxxx transition into the

    DNAnexus Salesenvelope

    Distributed by Region (py)

    This applet creates a count of reads from a BAM format file.

    View full source code on GitHubarrow-up-right

    hashtag
    How is the SAMtools dependency provided?

    The SAMtools dependency is resolved by declaring an Apt-Getarrow-up-right package in the dxapp.json runSpec.execDepends.

    For additional information, refer to the .

    hashtag
    Entry Points

    Distributed Python-interpreter apps use Python decorators on functions to declare entry points. This app has the following entry points as decorated functions:

    • main

    • samtoolscount_bam

    • combine_files

    Entry points are executed on a new worker with their own system requirements. In this example, the files are split and merged on basic mem1_ssd1_x2 instances and the more intensive processing step is performed on a mem1_ssd1_x4 instance. Instance type can be set in the dxapp.json runSpec.systemRequirements:

    hashtag
    main

    The main function scatters by region bins based on user input. If no *.bai file is present, the applet generates an index *.bai.

    Regions bins are passed to the samtoolscount_bam entry point using the function.

    Outputs from the samtoolscount_bam entry points are used as inputs for the combine_files entry point. The output of the combine_files entry point is used as the output of the main entry point.

    hashtag
    samtoolscount_bam

    This entry point downloads and creates a samtools view -c command for each region in the input bin. The dictionary returned from dxpy.download_all_inputs() is used to reference input names and paths.

    This entry point returns {"readcount_fileDX": readCountDXlink}, a JBOR referencing an uploaded text file. This approach to scatter-gather stores the results in files and uploads/downloads the information as needed. This approach exaggerates a scatter-gather for tutorial purposes. You're able to pass types other than file such as int.

    hashtag
    combine_files

    The main entry point triggers this subjob, providing the output of samtoolscount_bam as an input. This entry point gathers all the files generated by the samtoolscount_bam jobs and sums them.

    Important: While the main entry point triggers the processing and gathering entry points, remember that the main entry point doesn't do any heavy lifting or processing. Notice in the .runSpec JSON above the process starts with a lightweight instance, scales up for the processing entry point, then finally scales down for the gathering step.

    Distributed by Region (py)

    This applet creates a count of reads from a BAM format file.

    View full source code on GitHubarrow-up-right

    hashtag
    How is the SAMtools dependency provided?

    The SAMtools dependency is resolved by declaring an Apt-Getarrow-up-right package in the dxapp.json runSpec.execDepends.

    For additional information, refer to the .

    hashtag
    Entry Points

    Distributed Python-interpreter apps use Python decorators on functions to declare entry points. This app has the following entry points as decorated functions:

    • main

    • samtoolscount_bam

    • combine_files

    Entry points are executed on a new worker with their own system requirements. In this example, the applet splits and merges files on basic mem1_ssd1_x2 instances and performs a more intensive processing step on a mem1_ssd1_x4 instance. Instance type can be set in the dxapp.json's runSpec.systemRequirements:

    hashtag
    main

    The main function scatters by region bins based on user input. If no *.bai file is present, the applet generates an index *.bai.

    Regions bins are passed to the samtoolscount_bam entry points in the function.

    Outputs from the samtoolscount_bam entry points are used as inputs for the combine_files entry point. The output of the combine_files entry point is used as the output of the main entry point.

    hashtag
    samtoolscount_bam

    This entry point downloads and creates a samtools view -c command for each region in the input bin. The dictionary returned from dxpy.download_all_inputs() is used to reference input names and paths.

    This entry point returns {"readcount_fileDX": readCountDXlink}, a JBOR referencing an uploaded text file. This approach to scatter-gather stores the results in files and uploads/downloads the information as needed. This approach exaggerates a scatter-gather for tutorial purposes. You're able to pass types other than file such as int.

    hashtag
    combine_files

    The main entry point triggers this subjob, providing the output of samtoolscount_bam as an input. This entry point gathers all the files generated by the samtoolscount_bam jobs and sums them.

    Important: While the main entry point triggers the processing and gathering entry points, remember that the main entry point doesn't do any heavy lifting or processing. The .runSpec JSON shows a workflow that starts with a lightweight instance, scales up for the processing entry point, and then scales down for the gathering step.

    Project Navigation

    You can treat dx as an invocation command for navigating the data objects on the DNAnexus Platform. By adding dx in front of commonly used bash commands, you can manage objects in the platform directly from the command-line. Common commands include dx ls, dx cd, dx mv, and dx cp, which let you list objects, change folders, move data objects, and copy objects.

    hashtag
    Listing Objects

    hashtag
    Listing Objects in Your Current Project

    By default when you set your current project, you are placed in the root folder / of the project. You can list the objects and folders in your current folder with .

    hashtag
    Listing Object Details

    To see more details, you can run the command with the option dx ls -l.

    As in bash, you can list the contents on a path.

    hashtag
    Listing Objects in a Different Project

    You can also list the contents of a different project. To specify a path that points to a different project, start with the project-ID, followed by a :, then the path within the project where / is the root folder of the project.

    Enclose the path in quotes (" ") so dx interprets the spaces as part of the folder name, not as a new command.

    hashtag
    Listing Objects That Match a Pattern

    You can also list only the objects that match a pattern. In this example, an asterisk * acts as a wildcard to represent all objects with names containing .fasta. This returns only a subset of the objects from the original query.

    Enclose the path in quotes for two reasons:

    • The shell passes the wildcard pattern to dx without expanding it against local files.

    • The dx command correctly interprets any spaces in the path.

    For more information about using wildcards with dx commands, see .

    hashtag
    Switching Contexts

    hashtag
    Changing Folders

    To find out your present folder location, use the dx pwd command. You can switch contexts to a subfolder in a project using .

    hashtag
    Moving or Renaming Data Objects

    You can move and rename data objects and folders using the command .

    To rename an object or a folder, "move" it to a new name in the same folder. Here, a file named ce10.fasta.gz is renamed to C.elegans10.fastq.gz.

    If you want to move the renamed file into a folder, specify the path to the folder as the destination of the move command (dx mv).

    hashtag
    Copying Objects or Folders to Another Project

    You can copy data objects or folders to another project by running the command dx cp. The following example shows how to copy a human reference genome FASTA file (hs37d5.fa.gz) from a public project, "Reference Genome Files", to a project "Scratch Project" that the user has ADMINISTER permission to.

    You can also copy folders between projects by running dx cp folder_name destination_path. Folders are automatically copied recursively.

    The Platform prevents copying a data object within the same project, since each specific data object exists only once in a project. The system also prohibits copying any data object between projects that are located in different through dx cp.

    hashtag
    Changing Your Current Project

    hashtag
    Changing to Another Project With a Project Prompt List

    You can change to another project where you wanted to work by running the command . It brings up a prompt with a list of projects for you to select from. In the following example, the user has entered option 2 to select the project named "Mouse".

    hashtag
    Changing to a Public Project

    To view and select between all public projects, projects available to all DNAnexus users, you can run the command dx select --public:

    hashtag
    Changing to a Project With VIEW Permission

    By default, dx select prompts a list of projects that you have at least CONTRIBUTE permission to. If you wanted to switch to a project that you have VIEW permission to view the data objects, you can run dx select --level VIEW to list all the projects in which you have at least VIEW permission to.

    hashtag
    Changing Directly to a Specific Project

    If you know the project ID or name, you can also give it directly to switch to the project as dx select [project-ID | project-name]:

    JupyterLab Reference

    This page is a reference for most useful operations and features in the JupyterLab environment.

    circle-info

    JupyterLab is accessible to all users of the UK Biobank Research Analysis Platform and the Our Future Health Trusted Research Environment.

    A license is required to access JupyterLab on the DNAnexus Platform. Contact DNAnexus Salesenvelope for more information.

    hashtag
    Download Files from the Project to the Local Execution Environment

    hashtag
    Bash

    You can download input data from a project using in a notebook cell:

    The %%bash keyword converts the whole cell to a magic cell which allows you to run bash code in that cell without exiting the Python kernel. See examples of magic commands in the . The ! prefix achieves the same result:

    Alternatively, the dx command can be executed from the .

    hashtag
    Python

    To download data with Python in the notebook, you can use the function:

    Check the for details on how to download files and folders.

    hashtag
    Upload Data from the Session to the Project

    hashtag
    Bash

    Any files from the execution environment can be uploaded to the project using :

    hashtag
    Python

    To upload data using Python in the notebook, you can use the function:

    Check the for details on how to upload files and folders.

    hashtag
    Download and Upload Data to Your Local Machine

    By selecting a notebook or any other file on your computer and dragging it into the DNAnexus project file browser, you can upload the files directly to the project. To download a file, right-click on it and click Download (to local computer).

    You may upload and download data to the in a similar way, that is, by dragging and dropping files to the execution file browser or by right-clicking on the files there and clicking Download.

    hashtag
    Use the Terminal

    It is useful to have a terminal provided by JupyterLab at hand, which uses bash shell by default and lets you execute shell scripts or interact with the platform via dx toolkit. For example, the following command confirms what the current project context is:

    Running pwd shows you that the working directory of the execution environment is /opt/notebooks. The JupyterLab server is launched from this directory, which is also the default location of the output files generated in the notebooks.

    To open a terminal window, go to File > New > Terminal or open it from the Launcher (using the "Terminal" box at the bottom). To open a Launcher, select File > New Launcher.

    hashtag
    Install Custom Packages in the Session Environment

    You can install pip, conda, apt-get, and other packages in the execution environment from the notebook:

    By creating a , you can start subsequent sessions with these packages pre-installed by providing the snapshot as input.

    hashtag
    Access Public and Private GitHub Repositories from the JupyterLab Terminal

    You can access public GitHub repositories from the JupyterLab terminal using git clone command. By placing a private ssh key that's registered with your GitHub account in /root/.ssh/id_rsa you can clone private GitHub repositories using git clone and push any changes back to GitHub using git push from the JupyterLab terminal.

    Below is a screenshot of a JupyterLab session with a terminal displaying a script that:

    • sets up ssh key to access a private GitHub repository and clones it,

    • clones a public repository,

    • downloads a JSON file from the DNAnexus project,

    This animation shows the first part of the script in action:

    hashtag
    Run Notebooks Non-Interactively

    A command can be run in the JupyterLab Docker container without starting an interactive JupyterLab server. To do that, provide the cmd input and additional input files using the in input file array. The command runs in the directory where the JupyterLab server is started and notebooks are run, that is, /opt/notebooks/. Any output files generated in this directory are uploaded to the project and returned in the out output.

    The cmd input makes it possible to use a papermill tool pre-installed in the JupyterLab environment that executes notebooks non-interactively. For example, to execute all the cells in a notebook and produce an output notebook:

    where notebook.ipynb is the input notebook to papermill, which needs to be passed in the in input, and output_notebook.ipynb is the name of the output notebook, which stores the result of the cells' execution. The output is uploaded to the project at the end of the app execution.

    If the snapshot parameter is specified, execution of cmd takes place in the specified Docker container. The duration argument is ignored when running the app with cmd. The app can be run from the command line with the --extra-args flag to limit the runtime, for example, dx run dxjupyterlab --extra-args '{"timeoutPolicyByExecutable": {"app-xxxx":{"\*": {"hours": 1}}}}'".

    If cmd is not specified, the in parameter is ignored and the output of an app consists of an empty array.

    hashtag
    Use newer NVIDIA GPU-accelerated software

    If you are trying to use newer NVIDIA GPU-accelerated software, you may find that the NVIDIA GPU Driver kernel-mode driver NVIDIA.ko that is installed outside of the JupyterLab environment does not support the newer CUDA version required by your application. You can install packages to use the newer CUDA version required by your application by following the steps below in a JupyterLab terminal.

    hashtag
    Session Inactivity

    After 15 to 30 minutes of inactivity in the JupyterLab browser tabs, the system logs you out automatically from the JupyterLab session and displays a "Server Connection Error" message. To re-enter the JupyterLab session, reload the JupyterLab webpage and log into the platform to be redirected to the JupyterLab session.

    Searching Data Objects

    You can use the dx ls command to list the objects in your current project. You can determine the current project and folder you are in by using the command dx pwd. Using glob patterns, you can broaden your search for objects by specifying filenames with wildcard characters such as * and ?. An asterisk (*) represents zero or more characters in a string, and a question mark (?) represents exactly one character.

    hashtag
    Searching Objects with Glob Patterns

    hashtag
    Searching Objects in Your Current Folder

    By listing objects in your current directory with the wildcard characters * and ?, you can search for objects with a filename using a glob pattern. The examples below use the folder "C. Elegans - Ce10/" in the public project (platform login required to access this link).

    hashtag
    Printing the Current Working Directory

    hashtag
    Listing Folders and/or Objects in a Folder

    hashtag
    Listing Objects Named Using a Pattern

    hashtag
    Searching Across Objects in the Current Project

    To search the entire project with a filename pattern, use the command dx find data --name with the wildcard characters. Unless --path or --all-projects is specified, dx find data searches data under the current project. Below, the command dx find data is used in the public project (platform login required to access this link) using the --name option to specify the filename of objects that you're searching for.

    hashtag
    Quoting Wildcards in Shell Commands

    When using wildcard characters (* and ?) with dx commands, enclose the pattern in single ' or double " quotes. Without quotes, the shell expands the wildcards against files in your local filesystem before passing the pattern to the dx command, which produces unexpected results.

    Quoting the pattern ensures the shell treats it as a literal string and passes it directly to the dx command, where DNAnexus interprets the wildcards to search Platform objects.

    circle-info

    Bash also expands other special characters like ?, [, ], {, and }. For complete details about shell expansion and quoting, see the .

    hashtag
    Escaping Special Characters

    Escape special characters in filenames with a backslash (\) when you want to search for them literally. Characters that require escaping include wildcards (* and ?) when you want to find them as literal characters in filenames. You must also escape colons (:) and slashes (/), because these have special meaning in DNAnexus paths.

    Shell behavior affects escaping rules. In many shells, you need to either double-escape (\\) or use single quotes to prevent the shell from interpreting the backslash.

    The following examples show proper escaping techniques:

    hashtag
    Searching Objects with Other Criteria

    dx find data also allows you to search data using metadata fields, such as when the data was created, the data tags, or the project the data exists in.

    hashtag
    Searching Objects Created Within a Certain Period of Time

    You can use the flags --created-after and --created-before to search for data objects created within a specific time period.

    hashtag
    Searching Objects by Their Metadata

    You can search for objects based on their metadata. An object's metadata can be set by performing the command or to respectively tag or setup key-value pairs to describe your data object. You can also set metadata while uploading data to the platform. To search by object tags, use the option --tag. This option can be repeated if the search requires multiple tags.

    To search by object properties, use the option --property. This option can be repeated if the search requires multiple properties.

    hashtag
    Searching Objects in Another Project

    You can search for an object living in a different project than your current working project by specifying a project and folder path with the flag --path. Below, the project ID (project-BQfgzV80bZ46kf6pBGy00J38) of the public project (platform login required to access this link) is specified as an example.

    hashtag
    Searching Objects Across Projects with VIEW and Above Permissions

    To search for data objects in all projects where you have VIEW and above permissions, use the --all-projects flag. Public projects are not shown in this search.

    hashtag
    Scoping Within Projects

    To describe data for small amounts of files (typically below 100), scope findDataObjects to only a project level.

    The below is an example of code used to scope a project:

    See the for more information about usage.

    Stata in JupyterLab

    Using Stata via JupyterLab, working with project files, and creating datasets with Spark.

    Stataarrow-up-right is a powerful statistics package for data science. Stata commands and functionality can be accessed on the DNAnexus Platform via stata_kernelarrow-up-right, in Jupyter notebooks.

    hashtag
    Before You Begin

    hashtag
    Project License Requirement

    On the DNAnexus Platform, use the to create and edit Jupyter notebooks.

    circle-info

    You can only run this app within a project that's billed to an account with a license that allows the use of both and . if you need to upgrade your license.

    JupyterLab is accessible to all users of the UK Biobank Research Analysis Platform and the Our Future Health Trusted Research Environment. A license is required to access JupyterLab on the DNAnexus Platform. for more information.

    hashtag
    Stata License Requirement

    To use Stata on the DNAnexus Platform, you need a valid Stata license. Before launching Stata in a project, you must save your license details according to the instructions below in a plain text file with the extension .json, then upload this file to the project's root directory. You only need to do this once per project.

    hashtag
    Creating a Stata License Details File

    Start by creating the file in a text editor, including all the fields shown here, where <user> is your DNAnexus username, and<organization>is the org of which you're a member:

    Save the file according to the following format, where <username> is your DNAnexus username: .stataSettings.user-<username>.json

    circle-info

    Some operating systems may not support the naming of files with a "." as the first character. If this is the case, you can rename the .json file after uploading it to your project by hovering over the name of your file and clicking the pencil icon that appears.

    hashtag
    Uploading the Stata License Details File

    Open the project in which you want to use Stata. Upload the Stata license details file to the project's root directory by going to your project's Manage tab, clicking on the Add button on the upper right, and then selecting the Upload data option.

    hashtag
    Secure Indirect Format Option for Shared Projects

    When working in a shared project, you can take an additional step to avoid exposing your Stata license details to project collaborators.

    Create a private project. Then create and save a Stata license details file in that project's root directory, per the instructions above.

    Within the shared project, create and save a Stata license details file in this format, where project-yyyy is the name of the private project, and file-xxxx is the license details file ID, in that private project:

    circle-info

    When working on the Research Analysis Platform, you can only create a private credentials project from the .

    hashtag
    Launching JupyterLab

    1. Open the project in which you want to use Stata. From within the project's Manage tab, click the Start Analysis button.

    2. Select the app JupyterLab with Python, R, Stata, ML, Image Processing.

    3. Click the Run Selected button. If you haven't run this app before, you are prompted to install it. Next, you are taken to the Run Analysis screen.

    circle-info

    The app can take some time to load and start running.

    Once the analysis starts, you see the notification "Running" appear under the name of the app.

    hashtag
    Opening JupyterLab

    Click the Monitor tab heading. This opens a list of running and past jobs. Jobs are shown in reverse chronological order, with the most recently launched at the top. The topmost row should show the job you launched. To open the job and enter the JupyterLab interface, click on the URL shown under Worker URL.

    circle-info

    If you do not see the worker URL, click on the name of the job in the Monitor page.

    hashtag
    Using Stata Within JupyterLab

    Within the JupyterLab interface, open the DNAnexus tab shown at the left edge of the screen.

    Open a new Stata notebook by clicking the Stata tile in the Notebooks section.

    hashtag
    Working with Project Files

    You can download DNAnexus data files to the JupyterLab container from Stata notebook with:

    Data files in the current project can also be accessed using a /mnt/project folder from a Stata notebook as follows: To load a DTA file:

    To load a CSV file:

    To write a DTA file to the JupyterLab container:

    To write a CSV file to the JupyterLab container:

    To upload a data file from the JupyterLab container to the project, use the following command in a Stata notebook:

    Alternatively, open a new Launcher tab, open Terminal, and run:

    The /mnt/project directory is read-only, so trying to write to it results in an error.

    hashtag
    Creating a Stata Dataset with Spark

    can be used to query and filter DNAnexus returning a PySpark DataFrame. PySpark Dataframe can be converted to a pandas DataFrame with:

    Pandas dataframe can be exported to CSV or Stata DTA files in the JupyterLab container with:

    To upload a data file from the JupyterLab container to the DNAnexus project in the JupyterLab Spark Cluster app, use

    Once saved to the project, data files can be used in a JupyterLab Stata session using the instructions above.

    Genomes:human/hg19.fa.gz
    dx describe job-B0kK3p64Zg2FG1J75vJ00004:reads
    dx describe $(dx find jobs -n 1 --brief):reads
    dx download $(dx run some_exporter_app -iinput=my_input -y --brief --wait):file_output
    dx describe job-B0kK3p64Zg2FG1J75vJ00004:reads.0
    $ dx ls '{"$dnanexus_link": "file-B2VBGXyK8yjzxF5Y8j40001Y"}'
    file-name
    $ dx ls Project\ Mouse:
    name: with/special*characters?
    $ dx cd Project\ Mouse:
    $ dx describe name\\\\:\ with\\\\/special\\\\\\\\*characters\\\\\\\\?
    ID              file-9zz0xKJkf6V4yzQjgx2Q006Y
    Class           file
    Project         project-9zb014Jkf6V33pgy75j0000G
    Folder          /
    Name            name: with/special*characters?
    State           closed
    Hidden          visible
    Types           -
    Properties      -
    Tags            -
    Outgoing links  -
    Created         Wed Jul 11 16:39:37 2012
    Created by      alice
    Last modified   Sat Jul 21 14:19:55 2012
    Media type      text/plain
    Size (bytes)    4
    $ dx describe "name\: with\/special\\\\\\*characters\\\\\\?"
    ...
    $ dx new record -o "must\: escape\/everything\*once\?at creation"
    ID              record-B13BBVK4Zg29fvVv08q00005
    ...
    Name            must: escape/everything*once? at creation
    ...
    $ dx rename record-B13BBVK4Zg29fvVv08q00005 "no:escaping/necessary*even?wildcards"
    $ dx ls
    sample : file-9zbpq72y8x6F0xPzKZB00003
    sample : file-9zbjZf2y8x61GP1199j00085
    $ dx mv sample mouse_sample
    The given path "sample" resolves to the following data objects:
    0) closed  2012-06-27 18:04:28 sample (file-9zbpq72y8x6F0xPzKZB00003)
    1) closed  2012-06-27 15:34:00 sample (file-9zbjZf2y8x61GP1199j00085)
    
    Pick a numbered choice or "*" for all: 1
    $ dx ls -l
    closed  2012-06-27 15:34:00 mouse_sample (file-9zbjZf2y8x61GP1199j00085)
    closed  2012-06-27 18:04:28 sample (file-9zbpq72y8x6F0xPzKZB00003)
    "runSpec": {
      ...
      "execDepends": [
        {"name": "samtools"}
      ]
    }
      "runSpec": {
        ...
        "execDepends": [
          {"name": "samtools"}
        ]
      }

    /

    \\\\/

    '\/'

    *

    \\\\\\\\*

    '\\\\\\*'

    ?

    \\\\\\\\?

    '\\\\\\?'

    Tags
    Properties
    billing account
    cloud region
    access levels
    types of users
    FreeBayes Variant Callerarrow-up-right

    The database cluster is stopped.

    starting

    The database cluster is restarting from a stopped state, transitioning to available when ready.

    terminating

    The database cluster is being terminated.

    terminated

    The database cluster has been terminated and all data deleted.

    16

    2

    db_mem1_x4

    aurora-mysql: 8.0.mysql_aurora.3.04.1, aurora-postgresql: 12.9, 13.9, 14.6

    32

    4

    db_mem1_x8

    aurora-mysql: 8.0.mysql_aurora.3.04.1, aurora-postgresql: 12.9, 13.9, 14.6

    64

    8

    db_mem1_x16

    aurora-mysql: 8.0.mysql_aurora.3.04.1, aurora-postgresql: 12.9, 13.9, 14.6

    128

    16

    db_mem1_x32

    aurora-mysql: 8.0.mysql_aurora.3.04.1, aurora-postgresql: 12.9, 13.9, 14.6

    244

    32

    db_mem1_x48

    aurora-mysql: 8.0.mysql_aurora.3.04.1, aurora-postgresql: 12.9, 13.9, 14.6

    384

    48

    db_mem1_x64

    aurora-mysql: 8.0.mysql_aurora.3.04.1, aurora-postgresql: 12.9, 13.9, 14.6

    488

    64

    db_mem1_x96

    aurora-postgresql: 12.9, 13.9, 14.6

    768

    96

    dbcluster/new
    Using SSL/TLS to encrypt a connection to a DB instance or clusterarrow-up-right

    it is an origin or master job which has a data object (or linked hidden data object) output in the closing state

    billTo
    org, and try to archive all copies of files in all projects with the same
    billTo
    org using
    :
    archived
    state.

    Restore requested on the current file. The file is in transition from archival storage to standard storage.

    No

    archival

    No

    Yes*

    No

    No

    Yes (Cancel archive)

    archived

    No

    Yes

    No

    No

    Yes

    unarchiving

    No

    No

    No

    No

    No

    /project-xxxx/archive
    execDepends documentation
    dxpy.new_dxjobarrow-up-right
    execDepends documentation
    dxpy.new_dxjobarrow-up-right
    dx ls
    Quoting Wildcards in Shell Commands
    dx cd
    dx mv
    cloud regions
    dx select
    modifies an open-source notebook to convert the JSON file to CSV format,
  • saves the modified notebook to the private GitHub repository,

  • and uploads the results of JSON to CSV conversion back to the DNAnexus project.

  • dx download
    IPython documentationarrow-up-right
    terminal
    download_dxfilearrow-up-right
    dxpy helper functionsarrow-up-right
    dx upload
    upload_local_filearrow-up-right
    dxpy helper functionsarrow-up-right
    local execution environment
    snapshot
    NVIDIA Forward Compatibilityarrow-up-right
    "Reference Genome Files"arrow-up-right
    "Reference Genome Files"arrow-up-right
    Bash manual section on expansionsarrow-up-right
    dx tag
    dx set_properties
    "Exome Analysis Demo"arrow-up-right
    API method system/findDataObjects
  • On the Run Analysis screen, open the Analysis Inputs tab and click the Stata settings file button.

  • Add your Stata settings file as an input. This is the .json file you created, containing your Stata license details.

  • In the Common section at the bottom of the Analysis Inputs pane, open the Feature dropdown menu and select Stata.

  • Click the Start Analysis button at the top right corner of the screen. This launches the JupyterLab app, and takes you to the project's Monitor tab, where you can monitor the app's status as it loads.

  • JupyterLab app
    JupyterLab
    HTTPS apps
    Contact DNAnexus Salesenvelope
    Contact DNAnexus Salesenvelope
    Research Analysis Platform Projects pagearrow-up-right
    JupyterLab Spark Cluster app
    datasets
    The location of the DNAnexus tab within the JupyterLab interface.
    dx api file-xxxx listProjects '{"archivalInfoForOrg":"org-xxxx"}'
    {
    "project-xxxx": "ADMINISTER",
    "project-yyyy": "CONTRIBUTE",
    "liveProjects": [
     "project-xxxx",
     "project-yyyy",
     "project-zzzz"
    ]
    }
    dx api project-xxxx archive '{"files": ["file-xxxx"], "allCopies": true}'
    {
    "id": "project-xxxx"
    "count": 1
    }
    "runSpec": {
      ...
      "systemRequirements": {
        "main": {
          "instanceType": "mem1_ssd1_x2"
        },
        "samtoolscount_bam": {
          "instanceType": "mem1_ssd1_x4"
        },
        "combine_files": {
          "instanceType": "mem1_ssd1_x2"
        }
      },
      ...
    }
    regions = parseSAM_header_for_region(filename)
    split_regions = [regions[i:i + region_size]
                      for i in range(0, len(regions), region_size)]
    
    if not index_file:
        mappings_bam, index_file = create_index_file(filename, mappings_bam)
    print('creating subjobs')
    subjobs = [dxpy.new_dxjob(
                fn_input={"region_list": split,
                          "mappings_bam": mappings_bam,
                          "index_file": index_file},
                fn_name="samtoolscount_bam")
                for split in split_regions]
    
    fileDXLinks = [subjob.get_output_ref("readcount_fileDX")
                    for subjob in subjobs]
    print('combining outputs')
    postprocess_job = dxpy.new_dxjob(
        fn_input={"countDXlinks": fileDXLinks, "resultfn": filename},
        fn_name="combine_files")
    
    countDXLink = postprocess_job.get_output_ref("countDXLink")
    
    output = {}
    output["count_file"] = countDXLink
    
    return output
    def samtoolscount_bam(region_list, mappings_bam, index_file):
        """Processing function.
    
        Arguments:
            region_list (list[str]): Regions to count in BAM
            mappings_bam (dict): dxlink to input BAM
            index_file (dict): dxlink to input BAM
    
        Returns:
            Dictionary containing dxlinks to the uploaded read counts file
        """
        #
        # Download inputs
        # -------------------------------------------------------------------
        # dxpy.download_all_inputs will download all input files into
        # the /home/dnanexus/in directory.  A folder will be created for each
        # input and the file(s) will be download to that directory.
        #
        # In this example our dictionary inputs has the following key, value pairs
        # Note that the values are all list
        #     mappings_bam_path: [u'/home/dnanexus/in/mappings_bam/<bam filename>.bam']
        #     mappings_bam_name: [u'<bam filename>.bam']
        #     mappings_bam_prefix: [u'<bam filename>']
        #     index_file_path: [u'/home/dnanexus/in/index_file/<bam filename>.bam.bai']
        #     index_file_name: [u'<bam filename>.bam.bai']
        #     index_file_prefix: [u'<bam filename>']
        #
    
        inputs = dxpy.download_all_inputs()
    
        # SAMtools view command requires the bam and index file to be in the same
        shutil.move(inputs['mappings_bam_path'][0], os.getcwd())
        shutil.move(inputs['index_file_path'][0], os.getcwd())
        input_bam = inputs['mappings_bam_name'][0]
    
        #
        # Per region perform SAMtools count.
        # --------------------------------------------------------------
        # Output count for regions and return DXLink as job output to
        # allow other entry points to download job output.
        #
    
        with open('read_count_regions.txt', 'w') as f:
            for region in region_list:
                    view_cmd = create_region_view_cmd(input_bam, region)
                    region_proc_result = run_cmd(view_cmd)
                    region_count = int(region_proc_result[0])
                    f.write("Region {0}: {1}\n".format(region, region_count))
        readcountDXFile = dxpy.upload_local_file("read_count_regions.txt")
        readCountDXlink = dxpy.dxlink(readcountDXFile.get_id())
    
        return {"readcount_fileDX": readCountDXlink}
    def combine_files(countDXlinks, resultfn):
        """The 'gather' subjob of the applet.
    
        Arguments:
            countDXlinks (list[dict]): list of DXlinks to process job output files.
            resultfn (str): Filename to use for job output file.
    
        Returns:
            DXLink for the main function to return as the job output.
    
        Note: Only the DXLinks are passed as parameters.
        Subjobs work on a fresh instance so files must be downloaded to the machine
        """
        if resultfn.endswith(".bam"):
            resultfn = resultfn[:-4] + '.txt'
    
        sum_reads = 0
        with open(resultfn, 'w') as f:
            for i, dxlink in enumerate(countDXlinks):
                dxfile = dxpy.DXFile(dxlink)
                filename = "countfile{0}".format(i)
                dxpy.download_dxfile(dxfile, filename)
                with open(filename, 'r') as fsub:
                    for line in fsub:
                        sum_reads += parse_line_for_readcount(line)
                        f.write(line)
            f.write('Total Reads: {0}'.format(sum_reads))
    
        countDXFile = dxpy.upload_local_file(resultfn)
        countDXlink = dxpy.dxlink(countDXFile.get_id())
    
        return {"countDXLink": countDXlink}
      "runSpec": {
        ...
        "systemRequirements": {
          "main": {
            "instanceType": "mem1_ssd1_x2"
          },
          "samtoolscount_bam": {
            "instanceType": "mem1_ssd1_x4"
          },
          "combine_files": {
            "instanceType": "mem1_ssd1_x2"
          }
        },
        ...
      }
    regions = parseSAM_header_for_region(filename)
    split_regions = [regions[i:i + region_size]
                      for i in range(0, len(regions), region_size)]
    
    if not index_file:
        mappings_bam, index_file = create_index_file(filename, mappings_bam)
    print('creating subjobs')
    subjobs = [dxpy.new_dxjob(
                fn_input={"region_list": split,
                          "mappings_bam": mappings_bam,
                          "index_file": index_file},
                fn_name="samtoolscount_bam")
                for split in split_regions]
    
    fileDXLinks = [subjob.get_output_ref("readcount_fileDX")
                    for subjob in subjobs]
    print('combining outputs')
    postprocess_job = dxpy.new_dxjob(
        fn_input={"countDXlinks": fileDXLinks, "resultfn": filename},
        fn_name="combine_files")
    
    countDXLink = postprocess_job.get_output_ref("countDXLink")
    
    output = {}
    output["count_file"] = countDXLink
    
    return output
    def samtoolscount_bam(region_list, mappings_bam, index_file):
        """Processing function.
    
        Arguments:
            region_list (list[str]): Regions to count in BAM
            mappings_bam (dict): dxlink to input BAM
            index_file (dict): dxlink to input BAM
    
        Returns:
            Dictionary containing dxlinks to the uploaded read counts file
        """
        #
        # Download inputs
        # -------------------------------------------------------------------
        # dxpy.download_all_inputs will download all input files into
        # the /home/dnanexus/in directory.  A folder will be created for each
        # input and the file(s) will be download to that directory.
        #
        # In this example, the dictionary has the following key-value pairs
        # Note that the values are all list
        #     mappings_bam_path: [u'/home/dnanexus/in/mappings_bam/<bam filename>.bam']
        #     mappings_bam_name: [u'<bam filename>.bam']
        #     mappings_bam_prefix: [u'<bam filename>']
        #     index_file_path: [u'/home/dnanexus/in/index_file/<bam filename>.bam.bai']
        #     index_file_name: [u'<bam filename>.bam.bai']
        #     index_file_prefix: [u'<bam filename>']
        #
    
        inputs = dxpy.download_all_inputs()
    
        # SAMtools view command requires the bam and index file to be in the same
        shutil.move(inputs['mappings_bam_path'][0], os.getcwd())
        shutil.move(inputs['index_file_path'][0], os.getcwd())
        input_bam = inputs['mappings_bam_name'][0]
    
        #
        # Per region perform SAMtools count.
        # --------------------------------------------------------------
        # Output count for regions and return DXLink as job output to
        # allow other entry points to download job output.
        #
    
        with open('read_count_regions.txt', 'w') as f:
            for region in region_list:
                    view_cmd = create_region_view_cmd(input_bam, region)
                    region_proc_result = run_cmd(view_cmd)
                    region_count = int(region_proc_result[0])
                    f.write("Region {0}: {1}\n".format(region, region_count))
        readcountDXFile = dxpy.upload_local_file("read_count_regions.txt")
        readCountDXlink = dxpy.dxlink(readcountDXFile.get_id())
    
        return {"readcount_fileDX": readCountDXlink}
    def combine_files(countDXlinks, resultfn):
        """The 'gather' subjob of the applet.
    
        Arguments:
            countDXlinks (list[dict]): list of DXlinks to process job output files.
            resultfn (str): Filename to use for job output file.
    
        Returns:
            DXLink for the main function to return as the job output.
    
        Note: Only the DXLinks are passed as parameters.
        Subjobs work on a fresh instance so files must be downloaded to the machine
        """
        if resultfn.endswith(".bam"):
            resultfn = resultfn[:-4] + '.txt'
    
        sum_reads = 0
        with open(resultfn, 'w') as f:
            for i, dxlink in enumerate(countDXlinks):
                dxfile = dxpy.DXFile(dxlink)
                filename = "countfile{0}".format(i)
                dxpy.download_dxfile(dxfile, filename)
                with open(filename, 'r') as fsub:
                    for line in fsub:
                        sum_reads += parse_line_for_readcount(line)
                        f.write(line)
            f.write('Total Reads: {0}'.format(sum_reads))
    
        countDXFile = dxpy.upload_local_file(resultfn)
        countDXlink = dxpy.dxlink(countDXFile.get_id())
    
        return {"countDXLink": countDXlink}
    $ dx ls
    Developer Quickstart/
    Developer Tutorials/
    Quickstart/
    RNA-seq Workflow Example/
    SRR100022/
    _README.1st.txt
    $ dx ls -l
    Project: Demo Data (project-BQbJpBj0bvygyQxgQ1800Jkk)
    Folder : /
    Developer Quickstart/
    Developer Tutorials/
    Quickstart/
    RNA-seq Workflow Example/
    SRR100022/
    State   Last modified       Size     Name (ID)
    closed  2015-09-01 17:55:33 712 bytes _README.1st.txt (file-BgY4VzQ0bvyg22pfZQpXfzgK)
    $ dx ls SRR100022/
    SRR100022_1.filt.fastq.gz
    SRR100022_2.filt.fastq.gz
    $ dx ls "project-BQpp3Y804Y0xbyG4GJPQ01xv:/C. Elegans - Ce10/"
    ce10.bt2-index.tar.gz
    ce10.bwa-index.tar.gz
    ce10.cw2-index.tar.gz
    ce10.fasta.fai
    ce10.fasta.gz
    ce10.tmap-index.tar.gz
    $ dx ls "project-BQpp3Y804Y0xbyG4GJPQ01xv:/C. Elegans - Ce10/*.fasta*"
    ce10.fasta.fai
    ce10.fasta.gz
    $ dx pwd
    Demo Data:/
    $ dx cd Quickstart/
    $ dx ls
    SRR100022_20_1.fq.gz
    SRR100022_20_2.fq.gz
    $ dx ls
    some_folder/
    an_applet
    ce10.fasta.gz
    Variation Calling Workflow
    $ dx mv ce10.fasta.gz C.elegans10.fasta.gz
    $ dx ls
    some_folder/
    an_applet
    C.elegans10.fasta.gz
    Variation Calling Workflow
    $ dx mv C.elegans10.fasta.gz some_folder/
    $ dx ls some_folder/
    Hg19
    C.elegans10.fasta.gz
    ...
    $ dx select project-BQpp3Y804Y0xbyG4GJPQ01xv
    Selected project project-BQpp3Y804Y0xbyG4GJPQ01x
    $ dx cd H.\ Sapiens\ -\ GRCh37\ -\ hs37d5\ (1000\ Genomes\ Phase\ II)/
    $ dx ls
    hs37d5.2bit
    hs37d5.bt2-index.tar.gz
    hs37d5.bwa-index.tar.gz
    hs37d5.cw2-index.tar.gz
    hs37d5.fa.fai
    hs37d5.fa.gz
    hs37d5.fa.sa
    hs37d5.tmap-index.tar.gz
    $ dx cp hs37d5.fa.gz project-9z94ZPZvbJ3qP0pyK1P0000p:/
    $ dx select project-9z94ZPZvbJ3qP0pyK1P0000p
    $ dx ls
    some_folder/
    an_applet
    C.elegans10.fasta.gz
    hs37d5.fa.gz
    Variation Calling Workflow
    $ dx select
    
    Note: Use "dx select --level VIEW" or "dx select --public" to select from
    projects for which you only have VIEW permission to.
    
    Available projects (CONTRIBUTE or higher):
    0) SAM importer test (CONTRIBUTE)
    1) Scratch Project (ADMINISTER)
    2) Mouse (ADMINISTER)
    
    Project # [1]: 2
    Setting current project to: Mouse
    $ dx ls -l
    Project: Mouse (project-9zVfbG2y8x65kxKY7x20005G)
    Folder : /
    $ dx select --public
    
    Available public projects:
    0) Example 1 (VIEW)
    1) Apps Data (VIEW)
    2) Parliament (VIEW)
    3) CNVkit Tests (VIEW)
    ...
    m) More options not shown...
    
    Pick a numbered choice or "m" for more options: 1
    $ dx select --level VIEW
    
    Available projects (VIEW or higher):
    0) SAM importer test (CONTRIBUTE)
    1) Scratch Project (ADMINISTER)
    2) Shared Applets (VIEW)
    3) Mouse (ADMINISTER)
    
    Pick a numbered choice or "m" for more options: 2
    $ dx select project-9zVfbG2y8x65kxKY7x20005G
    Selected project project-9zVfbG2y8x65kxKY7x20005G
    $ dx ls -l
    Project: Mouse (project-9zVfbG2y8x65kxKY7x20005G)
    Folder : /
    %%bash
    dx download input_data/reads.fastq
    ! dx download input_data/reads.fastq
    import dxpy
    dxpy.download_dxfile(dxid='file-xxxx',
                         filename='unique_name.txt')
    %%bash
    dx upload Readme.ipynb
    import dxpy
    dxpy.upload_local_file('variants.vcf')
    $ dx pwd
    MyProject:/
    %%bash
    pip install torch
    pip install torchvision
    conda install -c conda-forge opencv
    my_cmd="papermill notebook.ipynb output_notebook.ipynb"
    dx run dxjupyterlab -icmd="$my_cmd" -iin="notebook.ipynb"
    # NVIDIA-smi
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 470.129.06   Driver Version: 470.129.06   CUDA Version: 11.4     |
    |-------------------------------+----------------------+----------------------+
    // Let's upgrade CUDA 11.4 to 12.5
    # apt-get update
    # apt-get -y install cuda-toolkit-12-5 cuda-compat-12-5
    # echo /usr/local/cuda/compat > /etc/ld.so.conf.d/NVIDIA-compat.conf
    # ldconfig
    # NVIDIA-smi
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 470.256.02   Driver Version: 470.256.02   CUDA Version: 12.5     |
    |-------------------------------+----------------------+----------------------+
    // CUDA 12.5 is now usable from terminal and notebooks
    $ dx select "Reference Genome Files"
    $ dx cd "C. Elegans - Ce10/"
    $ dx pwd # Print current working directory
    Reference Genome Files:/C. Elegans - Ce10
    $ dx ls
    ce10.bt2-index.tar.gz
    ce10.bwa-index.tar.gz
    ce10.cw2-index.tar.gz
    ce10.fasta.fai
    ce10.fasta.gz
    ce10.hisat2-index.tar.gz
    ce10.star-index.tar.gz
    ce10.tmap-index.tar.gz
    $ dx ls '*.fa*' # List objects with filenames of the pattern "*.fa*"
    ce10.fasta.fai
    ce10.fasta.gz
    $ dx ls ce10.???-index.tar.gz # List objects with filenames of the pattern "ce10.???-index.tar.gz"
    ce10.cw2-index.tar.gz
    ce10.bt2-index.tar.gz
    ce10.bwa-index.tar.gz
    $ dx find data --name "*.fa*.gz"
    closed  2014-10-09 09:50:51 776.72 MB /M. musculus - mm10/mm10.fasta.gz (file-BQbYQPj0Z05ZzPpb1xf000Xy)
    closed  2014-10-09 09:50:30 767.47 MB /M. musculus - mm9/mm9.fasta.gz (file-BQbYK6801fFJ9Fj30kf003PB)
    closed  2014-10-09 09:49:27 49.04 MB /D. melanogaster - Dm3/dm3.fasta.gz (file-BQbYVf80yf3J9Fj30kf00PPk)
    closed  2014-10-09 09:48:55 29.21 MB /C. Elegans - Ce10/ce10.fasta.gz (file-BQbY9Bj015pB7JJVX0vQ7vj5)
    closed  2014-10-08 13:52:26 818.96 MB /H. Sapiens - GRCh37 - hs37d5 (1000 Genomes Phase II)/hs37d5.fa.gz (file-B6ZY7VG2J35Vfvpkj8y0KZ01)
    closed  2014-10-08 13:51:31 876.79 MB /H. Sapiens - hg19 (UCSC)/ucsc_hg19.fa.gz (file-B6qq93v2J35fB53gZ5G0007K)
    closed  2014-10-08 13:50:53 827.95 MB /H. Sapiens - hg19 (Ion Torrent)/ion_hg19.fa.gz (file-B6ZYPQv2J35xX095VZyQBq2j)
    closed  2014-10-08 13:50:17 818.88 MB /H. Sapiens - GRCh38/GRCh38.no_alt_analysis_set.fa.gz (file-BFBv6J80634gkvZ6z100VGpp)
    closed  2014-10-08 13:49:53 810.45 MB /H. Sapiens - GRCh37 - b37 (1000 Genomes Phase I)/human_g1k_v37.fa.gz (file-B6ZXxfG2J35Vfvpkj8y0KXF5)
    # Correct usage with quotes
    dx ls '*.fa*'              # Single quotes prevent shell expansion
    dx find data --name "*.gz" # Double quotes also work
    # Searching for a file with colons in the name
    dx find data --name "sample\:123.txt"
    # Or alternatively with single quotes
    dx find data --name 'sample\:123.txt'
    
    # Searching for a file with a literal asterisk
    dx find data --name "experiment\*.fastq"
    $ dx find data --created-after 2017-02-22 --created-before 2017-02-25
    closed  2017-02-27 19:14:51 3.90 GB  /H. Sapiens - hg19 (UCSC)/ucsc_hg19.hisat2-index.tar.gz (file-F2pJvF80Vzx54f69K4J8K5xy)
    closed  2017-02-27 19:14:21 3.55 GB  /M. musculus - mm10/mm10.hisat2-index.tar.gz (file-F2pJqk00Vq161bzq44Vjvpf5)
    closed  2017-02-27 19:13:57 3.51 GB  /M. musculus - mm9/mm9.hisat2-index.tar.gz (file-F2pJpKj0G0JxZxBZ4KJq0Q6B)
    closed  2017-02-27 19:13:41 3.85 GB  /H. Sapiens - hg19 (Ion Torrent)/ion_hg19.hisat2-index.tar.gz (file-F2pJkp00BjBk99xz4Jk74V0y)
    closed  2017-02-27 19:13:28 3.85 GB  /H. Sapiens - GRCh37 - b37 (1000 Genomes Phase I)/human_g1k_v37.hisat2-index.tar.gz (file-F2pJpy007bGBzj7X446PzxJJ)
    closed  2017-02-27 19:13:02 3.90 GB  /H. Sapiens - GRCh37 - hs37d5 (1000 Genomes Phase II)/hs37d5.hisat2-index.tar.gz (file-F2pJpb000vFpzj7X446PzxF0)
    closed  2017-02-27 19:12:31 3.91 GB  /H. Sapiens - GRCh38/GRCh38.no_alt_analysis_set.hisat2-index.tar.gz (file-F2pK5y00F8Bp9BYk4KX7Qb4P)
    closed  2017-02-27 19:12:18 224.54 MB /D. melanogaster - Dm3/dm3.hisat2-index.tar.gz (file-F2pJP7j0QkbQ3ZqG269589pj)
    closed  2017-02-27 19:11:56 139.76 MB /C. Elegans - Ce10/ce10.hisat2-index.tar.gz (file-F2pJK300KKz8bx1126Ky5b3P)
    $ dx find data --tag sampleABC --tag batch123
    closed  2017-01-01 09:00:00 6.08 GB  /Input/SRR504516_1.fastq.gz (file-xxxx)
    closed  2017-01-01 09:00:00 5.82 GB  /Input/SRR504516_2.fastq.gz (file-wwww)
    $ dx find data --property sequencing_providor=CRO_XYZ
    closed  2017-01-01 09:00:00 8.06 GB  /Input/SRR504555_1.fastq.gz (file-qqqq)
    closed  2017-01-01 09:00:00 8.52 GB  /Input/SRR504555_2.fastq.gz (file-rrrr)
    $ dx find data --name "*.fastq.gz"
     --path project-BQfgzV80bZ46kf6pBGy00J38:/Input
      closed  2014-10-03 12:04:16 6.08 GB  /Input/SRR504516_1.fastq.gz (file-B40jg7v8KfPy38kjz1vQ001y)
      closed  2014-10-03 12:04:16 5.82 GB  /Input/SRR504516_2.fastq.gz (file-B40jgYG8KfPy38kjz1vQ0020)
    $ dx find data --name "SRR*_1.fastq.gz" --all-projects
    closed  2017-01-01 09:00:00 6.08 GB  /Exome Analysis Demo/Input/SRR504516_1.fastq.gz (project-xxxx:file-xxxx)
    closed  2017-07-01 10:00:00 343.58 MB /input/SRR064287_1.fastq.gz (project-yyyy:file-yyyy)
    closed  2017-01-01 09:00:00 6.08 GB  /data/exome_analysis_demo/SRR504516_1.fastq.gz (project-zzzz:file-xxxx)
    dx api system findDataObjects '{"scope": {"project": "project-xxxx"}, "describe":{"fields":{"state":true}}}'
    {
      "license": {
        "serialNumber": "<Serial number from Stata>",
        "code": "<Code from Stata>",
        "authorization": "<Authorization from Stata>",
        "user": "<Registered user line 1>",
        "organization": "<Registered user line 2>"
      }
    }
    {
      "licenseFile": {
        "$dnanexus_link": {
          "id": "file-xxxx",
          "project": "project-yyyy"
        }
      }
    }
    !dx download project-xxxx:file-yyy
    use /mnt/project/<path>/data_in.dta
    import delimited /mnt/project/<path>/data_in.csv
    save data_out
    export delimited data_out.csv
    !dx upload <file> --destination=<destination>
    dx upload <file> --destination=<destination>
    pandas_df = spark_df.toPandas()
    pandas_df.to_stata("data_out.dta")
    pandas_df.to_csv("data_out.csv")
    %%bash
    dx upload <file>

    Login and Logout

    Learn how to log into and out of the DNAnexus Platform, via both the user interface and the command-line interface. Learn how to use tokens to log in, and how to set up two-factor authentication.

    hashtag
    Logging In and Out via the User Interface

    To log in via the user interface (UI), open the DNAnexus Platform login pagearrow-up-right and enter your username and password.

    To log out via the UI, open your account menu and select Sign Out:

    Logging in and out

    hashtag
    Logging In via the Command-Line Interface

    To log in via the command-line interface (CLI), make sure you've . From the CLI, enter the command .

    Next, enter your username, or, if you've logged in before on the same computer and your username is displayed, hit Return to confirm that you want to use it to log in. Then enter your password.

    See below for directions on .

    See the for detail on optional arguments that can be used with dx login.

    hashtag
    Logging Out via the Command-Line Interface

    When using the CLI, log out by entering the command .

    circle-info

    If you use a token to log in, logging out invalidates that token. To log in again, you must .

    See the for detail on optional arguments that can be used with dx logout.

    hashtag
    Auto Logout

    hashtag
    Session inactivity

    By default, the system logs out users after 15 minutes of inactivity. Exceptions apply to users logged in with an that specifies a different session duration, or users in an org with a custom autoLogoutAfter policy.

    circle-info

    for more information on setting a custom autoLogoutAfter policy for an org.

    hashtag
    Credentials change

    The system automatically logs out users when they change their account credentials. This happens immediately after the credentials change is complete. Exceptions apply to users logged in with an .

    The following actions are considered credentials changes:

    • Change a password

    • Reset a password

    • Confirm a new email address after updating account email

    • Enable or disable multi-factor authentication (MFA)

    By default, changing your credentials does not automatically terminate any running jobs or active downloads and uploads that are authenticated on your behalf. When you change your credentials, you can choose Revoke Active Tokens to terminate these running jobs and active transfers. See details on .

    circle-exclamation

    If you suspect your account may be compromised, we strongly recommend that you:

    • Opt-in to terminate any active jobs authenticated on your behalf.

    hashtag
    Using Tokens

    You can log in via the CLI, and stay logged in for a fixed length of time, by using an API token, also called an authentication token.

    circle-exclamation

    Exercise caution when sharing DNAnexus Platform tokens. Anyone with a token can access the Platform and impersonate you as a user. They gain your access level to any projects accessible by the token, enabling them to run jobs and potentially incur charges to your account.

    hashtag
    Generating a Token

    To generate a token, open your account menu and select My Profile.

    Next, click on the API Tokens tab. Then click the New Token button:

    The New Token form opens in a modal window:

    Consider the following points when filling out the form:

    • The token provides access to each project at the level at which you have access. See the .

    • If the token provides access to a project within which you have PHI data access, it enables access to that PHI data.

    • Tokens without a specified expiration date expire in one month.

    After completing the form, click Generate Token. The system generates a 32-character token and displays it with a confirmation message.

    circle-info

    Copy your token immediately. The token is inaccessible after dismissing the confirmation message or navigating away from the API Tokens screen.

    hashtag
    Using a Token to Log In

    To log in with a token via the CLI, enter the command , followed by a valid 32-character token.

    hashtag
    Token Use Cases

    Tokens are useful in multiple scenarios, such as:

    • Logging in via the CLI with single sign-on enabled - If your organization uses , logging in via the CLI might require a token instead of a username and password.

    • Logging in via a script - Scripts can use tokens to authenticate with the Platform.

    circle-exclamation

    When incorporating a token into a script, take care to set the token's expiration date such that the script has Platform access for only as long as necessary. Ensure as well that the script only has access to that project or those projects to which it must have access, to function properly.

    hashtag
    Revoking a Token

    When you revoke API tokens, all running jobs and all active file uploads or downloads authenticated with those tokens are terminated immediately and fail with the . Any compute or egress charges incurred up to the point of termination remain billable to the account associated with those operations.

    To revoke a token, navigate to the API Tokens screen within your profile on the UI. Select the token you want to revoke, then click the Revoke button:

    In the Revoke Tokens Confirmation modal window, click the Yes, revoke it button. The token is revoked, and its name no longer appears in the list of tokens on the API Tokens screen.

    hashtag
    When to Revoke a Token

    • Token shared too widely - Revoke a token if someone with whom you've shared the token should no longer be able to use it, or if you're not certain who has access to it.

    • Token no longer needed - Revoke a token if a script that uses it is no longer in use, or if a group that had been using it no longer needs access to the Platform, or in any other situation in which the token is no longer necessary.

    hashtag
    Logging In Non-Interactively

    Though logging in typically requires direct interaction with the Platform through the UI or CLI, non-interactive login is also possible. Scripts commonly automate both login and project selection.

    Non-interactive login uses dx login with the --token argument. The command automates project selection. For manual project selection, add the argument to dx login.

    hashtag
    Two-Factor Authentication

    DNAnexus recommends adding two-factor authentication to your account, to provide an extra means of ensuring the security of all data to which you have access, on the Platform.

    With two-factor authentication enabled, you must enter a two-factor authentication code to log into the Platform and access certain other services. This code is a time-based one-time password valid for a single session, generated by a third-party two-factor authenticator application, such as Google Authenticator.

    Two-factor authentication protects your account by requiring both your credentials and an authentication code. This prevents unauthorized access even if your username and password are compromised.

    hashtag
    Enabling Two-Factor Authentication

    circle-info

    DNAnexus recommends using a time-based one-time password (TOTP)–compliant authenticator application on your mobile device. Popular options include Google Authenticator, Authy, Microsoft Authenticator, and 1Password. Google Authenticator is a free application that's available for both Apple iOS and Android mobile devices. Get it on or from the .

    If you are unable to use a smartphone application, compatible two-factor authenticator applications that use the TOTP (time-based one-time password) algorithm exist for other platforms.

    1. From your account menu, select Account Security.

    2. In the Two-Factor Authentication section, click Enable 2FA.

    3. Choose a TOTP-compatible authenticator application and click Next.

    circle-exclamation

    Save your backup codes before closing the setup dialog — they will not be accessible again after you close it. Store them in a secure location. Without backup codes and without access to your authenticator application, Platform login becomes impossible.

    if you lose both your codes and access to your authenticator application.

    hashtag
    Logging In with a Backup Code

    If you lose access to your authenticator application, you can use a saved backup code to log in.

    1. Navigate to the Platform login page and enter your username and password.

    2. When prompted for your two-factor authentication code, enter one of your saved backup codes instead.

    Each backup code can only be used once. Keep track of which codes you have used and store remaining codes securely.

    circle-exclamation

    If you lose access to both your authenticator application and your backup codes, you will not be able to log in to the Platform. for assistance.

    hashtag
    Disabling Two-Factor Authentication

    DNAnexus recommends keeping two-factor authentication enabled after activation. You can disable it manually, provided it has not been strictly enforced by your organization admin.

    circle-exclamation

    Turning off two-factor authentication logs you out of all active web sessions immediately and is treated as a . If you re-enable 2FA later, you will need to reconfigure your authenticator application by scanning a new QR code or entering a new secret key, and saving a new set of backup codes.

    1. From your account menu, select Account Security.

    2. In the Two-Factor Authentication section, click Turn Off.

    3. In the confirmation dialog, enter your current password and either the 6-digit code from your authenticator app or a backup code.

    circle-info

    If you used a backup code to log in to the Platform, you need a second backup code to confirm disabling 2FA in step 3, because each backup code can only be used once.

    hashtag
    Troubleshooting Two-Factor Authentication

    Code validation failures are most commonly caused by a time-desynchronization issue on your mobile device. This can occur in two ways:

    • Code invalid during setup — The Platform reports an invalid code immediately after scanning the QR code during initial 2FA setup.

    • Codes stopped working — You previously set up 2FA successfully, but newly generated codes are no longer accepted at login.

    To resolve either issue, enable Automatic Date and Time in your device's system settings and restart your phone. This synchronizes your device clock with the server time required for the TOTP algorithm to generate valid codes.

    Analyzing Somatic Variants

    Analyze somatic variants, including cancer-specific filtering, visualization, and variant landscape exploration in the Cohort Browser.

    circle-info

    An Apollo license is required to use Cohort Browser on the DNAnexus Platform. Org approval may also be required. Contact DNAnexus Salesenvelope for more information.

    Explore and analyze datasets with somatic variant assays by opening them in the Cohort Browser and switching to the Somatic Variants tab. You can create cohorts based on somatic variants, visualize variant patterns, and examine detailed variant information.

    You can analyze somatic variants across four main categories: Single Nucleotide Variants (SNVs) & Indels for small genomic changes, Copy Number Variants (CNVs) for alterations in gene copy numbers, Fusions for structural rearrangements involving gene coding sequences, and Structural Variants (SVs) for larger genomic rearrangements.

    circle-info

    Somatic assay datasets are created using the .

    hashtag
    Variant Classification

    The somatic data model classifies all genomic variants into four main classes, defined by their size, structure, and representation in VCF files. Each variant type has specific criteria that must be met for classification.

    Variant Type
    Classification Criteria
    Examples
    circle-info

    Somatic Variants in Cohort Browser

    • CNVs and Fusions are also classified as Structural Variants in the Cohort Browser because they use symbolic allele representations (<CNV>, <DEL>, <DUP>, <BND>

    hashtag
    Filtering by Somatic Variants

    You can to include only samples with specific somatic variants.

    To apply a somatic filter to your cohort:

    1. For the cohort you want to edit, click Add Filter.

    2. In Add Filter to Cohort > Assays > Variant (Somatic), select a genomic filter.

    3. In Edit Filter: Variant (Somatic), specify the criteria:

    You can specify up to 10 somatic variant filters for each cohort.

    circle-info

    After you apply or edit filters, the participant count updates immediately. However, visualization tiles do not automatically refresh. Click Refresh Visualizations at the top of the dashboard to update all tiles. Click Refresh on individual tiles to update specific charts.

    hashtag
    Working with Large Structural Variants (>10Mbp)

    Structural variants larger than 10 megabases lack gene-level annotations, which limits how you can filter and visualize them. Use these alternative filtering approaches:

    • Filter by genomic coordinates: In the Genes / Effects filter, enter genomic coordinates in the format chr:start-end, for example, 17:7661779-7687538 for the TP53 gene region. Set the variant type scope to SV or CNV and leave consequence types blank. Find gene coordinates by typing the gene symbol in the search icon next to the Variants & Events table.

    • Filter by variant IDs: In the Variant IDs filter, enter up to 10 variant IDs in the format chr_pos_ref_alt, for example, 17_7674257_A_<DEL>

    circle-check

    For comprehensive structural variant analysis, combine multiple filtering approaches. Use gene symbol filters to capture annotated structural variants ≤ 10Mbp, then add coordinate-based filters to include larger structural variants in the same genomic regions.

    Large structural variants are visible in the table with full details, but they do not appear in the due to missing gene-level annotations.

    hashtag
    Comparing Variant Patterns Across Your Cohort

    The Variant Frequency Matrix provides a visual overview of how often somatic variants appear throughout your cohort. The matrix helps you identify variant patterns across tumor samples and discover which variants frequently occur together. You can also measure the mutation burden in different genes and compare how mutation profiles differ between two cohorts. This makes it easier to spot trends and relationships in your data that might not be apparent when examining individual variants.

    circle-check

    The Variant Frequency Matrix is interactive. You can , and , and zoom in on specific genes or regions.

    In the Variant Frequency Matrix, the rows represents genes and columns represent samples, both are sorted by variant frequency.

    • Sorted gene list: Genes are ranked from most to least frequently affected by variants. A sample is considered "affected" by a gene if it is a tumor sample with at least one detected variant of high or moderate impact in that gene's canonical transcript. Matched normal samples are not included in this calculation.

    • Sorted sample list: Samples are ordered by the total number of genes that contain variants. This ranking is independent of how frequently each individual gene is affected.

    circle-info

    The Variant Frequency Matrix displays up to the top 50 genes with the most variants and up to 500 samples for any given cohort. The samples shown are the 500 with the highest number of genes containing variants. If your cohort has fewer than 500 samples, the matrix shows all samples.

    hashtag
    Filtering by Genes and Consequences

    By default, the Variant Frequency Matrix includes all genes and samples. To narrow your view, you can filter the matrix to specific classes of somatic variants, such as SNVs & Indels, Structural Variants, CNVs, or Fusions.

    Using the legend in bottom right, you can focus on specific variants, events, or consequences. This allows you to better explore particular areas of interest, such as high-impact mutations or specific consequences relevant to your research.

    circle-info

    When , the matrix can display the top 200 samples (columns) from both the primary and secondary cohorts. The top genes are selected and sorted by their variant frequency within the primary cohort.

    hashtag
    Viewing Gene and Sample Details

    The Variant Frequency Matrix is highly interactive, allowing you to quickly access more details and apply filters.

    When you hover over a cell, the matrix shows a unique identifier for the sample, along with a breakdown of the variants detected in that gene, organized by their consequence type. You can copy the sample ID to your clipboard to apply it to a cohort filter.

    When you hover over a gene ID on the left axis, the matrix shows more information about that gene. This includes a unique identifier for the gene, along with a quick breakdown of available external annotations, with direct links to the and databases (when available).

    To create a filter, hover over the gene and click + Add to Filter, or copy the gene ID to your clipboard for use in a custom filter.

    hashtag
    Color Coding and Consequences

    The Variant Frequency Matrix uses color coding to represent the consequences of detected variants, providing a quick visual assessment of variant types. Only high and moderate impact consequences, as defined by , are included in this visualization.

    Samples with two or more detected variants are color-coded as "Multi Hit", indicating a complex variant profile.

    hashtag
    Exploring Gene-Level Mutation Patterns

    The Lollipop Plot is a visualization tool that shows the somatic variants of a cohort on a single gene's canonical protein. With Lollipop Plot, you can identify mutation hotspots within a specific gene, understand the functional impact of variants in the context of protein domains, compare mutation patterns across different patient cohorts, and explore recurrent mutations in cancer driver genes.

    Use the Go to Gene field to quickly navigate to a gene of interest, such as TP53.

    When you hover over a lollipop, you can see details about the amino acid change, such as the HGVS notation and the frequency of that change in the current cohort. The plot also shows the location of each mutation along the protein sequence, with color coding to indicate the consequence type.

    circle-info

    The Lollipop Plot displays SNV & Indel data from the same genomic region as the . When you change the genomic region in the table, the Lollipop Plot updates to reflect the change and the other way around.

    hashtag
    Reading the Lollipop Plot

    • Each lollipop on the plot represents amino acid changes at a specific location.

    • The horizontal position (X axis) indicates the location of the change, while the height (Y axis) represents the frequency of that change within the current cohort.

    • Lollipops are color-coded by consequence based on the canonical transcript.

    hashtag
    Examining Detailed Variant Information

    The Variants & Events table displays details on the same genomic region as the . You can filter the table to focus on specific variant types, such as SNV & Indels, SV (Structural Variants), CNV, or Fusion.

    circle-info

    Unlike the Variant Frequency Matrix, the Variants & Events table displays all structural variants including those larger than 10Mbp. Use this table to examine large SVs that may not appear in other visualizations.

    Information displayed in the Variants & Events table includes:

    • Location of variant, with a link to its

    • Reference allele of variant

    • Alternate allele of variant

    hashtag
    Exporting Variant Information

    You can export the selected variants in the Variants & Events table as a list of variant IDs or a CSV file.

    • To copy a comma-separated list of variant IDs to your clipboard, select the set of IDs you want to copy, and click Copy.

    • To export variants as a CSV file, select the set of IDs you need, and click Download (.csv file).

    hashtag
    Accessing External Annotations and Resources

    In Variants & Events > Location column, you can click on the specific location to open the locus details.

    The locus details show specific SNV & Indel variants as well as up to 200 structural variants overlapping with the specific location. For canonical transcripts, a blue indicator appears next to the transcript ID, identifying the primary transcript annotations.

    The locus details include enhanced annotations to external resources:

    • Gene-level links - Direct links to gene information in external databases

    • Variant-level links - Links to variant-specific annotation resources

    These links allow you to quickly navigate to external annotation resources for further information about genes or variants of interest.

    SQL Runner

    circle-info

    A license is required to access Spark functionality on the DNAnexus Platform. for more information.

    hashtag
    Overview

    Rotate your API tokens after changing credentials. That is, delete your existing API tokens and create new ones.
    Enter your current Platform password and click Next.
  • Scan the provided QR code with your authenticator app. If you cannot scan it, enter the displayed text code manually into your app instead.

  • Enter the 6-digit code generated by your app and click Next.

  • Click Print or Download to save your one-time-use backup codes in a secure location.

  • Click Finish & Log Out to complete setup.

  • (Optional) Check Revoke Active Tokens to immediately terminate all running jobs and active file uploads and downloads.
  • Click Turn Off.

  • installed the dx command-line client
    dx login
    using a token to log in
    Index of dx Commands page
    dx logout
    generate a new token
    Index of dx Commands page
    API token
    Contact DNAnexus Supportenvelope
    API token
    revoking API tokens
    Projects page for more on project access levels
    dx login --token
    single sign-on
    error AuthError
    dx select
    --noprojects
    Google Playarrow-up-right
    Apple App Storearrow-up-right
    Contact DNAnexus Supportenvelope
    Contact DNAnexus Supportenvelope
    credentials change
    Creating a new token
    New token form
    Revoking a token

    All must match: • ALT field contains breakend notation with square brackets ([ or ]) • At least one breakpoint overlaps with annotated gene or transcript

    [chr2:123456[, ]chr5:789012]

    Structural Variant (SV) Large or complex structural changes

    Either must match: • Variant length > 50bp • ALT field contains symbolic allele (<DEL>, <INV>, <CNV>, <BND>)

    <DEL>, <INV>, large insertions

    ). This dual classification ensures they are correctly distinguished from SNVs regardless of their physical length.
  • For optimal performance and annotation scalability, the Cohort Browser processes SVs and CNVs between 50bp and 10Mbp differently than larger variants:

    • SVs and CNVs ≤ 10Mbp: Fully annotated with gene symbols and consequences, appear in all visualizations including the Variant Frequency Matrix

    • SVs and CNVs > 10Mbp: Ingested and visible in the Variants & Events table but lack gene-level annotations. These larger variants do not appear in the Variant Frequency Matrix and cannot be filtered using gene symbols or consequence terms. Use genomic coordinates or variant IDs to filter for these variants (see Working with Large Structural Variants below).

    • Fusions are not affected by this size limit as they are considered two single-position events.

  • For datasets with multiple somatic variant assays, select the specific assay to filter by.

  • Choose whether to include patients with at least one detected variant matching the specified criteria (WITH Variant), or include only patients who have no detected variants matching the criteria (WITHOUT Variant). By default, the filter includes those with matching variants. This choice applies to all specified filtering criteria.

  • On the Genes / Effects tab, select variants of specific types and variant consequencesarrow-up-right within specified genes and genomic ranges. You can specify up to 5 genes or genomic ranges in a comma-separated list.

  • On the HGVS tab, specify a particular HGVSarrow-up-right DNA or HGVS protein notation, preceded by a gene symbol. Example: KRAS p.Arg1459Ter.

  • On the Variant IDs tab, specify variant IDs using the standard format chr_pos_ref_alt (for example, 17_7674257_A_G). You can enter up to 10 variant IDs in a comma-separated list.

  • Enter multiple genes, ranges, or variants, by separating them with commas or placing each on a new line.

  • Click Apply Filter.

  • . To get variant IDs, navigate to the gene region in the Variants & Events table, select variants of interest, and download the CSV file - the Location column contains the variant IDs.
    If a lollipop represents multiple consequence types, it is coded as "Multi Hit".
  • You can identify mutation hotspots for a given gene and see protein changes in HGVSarrow-up-right short form notation, such as T322A, and HGVS.p notation, such as p.Thr322Ala.

  • Type of variant, such as SNV, Indel, or Structural Variant
  • Variant consequences, with entries color-coded by level of severity

  • HGVS cDNA

  • HGVS Protein

  • COSMIC ID

  • RSID, with a link to the dbSNParrow-up-right entry for the variant

  • SNV & Indel Single base substitutions and small insertions/deletions with precise allele sequences

    All must match: • Variant size ≤ 50bp • ALT field contains precise allele (NOT symbolic like <DEL>, <INS>, <DUP>, <CNV>)

    A→G, ATCG→A, A→ATCG

    Copy Number Variant (CNV) Changes in gene copy number

    All must match: • ALT field contains symbolic allele (<CNV>, <DEL>, <DUP>) • Explicit copy number value present in FORMAT field key CN

    <CNV>, <DEL>, <DUP>

    Somatic Variant Assay Loader
    define your cohort
    Variants & Events
    Variant Frequency Matrix
    filter by genes and consequences
    view details of specific genes and samples
    comparing cohorts
    CIViCarrow-up-right
    OncoKBarrow-up-right
    Ensembl VEP version 109arrow-up-right
    Variants & Events table
    Lollipop Plot
    locus details
    Adding a somatic variant filter
    Variant Frequency Matrix filtered to SNVs & Indels
    Show only a specific class of somatic variants
    Show only a specific consequence for a class of somatic variants
    Hovering over a cell to view specific sample details
    Hovering over a gene ID to view its details
    Lollipop Plot for the TP53 gene
    Variants & Events for the TP53 gene
    Showing locus details for a specific somatic variant

    Fusion Structural rearrangements involving gene coding sequences

    The Spark SQL Runner application brings up a Spark cluster and executes your provided list of SQL queries. This is especially useful if you need to perform a sequence repeatedly or if you need to run a complex set of queries. You can vary the size of your cluster to speed up your tasks.

    hashtag
    How to Run Spark SQL Runner

    Input:

    • sqlfile: [Required] A SQL file which contains an ordered list of SQL queries.

    • substitutions: A JSON file which contains the variable substitutions.

    • user_config: User configuration JSON file, in case you want to set or override certain Spark configurations.

    Other Options:

    • export: (boolean) default false. Exports output files with results for the queries in the sqlfile.

    • export_options: A JSON file which contains the export configurations.

    • collect_logs: (boolean) default false. Collects cluster logs from all nodes.

    • executor_memory: (string) Amount of memory to use per executor process, in MiB unless otherwise specified. Common values include 2g or 8g. This is passed as --executor-memory to Spark submit.

    • executor_cores: (integer) Number of cores to use per executor process. This is passed as --executor-cores to Spark submit.

    • driver_memory: (string) Amount of memory to use for the driver process. Common values include 2g or 8g. This is passed as --driver-memory to Spark submit.

    • log_level: (string) default INFO. Logging level for both driver and executors. [ALL, TRACE, DEBUG, INFO]

    Output:

    • output_files: Output files include report SQL file and query export files.

    hashtag
    Basic Run

    hashtag
    Examples

    hashtag
    sqlfile

    hashtag
    How sqlfile is Processed

    1. The SQL runner extracts each command in sqlfile and runs them in sequential order.

    2. Every SQL command needs to be separated with a semicolon ;.

    1. Any command starting with -- is ignored (comments). Any comment within a command should be inside /*...*/ The following are examples of valid comments:

    hashtag
    Variable Substitution

    Variable substitution can be done by specifying the variables to replace in substitutions.

    In the above example, each reference to srcdb in sqlfile within ${...} is substituted with sskrdemo1. For example, select * from ${srcdb}.${patient_table};. The script adds the set command before executing any of the SQL commands in sqlfile. As a result, select * from ${srcdb}.${patient_table}; translates to:

    hashtag
    Export

    If enabled, the results of the SQL commands are exported to a CSV file. export_options defines an export configuration.

    1. num_files: default 1. This defines the maximum number of output files to generate. The number generally depends on how many executors are running in the cluster and how many partitions of this file exist in the system. Each output file corresponds to a part file in parquet.

    2. fileprefix: The filename prefix for every SQL output file. By default, output files are prefixed with query_id, which is the order in which the queries are listed in sqlfile (starting with 1), for example, 1-out.csv. If a prefix is specified, output files are named like <prefix>-1-out.csv.

    3. header: Default is true. If true, a header is added to each exported file.

    hashtag
    User Configuration

    Values in spark-defaults.conf override or add to the default Spark configuration.

    hashtag
    Output Files

    The export folder contains two generated files:

    • <JobId>-export.tar: Contains all the query results.

    • <JobId>-outfile.sql: SQL debug file.

    hashtag
    Export Files

    After extracting the export tar file, the structure appears as follows:

    In the above example, demo is the fileprefix used. The export produces one folder per query. Each folder contains a SQL file with the query executed and a .csv folder containing the result CSV.

    hashtag
    SQL Report File

    Every SQL run execution generates a SQL runner debug report file. This is a SQL file.

    It lists all the queries executed and status of the execution (Success or Fail). It also lists the name of the output file for that command and the time taken. If there are any failures, it reports the query and stops executing subsequent commands.

    hashtag
    SQL Errors

    During execution of the series of SQL commands, a command may fail (error, syntax, etc). In that case, the app quits and uploads a SQL debug file to the project:

    The output identifies the line with the SQL error and its response.

    The query in the .sql file can be fixed, and this report file can be used as input for a subsequent run, allowing you to resume from where execution stopped.

    Contact DNAnexus Salesenvelope
    dx run spark-sql-runner \
       -i sqlfile=file-FQ4by2Q0Yy3pGp21F7vp8XGK \
       -i paramfile=file-FK7Qpj00GQ8Q7ybZ0pqYJj6G \
       -i export=true
    SELECT * FROM ${srcdb}.${patient_table};
    DROP DATABASE IF EXISTS ${dstdb} CASCADE;
    CREATE DATABASE IF NOT EXISTS ${dstdb} LOCATION 'dnax://';
    CREATE VIEW ${dstdb}.patient_view AS SELECT * FROM ${srcdb}.patient;
    SELECT * FROM ${dstdb}.patient_view;
    SHOW DATABASES;
    SELECT * FROM dbname.tablename1;
    SELECT * FROM
    dbname.tablename2;
    DESCRIBE DATABASE EXTENDED dbname;
    -- SHOW DATABASES;
    -- SELECT * FROM dbname.tablename1;
    SHOW /* this is valid comment */ TABLES;
    {
        "srcdb": "sskrdemo1",
        "dstdb": "sskrtest201",
        "patient": "patient_new",
        "f2c":"patient_f2c",
        "derived":"patient_derived",
        "composed":"patient_composed",
        "complex":"patient_complex",
        "patient_view": "patient_newview",
        "brca": "brca_new",
        "patient_table":"patient",
        "cna": "cna_new"
    }
    set srcdb=sskrdemo1;
    set patient_table=patient;
    select * from ${srcdb}.${patient_table};
    {
       "num_files" : 2,
       "fileprefix":"demo",
       "header": true
    }
    {
      "spark-defaults.conf": [
        {
          "name": "spark.app.name",
          "value": "SparkAppName"
        },
        {
          "name": "spark.test.conf",
          "value": true
        }
      ]
    }
    $ dx tree export
    export
    ├── job-FFp7K2j0xppVXZ791fFxp2Bg-export.tar
    ├── job-FFp7K2j0xppVXZ791fFxp2Bg-debug.sql
    ├── demo-0
    │   ├── demo-0-out.csv
    │   │   ├── _SUCCESS
    │   │   ├── part-00000-1e2c301e-6b28-47de-b261-c74249cc6724-c000.csv
    │   │   └── part-00001-1e2c301e-6b28-47de-b261-c74249cc6724-c000.csv
    │   └── demo-0.sql
    ├── demo-1
    │   ├── demo-1-out.csv
    │   │   ├── _SUCCESS
    │   │   └── part-00000-b21522da-0e5f-42ba-8197-e475841ba9c3-c000.csv
    │   └── demo-1.sql
    ├── demo-2
    │   ├── demo-2-out.csv
    │   │   ├── _SUCCESS
    │   │   ├── part-00000-e61c6eff-5448-4c39-8c72-546279d8ce6f-c000.csv
    │   │   └── part-00001-e61c6eff-5448-4c39-8c72-546279d8ce6f-c000.csv
    │   └── demo-3.sql
    ├── demo-3
    │   ├── demo-3-out.csv
    │   │   ├── _SUCCESS
    │   │   └── part-00000-5a48ba0f-d761-4aa5-bdfa-b184ca7948b5-c000.csv
    │   └── demo-3.sql
    -- [SQL Runner Report] --;
    -- [SUCCESS][TimeTaken: 1.90734863281e-06 secs ] set f2c=patient_f2c;
    -- [SUCCESS][TimeTaken: 1.90734863281e-06 secs ] set srcdb=sskrdemosrcdb1_13;
    -- [SUCCESS][TimeTaken: 9.53674316406e-07 secs ] set patient=patient_new;
    -- [SUCCESS][TimeTaken: 9.53674316406e-07 secs ] set derived=patient_derived;
    -- [SUCCESS][TimeTaken: 9.53674316406e-07 secs ] set composed=patient_composed;
    -- [SUCCESS][TimeTaken: 9.53674316406e-07 secs ] set patient_table=patient;
    -- [SUCCESS][TimeTaken: 1.19209289551e-06 secs ] set complex=patient_complex;
    -- [SUCCESS][TimeTaken: 9.53674316406e-07 secs ] set patient_view=patient_newview;
    -- [SUCCESS][TimeTaken: 9.53674316406e-07 secs ] set cna=cna_new;
    -- [SUCCESS][TimeTaken: 0.0 secs ] set brca=brca_new;
    -- [SUCCESS][TimeTaken: 2.14576721191e-06 secs ] set dstdb=sskrdemodstdb1_13;
    -- [SUCCESS][OutputFile: demo-0-out.csv, TimeTaken: 8.83630990982 secs] SHOW DATABASES;
    -- [SUCCESS][OutputFile: demo-1-out.csv, TimeTaken: 3.85295510292 secs] create database sskrdemo2 location 'dnax://';
    -- [SUCCESS][OutputFile: demo-2-out.csv, TimeTaken: 4.8106200695 secs] use sskrdemo2;
    -- [SUCCESS][OutputFile: demo-3-out.csv , TimeTaken: 1.00737595558 secs] create table patient (first_name string, last_name string, age int, glucose int, temperature int, dob string, temp_metric string) stored as parquet;
    -- [SQL Runner Report] --;
    -- [SUCCESS][TimeTaken: 1.90734863281e-06 secs ] set f2c=patient_f2c;
    -- [SUCCESS][TimeTaken: 1.90734863281e-06 secs ] set srcdb=sskrdemosrcdb1_13;
    -- [SUCCESS][TimeTaken: 9.53674316406e-07 secs ] set patient=patient_new;
    -- [SUCCESS][TimeTaken: 9.53674316406e-07 secs ] set derived=patient_derived;
    -- [SUCCESS][TimeTaken: 9.53674316406e-07 secs ] set composed=patient_composed;
    -- [SUCCESS][TimeTaken: 9.53674316406e-07 secs ] set patient_table=patient;
    -- [SUCCESS][TimeTaken: 1.19209289551e-06 secs ] set complex=patient_complex;
    -- [SUCCESS][TimeTaken: 9.53674316406e-07 secs ] set patient_view=patient_newview;
    -- [SUCCESS][TimeTaken: 9.53674316406e-07 secs ] set cna=cna_new;
    -- [SUCCESS][TimeTaken: 0.0 secs ] set brca=brca_new;
    -- [SUCCESS][TimeTaken: 2.14576721191e-06 secs ] set dstdb=sskrdemodstdb1_13;
    -- [SUCCESS][OutputFile: demo-0-out.csv, TimeTaken: 8.83630990982 secs] select * from ${srcdb}.${patient_table};
    -- [FAIL] SQL ERROR while below command [ Reason: u"\nextraneous input '`' expecting <EOF>(line 1, pos 45)\n\n== SQL ==\ndrop database if exists sskrtest2011 cascade `\n---------------------------------------------^^^\n"];
    drop database if exists ${dstdb} cascade `;
    create database if not exists ${dstdb} location 'dnax://';
    create view ${dstdb}.patient_view as select * from ${srcdb}.patient;
    select * from ${dstdb}.patient_view;
    drop database if exists ${dstdb} cascade `;

    Defining and Managing Cohorts

    Create, filter, and manage patient cohorts using clinical, genomic, and other data fields in the Cohort Browser.

    circle-info

    An Apollo license is required to use Cohort Browser on the DNAnexus Platform. Org approval may also be required. Contact DNAnexus Salesenvelope for more information.

    Create comprehensive patient cohorts by filtering your datasets. You can combine, compare, and export your cohorts for further analysis.

    If you'd like to visualize data in your cohorts, see Creating Charts and Dashboards.

    hashtag
    Managing Cohorts

    When you start exploring a dataset, Cohort Browser automatically creates a cohort that includes all patients/samples. You can then by adding filters, and repeat multiple times to create additional cohorts.

    The Cohorts panel gives you an overview of your active cohorts on the dashboard (up to 2) and the recently used cohorts (up to 8) in your current session. These can be temporary unsaved cohorts as well as .

    To change the active cohorts on the dashboard, you need to swap them between the Dashboard and Recent sections:

    1. In Cohorts > Dashboard, click In Dashboard to remove a cohort from the dashboard.

    2. In Cohorts > Recent, click Add to Dashboard next to the cohort you want to add to the dashboard.

    This way you can quickly explore, , and iterate across multiple cohorts within a single session.

    hashtag
    Defining Cohort Criteria

    hashtag
    Adding Clinical and Phenotypic Filters

    To apply a filter to your cohort:

    1. For the cohort you want to edit, click Add Filter.

    2. In Add Filter to Cohort > Clinical, select a data field to filter by.

    3. Click Add Cohort Filter.

    circle-info

    After you apply or edit filters, the participant count updates immediately. However, visualization tiles do not automatically refresh. Click Refresh Visualizations at the top of the dashboard to update all tiles. Click Refresh on individual tiles to update specific charts.

    hashtag
    Adding Assay Filters

    With multi-assay datasets, you can create cohorts by applying filters from multiple assay types and instances.

    When adding filters, you can find assay types under the Assays tab. This allows you to create cohorts that combine different types of data. For example, you can filter patients based on both clinical characteristics and germline variants, merge somatic mutation criteria with gene expression levels, or build cohorts that span multiple assays of the same type.

    To learn more about filtering by specific assay types, see:

    When working with an omics dataset that includes multiple assays, such as a germline dataset with both WES and WGS assays, you can:

    • Choose specific assays for filtering.

    • Apply different filters per assay.

    • Create separate cohorts for different assays of the same type and compare results.

    hashtag
    Filter Limits by Assay Type

    The maximum number of filters allowed varies by assay type and is shared across all instances of that type:

    • Germline variant assays: 1 filter maximum

    • Somatic variant assays: Up to 10 filter criteria

    • Gene expression assays: Up to 10 filter criteria

    hashtag
    Creating Filter Groups

    If you add multiple filters from the same category, such as Patient or Sample, they automatically form a filter group.

    By default, filters within a filter group are joined by the logical operator 'AND', meaning that all filters in the group must be satisfied for a record to be included in the cohort. You can change the logical operator used within the group to 'OR' by clicking on the operator.

    hashtag
    Joining Multiple Filters

    Join filters allow you to create cohorts by combining criteria across multiple related data entities within your dataset. This is useful when working with complex datasets that contain interconnected information, such as patient records linked to visits, medications, lab tests, or other clinical data.

    hashtag
    Understanding Data Entities

    An entity is a grouping of data around a unique item, event, or concept.

    In the Cohort Browser, an entity can refer either to a data model object, such as patient or visit, or to a specific input parameter in the app.

    Common examples of data entities include:

    • Patient: Demographics, medical history, baseline characteristics

    • Visit: Hospital admissions, appointments, encounters

    • Medication: Prescriptions, dosages, administration records

    hashtag
    Creating Join Filters

    To create join filters that span multiple data entities:

    1. Start a new join filter: On the cohort panel, click Add Filter or, on a chart tile, click Cohort Filters > Add Cohort Filter.

    2. Select secondary entity: Choose data fields from a secondary entity (different from your primary entity) to create the join relationship.

    3. Add criteria to existing joins: To expand an existing join filter, click Add additional criteria on the row of the chosen filter.

    hashtag
    Working with Logical Operators

    Join filters support both AND as well as OR logical operators to control how criteria are combined:

    • AND logic: All specified criteria must be met

    • OR logic: Any of the specified criteria can be met

    Key rules for logical operators:

    • Click on the operator buttons to switch between the AND logic and OR logic.

    • For a specific level of join filtering, joins are either all AND or all OR.

    • When using OR for join filters, the existence condition applies first: "where exists, join 1 OR join 2".

    hashtag
    Building Complex Join Structures

    As your filtering needs become more sophisticated, you can create multi-layered join structures:

    • Add criteria to branches: Further define secondary entities by adding additional criteria to existing join branches

    • Create nested joins: Add more layers of join filters that derive from the current branch

    • Automatic field filtering: The field selector automatically hides fields that are ineligible based on the current join structure

    hashtag
    Practical Examples

    The following examples show how join filters work in practice:

    1. First Example Cohort - Separate Conditions: This cohort identifies all patients with a "high" or "medium" risk level who meet both of these conditions:

      • Have a first hospital visit (visit instance = 1)

      • Have had a "nasal swab" lab test at any point (not necessarily during the first visit)

    hashtag
    Saving Cohorts

    You can save your cohort selection to a as a by clicking Save Cohort in the top-right corner of the cohort panel.

    Cohorts are saved with their applied filters, as well as the latest visualizations and dashboard layout. Like other dataset objects, you can find your saved cohorts under the Manage tab in your project.

    To open a cohort, double-click it or click Explore Data.

    circle-check

    Need to use your cohorts with a different dataset? If you want to apply your cohort definitions to a different Apollo Dataset, you can use the app to transfer your saved cohorts to a new target dataset.

    hashtag
    Exporting Data from Cohorts

    For each cohort, you can export a list of main entity IDs in your current cohort selection as a CSV file by clicking Export sample IDs.

    hashtag
    Data Preview

    On the Data Preview tab, you can export tabular information as record IDs or a CSV file. Select multiple table rows to see export options in the top-right corner. Exports include only the fields displayed in the Data Preview tab.

    The Data Preview supports up to 30 columns per tab. Tables with 30-200 columns show column names only. In such cases, you can save cohorts but data is not queried. Tables with over 200 columns are not supported.

    You can view up to 30,000 records in the Data Preview. If your cohort exceeds this size, the table may not display all data. For larger exports, use the .

    circle-info

    If your view contains more than one table, such as a participants table and a hospital records table, exporting to CSV or TSV generates a separate file for each table.

    hashtag
    Download Restrictions

    The Cohort Browser follows your project's . When every copy of your dataset exists in projects with restricted download policies, downloads are blocked. However, if at least one copy exists in a project that allows downloads, then downloads are permitted.

    Downloads are blocked if the database storing your dataset has restricted download permissions, preventing downloads from any Cohort Browser view of that dataset regardless of which project contains the cohort or dashboard.

    Downloads are also blocked if the specific cohort or dashboard you're viewing has restricted download permissions, regardless of the underlying dataset permissions.

    hashtag
    Combining Cohorts

    You can create complex cohorts by combining existing cohorts from the same dataset.

    • Near the cohort name, click + > Combine Cohorts.

    • In the Cohorts panel, click Combine Cohorts.

    You can also create a combined cohort based on the cohorts already being .

    The Cohort Browser supports the following combination logic:

    Once a combined cohort is created, you can inspect the combination logic and its original cohorts in the cohort filters section.

    circle-info

    Cohorts already combined cannot be combined a second time.

    hashtag
    Comparing Cohorts

    You can compare two cohorts from the same dataset by adding both cohorts into the Cohort Browser.

    To compare cohorts, click + next to the cohort name. You can create a new cohort, duplicate the current cohort, or load a previously saved cohort.

    When comparing cohorts:

    • All visualizations are converted to show data from both cohorts.

    • You can continue to edit both cohorts and visualize the results dynamically.

    You can compare a cohort with its complement in the dataset by selecting Compare / Combine Cohorts > Not In …. Similar to combining cohorts, you first need to save your current cohort before creating its not-in counterpart.

    Cohorts created using Not In cannot be used for further creation of combined or not-in cohorts. "Not In" cohorts are linked to the cohort they are originally based on. Once a not-in cohort is created, further changes to the original cohort definition are not reflected.

    hashtag
    Creating Cohorts via CLI

    The dx command generates a new Cohort object on the platform using an existing Dataset or Cohort object, and a list of primary IDs. The filters are applied to the global primary key of the dataset/cohort object.

    When the input is a CohortBrowser typed record, the existing filters are preserved and the output record has additional filters on the global primary key. The filters are combined in a way such that the resulting record is an intersection of the IDs present in the original input and the IDs passed through CLI.

    For additional details, see the and example notebooks in the public GitHub repository, .

    Describing Data Objects

    You can describe objects (files, app(let)s, and workflows) on the DNAnexus Platform using the command dx describe.

    hashtag
    Describing an Object by Name

    Objects can be described using their DNAnexus Platform name via the command line interface (CLI) using a path.

    hashtag
    Describe an Object With a Relative Path

    Objects can be described relative to the user's current directory on the DNAnexus Platform. In the following example, the indexed reference genome file human_g1k_v37.bwa-index.tar.gz is described.

    The entire path is enclosed in quotes because the folder name Original files contains whitespace. Instead of quotes, escape special characters with \: dx describe Original\ files/human_g1k_v37.bwa-index.tar.gz.

    hashtag
    Describe an Object in a Different Project Using an Absolute Path

    Objects can be described using an absolute path. This allows you to describe objects outside the current project context. In the following example, selects the project "My Research Project" and dx describe describes the file human_g1k_v37.fa.gz in the "Reference Genome Files" project.

    hashtag
    Describe an Object Using Object ID

    Objects can be described using a unique object ID.

    This example describes the workflow object "Exome Analysis Workflow" using its ID. This workflow is publicly available in the "Exome Analysis Demo" project.

    Because workflows can include many app(let)s, inputs/outputs, and default parameters, the dx describe output can seem overwhelming.

    hashtag
    Manipulating Outputs

    The output from a dx describe command can be used for multiple purposes. The optional argument --json converts the output from dx describe into JSON format for advanced scripting and command line use.

    In this example, the publicly available workflow object "Exome Analysis Workflow" is described and the output is returned in JSON format.

    Parse, process, and query the JSON output using . Below, the dx describe --json output is processed to generate a list of all stages in the exome analysis pipeline.

    To get the "executable" value of each stage present in the "stages" array value of the dx describe output above, use the following command:

    hashtag
    General Response Fields Overview

    Field name
    Objects
    Description

    Using Spark

    Connect with Spark for database sharing, big data analytics, and rich visualizations.

    circle-info

    A license is required to access Spark functionality on the DNAnexus Platform. Contact DNAnexus Salesenvelope for more information.

    Apache Spark can help you tackle big data analytics combined with rich visualization. Sharing a database is straightforward: platform access levels map directly to SQL abilities, so you can fine-tune access control to your databases at either an individual or org level.

    hashtag
    Spark Applications

    You can launch a Spark application distributed across a cluster of workers. Since this is all tightly integrated with the rest of the platform, Spark jobs leverage the features of normal jobs. You have the same ways to monitor a job's progress, SSH into a job instance to debug, and use the features of dx-toolkit and the platform web UI. You also have access to logs from workers and can monitor the job in the Spark UI.

    hashtag
    Visualization

    With Spark, you can visualize your results in real time. You can save those queries as cohorts, share them with your team, or use them as inputs to Spark-based analysis apps. You can create charts and shareable dashboards. The filter view allows you to build cohorts quickly without the need to write complex SQL queries by hand.

    hashtag
    Databases

    A database is a on the Platform. A object is stored in a project.

    hashtag
    Database Sharing

    Databases can be shared with other users or organizations through project sharing. Access to a database can be revoked at any time by revoking access to the project by the project administrator. If revoking access to the project is impossible, the database can be to another project with different set of collaborators.

    hashtag
    Database and Project Policies

    Project policies restrict how the data can be modified or copied to other projects. Databases follow the Delete Policy and the Copy Policy. If a database is in a restricted project, the database can be accessed for reading only from the same project context. Databases also adhere to the project's policy. If a database is in a project for which Data Protection is enabled ("PHI project"), the database is subject to the following restrictions:

    • The database cannot be accessed by Spark apps launched in projects for which PHI Data Protection is not enabled ("non-PHI projects").

    • If a non-PHI project is provided as a project context, only databases from non-PHI projects are available for retrieving data.

    • If a PHI project is provided as a project context, only databases from PHI projects are available to add new data.

    circle-info

    A license and a signed Business Associate Agreement are required to enable and use PHI Data Protection. for more information.

    hashtag
    Database Access

    As with all DNAnexus file objects, database access is controlled by project access. These access levels and database object translate into specific SQL abilities for the database, tables, data and database object in the project.

    The following tables reference supported actions on a database and database object with lowest necessary access level for an open and closed database.

    Spark SQL Function
    Open Database
    Closed Database
    Data Object Action
    Open Database
    Closed Database

    (*) If a project is protected, then ADMINISTER access is required.

    hashtag
    Database Naming Conventions

    The system handles database names in two ways:

    • User-provided name: Your database name is converted to lowercase and stored as the databaseName attribute.

    • System-generated unique name: A unique identifier is created by combining your lowercase database name with the database object ID (also converted to lowercase with hyphens changed to underscores) separated by two underscores. This is stored as the uniqueDatabaseName attribute.

    When a database is created using the following SQL statement and a user-generated database name (referenced below as, db_name):

    The platform database object, database-xxxx, is created with all lowercase characters. However, when creating a database using , the Python module supported by the DNAnexus SDK, dx-toolkit, the following case-sensitive command returns a database ID based on the user-generated database name, assigned here to the variable db_name:

    With that in mind, it is suggested to either use lowercase characters in your db_name assignment or to instead apply a forcing function like, .lower(), to the user-generated database name:

    Spark Cluster-Enabled JupyterLab

    Learn to use the JupyterLab Spark Cluster app.

    circle-info

    JupyterLab is accessible to all users of the UK Biobank Research Analysis Platform and the Our Future Health Trusted Research Environment.

    A license is required to access JupyterLab on the DNAnexus Platform. for more information.

    In Edit Filter, select operators and enter the values to filter by.
  • Click Apply Filter.

  • Lab Test: Results, procedures, sample information

    Second Example Cohort - Connected Conditions: This cohort includes all patients with a "high" or "medium" risk level who had the "nasal swab" test performed specifically during their first visit, creating a more restrictive temporal relationship between the visit and lab test.

    Select members that are present only in the first selected cohort and not in the second. Example: Subtraction of cohort A, B would be A - B.

    2 cohorts

    Unique

    Select members that appear in exactly one of the selected cohorts. Example: Unique of cohort A, B would be (A - B) ∪ (B - A).

    2 cohorts

    Logic

    Description

    Number of Cohorts Supported

    Intersection

    Select members that are present in ALL selected cohorts. Example: intersection of cohort A, B and C would be A ∩ B ∩ C.

    Up to 5 cohorts

    Union

    Select members that are present in ANY of the selected cohorts. Example: union of cohort A, B and C would be A ∪ B ∪ C.

    Up to 5 cohorts

    Logic

    Description

    Not In

    Select patients that are present in the dataset, but not in the current cohort. Example: In dataset U, the result of "Not In" A would be U - A.

    define your cohort criteria
    saved cohorts
    compare
    Analyzing Germline Variants
    Analyzing Somatic Variants
    Analyzing Gene Expression Data
    Table Exporter
    project
    cohort record
    Rebase Cohorts And Dashboards
    Table Exporter app
    download policy restrictions
    compared
    create_cohort
    create_cohort command reference
    DNAnexus/OpenBioarrow-up-right
    You can toggle the Cohorts panel by clicking Manage Cohorts
    Cohort size updates after filters are applied
    Adding a filter and toggling 'AND' and 'OR' functionality
    Example of 'OR' and 'AND' filtering.
    Join filters used with two example cohorts
    Save cohort action
    CohortBrowser object in a project
    Exporting a list of sample IDs
    Depending on the combination logic, you can combine up to 5 cohorts
    Combine cohorts from a comparison view
    Inspect combination logic on a combined cohort
    Creating Comparison Using the "Not In" logic

    Subtraction

    CONTRIBUTE (*)

    N/A

    ALTER TABLE RENAME PARTITION

    CONTRIBUTE

    N/A

    ANALYZE TABLE COMPUTE STATISTICS

    UPLOAD

    N/A

    CACHE TABLE, CLEAR CACHE

    N/A

    N/A

    CREATE DATABASE

    UPLOAD

    UPLOAD

    CREATE FUNCTION

    N/A

    N/A

    CREATE TABLE

    UPLOAD

    N/A

    CREATE VIEW

    UPLOAD

    UPLOAD

    DESCRIBE DATABASE, TABLE, FUNCTION

    VIEW

    VIEW

    DROP DATABASE

    CONTRIBUTE (*)

    ADMINISTER

    DROP FUNCTION

    N/A

    N/A

    DROP TABLE

    CONTRIBUTE (*)

    N/A

    EXPLAIN

    VIEW

    VIEW

    INSERT

    UPLOAD

    N/A

    REFRESH TABLE

    VIEW

    VIEW

    RESET

    VIEW

    VIEW

    SELECT

    VIEW

    VIEW

    SET

    VIEW

    VIEW

    SHOW COLUMNS

    VIEW

    VIEW

    SHOW DATABASES

    VIEW

    VIEW

    SHOW FUNCTIONS

    VIEW

    VIEW

    SHOW PARTITIONS

    VIEW

    VIEW

    SHOW TABLES

    VIEW

    VIEW

    TRUNCATE TABLE

    UPLOAD

    N/A

    UNCACHE TABLE

    N/A

    N/A

    UPLOAD

    N/A

    Get Details

    VIEW

    VIEW

    Remove

    CONTRIBUTE (*)

    ADMINISTER

    Remove Tags

    UPLOAD

    CONTRIBUTE

    Remove Types

    UPLOAD

    N/A

    Rename

    UPLOAD

    CONTRIBUTE

    Set Details

    UPLOAD

    N/A

    Set Properties

    UPLOAD

    CONTRIBUTE

    Set Visibility

    UPLOAD

    N/A

    ALTER DATABASE SET DBPROPERTIES

    CONTRIBUTE

    N/A

    ALTER TABLE RENAME

    CONTRIBUTE

    N/A

    Add Tags

    UPLOAD

    CONTRIBUTE

    Add Types

    UPLOAD

    N/A

    data object
    database
    relocated
    PHI Data Protection
    Contact DNAnexus Salesenvelope
    states
    dxpy

    ALTER TABLE DROP PARTITION

    Close

    CREATE DATABASE IF NOT EXISTS {db_name} LOCATION 'dnax://
    db_uri = dxpy.find_one_data_object(name=db_name", classname="database")['id']
    db_uri = dxpy.find_one_data_object(name=db_name.lower(), classname="database")['id']

    All

    Container where the object is stored.

    Folder

    All

    Objects inside a container (project) can be organized into folders. Objects can only exist in one path within a project.

    Name

    All

    Object name on the platform.

    All

    Status of the object on the platform.

    Visibility

    All

    Whether the file is visible to the user through the platform web interface.

    Tags

    All

    Set of tags associated with an object. Tags are strings used to organize or annotate objects.

    Properties

    All

    Key/value pairs attached to object.

    All

    JSON reference to another object on the platform. Linked objects are copied along with the object if the object is cloned to another project.

    Created

    All

    Date and time object was created.

    Created by

    All

    DNAnexus user who created the object. Contains subfield "via the job" if the object was created by an app or applet.

    Last modified

    All

    Date and time the object was last modified.

    Input Spec

    App(let)s and Workflows

    App(let) or workflow input names and classes. With workflows, the corresponding applet stage ID is also provided.

    Output Spec

    App(let) and Workflows

    App(let) or workflow output names and classes. With workflows, the corresponding applet stage ID is also provided.

    ID

    All

    Unique ID assigned to a DNAnexus object.

    Class

    All

    DNAnexus object type.

    dx select
    jqarrow-up-right

    Project

    hashtag
    Overview

    The JupyterLab Spark Cluster app is a Spark application that runs a fully-managed standalone Spark/Hadoop cluster. This cluster enables distributed data processing and analysis from directly within the JupyterLab application. In the JupyterLab session, you can interactively create and query DNAnexus databases or run any analysis on the Spark cluster.

    Besides the core JupyterLab features, the Spark cluster-enabled JupyterLab app allows you to:

    • Explore the available databases and get an overview of the available datasets

    • Perform analyses and visualizations directly on data available in the database

    • Create databases

    • Submit data analysis jobs to the Spark cluster

    Check the general Overview for an introduction to JupyterLab.

    hashtag
    Running and Using JupyterLab Spark Cluster

    The Quickstart page contains information on how to start a JupyterLab session and create notebooks on the DNAnexus Platform. The References page has additional useful tips for using the environment.

    hashtag
    Instantiating the Spark Context

    Having created your notebook in the project, you can populate your first cells as below. It is good practice to instantiate your Spark context at the beginning of your analyses, as shown below.

    hashtag
    Basic Operations on DNAnexus Databases

    hashtag
    Exploring Existing Databases

    To view any databases to which you have access to in your current region and project context, run a cell with the following code:

    A sample output should be:

    You can inspect one of the returned databases by running:

    which should return an output similar to:

    To find a database in your current region that may be in a different project than your current context, run the following code:

    A sample output should be:

    To inspect one of the databases listed in the output, use the unique database name. If you use only the database name, results are limited to the current project. For example:

    hashtag
    Creating Databases

    Here's an example of how to create and populate your own database:

    You can separate each line of code into different cells to view the outputs iteratively.

    hashtag
    Using Hail

    Hailarrow-up-right is an open-source, scalable framework for exploring and analyzing genomic data. It is designed to run primarily on a Spark cluster and is available with JupyterLab Spark Cluster. It is included in the app and can be used when the app is run with the feature input set to HAIL (set as default).

    Initialize the context when beginning to use Hail. It's important to pass previously started Spark Context sc as an argument:

    We recommend continuing your exploration of Hail with the GWAS using Hail tutorialarrow-up-right. For example:

    hashtag
    Using VEP with Hail

    To use VEParrow-up-right (Ensembl Variant Effect Predictor) with Hail, select "Feature," then "HAIL" when launching Spark Cluster-Enabled JupyterLab via the CLI.

    VEP can predict the functional effects of genomic variants on genes, transcripts, protein sequences, and regulatory regions. This includes the LOFTEE pluginarrow-up-right, which is activated when using the configuration file below.

    Add the following JSON configuration file to your DNAnexus project:

    Once the vep-GRCh38.json file is in your project, you can annotate the Hail MatrixTable (mt) using the following command:

    hashtag
    Behind the Scenes

    The Spark cluster app is a Docker-based app which runs the JupyterLab server in a Docker container.

    The JupyterLab instance runs on port 443. Because it is an HTTPS app, you can bring up the JupyterLab environment in a web browser using the URL https://job-xxxx.dnanexus.cloud, where job-xxxx is the ID of the job that runs the app.

    The script run at the instantiation of the container, /opt/start_jupyterlab.sh, configures the environment and starts the server needed to connect to the Spark cluster. The environment variables needed are set by sourcing two scripts, bind-mounted into the container:

    The default user in the container is root.

    The option --network host is used when starting Docker to remove the network isolation between the host and the Docker container, which allows the container to bind to the host's network and access Sparks master port directly.

    hashtag
    Accessing AWS S3 Buckets

    S3 buckets can have private or public access. Either the s3 or the s3a scheme can be used to access S3 buckets. The s3 scheme is automatically aliased to s3a in all Apollo Spark Clusters.

    hashtag
    Public Bucket Access

    To access public s3 buckets, you do not need to have s3 credentials. The example below shows how to access the public 1000Genomes bucket in a JupyterLab notebook:

    When the above is run in a notebook, the following is displayed:

    hashtag
    Private Bucket Access

    To access private buckets, see the example code below. The example assumes that a Spark session has been created as shown above.

    Contact DNAnexus Salesenvelope

    Using JupyterLab

    Use Jupyter notebooks on the DNAnexus Platform to craft sophisticated custom analyses in your preferred coding language.

    circle-info

    JupyterLab is accessible to all users of the UK Biobank Research Analysis Platform and the Our Future Health Trusted Research Environment.

    A license is required to access JupyterLab on the DNAnexus Platform. Contact DNAnexus Salesenvelope for more information.

    Jupyter notebooksarrow-up-right are a popular way to track the work performed in computational experiments the way a lab notebook tracks the work done in a wet lab setting. JupyterLab is an application provided by DNAnexus that allows you to perform computational experiments on the DNAnexus Platform using Jupyter notebooks. JupyterLab allows users on the DNAnexus Platform to collaborate on notebooks and extends with options for directly accessing a DNAnexus project from the JupyterLab environment.

    hashtag
    Why Use JupyterLab?

    JupyterLab supports the use of Bioconductor and Bioconda, useful tools for bioinformatics analysis.

    JupyterLab is a versatile application that can be used to:

    • Collaborate on exploratory analysis of data

    • Reproduce and fork work performed in computational analyses

    • Visualize and gain insights into data generated from biological experiments

    The DNAnexus Platform offers two different JupyterLab apps. One is a general-purpose JupyterLab application. The other is Spark cluster-enabled, and can be used within the framework.

    Both apps instantiate a JupyterLab server that allows for data analyses to be interactively performed in Jupyter notebooks on a DNAnexus worker.

    The app contains all the features found in the general-purpose JupyterLab along with access to a fully-managed, on-demand Spark cluster for big data processing and translational informatics.

    hashtag
    Version Information

    JupyterLab 2.2 is the default version on the DNAnexus Platform. .

    hashtag
    Creating Interactive Notebooks

    A step-by-step guide on how to start with JupyterLab and create and edit Jupyter notebooks can be found in the .

    hashtag
    JupyterLab Environments

    Creating a JupyterLab session requires the use of two different environments:

    1. The DNAnexus project (accessible through the web platform and the CLI).

    2. The worker execution environment.

    hashtag
    The Project on the DNAnexus Platform

    You have direct access to the project in which the application is run from the JupyterLab session. The project file browser (which lists folders, notebooks, and other files in the project) can be accessed from the DNAnexus tab in the left sidebar or from the :

    The project is selected when the JupyterLab app is started and cannot be subsequently changed.

    The DNAnexus file browser shows:

    • Up to 1,000 of your most recently modified files and folders

    • All Jupyter notebooks in the project

    • Databases (Spark-enabled app only, limited to 1,000 most recent)

    The file list refreshes automatically every 10 seconds. You can also refresh manually by clicking the circular arrow icon in the top right corner.

    circle-check

    Need to see more files? Use dx ls in the terminal or access them programmatically through the API.

    hashtag
    Worker Execution Environment

    When you open and run a notebook from the the kernel corresponding to this notebook is started in the worker execution environment and is used to execute the notebook code. DNAnexus notebooks have a [DX] prepended to the notebook name in the tab of all opened notebooks.

    The execution environment file browser is accessible from the left sidebar (notice the folder icon at the top) or from the terminal:

    To create Jupyter notebooks in the worker execution environment, use the File menu. These notebooks are stored on the local file system of the JupyterLab execution environment and require persistence in a DNAnexus project. More information about saving appears in the .

    hashtag
    Local vs. DNAnexus Notebooks

    hashtag
    DNAnexus Notebooks

    You can directly in the DNAnexus project as well as duplicate, delete, or download them to your local machine. Notebooks stored in your DNAnexus project, which are housed within the DNAnexus tab on the left sidebar, are fetched from and saved to the project on the DNAnexus Platform without being stored in the JupyterLab execution environment file system. These are referred to as "DNAnexus notebooks" and these notebooks persist in the DNAnexus project after the JupyterLab instance is terminated.

    DNAnexus notebooks can be recognized by the [DX] that is prepended to its name in the tab of all opened notebooks.

    DNAnexus notebooks can be created by clicking the DNAnexus Notebook icon from the Launcher tab that appears on starting the JupyterLab session, or by clicking the DNAnexus tab on the upper menu and then clicking "New notebook". The Launcher tab can also be opened by clicking File and then selecting "New Launcher" from the upper menu.

    hashtag
    Local Notebooks

    To create a new local notebook, click the File tab in the upper menu and then select "New" and then "Notebook". These non-DNAnexus notebooks can be saved to DNAnexus by dragging and dropping them in the DNAnexus file viewer in the left panel.

    hashtag
    Accessing Data

    In JupyterLab, users can access input data that is located in a DNAnexus project in one of the following ways.

    • For reading the input file multiple times or for reading a large fraction of the file in random order:

      • Download the file from the DNAnexus project to the execution environment with dx download and access the downloaded local file from Jupyter notebook.

    hashtag
    Uploading Data

    Files, such as local notebooks, can be persisted in the DNAnexus project by using one of these options:

    • dx upload in bash console.

    • Drag the file onto the DNAnexus tab that is in the column of icons on the left side of the screen. This uploads the file into the selected DNAnexus folder.

    hashtag
    Exporting DNAnexus Notebooks

    Exporting DNAnexus notebooks to formats such as HTML or PDF is not supported. However, you can dx download the DNAnexus notebook from the current DNAnexus project to the JupyterLab environment and export the downloaded notebook. For exporting local notebook to certain formats, the following commands might be needed beforehand: apt-get update && apt-get install texlive-xetex texlive-fonts-recommended texlive-plain-generic.

    hashtag
    Non-Interactive Execution of Notebooks

    A command can be executed in the JupyterLab worker execution environment without starting an interactive JupyterLab server. To do that, provide the cmd input and additional input files using the in input file array to the JupyterLab app. The provided command runs in the /opt/notebooks/directory and any output files generated in this directory are uploaded to the project and returned in the out output field of the job that ran the JupyterLab app.

    The cmd input makes it possible to use the papermill command that is pre-installed in the JupyterLab environment to execute notebooks non-interactively. For example, to execute all the cells in a notebook and produce an output notebook:

    Where notebook.ipynb is the input notebook to the papermill command, which is passed to the dxjupyterlab app using the in input, and output_notebook.ipynb is the name of the output notebook, which contains the result of executing the input notebook and is uploaded to the project at the end of app's execution. See the for details.

    hashtag
    Collaboration in the Cloud

    Collaborators can work on notebooks in the project without the risk of overwriting each other's changes.

    hashtag
    Notebook Locking During Editing

    If a user has opened a specific notebook in a JupyterLab session, other users cannot open or edit the notebook. This is indicated by a red lock icon next to the notebook's name.

    It is still possible to create a duplicate to see what changes are being saved in the locked notebook or to continue work on this "forked" version of the notebook. To copy a notebook, right-click on its name and select Duplicate. After a few seconds, a notebook with the same name and a "copy" suffix should appear in the project.

    Once the editing user closes the notebook, the lock is released and anybody else with access to the project can open it.

    hashtag
    Notebook Versioning

    Whenever a notebook is saved in the project, it is uploaded to the platform as a new file that replaces the previous version, that is, the file of the same name. The previous version is moved to the .Notebook_archive folder with a timestamp suffix added to its name and its ID is saved in the properties of the new file. Saving notebooks directly in the project ensures that your analyses are not lost when the JupyterLab session ends.

    circle-exclamation

    If a notebook saved to the project exceeds 20 MB, it may no longer open in JupyterLab and could trigger a "JSON Parse Error." To recover your code, open an earlier version from the .Notebook_archive folder, or download the notebook to your local machine and clear the notebook's outputs using a local Jupyter editor before re-uploading.

    hashtag
    Session Timeout Control

    JupyterLab sessions begin with a set duration and shut down automatically at the end of this period. The timeout clock appears in the footer on the right side and can be adjusted using the Update duration button. The session terminates at the set timestamp even if the JupyterLab webpage is closed. Job lengths have an upper limit of 30 days, which cannot be extended.

    A session can be terminated immediately from the top menu (DNAnexus > End Session).

    hashtag
    Environment Snapshots

    It is possible to save the current session environment and data and reload it later by creating a session snapshot (DNAnexus > Create Snapshot).

    A JupyterLab session is , and a session snapshot file is a tarball generated by saving the Docker container state (with the docker commit and docker save commands). Any installed packages and files created locally are saved to a snapshot file, except for directories /home/dnanexus and /mnt/, which are not included. This file is then uploaded to the project to .Notebook_snapshots and can be passed as input the next time the app is started.

    circle-info

    If many large files are created locally, the resulting snapshots take a long time to save and load. In general, it is recommended not to snapshot more than 1 GB of locally saved data/packages and rely on downloading larger files as needed.

    hashtag
    Snapshots Created in Older Versions of JupyterLab

    Snapshots created with JupyterLab versions older than 2.0.0 (released mid-2023) are not compatible with the current version. These previous snapshots contain tool versions that may conflict with the newer environment, potentially causing problems.

    hashtag
    Using Previous Snapshots in the Current Version of JupyterLab

    To use a snapshot from a previous version in the current version of JupyterLab, recreate the snapshot as follows:

    1. Create a tarball incorporating all the necessary data files and packages.

    2. Save the tarball in a project.

    3. Launch the current version of JupyterLab.

    hashtag
    Accessing an Older Snapshot in an Older Version of JupyterLab

    If you don't want to have to recreate your older snapshot, you can run an and access the snapshot there.

    hashtag
    Viewing Other Files in the Project

    Viewing any other file types from your project, such as CSV, JSON, PDF files, images, or scripts, is convenient because JupyterLab displays them accordingly. For example, JSON files are collapsible and navigable and CSV files are presented in the tabular format.

    However, editing and saving any open files from the project other than IPython notebooks results in an error.

    circle-exclamation

    Files larger than 20 MB display only their metadata in the JupyterLab file viewer. To access the full contents of a large file, download it using or the , or use the DNAnexus file browser on the platform.

    hashtag
    Permissions in the JupyterLab Session

    The JupyterLab apps are run in a specific project, defined at start time, and this project cannot be subsequently changed. The job associated with the JupyterLab app has CONTRIBUTE access to the project in which it is run.

    When running the JupyterLab app, it is possible to view, but not update, other projects the user has access to. This enhanced scope is required to be able to read databases which may be located in different projects and cannot be cloned.

    hashtag
    Running Jobs in the JupyterLab Session

    Use dx run to start new jobs from within a notebook or the terminal. If the billTo for the project where your JupyterLab session runs does not have a license for detached executions, any started jobs run as subjobs of your interactive JupyterLab session. In this situation, the --project argument for dx run is ignored, and the job uses the JupyterLab session's workspace instead of the specified project. If a subjob fails or terminates on the DNAnexus Platform, the entire job tree—including your interactive JupyterLab session—terminates as well.

    circle-exclamation

    Jobs are limited to a runtime of 30 days. The system automatically terminates jobs running longer than 30 days.

    hashtag
    Environment and Feature Options

    The JupyterLab app is a Docker-based app that runs the JupyterLab server instance in a Docker container. The server runs on port 443. Because it's an HTTPS app, you can bring up the JupyterLab environment in a web browser using the URL https://job-xxxx.dnanexus.cloud, where job-xxxx is the ID of the job that runs the app. Only the user who launched the JupyterLab job has access to the JupyterLab environment. Other users see a "403 Permission Forbidden" message under the JupyterLab session's URL.

    circle-info

    On the DNAnexus Platform, the JupyterLab server runs in a Python 3.9.16 environment, in a container running Ubuntu 20.04 as its operating system.

    hashtag
    Feature Options

    When launching JupyterLab, the feature options available are PYTHON_R, ML, IMAGE_PROCESSING, STATA, and MONAI_ML.

    • PYTHON_R (default option): Loads the environment with Python3 and R kernel and interpreter.

    • ML: Loads the environment with Python3 and machine learning packages, such as TensorFlow, PyTorch, CNTK as well as the image processing package Nipype, but it does not contain R.

    • IMAGE_PROCESSING

    circle-info

    The JupyterLab environment is headless and command-line only. While FSL and FreeSurfer command-line tools are available for batch processing, GUI viewers such as fsleyes and freeview cannot be launched. To visualize results interactively, download the output files to your local machine.

    • STATA: Requires a license to run. See for more information about running Stata in JupyterLab.

    • MONAI_ML: Loads the environment with Python3 and extends the ML feature. This feature is ideal for medical imaging research involving machine learning model development and testing. It includes medical imaging frameworks designed for AI-powered analysis. For details, see .

    circle-check

    For the full list of pre-installed packages, see the . This list includes details on feature-specific packages available when running the PYTHON_R, ML, IMAGE_PROCESSING, STATA, and MONAI_ML features.

    hashtag
    Installing Additional Packages

    Additional packages can be during a JupyterLab session. By creating a Docker container , users can then start subsequent sessions with the new packages pre-installed by providing the snapshot as input.

    hashtag
    JupyterLab Documentation

    For more information on the features and benefits of JupyterLab, see the .

    hashtag
    Next Steps

    • Create your first notebooks by following the instructions in the guide.

    • See the guide for tips and info on the most useful JupyterLab features.

    Organizations

    Learn about organizations, which associate users, projects, and resources with one another, enabling fluid collaboration, and simplifying the management of access, sharing, and billing.

    circle-info

    This functionality is also available via command line interface (CLI) tools. You may find it easier to use the CLI tools to perform some actions, such as inviting multiple users or exporting information into a machine-readable format.

    hashtag

    $ dx describe "Original files/human_g1k_v37.bwa-index.tar.gz"
    Result 1:
    ID                file-xxxx
    Class             file
    Project           project-xxxx
    Folder            /Original files
    Name              human_g1k_v37.bwa-index.tar.gz
    State             closed
    Visibility        visible
    Types             -
    Properties        -
    Tags              -
    Outgoing links    -
    Created           ----
    Created by        Amy
     via the job      job-xxxx
    Last modified     ----
    archivalState     "live"
    Size              3.21 GB
    $ dx select "My Research Project"
    $ dx describe Reference\ Genome\ Files:H.\ Sapiens\ -\ GRCh37\ -\ b37\ (1000\ Genomes\ Phase\ I)/human_g1k_v37.fa.gz
    Result 1:
    ID                file-xxxx
    Class             file
    Project           project-xxxx
    Folder           /H. Sapiens - GRCh37 - b37 (1000 Genomes Phase I)
    Name              human_g1k_v37.fa.gz
    State             closed
    Visibility        visible
    Types             -
    Properties        -
    Tags              -
    Outgoing links    -
    Created           ----
    Created by        Amy
     via the job      job-xxxx
    Last modified     ----
    archivalState     "live"
    Size              810.45 MB
    $ dx describe "Exome Analysis Demo":workflow-G409jQQ0bZ46x5GF4GXqKxZ0
    Result 1:
    ID                  workflow-G409jQQ0bZ46x5GF4GXqKxZ0
    Class               workflow
    Project             project-BQfgzV80bZ46kf6pBGy00J38
    Folder              /
    Name                Exome Analysis Workflow
    ....
    Stage 0             bwa_mem_fastq_read_mapper
      Executable        app-bwa_mem_fastq_read_mapper/2.0.1
    Stage 1             fastqc
      Executable        app-fastqc/3.0.1
    Stage 2             gatk4_bqsr
      Executable        app-gatk4_bqsr_parallel/2.0.1
    Stage 3             gatk4_haplotypecaller
      Executable        app-gatk4_haplotypecaller_parallel/2.0.1
    Stage 4             gatk4_genotypegvcfs
      Executable        app-gatk4_genotypegvcfs_single_sample_parallel/2.0.0
    $ dx describe "Exome Analysis Demo":workflow-G409jQQ0bZ46x5GF4GXqKxZ0 --json
      {
        "project": "project-BQfgzV80bZ46kf6pBGy00J38",
        "name": "Exome Analysis Workflow",
        "inputSpec": [
          {
            "name": "bwa_mem_fastq_read_mapper.reads_fastqgzs",
            "class": "array:file",
            "help": "An array of files, in gzipped FASTQ format, with the first read mates to be mapped.",
            "patterns": [ "*.fq.gz", "*.fastq.gz" ],
            ...
          },
          ...
        ],
        "stages": [
          {
            "id": "bwa_mem_fastq_read_mapper",
            "executable": "app-bwa_mem_fastq_read_mapper/2.0.1",
            "input": {
              "genomeindex_targz": {
                "$dnanexus_link": {
                  "project": "project-BQpp3Y804Y0xbyG4GJPQ01xv",
                  "id": "file-FFJPKp0034KY8f20F6V9yYkk"
                }
              }
            },
            ...
          },
          {
            "id": "fastqc",
            "executable": "app-fastqc/3.0.1",
            ...
          }
          ...
        ]
      }
    $ dx describe "Exome Analysis Demo":workflow-G409jQQ0bZ46x5GF4GXqKxZ0 --json |jq .stages
    [{
        "id": "bwa_mem_fastq_read_mapper",
        "executable": "app-bwa_mem_fastq_read_mapper/2.0.1",
      ...
      }, {
        "id": "fastqc",
        "executable": "app-fastqc/3.0.1",
      ...
      }, {
        "id": "gatk4_bqsr",
        "executable": "app-gatk4_bqsr_parallel/2.0.1",
      ...
      }
      ...
    }]
    $ dx describe "Exome Analysis Demo":workflow-G409jQQ0bZ46x5GF4GXqKxZ0 --json | jq '.stages | map(.executable) | .[]'
      "app-bwa_mem_fastq_read_mapper/2.0.1"
      "app-fastqc/3.0.1"
      "app-gatk4_bqsr_parallel/2.0.1"
      "app-gatk4_haplotypecaller_parallel/2.0.1"
      "app-gatk4_genotypegvcfs_single_sample_parallel/2.0.0"
    import pyspark
    sc = pyspark.SparkContext()
    spark = pyspark.sql.SparkSession(sc)
    spark.sql("show databases").show(truncate=False)
    +------------------------------------------------------------+
    |namespace                                                   |
    +------------------------------------------------------------+
    |database_xxxx__brca_pheno                                   |
    |database_yyyy__gwas_vitamind_chr1                           |
    |database_zzzz__meta_data                                    |
    |database_tttt__genomics_180820                              |
    +------------------------------------------------------------+
    db = "database_xxxx__brca_pheno"
    spark.sql(f"SHOW TABLES FROM {db}").show(truncate=False)
    +------------------------------------+-----------+-----------+
    |namespace                           |tableName  |isTemporary|
    +------------------------------------+-----------+-----------+
    |database_xxxx__brca_pheno           |cna        |false      |
    |database_xxxx__brca_pheno           |methylation|false      |
    |database_xxxx__brca_pheno           |mrna       |false      |
    |database_xxxx__brca_pheno           |mutations  |false      |
    |database_xxxx__brca_pheno           |patient    |false      |
    |database_xxxx__brca_pheno           |sample     |false      |
    +------------------------------------+-----------+-----------+
    show databases like "<project_id_pattern>:<database_name_pattern>";
    show databases like "project-*:<database_name>";
    +------------------------------------------------------------+
    |namespace                                                   |
    +------------------------------------------------------------+
    |database_xxxx__brca_pheno                                   |
    |database_yyyy__gwas_vitamind_chr1                           |
    |database_zzzz__meta_data                                    |
    |database_tttt__genomics_180820                              |
    +------------------------------------------------------------+
    db = "database_xxxx__brca_pheno"
    spark.sql(f"SHOW TABLES FROM {db}").show(truncate=False)
    # Create a database
    my_database = "my_database"
    spark.sql("create database " + my_database + " location 'dnax://'")
    spark.sql("create table " + my_database + ".foo (k string, v string) using parquet")
    spark.sql("insert into table " + my_database + ".foo values ('1', '2')")
    sql("select * from " + my_database + ".foo")
    import hail as hl
    hl.init(sc=sc)
    # Download example data from 1k genomes project and inspect the matrix table
    hl.utils.get_1kg('data/')
    hl.import_vcf('data/1kg.vcf.bgz').write('data/1kg.mt', overwrite=True)
    mt = hl.read_matrix_table('data/1kg.mt')
    mt.rows().select().show(5)
    vep-GRCh38.json
    {
      "command": [
        "docker",
        "run",
        "-i",
        "-v",
        "/cluster/vep:/root/.vep",
        "dnanexus/dxjupyterlab-vep",
        "./vep",
        "--format",
        "vcf",
        "__OUTPUT_FORMAT_FLAG__",
        "--everything",
        "--allele_number",
        "--no_stats",
        "--cache",
        "--offline",
        "--minimal",
        "--assembly",
        "GRCh38",
        "-o",
        "STDOUT",
        "--check_existing",
        "--dir_cache",
        "/root/.vep/",
        "--dir_plugins",
        "/root/.vep/Plugins/loftee",
        "--fasta",
        "/root/.vep/homo_sapiens/109_GRCh38/Homo_sapiens.GRCh38.dna.toplevel.fa.gz",
        "--plugin",
        "LoF,loftee_path:/root/.vep/Plugins/loftee,human_ancestor_fa:/root/.vep/human_ancestor.fa,conservation_file:/root/.vep/loftee.sql,gerp_bigwig:/root/.vep/gerp_conservation_scores.homo_sapiens.GRCh38.bw"
      ],
      "env": {
        "PERL5LIB": "/root/.vep/Plugins"
      },
      "vep_json_schema": "Struct{assembly_name:String,allele_string:String,ancestral:String,colocated_variants:Array[Struct{aa_allele:String,aa_maf:Float64,afr_allele:String,afr_maf:Float64,allele_string:String,amr_allele: String,amr_maf:Float64,clin_sig:Array[String],end:Int32,eas_allele:String,eas_maf:Float64,ea_allele:String,ea_maf:Float64,eur_allele:String,eur_maf:Float64,exac_adj_allele:String,exac_adj_maf:Float64,exac_allele:      String,exac_afr_allele:String,exac_afr_maf:Float64,exac_amr_allele:String,exac_amr_maf:Float64,exac_eas_allele:String,exac_eas_maf:Float64,exac_fin_allele:String,exac_fin_maf:Float64,exac_maf:Float64,exac_nfe_allele:  String,exac_nfe_maf:Float64,exac_oth_allele:String,exac_oth_maf:Float64,exac_sas_allele:String,exac_sas_maf:Float64,id:String,minor_allele:String,minor_allele_freq:Float64,phenotype_or_disease:Int32,pubmed:            Array[Int32],sas_allele:String,sas_maf:Float64,somatic:Int32,start:Int32,strand:Int32}],context:String,end:Int32,id:String,input:String,intergenic_consequences:Array[Struct{allele_num:Int32,consequence_terms:          Array[String],impact:String,minimised:Int32,variant_allele:String}],most_severe_consequence:String,motif_feature_consequences:Array[Struct{allele_num:Int32,consequence_terms:Array[String],high_inf_pos:String,impact:   String,minimised:Int32,motif_feature_id:String,motif_name:String,motif_pos:Int32,motif_score_change:Float64,strand:Int32,variant_allele:String}],regulatory_feature_consequences:Array[Struct{allele_num:Int32,biotype:   String,consequence_terms:Array[String],impact:String,minimised:Int32,regulatory_feature_id:String,variant_allele:String}],seq_region_name:String,start:Int32,strand:Int32,transcript_consequences:                        Array[Struct{allele_num:Int32,amino_acids:String,appris:String,biotype:String,canonical:Int32,ccds:String,cdna_start:Int32,cdna_end:Int32,cds_end:Int32,cds_start:Int32,codons:String,consequence_terms:Array[String],    distance:Int32,domains:Array[Struct{db:String,name:String}],exon:String,gene_id:String,gene_pheno:Int32,gene_symbol:String,gene_symbol_source:String,hgnc_id:String,hgvsc:String,hgvsp:String,hgvs_offset:Int32,impact:   String,intron:String,lof:String,lof_flags:String,lof_filter:String,lof_info:String,minimised:Int32,polyphen_prediction:String,polyphen_score:Float64,protein_end:Int32,protein_start:Int32,protein_id:String,             sift_prediction:String,sift_score:Float64,strand:Int32,swissprot:String,transcript_id:String,trembl:String,tsl:Int32,uniparc:String,variant_allele:String}],variant_class:String}"
    }
    # Annotation process relies on "dnanexus/dxjupyterlab-vep" docker container
    # as well as VEP and LoF resources that are pre-installed on every Spark node when
    # HAIL-VEP feature is selected.
    annotated_mt = hl.vep(mt, "file:///mnt/project/vep-GRCh38.json")
    source /home/dnanexus/environment
    source /cluster/dx-cluster.environment
    #read csv from public bucket
    df = spark.read.options(delimiter='\t', header='True', inferSchema='True').csv("s3://1000genomes/20131219.populations.tsv")
    df.select(df.columns[:4]).show(10, False)
    #access private data in S3 by first unsetting the default credentials provider
    sc._jsc.hadoopConfiguration().set('fs.s3a.aws.credentials.provider', '')
    
    # replace "redacted" with your keys
    sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'redacted')
    sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'redacted')
    df=spark.read.csv("s3a://your_private_bucket/your_path_to_csv")
    df.select(df.columns[:5]).show(10, False)
    State
    Outgoing Links
    Create figures and tables for scientific publications
  • Build and test algorithms directly in the cloud before creating DNAnexus apps and workflows

  • Test and train machine/deep learning models

  • Interactively run commands on a terminal

  • For scanning the content of the input file once or for reading only a small fraction of file's content:
    • A project in which the app is running is mounted in a read-only fashion at /mnt/project folder. Reading the content of the files in /mnt/project dynamically fetches the content from the DNAnexus Platform, so this method uses minimal disk space in the JupyterLab execution environment, but uses more API calls to fetch the content.

    Import and unpack the tarball file.
  • Create a snapshot of the JupyterLab environment.

  • : Loads the environment with Python3 and Image Processing packages such as Nipype, FreeSurfer and FSL but it does not contain R. The FreeSurfer package requires a license to run. Details about license creation and usage can be found in the
    .
    JupyterLabarrow-up-right
    DNAnexus Apollo
    JupyterLab Spark Cluster
    Previous versions remain available
    Quickstart
    terminal
    DNAnexus file browser
    following section
    create, edit, and save notebooks
    JupyterLab app pagearrow-up-right
    run in a Docker container
    older version of JupyterLab
    dx download
    dxpy client library
    Stata in JupyterLab
    MONAI in JupyterLab
    JupyterLab in-product documentationarrow-up-right
    installed
    snapshot
    official JupyterLab documentationarrow-up-right
    Quickstart
    JupyterLab Reference
    FreeSurfer in JupyterLab guide
    my_cmd="papermill notebook.ipynb output_notebook.ipynb"
    dx run dxjupyterlab -icmd="$my_cmd" -iin="notebook.ipynb"
    What Is an Org?

    An organization (or "org") is a DNAnexus entity used to manage a group of users. Use orgs to group users, projects, and other resources together, in a way that models real-world collaborative structures.

    In its simplest form, an org can be thought of as referring to a group of users on the same project. An org can be used efficiently to share projects and data with multiple users - and, if necessary, to revoke access.

    Org admins can manage org membership, configure access and projects associated with the org, and oversee billing. All storage and compute costs associated with an org are invoiced to a single billing account designated by the org admin. You can create an org that is associated with a billing account by contacting DNAnexus Salesenvelope.

    Orgs are referenced on the DNAnexus Platform by a unique org ID, such as org-dnanexus. Org IDs are used when sharing projects with an org in the Platform user interface or when manipulating the org in the CLI.

    hashtag
    Org Membership Levels

    Users may have one of two membership levels in an org:

    • ADMIN

    • MEMBER

    An ADMIN-level user is granted all possible access in the org and may perform org administrative functions. These functions include adding/removing users or modifying org policies. A MEMBER-level user, on the other hand, is granted only a subset of the possible org accesses in the org and has no administrative power in the org.

    hashtag
    Members

    A user with MEMBER level can be configured to have a subset of the following org access. These access levels determine which actions each user can perform in an org.

    Access
    Description
    Options

    Billable activities access

    If allowed, the org member can create new projects and apps billed to the org, download data (incurring data egress charges against the org), and set their own default billing account to that of the org.

    [Allowed] or [Not Allowed]

    Shared apps access

    If allowed, the org member has access to view and run apps in which the org has been added as an "authorized user".

    [Allowed] or [Not Allowed]

    These accesses allow you to have fine-grained control over what members of your orgs can do in the context of your org.

    hashtag
    Admins

    Org admins are granted all possible access in the org. More specifically, org admins receive the following set of accesses:

    Access
    Level

    Billable activities access

    Allowed

    Shared apps access

    Allowed

    Shared projects access

    ADMINISTER

    Org admins also have the following special privileges:

    hashtag
    Viewing Metadata for All Org Projects

    Org admins can list and view metadata for all org projects (projects billed to the org) even if the project is not explicitly shared with them. They can also give themselves access to any project billed to the org. For example, when a member creates a new project, Project-P, and bills it to the org, they are the only user with access to Project-P. The org admin can see all projects billed to the org, including Project-P. Org admins can also invite themselves to Project-P at any time to get access to objects and jobs in the project.

    hashtag
    Becoming a Developer for All Org Apps

    Org admins can add themselves as developers to any app billed to the org. For example, when a member creates a new app, App-A, billed to the org, they are the only developer for App-A. However, any org admins may add themselves as developers at any time.

    hashtag
    Examples of Using Orgs

    hashtag
    Org Structure Diagram

    In the diagram below, there are 3 examples of how organizations can be structured.

    hashtag
    ORG-1

    The simplest example, ORG-1, is represented by the leftmost circle. In this situation, ORG-1 is a billable org that has 3 members who share one billing account, so all 5 projects created by the members of ORG-1 are billed to that org. One admin (user A) manages ORG-1.

    hashtag
    ORG-2 and ORG-3

    The second example shows ORG-2 and ORG-3 demonstrating more a complicated organizational setup. Here users are grouped into two different billable orgs, with some users belonging to both orgs and others belonging to only one.

    In this case, ORG-2 and ORG-3 bill their work against separate billing accounts. This separation of orgs can represent two different groups in one company working in different departments, each with their own budgets, two different labs that work closely together, or any other scenario in which two collaborators would share work.

    ORG-2 has 5 members, 4 projects, and is managed by one org admin (user G). ORG-3 has 5 members and 3 projects, but is managed by 2 admins (users G and I).

    In this example, admin G and member H belong to both ORG-2 and ORG-3. They can create new projects billed to either org, depending on the project they're working on. Admin G can manage users and projects in both ORG-2 and ORG-3.

    hashtag
    Example 1: Creating an Org for Sharing Data

    You can create a non-billable org as an alias for a group of users. For example, you have a group of users who all need access to a shared dataset. You can make an org which represents all the users who need access to the dataset, for example, an org named org-dataset_access, and share all the projects and apps related to the dataset with that org. All members of the org have at least VIEW "shared project access" and "shared app access" so that they are all given permission to view the dataset. If a member no longer needs access to the dataset, they can be removed from the org, and then no longer have access to any projects or apps shared with org-dataset_access.

    hashtag
    Example 2: Only Admins can Create Projects

    You can contact DNAnexus Salesenvelope to create a billable org where only one member, the org admin, can create new org projects. All other org members are not granted the "billable activities access", and so cannot create new org projects. The org admin can then assign each org member a "shared projects access" (VIEW, UPLOAD, CONTRIBUTE, ADMINISTER) and share every org project with the org with ADMINISTER access. The members' permissions to the projects are restricted by their respective "shared project access."

    For example, in a given group, bioinformaticians can be given CONTRIBUTE access to the projects shared with the entire org, so they can run analyses and produce new data in any of the org projects. However, the sequencing center technicians only need UPLOAD permissions to add new data to the projects. Analysts in the group are only given VIEW access to projects shared with the org. When you need to add a new member to your group and give them access to the projects shared with the org, you need to add them to the org as a new member and assign them the appropriate permission levels.

    This membership structure allows the org admin to control the number of projects billed to the org. The org admin can also quickly share new projects with their org and revoke permissions from users who have been removed from the org.

    hashtag
    Example 3: Shared Billing Account

    You can contact DNAnexus Salesenvelope to create a billable org where users work independently and bill their activities to the org billing account (as specified by the org admin). All org members are granted "billable activities access." The org members also need to share common resources. These resources might include incoming samples or reference datasets.

    In this case, all members should be granted the "shared apps access" and assigned VIEW as their "shared projects access." The reference datasets that need to be shared with the org are stored in an "Org Resources" project that is shared with the org, which is granted VIEW access. The org can also have best-practice executables built as apps on the DNAnexus system.

    The apps can be shared with the org so all members of the org have access to these (potentially proprietary) executables. If any user leaves your company or institution, their access to reference datasets and executables is revoked by removing them from the org.

    hashtag
    Other Cases

    In general, it is possible to apply many different schemas to orgs as they were designed for many different real-life collaborative structures. If you have a type of collaboration you would like to support, contact DNAnexus Supportenvelope for more information about how orgs can work for you.

    hashtag
    Managing Your Orgs

    If you are an admin of an org, you can access the org admin tools from the Org Admin link in the header of the DNAnexus Platform. From here, you can quickly navigate to the list of orgs you administer via All Orgs, or to a specific org.

    The Organizations list shows you the list of all orgs to which you have admin access. On this page, you can quickly see your orgs, the org IDs, their Project Transfer setting, and the Member List Visibility setting.

    Within an org, the Settings tab allows you to view and edit basic information, billing, and policies for your org.

    hashtag
    Viewing and Updating Org Information

    You can find the org overview on the Settings tab. From here, you can:

    • View and edit the organization name (this is how the org is referred to in the Platform user interface and in email notifications).

    • View the organization ID, the unique ID used to reference a particular org on the CLI. An example org ID would be org-demo_org.

    • View the number of org members, org projects, and org apps.

    • View the list of organization admins.

    hashtag
    Managing Org Members

    Within an org page, the Members tab allows you to view all the members of the org, invite new members, remove existing members, and update existing members' permission levels.

    From the Members tab, you can quickly see the names and access levels for all org members. For more information about org membership, see the organization member guide.

    hashtag
    Inviting a New Member

    To add existing DNAnexus user to your org, you can use the + Invite New Member button from the org's Members tab. This opens a screen where you can enter the user's username, such as smithj, or user-ID, such as user-smithj. Then you can configure the user's access level in the org.

    If you add a member to the org with billable activities access set to billing allowed, they have the ability to create new projects billed to the org.

    However, adding the member does not change their default billing account. If the user wishes to use the org as their default billing account, they must set their own default billing account.

    If the member has any pre-existing projects that are not billed to the org, the user must transfer the project to an org if they wish to have the project billed to the org.

    The user receives an email notification informing them that they have been added to the organization.

    hashtag
    Creating New DNAnexus Accounts

    Org admins have the ability to create new DNAnexus accounts on behalf of the org, provided the org is covered by a license that enables account provisioning. The user then receives an email with instructions to activate their account and set their password.

    circle-info

    For information on a license that enables account provisioning, contact DNAnexus Salesenvelope.

    If this feature has already been turned on for an org you administer, you see an option to Create New User when you go to invite a new member.

    Here you can specify a username, such as alice or smithj, the new user's name, and their email address. The system automatically creates a new user account for the given email address and adds them as a member in the org.

    If you create a new user and set their Billable Activities Access to Billing Allowed, consider setting the org as the user's default billing account. This option is available as a checkbox under the Billable Activities Access dropdown.

    hashtag
    Editing Member Access

    From the org Members tab, you can edit the permissions for one or multiple members of the org. The option to Edit Access appears when you have one or more org members selected in the table.

    When you edit multiple members, you have the option of changing only one access while leaving the rest alone.

    hashtag
    Removing Members

    From the org Members tab, you can remove one or more members from the org. The option to Remove appears when you have one or more org members selected on the Members tab.

    Removing a member revokes the user's access to all projects and apps billed to or shared with the org.

    hashtag
    Org Projects

    In the org's Projects tab you to see the list of all projects billed to the org. This list includes all projects in which you have VIEW and above permissions as well as projects that are billed to the org in which you do not have permissions (not a Member of).

    You can view all project metadata, such as the list of members, data usage, and creation date. You can also view other optional columns such as project creator. To enable the optional columns, select the column from the dropdown menu to the right of the column names.

    hashtag
    Granting Admin Access to Org Projects

    Org admins can give themselves access to any project billed to the org. If you select a project in which you are not a member, you are still able to navigate into the project's settings page. On the project settings page, you can click a button to grant yourself ADMINISTER permissions to the project.

    You can also grant yourself ADMINISTER permissions if you are a member of a project billed to your org but you only have VIEW, CONTRIBUTE, or UPLOAD permissions.

    hashtag
    Org Billing

    hashtag
    Accessing Org Billing Information

    To access your org's billing information:

    1. In the main menu, click Orgs > All Orgs.

    2. Select an organization you want to view.

    3. Select the Billing tab to view billing information.

    hashtag
    Setting Up or Updating Billing Information for an Org

    To set up or update the billing information for an org you administer, contact DNAnexus Billing teamenvelope.

    Setting up billing for an organization designates someone to receive and pay DNAnexus invoices, including usage by organization members. The billing contact can be you, someone from your finance department, or another designated person.

    When you click Confirm Billing, DNAnexus sends an email to the designated billing contact requesting confirmation of their responsibility for receiving and paying invoices. The organization's billing contact information does not update until DNAnexus receives this confirmation.

    hashtag
    Setting and Modifying an Org Spending Limit

    The org spending limit is the total in outstanding usage charges that can be incurred by projects linked to an org.

    If you are an org admin, you can set or modify this spending limit:

    1. In the main menu, click Orgs > All Orgs.

    2. Select the org for which you'd like to set or modify a spending limit.

    3. In the org details, select the Billing tab.

    4. In Summary, click Increase Spending Limit to request increasing the limit via DNAnexus Support.

    Doing this only submits your request.

    Before approving your request, DNAnexus Support may follow up with you via email with questions about the change.

    hashtag
    Viewing Estimated Charges

    The Usage Charges section allows users with billable access to view total charges incurred to date. You can see how much is left of the org's spending limit. This section is only visible if your org is a billable org, which means your org has confirmed billing information.

    circle-info

    If your org doesn't have a spending limit, your org is unlimited and shows as "N/A."

    hashtag
    Using Monthly Project Spending and Usage Limits

    circle-info

    You need a license to use both the Monthly Project Usage Limit for Computing and Egress, and Monthly Project Spending Limit for Storage features. Contact DNAnexus Salesenvelope for more information.

    For orgs with the Monthly Project Usage Limit for Computing and Egress and/or the Monthly Project Spending Limit for Storage feature enabled, org admins can set, update, and view default limits for each spending type, and set limit enforcement actions via API calls.

    • To set the org policies for compute, egress, and storage spending limits and related enforcement actions, use the API methods org/new, org-xxxx/update, or org-xxxx/bulkUpdateProjectLimit.

    • To retrieve and view your org's policies configuration for spending and usage limits, use the API method org-xxxx/describe.

    In orgs with the Monthly Project Usage Limit for Computing and Egress feature enabled, org admins can set default limits and enforcement actions:

    1. In the main menu, click Orgs > All Orgs.

    2. In the Organizations list, click the organization name you want to configure.

    3. In the Usage Limits section

      1. Set the default compute and egress spending limits for linked projects.

      2. Configure the enforcement action when limits are reached.

      3. Choose whether to prevent new executions and terminate ongoing ones, or send alerts while allowing executions to continue.

    For details on these limits, see How Spending and Usage Limits Work.

    Configuring limits and their enforcement in org details > Billing > Usage Limits.

    hashtag
    Monitoring Spending and Usage

    circle-info

    Licenses are required to use the Per-Project Usage Report and Root Execution Stats Report features. Contact DNAnexus Salesenvelope for more information.

    Configuration of these features, and report delivery, is handled by DNAnexus Support.

    The Per-Project Usage Report and Root Execution Stats Report are monthly reports that provide detailed breakdowns of charges incurred by org members. These reports help you track and analyze spending patterns across your organization. For more information, see organization management and usage monitoring.

    hashtag
    Org Policies

    Org admins can also set configurable policies for the org. Org policies dictate many different behaviors when the org interacts with other entities. The following policies exist:

    Policy
    Description
    Options

    Membership List Visibility

    Dictates the minimum org membership level required to view the list of org members, their membership level, and access within the org. If PUBLIC, any DNAnexus user can view the list of org members.

    [ADMIN], [MEMBER], or [PUBLIC]

    Project Transfer

    Dictates the minimum org membership level allowed to change the billing account of an org project (via the UI or project transfer).

    [ADMIN] or [MEMBER]

    DNAnexus recommends, as a starting point, to restrict the "membership list visibility policy" to ADMIN and "project transfer policy" to ADMIN. This ensures that only the org admin is allowed to see the list of members and their access within the org and that org projects always remain under control of the org.

    You can update org policies for your org in the Policies and Administration section of the org Settings tab. Here, you can both change the membership list visibility and restrict project transfer policies for the org and contact DNAnexus Support to enable PHI data policies for org projects.

    hashtag
    Glossary of Org Terms

    • Billable activities access is an access level that can be granted to org members. If allowed, the org member can create new projects and apps billed to the org, download data (incurring data egress charges against the org), and set their own default billing account to that of the org.

    • Billable org is an org that has confirmed billing information or a non-negative spending limit remaining. Users with billable activities access in a billable org are allowed to create new projects billed to the org. See the definition of a non-billable org for an org that is used for sharing.

    • Billed to an org (app context) sets the billing account of an app to an org. Apps require storage for their resources and assets, and the billing account of the app are billed for that storage. The billing account of an app does not pay for invocations of the app unless the app is run in a project billed to the org.

    • Billed to an org (project context) sets the billing account of a project to an org. The org is invoiced the storage for all data stored in the project as well as compute charges for all jobs and analyses run in the project.

    • Membership level describes one of two membership levels available to users in an org, ADMIN or MEMBER. Remember that ADMINISTER is a type of access level.

    • Membership list visibility policy dictates the minimum org membership level required to view the list of org members, their membership level, and access within the org.

    • Non-billable org describes an org only used as an alias for a group of users. Non-billable orgs do not have billing information and do not have any org projects or org apps. Any user can share a project with a non-billable org.

    • Org access is granted to a user to determine which actions the user can perform in an org.

    • Org admin describes administrators of an org who can manage org membership, configure access and projects associated with the org, and oversee billing.

    • Org app is an app billed to an org.

    • Org ID is the unique ID used to reference a particular org on the DNAnexus Platform. An example is org-dnanexus.

    • Org member is a DNAnexus user associated with an org. Org members can have variable membership levels in an org which define their role in the org. Admins are a type of org member as well.

    • Org policy is a configurable policy for the org. Org policies dictate many different behaviors when the org interacts with other entities.

    • Org project describes a project billed to an org.

    • Org (or "organization") is a DNAnexus entity that is used to associate a group of users. Orgs are referenced on the DNAnexus Platform by a unique org ID.

    • Project transfer policy dictates the minimum org membership level allowed to change the billing account of an org project.

    • Share with an org means to give the members of an org access to a project or app via giving the org access to the project or adding the org as an "authorized user" of an app.

    • Shared apps access is an org access level that can be granted to org members. If allowed, the org member can view and run apps in which the org has been added as an "authorized user."

    • Shared projects access is an org access level that can be granted to org members: the maximum access level a user can have in projects shared with an org.

    hashtag
    Learn More

    Learn in depth about setting up and managing orgs as an administrator.

    Learn about what you can do as an org member.

    Learn about creating and managing orgs as a developer, via the DNAnexus API.

    Running Workflows

    You can run workflows from the command-line using the command dx run. The inputs to these workflows can be from any project for which you have VIEW access.

    The examples here use the publicly available Exome Analysis Workflowarrow-up-right (platform login required to access this link).

    For information on how to run a Nextflow pipeline, see Running Nextflow Pipelines.

    hashtag
    Running in Interactive Mode

    Running dx run without specifying an input launches interactive mode. The system prompts for each required input, followed by options to select from a list of optional parameters to modify. Optional parameters include all modifiable parameters for each stage of the workflow. The interface outputs a JSON file detailing the input specified and generates an analysis ID of the form analysis-xxxx unique to this particular run of the workflow.

    Below is an example of running the Exome Analysis Workflow from the public "Exome Analysis Demo" project.

    hashtag
    Running in Non-Interactive Mode

    You can specify each input on the command-line using the -i or --input flags using the syntax -i<stage ID>.<input name>=<input value>. <input-value> must take the form of a DNAnexus object ID or a file named in the project you have selected. It is also possible to specify the number of a stage in place of the stage ID for a given workflow, where stages are indexed starting at zero. The inputs in the following example are specified for the first stage of the workflow only to illustrate this point. The parentheses around the <input-value> in the help string are omitted when entering input.

    Possible values for the input name field can be found by running the command dx run workflow-xxxx -h, as shown below using the Exome Analysis Workflow.

    This help message describes the inputs for each stage of the workflow in the order they are specified. For each stage of the workflow, the help message first lists the required inputs for that stage, specifying the requisite type in the <input-value> field. Next, the message describes common options for that stage (as seen in that stage's corresponding UI on the platform). Lastly, it lists advanced command-line options for that stage. If any stage's input is linked to the output of a prior stage, the help message shows the default value for that stage as a DNAnexus link of the form

    {"$dnanexus_link": {"outputField": "<prior stage output name>", "stage": "stage-xxxx" }}.

    This link format can also be used to specify output from any prior stage in the workflow as input for the current stage.

    For the Exome Analysis Workflow, one required input parameter needs to be specified manually: -ibwa_mem_fastq_read_mapper.reads_fastqgzs.

    This parameter targets the first stage of the workflow. For convenience, use the stage number instead of the full stage ID. Since this is the first stage (and workflow stages are zero-indexed), replace bwa_mem_fastq_read_mapper with 0 like this: -i0.reads_fastqgzs.

    The example below shows how to run the same Exome Analysis Workflow on a FASTQ file containing reads, as well as a BWA reference genome, using the default parameters for each subsequent stage.

    hashtag
    Specifying Array Input

    Array input can be specified by specifying multiple inputs for a single parameter in a stage. For example, the following flags would add files 1 through 3 to the file_inputs parameter for stage-xxxx of the workflow:

    If no project is selected, or if the file is in another project, the project containing the files you wish to use must be specified as follows: -i<stage ID>.<input name>=<project id>:<file id>.

    hashtag
    Job-Based Object References (JBORs)

    The -i flag can also be used to specify (JBORs) with the syntax -i<stage ID or number>:<input name>=<job id>:<output name>. The --brief flag, when used with the command dx run, outputs only the execution's ID. You can also skip the interactive prompts confirming the execution using the -y flag. Calling dx run on a job with the --brief flag returns only the job ID of that execution, and you can skip being prompted to begin execution with the -y flag.

    The example below calls the app (platform login required to access this link) to produce the sorted_bam output described in the help string produced by running dx run app-bwa_mem_fastq_read_mapper -h. This output is then used as input to the first stage of the featured on the DNAnexus Platform (platform login required to access this link).

    hashtag
    Advanced Options

    hashtag
    Quiet Output

    Using the --brief flag at the end of a dx run command causes the command line to print the execution's analysis ID ("analysis-xxxx") instead of the input JSON for the execution. This ID can be saved for later reference.

    hashtag
    Rerunning Analyses With Modified Settings

    To modify specific settings from the previous analysis, you can run the command dx run --clone analysis-xxxx [options]. The [options] parameters override anything set by the --clone flag, and take the form of options passed as input from the command line.

    The --clone flag does not copy the usage of the --allow-ssh or --debug-on flags, which must be set with the new execution. Only the applet, instance type, and input spec are copied. See the page for more information on the usage of these flags.

    For example, the command below redirects the output of the analysis to the outputs/ folder and reruns all stages.

    circle-info

    Only the outputs of stages rerun are placed in the destination specified.

    hashtag
    Rerunning Specific Stages

    When rerunning workflows, if a stage runs identically to how it ran in a previous analysis, the stage itself is not rerun. The outputs of that stage are not copied or rewritten in a new location. To rerun a specific stage, use the option --rerun-stage STAGE_ID to force a stage to be run again, where STAGE_ID is an ID of the form stage-xxxx, the stage's name, or the index of that stage (where the first stage of a workflow is indexed at 0). If you want to rerun all stages of an analysis, you can use --rerun-stage "*", where the asterisk is enclosed in quotes to prevent expansion of that variable into all folders in your current directory via globbing.

    The command below reruns the third and final stage of analysis-xxxx

    hashtag
    Specifying Analysis Output Folders

    The --destination flag allows you to specify the path of the output of a workflow. By default, every output of every stage is written to the destination specified.

    hashtag
    Specifying Output Folders

    You can use the --stage-output-folder <stage_ID> <folder> command to specify the output destination of a particular stage in the analysis being run. In this command, stage_ID is the stage's name, or the index of that stage (where the first stage of a workflow is indexed at 0). The folder is the project and path to which you wish the stage to write using the syntax project-xxxx:/PATH where PATH is the path to the folder in project-xxxx where you wish to write outputs.

    The following command reruns all stages of analysis-xxxx and sets the output destination of the first step of the workflow (BWA) to "mappings" in the current project:

    hashtag
    Specifying Stage-Relative Output Folders

    If you want to specify output folder of a stage within the current output folder of the entire analysis, you can use the flag --stage-relative-output-folder <stage_id> <folder>, where stage_id is the stage's name (stage-xxxx), or the index of that stage (where the first stage of a workflow is indexed at 0). For the folder argument, you can specify a quoted path to write the output of that stage that is relative to the output folder of the analysis.

    The following command reruns all stages of analysis-xxxx, setting the output destination of the analysis to /exome_run, and the output destination of stage 0 to /exome_run/mappings in the current project:

    hashtag
    Specifying a Different Instance Type

    To specify the instance type of all stages in your analysis or a specific set of stages in your analysis, use the flag --instance-type. Specifically, the format --instance-type STAGE_ID=INSTANCE_TYPE allows you to set the instance type of a specific stage, while --instance-type INSTANCE_TYPE sets one instance type for all stages. The two options can be combined, for example, --instance-type mem2_ssd1_x2 --instance-type my_stage_0=mem3_ssd1_x16 sets all stages' instance types to mem2_ssd1_x2 except for the stage my_stage_0, for which mem3_ssd1_x16 is used.

    Here STAGE_ID is an ID of a stage, the stage's name, or the index of that stage (where the first stage of a workflow is indexed at 0).

    The example below reruns all stages of analysis-xxxx and specifies that the first and second stages should be run on mem1_ssd2_x8 and mem1_ssd2_x16 instances respectively:

    hashtag
    Adding Metadata to an Analysis

    This is identical to adding metadata to a job. See for details.

    hashtag
    Monitoring an Analysis

    Command line monitoring of an analysis is not available. For information about monitoring a job from the command line, see .

    circle-exclamation

    On the DNAnexus Platform, jobs are limited to a runtime of 30 days. Jobs that run longer than 30 days are automatically terminated.

    hashtag
    Providing Input JSON

    This is identical to providing an input JSON to a job. For more information, see .

    As in running a workflow in non-interactive mode, inputs to a workflow must be specified as STAGE_ID.<input>. Here STAGE_ID is either an ID of the form stage-xxxx or the index of that stage in the workflow (starting with the first stage at index 0).

    job-based object references
    BWA-MEM FASTQ Read Mapperarrow-up-right
    Parliament Workflowarrow-up-right
    Connecting to Jobs
    Adding metadata to a job
    Monitoring Executions
    Providing input JSON
    $ dx run "Exome Analysis Demo:Exome Analysis Workflow"
    Entering interactive mode for input selection.
    
    Input:   Reads (bwa_mem_fastq_read_mapper.reads_fastqgzs)
    Class:   array:file
    
    Enter file values, one at a time (^D or <ENTER> to finish, <TAB> twice for compatible files in
        current directory, '?' for more options)
    bwa_mem_fastq_read_mapper.reads_fastqgzs[0]: "Exome Analysis Demo:/Input/SRR504516_1.fastq.gz"
    
    
    Select an optional parameter to set by its # (^D or <ENTER> to finish):
    
     [0] Reads (right mates) (bwa_mem_fastq_read_mapper.reads2_fastqgzs)
     [1] Read group information (bwa_mem_fastq_read_mapper.rg_info_csv)
    .
    .
    .
     [33] Output prefix (gatk4_genotypegvcfs.prefix)
     [34] Extra command line options (gatk4_genotypegvcfs.extra_options) [default="-G StandardAnnotation --only-output-calls-starting-in-intervals"]
    
    Optional param #: 0
    
    Input:   Reads (right mates) (bwa_mem_fastq_read_mapper.reads2_fastqgzs)
    Class:   array:file
    
    Enter file values, one at a time (^D or <ENTER> to finish, <TAB> twice for compatible files in
       current directory, '?' for more options)
    bwa_mem_fastq_read_mapper.reads2_fastqgzs[0]: "Exome Analysis Demo:/Input/SRR504516_2.fastq.gz"
    bwa_mem_fastq_read_mapper.reads2_fastqgzs[1]:
    
    Optional param #: <ENTER>
    
    Using input JSON:
    {
      "bwa_mem_fastq_read_mapper.reads_fastqgzs": [
        {
          "$dnanexus_link": {
            "project": "project-BQfgzV80bZ46kf6pBGy00J38",
            "id": "file-B40jg7v8KfPy38kjz1vQ001y"
          }
        }
      ],
      "bwa_mem_fastq_read_mapper.reads2_fastqgzs": [
        {
          "$dnanexus_link": {
            "project": "project-BQfgzV80bZ46kf6pBGy00J38",
            "id": "file-B40jgYG8KfPy38kjz1vQ0020"
          }
        }
      ]
    }
    
    Confirm running the executable with this input [Y/n]: <ENTER>
    Calling workflow-xxxx with output destination project-xxxx:/
    
    Analysis ID: analysis-xxxx
    $ dx run "Exome Analysis Demo:Exome Analysis Workflow" -h
    usage: dx run Exome Analysis Demo:Exome Analysis Workflow [-iINPUT_NAME=VALUE ...]
    
    Workflow: GATK4 Exome FASTQ to VCF (hs38DH)
    
    Runs GATK4 Best Practice for Exome on hs38DH reference genome
    
    Inputs:
     bwa_mem_fastq_read_mapper
      Reads: -ibwa_mem_fastq_read_mapper.reads_fastqgzs=(file) [-ibwa_mem_fastq_read_mapper.reads_fastqgzs=... [...]]
            An array of files, in gzipped FASTQ format, with the first read mates
            to be mapped.
    
      Reads (right mates): [-ibwa_mem_fastq_read_mapper.reads2_fastqgzs=(file) [-ibwa_mem_fastq_read_mapper.reads2_fastqgzs=... [...]]]
            (Optional) An array of files, in gzipped FASTQ format, with the second
            read mates to be mapped.
      BWA reference genome index: [-ibwa_mem_fastq_read_mapper.genomeindex_targz=(file, default={"$dnanexus_link": {"project": "project-BQpp3Y804Y0xbyG4GJPQ01xv", "id": "file-FFJPKp0034KY8f20F6V9yYkk"}})]
            A file, in gzipped tar archive format, with the reference genome
            sequence already indexed with BWA.
      ...
     fastqc
      Reads: [-ifastqc.reads=(file, default={"$dnanexus_link": {"stage": "bwa_mem_fastq_read_mapper", "outputField": "sorted_bam"}})]
            A file containing the reads to be checked. Accepted formats are
            gzipped-FASTQ and BAM.
      ...
     gatk4_bqsr
      Sorted mappings: [-igatk4_bqsr.mappings_sorted_bam=(file, default={"$dnanexus_link": {"outputField": "sorted_bam", "stage": "bwa_mem_fastq_read_mapper"}})]
            A coordinate-sorted BAM or CRAM file with the base quality scores to
            be recalibrated.
       ...
     ...
    
    Outputs:
      Sorted mappings: bwa_mem_fastq_read_mapper.sorted_bam (file)
            A coordinate-sorted BAM file with the resulting mappings.
    
      Sorted mappings index: bwa_mem_fastq_read_mapper.sorted_bai (file)
            The associated BAM index file.
      ...
      Variants index: gatk4_genotypegvcfs.variants_vcfgztbi (file)
            The associated TBI file.
    $ dx run "Exome Analysis Demo:Exome Analysis Workflow" \
     -i0.reads_fastqgzs="Exome Analysis Demo:/Input/SRR504516_1.fastq.gz" \
     -ibwa_mem_fastq_read_mapper.genomeindex_targz='Reference Genome Files\: AWS US (East):/H. Sapiens - GRCh37 - hs37d5 (1000 Genomes Phase II)/hs37d5.bwa-index.tar.gz' -y
    Using input JSON:
    {
      "bwa_mem_fastq_read_mapper.reads_fastqgzs": [
        {
          "$dnanexus_link": {
            "project": "project-BQfgzV80bZ46kf6pBGy00J38",
            "id": "file-B40jg7v8KfPy38kjz1vQ001y"
          }
        }
      ],
      "bwa_mem_fastq_read_mapper.genomeindex_targz": {
        "$dnanexus_link": {
          "project": "project-BQpp3Y804Y0xbyG4GJPQ01xv",
          "id": "file-B6ZY4942J35xX095VZyQBk0v"
        }
      }
    }
    
    Calling workflow-xxxx with output destination
      project-xxxx:/
    
    Analysis ID: analysis-xxxx
    $ dx run workflow \
    -istage-xxxx.file_inputs=project-xxxx:file-1xxxx \
    -istage-xxxx.file_inputs=project-xxxx:file-2xxxx \
    -istage-xxxx.file_inputs=project-xxxx:file-3xxxx
    
    Using input JSON:
    {
      "stage-xxxx.file_inputs": [
          {
           "$dnanexus_link": {
              "project": "project-xxxx",
              "id": "file-1xxxx"
          },
          {
           "$dnanexus_link": {
              "project": "project-xxxx",
              "id": "file-2xxxx"
          },
          {
           "$dnanexus_link": {
              "project": "project-xxxx",
              "id": "file-3xxxx"
          }
      ]
    }
    $ dx run Parliament \
      -i0.illumina_bam=$(dx run bwa_mem_fastq_read_mapper -ireads_fastqgzs=file-xxxx -ireads2_fastqgzs=file-xxxx -igenomeindex_targz=project-BQpp3Y804Y0xbyG4GJPQ01xv:file-B6qq53v2J35Qyg04XxG0000V -y --brief):sorted_bam \
      -i0.ref_fasta=project-BQpp3Y804Y0xbyG4GJPQ01xv:file-B6qq53v2J35Qyg04XxG0000V \
      -y
    
    Using input JSON:
    {
        "stage-F14F5qQ0Jz1gfpjX8y1JxG3y.illumina_bam": {
            "$dnanexus_link": {
                "field": "sorted_bam",
                "job": "job-xxxx"
            }
        },
        "stage-F14F5qQ0Jz1gfpjX8y1JxG3y.ref_fasta": {
            "$dnanexus_link": {
                "project": "project-xxxx",
                "id": "file-B6qq53v2J35Qyg04XxG0000V"
            }
        }
    }
    
    Calling workflow-xxxx with output destination project-xxxx:/
    
    Analysis ID: analysis-xxxx
    $ dx run workflow-xxxx -i0.input_file=Input/SRR504516_1.fastq.gz -y --brief
    analysis-xxxx
    dx run --clone analysis-xxxx \
      --rerun-stage "*" \
      --destination project-xxxx:/output -y
    dx run --clone analysis-xxxx --rerun-stage 2 --brief -y
    dx run --clone analysis-xxxx --rerun-stage "*" \
      --stage-output-folder 0 "mappings" --brief -y
    dx run --clone analysis-xxxx --rerun-stage "*" \
      --destination "exome_run" \
      --stage-relative-output-folder 0 "mappings" --brief -y
    dx run --clone analysis-xxxx \
      --rerun-stage "*" \
      --instance-type '{"0": "mem1_hdd2_x8", "1": "mem1_ssd2_x4"}' \
      --brief -y

    Shared projects access

    The maximum access level a user can have in projects shared with an org. For example, if this is set to UPLOAD for an org member, the member has at most UPLOAD access in projects shared with the org, even if the org was given CONTRIBUTE or ADMINISTER access to the project.

    [NONE], [VIEW], [UPLOAD], [CONTRIBUTE] or [ADMINISTER]

    Project Sharing

    Dictates the minimum org membership level allowed for a user to invite that org to a project

    [ADMIN] or [MEMBER]

    Instance Upgrade on Job Restart

    Controls whether the platform automatically retries a job on a larger instance type when the job fails with an AppInsufficientResourceError (out of memory or out of storage). When enabled, the platform upgrades the instance one step within the same instance family on retry. This policy corresponds to the allowInstanceUpgradeOnJobRestart org policy key.

    [Enabled] or [Disabled] (default: Disabled)

    Running Apps and Applets

    You can run apps and applets from the command-line using the command dx run. The inputs to these app(let)s can be from any project for which you have VIEW access. Or run from UI.

    hashtag
    Running in Interactive Mode

    If dx run is run without specifying any inputs, interactive mode launches. When you run this command, the platform prompts you for each required input, followed by a prompt to set any optional parameters. As shown below using the BWA-MEM FASTQ Read Mapper apparrow-up-right (platform login required to access this link), after you are done entering inputs, you must confirm that you want the applet/app to run with the inputs you have selected.

    hashtag
    Running in Non-interactive Mode

    hashtag
    Naming Each Input

    You can also specify each input parameter by name using the ‑i or ‑‑input flags with syntax ‑i<input name>=<input value>. Names of data objects in your project are resolved to the appropriate IDs and packaged correctly for the API method as shown below.

    When specifying input parameters using the ‑i/‑‑input flag, you must use the input field names (not to be confused with their human-readable labels). To look up the input field names for an app, applet, or workflow, you can run the command dx run app(let)-xxxx -h, as shown below using the (platform login required to access this link).

    The help message describes the inputs and outputs of the app, their types, and how to identify them when running the app from the command line. For example, from the above help message, the Swiss Army Knife app has two primary inputs: one or more file and a string to be executed on the command line, to be specified as -iin=file-xxxx and icmd=<string>, respectively.

    The example below shows you how to run the same Swiss Army Knife app to sort a small BAM file using these inputs.

    hashtag
    Specifying Array Input

    To specify array inputs, reuse the ‑i/‑‑input flag for each input in the array and each file specified is appended into an array in the same order as it was entered on the command line. Below is an example of how to use the to index multiple BAM files (platform login required to access this link).

    hashtag
    Job-Based Object References (JBORs)

    (JBORs) can also be provided using the -i flag with syntax ‑i<input name>=<job id>:<output name>. Combined with the --brief flag (which allows dx run to output only the job ID) and the -y flag (to skip confirmation), you can string together two jobs using one command.

    Below is an example of how to run the (platform login required to access this link), producing the output named sorted_bam as described in the dx help output by executing the command dx run app-bwa_mem_fastq_read_mapper -h. The sorted_bam output is then used as input for the (platform login required to access this link).

    hashtag
    Advanced Options

    Some examples of additional functionalities provided by dx run are listed below.

    hashtag
    Quiet Output

    Regardless of whether you run a job interactively or non-interactively, the command dx run always prints the exact input JSON with which it is calling the applet or app. If you don't want to print this verbose output, you can use the --brief flag which tells dx to print out only the job ID instead. This job ID can then be saved.

    circle-check

    To run jobs without being prompted for confirmation, use the -y or --yes option. This is especially helpful when scripting or automating job submissions.

    If you want to both skip confirmation and immediately monitor the job's progress, use -y --watch. This starts the job and displays its logs in your terminal as it runs.

    hashtag
    Rerunning a Job With the Same Settings

    If you are debugging applet-xxxx and wish to rerun a job you previously ran, using the same settings (destination project and folder, inputs, instance type requests), but use a new executable applet-yyyy, you can use the --clone flag.

    In the above command, the command overrides the --clone job-xxxx command to use the executable (platform login required to access this link) rather than that used by the job.

    If you want to modify some but not all settings from the previous job, you can run dx run <executable> --clone job-xxxx [options]. The command-line arguments you provide in [options] override the settings reused from --clone. For example, this is useful if you want to rerun a job with the same executable and inputs but a different instance type, or if you want to run an executable with the same settings but slightly different inputs.

    The example shown below redirects the outputs of the job to the folder "outputs/".

    circle-info

    While the --clone job-xxxx flag copies the applet, instance type, and inputs, it does not copy usage of the --allow-ssh or --debug-on flags. These must be re-specified for each job run. For more information, see the page.

    hashtag
    Specifying the Job Output Folder

    The --destination flag allows you to specify the full project-ID:/folder/ path in which to output the results of the app(let). If this flag is unspecified, the output of the job defaults to the present working directory, which can be determined by running .

    In the above command, the flag --destination project-xxxx:/mappings instructs the job to output all results into the "mappings" folder of project-xxxx.

    hashtag
    Specifying a Different Instance Type

    The dx run --instance-type command allows you to specify the instance types to use for the job. More information is available by running the command dx run --instance-type-help.

    Some apps and applets have multiple , meaning that different instance types can be specified for different functions executed by the app(let). In the example below, the (platform login required to access this link) is run while specifying the instance types for the entry points honey, ssake, ssake_insert, and main. Specifying the instance types for each entry point requires a JSON-like string, meaning that the string should be wrapped in single quotes, as explained earlier, and demonstrated below.

    hashtag
    Adding Metadata to a Job

    If you are running many jobs that have varying purposes, you can organize the jobs using metadata. Two types of metadata are available on the DNAnexus Platform: properties and tags.

    Properties are key-value pairs that can be attached to any object on the platform, whereas tags are strings associated with objects on the platform. The --property flag allows you to attach a property to a job, and the --tag flag allows you to tag a job.

    Adding metadata to executions does not affect the metadata of the executions' output files. Metadata on jobs make it easier for you to search for a particular job in your job history. This is useful when you want to tag all jobs run with a particular sample, for instance.

    hashtag
    Specifying an App Version

    If your current workflow is not using the most up-to-date version of an app, you can specify an older version when running your job. Append the app name with the version required, for example, app-xxxx/0.0.1 if the current version is app-xxxx/1.0.0.

    hashtag
    Watching a Job

    To monitor your job as it runs, use the --watch flag to display the job's logs in your terminal window as it progresses.

    hashtag
    Providing Input JSON

    You can also specify the input JSON in its entirety. To specify a data object, you must wrap it in (a key-value pair with a key of $dnanexus_link and value of the data object's ID). Because you are already providing the JSON in its entirety, as long as the applet/app ID can be resolved and the JSON can be parsed, confirmation before the job starts is not required. Three methods exist for entering the full input JSON, discussed in separate sections below.

    hashtag
    From the CLI

    If using the CLI to enter the full input JSON, you must use the flag ‑j/‑‑input‑json followed by the JSON in single quotes. Only single quotes should be used to wrap the JSON to avoid interfering with the double quotes used by the JSON itself.

    hashtag
    From a File

    If using a file to enter the input JSON, you must use the flag ‑f/‑‑input‑json‑file followed by the name of the JSON file.

    hashtag
    From stdin

    Entering the input JSON file using stdin is done the same way as entering the file using the -f flag with the substitution of using "-" as the filename. Below is an example that demonstrates how to echo the input JSON to stdin and pipe the output to the input of dx run. As before, use single quotes to wrap the JSON input to avoid interfering with the double quotes used by the JSON itself.

    hashtag
    Getting Additional Information on dx run

    Executing the dx run --help command shows the flags available to use with dx run. The message printed by this command is identical to the one displayed in the brief description of .

    hashtag
    Cost Run Limits

    The --cost-limit cost_limit sets the maximum cost of the job before termination. In case of workflows, it is the cost of the entire analysis job. For batch run, this limit applies per job. See the dx run --help command for more information.

    hashtag
    Job Runtime Limits

    On the DNAnexus Platform, jobs are limited to a runtime of 30 days. Jobs running longer than 30 days are automatically terminated.

    Swiss Army Knife apparrow-up-right
    Swiss Army Knife apparrow-up-right
    Job-based object references
    BWA-MEM FASTQ Read Mapper apparrow-up-right
    Swiss Army Knife apparrow-up-right
    Swiss Army Knife apparrow-up-right
    Connecting to Jobs
    dx pwd
    entry points
    Parliament apparrow-up-right
    DNAnexus link form
    dx run
    $ dx run app-bwa_mem_fastq_read_mapper
    Entering interactive mode for input selection.
    
    Input:   Reads (reads_fastqgz)
    Class:   file
    
    Enter file ID or path ((<TAB> twice for compatible files in current directory, '?' for more options)
    reads_fastqgz: reads.fastq.gz
    
    Input:   BWA reference genome index (genomeindex_targz)
    Class:   file
    Suggestions:
        project-BQpp3Y804Y0xbyG4GJPQ01xv://file-* (DNAnexus Reference Genomes)
    
    Enter file ID or path (<TAB> twice for compatible files in current directory, '?' for more options)
    genomeindex_targz: "Reference Genome Files:/H. Sapiens - hg19 (UCSC)/ucsc_hg19.bwa-index.tar.gz"
    
    Select an optional parameter to set by its # (^D or <ENTER> to finish):
    
     [0] Reads (right mates) (reads2_fastqgz)
     [1] Add read group information to the mappings (required by downstream GATK)? (add_read_group) [default=true]
     [2] Read group id (read_group_id) [default={"$dnanexus_link": {"input": "reads_fastqgz", "metadata": "name"}}]
     [3] Read group platform (read_group_platform) [default="ILLUMINA"]
     [4] Read group platform unit (read_group_platform_unit) [default="None"]
     [5] Read group library (read_group_library) [default="1"]
     [6] Read group sample (read_group_sample) [default="1"]
     [7] Output all alignments for single/unpaired reads? (all_alignments)
     [8] Mark shorter split hits as secondary? (mark_as_secondary) [default=true]
     [9] Advanced command line options (advanced_options)
    
    Optional param #: <ENTER>
    
    Using input JSON:
    {
        "reads_fastqgz": {
            "$dnanexus_link": {
                "project": "project-xxxx",
                "id": "file-xxxx"
            }
        },
        "genomeindex_targz": {
            "$dnanexus_link": {
                "project": "project-xxxx",
                "id": "file-xxxx"
            }
        }
    }
    
    Confirm running the applet/app with this input [Y/n]: <ENTER>
    Calling app-xxxx with output destination project-xxxx:/
    
    Job ID: job-xxxx
    Watch launched job now? [Y/n] n
    usage: dx run app-swiss-army-knife [-iINPUT_NAME=VALUE ...]
    
    App: Swiss Army Knife
    
    Version: 5.1.0 (published)
    
    A multi-purpose tool for all your basic analysis needs
    
    See the app page for more information:
      https://platform.dnanexus.com/app/swiss-army-knife
    
    Inputs:
      Input files: [-iin=(file) [-iin=... [...]]]
            (Optional) Files to download to instance temporary folder before
            command is executed.
    
      Command line: -icmd=(string)
            Command to execute on instance. View the app readme for details.
    
      Whether to use "dx-mount-all-inputs"?: [-imount_inputs=(boolean, default=false)]
            (Optional) Whether to mount all files that were supplied as inputs to
            the app instead of downloading them to the local storage of the
            execution worker.
    
      Public Docker image identifier: [-iimage=(string)]
            (Optional) Instead of using the default Ubuntu 24.04 environment, the
            input command <CMD> will be run using the specified publicly
            accessible Docker image <IMAGE> as it would be when running 'docker
            run <IMAGE> <CMD>'. Example image identifiers are 'ubuntu:25.04',
            'quay.io/ucsc_cgl/samtools'. Cannot be specified together with
            'image_file'. This input relies on access to internet and is unusable
            in an internet-restricted project.
    
      Platform file containing Docker image accepted by `docker load`: [-iimage_file=(file)]
            (Optional) Instead of using the default Ubuntu 24.04 environment, the
            input command <CMD> will be run using the Docker image <IMAGE> loaded
            from the specified image file <IMAGE_FILE> as it would be when running
            'docker load -i <IMAGE_FILE> && docker run <IMAGE> <CMD>'. Cannot be
            specified together with 'image'.
    
    Outputs:
      Output files: [out (array:file)]
            (Optional) New files that were created in temporary folder.
    $ dx run app-swiss-army-knife \
    -iin=project-BQbJpBj0bvygyQxgQ1800Jkk:file-BQbXVY0093Jk1K VY1J082y7v \
    -icmd="samtools sort -T /tmp/aln.sorted -o SRR100022_chrom20_mapped_to_b37.sorted.bam \
    SRR100022_chrom20_mapped_to_b37.bam" -y
    
    Using input JSON:
    {
        "cmd": "samtools sort -T /tmp/aln.sorted -o SRR100022_chrom20_mapped_to_b37.sorted.bam SRR100022_chrom20_mapped_to_b37.bam",
        "in": [
            {
                "$dnanexus_link": {
                    "project": "project-BQbJpBj0bvygyQxgQ1800Jkk",
                    "id": "file-BQbXVY0093Jk1KVY1J082y7v"
                }
            }
        ]
    }
    
    Calling app-xxxx with output destination project-xxxx:/
    
    Job ID: job-xxxx
    $ dx run app-swiss-army-knife \
    -iin=project-BQbJpBj0bvygyQxgQ1800Jkk:file-BQbXVY0093Jk1KVY1J082y7v \
    -iin=project-BQbJpBj0bvygyQxgQ1800Jkk:file-BZ9YGpj0x05xKxZ42QPqZkJY \
    -iin=project-BQbJpBj0bvygyQxgQ1800Jkk:file-BZ9YGzj0x05b66kqQv51011q \
    -icmd="ls *.bam | xargs -n1 -P5 samtools index" -y
    
    Using input JSON:
    {
        "cmd": "ls *.bam | xargs -n1 -P5 samtools index",
        "in": [
            {
                "$dnanexus_link": {
                    "project": "project-BQbJpBj0bvygyQxgQ1800Jkk",
                    "id": "file-BQbXVY0093Jk1KVY1J082y7v"
                }
            },
            {
                "$dnanexus_link": {
                    "project": "project-BQbJpBj0bvygyQxgQ1800Jkk",
                    "id": "file-BZ9YGpj0x05xKxZ42QPqZkJY"
                }
            },
            {
                "$dnanexus_link": {
                    "project": "project-BQbJpBj0bvygyQxgQ1800Jkk",
                    "id": "file-BZ9YGzj0x05b66kqQv51011q"
                }
            }
        ]
    }
    
    Calling app-xxxx with output destination project-xxxx:/
    
    Job ID: job-xxxx
    $ dx run app-swiss-army-knife \
        -iin=$(dx run app-bwa_mem_fastq_read_mapper \
        -ireads_fastqgz=project-BQbJpBj0bvygyQxgQ1800Jkk:file-BQbXKk80fPFj4Jbfpxb6Ffv2 \
        -igenomeindex_targz=project-BQpp3Y804Y0xbyG4GJPQ01xv:file-B6qq53v2J35Qyg04XxG0000V -y \
        --brief):sorted_bam \
        -icmd="samtools index *.bam" -y
    
    Using input JSON:
    {
        "in": [
            {
                "$dnanexus_link": {
                    "field": "sorted_bam",
                    "job": "job-xxxx"
                }
            }
        ],
        "cmd": "samtools index *.bam"
    }
    
    Calling app-xxxx with output destination project-xxxx:/
    
    Job ID: job-xxxx
    $ dx run app-bwa_mem_fastq_read_mapper \
        -ireads_fastqgz="project-BQbJpBj0bvygyQxgQ1800Jkk:/SRR100022/SRR100022_1.filt.fastq.gz" \
        -ireads_fastqgz="project-BQbJpBj0bvygyQxgQ1800Jkk:/SRR100022/SRR100022_2.filt.fastq.gz" \
        -igenomeindex_targz="project-BQpp3Y804Y0xbyG4GJPQ01xv:file-B6ZY4942J35xX095VZyQBk0v" \
        --destination "mappings" -y --brief
    $ dx run app-swiss-army-knife --clone job-xxxx -y
    
    Using input JSON:
    {
        "cmd": "ls *.bam | xargs -n1 -P5 samtools index",
        "in": [
            {
                "$dnanexus_link": {
                    "project": "project-BQbJpBj0bvygyQxgQ1800Jkk",
                    "id": "file-BQbXVY0093Jk1KVY1J082y7v"
                }
            },
            {
                "$dnanexus_link": {
                    "project": "project-BQbJpBj0bvygyQxgQ1800Jkk",
                    "id": "file-BZ9YGpj0x05xKxZ42QPqZkJY"
                }
            },
            {
                "$dnanexus_link": {
                    "project": "project-BQbJpBj0bvygyQxgQ1800Jkk",
                    "id": "file-BZ9YGzj0x05b66kqQv51011q"
                }
            }
        ]
    }
    
    Calling app-xxxx with output destination project-xxxx:/
    
    Job ID: job-xxxx
    $ dx run app-swiss-army-knife \
    --clone job-xxx --destination project-xxxx:/output -y
    $ dx run app-bwa_mem_fastq_read_mapper \
    -ireads_fastqgz="project-BQbJpBj0bvygyQxgQ1800Jkk:/SRR100022/SRR100022_1.filt.fastq.gz" \
    -ireads_fastqgz="project-BQbJpBj0bvygyQxgQ1800Jkk:/SRR100022/SRR100022_2.filt.fastq.gz" \
    -igenomeindex_targz="project-BQpp3Y804Y0xbyG4GJPQ01xv:file-B6ZY4942J35xX095VZyQBk0v" \
    --destination "mappings" -y --brief
    dx run parliament \
      -iillumina_bam=illumina.bam \
      -iref_fasta=ref.fa.gz \
      --instance-type '{
        "honey": "mem1_ssd1_x32",
        "ssake": "mem1_ssd1_x8",
        "ssake_insert": "mem1_ssd1_x32",
        "main": "mem1_ssd1_x16"
      }' \
      -y \
      --brief
    $ dx run app-swiss-army-knife \
        -iin=project-BQbJpBj0bvygyQxgQ1800Jkk:file-BQbXVY0093Jk1KVY1J082y7v \
        -icmd="samtools sort -T /tmp/aln.sorted -o \
        SRR100022_chrom20_mapped_to_b37.sorted.bam SRR100022_chrom20_mapped_to_b37.bam" \
        --property foo=bar --tag dna -y
    $ dx run app-swiss-army-knife/2.0.1 \
        -iin=project-BQbJpBj0bvygyQxgQ1800Jkk:file-BQbXVY0093Jk1KVY1J082y7v \
        -icmd="samtools sort -T /tmp/aln.sorted -o SRR100022_chrom20_mapped_to_b37.sorted.bam SRR100022_chrom20_mapped_to_b37.bam" \
        -y --brief
    $ dx run app-swiss-army-knife \
      -iin=project-BQbJpBj0bvygyQxgQ1800Jkk:file-BQbXVY0093Jk1KVY1J082y7v \
      -icmd="samtools sort -T /tmp/aln.sorted \
        -o SRR100022_chrom20_mapped_to_b37.sorted.bam \
        SRR100022_chrom20_mapped_to_b37.bam" \
      --watch \
      -y \
      --brief
    
    job-xxxx
    
    Job Log
    -------
    Watching job job-xxxx. Press Ctrl+C to stop.
    $ dx run app-swiss-army-knife -j '{
      "cmd": "ls *.bam | xargs -n1 -P5 samtools index",
      "in": [
        { "$dnanexus_link": {
            "project": "project-BQbJpBj0bvygyQxgQ1800Jkk",
            "id": "file-BQbXVY0093Jk1KVY1J082y7v"
          }
        },
        { "$dnanexus_link": {
            "project": "project-BQbJpBj0bvygyQxgQ1800Jkk",
            "id": "file-BZ9YGpj0x05xKxZ42QPqZkJY"
          }
        },
        { "$dnanexus_link": {
            "project": "project-BQbJpBj0bvygyQxgQ1800Jkk",
            "id": "file-BZ9YGzj0x05b66kqQv51011q"
          }
        }
      ]
    }' -y
    
    Calling app-xxxx with output destination project-xxxx:/
    
    Job ID: job-xxxx
    $ dx run app-swiss-army-knife -f input.json
    
    Using input JSON:
    {
        "cmd": "ls *.bam | xargs -n1 -P5 samtools index",
        "in": [
            {
                "$dnanexus_link": {
                    "project": "project-BQbJpBj0bvygyQxgQ1800Jkk",
                    "id": "file-BQbXVY0093Jk1KVY1J082y7v"
                }
            },
            {
                "$dnanexus_link": {
                    "project": "project-BQbJpBj0bvygyQxgQ1800Jkk",
                    "id": "file-BZ9YGpj0x05xKxZ42QPqZkJY"
                }
            },
            {
                "$dnanexus_link": {
                    "project": "project-BQbJpBj0bvygyQxgQ1800Jkk",
                    "id": "file-BZ9YGzj0x05b66kqQv51011q"
                }
            }
        ]
    }
    
    Calling app-xxxx with output destination project-xxxx:/
    
    Job ID: job-xxxx
    $ echo '{
      "cmd": "ls *.bam | xargs -n1 -P5 samtools index",
      "in": [
        {
          "$dnanexus_link": {
            "project": "project-BQbJpBj0bvygyQxgQ1800Jkk",
            "id": "file-BQbXVY0093Jk1KVY1J082y7v"
          }
        },
        {
          "$dnanexus_link": {
            "project": "project-BQbJpBj0bvygyQxgQ1800Jkk",
            "id": "file-BZ9YGpj0x05xKxZ42QPqZkJY"
          }
        },
        {
          "$dnanexus_link": {
            "project": "project-BQbJpBj0bvygyQxgQ1800Jkk",
            "id": "file-BZ9YGzj0x05b66kqQv51011q"
          }
        }
      ]
    }' | dx run app-swiss-army-knife -f - -y
    
    Calling app-xxxx with output destination project-xxxx:/
    
    Job ID: job-xxxx

    Command Line Quickstart

    Learn to use the dx client for command-line access to the full range of DNAnexus Platform features.

    circle-info

    You must set up billing for your account before you can perform an analysis, or upload or egress data. Follow these instructions to set up billing.

    The dx command-line client is included in the DNAnexus SDK (dx-toolkit). You can use the dx client to log into the Platform, to upload, browse, and organize data, and to launch analyses.

    All the projects and data referenced in this Quickstart are publicly available, so you can follow along step-by-step.

    hashtag
    Before You Begin

    If you haven't already done so, , which includes the dx command-line client, as well as range of useful utilities.

    hashtag
    Getting Help

    As you work, use the as a reference.

    On the command line, you can also enter dx help to see a list of commands, broken down by category. To see a list of commands from a particular category, enter dx help <category>.

    To learn what a particular command does, enter dx help <command>, dx <command> -h, or dx <command> -help . For example, enter dx help ls to learn about the command dx ls:

    hashtag
    Step 1: Log In

    The first step is to . If you have not created a DNAnexus account, open the and sign up. User signup is not supported on the command line.

    Your and your current project settings are saved in a local configuration file, and you can start accessing your project.

    circle-info

    You can generate an authentication token from the online DNAnexus Platform .

    hashtag
    Step 2: Explore

    hashtag
    Public Projects

    Look inside some public projects that have already been set up. From the command line, enter the command:

    By running the command and picking a project, you perform the command-line equivalent of going to the project page for (platform login required to access this link) on the website. This is a DNAnexus-sponsored project containing popular genomes for use in analyses with your own data.

    For more information about the dx select command, see the page.

    circle-info

    DNAnexus-sponsored data is free to copy from this project as many times as needed.

    List the data in the top-level directory of the project you've selected by running the command . View the contents of a folder by running the command dx ls <folder_name>.

    circle-info

    When you use wildcard characters like * or ? with dx commands, always enclose the pattern in quotes. For example, use dx ls "*.fastq". Without quotes, your shell expands the wildcards against local files before passing them to dx. This produces unexpected results. For details, see .

    You can avoid typing out the full name of the folder by typing in dx ls C and then pressing <TAB>. The folder name auto-completes from there.

    You don't have to be in a project to inspect its contents. You can also look into another project, and a folder within the project, by giving the project name or ID, followed by a colon (:) and the folder path. Here, the contents of the publicly available project "Demo Data" are listed using both its name and ID.

    As shown above, you can use the -l flag with dx ls to list more details about files, such as the time a file was last modified, its size (if applicable), and its full DNAnexus ID.

    hashtag
    Describing DNAnexus Objects

    You can use the command to learn more about on the platform. Given a DNAnexus object ID or name, dx describe returns detailed information about the object. dx describe only returns results for data objects to which you have access.

    Besides describing data and projects (examples for which are shown below), you can also describe apps, jobs, and users.

    hashtag
    Describing a File

    Below, the reference genome file for C. Elegans located in the "Reference Genome Files: AWS US (East)" project that has been used is described (which should be accessible from other regions as well). You need to add a colon (:) after the project name, here that would be Reference Genome Files\: AWS US (East): .

    hashtag
    Describing a Project

    Below, the publicly available Reference Genome Files project that has been used is described.

    hashtag
    Step 3: Create Your Own Project

    Use the command to create a new project.

    The text project-xxxx denotes a placeholder for a unique, immutable project ID. For more information about object IDs, see the page.

    The project is ready for uploading data and running analyses.

    circle-info

    The new command can also allow you to create other new data objects, including new orgs or users. Use the command dx help new to see additional information. For mode information, see the .

    hashtag
    Step 4: Upload and Manage Your Data

    To analyze a sample, use the command or the if installed. For this tutorial, download the file , which represents the first 25000 C. elegans reads from SRR070372. This file is used in the sample analysis below.

    For uploading multiple or large files, use the . It compresses files and uploads them in parallel over multiple HTTP connections and supports resumable uploads.

    The following command uploads the small-celegans-sample.fastq file into the current directory of the current project. The --wait flag tells to wait until uploading is complete before returning the prompt and describing the result.

    circle-info

    If you run the same command but add the flag --brief, only the file ID (in the form of file-xxxx) is printed to the terminal. Other dx commands also accept the --brief flag and report only object IDs.

    hashtag
    Examining Data

    To take a quick look at the first few lines of the file you uploaded, use the command. By default, it prints the first 10 lines of the given file.

    Run it on the file you uploaded and use the -n flag to ask for the first 12 lines (the first 3 reads) of the FASTQ file.

    hashtag
    Downloading Data

    If you'd like to download a file from the platform, use the command. This command uses the name of the file for the filename unless you specify your own with the -o or --output flag. The example below downloads the same C. elegans file that was uploaded previously.

    hashtag
    About Metadata

    Files have different available fields for metadata, such as "properties" (key-value pairs) and "tags".

    hashtag
    Step 5: Analyze a Sample

    For the next few steps, if you would like to follow along, you need a C. elegans FASTQ file. This tutorial maps the reads against the ce10 genome. If you haven't already, you can download and use the following FASTQ file, which contains the first 25,000 reads from SRR070372: .

    circle-info

    You can also substitute your own reads file for a different species (though it may take longer to run the example). For convenience, DNAnexus has already imported a variety of reference genomes to the platform. If you have a FASTA file to use, upload it and create genome indices for BWA using the (platform login required to access these links).

    The following walkthrough explains what each command does and shows which apps run. If you only want to convert a gzipped FASTQ file to a VCF via BWA and the FreeBayes Variant Caller, to see the commands required to run the apps.

    hashtag
    Uploading Reads

    If you have not yet done so, you can upload a FASTQ file for analysis.

    For more information about using the command , see the page.

    hashtag
    Mapping Reads

    Next, use the (platform login required to access this link) to map the uploaded reads file to a reference genome.

    hashtag
    Finding the App Name

    If you don't know the command-line name of the app to run, you have two options:

    1. Navigate to its web page from the (platform login required to access this link). The app's page shows how to run it from the command line. See the for details on the app used here (platform login required).

    2. Alternatively, search for apps from the command line by running dx find apps. The command-line name appears in parentheses in the output (underlined below).

    hashtag
    Installing and Running the App

    Install the app using and check that it has been installed. While you do not always need to install an app to run it, you may find it useful as a bookmarking tool.

    You can run the app using . When you run it without any arguments, it prompts you for required and then optional arguments. The reference file genomeindex_targz for this C. elegans sample is in a .tar.gz format and can be found in the Reference Genome folder of the region your project is in.

    hashtag
    Monitoring Your Job

    You can use the command to monitor jobs. The command prints out the log file of the job, including the STDOUT, STDERR, and INFO printouts.

    You can also use the command dx describe job-xxxx to learn more about your job. If you don't know the job's ID, you can use the command to list all the jobs run in the current project, along with the user who ran them, their status, and when they began.

    Additional options are available to restrict your search of previous jobs, such as by their names or when they were run.

    hashtag
    Terminating Your Job

    If for some reason you need to terminate your job before it completes, use the command .

    hashtag
    After Your Job Finishes

    You should see two new files in your project: the mapped reads in a BAM file, and an index of that BAM file with a .bai extension. You can refer to the output file by name or by the job that produced it using the syntax job-xxxx:<output field>. Try it yourself with the job ID you got from calling the BWA-MEM app!

    hashtag
    Variant Calling

    You can use the (platform login required to access this link) to call variants on your BAM file.

    This time, instead of relying on interactive mode to enter inputs, you provide them directly. First, look up the app's spec to determine the input names. Run the command dx run freebayes -h.

    Optional inputs are shown using square brackets ([]) around the command-line syntax for each input. Notice that there are two required inputs that must be specified:

    1. Sorted mappings (sorted_bams): A list of files with a .bam extension.

    2. Genome (genome_fastagz): A reference genome in FASTA format that has been gzipped.

    circle-info

    You can also run dx describe freebayes for a more compact view of the input and output specifications. By default, it hides the advanced input options, but you can view them using the --verbose flag.

    hashtag
    Running the App with a One-Liner Using a Job-Based Object Reference

    It is sometimes more convenient to run apps using a single one-line command. You can do this by specifying all the necessary inputs either via the command line or in a prepared file. Use the -i flag to specify inputs as suggested by the output of dx run freebayes ‑h:

    • sorted_bams: The output of the previous BWA step (see the section for more information).

    • genome_fastagz: The ce10 genome in the Reference Genomes project.

    To specify new job input using the output of a previous job, use a via the job-xxxx:<output field> syntax used earlier.

    circle-info

    You can use job-based object references as input even before the referenced jobs have finished. The system waits until the input is ready to begin the new job.

    Replace the job ID below with that generated by the BWA app you ran earlier. The -y flag skips the input confirmation.

    hashtag
    Automatically Running a Command After a Job Finishes

    Use the command to wait for a job to finish. If you run the following command immediately after launching the FreeBayes app, it shows recent jobs only after the job has finished, as shown in the example below.

    Congratulations! You have called variants on a reads sample using the command line. Next, see how to automate this process.

    hashtag
    Automation

    The CLI enables automation of these steps. The following script assumes that you are logged in. It is hardcoded to use the ce10 genome and takes a local gzipped FASTQ file as its command-line argument.

    hashtag
    Learn More

    You can start scripting using dx. The --brief flag is useful for scripting. A list of all dx commands and flags is on the page.

    For more detailed information about running apps and applets from the command line, see the page.

    For a comprehensive guide to the DNAnexus SDK, see the .

    Want to start writing your own apps? Check out the for some useful tutorials.

    Monitoring Executions

    Learn how to get information on current and past executions via both the UI and the CLI.

    hashtag
    Monitoring an Execution via the UI

    hashtag
    Getting Basic Information on an Execution

    download and install the DNAnexus Platform toolkit
    index of dx commands
    log in
    DNAnexus Platformarrow-up-right
    authentication token
    using the UI
    dx select
    Reference Genome Files: AWS US (East)arrow-up-right
    Changing Your Current Project
    dx ls
    Quoting Wildcards in Shell Commands
    dx describe
    files and other objects
    dx new project
    Entity IDs
    full list of dx commands
    dx upload
    Upload Agent
    small-celegans-sample.fastqarrow-up-right
    Upload Agent
    dx upload
    dx head
    dx download
    small-celegans-sample.fastqarrow-up-right
    BWA FASTA Indexer apparrow-up-right
    skip ahead to the Automate It section
    dx upload
    dx upload
    BWA-MEM apparrow-up-right
    Apps pagearrow-up-right
    BWA-MEM FASTQ Read Mapper pagearrow-up-right
    dx install
    dx run
    dx watch
    dx find jobs
    dx terminate
    FreeBayes Variant Caller apparrow-up-right
    Map Reads
    job-based object reference
    dx wait
    Index of dx Commands
    Running Apps and Applets
    SDK documentation
    Developer Portal
    $ dx help ls
    usage: dx ls [-h] [--color {off,on,auto}] [--delimiter [DELIMITER]]
    [--env-help] [--brief | --summary | --verbose] [-a] [-l] [--obj]
    [--folders] [--full]
    [path]
    
    List folders and/or objects in a folder
    ... # output truncated for brevity
    $ dx login
    Acquiring credentials from https://auth.dnanexus.com
    Username: <your username>
    Password: <your password>
    
    No projects to choose from. You can create one with the command "dx new project".
    To pick from projects for which you only have VIEW permissions, use "dx select --level VIEW" or "dx select --public".
    dx select --public --name "Reference Genome Files*"
    $ dx ls
    C. Elegans - Ce10/
    D. melanogaster - Dm3/
    H. Sapiens - GRCh37 - b37 (1000 Genomes Phase I)/
    H. Sapiens - GRCh37 - hs37d5 (1000 Genomes Phase II)/
    H. Sapiens - GRCh38/
    H. Sapiens - hg19 (Ion Torrent)/
    H. Sapiens - hg19 (UCSC)/
    M. musculus - mm10/
    M. musculus - mm9/
    $ dx ls "C. Elegans - Ce10/"
    ce10.bt2-index.tar.gz
    ce10.bwa-index.tar.gz
    ... # output truncated for brevity
    $ dx ls "Demo Data:/SRR100022/"
    SRR100022_1.filt.fastq.gz
    SRR100022_2.filt.fastq.gz
    $ dx ls -l "project-BQbJpBj0bvygyQxgQ1800Jkk:/SRR100022/"
    Project: Demo Data (project-BQbJpBj0bvygyQxgQ1800Jkk)
    Folder : /SRR100022
    State   Last modified       Size     Name (ID)
    ... # output truncated for brevity
    $ dx describe "Reference Genome Files\: AWS US (East):/C. Elegans - Ce10/ce10.fasta.gz"
    Result 1:
    ID                  file-BQbY9Bj015pB7JJVX0vQ7vj5
    Class               file
    Project             project-BQpp3Y804Y0xbyG4GJPQ01xv
    Folder              /C. Elegans - Ce10
    Name                ce10.fasta.gz
    State               closed
    Visibility          visible
    Types               -
    Properties          Assembly=UCSC ce10,
                        Origin=https://hgdownload.cse.ucsc.edu/goldenPath/ce10/bigZips/ce10.2bit,
                        Species=Caenorhabditis elegans,
                        Taxonomy
                        ID=6239
    Tags                -
    Outgoing links      -
    Created             Tue Sep 30 18:54:35 2014
    Created by          bhannigan
     via the job        job-BQbY8y80KKgP380QVQY000qz
    Last modified       Thu Mar  2 12:17:27 2017
    Media type          application/x-gzip
    archivalState       "live"
    Size                29.21 MB, sponsored by DNAnexus
    $ dx describe "Reference Genome Files\: AWS US (East):"
    Result 1:
    ID                  project-BQpp3Y804Y0xbyG4GJPQ01xv
    Class               project
    Name                Reference Genome Files: AWS US (East)
    Summary             
    Billed to           org-dnanexus
    Access level        VIEW
    Region              aws:us-east-1
    Protected           true
    Restricted          false
    Contains PHI        false
    Created             Wed Oct  8 16:42:53 2014
    Created by          tnguyen
    Last modified       Tue Oct 23 14:15:59 2018
    Data usage          0.00 GB
    Sponsored data      519.77 GB
    Sponsored egress    0.00 GB used of 0.00 GB total
    Tags                -
    Properties          -
    downloadRestricted  false
    defaultInstanceType "mem2_hdd2_x2"
    $ dx new project "My First Project"
    Created new project called "My First Project"
    (project-xxxx)
    Switch to new project now? [y/N]: y
    $ dx upload --wait small-celegans-sample.fastq
    [===========================================================>] Uploaded (16801690 of 16801690 bytes) 100% small-celegans-sample.fastq
    ID              file-xxxx
    Class           file
    Project         project-xxxx
    Folder          /
    Name            small-celegans-sample.fastq
    State           closed
    Visibility      visible
    Types           -
    Properties      -
    Tags            -
    Details         {}
    Outgoing links  -
    Created         Sun Jan  1 09:00:00 2017
    Created by      amy
    Last modified   Sat Jan  1 09:00:00 2017
    Media type      text/plain
    Size            16.02 MB
    $ dx head -n 12 small-celegans-sample.fastq
    @SRR070372.1 FV5358E02GLGSF length=78
    TTTTTTTTTTTTTTTTTTTTTTTTTTTNTTTNTTTNTTTNTTTATTTATTTATTTATTATTATATATATATATATATA
    +SRR070372.1 FV5358E02GLGSF length=78
    ...000//////999999<<<=<<666!602!777!922!688:669A9=<=122569AAA?>@BBBBAA?=<96632
    @SRR070372.2 FV5358E02FQJUJ length=177
    TTTCTTGTAATTTGTTGGAATACGAGAACATCGTCAATAATATATCGTATGAATTGAACCACACGGCACATATTTGAACTTGTTCGTGAAATTTAGCGAACCTGGCAGGACTCGAACCTCCAATCTTCGGATCCGAAGTCCGACGCCCCCGCGTCGGATGCGTTGTTACCACTGCTT
    +SRR070372.2 FV5358E02FQJUJ length=177
    222@99912088>C<?7779@<GIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIC;6666IIIIIIIIIIII;;;HHIIE>944=>=;22499;CIIIIIIIIIIIIHHHIIIIIIIIIIIIIIIH?;;;?IIEEEEEEEEIIII77777I7EEIIEEHHHHHIIIIIIIIIIIIII
    @SRR070372.3 FV5358E02GYL4S length=70
    TTGGTATCATTGATATTCATTCTGGAGAACGATGGAACATACAAGAATTGTGTTAAGACCTGCATAAGGG
    +SRR070372.3 FV5358E02GYL4S length=70
    @@@@@DFFFFFHHHHHHHFBB@FDDBBBB=?::5555BBBBD??@?DFFHHFDDDDFFFDDBBBB<<410
    $ dx download small-celegans-sample.fastq
    [                                                            ] Downloaded 0 byte
    [===========================================================>] Downloaded 16.02 of
    [===========================================================>] Completed 16.02 of 16.02 bytes (100%) small-celegans-sample.fastq
    dx upload small-celegans-sample.fastq --wait
    $ dx find apps
    ...
    x BWA-MEM FASTQ Read Mapper (bwa_mem_fastq_read_mapper), v1.4.0
    ...
    $ dx install bwa_mem_fastq_read_mapper
    Installed the bwa_mem_fastq_read_mapper app
    $ dx find apps --installed
    BWA-MEM FASTQ Read Mapper (bwa_mem_fastq_read_mapper), v1.4.0
    $ dx run bwa_mem_fastq_read_mapper
    Entering interactive mode for input selection.
    
    Input:   Reads (reads_fastqgz)
    Class:   file
    Enter file ID or path (<TAB> twice for compatible files in current directory,'?' for help)
    reads_fastqgz[0]: <small-celegans-sample.fastq.gz>
    
    Input:   BWA reference genome index (genomeindex_targz)
    Class:   file
    
    Suggestions:
    project-BQpp3Y804Y0xbyG4GJPQ01xv://file-\* (DNAnexus Reference Genomes)
    Enter file ID or path (<TAB> twice for compatible files in current
    directory,'?' for more options)
    genomeindex_targz: <"Reference Genome Files\: <REGION_OF_PROJECT>:/C. Elegans - Ce10/ce10.bwa-index.tar.gz">
    
    Select an optional parameter to set by its # (^D or <ENTER> to finish):
    
    [0] Reads (right mates) (reads2_fastqgz)
    [1] Add read group information to the mappings (required by downstream GATK)? (add_read_group) [default=true]
    [2] Read group id (read_group_id) [default={"$dnanexus_link": {"input": "reads_fastqgz", "metadata": "name"}}]
    [3] Read group platform (read_group_platform) [default="ILLUMINA"]
    [4] Read group platform unit (read_group_platform_unit) [default="None"]
    [5] Read group library (read_group_library) [default="1"]
    [6] Read group sample (read_group_sample) [default="1"]
    [7] Output all alignments for single/unpaired reads? (all_alignments)
    [8] Mark shorter split hits as secondary? (mark_as_secondary) [default=true]
    [9] Advanced command line options (advanced_options)
    
    Optional param #: <ENTER>
    
    Using input JSON:
    {
        "reads_fastqgz": {
            "$dnanexus_link": {
                "project": "project-B3X8bjBqqBk1y7bVPkvQ0001",
                "id": "file-B3P6v02KZbFFkQ2xj0JQ005Y"
            }
    
    "genomeindex_targz": {
            "$dnanexus_link": {
                "project": "project-xxxx(project ID for the reference genome in your region)",
                "id": "file-BQbYJpQ09j3x9Fj30kf003JG"
            }
        }
    }
    
    Confirm running the applet/app with this input [Y/n]: <ENTER>
    Calling app-BP2xVx80fVy0z92VYVXQ009j with output destination
         project-xxxx:/
    
    Job ID: job-xxxx
    $ dx find jobs
    * BWA-MEM FASTQ Read Mapper (bwa_mem_fastq_read_mapper:main)(done) job-xxxx
    user-amy 20xx-xx-xx 0x:00:00 (runtime 0:00:xx)
    $ dx describe job-xxxx
    ...
    $ dx ls
    small-celegans-sample.bam
    small-celegans-sample.bam.bai
    small-celegans-sample.fastq
    $ dx describe small-celegans-sample.bam
    ...
    $ dx describe job-xxxx:sorted_bam
    ...
    usage: dx run freebayes [-iINPUT_NAME=VALUE ...]
    
    App: FreeBayes Variant Caller
    
    Version: 3.0.1 (published)
    
    Calls variants (SNPs, indels, and other events) using FreeBayes
    
    See the app page for more information:
      https://platform.dnanexus.com/app/freebayes
    
    Inputs:
      Sorted mappings: -isorted_bams=(file) [-isorted_bams=... [...]]
            One or more coordinate-sorted BAM files containing mappings to call
            variants for.
    
      Genome: -igenome_fastagz=(file)
            A file, in gzipped FASTA format, with the reference genome that the
            reads were mapped against.
    
            Suggestions:
              project-BQpp3Y804Y0xbyG4GJPQ01xv://file-* (DNAnexus Reference Genomes: AWS US (East))
              project-F3zxk7Q4F30Xp8fG69K1Vppj://file-* (DNAnexus Reference Genomes: AWS Germany)
              project-F0yyz6j9Jz8YpxQV8B8Kk7Zy://file-* (DNAnexus Reference Genomes: Azure US (West))
              project-F4gXb605fKQyBq5vJBG31KGG://file-* (DNAnexus Reference Genomes: AWS Sydney)
              project-FGX8gVQB9X7K5f1pKfPvz9yG://file-* (DNAnexus Reference Genomes: Azure Amsterdam)
              project-GvGXBbk36347jYPxP0j755KZ://file-* (DNAnexus Reference Genomes: Bahrain)
    
      Target regions: [-itargets_bed=(file)]
            (Optional) A BED file containing the coordinates of the genomic
            regions to intersect results with. Supplying this will cause 'bcftools
            view -R' to be used, to limit the results to that subset. This option
            does not speed up the execution of FreeBayes.
    
            Suggestions:
              project-B6JG85Z2J35vb6Z7pQ9Q02j8:/vendor_exomes/file-* (Vendor Exomes (GRCh37 and hg19): AWS US (East))
              project-F3zqGV04fXX5j7566869fjFq:/vendor_exomes/file-* (Vendor Exomes (GRCh37 and hg19): AWS Germany)
              project-F29g0xQ90fvQf5z1BX6b5106:/vendor_exomes/file-* (Vendor Exomes (GRCh37 and hg19): Azure US (West))
              project-F4gYG1850p1JXzjp95PBqzY5:/vendor_exomes/file-* (Vendor Exomes (GRCh37 and hg19): AWS Sydney)
              project-FGXfq9QBy7Zv5BYQ9Yvqj9Xv:/vendor_exomes/file-* (Vendor Exomes (GRCh37 and hg19): Azure Amsterdam)
              project-GvGXBZk3f624QVfBPjB8916j:/vendor_exomes/file-* (Vendor Exomes (GRCh37 and hg19): Bahrain)
    
     Common
      Output prefix: [-ioutput_prefix=(string)]
            (Optional) The prefix to use when naming the output files (they will
            be called prefix.vcf.gz, prefix.vcf.gz.tbi). If not provided, the
            prefix will be the same as the first BAM file given.
    
      Apply standard filters?: [-istandard_filters=(boolean, default=true)]
            Select this to use stringent input base and mapping quality filters,
            which may reduce false positives. This will supply the
            '--standard-filters' option to FreeBayes.
    
      Normalize variants representation?: [-inormalize_variants=(boolean, default=true)]
            Select this to use 'bcftools norm' in order to normalize the variants
            representation, which may help with downstream compatibility.
    
      Perform parallelization?: [-iparallelized=(boolean, default=true)]
            Select this to parallelize FreeBayes using multiple threads. This will
            use the 'freebayes-parallel' script from the FreeBayes package, with a
            granularity of 3 million base pairs. WARNING: This option may be
            incompatible with certain advanced command-line options.
    
     Advanced
      Report genotype qualities?: [-igenotype_qualities=(boolean, default=false)]
            Select this to have FreeBayes report genotype qualities.
    
      Add RG tags to BAM files?: [-ibam_add_rg=(boolean, default=false)]
            Select this to have FreeBayes add read group tags to the input BAM
            files so each file will be treated as an individual sample. WARNING:
            This may increase the memory requirements for FreeBayes.
    
      Advanced command line options: [-iadvanced_options=(string)]
            (Optional) Advanced command line options that will be supplied
            directly to the FreeBayes program.
    
    Outputs:
      Variants: variants_vcfgz (file)
            A bgzipped VCF file with the called variants.
    
      Variants index: variants_tbi (file)
            A tabix index (TBI) file with the associated variants index.
    $ dx run freebayes -y \
     -igenome_fastagz=Reference\ Genome\ Files:/C.\ Elegans\ -\ Ce10/ce10.fasta.gz \
     -isorted_bams=job-xxxx:sorted_bam
    
    Using input JSON:
    {
      "genome_fastagz": {
        "$dnanexus_link": {
          "project": "project-xxxx",
          "id": "file-xxxx"
        }
      },
      "sorted_bams": {
        "field": "sorted_bam",
        "job": "job-xxxx"
      }
    }
    
    Calling app-BFG5k2009PxyvYXBBJY00BK1 with output destination
    project-xxxx:/
    
    Job ID: job-xxxx
    $ dx wait job-xxxx && dx find jobs
    Waiting for job-xxxx to finish running...
    Done
    * FreeBayes Variant Caller (done) job-xxxx
    user-amy 2017-01-01 09:00:00 (runtime 0:05:24)
    ...
    #!/usr/bin/env bash
    # Usage: <script_name.sh> local_fastq_filename.fastq.gz
    
    reference="Reference Genome Files\: AWS US (East):/C. Elegans - Ce10/ce10.fasta.gz"
    bwa_indexed_reference="Reference Genome Files\: AWS US (East):/C. Elegans - Ce10/ce10.bwa-index.tar.gz"
    local_reads_file="$1"
    
    reads_file_id=$(dx upload "$local_reads_file" --brief)
    bwa_job=$(dx run bwa_mem_fastq_read_mapper -ireads_fastqgzs=$reads_file_id -igenomeindex_targz="$bwa_indexed_reference" -y --brief)
    freebayes_job=$(dx run freebayes -isorted_bams=$bwa_job:sorted_bam -igenome_fastagz="$reference" -y --brief)
    
    dx wait $freebayes_job
    
    dx download $freebayes_job:variants_vcfgz -o "$local_reads_file".vcf.gz
    gunzip "$local_reads_file".vcf.gz
    To get basic information on a job (the execution of an app or applet) or an analysis (the execution of a workflow):
    1. Click on Projects in the main Platform menu.

    2. On the Projects list page, find and click on the name of the project within which the execution was launched.

    3. Click on the Monitor tab to open the Monitor screen.

    4. The Monitor screen shows a list of executions launched within the project. By default, executions appear in reverse chronological order, with the most recently launched execution at the top.

    5. Find the row displaying information on the execution.

      • For an analysis (the execution of a workflow), click the "+" icon to the left of the analysis name to expand the row and view information on its stages. For executions with further descendants, click the "+" icon next to the name to expand the row and show additional details.

    6. To see additional information on an execution, .

      • The following shortcuts allow you to view information from the details page directly on the list page, or relaunch an execution:

        • To view the :

    hashtag
    Available Basic Information on Executions

    The list on the Monitor screen displays the following information for each execution that is running or has been run within the project:

    • Name - The default name for an execution is the name of the app, applet, or workflow being run. When configuring an execution, you can give it a custom name, either via the UI, or via the CLI. The execution's name is used in Platform email alerts related to the execution. Clicking on a name in the executions list opens the execution details page, giving in-depth information on the execution.

    • State - This is the execution's state. State values include:

      • "Waiting" - The execution awaits Platform resource allocation or completion of dependent executions.

      • "Running" - The job is actively executing.

      • "In Progress" - The analysis is actively processing.

      • "Done" - The execution completed successfully without errors.

      • "Failed" - The execution encountered an error and could not complete. See for troubleshooting assistance.

      • "Partially Failed" - if one or more workflow stages did not finish successfully, with at least one stage not in a terminal state (either "Done," "Failed," or "Terminated").

      • "Terminating" - The worker has initiated but not completed the termination process.

      • "Terminated" - The execution stopped before completion.

      • "Debug Hold" - The execution, run with debugging options, encountered an applicable failure and entered debugging hold.

    • Executable - The executable or executables run during the execution. If the execution is an analysis, each stage appears in a separate row, including the name of the executable run during the stage. If an informational page exists with details about the executable's configuration and use, the executable name becomes clickable, and clicking displays that page.

    • Tags - Tags are strings associated with objects on the platform. They are a .

    • Launched By - The name of the user who launched the execution.

    • Launched On - The time at which the execution was launched. This time often precedes the time in the Started Running column due to executions waiting for available resources before starting.

    • Started Running - The time at which the execution started running, if it has done so. This is not always the same as its launch time, if it requires time waiting for available resources before starting.

    • Duration - For jobs, this figure represents the time elapsed since the job entered the running state. For analyses, it represents the time elapsed since the analysis was created.

    • Cost - A value is displayed in this column when the user has access to billing info for the execution. The figure shown represents either, for a running execution, an estimate of the charges it has incurred so far, or, for a completed execution, the total costs it incurred.

    • Priority - The priority assigned to the execution - either "low," "normal," or "high" - when it was configured, either or . This setting determines the scheduling priority of the execution relative to other executions that are waiting to be launched.

    • Worker URL - If the execution runs an executable, such as , with direct web URL connection capability, the URL appears here. Clicking the URL opens a connection to the executable in a new browser tab.

    • Output Folder - For each execution, the value shows a path relative to the project's root folder. Click the value to open the folder containing the execution's outputs.

    hashtag
    Additional Basic Information

    Additional basic information can be displayed for each execution. To do this:

    1. Click on the "table" icon at the right edge of the table header row.

    2. Select one or more of the entries in the list, to display an additional column or columns.

    Available additional columns include:

    • Stopped Running - The time at which the execution stopped running.

    • Custom properties columns - If a custom property or properties have been assigned to any of the listed executions, a column can be added to the table, for each such property, showing the values assigned to each execution, for that property.

    hashtag
    Customizing the Executions List Display

    To remove columns from the list, click on the "table" icon at the right edge of the table header row, then de-select one or more of the entries in the list, to hide the column or columns.

    hashtag
    Filtering the Executions List

    A filter menu above the executions list allows you to run a search that refines the list to display only executions meeting specific criteria.

    By default, pills are available to set search criteria for filtering executions by one or more of these attributes:

    • Name - Execution name

    • State - Execution state

    • ID - An execution's job ID or analysis ID

    • Executable - A specific executable

    • Launched By - The user who launched an execution or executions

    • Launch Time - The time range within which executions were launched

    Click the List icon, above the right edge of the executions list, to display pills that allow filtering by additional execution attributes.

    hashtag
    Search Scope

    By default, filters are set to display only root executions that meet the criteria defined in the filter. To include all executions, including those run during individual stages of workflows, click the button above the left edge of the executions list showing the default value "Root Executions Only," then click "All Executions."

    hashtag
    Saving and Reusing Filters

    To save a particular filter, click the Bookmark icon, above the right edge of the executions list, assign your filter a name, then click Save.

    To apply a saved filter to the executions list, click the Bookmark icon, then select the filter from the list.

    hashtag
    Terminating an Execution from the Monitor Screen

    If you launched an execution or have contributor access to the project in which the execution is running, you can terminate the execution from the list on the Monitor screen when it is in a non-terminal state. You can also terminate executions launched by other project members if you have project admin status.

    To terminate an execution:

    1. Find the execution in the list:

      • Select the execution by clicking on the row. Click the red Terminate button that appears at the end of the header.

      • Hover over the row and click on the "More Actions" button that looks like three vertical dots at the end of the row to select Terminate in the menu.

    2. A modal window opens, asking you to confirm the termination. Click Terminate to confirm.

    3. The execution's state changes to "Terminating" during termination, then to "Terminated" once complete.

    hashtag
    Getting Detailed Information on an Execution via the UI

    For additional information about an execution, click its name in the list on the Monitor screen to open its details page.

    hashtag
    Available Detailed Information on Executions

    The details page for an execution displays a range of information:

    hashtag
    High-level details

    For a standalone execution, such as a job without subjobs, the display shows a single entry with details about the execution state, start and stop times, and duration in the running state.

    For an execution with descendants, such as an analysis with multiple stages, the display shows a list with each row containing details about stage executions. For executions with descendants, click the "+" icon next to the name to expand the row and view descendant information. A page displaying detailed information on a stage appears when clicking on its name in the list. To navigate back to the workflow's details page, click its name in the "breadcrumb" navigation menu in the top right corner of the screen.

    hashtag
    Execution state

    In the Execution Tree section, each execution row includes a color bar that represents the execution's current state. For descendants within the same execution tree, the time visualizations are staggered, indicating their different start and stop times compared to each other. The colors include:

    • Blue - A blue bar indicates that the execution is in the "Running" or "In Progress" state.

    • Green - A green bar indicates that the execution is in the "Done" state.

    • Red - A red bar indicates that the execution is in the "Failed" or "Partially Failed" state.

    • Grey - A grey bar indicates that the execution is in the "Terminated" state.

    hashtag
    Execution start and stop times

    Times are displayed in the header bar at the top of the Execution Tree section. These times run, from left to right, from the time at which the job started running, or when the analysis was created, to either the current time, or the time at which the execution entered a terminal state ("Done," "Failed," or "Terminated").

    hashtag
    Inputs

    This section lists the execution inputs. Available input files appear as hyperlinks to their project locations. For inputs from other workflow executions, the source execution name appears as a hyperlink to its details page.

    hashtag
    Outputs

    This section lists the execution's outputs. Available output files appear as hyperlinks. Click a link to open the folder containing the output file.

    hashtag
    Log files

    An execution's log file is useful in understanding details about, for example, the resources used by an execution, the costs it incurred, and the source of any delays it encountered. To access log files, and, as needed, download them in .txt format:

    • To access the log file for a job, click either the View Log button in the top right corner of the screen, or the View Log link in the Execution Tree section.

    • To access the log file for each stage in an analysis, click the View Log link next to the row displaying information on the stage, in the Execution Tree section.

    hashtag
    Basic info

    The Info pane, on the right side of the screen, displays a range of basic information on the execution, along with additional detail such as the execution's unique ID, and custom properties and tags assigned to it.

    hashtag
    Reused results

    For executions reusing results from another execution, the information appears in a blue pane above the Execution Tree section. Click the source execution's name to see details about the execution that generated these results.

    hashtag
    Getting Help with Failed Executions

    For failed executions, a Cause of Failure pane appears above the Execution Tree section. The cause of failure is a system-generated error message. For assistance in diagnosing the failure and any related issues:

    1. Click the button labeled Send Failure Report to DNAnexus Support.

    2. A form opens in a modal window, with pre-populated Subject and Message fields containing diagnostic information for DNAnexus Support.

    3. Click the button in the Grant Access section to grant DNAnexus Support "View" access to the project, enabling faster issue diagnosis and resolution.

    4. Click Send Report to send the report.

    hashtag
    Launching a New Execution

    To re-launch a job from the execution details screen:

    1. Click the Launch as New Job button in the upper right corner of the screen.

    2. A new browser tab opens, displaying the Run App / Applet form.

    3. Configure the run, then click Start Analysis.

    To re-launch an analysis from the execution details screen:

    1. Click the Launch as New Analysis button in the upper right corner of the screen.

    2. A new browser tab opens, displaying the Run Analysis form.

    3. Configure the run, then click Start Analysis.

    hashtag
    Saving a Workflow as a New Workflow

    To save a copy of a workflow along with its input configurations under a new name from the execution details screen:

    1. Click the Save as New Workflow button in the upper right corner of the screen.

    2. In the Save as New Workflow modal window, give the workflow a name, and select the project in which you'd like to save it.

    3. Click Save.

    hashtag
    Viewing Initial Tries for Restarted Jobs

    As described in job states, jobs can be configured to restart automatically on certain types of failures.

    If you want to view the execution details for the initial tries for a restarted job:

    1. Click on the "Tries" link below the job name in the summary banner, or the "Tries" link next to the job name in the execution tree.

    2. A modal window opens.

    3. Click the name of the try for which you'd like to view execution details.

    You can only send a failure report for the most recent try, not for any previous tries.

    hashtag
    Monitoring a Job via the CLI

    You can use dx watch to view the log of a running job or any past jobs, which may have finished successfully, failed, or been terminated.

    hashtag
    Monitoring a Running Job

    Use dx watch to view a job's log stream during execution. The log stream includes stdout, stderr, and additional worker output information.

    hashtag
    Terminating a Job

    To terminate a job before completion, use the command dx terminate.

    hashtag
    Monitoring Past Jobs

    Use the dx watch command to view completed jobs. The log stream includes stdout, stderr, and additional worker output information from the execution.

    hashtag
    Finding Executions via the CLI

    Use dx find executions to display the ten most recent executions in your current project. Specify a different number of executions by using dx find executions -n <specified number>. The output matches the information shown in the "Monitor" tab on the DNAnexus web UI.

    Below is an example of dx find executions. In this case, only two executions have been run in the current project. An individual job, DeepVariant Germline Variant Caller, and a workflow consisting of two stages, Variant Calling Workflow, are shown. A stage is represented by either another analysis (if running a workflow) or a job (if running an app(let)).

    The job running the DeepVariant Germline Variant Caller executable is running and has been running for 10 minutes and 28 seconds. The analysis running the Variant Calling Workflow consists of 2 stages, FreeBayes Variant Caller, which is waiting on input, and BWA-MEM FASTQ Read Mapper, which has been running for 10 minutes and 18 seconds.

    hashtag
    Using dx find executions

    The dx find executions operation searches for jobs or analyses created when a user runs an app or applet. For jobs that are part of an analysis, the results appear in a tree representation linking related jobs together.

    By default, dx find executions displays up to ten of the most recent executions in your current project, ordered by creation time.

    Filter executions by job type using command flags: --origin-jobs shows only original jobs, while --all-jobs includes both original jobs and subjobs.

    hashtag
    Finding Analyses via the CLI

    You can monitor analyses by using the command dx find analyses, which displays the top-level analyses, excluding contained jobs. Analyses are executions of workflows and consist of one or more app(let)s being run.

    Below is an example of dx find analyses:

    hashtag
    Finding Jobs via the CLI

    Jobs are runs of an individual app(let) and compose analyses. Monitor jobs using the command dx find jobs to display a flat list of jobs. For jobs within an analysis, the command returns all jobs in that analysis.

    Below is an example of dx find jobs:

    hashtag
    Advanced CLI Monitoring Options

    Searches for executions can be restricted to specific parameters.

    hashtag
    Viewing stdout and/or stderr from a Job Log

    • To extract stdout only from this job, run the command dx watch job-xxxx --get-stdout.

    • To extract stderr only from this job, run the command dx watch job-xxxx --get-stderr.

    • To extract both stdout and stderr from this job, run the command dx watch job-xxxx --get-streams.

    Below is an example of viewing stdout lines of a job log:

    hashtag
    Viewing Subjobs

    To view the entire job tree, including both main jobs and subjobs, use the command dx watch job-xxxx --tree.

    hashtag
    Viewing the Most Recent n Messages of a Job Log

    To view the most recent n messages from a job log, use the command dx watch job-xxxx -n 8. If the job already ran, output is displayed as well.

    In the example below, the app Sample Prints doesn't have any output.

    hashtag
    Finding and Examining Initial Tries for Restarted Jobs

    Jobs can be configured to restart automatically on certain types of failures as described in the Restartable Jobs section. To view initial tries of the restarted jobs along with execution subtrees rooted in those initial tries, use dx find executions --include-restarted. To examine job logs for initial tries, use dx watch job-xxxx --try X. An example of these commands is shown below.

    hashtag
    Searching Across All Projects

    By default, dx find restricts searches to your current project context. Use the --all-projects flag to search across all accessible projects.

    hashtag
    Returning More Than Ten Results

    By default, dx find returns up to ten of the most recently launched executions matching your search query. Use the -n option to change the number of executions returned.

    hashtag
    Searching by Executable

    A user can search for only executions of a specific app(let) or workflow based on its entity ID.

    hashtag
    Searching by Execution Start Time

    Users can also use the --created-before and --created-after options to search based on when the execution began.

    hashtag
    Searching by Date

    hashtag
    Searching by Time

    hashtag
    Searching by Execution State

    Users can also restrict the search to a specific state, for example, "done", "failed", "terminated".

    hashtag
    Scripting

    hashtag
    Delimiters

    The --delim flag produces tab-delimited output, suitable for processing by other shell commands.

    hashtag
    Returning Only IDs

    Use the --brief flag to display only the object IDs for objects returned by your search query. The ‑‑origin‑jobs flag excludes subjob information.

    Below is an example usage of the --brief flag:

    Below is an example of using the flags --origin-jobs and --brief. In the example below, the last job run in the current default project is described.

    hashtag
    Rerunning Time-Specific Failed Jobs With Updated Instance Types

    hashtag
    Rerunning Failed Executions With an Updated Executable

    hashtag
    Getting Information on Jobs That Share a Tag

    See more on using dx find jobs.

    hashtag
    Forwarding Job Logs to Splunk for Analysis

    circle-info

    A license is required to use this feature. Contact DNAnexus Salesenvelope for more information.

    Job logs can be automatically forwarded to a customer's Splunk instance for analysis.

    $ dx watch job-xxxx
    Watching job job-xxxx. Press Ctrl+C to stop.
    * Sample Prints (sample_prints:main) (running) job-xxxx
      amy 2024-01-01 09:00:00 (running for 0:00:37)
    2024-01-01 09:06:00 Sample Prints INFO Logging initialized (priority)
    2024-01-01 09:06:37 Sample Prints INFO CPU: 4% (4 cores) * Memory: 547/7479MB * Storage: 74GB free * Net: 0↓/0↑MBps
    2024-01-01 09:06:37 Sample Prints INFO Setting SSH public key
    2024-01-01 09:06:37 Sample Prints STDOUT dxpy/0.365.0 (Linux-5.15.0-1050-aws-x86_64-with-glibc2.29) Python/3.8.10
    2024-01-01 09:06:37 Sample Prints STDOUT Invoking main with {}
    2024-01-01 09:06:37 Sample Prints STDOUT 0
    ...
    $ dx watch job-xxxx
    Watching job job-xxxx. Press Ctrl+C to stop.
    * Sample Prints (sample_prints:main) (running) job-xxxx
      amy 2024-01-01 09:00:00 (running for 0:00:37)
    2024-01-01 09:06:00 Sample Prints INFO Logging initialized (priority)
    2024-01-01 09:06:37 Sample Prints INFO CPU: 4% (4 cores) * Memory: 547/7479MB * Storage: 74GB free * Net: 0↓/0↑MBps
    2024-01-01 09:06:37 Sample Prints INFO Setting SSH public key
    2024-01-01 09:06:37 Sample Prints STDOUT dxpy/0.365.0 (Linux-5.15.0-1050-aws-x86_64-with-glibc2.29) Python/3.8.10
    2024-01-01 09:06:37 Sample Prints STDOUT Invoking main with {}
    2024-01-01 09:06:37 Sample Prints STDOUT 0
    2024-01-01 09:06:37 Sample Prints STDOUT 1
    2024-01-01 09:06:37 Sample Prints STDOUT 2
    2024-01-01 09:06:37 Sample Prints STDOUT 3
    * Sample Prints (sample_prints:main) (done) job-xxxx
      amy 2024-01-01 09:08:11 (runtime 0:02:11)
      Output: -
    $ dx find executions
    * DeepVariant Germline Variant Caller (deepvariant_germline:main) (running) job-xxxx
      amy 2024-01-01 09:00:18 (running for 0:10:28)
    * Variant Calling Workflow (in_progress) analysis-xxxx
    │ amy 2024-01-01 09:00:18
    ├── * FreeBayes Variant Caller (freebayes:main) (waiting_on_input) job-yyyy
    │     amy 2024-01-01 09:00:18
    └── * BWA-MEM FASTQ Read Mapper (bwa_mem_fastq_read_mapper:main) (running) job-zzzz
          amy 2024-01-01 09:00:18 (running for 0:10:18)
    $ dx find analyses
    * Variant Calling Workflow (in_progress) analysis-xxxx
      amy 2024-01-01 09:00:18
    $ dx find jobs
    * DeepVariant Germline Variant Caller (deepvariant_germline:main) (running) job-xxxx
      amy 2024-01-01 09:10:00 (running for 0:00:28)
    * FreeBayes Variant Caller (freebayes:main) (waiting_on_input) job-yyyy
      amy 2024-01-01 09:00:18
    * BWA-MEM FASTQ Read Mapper (bwa_mem_fastq_read_mapper:main) (running) job-zzzz
      amy 2024-01-01 09:00:18 (running for 0:10:18)
    $ dx watch job-xxxx --get-streams
    Watching job job-xxxx. Press Ctrl+C to stop.
    dxpy/0.365.0 (Linux-5.15.0-1050-aws-x86_64-with-glibc2.29) Python/3.8.10
    Invoking main with {}
    0
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    $ dx watch job-F5vPQg807yxPJ3KP16Ff1zyG -n 8
    Watching job job-xxxx. Press Ctrl+C to stop.
    * Sample Prints (sample_prints:main) (done) job-xxxx
      amy 2024-01-01 09:00:00 (runtime 0:02:11)
    2024-01-01 09:06:00 Sample Prints INFO Logging initialized (priority)
    2024-01-01 09:08:11 Sample Prints INFO CPU: 4% (4 cores) * Memory: 547/7479MB * Storage: 74GB free * Net: 0↓/0↑MBps
    2024-01-01 09:08:11 Sample Prints INFO Setting SSH public key
    2024-01-01 09:08:11 Sample Prints dxpy/0.365.0 (Linux-5.15.0-1050-aws-x86_64-with-glibc2.29) Python/3.8.10
    * Sample Prints (sample_prints:main) (done) job-F5vPQg807yxPJ3KP16Ff1zyG
      amy 2024-01-01 09:00:00 (runtime 0:02:11)
      Output: -
    $ dx run swiss-army-knife -icmd="exit 1" \
        --extra-args '{"executionPolicy": { "restartOn":{"*":2}}}'
    
    $ dx find executions --include-restarted
    * Swiss Army Knife (swiss-army-knife:main) (failed) job-xxxx tries
    ├── * Swiss Army Knife (swiss-army-knife:main) (failed) job-xxxx try 2
    │     amy 2023-08-02 16:33:40 (runtime 0:01:45)
    ├── * Swiss Army Knife (swiss-army-knife:main) (restarted) job-xxxx try 1
    │     amy 2023-08-02 16:33:40
    └── * Swiss Army Knife (swiss-army-knife:main) (restarted) job-xxxx try 0
          amy 2023-08-02 16:33:40
    
    $ dx watch job-xxxx --try 0
    Watching job job-xxxx try 0. Press Ctrl+C to stop watching.
    * Swiss Army Knife (swiss-army-knife:main) (restarted) job-xxxx try 0
      amy 2023-08-02 16:33:40
    2023-08-02 16:35:26 Swiss Army Knife INFO Logging initialized (priority)
    $ dx find executions -n 3 --all-projects
    * Sample Prints (sample_prints:main) (done) job-xxxx
      amy 2024-01-01 09:15:00 (runtime 0:02:11)
    * Sample Applet (sample_applet:main) (done) job-yyyy
      ben 2024-01-01 09:10:00 (runtime 0:00:28)
    * Sample Applet (sample_applet:main) (failed) job-zzzz
      amy 2024-01-01 09:00:00 (runtime 0:19:02)
    # Find the 100 most recently launched jobs in your project
    $ dx find executions -n 100
    # Find most recent executions running app-deepvariant_germline in the current project
    $ dx find executions --executable app-deepvariant_germline
    * DeepVariant Germline Variant Caller (deepvariant_germline:main) (running) job-xxxx
      amy 2024-01-01 09:00:18 (running for 0:10:18)
    # Find executions run on January 2, 2024
    $ dx find executions --created-after=2024-01-01 --created-before=2024-01-03
    # Find executions created in the last 2 hours
    $ dx find executions --created-after=-2h
    # Find analyses created in the last 5 days
    $ dx find analyses --created-after=-5d
    # Find failed jobs in the current project
    $ dx find jobs --state failed
    $ dx find jobs --delim
    * Cloud Workstation (cloud_workstation:main) done  job-xxxx    amy   2024-01-07 09:00:00 (runtime 1:00:00)
    * GATK3 Human Exome Pipeline(gatk3_human_exome_pipeline:main)    done  job-yyyy amy 2024-01-07  09:00:00 (runtime 0:21:16)
    $ dx find jobs -n 3 --brief
    job-xxxx
    job-yyyy
    job-zzzz
    $ dx describe $(dx find jobs -n 1 --origin-jobs --brief)
    Result 1:
    ID                  job-xxxx
    Class               job
    Job name            BWA-MEM FASTQ Read Mapper
    Executable name     bwa_mem_fastq_read_mapper
    Project context     project-xxxx
    Billed to           amy
    Workspace           container-xxxx
    Cache workspace     container-yyyy
    Resources           container-zzzz
    App                 app-xxxx
    Instance Type       mem1_ssd1_x8
    Priority            high
    State               done
    Root execution      job-zzzz
    Origin job          job-zzzz
    Parent job          -
    Function            main
    Input               genomeindex_targz = file-xxxx
                    reads_fastqgz = file-xxxx
                    [read_group_library = "1"]
                    [mark_as_secondary = true]
                    [read_group_platform = "ILLUMINA"]
                    [read_group_sample = "1"]
                    [add_read_group = true]
                    [read_group_id = {"$dnanexus_link": {"input": "reads_fastqgz", "metadata": "name"}}]
                    [read_group_platform_unit = "None"]
    Output              -
    Output folder       /
    Launched by         amy
    Created             Sun Jan  1 09:00:17 2024
    Started running     Sun Jan  1 09:00:10 2024
    Stopped running     Sun Jan  1 09:00:27 2024 (Runtime: 0:00:16)
    Last modified       Sun Jan  1 09:00:28 2024
    Depends on          -
    Sys Requirements    {"main": {"instanceType": "mem1_ssd1_x8"}}
    Tags                -
    Properties          -
    # Find failed jobs in the current project from a time period
    $ dx find jobs --state failed --created-after=2024-01-01 --created-before=2024-02-01
    * BWA-MEM FASTQ Read Mapper (bwa_mem_fastq_read_mapper:main) (failed) job-xxxx
      amy 2024-01-22 09:00:00 (runtime 0:02:12)
    * BWA-MEM FASTQ Read Mapper (bwa_mem_fastq_read_mapper:main) (done) job-yyyy
      amy 2024-01-07 06:00:00 (runtime 0:11:22)
    # Find all failed executions of specified executable
    $ dx find executions --state failed --executable app-bwa_mem_fastq_read_mapper
    * BWA-MEM FASTQ Read Mapper (bwa_mem_fastq_read_mapper:main) (failed) job-xxxx
      amy 2024-01-01 09:00:00 (runtime 0:02:12)
    # Update the app and navigate to within app directory
    $ dx build -a
    INFO:dxpy:Archived app app-xxxx to project-xxxx:"/.App_archive/bwa_mem_fastq_read_mapper (Sun Jan  1 09:00:00 2024)"
    {"id": "app-yyyy"}
    # Rerun job with updated app
    dx run bwa_mem_fastq_read_mapper --clone job-xxxx
    dx find jobs --tag TAG

    Click the Info icon, above the right edge of the executions list, if it's not already selected, and then select the execution by clicking on the row.

  • Hover over the row and click on the "More Actions" button that looks like three vertical dots at the end of the row to select View Info in the fly out menu.

  • To view the for a job, do either of the following:

    • Select the execution by clicking on the row. When a View Log button appears in the header, click it,

    • Hover over the row and click on the "More Actions" button that looks like three vertical dots at the end of the row to select View Log in the fly out menu.

  • To , do either of the following:

    • Select the execution by clicking on the row. When a Launch as New Job button appears in the header, click it.

    • Hover over the row and click on the "More Actions" button that looks like three vertical dots at the end of the row, then select Launch as New Job in the menu.

  • To , do either of the following:

    • Select the execution by clicking on the row. When a Launch as New Analysis button appears in the header, click it.

    • Hover over the row and click on the "More Actions" button that looks like three vertical dots at the end of the row to select Launch as New Analysis in the menu.

  • click on its name to be taken to its details page
    Info pane
    Types of Errors
    An analysis reaches "Partially Failed" state
    type of metadata that can be added to an execution
    via the CLI
    via the UI
    JupyterLab
    log file
    re-launch a job
    re-launch an analysis

    Tools List

    hashtag
    Public Tools

    hashtag
    Annotation Apps

    Name of Tool
    Name in CLI
    Scientific Algorithms
    Common Uses

    hashtag
    Data Transfer Apps

    Name of Tool
    Name in CLI
    Scientific Algorithms
    Common Uses

    hashtag
    DNAseq Apps

    Name of Tool
    Name in CLI
    Scientific Algorithms
    Common Uses

    hashtag
    GWAS Apps

    Name of Tool
    Name in CLI
    Scientific Algorithms
    Common Uses

    hashtag
    File Transfer Apps

    Name of Tool
    Name in CLI
    Scientific Algorithms
    Common Uses

    hashtag
    Imaging Apps

    Name of Tool
    Name in CLI
    Scientific Algorithms
    Common Uses

    hashtag
    Import Apps

    Name of Tool
    Name in CLI
    Scientific Algorithms
    Common Uses

    hashtag
    Interactive Analysis Apps

    Name of Tool
    Name in CLI
    Scientific Algorithms
    Common Uses

    hashtag
    Joint Genotyping Apps

    Name of Tool
    Name in CLI
    Scientific Algorithms
    Common Uses

    hashtag
    Mapping Manipulation Apps

    Name of Tool
    Name in CLI
    Scientific Algorithms
    Common Uses

    hashtag
    PheWAS Apps

    Name of Tool
    Name in CLI
    Scientific Algorithms
    Common Uses

    hashtag
    PRS Apps

    Name of Tool
    Name in CLI
    Scientific Algorithms
    Common Uses

    hashtag
    QC Apps

    Name of Tool
    Name in CLI
    Scientific Algorithms
    Common Uses

    hashtag
    Quantification Apps

    Name of Tool
    Name in CLI
    Scientific Algorithms
    Common Uses

    hashtag
    Read Mapping Apps

    Name of Tool
    Name in CLI
    Scientific Algorithms
    Common Uses

    hashtag
    Read Manipulation Apps

    Name of Tool
    Name in CLI
    Scientific Algorithms
    Common Uses

    hashtag
    RNAseq Apps

    Name of Tool
    Name in CLI
    Scientific Algorithms
    Common Uses

    hashtag
    RNAseq Notebooks

    Name of Tool
    Name in CLI
    Scientific Algorithms
    Common Uses

    hashtag
    Utility Apps

    Name of Tool
    Name in CLI
    Scientific Algorithm
    Common Uses

    hashtag
    Variant Calling Apps

    Name of Tool
    Name in CLI
    Scientific Algorithms
    Common Uses

    hashtag
    Visualization Apps

    Name of Tool
    Name in CLI
    Scientific Algorithms
    Common Uses

    hashtag
    Titan Tools

    circle-info

    A Titan license is required to access and use these tools. Contact for more information.

    hashtag
    Statistics Apps

    Name of Tool
    Name in CLI
    Scientific Algorithms
    Common Uses

    hashtag
    Apollo Tools

    circle-info

    An Apollo license is required to access and use these tools. Contact for more information.

    hashtag
    Dataset Administration Apps

    Name of Tool
    Name in CLI
    Scientific Algorithms
    Common Uses

    hashtag
    Dataset Management Apps

    Name of Tool
    Name in CLI
    Scientific Algorithms
    Common Uses

    hashtag
    Data Science Apps

    Name of Tool
    Name in CLI
    Scientific Algorithms
    Common Uses

    hashtag
    Third Party Tools

    Tools in this section are created and maintained by their respective vendors and may require separate licenses.

    hashtag
    DNAseq Apps

    Name of Tool
    Name in CLI
    Scientific Algorithms
    Common Uses

    hashtag
    Joint Cohort Genotyping Apps

    Name of Tool
    Name in CLI
    Scientific Algorithms
    Common Uses

    hashtag
    Read Mapping Apps

    Name of Tool
    Name in CLI
    Scientific Algorithms
    Common Uses

    sra_fastq_importer

    Retrieve reads in FASTQ format from SRA

    Variant calling

    bwa_mem_fastq_read_mapper

    BWA-MEM

    Short read alignment

    gatk4_haplotypecaller_parallel

    Variant calling, post-alignment QC

    gatk4_genotypegvcfs_single_sample_parallel

    Variant calling

    picard_mark_duplicates

    Variant Calling- remove duplicates, post-alignment

    raremetal2

    Raremetalworker, Raremetal

    Meta-analysis of rare variants

    saige_gwas_gbat

    Gene and region-based association tests

    saige_gwas_svat

    Single variant association tests

    saige_gwas_grm

    Null model fitting using a GRM for SAIGE GWAS

    saige_gwas_sparse_grm

    Sparse GRM construction for SAIGE GWAS

    plink_pipeline

    Plink2

    plato_pipeline

    Plato, Plink2

    locuszoom

    LocusZoom

    GWAS, visualization

    Visualization and image analysis

    3d_slicer_monai

    ,

    Interactive radiology image analysis

    Fetches a file from a URL onto the DNAnexus Platform

    Unix shell on a Platform cloud worker in your browser. Use it for on-demand CLI operations and to launch httpsApp-enabled apps or applets on 2 extra ports

    Sort alignment result based on coordinates

    QC

    rnaseqc

    Transcriptomics Expression Quantification

    fastqc

    FastQC

    Transcriptomics Expression Quantification

    Short read alignment

    bwa_fasta_indexer

    BWA- bwa index

    Building reference for BWA alignment

    bwa_mem_fastq_read_mapper

    BWA-MEM

    Short read alignment

    star_generate_genome_index

    (Spliced Transcripts Alignment to a Reference) (--runMode genomeGenerate)

    RNA Seq- indexing

    star_mapping

    (Spliced Transcripts Alignment to a Reference)

    RNA Seq- mapping

    subread_feature_counts

    featureCounts

    Read summarization, RNAseq

    salmon_index_builder

    Salmon

    Transcriptomics Expression Quantification

    salmon_mapping_quant

    Salmon

    Transcriptomics Expression Quantification

    QC

    trimmomatic

    Read quality trimming, adapter trimming

    RNA Seq- indexing

    star_mapping

    (Spliced Transcripts Alignment to a Reference)

    RNA Seq- mapping

    subread_feature_counts

    featureCounts

    Read summarization, RNAseq

    star_generate_genome_index

    (Spliced Transcripts Alignment to a Reference) (--runMode genomeGenerate)

    Transcriptomics Expression Quantification

    star_mapping

    (Spliced Transcripts Alignment to a Reference)

    Transcriptomics Expression Quantification

    salmon_index_builder

    Salmon

    Transcriptomics Expression Quantification

    salmon_mapping_quant

    Salmon

    Transcriptomics Expression Quantification

    salmon_quant

    Salmon

    Transcriptomics Expression Quantification

    Transcript_Expression_Part-04_Analysis-CoEx-Network_R.ipynb

    WGCNA, topGO

    Transcript_Expression_Part-05_Analysis-Regulatory-Network_R.ipynb

    GENIE3

    swiss-army-knife

    bcftools, bedtools, bgzip, plink, sambamba, SAMtools, seqtk, tabix, vcflib, Plato, QCTool, vcftools

    Data processing tools

    ttyd

    N/A

    Unix shell on a platform cloud worker in your browser. Use it for on-demand CLI operations and to launch httpsApp-enabled apps or applets on 2 extra ports

    Variant calling, post-alignment QC

    gatk4_genotypegvcfs_single_sample_parallel

    Variant calling

    picard_mark_duplicates

    Variant Calling- remove duplicates, post-alignment

    freebayes

    Use for short variant calls

    gatk4_mutect2_variant_caller_and_filter

    Somatic variant calling and post calling filtering

    gatk4_somatic_panel_of_normals_builder

    Create a panel of normals (PoN) containing germline and artifactual sites for use with Mutect2.

    Running analyses, visualizing data, building and testing models and algorithms in an interactive way

    dxjupyterlab

    dxpy, matplotlib, numpy, pandas, papermill, scipy, seaborn, dxdata, BiocManager, coloc, epiR, hyprcoloc

    Running analyses, visualizing data, building and testing models and algorithms in an interactive way, accessing and manipulating data in spark databases and tables

    dxjupyterlab

    dxpy, matplotlib, numpy, pandas, papermill, scipy, seaborn, dxdata, nipype, freesurfer, FSL

    Running imaging processing related analysis

    dxjupyterlab

    dxpy, matplotlib, numpy, pandas, papermill, scipy, seaborn, dxdata, keras, scikit-learn, TensorFlow, torch

    Running image processing related analysis, building and testing models and algorithms in an interactive way

    Dataset Extension

    Dynamic SQL Execution

    table-exporter

    N/A

    Data Extraction

    Running analyses, visualizing data, building and testing models and algorithms in an interactive way

    dxjupyterlab_spark_cluster

    dxpy, hail, matplotlib, numpy, pandas, papermill, scipy, seaborn, dxdata, pyOpenSSL, glow.py, pypandoc

    Running analyses, visualizing data, building and testing models and algorithms in an interactive way

    WGS, WES, accelerated analysis

    pbgermline

    BWA-Mem Alignment, Co-ordinate Sorting, Picard MarkDuplicates, Base Quality Score Recalibration

    WGS, WES, accelerated analysis

    sentieon-tnbam

    Sentieon's BAM to VCF somatic analysis pipeline

    WGS, WES, accelerated analysis

    pbdeepvariant

    Deepvariant

    Variant calling, accelerated analysis

    sentieon-umi

    Sentieon's pre-processing and alignment pipeline for next-generation sequence

    WGS, WES, accelerated analysis

    sentieon-dnabam

    Sentieon's BAM to VCF germline analysis pipeline

    WGS, WES, accelerated analysis

    sentieon-joint_genotyping

    Sentieon GVCFtyper

    WGS, WES, accelerated analysis

    sentieon-ccdg

    Sentieon's FASTQ to CRAM pipeline, Functional Equivalent Pipeline

    WGS, WES, accelerated analysis

    sentieon-dnaseq

    Sentieon's FASTQ to VCF germline analysis pipeline

    WGS, WES, accelerated analysis

    WGS, WES, accelerated analysis

    pbgermline

    BWA-Mem Alignment, Co-ordinate Sorting, Picard MarkDuplicates, Base Quality Score Recalibration

    WGS, WES, accelerated analysis

    pbdeepvariant

    Deepvariant

    Variant calling, accelerated analysis

    sentieon-umi

    Sentieon's pre-processing and alignment pipeline for next-generation sequence

    WGS, WES, accelerated analysis

    sentieon-ccdg

    Sentieon's FASTQ to CRAM pipeline, Functional Equivalent Pipeline

    WGS, WES, accelerated analysis

    sentieon-dnaseq

    Sentieon's FASTQ to VCF germline analysis pipeline

    WGS, WES, accelerated analysis

    SnpEff Annotatearrow-up-right

    snpeff_annotate

    SnpEff

    Annotation

    SnpSift Annotatearrow-up-right

    snpsift_annotate

    SnpSift

    Annotation

    AWS S3 Exporterarrow-up-right

    aws_platform_to_s3_file_transfer

    AWS S3

    AWS S3 Importerarrow-up-right

    aws_s3_to_platform_files

    Original Quality Functionally Equivalentarrow-up-right

    oqfe

    A revision of Functionally Equivalentarrow-up-right

    WGS, WES- alignment and duplicate marking

    GATK4 Base Quality Score Recalibrator (Parallel Per-Chrom)arrow-up-right

    gatk4_bqsr_parallel

    REGENIEarrow-up-right

    regenie

    REGENIE

    GWAS

    PLINK GWASarrow-up-right

    plink_gwas

    URL Fetcherarrow-up-right

    url_fetcher

    N/A

    Fetches a file from a URL onto the DNAnexus Platform

    Imaging Multitool - MONAIarrow-up-right

    imaging_multitool_monai

    MONAI Corearrow-up-right

    Image processing

    3D Slicerarrow-up-right

    3d_slicer

    SRA FASTQ Importerarrow-up-right

    sra_fastq_importer

    SRA tools: fasterq-dumparrow-up-right

    Retrieve reads in FASTQ format from SRA

    URL Fetcherarrow-up-right

    url_fetcher

    Cloud Workstationarrow-up-right

    cloud_workstation

    N/A

    SSH-accessible Unix shell on a Platform cloud worker. Use it for on-demand analysis of Platform data.

    ttydarrow-up-right

    ttyd

    GLnexusarrow-up-right

    glnexus

    GLnexus

    This app can also be used to create pVCF without running joint genotyping

    SAMtools Mappings Indexerarrow-up-right

    samtools_index

    SAMtoolsarrow-up-right - SAMtools index

    Building BAM index file

    SAMtools Mappings Sorterarrow-up-right

    samtools_sort

    PHESANTarrow-up-right

    phesant

    PHESANT

    PheWAS

    PRSice 2arrow-up-right

    prsice2

    PRSice-2

    Polygenic risk scores

    MultiQCarrow-up-right

    multiqc

    MultiQC

    QC reporting

    Qualimap2 Analysisarrow-up-right

    qualimap2_anlys

    Salmon Quantificationarrow-up-right

    salmon_quant

    Salmon

    Transcriptomics Expression Quantification

    Salmon Mapping and Quantificationarrow-up-right

    salmon_mapping_quant

    Bowtie2 FASTA Indexerarrow-up-right

    bowtie2_fasta_indexer

    bowtie2: bowtie2-build

    Building reference for Bowite2 alignment

    Bowtie2 FASTQ Read Mapperarrow-up-right

    bowtie2_fastq_read_mapper

    GATK4 Base Quality Score Recalibrator (Parallel Per-Chrom)arrow-up-right

    gatk4_bqsr_parallel

    [GATK4 - BaseRecalibratorarrow-up-right and ApplyBQSRarrow-up-right

    Variant calling

    Flexbar FASTQ Read Trimmerarrow-up-right

    flexbar_fastq_read_trimmer

    RNASeQCarrow-up-right

    rnaseqc

    RNASeQC 2arrow-up-right

    Transcriptomics Expression Quantification

    STAR Generate Genome Indexarrow-up-right

    star_generate_genome_index

    Transcript_Expression_Part-02_Analysis-diff-exp_R.ipynbarrow-up-right

    Transcript_Expression_Part-02_Analysis-diff-exp_R.ipynb

    DESeq2

    Transcript_Expression_Part-03_Analysis-GSEA_R.ipynbarrow-up-right

    Transcript_Expression_Part-03_Analysis-GSEA_R.ipynb

    File Concatenatorarrow-up-right

    file_concatenator

    N/A

    Gzip File Compressorarrow-up-right

    gzip

    CNVkitarrow-up-right

    cnvkit_batch

    CNVkitarrow-up-right

    Copy Number Variant

    GATK4 HaplotypeCaller (Parallel Per-IntervalByNs)arrow-up-right

    gatk4_haplotypecaller_parallel

    LocusZoomarrow-up-right

    locuszoom

    LocusZoom

    GWAS, visualization

    JupyterLab with MLarrow-up-right

    dxjupyterlab

    dxpy, matplotlib, numpy, pandas, papermill, scipy, seaborn, dxdata, cntk, keras, scikit-learn, TensorFlow, torch

    Running analyses, visualizing data, building and testing models and algorithms in an interactive way

    JupyterLab with Python_Rarrow-up-right

    dxjupyterlab

    Data Model Loaderarrow-up-right

    data_model_loader_v2

    Dataset Creation

    Dataset Creation

    Dataset Extenderarrow-up-right

    dataset-extender

    CSV Loaderarrow-up-right

    csv-loader

    N/A

    Data Loading

    Spark SQL Runnerarrow-up-right

    spark-sql-runner

    JupyterLab with Spark Cluster with Glowarrow-up-right

    dxjupyterlab_spark_cluster

    dxpy, hail, matplotlib, numpy, pandas, papermill, scipy, seaborn, dxdata, pyOpenSSL, glow.py, pypandoc, koalas, pyarrow, bokeh, vep, BiocManager, coloc, epiR, yprcoloc, incidence, MendelianRandomization, outbreaks, prevalence, sparklyr, Glow

    Running analyses, visualizing data, building and testing models and algorithms in an interactive way

    JupyterLab with Spark Cluster with Hailarrow-up-right

    dxjupyterlab_spark_cluster

    Sentieon somatic FASTQ to VCFarrow-up-right

    sentieon-tnseq

    Sentieon's FASTQ to VCF somatic analysis pipeline

    WGS, WES, accelerated analysis

    Sentieon BWA-MEM and Sentieon De-duplicationarrow-up-right

    sentieon-bwa

    Sentieon distributed Joint Genotypingarrow-up-right

    sentieon-joint_genotyping

    Sentieon GVCFtyper

    WGS, WES, accelerated analysis

    Sentieon somatic FASTQ to VCFarrow-up-right

    sentieon-tnseq

    Sentieon's FASTQ to VCF somatic analysis pipeline

    WGS, WES, accelerated analysis

    Sentieon BWA-MEM and Sentieon De-duplicationarrow-up-right

    sentieon-bwa

    DNAnexus Salesenvelope
    DNAnexus Salesenvelope

    AWS S3

    GATK4 - and

    PLINK2

    N/A

    N/A

    - SAMtools sort

    Qualimap2

    Salmon

    bowtie2, SAMtools view, SAMtools sort, SAMtools index

    (Spliced Transcripts Alignment to a Reference) (--runMode genomeGenerate)

    WebGestaltR

    gzip

    dxpy, matplotlib, numpy, pandas, papermill, scipy, seaborn, dxdata, BiocManager, coloc, epiR, hyprcoloc

    Dataset Extension

    Spark SQL

    dxpy, hail, matplotlib, numpy, pandas, papermill, scipy, seaborn, dxdata, pyOpenSSL, glow.py, pypandoc

    Sentieon's FASTQ to BAM/CRAM pipeline

    Sentieon's FASTQ to BAM/CRAM pipeline

    ,
    plink2
    ,
    Picard
    ,
    REGENIE
    ,
    BOLT-LMM
    ,
    BGEN
    ,
    incidence
    ,
    MendelianRandomization
    ,
    outbreaks
    ,
    prevalence
    ,
    incidence
    ,
    MendelianRandomization
    ,
    outbreaks
    ,
    prevalence
    ,
    Stata
    ,
    stata_kernel
    ,
    monai
    ,
    monailabel
    ,
    3D Slicer
    ,
    koalas
    ,
    pyarrow
    ,
    bokeh
    ,
    vep
    ,
    BiocManager
    ,
    coloc
    ,
    epiR
    ,
    yprcoloc
    ,
    incidence
    ,
    MendelianRandomization
    ,
    outbreaks
    ,
    prevalence
    ,
    sparklyr
    ,
    HAIL
    ,
    koalas
    ,
    pyarrow
    ,
    bokeh
    ,
    vep
    ,
    BiocManager
    ,
    coloc
    ,
    epiR
    ,
    yprcoloc
    ,
    incidence
    ,
    MendelianRandomization
    ,
    outbreaks
    ,
    prevalence
    ,
    sparklyr
    ,
    HAIL
    , Ensembl Variant Effect Predictor
    SRA FASTQ Importerarrow-up-right
    SRA toolkit fasterq-dumparrow-up-right
    BaseRecalibratorarrow-up-right
    ApplyBQSRarrow-up-right
    BWA-MEM FASTQ Read Mapperarrow-up-right
    GATK4 HaplotypeCaller (Parallel Per-IntervalByNs)arrow-up-right
    GATK4- HaplotypeCaller modulearrow-up-right
    GATK4 Single-Sample GenotypeGVCFs (Parallel Per-Chrom)arrow-up-right
    GATK4- GenotypeGVCFs modulearrow-up-right
    Picard MarkDuplicates Mappings Deduplicatorarrow-up-right
    MarkDuplicates from the Picard suite of toolsarrow-up-right
    Raremetal2 - Raremetal and Raremetalworkerarrow-up-right
    SAIGE GWAS - Gene and region based association testsarrow-up-right
    SAIGEarrow-up-right
    SAIGE GWAS - Single variant association testsarrow-up-right
    SAIGEarrow-up-right
    SAIGE GWAS GRMarrow-up-right
    SAIGEarrow-up-right
    SAIGE-GWAS Sparse GRMarrow-up-right
    SAIGEarrow-up-right
    PLINK GWAS Pipelinearrow-up-right
    PLATO GWAS Pipelinearrow-up-right
    LocusZoomarrow-up-right
    3D Slicerarrow-up-right
    3D Slicer with MONAI Labelarrow-up-right
    3D Slicerarrow-up-right
    MONAI Labelarrow-up-right
    SAMtoolsarrow-up-right
    RNASeQCarrow-up-right
    RNASeQC 2arrow-up-right
    FastQC Reads Quality Controlarrow-up-right
    BWA FASTA Indexerarrow-up-right
    BWA-MEM FASTQ Read Mapperarrow-up-right
    STAR Generate Genome Indexarrow-up-right
    STARarrow-up-right
    STAR Mappingarrow-up-right
    STARarrow-up-right
    Subread featureCountsarrow-up-right
    Salmon Index Builderarrow-up-right
    Salmon Mapping and Quantificationarrow-up-right
    flexbararrow-up-right
    Trimmomaticarrow-up-right
    trimmomaticarrow-up-right
    STARarrow-up-right
    STAR Mappingarrow-up-right
    STARarrow-up-right
    Subread featureCountsarrow-up-right
    STAR Generate Genome Indexarrow-up-right
    STARarrow-up-right
    STAR Mappingarrow-up-right
    STARarrow-up-right
    Salmon Index Builderarrow-up-right
    Salmon Mapping and Quantificationarrow-up-right
    Salmon Quantificationarrow-up-right
    Transcript_Expression_Part-04_Analysis-CoEx-Network_R.ipynbarrow-up-right
    Transcript_Expression_Part-05_Analysis-Regulatory-Network_R.ipynbarrow-up-right
    Swiss Army Knifearrow-up-right
    ttydarrow-up-right
    GATK4- HaplotypeCaller modulearrow-up-right
    GATK4 Single-Sample GenotypeGVCFs (Parallel Per-Chrom)arrow-up-right
    GATK4- GenotypeGVCFs modulearrow-up-right
    Picard MarkDuplicates Mappings Deduplicatorarrow-up-right
    MarkDuplicates from the Picard suite of toolsarrow-up-right
    FreeBayes Variant Callerarrow-up-right
    FreeBayesarrow-up-right
    GATK4 Mutect2 Variant Caller and Filterarrow-up-right
    GATK Mutect2arrow-up-right
    GATK4 Somatic Panel Of Normals Builderarrow-up-right
    GATK CreateSomaticPanelOfNormalsarrow-up-right
    JupyterLab with Stataarrow-up-right
    JupyterLab with IMAGE_PROCESSINGarrow-up-right
    JupyterLab with MONAI_MLarrow-up-right
    Table Exporterarrow-up-right
    JupyterLab with Spark Cluster with Hail and VEParrow-up-right
    Germline Pipeline (NVIDIA Clara Parabricks accelerated)arrow-up-right
    Sentieon somatic BAM/CRAM to VCFarrow-up-right
    DeepVariant Pipeline (Parabricks accelerated)arrow-up-right
    Sentieon UMIarrow-up-right
    Sentieon germline BAM/CRAM to VCFarrow-up-right
    Sentieon distributed Joint Genotypingarrow-up-right
    Sentieon Functional equivalent protocolarrow-up-right
    Sentieon germline FASTQ to VCFarrow-up-right
    Germline Pipeline (NVIDIA Clara Parabricks accelerated)arrow-up-right
    DeepVariant Pipeline (Parabricks accelerated)arrow-up-right
    Sentieon UMIarrow-up-right
    Sentieon Functional equivalent protocolarrow-up-right
    Sentieon germline FASTQ to VCFarrow-up-right

    Running Nextflow Pipelines

    This tutorial demonstrates how to use Nextflow pipelines on the DNAnexus Platform by importing a Nextflow pipeline from a remote repository or building from local disk space.

    circle-info

    A license is required to create a DNAnexus app or applet from the Nextflow script folder. Contact DNAnexus Salesenvelope for more information.

    This documentation assumes you already have a basic understanding of how to develop and run a Nextflowarrow-up-right pipeline. To learn more about Nextflow, consult the official Nextflow Documentationarrow-up-right.

    To run a Nextflow pipeline on the DNAnexus Platform:

    1. Import the pipeline script from a remote repository or local disk.

    2. Convert the script to an app or applet.

    3. Run the app or applet.

    You can do this via either the user interface (UI) or the command-line interface (CLI), using the .

    circle-info

    Use the latest version of to take advantage of recent improvements and bug fixes.

    As of dx-toolkit version v0.391.0, pipelines built using dx build --nextflow default to running on Ubuntu 24.04. To use the Ubuntu 20.04 instead, override the default by specifying the release in --extra-args:

    hashtag
    Quickstart

    hashtag
    Pipeline Script Folder Structure

    A Nextflow pipeline script is structured as a folder with Nextflow scripts with optional configuration files and subfolders. Below are the basic elements of the folder structure when building a Nextflow executable:

    • (Required) A main Nextflow file with the extension .nf containing the pipeline. The default filename is main.nf. A different filename can be specified in the nextflow.config file.

    • (Optional) A .

    An flavored folder structure is encouraged but not required.

    hashtag
    Importing a Nextflow Pipeline

    hashtag
    Import via UI

    To import a Nextflow pipeline via the UI, click on the Add button on the top-right corner of the project's Manage tab, then expand the dropdown menu. Select the Import Pipeline/Workflow option.

    Once the Import Pipeline/Workflow modal appears, enter the repository URL where the Nextflow pipeline source code resides, for example, the . Then choose the desired project import location. If the repository is private, provide the credentials necessary for accessing it.

    An example of the Import Pipeline/Workflow modal:

    Click the Start Import button after providing the necessary information. This starts a pipeline import job in the project specified in the Import To field (default is the current project).

    After launching the import job, a status message "External workflow import job started" appears.

    Access information about the pipeline import job in the project's Monitor tab:

    After the import finishes, the imported pipeline executable exists as an applet. This is the output of the pipeline import job:

    The newly created Nextflow pipeline applet appears in the project, for example, hello.

    hashtag
    Import via CLI from a Remote Repository

    To import a Nextflow pipeline from a remote repository via the CLI, run the following command to specify the repository's URL. You can also provide optional information, such as a and an :

    circle-info

    Use the latest version of to take advantage of recent improvements and bug fixes.

    All versions beginning with v0.338.0 support converting Nextflow pipelines to apps or applets.

    This documentation covers features available in dx-toolkit versions beginning with v0.370.0.

    circle-info

    Your destination project's billTo feature needs to be enabled for Nextflow pipeline applet building. for more information.

    For Nextflow pipelines stored in private repositories, access requires credentials provided via the --git-credentials option with a DNAnexus file containing your authentication details. The file should be specified using either its qualified ID or path on the Platform. See the section for more details on setting up and formatting these credentials.

    Once the pipeline import job finishes, it generates a new Nextflow pipeline applet with an applet ID in the form applet-zzzz.

    Use dx run -h to get more information about running the applet:

    hashtag
    Building from a Local Disk

    Through the CLI you can also build a Nextflow pipeline applet from a pipeline script folder stored on a local disk. For example, you may have a copy of the nextflow-io/hello pipeline from the Nextflow on your local laptop, stored in a directory named hello, which contains the following files:

    Ensure that the folder structure is in the required format, as .

    To build a Nextflow pipeline applet using a locally stored pipeline script, run the following command and specify the path to the folder containing the Nextflow pipeline scripts. You can also provide , such as an import destination:

    circle-info

    Your destination project's billTo feature needs to be enabled for Nextflow pipeline applet building. Contact for more information.

    This command packages the Nextflow pipeline script folder as an applet named hello with ID applet-yyyy, and stores the applet in the destination project and path project-xxxx:/applets2/hello. If an import destination is not provided, the current working directory is used.

    The command can be run to see information about this applet, similar to the above example.

    A Nextflow pipeline applet has a type nextflow under its metadata. This applet acts like a regular DNAnexus applet object, and can be shared with other DNAnexus users who have access to the project containing the applet.

    For advanced information regarding the parameters of dx build --, run dx build --help in the CLI and find the Nextflow section for all arguments that are supported for building an Nextflow pipeline applet.

    hashtag
    Building a Nextflow Pipeline App from a Nextflow Pipeline Applet

    You can also build a Nextflow pipeline by running the command: dx build --app --from applet-xxxx.

    hashtag
    Running a Nextflow Pipeline Executable (App or Applet)

    hashtag
    Running a Nextflow Pipeline Executable via UI

    You can access a Nextflow pipeline applet from the Manage tab in your project, while the Nextflow pipeline app that you built can be accessed by clicking on the Tools Library option from the Tools tab. Once you click on the applet or app, the Run Analysis tab is displayed. Fill out the required inputs/outputs and click the Start Analysis button to launch the job.

    hashtag
    Running a Nextflow Pipeline Applet via CLI

    To run the Nextflow pipeline applet, use dx run applet-xxxx or dx run app-xxxx commands in the CLI and specify your :

    You can list and see the progress of the Nextflow pipeline job tree, which is structured as a head job with many subjobs, using the following :

    hashtag
    Monitoring Jobs

    Each Nextflow pipeline executable run is represented as a job tree with one head job and many subjobs. The head job launches and supervises the entire pipeline execution. Each subjob handles a process in the Nextflow pipeline. You can monitor the progress of the entire pipeline job tree by viewing the status of the subjobs (see example above).

    Monitor the detail log of the head job and the subjobs through each job's DNAnexus log via the UI or the CLI.

    circle-exclamation

    On the DNAnexus Platform, jobs are limited to a runtime of 30 days. Jobs running longer than 30 days are automatically terminated.

    hashtag
    Monitoring in the UI

    Once your job tree is running, you can go to the Monitor tab to view the status of your job tree. From the Monitor tab, view the job log of the head job as well as the subjobs by clicking on the Log link in the row of the desired job. The costs (when your account has permission) and resource usage of a job are also viewable.

    An example of the log of a head job:

    An example of the log of a subjob:

    hashtag
    Monitoring in the CLI

    From the CLI, you can use the command to check the status and view the log of the head job or each subjob.

    Monitoring the head job:

    Monitoring a subjob:

    hashtag
    Advanced Options: Running a Nextflow Pipeline Executable (App or Applet)

    hashtag
    Nextflow Execution on DNAnexus

    The Nextflow pipeline executable is launched as a job tree, with one head job running the Nextflow , and multiple subjobs running a single each. Throughout the pipeline's execution, the head job remains in "running" state and supervises the job tree's execution.

    hashtag
    Nextflow Execution Log File

    When a Nextflow head job (job-xxxx) enters its terminal state, either "done" or "failed", the system writes a named nextflow-<job-xxxx>.log to the of the head job.

    hashtag
    Private Docker Repository

    DNAnexus supports Docker container engines for the Nextflow pipeline execution environment. The pipeline developer may refer to a public Docker repository or a private one. When the pipeline is referencing a private Docker repository, you should provide your Docker credential file as a file input of docker_creds to the Nextflow pipeline executable when launching the job tree.

    Syntax of a private Docker credential:

    Store this credential file in a separate project with restricted access permissions for security.

    hashtag
    Nextflow Pipeline Executable Inputs and Outputs

    hashtag
    Specifying Input Values to a Nextflow Pipeline Executable

    Below are all possible means that you can specify an input value at build time and runtime. They are listed in order of precedence (items listed first have greater precedence and override items listed further down the list):

    1. Executable (app or applet) run time

      1. DNAnexus Platform app or applet input.

        • CLI example: dx run project-xxxx:applet-xxxx -i reads_fastqgz=project-xxxx:file-yyyy

    hashtag
    Formats of PATH to File, Folder, or Wildcards

    While you can specify a file input parameter's value at different places as seen above, the valid PATH format referring to the same file is different. This depends on the level (DNAnexus API/CLI level or Nextflow script-level) and the (file object or string) of the executable's input parameter. Examples of this are given below.

    Scenarios
    Valid PATH format

    hashtag
    Specifying a Nextflow Job Tree Output Folder

    When launching a DNAnexus job, you can specify a job-level output destination such as project-xxxx:/destination/ using the platform-level optional parameter on the or on the . For pipelines with publishDir settings, each output file is saved to <dx_run_path>/<publishDir>/, where <dx_run_path> is the job-level output destination and <publishDir> is the path assigned by the Nextflow script's process.

    Read more detail about the output folder specification and . Find an example on how to construct output paths of an nf-core pipeline job tree at run time in the .

    hashtag
    Using an AWS S3 Bucket as a Work Directory for Nextflow Pipeline Runs

    You can have your Nextflow pipeline runs use an Amazon Web Services (AWS) S3 bucket as a work directory. To do this, follow the steps outlined below.

    hashtag
    Step 1. Configure Your AWS Account to Trust the DNAnexus Platform as an OIDC Identity Provider

    to configure your AWS account to trust the Platform, as an OIDC identity provider. Be sure to note the value entered in the "Audience" field. This value is required in a configuration file used by your pipeline to enable pipeline runs to access the S3 bucket.

    hashtag
    Step 2. Configure an AWS IAM Role with the Proper Trust and Permissions Policies

    Next, configure an , such that its permissions and trust policies allow Platform jobs that assume this role, to access and use resources in the S3 bucket.

    hashtag
    Permissions Policy

    The following example shows how to structure an IAM role's permission policy, to enable the role to use an S3 bucket - accessible via the S3 URI s3://my-nextflow-s3-workdir - as the work directory of Nextflow pipeline runs:

    In the above example:

    • The "Action" section contains a list of the actions the role is allowed to perform, including deleting, getting, listing, and putting objects.

    • The two entries in the list in the "Resource" section enable the role to access all resources in the bucket accessible via the S3 URI my-nextflow-s3-workdir.

    hashtag
    Trust Policy

    The following example shows how to configure an IAM role's trust policy, to allow only properly configured Platform jobs to assume the role:

    In the above example:

    • To assume the role, a job must be launched from within a specific Platform project (in this case, project-xxxx).

    • To assume the role, a job must be launched by a specific Platform user (in this case, user-aaaa).

    • Via the "Federated" setting in the "Principal" section, the policy configures the role to trust the Platform as an OIDC identity provider, as accessible at

    hashtag
    Step 3. Configure Your Nextflow Pipeline's Configuration File to Access the S3 Bucket

    Next you need to configure your pipeline so that when it's run, it can access the S3 bucket. To do this, add, in a configuration file, a dnanexus that includes the properties shown in this example:

    In the above example:

    • workDir is the path to the bucket to be used as a work directory, in S3 URI format.

    • jobTokenAudience is the value of "Audience" you defined in above.

    • jobTokenSubjectClaims

    hashtag
    Using Subject Claims to Control Bucket Access

    When configuring the trust policy for the role that allows access to the S3 bucket, use custom subject claims to control which jobs can assume this role. Here are some typical combinations that we recommend, with their implications:

    Having included custom subject claims in the trust policy for the role, you need then, in the , to set the value of jobTokenSubjectClaims to equal a comma-separated list of claims, entered in the same order in which you entered them in the trust policy.

    For example, if you configured a role's trust policy per the , you are requiring a job, to assume the role, to present custom subject claims project_id and launched_by, in that order. In your Nextflow configuration file, set the value of jobTokenSubjectClaims, within the dnanexus config scope, as follows:

    Within the dna config scope, you must also set the value of iamRoleArnToAssume to that of the appropriate role:

    hashtag
    Advanced Options: Building a Nextflow Pipeline Executable

    hashtag
    Nextflow Pipeline Executable Permissions

    By default, the Platform . Nextflow pipeline apps and applets have the following capabilities that are exceptions to these limits:

    • External internet access ("network": ["*"]) - This is required for Nextflow pipeline apps and applets to be able to pull Docker images from external Docker registries at runtime.

    • UPLOAD access to the project in which a Nextflow pipeline job is run ("project": "UPLOAD") - This is required in order for Nextflow pipeline jobs to record the progress of executions, and preserve the run cache, to enable resume functionality.

    You can modify a Nextflow pipeline app or applet's permissions by overriding the default values when , using the --extra-args flag with . An example:

    Here are the key points:

    • "network": [] prevents jobs from accessing the internet.

    • "allProjects":"VIEW" increases jobs' access permission level to VIEW. This means that each job has "read" access to projects that can be accessed by the user running the job. Use this carefully. This permission setting can be useful when expected input file PATHs are provided as DNAnexus URIs - via a , for example, - from projects other than the one in which a job is being run.

    hashtag
    Advanced Building and Importing Pipelines

    Additional options exist for dx build --nextflow:

    Option
    Class
    Description

    Use dx build --help for more information.

    hashtag
    Private Nextflow Pipeline Repository

    When the Nextflow pipeline to be imported is from a private repository, you must provide a file object that contains the credentials needed to access the repository. Via the CLI, use the --git-credentials flag, and format the object as follows:

    circle-info

    To safeguard this credentials field object, store it in a separate project that only you can access.

    hashtag
    Platform File Objects as Runtime Docker Images

    When building a Nextflow pipeline executable, you can replace any with a Platform file object in tarball format. These Docker tarball objects serve as substitutes for referencing external Docker repositories.

    This approach enhances the provenance and reproducibility of the pipeline by minimizing reliance on external dependencies, thereby reducing associated risks. Also, it fortifies data security by eliminating the need for internet access to external resources, during pipeline execution.

    Two methods are available for preparing Docker images as tarball file objects on the platform: or .

    hashtag
    Built-in Docker Image Caching vs. Manually Preparing Tarballs

    Built-in Docker image caching
    Manually preparing tarballs

    hashtag
    Built-in Docker Image Caching

    This method initiates a building job that begins by taking the pipeline script, then identifying Docker containers by scanning the script's source code based on the final execution tree. Next, the job converts the containers to tarballs, and saves those tarballs to the project in which the job is running. Finally, the job builds the Nextflow pipeline executable, in the tarballs, as bundledDepends.

    You can use built-in caching via the CLI by using the flag --cache-docker at build time. All cached Docker tarballs are stored as file objects, within the Docker cache path, at project-xxxx:/.cached_docker_images/<image_name>/<image_name>_<version>.

    An example:

    If you need to access a Docker container that's stored in a private repository, you must provide, along with the flag --docker-secrets, a file object that contains the credentials needed to access the repository. This object must be in the following format:

    circle-info
    • When a pipeline requires specific inputs, such as file objects, sample values must be present within the project in which building job is to execute. These values must be provided along with the flag --nextflow-pipeline-params.

      • It's crucial that these sample values be structured in the same way as actual input data is structured. This ensures that the execution logic of the Nextflow pipeline remains intact. During the build process, use small files, containing data representative of the larger dataset, as sample data, to reduce file localization overhead.

    hashtag
    Manually Preparing Tarballs

    You can manually convert Docker images to tarball file objects. Within Nextflow pipeline scripts, you must then reference the location of each such tarball, in one of the following three ways:

    Option A: Reference each tarball by its unique Platform ID such as dx://project-xxxx:file-yyyy. Use this approach if you want deterministic execution behavior.

    You can use Platform IDs in Nextflow pipeline scripts (*.nf) or configuration files (*.config), as follows:

    circle-info

    When accessing a Platform project, a Nextflow pipeline job needs the VIEW or higher permission to the project.

    Option B: Within a Nextflow pipeline script, you can also reference a Docker image by using its . Use this name within a path that's in the following format: project-xxxx:/.cached_docker_images/<image_name>/<image_name>_<version>.

    An example:

    File extensions are not necessary, and project-xxxx is the project where the Nextflow pipeline executable was built and is executed. For.cached_docker_images, substitute the name of the folder in which these images have been stored. An exact <version> reference must be included - latest is not an accepted tag in this context.

    circle-info

    At Nextflow pipeline executable runtime:

    1. If no image is found at the path provided, the Nextflow pipeline job attempts to pull the Docker image from the remote external registry, based on the image name. This pull attempt requires internet access.

    2. When the version is referenced as latest

    Here are examples of tarball file object paths and names, as constructed from image names and version tags:

    Image Name
    Version Tag
    Tarball File Object Path and Name

    Option C: You can also reference Docker image names in pipeline scripts by digest - for example, <Image_name>@sha256:XYZ123…). File extensions are not necessary, and project-xxxx is the project where the Nextflow pipeline executable was built and is executed. For.cached_docker_images, substitute the name of the folder in which these images have been stored. An exact <version> reference must be included - latest is not an accepted tag in this context. When referring to a tarball file on the Platform using this method, the file must have an object property image_digest assigned to it. A typical format would be "image_digest":"<IMAGE_DIGEST_HERE>".

    An example:

    hashtag
    Nextflow Input Parameter Type Conversion to DNAnexus Executable Input Parameter Class

    Based on the input parameter's type and format (when applicable) defined in the corresponding , each parameter is assigned to the corresponding class (, ).

    hashtag
    File Input as String or File Class

    As a pipeline developer, you can specify a file input variable as {"type":"string", "format":"file-path"} or {"type":"string", "format":"path"}, which is assigned to "file" or "string" class, respectively. When running the executable, based on the class (file or string) of the executable's input parameter, you use a specific PATH format to specify the value. See the for an acceptable PATH format for each class.

    hashtag
    Converting a URL path to a String

    When converting a file reference from a URL format to a String, you use the method toUriString(). An example of a URL format would be dx://project-xxxx:/path/to/file for a DNAnexus URI. The method toURI().toString() does not give the same result because toURI() removes the context ID, such as project-xxxx, and toString() removes the scheme, such as dx://. More information about the Nextflow methods is available in the .

    hashtag
    Managing intermediate files and publishing outputs

    hashtag
    Pipeline Output Setting Using output: block and publishDir

    All files generated by a Nextflow job tree are stored in its session's corresponding workDir, which is the path where the temporary results are stored. On DNAnexus, when the Nextflow pipeline job is run with "preserve_cache=true", the workDir is set at the path: project-xxxx:/.nextflow_cache_db/<session_id>/work/. The project-xxxx is the project where the job took place, and you can follow the path to access all preserved temporary results. It is useful to be able to access these results for investigating the detailed pipeline progress, and use them for resuming job runs for pipeline development purposes.

    When the Nextflow pipeline job is run with "preserve_cache=false" (default), temporary files are stored in the job's which is deconstructed when the head job enters its terminate state - "done", "failed", or "terminated". Since a lot of these files are intermediate input/output being passed between processes and expected to be cleaned up after the job is completed, running with "preserve_cache=false" helps reduce project storage cost for files that are not of interest. It also saves you from remembering to clean up all temporary files.

    To save the final results of interest, and to display them as the Nextflow pipeline executable's output, you can declare output files matching the declaration under the script's output: block, and use Nextflow's optional directive to publish them.

    This makes the published output files available as the Nextflow pipeline head job's , under the executable's formally defined placeholder output parameter, published_files, as array:file class. Then the files are organized under the relative folder structure assigned via publishDir. This works for both "preserve_cache=true" and "preserve_cache=false". Only the "copy" publish mode is supported on DNAnexus.

    hashtag
    Values of publishDir

    At pipeline development time, the valid value of publishDir can be:

    • A local path string, for example, "publishDir path: ./path/to/nf/publish_dir/",

    • A dynamic string value defined as a pipeline input parameter such as "params.outdir", where "outdir" is a string-class input. This allows pipeline users to determine parameter values at runtime. For example, "publishDir path: '${params.outdir}/some/dir/'" or './some/dir/${params.outdir}/

    Find an example on how to construct output paths for an nf-core pipeline job tree at run time in the .

    circle-info

    publishDir is NOT supported on DNAnexus when assigned as an absolute path starting at root (/), such as /path/to/nf/publish_dir/. If an absolute path is defined for the publishDir, no output files are generated as the job's output parameter "published_files".

    hashtag
    Queue Size Configuration

    The queueSize option is part of Nextflow's executor . It defines how many tasks the executor handles in a parallel way. On DNAnexus, this represents the number of subjobs being created at a time (5 by default) by the Nextflow pipeline executable's head job. If the pipeline's executor configuration has a value assigned to queueSize, it overrides the default value. If the value exceeds the upper limit (1000) on DNAnexus, the root job errors out. See the Nextflow executor page for examples.

    hashtag
    Instance Type Determination

    hashtag
    Head job instance type determination

    The head job of the job tree defaults to running on instance type mem2_ssd1_v2_x4 in AWS regions and azure:mem2_ssd1_x4 in Azure regions. Users can change to a different instance type than the default, but this is not recommended. The head job executes and monitors the subjobs. Changing the instance type for the head job does not affect the computing resources available for subjobs, where most of the heavy computation takes place (see below where to configure instance types for Nextflow processes). Changing the instance type for the head job may be necessary only if it runs out of memory or disk space when staging input files, collecting pipeline output files, or uploading pipeline output files to the project.

    hashtag
    Subjob instance type determination

    Each subjob's instance type is determined based on the profile information provided in the Nextflow pipeline script. Specify required instances by via Nextflow's directive (example below). Alternatively, use a set of system requirements such as , , , and other resource parameters according to the official Nextflow documentation. The executor matches instance types to the minimal requirements described in the Nextflow pipeline profile using this logic:

    1. Choose the cheapest instance that satisfies the system requirements.

    2. Use only SSD type instances.

    3. For all things equal (price and instance specifications), it prefers a instance type.

    Order of precedence for subjob instance type determination:

    1. The value assigned to machineType directive.

    2. Values assigned to cpus, memory, and disk directives in their .

    An example command for specifying machineType by DNAnexus instance type name is provided below:

    circle-info

    Values assigned to cpus, memory, and disk directives serve two purposes: they determine the instance type and can be recalled by Nextflow's process such as ${task.cpus}, ${task.memory}, and ${task.disk} at runtime for task allocation.

    hashtag
    Nextflow Resume

    hashtag
    Preserve Run Caches and Resuming Previous Jobs

    Nextflow's feature enables skipping the processes that have been finished successfully and cached in previous runs. The new run can directly jump to downstream processes without needing to start from the beginning of the pipeline. By retrieving cached progress, Nextflow resume helps pipeline developers to save both time and compute costs. It is helpful for testing and troubleshooting when building and developing a Nextflow pipeline.

    Nextflow uses a scratch storage area for caching and preserving each task's temporary results. The directory is called "working directory", and the directory's path is defined by

    • The session id, a universally unique identifier (UUID) associated with current execution

    • Each task's unique hash ID: a hash number composed of each task's input values, input files, command line strings, container ID such as Docker image, conda environment, environment modules, and executed scripts in the bin directory, when applicable.

    You can use the Nextflow resume feature with the following Nextflow pipeline executable parameters:

    • preserve_cache Boolean type. Default value is false. When set to true, the run is cached in the current project for future resumes. For example:

      • This enables the Nextflow job tree to preserve cached information as well as all temporary results in the project where it is executed under the following paths, based on its session ID and each subjob's unique ID.

    circle-info
    • When preserve_cache=true, DNAnexus executor overrides the value of workDir of the job tree to be project-xxxx:/.nextflow_cache_db/<session_id>/work/, where project-xxxx is the project where the job tree was executed.

    Below are four possible scenarios and the recommended use cases for –i resume:

    Scenarios
    Parameters
    Use Cases
    Note

    hashtag
    Cache Preserve Limitations and Cleaning Up workDir

    To save on storage costs, clean up the workDir regularly. The maximum number of sessions that can be preserved in a DNAnexus project is 20 sessions. If you exceed the limit, the job generates an error with the following message:

    "The number of preserved sessions is already at the limit (N=20) and preserve_cache is true. Remove the folders in <project-id>:/.nextflow_cache_db/ to be under the limit, if you want to preserve the cache of this run. "

    To clean up all preserved sessions under a project, you can delete the entire ./nextflow_cache_db folder. To clean up a specific session's cached folder, you can delete the specific .nextflow_cache_db/<session_id>/ folder. To delete a folder in UI, you can follow the documentation on . To delete a folder in CLI, you can run:

    Be aware that deleting an object on UI or using CLI dx rm cannot be undone. Once the session work directory is deleted or moved, subsequent runs cannot resume from the session.

    For each session, only one job can resume the session's cached results and preserve its own progress to this session. Multiple jobs can resume and preserve different sessions without limitations, as long as each job preserves a different session. Similarly, multiple jobs can resume the same session without limitations, as long as only one or none is preserving the progress to the session.

    hashtag
    Nextflow's errorStrategy

    Nextflow's directive allows you to define how the error condition is managed by the Nextflow executor at the process level. When an error status is returned, by default, the process and other pending processes stop immediately (the default is errorStrategy terminate). This forces the entire pipeline execution to be terminated.

    Four error strategy options exist for Nextflow executor: terminate, finish, ignore, and retry. Below is a table of behaviors for each strategy. The "all other subjobs" referenced in the third column have not yet entered their terminal states.

    When more than one errorStrategy directives are applied to a pipeline job tree, the following rules apply depending on the first errorStrategy used.

    • When terminate is the first errorStrategy directive to be triggered in a subjob, all the other ongoing subjobs result in the "failed" state immediately.

    • When finish is the first errorStrategy directive to be triggered in a subjob, any other errorStrategy that is reached in the remaining ongoing subjobs also applies the

    Independent from Nextflow process-level error conditions, when a Nextflow subjob encounters platform-related restartable , such as ExecutionError, UnresponsiveWorker, JMInternalError, AppInternalError, AppInsufficientResourceError, or JobTimeoutExceeded, the subjob follows the executionPolicy determined for the subjob and restarts itself. It does not restart from the head job.

    hashtag
    FAQ

    hashtag
    My Nextflow job tree failed, how do I find where the errors are?

    A: Find the errored subjob's job ID from the head job's nextflow_errored_subjob and nextflow_errorStrategy properties to investigate which subjob failed and which errorStrategy was applied. To query these errorStrategy related properties in CLI, run the following command:

    where job-xxxx is the head job's job ID. \

    After finding the errored subjob, investigate the job log using the Monitor page by accessing the URL https://platform.dnanexus.com/projects/<projectID>/monitor/job/<jobID>. In this URL, jobID is the subjob's ID such as job-yyyy. Alternatively, watch the job log in CLI using dx watch job-yyyy.

    With the preserve_cache value set to true when starting the Nextflow pipeline executable, trace the cache workDir such as project-xxxx:/.nextflow_cache_db/<session_id>/work/ to investigate the intermediate results of this run.

    hashtag
    What is the version of Nextflow that is used?

    A: Find the Nextflow version by reading the log of the head job. Each built Nextflow executable is locked down to the specific version of Nextflow executor.

    hashtag
    What container runtimes are supported?

    A: DNAnexus supports as the container runtime for Nextflow pipeline applets. It is recommended to set docker.enabled=true in the Nextflow pipeline configuration, which enables the built Nextflow pipeline applet to execute the pipeline using Docker.

    hashtag
    My job hangs at the end of the analysis. What can I do to avoid this problem?

    A: There can be many possibilities causing the head job to become unresponsive. One of the known reasons is caused by the being written directly to a DNAnexus URI such as dx://project-xxxx:/path/to/file. To avoid this cause, specify ​​-with-trace path/to/tracefile (using a local path string) to the Nextflow pipeline applet's nextflow_run_opts input parameter.

    hashtag
    Can I have an example of how to construct an output path when I run a Nextflow pipeline with params.outdir, publishDir and job-level destination?

    Taking as an example, start with reading the pipeline's logic:

    1. The pipeline's is constructed with a prefix of the params.outdir variable followed by each task's name for each subfolder: publishDir = [ path: { "${params.outdir}/${...}" }, ... ]

    2. params.outdir is a to the pipeline, and the . The user running the corresponding Nextflow pipeline executable must specify a value to params.outdir to:

    To specify a value of params.outdir for the Nextflow pipeline executable built from the nf-core/sarek pipeline script, you can use the following command:

    You can also set a job tree's output destination using :

    This command constructs the final output paths as follows:

    1. project-xxxx:/path/to/jobtree/destination/ as the destination of the job tree's shared output folder.

    2. project-xxxx:/path/to/jobtree/destination/local/to/outdir as the shared output folder of the all tasks/processes/subjobs of this pipeline.

    3. project-xxxx:/path/to/jobtree/destination/local/to/outdir/<task_name> as the output folder of each specific task/process/subjob of this pipeline.

    circle-info
    1. This example is built based on and .

    2. Not all Nextflow pipelines have params.outdir as input, nor do all use params.outdir

    This documentation covers features available in dx-toolkit versions beginning with v0.378.0
    (Optional, recommended) A nextflow_schema.json filearrow-up-right. If this file is present at the root folder of the Nextflow script when importing or building the executable, the input parameters described in the file are exposed as the built Nextflow pipeline applet's input parameters. For more information on how the exposed parameters are used at run time, see specifying input values to a Nextflow pipeline executable.
  • (Optional) Subfolders and other configuration files. Subfolders and other configuration files can be referenced by the main Nextflow file or nextflow.config via the include or includeConfig keyword. Ensure that all referenced subfolders and files exist under the pipeline script folder at the time of building or importing the pipeline.

  • reads_fastqgz is an example of an executable input parameter name. All Nextflow pipeline inputs can be configured and exposed by the pipeline developer using an nf-core flavored pipeline schema file (nextflow_schema.jsonarrow-up-right).

  • When the input parameter is expecting a file, you need to specify the value in a certain format based on the class of the input parameter. When the input is of the "file" class, use DNAnexus qualified ID, which is the absolute path to the file object such as "project-xxxx:file-yyyy". When the input is of the "string" class, use the DNAnexus URI ("dx://project-xxxx:/path/to/file"). See table below for full descriptions of the formatting of PATHs.

  • You can use dx run <app(let)> --help to query the class of each input parameter at the app(let) level. In the example code block below, fasta is an input parameter of a file object, while fasta_fai is an input parameter of a string object. You then use DNAnexus qualifiedID format for fasta, and DNAnexus URI format for fasta_fai.

  • The DNAnexus object class of each input parameter is based on the type and format specified in the pipeline's nextflow_schema.json, when it exists. See additional documentation in the Nextflow Input Parameter Type Conversion section to understand how Nextflow input parameter's type and format (when applicable) converts to an app or applet's input class.

  • It is recommended to always use the app/applet means for specifying input values. The platform validates the input class and existence before the job is created.

  • All inputs for a Nextflow pipeline executable are set as "optional" inputs. This allows users to have flexibility to specify input via other means.

  • Nextflow pipeline command line input parameter, available as nextflow_pipeline_params. This is an optional "string" class input, available for any Nextflow pipeline executable on it being built.

    • CLI example: dx run project-xxxx:applet-xxxx -i nextflow_pipeline_params="--foo=xxxx --bar=yyyy", where "--foo=xxxx --bar=yyyy" corresponds to the "--something value" pattern of Nextflow input specification referenced in the Nextflow Configuration documentationarrow-up-right.

    • Because nextflow_pipeline_params is a string type parameter with file-path format, use the DNAnexus URI format when the file is stored on DNAnexus.

  • Nextflow options parameter nextflow_run_opts. This is an optional "string" class input, available for any Nextflow pipeline executable on it being built.

    • CLI example: dx run project-xxxx:applet-xxxx -i nextflow_run_opts="-profile test", where -profile is single-dash prefix parameter that corresponds to the Nextflow run options patternarrow-up-right, specifying a preset input configuration.

  • Nextflow parameter file nextflow_params_file. This is an optional "file" class input, available for any Nextflow pipeline executable that is being built.

    • CLI example: dx run project-xxxx:applet-xxxx -i nextflow_params_file=project-xxxx:file-yyyy, where project-xxxx:file-yyyy is the DNAnexus qualified ID of the file being passed to nextflow run -params-file <file>. This corresponds to -params-filearrow-up-right option of nextflow run.

  • Nextflow soft configuration override file nextflow_soft_confs. This is an optional "array:file" class input, available for any Nextflow pipeline executable that is being built.

    • CLI example: dx run project-xxxx:applet-xxxx -i nextflow_soft_confs=project-xxxx:file-1111 -i nextflow_soft_confs=project-xxxx:file-2222, where project-xxxx:file-1111 and project-xxxx:file-2222 are the DNAnexus qualified IDs of the file being passed to nextflow run -c <config-file1> -c <config-file2>. This corresponds to option of nextflow run, and the order specified for this array of file input is preserved when passing to the nextflow run execution.

    • The soft configuration file can be used for assigning default values of configuration scopes (such as ).

    • It is highly recommended to use nextflow_params_file as a replacement to using nextflow_soft_confs for the use case of specifying parameter values, especially when running Nextflow DSL2 nf-core pipelines. Read more about this at .

  • Pipeline source code:

    1. nextflow_schema.json

      • Pipeline developers may specify default values of inputs in the nextflow_schema.json file.

      • If an input parameter is of Nextflow's string type with file-path format, use DNAnexus URI format when the file is stored on DNAnexus.

    2. nextflow.config

      • Pipeline developers may specify default values of inputs in the nextflow.config file.

      • Pipeline developers may specify a default profile value using

    3. main.nf, sourcecode.nf

      • Pipeline developers may specify default values of inputs in the Nextflow source code file (*.nf).

  • job-oidc.dnanexus.com
    .
    is an ordered, comma-separated list of DNAnexus
    - for example, project_id, launched_by - that the job must present, to assume the role that enables bucket access.
  • iamRoleArnToAssume is the Amazon Resource Name (ARN) for the role that you configured in Step 2 above, and that is assumed by jobs to access the bucket.

  • You need also to configure your pipeline to access the bucket within the appropriate AWS region, which you specify via the region parameter, within an aws config scope.

  • string

    Specifies tag for Git repository. Can be used only with --repository.

    --git-credentials GIT_CREDENTIALS

    file

    Git credentials used to access Nextflow pipelines from private Git repositories. Can be used only with --repository. More information about the file syntax can be found in the .

    --cache-docker

    flag

    Stores a container image tarball in the selected project in /.cached_dockerImages. Only Docker engine is supported. Incompatible with --remote.

    --nextflow-pipeline-params NEXTFLOW_PIPELINE_PARAMS

    string

    Custom pipeline parameters to be referenced when collecting the Docker images.

    --docker-secrets DOCKER_SECRETS

    file

    A DNAnexus file ID with credentials for a private Docker repository.

    Job first attempts to access Docker images cached as bundledDepends. If this fails, the job attempts to find the image on the Platform. If this fails, the job tries to pull the images from the external repository, via the internet.

    Job attempts to find the Docker image based on the Docker cache path referenced. If this fails, the job attempts to pull from the external repository, via the internet.

  • For pipelines featuring conditional process trees determined by input values, provide mocked input values for caching Docker containers used by processes affected by the condition.

  • A building job requires CONTRIBUTE or higher permission to the destination project, that is the project for placing tarballs created from Docker containers.

  • Pipeline source code is saved at /.nf_source/<pipeline_folder_name>/ in the destination project. The user handles cleaning up this folder after the executable has been built.

  • , or when no version tag is provided, the Nextflow pipeline job attempts to search the digest of the image's
    latest
    reference from the external Docker repository and uses it to search for the corresponding tarball on The platform. This digest search requires internet access. If no digest is found, or if there is no internet access, the execution fails.

    latest

    Nextflow pipeline job attempts to pull from remote external registry

    path

    string

    string

    NA

    string

    integer

    NA

    int

    number

    NA

    float

    boolean

    NA

    boolean

    object

    NA

    hash

    ' or
    './some/dir/${params.outdir}/some/dir/'
    .
    • When publishDir is defined this way, the user who launches the Nextflow pipeline executable handles constructing the publishDir to be a valid relative path.

    The actual selected instance type's resources (CPUs, memory, disk capacity) may differ from what is allocated by the task. Instance type selection follows the precedence rules described above, while task allocation uses the values assigned in the configuration file.

  • When using Docker as the runtime container, the Nextflow executor propagates task execution settings to the Docker run command. For example, when task.memory is specified, this becomes the maximum amount of memory allowed for the container: docker run --memory ${task.memory}

  • The session's cache directory containing information on the location of the workDir, the session progress, job status, and configuration data is saved to project-xxxx:/.nextflow_cache_db/<session_id>/cache.tar, where project-xxxx is the project where the job tree is executed.

  • Each task's working directory is saved to project-xxxx:/.nextflow_cache_db/<session_id>/work/<2digit>/<30characters>/, where <2digit>/<30characters>/ is technically the task's unique ID, and project-xxxx is the project where the job tree is executed.

  • resume String type. Default value is an empty string, and the run begins without any cached data. When assigned with a session id, the run resumes from what is cached for the session id on the project. When assigned with "true" or "last", the run determines the session id that corresponds to the latest valid execution in the current project and resumes the run from it. For example, dx run applet-xxxxm -i reads_fastqgz="project-xxxx:file-yyyy" -i resume="<session_id>"

  • When a new job is launched and resumes a cached session (where session_id has a format like 12345678-1234-1234-1234-123456789012), the new job not only resumes from where the cache left at, but also shares the same session_id with the cached session it resumes. When a new job makes progress in a session and if the job is being cached, it creates temporary results to the same session's workDir. This generates a new cache directory (cache.tar) with the latest cache information.

  • You can have many Nextflow job trees sharing the same sessionID and writing to the same path for workDir and creating its own cache.tar, while only the latest job that ends in "done" or "failed" state is preserved on the project.

  • When the head job enters its terminal state such as "failed" or "terminated" that is not caused by the executor, no cache directory is preserved, even when the job was run with preserve_cache=true. Subsequent new jobs cannot resume from this job run. This can happen when a job tree fails due to exceeding a cost limit or a user terminating a job of the job tree.

  • Only up to 20 Nextflow sessions can be preserved per project.

    3

    resume=<session_ID> | "true" | "last" and preserve_cache=false

    Pipeline development. Pipeline developers can investigate the job workspace with --delay_workspace_destruction and --ssh

    4

    resume=<session_ID> | "true" | "last" and preserve_cache=true

    Pipeline development. Only happens for the first few tests.

    Only 1 job with the same <session_ID> can run at each time point.

    - Job properties set with: "nextflow_errorStrategy":"finish" "nextflow_errored_subjob":"job-xxxx, job-2xxx" where job-xxxx and job-2xxx are errored subjobs. - No new subjobs created after error. - Ends in "failed" state eventually, after other subjobs enter terminal states, with error message: "Job was ended with finish errorStrategy for job-xxxx, check the job log to find the failure."

    - Keep running until terminal state. - If error occurs in any, finish errorStrategy is applied (ignoring other error strategies), per Nextflow default behavior.

    retry

    - Job properties set with: "nextflow_errorStrategy":"retry" "nextflow_errored_subjob":"self" - Ends in "done" state immediately

    - Spins off a new subjob to retry the errored job, named <name> (retry: <RetryCount>). - Ends in a terminal state depending on other subjobs (can be "done", "failed", or "terminated").

    - Keep running until terminal state. - If error occurs, their own errorStrategy is applied.

    ignore

    - Job properties set with: "nextflow_errorStrategy":"ignore" "nextflow_errored_subjob":"self" - Ends in "done" state immediately

    - Job properties set with: "nextflow_errorStrategy":"ignore" "nextflow_errored_subjob":"job-1xxx, job-2xxx" - Shows "subjobs <job-1xxx>, <job-2xxx> runs into Nextflow process errors' ignore errorStrategy were applied" at end of job log. - Ends in a terminal state depending on other subjobs (can be "done", "failed", or "terminated").

    - Keep running until terminal state. - If error occurs, their own errorStrategy is applied.

    finish
    errorStrategy
    , ignoring any other error strategies set in the pipeline's source code or configuration.
  • If the retry errorStrategy is the first directive triggered in a subjob, if any of the remaining subjobs trigger a terminate, finish, or ignore errorStrategy, these other errorStrategy directives are applied to the corresponding subjob.

  • When ignore is the first errorStrategy directive to trigger in a subjob , and if any of terminate, finish, or retry errorStrategy directives applies to the remaining subjobs, that other errorStrategy is applied to the corresponding subjob.

  • Meet the input requirement for executing the pipeline.

  • Resolve the value of publishDir, with outdir as the leading path and each task's name as the subfolder name.

  • in
    publishDir
    . Read the source script of the Nextflow pipeline for the actual context of usage and requirements for
    params.outdir
    and
    publishDir
    .

    • App or applet input parameter class as file object • CLI/API level, such as dx run --destination PATH

    DNAnexus qualified ID (absolute path to the file object). • Example (file): project-xxxx:file-yyyy project-xxxx:/path/to/file • Example (folder): project-xxxx:/path/to/folder/

    • App or applet input parameter class as string • Nextflow configuration and source code files, such as nextflow_schema.json, nextflow.config, main.nf, and sourcecode.nf

    DNAnexus URI. • Example (file): dx://project-xxxx:/path/to/file • Example (folder): dx://project-xxxx:/path/to/folder/ • Example (wildcard): dx://project-xxxx:/path/to/wildcard_files

    Values of StringEquals:job-oidc.dnanexus.com/:sub

    Which jobs can assume the role that enables bucket access?

    project_id;project-xxxx

    Any Nextflow pipeline jobs that are running in project-xxxx

    launched_by;user-aaaa

    Any Nextflow pipeline jobs that are launched by user-aaaa

    project_id;project-xxxx;launched_by;user-aaaa

    Any Nextflow pipeline jobs that are launched by user-aaaa in project-xxxx

    bill_to;org-zzzz

    --profile PROFILE

    string

    Set default profile for the Nextflow pipeline executable.

    --repository REPOSITORY

    string

    Specifies a Git repository of a Nextflow pipeline. Incompatible with --remote.

    Requires running a "building job" with external internet access?

    Yes, if building an applet for the first time or if any image is going to be updated. No internet access required on rebuild.

    No

    Docker images packaged as bundledDepends?

    Yes. For Docker images that are used in the execution, they are cached and bundled at build time.

    No. Docker tarballs are resolved at runtime.

    quay.io/biocontainers/tabix

    1.11--hdfd78af_0

    project-xxxx:/.cached_docker_images/tabix/tabix_1.11--hdfd78af_0

    python

    3.9-slim

    project-xxxx:/.cached_docker_images/python/python_3.9-slim

    From: Nextflow Input Parameter (defined at nextflow_schema.json) Type

    Format

    To: DNAnexus Input Parameter Class

    string

    file-path

    file

    string

    directory-path

    string

    1 (default)

    resume="" (empty string) and preserve_cache=false

    Production data processing. Most high volume use cases

    2

    resume="" (empty string) and preserve_cache=true

    errorStrategy

    Subjob Error

    Head Job

    All Other Subjobs

    terminate

    - Job properties set with: "nextflow_errorStrategy":"terminate" "nextflow_errored_subjob":"self" - Ends in "failed" state immediately

    - Job properties set with: "nextflow_errorStrategy":"terminate" "nextflow_errored_subjob":"job-xxxx" "nextflow_terminated_subjob":"job-yyyy, job-zzzz" where job-xxxx is the errored subjob, and job-yyyy, job-zzzz are other subjobs terminated due to this error. - Ends in "failed" state immediately, with error message: "Job was terminated by Nextflow with terminate errorStrategy for job-xxxx, check the job log to find the failure."

    End in "failed" state immediately.

    finish

    dx command-line client
    dx-toolkit
    nextflow.config filearrow-up-right
    nf-corearrow-up-right
    nextflow-io/hello repositoryarrow-up-right
    repository tag
    import destination
    dx-toolkit
    Contact DNAnexus Salesenvelope
    Private Nextflow Pipeline Repository
    GitHubarrow-up-right
    described here
    optional information
    Salesenvelope
    dx run -h
    app from a Nextflow pipeline applet
    inputs
    command
    dx watch
    executorarrow-up-right
    processarrow-up-right
    Nextflow log filearrow-up-right
    destination path
    class
    UI
    CLI
    publishDir
    FAQ
    Follow the steps outlined here
    AWS Identity and Access Management (IAM) rolearrow-up-right
    config scopearrow-up-right
    Step 1
    aforementioned Nextflow configuration file
    above example
    limits apps' and applets' ability to read and write data
    building from a local disk
    dx build
    samplesheet.csvarrow-up-right
    Docker containerarrow-up-right
    Built-in Docker image caching
    Manually preparing the tarballs
    bundling
    full image namearrow-up-right
    nextflow_schema.json filearrow-up-right
    ref1
    ref2
    Formats of Path to File, Folder or Wildcards section
    Nextflow Opening Files documentationarrow-up-right
    temporary workspace
    publishDirarrow-up-right
    output
    FAQ
    configurationarrow-up-right
    configurationarrow-up-right
    instance type name
    machineTypearrow-up-right
    cpusarrow-up-right
    memoryarrow-up-right
    diskarrow-up-right
    version2 (v2)
    configurationarrow-up-right
    implicit variables of task objectarrow-up-right
    resumearrow-up-right
    deleting objects
    errorStrategyarrow-up-right
    errors
    Dockerarrow-up-right
    trace report filearrow-up-right
    nf-core/sarek (3.3.1)arrow-up-right
    publishDirarrow-up-right
    required input parameterarrow-up-right
    default value of params.outdir is nullarrow-up-right
    --destination
    Specifying A Nextflow Job Tree Output Folder
    Managing intermediate files and publishing outputs
    The "Estimated Price" value shown here is only an example. The actual price depends on the pricing model and runtime of the import job.

    Any Nextflow pipeline jobs that are billed to org-zzzz

    --repository-tag TAG

    At runtime

    python

    string

    Pipeline development. Only happens for the first few pipeline tests. During development, it is useful to see all intermediate results in workDir.

    - Job properties set with: "nextflow_errorStrategy":"finish" "nextflow_errored_subjob":"self" - Ends in "done" state immediately

    job identity token custom claims
    dx build --nextflow --extra-args='{"runSpec": {"release": "20.04"}}'
    $ dx build --nextflow \
      --repository https://github.com/nextflow-io/hello \
      --destination project-xxxx:/applets/hello
    
    Started builder job job-aaaa
    Created Nextflow pipeline applet-zzzz
    $ dx run project-xxxx:/applets/hello -h
    usage: dx run project-xxxx:/applets/hello [-iINPUT_NAME=VALUE ...]
    
    Applet: hello
    
    hello
    
    Inputs:
     Nextflow options
      Nextflow Run Options: [-inextflow_run_opts=(string)]
            Additional run arguments for Nextflow (e.g. -profile docker).
    
      Nextflow Top-level Options: [-inextflow_top_level_opts=(string)]
            Additional top-level options for Nextflow (e.g. -quiet).
    
      Soft Configuration File: [-inextflow_soft_confs=(file) [-inextflow_soft_confs=... [...]]]
            (Optional) One or more nextflow configuration files to be appended to the Nextflow pipeline
            configuration set
    
      Script Parameters File: [-inextflow_params_file=(file)]
            (Optional) A file, in YAML or JSON format, for specifying input parameter values
    
     Advanced Executable Development Options
      Debug Mode: [-idebug=(boolean, default=false)]
            Shows additional information in the job log. If true, the execution log messages from
            Nextflow are also included.
    
      Resume: [-iresume=(string)]
            Unique ID of the previous session to be resumed. If 'true' or 'last' is provided instead of
            the sessionID, resumes the latest resumable session run by an applet with the same name
            in the current project in the last 6 months.
    
      Preserve Cache: [-ipreserve_cache=(boolean, default=false)]
            Enable storing pipeline cache and local working files to the current project. If true, local
            working files and cache files are uploaded to the platform, so the current session could
            be resumed in the future
    
    Outputs:
      Published files of Nextflow pipeline: [published_files (array:file)]
            Output files published by current Nextflow pipeline and uploaded to the job output
            destination.
    $ pwd
    /path/to/hello
    $ ls
    LICENSE         README.md       main.nf         nextflow.config
    $ dx build --nextflow /path/to/hello \
      --destination project-xxxx:/applets2/hello
    {"id": "applet-yyyy"}
    $ dx run project-yyyy:applet-xxxx \
      -i debug=false \
      --destination project-xxxx:/path/to/destination/ \
      --brief -y
    
    job-bbbb
    # See subjobs in progress
    $ dx find jobs --origin job-bbbb
    * hello (done) job-bbbb
    │ amy 2023-09-20 14:57:58 (runtime 0:02:03)
    ├── sayHello (3) (hello:nf_task_entry) (done) job-1111
    │   amy 2023-09-20 14:58:57 (runtime 0:00:45)
    ├── sayHello (1) (hello:nf_task_entry) (done) job-2222
    │   amy 2023-09-20 14:58:52 (runtime 0:00:52)
    ├── sayHello (2) (hello:nf_task_entry) (done) job-3333
    │   amy 2023-09-20 14:58:48 (runtime 0:00:53)
    └── sayHello (4) (hello:nf_task_entry) (done) job-4444
        amy 2023-09-20 14:58:43 (runtime 0:00:50)
    # Monitor job in progress
    $ dx watch job-bbbb
    Watching job job-bbbb. Press Ctrl+C to stop watching.
    * hello (done) job-bbbb
      amy 2023-09-20 14:57:58 (runtime 0:02:03)
    ... [deleted]
    2023-09-20 14:58:29 hello STDOUT dxpy/0.358.0 (Linux-5.15.0-1045-aws-x86_64-with-glibc2.29) Python/3.8.10
    2023-09-20 14:58:30 hello STDOUT bash running (job ID job-bbbb)
    2023-09-20 14:58:31 hello STDOUT =============================================================
    2023-09-20 14:58:31 hello STDOUT === NF projectDir   : /home/dnanexus/hello
    2023-09-20 14:58:31 hello STDOUT === NF session ID   : 0eac8f92-1216-4fce-99cf-dee6e6b04bc2
    2023-09-20 14:58:31 hello STDOUT === NF log file     : dx://project-xxxx:/applets/nextflow-job-bbbb.log
    2023-09-20 14:58:31 hello STDOUT === NF command      : nextflow -log nextflow-job-bbbb.log run /home/dnanexus/hello -name job-bbbb
    2023-09-20 14:58:31 hello STDOUT === Built with dxpy : 0.358.0
    2023-09-20 14:58:31 hello STDOUT =============================================================
    2023-09-20 14:58:34 hello STDOUT N E X T F L O W  ~  version 22.10.7
    2023-09-20 14:58:35 hello STDOUT Launching `/home/dnanexus/hello/main.nf` [job-bbbb] DSL2 - revision: 1647aefcc7
    2023-09-20 14:58:43 hello STDOUT [0a/6a81ca] Submitted process > sayHello (4)
    2023-09-20 14:58:48 hello STDOUT [f5/87df8b] Submitted process > sayHello (2)
    2023-09-20 14:58:53 hello STDOUT [4b/21374a] Submitted process > sayHello (1)
    2023-09-20 14:58:57 hello STDOUT [f6/8c44f5] Submitted process > sayHello (3)
    2023-09-20 14:59:51 hello STDOUT Hola world!
    2023-09-20 14:59:51 hello STDOUT 
    2023-09-20 14:59:51 hello STDOUT Ciao world!
    2023-09-20 14:59:51 hello STDOUT 
    2023-09-20 15:00:06 hello STDOUT Bonjour world!
    2023-09-20 15:00:06 hello STDOUT 
    2023-09-20 15:00:06 hello STDOUT Hello world!
    2023-09-20 15:00:06 hello STDOUT 
    2023-09-20 15:00:07 hello STDOUT === Execution completed — cache and working files will not be resumable
    2023-09-20 15:00:07 hello STDOUT === Execution completed — upload nextflow log to job output destination project-xxxx:/applets/
    2023-09-20 15:00:09 hello STDOUT Upload nextflow log as file: file-GZ5ffkj071zqZ9Qj22qv097J
    2023-09-20 15:00:09 hello STDOUT === Execution succeeded — upload published files to job output destination project-xxxx:/applets/
    * hello (done) job-bbbb
      amy 2023-09-20 14:57:58 (runtime 0:02:03)
      Output: -
    # Monitor job in progress
    $ dx watch job-cccc
    Watching job job-cccc. Press Ctrl+C to stop watching.
    sayHello (1) (hello:nf_task_entry) (done) job-cccc
    amy 2023-09-20 14:58:52 (runtime 0:00:52)
    ... [deleted]
    2023-09-20 14:59:28 sayHello (1) STDOUT dxpy/0.358.0 (Linux-5.15.0-1045-aws-x86_64-with-glibc2.29) Python/3.8.10
    2023-09-20 14:59:30 sayHello (1) STDOUT bash running (job ID job-cccc)
    2023-09-20 14:59:33 sayHello (1) STDOUT file-GZ5ffQj047j3Vq7QX220Q5vQ
    2023-09-20 14:59:34 sayHello (1) STDOUT Bonjour world!
    2023-09-20 14:59:36 sayHello (1) STDOUT file-GZ5ffVQ047j2QXZ2ZkFx4YxG
    2023-09-20 14:59:38 sayHello (1) STDOUT file-GZ5ffX0047j2QXZ2ZkFx4YxK
    2023-09-20 14:59:41 sayHello (1) STDOUT file-GZ5ffXQ047jGYZ91x6KG32Jp
    2023-09-20 14:59:43 sayHello (1) STDOUT file-GZ5ffY8047jF2PY3609JPBKB
    sayHello (1) (hello:nf_task_entry) (done) job-cccc
    amy 2023-09-20 14:58:52 (runtime 0:00:52)
    Output: exit_code = 0
    {
      "docker_registry": {
        "registry": "url-to-registry",
        "username": "name123",
        "token": "12345678"
      }
    }
    # Query for the class of each input parameter
    $ dx run project-yyyy:applet-xxxx --help
    usage: dx run project-yyyy:applet-xxxx [-iINPUT_NAME=VALUE ...]
    
    Applet: example_applet
    
    example_applet
    
    Inputs:
    …
      fasta: [-ifasta=(file)]
    …
    
      fasta_fai: [-ifasta_fai=(string)]
    …
    
    
    # Assign values of the parameter based on the class of the parameter
    $ dx run project-yyyy:applet-xxxx -ifasta="project-xxxx:file-yyyy" -ifasta_fai="dx://project-xxxx:/path/to/file"
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Action": [
            "s3:DeleteObject",
            "s3:GetObject",
            "s3:ListBucket",
            "s3:PutObject"
          ],
          "Resource": [
            "arn:aws:s3:::my-nextflow-s3-workdir",
            "arn:aws:s3:::my-nextflow-s3-workdir/*"
          ]
        }
      ]
    }
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Action": "sts:AssumeRoleWithWebIdentity",
          "Principal": {
            "Federated": "arn:aws:iam::123456789012:oidc-provider/job-oidc.dnanexus.com/"
            ,
            "Condition": {
             "StringEquals": {
                "job-oidc.dnanexus.com/:aud": "dx_nextflow_s3_scratch_token_aud"
              },
             "StringEquals": {
                "job-oidc.dnanexus.com/:sub": "project_id;project-xxxx;launched_by;user-aaaa"
              }
            }
          }
        }
      ]
    }
    # In a nextflow configuration file:
    
    aws { region = '<aws region>'}
    
    dnanexus {
     workDir = '<S3 URI path>'
     jobTokenAudience = '<OIDC_audience_name>'
     jobTokenSubjectClaims = '<list of claims separated by commas>'
     iamRoleArnToAssume = '<arn of the role who is set with permission>'
    }
    # In a nextflow configuration file:
    dnanexus {
     ...
     jobTokenSubjectClaims = 'project_id,launched_by'
     ...
    }
    # In a nextflow configuration file:
    dnanexus {
     ...
     iamRoleArnToAssume = arn:aws:iam::123456789012:role/NextflowRunIdentityToken
     ...
    }
    $ dx build --nextflow /path/to/hello --extra-args \
        '{"access":{"network": [], "allProjects":"VIEW"}}'
    ...
    {"id": "applet-yyyy"}
    providers {
      github {
        user = 'username'
        password = 'ghp_xxxx'
      }
    }
    $ dx build --nextflow /path/to/hello \
    --cache-docker \
    --nextflow-pipeline-params "--alpha=1 --beta=foo" \ # when required
    --destination project-xxxx:/applets2/hello
    ...
    {"id:"applet-yyyy"}
    
    $ dx tree /.cached_docker_images/
    /.cached_docker_images/
    ├── samtools
    │   └── samtools_1.16.1--h6899075_1
    ├── multiqc
    │   └── multiqc_1.18--pyhdfd78af_0
    └── fastqc
        └── fastqc_0.11.9--0
    "docker_registry": {
      "registry": "url-to-registry",
      "username": "name123",
      "token": "12345678"
    }
    # In a Nextflow pipeline script:
    process foo {
      container 'dx://project-xxxx:file-yyyy'
    
      '''
      do this
      '''
    }
    # In nextflow.config    // at root folder of the nextflow pipeline:
    process {
        withName:foo {
            container = 'dx://project-xxxx:file-yyyy'
        }
    }
    # In nextflow configuration file:
    docker.enabled = true
    docker.registry = 'quay.io'
    
    # In the Nextflow pipeline script:
    process bar {
      container 'quay.io/biocontainers/tabix:1.11--hdfd78af_0'
    
      '''
      do this
      '''
    }
    # In nextflow configuration file:
    docker.enabled = true
    docker.registry = 'quay.io'
    
    # In the Nextflow pipeline script:
    process bar {
      container 'quay.io/biocontainers/tabix@sha256:XYZ123…'
      '''
      do this
      '''
    }
    process foo {
      machineType 'mem1_ssd1_v2_x36'
    
      """
      <your script here>
      """
    }
    dx run applet-xxxx -i reads_fastqgz=project-xxxx:file-yyyy -i preserve_cache=true
    dx rm -r project-xxxx:/.nextflow_cache_db/              # cleanup ALL sessions caches
    dx rm -r project-xxxx:/.nextflow_cache_db/<session_id>/ # clean up a specific session's cache
    $ dx describe job-xxxx --json | jq -r .properties.nextflow_errored_subjob
    job-yyyy
    $ dx describe job-xxxx --json | jq -r .properties.nextflow_errorStrategy
    terminate
    dx run project-xxxx:applet-zzzz \
    -i outdir=./local/to/outdir \   # assign "./local/to/outdir" params.outdir
    --brief -y
    dx run project-xxxx:applet-zzzz \
    -i outdir=./local/to/outdir \   # assign "./local/to/outdir" params.outdir
    --destination project-xxxx:/path/to/jobtree/destination/ \
    --brief -y
    --profile <value>
    when building the executable, for example,
    dx build --nextflow --profile test
    .
    If an input parameter is of Nextflow's string type with file-path format, use the DNAnexus URI format when the file is stored on DNAnexus.
    -carrow-up-right
    processarrow-up-right
    nf-core documentationarrow-up-right
    Configure Git repositories with Nextflow blog postarrow-up-right

    Distributed