Only this pageAll pages
Powered by GitBook
Couldn't generate the PDF for 243 pages, generation stopped at 100.
Extend with 50 more pages.
1 of 100

DNAnexus Documentation

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Python

Bash

Parallel

Key Concepts

By understanding projects, organizations, apps, and workflows, you'll improve your understanding of the DNAnexus Platform.

Running Apps and Workflows

Example Applications

Objects

Overview

Using the DNAnexus Platform

Getting Started

Get to know commonly used features in a series of short, task-oriented tutorials.

User

Learn to access and use the Platform via both its command-line interface and its user interface.

Developer

Learn to manage data, users, and work on the Platform, via its API. Create and share reusable pipelines, applications for analyzing data, custom viewers, and workflows.

Administrator

This section is targeted towards organizational leads who have the permission to enable others to use DNAnexus for scientific purposes. Operations include managing organization permissions, billing, and authentication to the platform.

Downloads

Download, install, and get started using the DNAnexus Platform SDK, the DNAnexus upload and download agents, and dxCompiler.

Get details on new features, changes, and bug fixes for each Platform and toolkit release.

DNAnexus Essentials

Learn to upload data, create a project, run an analysis, and visualize results.

Learn More

See these Key Concepts pages to learn more about how the DNAnexus Platform works, and how to get the most from it:

User

In this section, learn to access and use the Platform via both its command-line interface (CLI) and its user interface (UI).

To use the CLI, you need to .

If you're not familiar with the dx client, check the .

This section provides detailed instructions on using the dx client to perform such common actions as logging in, selecting projects, listing, copying, moving, and deleting objects, and launching and monitoring jobs. Details on using the UI are included throughout, as applicable.

Release Notes
download and install the dx command-line client
Command-Line Quickstart

Projects

A project is a collaborative workspace on the DNAnexus Platform where you can store objects such as files, applets, and workflows. Within projects, you can run apps and workflows. You can also share a project with other users by giving them access to it. Read about projects in the Key Concepts section.

Distributed

Concurrent Computing Tutorials

Learn important terminology before using parallel and distributed computing paradigms on the DNAnexus Platform.

Many definitions and approaches exist for tackling the concept of parallelization and distributing workloads in the cloud (Here's a particularly helpful Stack Exchange post on the subject). To help make the documentation easier to understand, when discussing concurrent computing paradigms this guide refers to:

  • Parallel: Using multiple threads or logical cores to concurrently process a workload.

  • Distributed: Using multiple machines (in this case instances in the cloud) that communicate to concurrently process a workload.

Keep these formal definitions in mind as you read through the tutorials and learn how to compute concurrently on the DNAnexus Platform.

Organizations

  • Apps and Workflows

  • Get up and running quickly using the Platform via both its user interface (UI) and its command-line interface (CLI):

    • User Interface Quickstart

    • Command Line Quickstart

    Learn the basics of developing for the Platform:

    • Developer Quickstart

    • Developer Tutorials

    Projects

    Getting Started

    Get to know features you'll use every day, in these short, task-oriented tutorials.

    You must set up billing for your account before you can perform an analysis, or upload or egress data. Follow these instructions to set up billing.

    Uploading and Sharing Data

    Running a Single App

    Creating and Running a Workflow

    Monitoring Jobs and Viewing Results

    Visualizing Data

    Learn More

    See these Key Concepts pages for more in-depth treatments of topics that are covered briefly here:

    For a step-by-step written tutorial to using the Platform via its UI, see .

    For a step-by-step written tutorial to using the Platform via its CLI, see .

    For a more in-depth video intro to the Platform, watch the .

    Additional Tutorials

    As a developer, you may be interested in the following:

    As a bioinformatician, see the walkthrough and other content in the .

    TensorBoard Example Web App

    This example demonstrates how to run TensorBoard inside a DNAnexus applet.

    View full source code on GitHub

    TensorBoard is a web application used to visualize and inspect what is going on inside TensorFlow training. To use TensorBoard, the training script in TensorFlow needs to include code that saves specific data to a log directory where TensorBoard can then find the data to display it.

    This example uses an example script from the TensorBoard authors. For more guidance on how to use TensorBoard, check out the TensorFlow website (external link).

    Creating the web application

    The applet code runs a training script, which is placed in resources/home/dnanexus/ to make it available in the current working directory of the worker, and then it starts TensorBoard on port 443 (HTTPS).

    The training script runs in the background to start TensorBoard immediately, which allows you to see the results while training is still running. This is particularly important for long-running training scripts.

    For all web apps, if everything is running smoothly and no errors are encountered (the ideal case), the line of code that starts the server keeps it running forever. The applet stops only when it is terminated. This also means that any lines of code after the server starts are not executed.

    As with all web apps, the dxapp.json must include "httpsApp": {"ports":[443], "shared_access": "VIEW"} to tell the worker to expose port 443.

    Creating an applet on DNAnexus

    Build the asset with the libraries first:

    Take the record ID it outputs and add it to the dxapp.json for the applet.

    Then build the applet

    Once it spins up, you can go to that job's designated URL based on its job ID, https://job-xxxx.dnanexus.cloud, to see the result.

    Precompiled Binary

    This tutorial showcases packaging a precompiled binary in the resources/ directory of an app(let).

    View full source code on GitHub

    Precompiling a Binary

    In this applet, the SAMtools binary was precompiled on an Ubuntu machine. A user can do this compilation on an Ubuntu machine of their own, or they can use the Cloud Workstation app to build and compile a binary. On the Cloud Workstation, the user can download the SAMtools source code and compile it in the worker environment, ensuring that the binary runs on future workers.

    See Cloud Workstationin the App library for more information.

    Resources Directory

    The SAMtools precompiled binary is placed in the <applet dir>/resources/ directory. Any files found in the resources/ directory are packaged, uploaded to the Platform, and then extracted in the root directory \ of the worker. In this case, the resources/ dir is structured as follows:

    When this applet is run on a worker, the resources/ directory is placed in the worker's root directory /:

    The SAMtools command is available because the respective binary is visible from the default $PATH variable. The directory /usr/bin/ is part of $PATH, so the script can reference the samtools command directly:

    Running Older Versions of DXJupyterLab

    Learn how to run an older version of DXJupyterLab via the user interface or command-line interface.

    Why Run an Older Version of DXJupyterLab?

    The primary reason to run an older version of DXJupyterLab is to access snapshots containing tools that cannot be run in the current version's execution environment.

    Launching an Older Version via the User Interface (UI)

    1. From the main Platform menu, select Tools, then Tools Library.

    2. Find and select, from the list of tools, either DXJupyterLab with Python, R, Stata, ML, Image Processing or DXJupyterLab with Spark Cluster.

    3. From the tool detail page, click on the Versions tab.

    Launching an Older Version via the Command-Line Interface (CLI)

    1. Select the project in which you want to run DXJupyterLab.

    2. Launch the version of DXJupyterLab you want to run, substituting the version number for x.y.z in the following commands:

      • For DXJupyterLab without Spark cluster capability, run the command dx run app-dxjupyterlab/x.y.z --priority high.

    Running DXJupyterLab at "high" priority is not required. However, doing so ensures that your interactive session is not interrupted by spot instance termination.

    Accessing DXJupyterLab

    After launching DXJupyterLab, access the DXJupyterLab environment using your browser. To do this:

    1. Get the job ID for the job created when you launched DXJupyterLab. See the for details on how to get the job ID, via either the UI or the CLI.

    2. Open the URL https://job-xxxx.dnanexus.cloud, substituting the job's ID for job-xxxx.

    3. You may see an error message "502 Bad Gateway" if DXJupyterLab is not yet accessible. If this happens, wait a few minutes, then try again.

    Apollo Apps

    Spark Applications

    A license is required to access Spark functionality on the DNAnexus Platform. Contact DNAnexus Sales for more information.

    The Spark application is an extension of the current app(let) framework. App(let)s have a specification for their VM (instance type, OS, packages). This has been extended to allow for an additional optional cluster specification with type=dxspark.

    • Calling /app(let)-xxxx/run for Spark apps creates a Spark cluster (+ master VM).

    • The master VM (where the app shell code runs) acts as the driver node for Spark.

    • Code in the master VM leverages the Spark infrastructure.

    • Job mechanisms (monitoring, termination, and management) are the same for Spark apps as for any other regular app(let)s on the Platform.

    Spark apps can be launched over a distributed Spark cluster.

    Apps and Workflows Glossary

    Learn key terms used to describe apps and workflows.

    On the DNAnexus Platform, the following terms are used when discussing apps and workflows:

    • Execution: An analysis or job.

      • Root execution: The initial analysis or job that's created when a user makes an API call to run a workflow, app, or applet. Analyses and jobs created from a job via /executable-xxxx/run API call with detach flag set to true are also root executions.

      • Execution tree: The set of all jobs and/or analyses that are created because of running a root execution.

    • Analysis: An analysis is created when a workflow is run. It consists of some number of stages, each of which consists of either another analysis (if running a workflow) or a job (if running an app or applet).

      • Parent analysis: Each analysis is the parent analysis to each of the jobs that are created to run its stages.

    • Job: A job is a unit of execution that is run on a worker in the cloud. A job is created when an app or applet is run, or when a job spawns another job.

      • Origin job: The job created when an app or applet is run by either a user or an analysis. An origin job always executes the "" entry point.

      • Master job: The job created when an app or applet is run by a user, job, or analysis. A master job always executes the "main" entry point. All origin jobs are also master jobs.

    • Job-based object reference: A hash containing a job ID and an output field name. This hash is given in the input or output of a job. Once the specified job has transitioned to the "done" state, it is replaced with the specified job's output field.

    TensorBoard Example Web App

    This example demonstrates how to run TensorBoard inside a DNAnexus applet.

    TensorBoard is a web application used to visualize and inspect what is going on inside TensorFlow training. To use TensorBoard, your training script in TensorFlow must include code that saves specific data to a log directory where TensorBoard can then find the data to display it.

    This example uses an example script from the TensorBoard authors. For more guidance on how to use TensorBoard, check out the TensorFlow website ().

    Creating the web application

    Bash Helpers

    Learn to build an applet that performs a basic SAMtools count with the aid of bash helper variables.

    Source Code

    Step 1. Download BAM Files

    R Shiny Example Web App

    This is an example web applet that demonstrates how to build and run an R Shiny application on DNAnexus.

    Creating the web application

    Inside the dxapp.json, you would add "httpsApp": {"ports":[443], "shared_access": "VIEW"} to tell the worker to expose this port.

    R Shiny needs two scripts, server.R

    Job Notifications

    Learn how to set job notification thresholds on the DNAnexus Platform.

    A license is required to use the functionality described on this page. Contact for more information.

    Being notified of when a job may be stuck can help users to troubleshoot problems. On DNAnexus, users can set to limit the amount of time their jobs can run, or set a threshold on how long a job can take to run before the user is notified. The notification threshold can be specified in the executable at compile time via or .

    When the threshold is reached for a , the system sends an email notification to both the user who launched the executable and the org admin.

    Histogram

    Learn to build and use histograms in the Cohort Browser.

    When to Use Histograms

    Histograms can be used to visualize numerical, date, and datetime data.

    Supported Data Types

    Scatter Plot

    Learn to build and use scatter plots in the Cohort Browser.

    When to Use Scatter Plots

    Scatter plots can be used to compare the distribution of values in a field containing numerical data, across different groups in a cohort. In a scatter plot, each such group is defined by its members sharing the same value in another field that also contains numerical data.

    Primary field values are plotted on the x axis. Secondary field values are plotted on the y axis.

    Supported Data Types

    CSV Loader

    A license is required to access Spark functionality on the DNAnexus Platform. for more information.

    Overview

    The CSV Loader ingests CSV files into a database. The input CSV files are loaded into a Parquet-format database and tables that can be queried using Spark SQL.

    DXJupyterLab Quickstart

    In this tutorial, you will learn how to create and run a notebook in JupyterLab on the platform, download data from the notebook, and upload results to the platform.

    DXJupyterLab is accessible to all users of the UK Biobank Research Analysis Platform and the Our Future Health Trusted Research Environment.

    A license is required to access DXJupyterLab on the DNAnexus Platform. for more information.

    FreeSurfer in DXJupyterLab

    Learn how to use FreeSurfer in DXJupyterLab.

    DXJupyterLab is accessible to all users of the UK Biobank Research Analysis Platform and the Our Future Health Trusted Research Environment.

    A license is required to access DXJupyterLab on the DNAnexus Platform. for more information.

    VCF Preprocessing

    Learn about preprocessing VCF data before using it in an analysis.

    Overview

    It may be necessary to preprocess, or harmonize, the data before you load them.

    Harmonizing Data

    Spark apps use the same platform dx communication between the master VM and DNAnexus API servers.

  • There's a new log collection mechanism to collect logs from all nodes.

  • You can use the Spark UI to monitor running job using ssh tunneling.

  • Select the version you'd like to run. Click the Run button.

    For DXJupyterLab with Spark cluster capability, run the command dx run app-dxjupyterlab_spark_cluster/x.y.z --priority high

    Monitoring Executions page

    Parent job: A job that creates another job or analysis via an /executable-xxxx/run or /job/new API call.

  • Child job: A job created from a parent job via an /app[let]-xxxx/run or /job/new API call.

  • Subjob: A job created from a job via a /job/new API call. A subjob runs the same executable as its parent, and executes the entry point specified in the API call that created it.

  • Job tree: A set of all jobs that share the same origin job.

  • main
    About FreeSurfer

    ​FreeSurfer is a software package for the analysis and visualization of structural and functional neuroimaging data from cross-sectional or longitudinal studies.

    The FreeSurfer package comes pre-installed with the IMAGE_PROCESSING feature of DXJupyterLab.

    FreeSurfer License Registration

    To use FreeSurfer on the DNAnexus Platform, you need a valid FreeSurfer license. You can register for the FreeSurfer license at the FreeSurfer registration page.

    Using the FreeSurfer License on DNAnexus

    To use the FreeSurfer license, complete the following steps:

    1. Upload the license text file to your project on the DNAnexus Platform.

    2. Launch the DXJupyterLab app and specify the IMAGE_PROCESSING feature.

    3. Once DXJupyterLab is running, open your existing notebook (or a new notebook) and download the license file into the FREESURFER_HOME directory.

    The commands to download the license file are as follows:

    • Python kernel: !dx download license.txt -o $FREESURFER_HOME

    • Bash kernel: dx download license.txt -o $FREESURFER_HOME

    Contact DNAnexus Sales

    Uploading and Downloading Files

    # Start the training script and put it into the background,
    # so the next line of code will run immediately
    python3 mnist_tensorboard_example.py --log_dir LOGS_FOR_TENSORBOARD &
    
    # Run TensorBoard
    tensorboard  --logdir LOGS_FOR_TENSORBOARD --host 0.0.0.0 --port 443
    ├── Applet dir
    │   ├── src
    │   ├── dxapp.json
    │   ├── resources
    │       ├── usr
    │           ├── bin
    │               ├── < samtools binary >
    The applet code runs a training script, which is placed in resources/home/dnanexus/ to make it available in the current working directory of the worker, and then it starts TensorBoard on port 443 (HTTPS).

    Run the training script in the background to start TensorBoard immediately, which lets you see the results while training is still running. This is particularly important for long-running training scripts.

    For all web apps, if everything is running smoothly and no errors are encountered (the ideal case), the line of code that starts the server keeps it running forever. The applet stops only when it is terminated. This also means that any lines of code after the server starts are not executed.

    As with all web apps, the dxapp.json must include "httpsApp": {"ports":[443], "shared_access": "VIEW"} to tell the worker to expose port 443.

    Creating an applet on DNAnexus

    Build the asset with the libraries first:

    Take the record ID it outputs and add it to the dxapp.json for the applet.

    Then build the applet

    Once it spins up, you can go to that job's designated URL based on its job ID, https://job-xxxx.dnanexus.cloud/, to see the result.

    View full source code on GitHub
    external link
    # Start the training script and put it into the background,
    # so the next line of code runs immediately
    python mnist_tensorboard_example.py --log_dir LOGS_FOR_TENSORBOARD &
    
    # Run TensorBoard
    tensorboard  --logdir LOGS_FOR_TENSORBOARD --host 0.0.0.0 --port 443

    Download input files using the dx-download-all-inputs command. The dx-download-all-inputs command goes through all inputs and downloads into folders with the pattern /home/dnanexus/in/[VARIABLE]/[file or subfolder with files].

    Step 2. Create an Output Directory

    Create an output directory in preparation for the dx-upload-all-outputs DNAnexus command in the Upload Results section.

    Step 3. Run SAMtools View

    After executing the dx-download-all-inputs command, there are three helper variables created to aid in scripting. For this applet, the input variable name mappings_bam with platform filename my_mappings.bam has a helper variables:

    Use the bash helper variable mappings_bam_path to reference the location of a file after it has been downloaded using dx-download-all-inputs.

    Step 4. Upload Result

    Use the dx-upload-all-outputs command to upload data to the platform and specify it as the job's output. The dx-upload-all-outputs command expects to find file paths matching the pattern /home/dnanexus/out/[VARIABLE]/*. It uploads matching files and then associates them as the output corresponding to [VARIABLE]. In this case, the output is called counts_txt. After creating the folders, place the outputs there.

    View full source code on GitHub
    and
    ui.R
    , which should be under
    resources/home/dnanexus/my_app/
    . When a job starts based on this applet, the
    resources
    directory is copied onto the worker, and since the
    ~/
    path on the worker is
    /home/dnanexus
    , that means you have
    ~/my_app
    with those two scripts inside.

    From the main applet script code.sh, start shiny pointing to ~/my_app, serving its mini-application on port 443.

    For all web apps, if everything is running smoothly and no errors are encountered (the ideal case), the line of code that starts the server keeps it running indefinitely. The applet stops only when it is terminated. This also means that any lines of code after the server starts are not executed.

    Modifying this example for your own applet

    To make your own applet with R Shiny, copy the source code from this example and modify server.R and ui.R inside resources/home/dnanexus/my_app.

    How to rebuild the shiny asset

    View dxasset.json file

    To build the asset, run the dx build_asset command and pass shiny-asset, that is the name of the directory holding dxasset.json:

    This outputs a record ID record-xxxx that you can then put into the applet's dxapp.json in place of the existing one:

    Build the applet

    Build and run the applet itself:

    Once it spins up, you can go to that job's designated URL based on its job ID, https://job-xxxx.dnanexus.cloud/, to see the result.

    View full source code on GitHub
    Setting Thresholds From the Command Line

    For a root execution, the turnaround time is the time between its creation time and the time it reaches the terminal state (or the current time if it is not in a terminal state). The terminal states of an execution are done, terminated, and failed. The job tree turnaround time threshold can be set from the dxapp.json app metadata file using the treeTurnaroundTimeThreshold supported field, where the threshold time is set in seconds. When a user runs an executable that has a threshold, the threshold applies only to the resulting root execution. See here for more details on the treeTurnAroundTimeThreshold API.

    Example of including the treeTurnaroundTimeThreshold field in dxapp.json:

    In the command-line interface (CLI), the dx build and dx build --app commands can accept the treeTurnaroundTimeThreshold field from dxapp.json, and the resulting app is built with the job tree turnaround time threshold from the JSON file.

    To check the treeTurnaroundTimeThreshold value of an executable, users can use dx describe {app, applet, workflow or global workflow id} --json command.

    Using the dx describe {execution_id} --json command displays the selectedTreeTurnaroundTimeThreshold, selectedTreeTurnaroundTimeThresholdFrom, and treeTurnaroundTime values of root executions.

    WDL Workflows

    For WDL workflows and tasks, dxCompiler enables tree turnaround time specification using the extras JSON file. dxCompiler reads the treeTurnaroundTimeThreshold field from the perWorkflowDxAttributes and defaultWorkflowDxAttributes sections in extras and applies this threshold to the generated workflow. To set a job tree turnaround time threshold for an applet using dxCompiler, add the treeTurnaroundTimeThreshold field to the perTaskDxAttributes and defaultTaskDxAttributes sections in the extras JSON file.

    Example of including the treeTurnAroundTimeThreshold field in perWorkflowDxAttributes:

    DNAnexus Sales
    timeouts
    dx
    dxCompiler
    job tree
    You can load a single CSV file or many CSV files. In the many files case, all files must be syntactically equal.

    For example:

    • All files must have the same separator. This can be a comma, tab, or another consistent delimiter.

    • All files must include a header line, or all files must exclude it

    Each CSV file is loaded into its own table within the specified database.

    How to Run CSV Loader

    Input:

    • CSV (array of CSV files to load into the database)

    Required Parameters:

    • database_name -> name of the database to load the CSV files into.

    • create_mode -> strict mode creates database and tables from scratch and optimistic mode creates databases and tables if they do not already exist.

    • insert_mode -> append appends data to the end of tables and overwrite is equivalent to truncating the tables and then appending to them.

    • table_name -> array of table names, one for each corresponding CSV file by array index.

    • type -> the cluster type, "spark" for Spark apps

    Other Options:

    • spark_read_csv_header -> default false -- whether the first line of each CSV should be used as column names for the corresponding table.

    • spark_read_csv_sep -> default , -- the separator character used by each CSV.

    • spark_read_csv_infer_schema -> default false -- whether the input schema should be inferred from the data.

    Basic Run

    The following case creates a brand new database and loads data into two new tables:

    Contact DNAnexus Sales
    Run a JupyterLab Session and Create Notebooks

    1. Launch DXJupyterLab and View the Project

    First, launch DXJupyterLab in the project of your choice, as described in the Running DXJupyterLab guide.

    After starting your JupyterLab session, click on the DNAnexus tab on the left sidebar to see all the files and folders in the project.

    2. Create an Empty Notebook

    To create a new empty notebook in the DNAnexus project, select DNAnexus > New Notebook from the top menu.

    This creates an untitled ipynb file, viewable in the DNAnexus project browser, which refreshes every few seconds.

    To rename your file, right-click on its name and select Rename.

    3. Edit and Save the Notebook in the Project

    You can open and edit the newly created notebook directly from the project (accessible from the DNAnexus tab in the left sidebar). To save your changes, hit Ctrl/Command + S or click on the save icon in the Toolbar (an area below the tab bar at the top). A new notebook version lands in the project, and you should see in the "Last modified" column that the file was created recently.

    Since DNAnexus files are immutable, each notebook save creates a new version in the project, replacing the file of the same name. The previous version moves to the .Notebook_archive with a timestamp suffix added to its name. Saving notebooks directly in the project as new files preserves your analyses beyond the DXJupyterLab session's end.

    4. Download the Data to the Execution Environment

    To process your data in the notebook, the data must be available in the execution environment (as is the case with any DNAnexus app).

    You can download input data from a project for your notebook using dx download in a notebook cell:

    You can also use the terminal to execute the dx command.

    5. Upload Data to the Project

    For any data generated by your notebook that needs to be preserved, upload it to the project before the session ends and the JupyterLab worker terminates. Upload data directly in the notebook by running dx upload from a notebook cell or from the terminal:

    If you create a notebook from the Launcher or from the top menu (File > New > Notebook), the notebook is not created in the project but in the local execution environment. To move it to the project, you must upload it to the project manually. Make sure you upload your local notebooks to the project before the session expires, or work on your notebooks directly from the project, so as not to lose your work.

    Next Steps

    • Check the References guide for tips on the most useful operations and features in the DNAnexus JupyterLab.

    Contact DNAnexus Sales
    dx build_asset tensorflow_asset
    "runSpec": {
        ...
        "assetDepends": [
        {
          "id": "record-xxxx
        }
      ]
        ...
    }
    dx build -f tensorboard-web-app
    dx run tensorboard-web-app
    /
    ├── usr
    │   ├── bin
    │       ├── < samtools binary >
    ├── home
    │   ├── dnanexus
    │       ├── applet script
    samtools view -c "${mappings_bam_name}" > "${mappings_bam_prefix}.txt"
    dx build_asset tensorflow_asset
    "runSpec": {
     ...
     "assetDepends": [
        {
          "id": "record-xxxx
        }
      ]
     ...
    }
    dx build -f tensorboard-web-app
    dx run tensorboard-web-app
    dx-download-all-inputs
    mkdir -p out/counts_txt
    # [VARIABLE]_path the absolute string path to the file.
    $ echo $mappings_bam_path
    /home/dnanexus/in/mappings_bam/my_mappings.bam
    # [VARIABLE]_prefix the file name minus the longest matching pattern in the dxapp.json file
    $ echo $mappings_bam_prefix
    my_mappings
    # [VARIABLE]_name the file name from the platform
    $ echo $mappings_bam_name
    my_mappings.bam
    samtools view -c "${mappings_bam_path}" \
      > out/counts_txt/"${mappings_bam_prefix}.txt"
    dx-upload-all-outputs
    main() {
      R -e "shiny::runApp('~/my_app', host='0.0.0.0', port=443)"
    }
    dx build_asset shiny-asset
    "runSpec": {
        ...
        "assetDepends": [
        {
          "id": "record-xxxx
        }
      ]
        ...
    }
    dx build -f dash-web-app
    dx run dash-web-app
    {
      "treeTurnaroundTimeThreshold": {threshold},
      ...
    }
    {
      "perWorkflowDxAttributes": {
        {workflow_name}: {
          "treeTurnaroundTimeThreshold": {threshold},
          ...
        },
        ...
      }
    }
    dx run app-csv-loader \
       -i database_name=pheno_db \
       -i create_mode=strict \
       -i insert_mode=append \
       -i spark_read_csv_header=true \
       -i spark_read_csv_sep=, \
       -i spark_read_csv_infer_schema=true \
       -i csv=file-xxxx \
       -i table_name=sample_metadata \
       -i csv=file-yyyy \
       -i table_name=gwas_result
    %%bash
    dx download input_data/reads.fastq
    %%bash
    dx upload results.csv

    Datetime

    Using Histograms in the Cohort Browser

    In a histogram in the Cohort Browser, each vertical bar represents the count of records in a particular "bin." Each bin groups records that share the same value or similar values, in a particular field.

    The Cohort Browser automatically groups records into bins, based on the distribution of values in the dataset, for the field. Values are distributed in a linear fashion, on the x axis.

    Below is a sample histogram showing the distribution of values in a field Critical care total days. The label under the chart title indicates the number of records (203) for which values are shown, and the name of the entity ("RNAseq Notes") to which the data relates.

    Histogram in the Cohort Browser

    Non-Numeric Data in Histograms

    A field containing numeric data may also contain some non-numeric values. These values cannot be represented in a histogram. In such cases, you see the following informational message below the chart:

    Histogram Displaying Data for a Field Containing Non-Numeric Values

    Clicking the "non-numeric values" link displays detail on those values, and the number of record in which each appears:

    Detail on Non-Numeric Values Omitted from a Histogram

    In Cohort Compare mode, histograms can be used to compare the distribution of values in a field that's common to both cohorts. In this scenario, the distributions are overlaid one atop another. Clicking the "ˇ" icon, in the lower right corner of the tile containing the chart, opens a tooltip showing the cohort names and the colors used to represent data in each.

    Histogram in Cohort Compare Mode

    See Comparing Cohorts for more on using Cohort Compare mode.

    Preparing Data for Visualization in Histograms

    When ingesting data using Data Model Loader, the following data types can be visualized in histograms:

    • Integer

    • Integer Sparse

    • Float

    • Float Sparse

    • Date

    • Date Sparse

    • Datetime

    • Datetime Sparse

    Numerical (Integer)

    Numerical (Float)

    Date

    Primary Field

    Secondary Field

    Numerical (Integer) or Numerical (Float)

    Numerical (Integer) or Numerical (Float)

    Using Scatter Plots in the Cohort Browser

    In the scatter plot below, each dot represents a particular combination of values, found in one or more records in a cohort, in fields Insurance Billed and Cost. The lighter the dot at a particular point, the fewer the records that share that combination. Darker dots, meanwhile, indicate that more records share a particular combination.

    Scatter Plot: Insurance Billed x Cost

    Non-Numeric Data in Scatter Plots

    Fields containing primarily numeric data may also include non-numeric values. These non-numeric values cannot be represented in a scatter plot. The message "This field contains non-numeric values" appears below the scatter plot, as in this sample chart:

    Scatter Plot Based on Field or Fields Containing Non-Numeric Values

    Clicking the "non-numeric values" link displays detail on those values, and the number of record in which each appears.

    Detail on Non-Numeric Values

    Limit on Number of Data Points

    In the Cohort Browser, scatter plots can show up to 30,000 distinct data points. If you create a scatter plot that would require that more data points be shown, you see this message above the chart:

    Scatter Plot with Warning Message about Data Point Limit

    In this scenario, add a cohort filter to generate a scatter plot that shows data for all the members of a cohort.

    Cohort Compare

    Scatter plots are not supported in Cohort Compare.

    Preparing Data for Visualization in Scatter Plots

    When ingesting data using Data Model Loader, the following data types can be visualized in scatter plots:

    • Integer

    • Integer Sparse

    • Float

    • Float Sparse

  • The raw data is expected to be a set of gVCF files -- one file per sample in the cohort.

  • GLnexus is used to harmonize sites across all gVCFs and generate a single pVCF file containing all harmonized sites and all genotypes for all samples.

  • Apollo GLnexus

    Basic Run

    Advanced Run

    To learn more about GLnexus, see GLnexus or Getting started with GLnexus.

    Annotating Variants

    VCF files can include variant annotations. SnpEff annotations provided as INFO/ANN tags are loaded into the database. You can annotate the harmonized pVCF yourself by running any standard SnpEff annotator before loading it. For large pVCFs, rely on the internal annotation step in the VCF Loader instead of generating an annotated intermediate file. The VCF Loader performs annotation in a distributed, massively parallel process.

    The VCF Loader does not persist the intermediate, annotated pVCF as a file. If you want to have access to the annotated file up front, you should annotate it yourself.

    Annotation flow

    VCF annotation flows. In (a) the annotation step is external to the VCF Loader, whereas in (b) the annotation step is internal. In any case, SnpEff annotations present as INFO/ANN tags are loaded into the database by the VCF Loader.

    Projects
    Organizations
    Apps and Workflows
    User Interface Quickstart
    Command Line Quickstart
    DNAnexus Platform Essentials video
    dxCompiler
    JupyterLab
    Spark on JupyterLab
    SAIGE GWAS
    Science Corner

    Executions and Cost and Spending Limits

    Learn about limits on the costs executions can incur, and how these limits can affect executions on the DNAnexus Platform.

    Types of Cost and Spending Limits

    A running execution can be terminated when it incurs charges that cause a cost or spending limit to be reached. When a spending limit is reached, this can also prevent new executions from being launched.

    Execution Cost Limits

    . This limit is set when a root execution is launched. Once this limit is reached, the DNAnexus Platform terminates running executions in the affected execution tree.

    Errors

    When an execution is terminated in this fashion, the Platform sets . This failure code is displayed on the UI, on the relevant project's Monitor page.

    Billing Account Spending Limits

    Billing account spending limits are managed by billing administrators, and can impact executions in projects billed to the account.

    Billing account spending limits apply to cumulative charges incurred by projects billed to the account.

    If cumulative charges reach this limit, the Platform terminates running jobs in projects billed to the account, and prevents new executions from being launched.

    Errors

    When a job is terminated in this fashion, the Platform sets as the failure reason. This failure reason is displayed on the UI, on the relevant project's Monitor page.

    Project-Level Compute and Egress Spending Limits

    A license is required to use the Enforce Monthly Spending Limit for Computing and Egress feature. for more information.

    , and can impact executions run within the project. Project admins can also set a separate monthly project-level egress spending limit, which can impact data egress from the project.

    If the compute spending limit is reached, the Platform may terminate running jobs launched by project members, and prevent new executions from being launched. If the egress spending limit is reached, the Platform may prevent data egress from the project. The exact behavior depends on the policies of the org to which the project is billed.

    For more information on these limits, see the , and the .

    Compute Charges Incurred by Using Relational Database Clusters

    Monthly project compute limits do not apply to compute charges incurred by using .

    Compute Charges for Using Public IPv4 Addresses for Workers

    Using public IPv4 addresses for workers incurs additional charges. When a job uses such a worker, IPv4 charges are included in the total cost figure shown for the job on the UI. These charges also count toward any .

    For information on how to find the per-hour charge for using IPv4 addresses, in each cloud region in which org members can run executions, see the .

    Getting Info on Cost and Spending Limits

    Execution Costs and Cost Limits

    The UI displays information on costs and cost limits for both individual executions and execution trees. Navigate to the project in which the execution or execution tree is being run, then click the Monitor tab. Click on the name of the execution or execution tree to open a page showing detailed information about it.

    While an execution or execution tree is running, information is displayed on the charges it has incurred so far, and on additional charges it can incur, before an applicable cost limit is reached.

    Spending Limits

    Org spending limit information is available from the .

    Project-Level Monthly Spending Limits

    If project-level monthly spending limits have been set for a project, detailed information is available via the CLI, using the command .

    Chart Types

    Get an overview of the range of different charts you can build and use in the Cohort Browser.

    While working in the Cohort Browser, you can visualize data using a variety of different types of charts.

    To visualize data stored in particular field, follow these directions to browse through the fields in a dataset, select one, then create a chart based on the values in the field. When you select a field, the Cohort Browser suggests a chart type to use, to visualize the type of data it contains. You can also create multi-variable charts, displaying data from two fields, to help clarify the relationship between the data stored in each.

    Single-Variable Charts

    The following single-variable chart types are available in the Cohort Browser:

    Multi-Variable Charts

    The following multi-variable chart types are available in the Cohort Browser:

    When creating multi-variable charts using datasets that include data related to multiple entities, the entity relationship between the selected data fields affects chart type availability. Often, data fields related to the same entity, or data fields related to entities that in turn relate to one another in 1:1, N:1, or 1:N fashion, can be used together in a multi-variable chart.

    Interpreting Chart Data

    Chart Totals and Missing Data

    In all charts used in the Cohort Browser, a chart total count is displayed under the chart's title. This figure represents the number of records for which data is displayed in the chart. The label - "Participants" in the chart shown below - indicates the entity to which the data relates.

    This figure is not always the same as the number of records in the cohort.

    In a single-variable chart, if a field in a record is empty or contains a null value, that record is not included in the total, as its data can't be visualized. If any such records exist in the cohort, an "i" warning icon appears next to the chart total figure. Hover over the icon to show a tooltip with information about records that aren't included in the total.

    The same holds for multi-variable charts. If any record contains a null value in either of the selected fields, or if either field is empty, that record isn't included in the chart total count, as its data can't be visualized.

    Stacked Row Chart

    Learn to build and use stacked row charts in the Cohort Browser.

    When to Use Stacked Row Charts

    Stacked row charts can be used to compare the distribution of values in a field containing categorical data, across different groups in a cohort. In a stacked row chart, each such group is defined by its patient sharing the same value in another field that also contains categorical data.

    When creating a stacked row chart:

    • Both the primary and secondary fields must contain categorical data

    • Both the primary and secondary fields must contain no more than 20 distinct category values

    Supported Data Types

    Categorical multiple and categorical hierarchical data are not supported in stacked row charts.

    Using Stacked Row Charts in the Cohort Browser

    In the stacked row chart below, the primary field is VisitType, while DoctorType is the secondary field. In this chart, a cohort has been broken down into two groups, with the first sharing the value "Out-patient" in the VisitType field, while the second shares the value "In-patient."

    The size of each bar, and the number to its right, indicate the total number of records in each group. In the chart below, for example, you can see that 3,179 records contain the value "Out-patient" in the VisitType field.

    Each bar contains a color-coded section indicating how many of the group's records contain a specific value in the secondary field. Hovering over one of these sections reveals how many records, within a particular group, share a particular value in the secondary field. In the chart below, for example, you can see that 87 records in the first group share the value "specialist" in the DoctorType field.

    Cohort Compare

    Stacked row charts are not supported in Cohort Compare. Use a instead.

    Preparing Data for Visualization in Stacked Row Charts

    When , the following data types can be visualized in stacked row charts:

    • String Categorical

    • String Categorical Sparse

    • Integer Categorical

    Kaplan-Meier Survival Curve

    Learn to build and use Kaplan-Meier Survival Curve charts in the Cohort Browser.

    Building a Kaplan-Meier Survival Curve Chart

    To generate a survival chart, select one numerical field representing time, and one categorical field, which is transformed into the individual's status.

    The categorical field should use one of the following 4 terms (case-insensitive) to indicate a status of "Living": living, alive, diseasefree, disease-free.

    For multi-entity datasets, survival curve charts only support data fields from the main entity, or entities with 1:1 relation to the main entity.

    Calculating Survival Percentage

    To calculate survival percent at the current event the system evaluates the following formula:

    • : Survival at the current event

    • : Number of subjects living at the start of the period or event

    • : Number of subjects that died

    For each time period the following values are generated:

    • Status: Each individual is considered Dead unless they qualify as Living.

    • Number of Subjects Living at the Start ()

      • For the initial value this is the total number of records returned by the backend from survival data with Living or Dead Status.

    This is the actual point drawn on the survival plot.

    Learn More

    SAMtools count

    View full source code on GitHub

    This applet performs a basic samtools view -c {bam} command, referred to as "SAMtools count", on the DNAnexus Platform.

    Download BAM Files

    For bash scripts, inputs to a job execution become environment variables. The inputs from the dxapp.json file are formatted as shown below:

    The object mappings_bam, a DNAnexus link containing the file ID of that file, is available as an environmental variable in the applet's execution. Use the command dx download to download the BAM file. By default, downloading a file preserves the filename of the object on the platform.

    SAMtools Count

    Use the bash helper variable mappings_bam_name for file inputs. For these inputs, the DNAnexus Platform creates a bash variable [VARIABLE]_name that holds the platform filename. Because the file was downloaded with default parameters, the worker filename matches the platform filename. The helper variable [VARIABLE]_prefix contains the filename minus any suffixes specified in the input field patterns (for example, the platform removes the trailing .bam to create [VARIABLE]_prefix).

    Upload Result

    Use command to upload data to the platform. This uploads the file into the job container, a temporary project that holds onto files associated with the job. When running the command dx upload with the flag --brief, the command returns only the file ID.

    Job containers are an integral part of the execution process. To learn more see .

    Associate With Output

    The output of an applet must be declared before the applet is even built. Looking back to the dxapp.json file, you see the following:

    The applet declares a file type output named counts_txt. In the applet script, specify which file should be associated with the output counts_txt. On job completion, this file is copied from the temporary job container to the project that launched the job.

    Row Chart

    Learn to build and use row charts in the Cohort Browser.

    When to Use Row Charts

    Row charts can be used to visualize categorical data.

    When creating a row chart:

    • The data must be from a field that contains either categorical or categorical multi-select data

    • This field must contain no more than 20 distinct category values

    • The values cannot be organized in a hierarchy

    Supported Field Types

    See if you need to visualize hierarchical categorical data.

    When to Use List Views for Categorical Data

    Row charts can't be used to visualize data in categorical fields that have a hierarchical structure. For this type of data, use a .

    Row charts aren't supported in Cohort Compare mode. In Cohort Compare mode, row charts are converted to .

    Using Stacked Row Charts for Multivariate Visualizations

    Row charts can't be used to visualize data from more than one field. To visualize categorical data from two fields, you can use a .

    Using Row Charts in the Cohort Browser

    In a row chart, each row shows a single category value, along with the number of records - the "count" - in which that value appears in the selected field. Also shown is the percentage of total cohort records in which it appears - its "freq." or "frequency."

    Below is a sample row chart showing the distribution of values in a field Salt added to food. In the current cohort selection of 100,000 participants, 27,979 records contain the value "Sometimes", which represents 27.98% of the current cohort size.

    When records are missing values for the displayed field, the sum of the "count" figures is smaller than the total cohort size, and the sum of the "freq." figures is less than 100%. See for more information on how missing data affects chart calculations.

    Preparing Data for Visualization in Row Charts

    When , the following data types can be visualized in row charts, if category values are specified as such in the coding file used at ingestion:

    • String Categorical

    • String Categorical Sparse

    • String Categorical Multi-select

    • Integer Categorical

    While sparse serial data can be visualized using row charts, non-encoded values are not supported. These values do not appear as rows.

    Executions and Time Limits

    Learn about different types of time limits on executions, and how they can affect your executions on the DNAnexus Platform.

    Types of Time Limits

    On the DNAnexus Platform, executions are subject to two independent time limits: job timeouts, and execution tree expirations.

    Job Timeouts

    Each job has a . This setting denotes the maximum amount of "wall clock time" that the job can spend in the "running" state, that is, running on the DNAnexus Platform.

    If the job is still running when this limit is reached, the job is terminated.

    The default job timeout setting is 30 days, though . A job may be given a .

    How Job Timeouts Work

    As noted above, job timeouts only apply to the time a job spends in the "running" state.

    Job timeouts do not apply to any time a job spends waiting to begin running - as, for example, when a job is waiting for inputs to become available.

    Job timeouts also do not apply to the time a job may spend between exiting the "running" state, and entering the "done" state - as, for example, when it is waiting for subjobs to finish.

    To learn more about timeouts, see .

    Errors

    If a job fails to complete running before reaching its timeout limit, it is terminated, with .

    Execution Tree Expiration

    Each job is part of an . All jobs in an execution tree must complete running within 30 days of the launch of the tree's .

    After this limit has been reached, all jobs within the execution tree lose the ability to access the Platform.

    If an execution tree is restarted, its timeout setting is not reset. Jobs in the tree lose Platform access 30 days after the initial launch (the first try) of the tree's root execution.

    Errors

    If an execution tree reaches its time limit, jobs in the tree may not fail right away. If such a job is waiting for inputs or outputs, or if it is running without accessing the Platform, it may remain in that state. Only when the job tries to access the Platform does it fail. Depending on the access pattern, .

    Monitoring Time Limits

    To see information on time limits for execution and execution trees:

    1. Navigate to the project in which the execution or execution tree is being run.

    2. Click the Monitor tab.

    3. Click the name of the execution or execution tree to open a page showing detailed information on it.

    If a time limit is approaching, a warning message provides information on when the limit is reached.

    If a job is waiting for subjobs to finish, it is shown as running, but job timeout information is not displayed. Execution tree information continues to be displayed.

    Cohort Browser

    Visualize your data and browse your multi-omics datasets.

    Cohort Browser is a visualization tool for exploring and filtering structured datasets. It provides an intuitive interface for creating visualizations, defining patient cohorts, and analyzing complex data.

    Cohort Browser supports multiple types of datasets:

    • Clinical and phenotypic data - Patient demographics, clinical measurements, and outcomes

    • Germline variants - Inherited genetic variations

    Dash Example Web App

    This is an example web app made with Dash, which in turn uses Flask underneath.

    Creating the web application

    After configuring an app with Dash, start the server on port 443.

    Inside the dxapp.json, you would add "httpsApp": {"ports":[443], "shared_access": "VIEW"}

    Creating Charts and Dashboards

    Create charts, manage dashboards, and build visualizations to explore your datasets in the Cohort Browser.

    Create interactive visualizations and manage dashboard layouts in the Cohort Browser.

    If you'd like to filter your dataset to specific samples, see .

    Managing Dashboards

    Running DXJupyterLab

    Learn to launch a JupyterLab session on the DNAnexus Platform, via the DXJupyterLab app.

    DXJupyterLab is accessible to all users of the UK Biobank Research Analysis Platform and the Our Future Health Trusted Research Environment.

    For DNAnexus Platform users, a license is required to access DXJupyterLab. for more information.

    Grouped Box Plot

    Learn to build and use grouped box plots in the Cohort Browser.

    When to Use Grouped Box Plots

    Grouped box plots can be used to compare the distribution of values in a field containing numerical data, across different groups in a cohort. In a grouped box plot, each such group is defined by its members sharing the same value in another field that contains categorical data.

    When creating a grouped box plot:

    Environment Variables

    The command-line client and the client bindings use a set of environment variables to communicate with the API server and to store state on the current default project and directory. These settings are set when you run dx login and can be changed through other dx commands. To display the active settings in human-readable format, use the dx env command:

    To print the bash commands for setting the environment variables to match what dx is using, you can run the same command with the --bash flag.

    Running a dx command from the command-line does not (and cannot) overwrite your shell's environment variables. The environment variables are stored in the ~/.dnanexus_config/environment file.

    Box Plot

    Learn to build and use box plots in the Cohort Browser.

    When to Use Box Plots

    Box plots can be used to visualize numerical data.

    Supported Data Types

    dx run app-glnexus \
        -i common.gvcf_manifest=<manifest_file_id> \
        -i common.config=gatk_unfiltered \
        -i common.targets_bed=<bed_target_ranges>
    dx run workflow-glnexus \
        -i common.gvcf_manifest=<manifest_file_id> \
        -i common.config=gatk_unfiltered \
        -i common.targets_bed=<bed_target_ranges> \
        -i unify.shards_bed=<bed_genomic_partition_ranges> \
        -i etl.shards=<num_sample_partitions>
    {
      "inputSpec": [
        {
          "name": "mappings_bam",
          "label": "Mapping",
          "class": "file",
          "patterns": ["*.bam"],
          "help": "BAM format file."
        }
      ]
    }
    An execution cost limit is an optional limit on the usage charges an execution tree can incur
    CostLimitExceeded as the failure reason
    SpendingLimitExceeded
    Contact DNAnexus Sales
    Monthly project compute spending limits can be set by project admins
    billing and account management overview
    detailed explanation of setting org spending limit policies
    relational database clusters
    compute spending limit that applies to the project in which the job is running
    org-xxxx/describe method
    Billing page for each org
    dx describe project-id
    timeout setting
    individual apps may have different timeout settings, as specified by the app's creator
    custom timeout setting
    job lifecycle and job states
    the Platform returning JobTimeoutExceeded as the job's failure reason
    execution tree
    root execution
    the Platform returns AppInternalError, AppError, or AuthError as the job's failure reason
    dx upload
    Containers for Execution
    to tell the worker to expose this port.

    For all web apps, if everything is running smoothly and no errors are encountered (the ideal case), the line of code that starts the server keeps it running forever. The applet stops only when it is terminated. This also means that any lines of code after the server starts are not executed.

    The rest of these instructions apply to building any applet with dependencies stored in an asset.

    Creating an applet on DNAnexus

    Install the DNAnexus SDK and log in, then run dx-app-wizard with default options.

    Creating the asset

    dash-asset specifies all the packages and versions needed. These come from the Dash installation guide.

    Add these into dash-asset/dxasset.json:

    Build the asset:

    Use the asset from the applet

    Add this asset to the applet's dxapp.json:

    Build the applet

    Build and run the applet itself:

    You can always use dx ssh job-xxxx to ssh into the worker and inspect what's going on or experiment with quick changes Then go to that job's special URL https://job-xxxx.dnanexus.cloud/ and see the result!

    Optional local testing

    The main code is in dash-web-app/resources/home/dnanexus/my_app.py with a local launcher script called local_test.py in the same folder. This allows you to launch the same core code in the applet locally to quickly iterate. This is optional because you can also do all testing on the platform itself.

    Install locally the same libraries listed above.

    To launch the web app locally:

    Once it spins up, you can go to that job's designated URL based on its job ID, https://job-xxxx.dnanexus.cloud/, to see the result.

    View full source code on GitHub
    Configuration File Prioritization

    The following is an ordered list of which DNAnexus utilities load values from configuration sources:

    1. Command line options (if available)

    2. Environment variables already set in the shell

    3. ~/.dnanexus_config/environment.json (dx configuration file)

    4. Hardcoded defaults

    Overriding the dx Configuration File

    The dx command always prioritizes the environment variables that are set in the shell. This means that if you have set your environment variable for DX_SECURITY_CONTEXT and then use dx login to log in as a different user, it still uses the original environment variable. When not run in a script, it prints a warning to stderr whenever the environment variables and its stored state have a mismatch. To get out of this situation, the best approach is often to run source ~/.dnanexus_config/unsetenv. Setting environment variables is generally an approach within a shell script or part of a job environment in the cloud.

    In the interaction below, environment variables have already been set, but the user then uses dx to log in which is still overridden by the shell's environment variables.

    Clearing dx-set Variables

    If you instead want to discard the values which dx has stored, the command dx clearenv removes the dx-generated configuration file ~/.dnanexus_config/environment.json for you.

    Command Line Options

    Most dx commands have the following additional flags to temporarily override the values of the respective variables.

    For example, you can temporarily override the current default project used:

    dx download "${mappings_bam}"
    readcount=$(samtools view -c "${mappings_bam_name}")
    echo "Total reads: ${readcount}" > "${mappings_bam_prefix}.txt"
    counts_txt_id=$(dx upload "${mappings_bam_prefix}.txt" --brief)
    {
      "name": "counts_txt",
      "class": "file",
      "label": "Read count file",
      "patterns": [
        "*.txt"
      ],
      "help": "Output file with Total reads as the first line."
    }
    dx-jobutil-add-output counts_txt "${counts_txt_id}" --class=file
    app.run_server(host='0.0.0.0', port=443)
    pip install dash==0.39.0  # The core dash backend
    pip install dash-html-components==0.14.0  # HTML components
    pip install dash-core-components==0.44.0  # Supercharged components
    pip install dash-table==3.6.0  # Interactive DataTable component
    pip install dash-daq==0.1.0  # DAQ components
    {
      ...
      "execDepends": [
        {"name": "dash", "version":"0.39.0", "package_manager": "pip"},
            {"name": "dash-html-components", "version":"0.14.0", "package_manager": "pip"},
            {"name": "dash-core-components", "version":"0.44.0", "package_manager": "pip"},
            {"name": "dash-table", "version":"3.6.0", "package_manager": "pip"},
            {"name": "dash-daq", "version":"0.1.0", "package_manager": "pip"}
      ],
      ...
    }
    dx build_asset dash-asset
    "runSpec": {
        ...
        "assetDepends": [
        {
          "id": "record-xxxx
        }
      ]
        ...
    }
    dx build -f dash-web-app
    dx run dash-web-app
    cd dash-web-app/resources/home/dnanexus/
    python3 local_test.py
    $ dx env
    Auth token used         adLTkSNkjxoAerREqbB1dVkspQzCOuug
    API server protocol     https
    API server host         api.dnanexus.com
    API server port         443
    Current workspace       project-9zVpbQf4Zg2641v5BGY00001
    Current workspace name  "Scratch Project"
    Current folder          /
    Current user            alice
    $ dx env --bash
    export DX_SECURITY_CONTEXT='{"auth_token_type": "bearer", "auth_token": "adLTkSNkjxoAerREqbB1dVkspQzCOuug"}'
    export DX_APISERVER_PROTOCOL=https
    export DX_APISERVER_HOST=api.dnanexus.com
    export DX_APISERVER_PORT=443
    export DX_PROJECT_CONTEXT_ID=project-9zVpbQf4Zg2641v5BGY00001
    $ dx ls -l
    Project: Sample Project (project-9zVpbQf4Zg2641v5BGY00001)
    Folder : /
    <Contents of Sample Project>
    $ dx login
    Acquiring credentials from https://auth.dnanexus.com
    Username: alice
    Password:
    
    Note: Use "dx select --level VIEW" or "dx select --public" to select from
    projects for which you only have VIEW permissions.
    
    Available projects:
    0) SAM importer test (CONTRIBUTE)
    1) Scratch Project (ADMINISTER)
    2) Mouse (ADMINISTER)
    
    Pick a numbered choice [1]: 2
    Setting current project to: Mouse
    $ dx ls
    WARNING: The following environment variables were found to be different than the
    values last stored by dx: DX_SECURITY_CONTEXT, DX_PROJECT_CONTEXT_ID
    To use the values stored by dx, unset the environment variables in your shell by
    running "source ~/.dnanexus_config/unsetenv". To clear the dx-stored values,
    run "dx clearenv".
    Project: Sample Project (project-9zVpbQf4Zg2641v5BGY00001)
    Folder : /
    <Contents of Sample Project>
    $ source ~/.dnanexus_config/unsetenv
    $ dx ls -l
    Project: Mouse (project-9zVpbQf4Zg2641v5BGY00001)
    Folder : /
    <Contents of Mouse>
    $ dx --env-help
    usage: dx command ... [--apiserver-host APISERVER_HOST]
                          [--apiserver-port APISERVER_PORT]
                          [--apiserver-protocol APISERVER_PROTOCOL]
                          [--project-context-id PROJECT_CONTEXT_ID]
                          [--workspace-id WORKSPACE_ID]
                          [--security-context SECURITY_CONTEXT]
                          [--auth-token AUTH_TOKEN]
    
    optional arguments:
      --apiserver-host APISERVER_HOST
                            API server host
      --apiserver-port APISERVER_PORT
                            API server port
      --apiserver-protocol APISERVER_PROTOCOL
                            API server protocol (http or https)
      --project-context-id PROJECT_CONTEXT_ID
                            Default project or project context ID
      --workspace-id WORKSPACE_ID
                            Workspace ID (for jobs only)
      --security-context SECURITY_CONTEXT
                            JSON string of security context
      --auth-token AUTH_TOKEN
                            Authentication token
    $ dx env --project-context-id project-B0VK6F6gpqG6z7JGkbqQ000Q
    Auth token used         R54BN6Ws6Zl3Y0VqBA9o1qweUswYW5o4
    API server protocol     https
    API server host         api.dnanexus.com
    API server port         443
    Current workspace       project-B0VK6F6gpqG6z7JGkbqQ000Q
    Current folder          /

    For followup events this is the number of subjects at the start of the previous event minus the number of subjects that died in the previous event and the subjects that dropped out or were censored in the previous event.

  • Number of Subjects Who Died (DDD): * 1 for each individual who at the event does not have a status of Living.

  • Number of Subjects Dropped or Censored: * 1 for each individual who at the event has a status of Living.

  • Survival Percent at the Current Event (STS_TST​): ST=LT0−DLT0S_T = \frac{L_{T0} - D}{L_{T0}}ST​=LT0​LT0​−D​

  • Cumulative Survival (SSS): S=ST−1⋅STS = S_{T-1} \cdot S_TS=ST−1​⋅ST​ where ST−1S_{T-1}ST−1​ is the survival percent at the previous event.

  • ST=LT0−DLT0S_T = \frac{L_{T0} - D}{L_{T0}}ST​=LT0​LT0​−D​
    STS_TST​
    LT0L_{T0}LT0​
    DDD
    LT0L_{T0}LT0​
    Survival Curve Wikipedia
    Kaplan-Meier Wikipedia

    Integer Categorical Multi-select

    Supported Data Types

    Limitations

    Categorical

    ≤20 distinct category values

    Categorical Multi-Select

    ≤20 distinct category values

    When to Use List Views for Categorical Data
    list view
    list views
    Stacked Row Chart
    Chart Totals and Missing Data
    ingesting data using Data Model Loader
    Row Chart in the Cohort Browser
    Somatic variants - Cancer-related genetic changes
  • Gene expressions - Molecular expression measurements

  • Multi-assay datasets - Datasets combining multiple assay types or instances of the same assay type

  • If you need to perform custom statistical analysis, you can also use JupyterLab environments with Spark clusters to query your data programmatically.

    Prerequisites

    You need to ingest your data before you can access it through a dataset in the Cohort Browser.

    Opening Datasets Using the Cohort Browser

    1. In Projects, select the project where your dataset is located.

    2. Go to the Manage tab.

    3. Select your dataset.

    4. Click Explore Data.

    Selecting a dataset to explore

    You can also use the Info Panel to view information about the selected dataset, such as its creator or sponsorship.

    Getting familiar with Cohort Browser

    Depending on your dataset, the Cohort Browser shows the following tabs:

    • Overview - Clinical data using interactive charts and dashboards

    • Data Preview - Clinical data in tabular format

    • Assay-specific tabs - Additional tabs appear based on your dataset content:

      • Germline Variants - For datasets containing germline genomic variants

      • Somatic Variants - For datasets containing somatic variants and mutations

      • Gene Expression - For datasets containing molecular expression data

    Exploring Data in a Dataset

    In the Cohort Browser's Overview tab, you can visualize your data using charts. These visualizations provide an introduction to the dataset and insights on the clinical data it contains.

    Sample clinical dashboard with multiple tiles and cohorts

    When you open a dataset, Cohort Browser automatically creates an empty cohort that includes all records in the dataset. From here, you can add filters to create specific cohorts, build visualizations to explore your data, and export filtered data for further analysis outside the platform.

    Next Steps

    • Creating Charts and Dashboards - Build visualizations and manage dashboard layouts

    • Defining and Managing Patient Cohorts - Filter data and create patient groups

    • Analyzing Germline Genomic Variants - Work with inherited genetic variations

    • Analyzing Somatic Variants and Mutations - Explore cancer-related genetic changes

    • - Examine molecular expression patterns

    Dashboards contain your charts and define their layout. Each such configuration is called a dashboard view. Dashboard views can be specific to a saved cohort or standalone (custom dashboard view). You can create multiple dashboard views, allowing you to switch between different visualizations and analyses.

    By using Dashboard Actions, you can save or load your own dashboard views. This lets you quickly switch between different visualizations without having to set them up each time.

    • Save Dashboard View - Saves the current dashboard configuration as a record of the DashboardView type, including all tiles and their settings.

    • Load Dashboard View - Loads a custom dashboard view, restoring the tiles and their configurations.

    Using Dashboard Actions

    After loading a dashboard view once, you can access it again from Dashboard Actions > Custom Dashboard Views.

    Moving dashboards between datasets? If you want to use your dashboard views with a different Apollo Dataset, you can use the Rebase Cohorts And Dashboards app to transfer your custom dashboard configurations to a new target dataset.

    Visualizing Data

    Add charts to your dashboards to visualize the clinical and phenotypical data in your dataset. For example, you can add charts to display patient demographics or clinical measurements.

    Working with Multi-Assay Visualizations

    For omics datasets, such as those for germline variants, somatic variants, or gene expression, you have additional predefined visualization options available:

    • Germline and somatic variants are visualized using lollipop plots and variant frequency matrices. For details, see Analyzing Germline Variants and Analyzing Somatic Variants.

    • Gene expression data is visualized using expression level and feature correlation charts. For details, see .

    Adding Tiles to Visualize Data

    Each chart is represented as a tile on the dashboard. You can add multiple tiles to visualize different aspects of your data.

    1. In the Overview tab, click + Add Tile on the top-right.

    2. In the hierarchical list of the dataset fields, select the field you want to visualize.

    3. In Data Field Details, choose your preferred chart type.

      • The available chart types depend on the field's value type.

    4. Click Add Tile.

    The tile immediately appears on the dashboard. You can add up to 15 tiles.

    Creating Multi-Variable Charts

    When selecting data fields to visualize, you can add a secondary data field to create a multi-variable chart. This allows you to visualize relationships between two data fields in the same chart.

    To visualize the relationship between two data fields in the same chart, first select your primary data field from the hierarchical list. This opens a Data Field Details panel, showing the field's information and a preview of a basic chart.

    To add a secondary field, keep the primary field selected and search for the desired field. When you find it, click the Add as Secondary Field icon (+) next to its name rather than selecting it directly. This adds the new field to the visualization. The Data Field Details panel updates to show the combined information for both fields.

    You can click the + icon only when at least one chart type is supported for the specified combination.

    For certain chart types, such as Stacked Row Chart and Scatter Plot, you can re-order the primary and secondary data fields by dragging the data field in Data Field Details.

    Adding grouped box plot by combining two data fields

    For more details on multi-variable charts, including how to build a survival curve, see Multi-Variable Charts.

    Chart Optimization

    When working with large datasets, keep these tips in mind:

    • Limit dashboard tiles: To ensure fast loading times and a clear overview, it's best to limit the number of charts on a single dashboard. Typically, 8-10 tiles is a good number for human comprehension and optimal performance.

    • Filter data first: Reduce the volume of data by applying filters before you create complex visualizations. This improves chart loading speed.

    Defining and Managing Cohorts
    Running from the UI
    1. In the main menu, navigate to Tools > JupyterLab. If you have used DXJupyterLab before, the page shows your previous sessions across different projects.

    2. Click New JupyterLab.

    3. Configure your JupyterLab session:

      • Specify the session name and select an instance type.

      • Choose the project where JupyterLab should run.

      • Set the session duration after which the environment automatically shuts down.

      • Optionally, provide a snapshot file to load a previously saved environment.

      • If needed, enable Spark Cluster and set the number of nodes.

    4. Select a feature option based on your analysis needs:

      • PYTHON_R (default): Python3 and R kernel and interpreter

      • ML: Python3 with machine learning packages (TensorFlow, PyTorch, CNTK) and image processing (Nipype), but no R

    5. Review the pricing estimate (if you have billing access) based on your selected duration and instance type.

    6. Click Start Environment to launch your session. The JupyterLab shows an "Initializing" state while the worker spins up and the server starts.

    7. Open your JupyterLab environment by clicking the session name link once the state changes to "Ready". You can also access it directly via https://job-xxxx.dnanexus.cloud, where job-xxxx is your job ID.

    Snapshots created using older versions of DXJupyterLab are incompatible with the current version. If you need to use an older DXJupyterLab snapshot, see environment snapshot guidelines.

    For a detailed list of libraries included in each feature option, see the in-product documentation.

    Running DXJupyterLab from the CLI

    You can start the JupyterLab environment directly from the command line by running the app:

    Once the app starts, you may check if the JupyterLab server is ready to server connections, which is indicated by the job's property httpsAppState set to running. Once it is running, you can open your browser and go to https://job-xxxx.dnanexus.cloud where job-xxxx is the ID of the job running the app.

    To run the Spark version of the app, use the command:

    You can check the optional input parameters for the apps on the DNAnexus Platform (platform login required to access the links):

    • DXJupyterLab App

    • DXJupyterLab Spark Cluster Enabled App

    From the CLI, you can learn more about dx run with the following command:

    where APP_NAME is either app-dxjupyterlab or app-dxjupyterlab_spark_cluster.

    Next Steps

    See the Quickstart and References pages for more details on how to use DXJupyterLab.

    Contact DNAnexus Sales
    The primary field must contain categorical or categorical multiple data
  • The primary field must contain no more than 15 distinct category values

  • The secondary field must contain numerical data

  • Supported Data Types

    Primary Field

    Secondary Field

    Categorical or Categorical Multiple (<=15 categories)

    Numerical (Integer) or Numerical (Float)

    Using Grouped Box Plots in the Cohort Browser

    The grouped box plot below shows a cohort that has been broken down into groups, according to the value in a field Doctor. For each group, a box plot provides detail on the reported Visit Feeling, for cohort members who share a doctor:

    Grouped Box Plot

    Non-Numeric Values in Grouped Box Plots

    A field containing numeric data may also contain some non-numeric values. These values cannot be represented in a grouped box plot. See the chart above for an example of the informational message that shows below the chart, in this scenario.

    Clicking the "non-numeric values" link displays detail on those values, and the number of records in which each appears:

    Grouped Box Plot: Detail on Non-Numeric Values

    Outliers

    Cohort Browser grouped box plots represent all non-null numeric values. When a field contains an outlier value or values - that is, values that are unusually high or low - this can result in a grouped box plot that looks like this:

    Outlier Value in a Grouped Box Plot

    This grouped box plot displays data on the number of cups of coffee consumed per day, by members of different groups in a particular cohort, with groups defined by shared value in a field Coffee type. In multiple groups, one member was recorded as consuming far more cups of coffee per day than others in the group.

    Grouped Box Plots in Cohort Compare

    In Cohort Compare mode, a grouped box plot can be used to compare the distribution of values in a field that's common to both cohorts, across groups defined using values in a categorical field that is also common to both cohorts.

    In this scenario, a separate, color-coded box plot is displayed for each group in each cohort.

    Hovering over one of these box plots opens an informational window showing detail on the distribution of values for the group.

    Clicking the "ˇ" icon, in the lower right corner of the tile containing the chart, opens a tooltip showing the cohort names and the colors used to represent data in each.

    Grouped Box Plot in cohort compare mode

    Preparing Data for Visualization in Grouped Box Plots

    When ingesting data using Data Model Loader, the following data types can be visualized in grouped box plots:

    Primary Field

    • String Categorical

    • String Categorical Multi-Select

    • String Categorical Sparse

    • Integer Categorical

    • Integer Categorical Multi-Select

    Secondary Field

    • Integer

    • Integer Sparse

    • Float

    • Float Sparse

    Numerical data can also be visualized using histograms.

    Using Box Plots in the Cohort Browser

    Box plots provide a range of detail on the distribution of values in a field containing numerical data. Each box plot includes three thin blue horizontal lines, indicating, from top to bottom:

    • Max - The maximum, or highest value

    • Med - The median value

    • Min - The minimum, or lowest value

    The blue box straddling the median value line represents the span covered by the median 50% of values. Of the total number of values, 25% sit above the box, and 25% lie below it.

    Hovering over the middle of a box plot opens a window displaying detail on the maximum, median, and minimum values. Also shown are the values at the "top" ("Q3") and "bottom" ("Q1") of the box. "Q1" is the highest value in the first, or lowest, quartile of values. "Q3" is the highest value in the third quartile.

    Also shown in this window is the total count of values covered by the box plot, along with the name of the entity to which the data relates.

    Box Plot with Detail on Value Distribution

    Non-Numeric Data in Box Plots

    Fields containing primarily numeric data may also include non-numeric values. These non-numeric values cannot be represented in a box plot. See the chart above for an example of the informational message that shows below the chart when non-numeric values are present.

    Clicking the "non-numeric values" link displays detail on those values, and the number of record in which each appears:

    Detail on Non-Numeric Values Omitted from a Box Plot

    In this scenario, a discrepancy exists between the "count" figure shown in the chart label and the one shown in the informational window that opens when hovering over the middle of a box plot. The latter figure is smaller, with the discrepancy determined by the number of records for which values can't be displayed in the box plot.

    Outliers

    Cohort Browser box plots represent all non-null numeric values. When a field contains an outlier value or values - that is, values that are unusually high or low - this can result in a box plot that looks like this:

    Outlier Value in a Box Plot

    This box plot displays data on the number of cups of coffee consumed per day, by patients of a particular cohort. One cohort patient was recorded as consuming 42 cups of coffee per day, much higher than the value (2 cups/day) at the "top" of the third quartile, and far higher than the median value of 2 cups/day.

    Box Plots in Cohort Compare Mode

    In Cohort Compare mode, a box plot chart can be used to compare the distribution of values in a field that's common to both cohorts. In this scenario, a separate, color-coded box plot is displayed for each cohort.

    Hovering over either of the plots opens an informational window showing detail on the distribution of values for the cohort.

    Clicking the "ˇ" icon, in the lower right corner of the tile containing the chart, opens a tooltip showing the cohort names and the colors used to represent data in each.

    Box Plot in Cohort Compare Mode

    Preparing Data for Visualization in Box Plots

    When ingesting data using Data Model Loader, the following data types can be visualized in box plots:

    • Integer

    • Integer Sparse

    • Float

    • Float Sparse

    Numerical (Integer)

    Numerical (Float)

    Primary Field

    Secondary Field

    Categorical (<=20 distinct category values)

    Categorical (<=20 distinct category values)

    list view
    ingesting data using Data Model Loader
    Stacked Row Chart: VisitType x DoctorType
    Box Plot
    Histogram
    List View
    Row Chart
    Grouped Box Plot
    Kaplan-Meier Survival Curve
    List View
    Stacked Row Chart
    Detail on "missing" records

    Developer Tutorials

    Access developer tutorials and examples.

    Developers new to the DNAnexus Platform may find it helpful to learn by doing. This page contains a collection of tutorials and examples intended to showcase common tasks and methodologies when creating an app(let) on the DNAnexus Platform. After reading through the tutorials and examples you should be able to develop app(let)s that:

    • Run efficiently: use cloud computing methodologies.

    • Are straightforward to debug: let developers understand and resolve issues.

    • Use the scale of the cloud: take advantage of the DNAnexus Platform's flexibility

    • Are straightforward to use: reduce support and enable collaboration.

    If it's your first time developing an app(let), read the series. This series introduces terms and concepts that tutorials and examples build on.

    These tutorials are not meant to show realistic everyday examples, but rather provide a strong starting point for app(let) developers. These tutorials showcase varied implementations of the SAMtools view command on the DNAnexus Platform.

    Bash App(let) Tutorials

    Bash app(let)s use dx-toolkit, the platform SDK, and the command-line interface along with common Bash practices to create bioinformatic pipelines in the cloud.

    Python App(let) Tutorials

    Python app(let)s make of use dx-toolkit's along with common Python modules such as to create bioinformatic pipelines in the cloud.

    Web App(let) Tutorials

    To create a web applet, you need access to Titan or Apollo features. Web applets can be made as either Python or Bash applets. The only difference is that they launch a web server and expose port 443 (for HTTPS) to allow a user to interact with that web application through a web browser.

    Concurrent Computing Tutorials

    A bit of terminology before starting the discussion of parallel and distributed computing paradigms on the DNAnexus Platform.

    Many definitions and approaches exist for tackling the concept of parallelization and distributing workloads in the cloud (Here's a on the subject). To make the documentation easier to understand when discussing concurrent computing paradigms, this guide refers to:

    • Parallel: Using multiple threads or logical cores to concurrently process a workload.

    • Distributed: Using multiple machines (in this case, cloud instances) that communicate to concurrently process a workload.

    Keep these formal definitions in mind as you read through the tutorials and learn how to compute concurrently on the DNAnexus Platform.

    Parallel

    Distributed

    Parallel by Chr (py)

    This applet tutorial performs a SAMtools count using parallel threads.

    View full source code on GitHub

    To take full advantage of the scalability that cloud computing offers, your scripts have to implement the correct methodologies. This applet tutorial shows you how to:

    1. Install SAMtools

    2. Download BAM file

    3. Count regions in parallel

    How is the SAMtools dependency provided?

    The SAMtools dependency is resolved by declaring an package in the dxapp.json runSpec.execDepends.

    For additional information, refer to the .

    Download BAM file

    The dxpy.download_all_inputs() function downloads all input files into the /home/dnanexus/in directory. A folder is created for each input and the files are downloaded to that directory. For convenience, the dxpy.download_all_inputs function returns a dictionary containing the following keys:

    • <var>_path (string): full absolute path to where the file was downloaded.

    • <var>_name (string): name of the file, including extension.

    • <var>_prefix (string): name of the file minus the longest matching pattern found in the

    The path, name, and prefix key-value pattern is repeated for all applet file class inputs specified in the dxapp.json. In this example, the dictionary has the following key-value pairs:

    Count Regions in Parallel

    Before performing the parallel SAMtools count, determine the workload for each thread. The number of workers is arbitrarily set to 10 and the workload per thread is set to 1 chromosome at a time. Python offers multiple ways to achieve multithreaded processing. For the sake of simplicity, use , a wrapper around Python's threading module.

    Each worker creates a string to be called in a subprocess.Popen call. The multiprocessing.dummy.Pool.map(<func>, <iterable>) function is used to call the helper function run_cmd for each string in the iterable of view commands. Because multithreaded processing is performed using subprocess.Popen, the process does not alert to any failed processes. Closed workers are verified in the verify_pool_status helper function.

    Important: In this example, you use subprocess.Popen to process and verify results in verify_pool_status. In general, it is considered good practice to use Python's built-in subprocess convenience functions. In this case, subprocess.check_call would achieve the same goal.

    Gather Results

    Each worker returns a read count of only one region in the BAM file. Sum and output the results as the job output. The dx-toolkit Python SDK function is used to upload and generate a DXFile corresponding to the result file. For Python, job outputs have to be a dictionary of key-value pairs, with the keys being job output names as defined in the dxapp.json and the values being the output values for corresponding output classes. For files, the output type is a . The function is used to generate the appropriate DXLink value.

    Parallel by Region (py)

    This applet tutorial performs a SAMtools count using parallel threads.

    View full source code on GitHub

    To take full advantage of the scalability that cloud computing offers, your scripts must implement the correct methodologies. This applet tutorial shows you how to:

    1. Install SAMtools

    2. Download BAM file

    3. Split workload

    4. Count regions in parallel

    How is the SAMtools dependency provided?

    The SAMtools dependency is resolved by declaring an package in the dxapp.json runSpec.execDepends field.

    Download Inputs

    This applet downloads all inputs at once using dxpy.download_all_inputs:

    Split workload

    Using the Python multiprocessing module, you can split the workload into multiple processes for parallel execution:

    With this pattern, you can quickly orchestrate jobs on a worker. For a more detailed overview of the multiprocessing module, visit the .

    Specific helpers are created in the applet script to manage the workload. One helper you may have seen before is run_cmd. This function manages the subprocess calls:

    Before the workload can be split, you need to identify the regions present in the BAM input file. This initial parsing is handled in the parse_sam_header_for_region function:

    Once the workload is split and processing has started, wait and review the status of each Pool worker. Then, merge and output the results.

    The run_cmd function returns a tuple containing the stdout, stderr, and exit code of the subprocess call. These outputs from the workers are parsed to determine whether the run failed or passed.

    Git Dependency

    View full source code on GitHub

    What does this applet do?

    This applet performs a basic SAMtools count of alignments present in an input BAM.

    Prerequisites

    The app must have network access to the hostname where the git repository is located. In this example, access.network is set to:

    To learn more about access and network fields see .

    How is the SAMtools dependency added?

    SAMtools is cloned and built from the repository. The following is a closer look at the dxapp.json file's runSpec.execDepends property:

    The execDepends value is a JSON array of dependencies to resolve before the applet source code is run. In this applet, the git fetch dependencies for htslib and SAMtools are specified. Dependencies resolve in the order listed. Specify htslib first, before the SAMtools build_commands, because newer versions of SAMtools depend on htslib. An overview of each property in the git dependency:

    • package_manager - Details the type of dependency and how to resolve. .

    • url - Must point to the server containing the repository. In this case, a GitHub URL.

    • tag/branch - Git tag/branch to fetch.

    The build_commands are executed from the destdir. Use cd when appropriate.

    How is SAMtools called in the src script?

    Because "destdir": "/home/dnanexus" is set in dxapp.json, the git repository is cloned to the same directory from which the script executes. The example directory's structure:

    The SAMtools command in the app script is samtools/samtools.

    Applet Script

    You can build SAMtools in a directory that is on the $PATH or add the binary directory to $PATH. Keep this in mind for your app(let) development.

    Parallel by Region (sh)

    This applet performs a basic SAMtools count on a series of sliced (by canonical chromosome) BAM files in parallel using wait (Ubuntu 14.04+).

    View full source code on GitHub

    How is the SAMtools dependency provided?

    The SAMtools dependency is resolved by declaring an Apt-Get package in the dxapp.json runSpec.execDepends.

    Debugging

    The command set -e -x -o pipefail assists you in debugging this applet:

    • -e causes the shell to immediately exit if a command returns a non-zero exit code.

    • -x prints commands as they are executed, which is useful for tracking the job's status or pinpointing the exact execution failure.

    • -o pipefail makes the return code the first non-zero exit code. (Typically, the return code of pipes is the exit code of the last command, which can create difficult to debug problems.)

    The *.bai file was an optional job input. You can check for a empty or unset var using the bash built-in test [[ - z ${var}} ]]. You can then download or create a *.bai index as needed.

    Parallel Run

    Bash's system allows for convenient management of multiple processes. In this example, bash commands are run in the background as the maximum job executions are controlled in the foreground. You can place processes in the background using the character & after a command.

    Job Output

    Once the input BAM has been sliced, counted, and summed, the output counts_txt is uploaded using the command . The following directory structure required for dx-upload-all-outputs is below:

    In your applet, upload all outputs by creating the output directory and then using dx-upload-all-outputs to upload the output files.

    Pysam

    This applet performs a SAMtools count on an input BAM using Pysam, a python wrapper for SAMtools.

    View full source code on GitHub

    How is Pysam provided?

    Pysam is provided through a pip3 install using the pip3 package manager in the dxapp.json's runSpec.execDepends property:

    The execDepends value is a JSON array of dependencies to resolve before the applet source code is run. In this applet, pip3 is specified as the package manager and pysam version 0.15.4 as the dependency to resolve.

    Downloading Input

    The fields mappings_sorted_bam and mappings_sorted_bai are passed to the main function as parameters for the job. These parameters are dictionary objects with key-value pair {"$dnanexus_link": "<file>-<xxxx>"}. File objects from the platform are handled through handles. If an index file is not supplied, then a *.bai index is created.

    Working with Pysam

    Pysam provides key methods that mimic SAMtools commands. In this applet example, the focus is only on canonical chromosomes. The Pysam object representation of a BAM file is pysam.AlignmentFile.

    The helper function get_chr

    Once a list of canonical chromosomes is established, you can iterate over them and perform the Pysam version of samtools view -c, pysam.AlignmentFile.count.

    Uploading Outputs

    The summarized counts are returned as the job output. The dx-toolkit Python SDK function uploads and generates a DXFile corresponding to the tabulated result file.

    Python job outputs have to be a dictionary of key-value pairs, with the keys being job output names as defined in the dxapp.json file and the values being the output values for corresponding output classes. For files, the output type is a DXLink. The function generates the appropriate DXLink value.

    Parallel xargs by Chr

    This applet slices a BAM file by canonical chromosome then performs a parallelized samtools view -c using xargs. Type man xargs for general usage information.

    View full source code on GitHub

    How is the SAMtools dependency provided?

    The SAMtools compiled binary is placed directory in the <applet dir>/resources directory. Any files found in the resources/ directory are uploaded so that they are present in the root directory of the worker. In this case:

    When this applet is run on a worker, the resources/ folder is placed in the worker's root directory /:

    /usr/bin is part of the $PATH variable, so the script can reference the samtools command directly, for example, samtools view -c ....

    Parallel Run

    Splice BAM

    First, download the BAM file and slice it by canonical chromosome, writing the *bam file names to another file.

    To split a BAM by regions, you need a *.bai index. You can either create an app(let) which takes the *.bai as an input or generate a *.bai in the applet. In this tutorial, you generate the *.bai in the applet, sorting the BAM if necessary.

    Xargs SAMtools view

    In the previous section, you recorded the name of each sliced BAM file into a record file. Next, perform a samtools view -c on each slice using the record file as input.

    Upload results

    The results file is uploaded using the standard bash process:

    1. Upload a file to the job execution's container.

    2. Provide the DNAnexus link as a job's output using the script dx-jobutil-add-output <output name>

    VCF Loader

    A license is required to access Spark functionality on the DNAnexus Platform. Contact DNAnexus Sales for more information.

    Overview

    VCF Loader ingests Variant Call Format (VCF) files into a database. The input VCF files are loaded into a Parquet-format database that can be queried using Spark SQL.

    The input VCF for every run can be a single VCF file or many VCF files, but the merged input must represent a single logical VCF file. In the many files case, the logical VCF may be partitioned by chromosome, by genomic region, and/or by sample. In any case, every input VCF file must be a syntactically correct, sorted VCF file.

    VCF Preprocessing

    Although VCF data can be loaded into Apollo databases after the variant call step, the dataset may not be normalized for downstream analyses across large cohorts. In that case, preprocessing and harmonizing the data before loading is recommended. To learn more, see .

    How to Run VCF Loader

    Input:

    • vcf_manifest: (file) a text file containing a list of file ID's of the VCF files to load (one per line). The referenced files' names must be distinct and end in .vcf.gz. If more than one file is specified, then the complete VCF file to load is considered to be partitioned and every specified partition must be a valid VCF file. After the partition-merge step in preprocessing, the complete VCF file must be valid.

    Required Parameters:

    • database_name: (string) name of the database into which to load the VCF files.

    • create_mode: (string) strict mode creates database and tables from scratch and optimistic mode creates databases and tables if they do not already exist.

    Other Options:

    • snpeff: (boolean) default true -- whether to include the SnpEff annotation step in preprocessing with INFO/ANN tags. If SnpEff annotations are desired in the database, then either pre-annotate the raw VCF separately, or include this SnpEff annotation step -- it is not necessary to do both.

    • snpeff_human_genome: (string) default GRCh38.92 -- id of the SnpEff human genome to use in the SnpEff annotation step in preprocessing.

    Basic Run

    Parallel by Chr (py)

    This applet tutorial performs a SAMtools count using parallel threads.

    To take full advantage of the scalability that cloud computing offers, your scripts have to implement the correct methodologies. This applet tutorial shows you how to:

    1. Install SAMtools

    2. Download BAM file

    Distributed by Chr (sh)

    How is the SAMtools dependency provided?

    The SAMtools dependency is resolved by declaring an Apt-Get package in the dxapp.json file's runSpec.execDepends.

    For additional information, see execDepends.

    Parallel by Region (sh)

    This applet performs a basic SAMtools count on a series of sliced (by canonical chromosome) BAM files in parallel using wait.

    How is the SAMtools dependency provided?

    The SAMtools dependency is resolved by declaring an package in the dxapp.json runSpec.execDepends.

    Parallel xargs by Chr

    This applet slices a BAM file by canonical chromosome and performs a parallelized SAMtools view.

    How is the SAMtools dependency provided?

    The SAMtools compiled binary is placed directory in the <applet dir>/resources directory. Any files found in the resources/ directory are uploaded so that they are present in the root directory of the worker. In this case:

    When this applet is run on a worker, the

    Distributed by Chr (sh)

    How is the SAMtools dependency provided?

    The SAMtools dependency is resolved by declaring an Apt-Get package in the dxapp.json file's runSpec.execDepends.

    For additional information, see execDepends

    Exploring and Querying Datasets

    A license is required to access Spark functionality on the DNAnexus Platform. for more information.

    Extracting Data From a Dataset With Spark

    The dx

    Analyzing Germline Variants

    Analyze germline genomic variants, including filtering, visualization, and detailed variant annotation in the Cohort Browser.

    Explore and analyze datasets with germline data by opening them in the Cohort Browser and switching to the Germline Variants tab. You can create cohorts based on germline variants, visualize variant patterns, and examine detailed variant information.

    Filtering by Germline Variants

    You can to include only samples with specific germline variants.

    To apply a germline filter to your cohort:

    Analyzing Gene Expression Data

    Analyze gene expression data, including expression-based filtering, visualization, and molecular profiling in the Cohort Browser.

    Explore and analyze datasets with gene expression assays by opening them in the Cohort Browser and switching to the Gene Expression tab. You can create cohorts based on expression levels, visualize expression patterns, and examine detailed gene information.

    Gene expression datasets are created using the .

    Smart Reuse (Job Reuse)

    Speed workflow development and reduce testing costs by reusing computational outputs.

    A license is required to access the Smart Reuse feature. Contact for more information.

    DNAnexus allows organizations to optionally reuse outputs of jobs that share the same executable and input IDs, even if these outputs are across projects or entire organizations. This feature has two primary use cases.

    dx run app-dxjupyterlab
    dx run app-dxjupyterlab_spark_cluster
    dx run -h APP_NAME
      "runSpec": {
        ...
        "execDepends": [
          {"name": "samtools"}
        ]
      }
    {
     "runSpec": {
        ...
        "execDepends": [
          {"name": "pysam",
             "package_manager": "pip3",
             "version": "0.15.4"
          }
        ]
        ...
     }
    ├── Applet dir
    │   ├── src
    │   ├── dxapp.json
    │   ├── resources
    │       ├── usr
    │           ├── bin
    │               ├── <samtools binary>
    IMAGE_PROCESSING: Python3 with image processing packages (Nipype, FreeSurfer, FSL), but no R. FreeSurfer requires a license.
  • STATA: Stata requires a license to run

  • MONAI_ML: Extends the ML feature with specialized medical imaging frameworks, such as MONAI Core, MONAI Label, and 3D Slicer.

  • Mkfifo and dx cat

  • Parallel by Region (sh)

  • Parallel xargs by Chr

  • Precompiled Binary

  • R Shiny Example Web App

  • SAMtools count

  • TensorBoard Example Web App

  • Pysam

    Getting started
    Bash Helpers
    Distributed by Chr (sh)
    Distributed by Region (sh)
    Git Dependency
    Python implementation
    subprocess
    Dash Example Web App
    Distributed by Region (py)
    Parallel by Chr (py)
    Parallel by Region (py)
    R Shiny Example Web App
    TensorBoard Example Web App
    particularly helpful Stack Exchange post
    Parallel by Chr (py)
    Parallel by Region (py)
    Parallel by Region (sh)
    Parallel xargs by Chr
    Distributed by Chr (sh)
    Distributed by Region (py)
    Distributed by Region (sh)

    An Apollo license is required to use Cohort Browser on the DNAnexus Platform. Org approval may also be required. Contact DNAnexus Sales for more information.

    An Apollo license is required to use Cohort Browser on the DNAnexus Platform. Org approval may also be required. Contact DNAnexus Sales for more information.

    An Apollo license is required to use Cohort Browser on the DNAnexus Platform. Org approval may also be required. Contact DNAnexus Sales for more information.

    An Apollo license is required to use Cohort Browser on the DNAnexus Platform. Org approval may also be required. Contact DNAnexus Sales for more information.

    An Apollo license is required to use Cohort Browser on the DNAnexus Platform. Org approval may also be required. Contact DNAnexus Sales for more information.

    An Apollo license is required to use Cohort Browser on the DNAnexus Platform. Org approval may also be required. Contact DNAnexus Sales for more information.

    Analyzing Gene Expression Data

    An Apollo license is required to use Cohort Browser on the DNAnexus Platform. Org approval may also be required. Contact DNAnexus Sales for more information.

    Analyzing Gene Expression Data

    An Apollo license is required to use Cohort Browser on the DNAnexus Platform. Org approval may also be required. Contact DNAnexus Sales for more information.

    An Apollo license is required to use Cohort Browser on the DNAnexus Platform. Org approval may also be required. Contact DNAnexus Sales for more information.

    An Apollo license is required to use Cohort Browser on the DNAnexus Platform. Org approval may also be required. Contact DNAnexus Sales for more information.

    dxapp.json
    I/O pattern field.
    Apt-Get
    execDepends documentation
    multiprocessing.dummy
    dxpy.upload_local_file
    DXLink
    dxpy.dxlink
    Apt-Get
    Python docs
  • destdir - Directory on worker to which the git repository is cloned.

  • build_commands - Commands to build the dependency, run from the repository destdir. In this example, htslib is built when SAMtools is built, so only the SAMtools entry includes build_commands.

  • Execution Environment Reference
    SAMtools GitHub
    supplementary details
    job control
    dx-upload-all-outputs
    DXFile
    dxpy.upload_local_file
    dxpy.dxlink
    insert_mode
    : (string)
    append
    appends data to the end of tables and
    overwrite
    is equivalent to truncating the tables and then appending to them.
  • run_mode: (string) site mode processes only the site-specific data, genotype mode processes genotype-specific data and other non-site-specific data and all mode processes both types of data.

  • etl_spec_id: (string) Only the genomics-phenotype schema choice is supported.

  • is_sample_partitioned: (boolean) whether the raw VCF data is partitioned.

  • snpeff_opt_no_upstream: (boolean) default true -- exclude SnpEff upstream gene annotations in the SnpEff annotation step (equivalent to SnpEff's -no-upstream option). This option does not filter pre-calculated annotations outside of the SnpEff annotation step.
  • snpeff_opt_no_downstream: (boolean) default true -- exclude SnpEff downstream gene annotations in the SnpEff annotation step (equivalent to SnpEff's -no-downstream option). This option does not filter pre-calculated annotations outside of the SnpEff annotation step.

  • calculate_worst_effects: (boolean) default true -- whether to include the SnpEff worst-effects annotation step in preprocessing, which adds the SnpEff worst-effects for every alternate-allele--gene combination as INFO/ANN_WORST tags (Number "A"). This option automatically filters SnpEff annotations to exclude 'feature_type!=transcript', 'transcript_biotype!=protein_coding', 'effect=upstream_gene_variant' and 'effect=downstream_gene_variant'

  • calculate_locus_frequencies: (boolean) default true -- whether to include the locus-level frequencies annotation step in preprocessing, which adds locus-level allele and genotype frequencies as INFO tags.

  • snpsift: (boolean) default true -- whether to include the SnpSift/dbSNP annotation step in preprocessing. The SnpSift/dbSNP annotation step adds dbSNP ID annotations to the INFO/RSID tag (Number "A"), which is required in the ETL stage. If the raw VCF is already pre-annotated, then this annotation step is not necessary.

  • num_init_partitions: (int) integer defining the number of partitions for the initial VCF lines Spark RDD.

  • VCF Preprocessing
    Count regions in parallel

    How is the SAMtools dependency provided?

    The SAMtools dependency is resolved by declaring an Apt-Get package in the dxapp.json runSpec.execDepends.

    For additional information, refer to the execDepends documentation.

    Download BAM file

    The dxpy.download_all_inputs() function downloads all input files into the /home/dnanexus/in directory. A folder is created for each input and the files are downloaded to that directory. For convenience, the dxpy.download_all_inputs function returns a dictionary containing the following keys:

    • <var>_path (string): full absolute path to where the file was downloaded.

    • <var>_name (string): name of the file, including extension.

    • <var>_prefix (string): name of the file minus the longest matching pattern found in the dxapp.json I/O pattern field.

    The path, name, and prefix key-value pattern is repeated for all applet file class inputs specified in the dxapp.json. In this example, the dictionary has the following key-value pairs:

    Count Regions in Parallel

    Before performing the parallel SAMtools count, determine the workload for each thread. The number of workers is arbitrarily set to 10 and the workload per thread is set to 1 chromosome at a time. Python offers multiple ways to achieve multithreaded processing. For the sake of simplicity, use multiprocessing.dummy, a wrapper around Python's threading module.

    Each worker creates a string to be called in a subprocess.Popen call. The multiprocessing.dummy.Pool.map(<func>, <iterable>) function is used to call the helper function run_cmd for each string in the iterable of view commands. Because multithreaded processing is performed using subprocess.Popen, the process does not alert to any failed processes. Closed workers are verified in the verify_pool_status helper function.

    Important: In this example, subprocess.Popen is used to process and verify results in verify_pool_status. In general, it is considered good practice to use Python's built-in subprocess convenience functions. In this case, subprocess.check_call would achieve the same goal.

    Gather Results

    Each worker returns a read count of only one region in the BAM file. Sum and output the results as the job output. The dx-toolkit Python SDK function dxpy.upload_local_file is used to upload and generate a DXFile corresponding to the result file. For Python, job outputs have to be a dictionary of key-value pairs, with the keys being job output names as defined in the dxapp.json and the values being the output values for corresponding output classes. For files, the output type is a DXLink. The dxpy.dxlink function is used to generate the appropriate DXLink value.

    View full source code on GitHub
    Entry Points

    Distributed bash-interpreter apps use bash functions to declare entry points. This app has the following entry points specified as bash functions:

    • main

    • count_func

    • sum_reads

    Entry points are executed on a new worker with its own system requirements. The instance type can be set in the dxapp.json file's runSpec.systemRequirements:

    main

    The main function slices the initial *.bam file and generates an index *.bai if needed. The input *.bam is then sliced into smaller *.bam files containing only reads from canonical chromosomes. First, the main function downloads the BAM file and gets the headers.

    Sliced *.bam files are uploaded and their file IDs are passed to the count_func entry point using the dx-jobutil-new-job command.

    Outputs from the count_func entry points are referenced as Job Based Object References (JBOR) and used as inputs for the sum_reads entry point.

    The output of the sum_reads entry point is used as the output of the main entry point via JBOR reference using the command dx-jobutil-add-output.

    count_func

    This entry point downloads and runs the command samtools view -c on the sliced *.bam. The generated counts_txt output file is uploaded as the entry point's job output via the command dx-jobutil-add-output.

    sum_reads

    The main entry point triggers this sub job, providing the output of count_func as an input. This entry point gathers all the files generated by the count_func jobs and sums them.

    This function returns read_sum_file as the entry point output.

    View full source code on GitHub
    Debugging

    The command set -e -x -o pipefail assists you in debugging this applet:

    • -e causes the shell to immediately exit if a command returns a non-zero exit code.

    • -x prints commands as they are executed, which is useful for tracking the job's status or pinpointing the exact execution failure.

    • -o pipefail makes the return code the first non-zero exit code. (Typically, the return code of pipes is the exit code of the last command, which can create difficult to debug problems.)

      The *.bai file was an optional job input. You can check for an empty or unset var using the bash built-in test [[ - z ${var}} ]]. Then, you can download or create a *.bai index as needed.

    Parallel Run

    Bash's job control system allows for convenient management of multiple processes. In this example, you can run bash commands in the background as you control maximum job executions in the foreground. Place processes in the background using the character & after a command.

    Job Output

    Once the input BAM has been sliced, counted, and summed, the output counts_txt is uploaded using the command dx-upload-all-outputs. The following directory structure required for dx-upload-all-outputs is below:

    In your applet, upload all outputs by:

    View full source code on GitHub
    Apt-Get
    resources/
    folder is placed in the worker's root directory
    /
    :

    /usr/bin is part of the $PATH variable, so in the script, you can reference the samtools command directly, as in samtools view -c ....

    Parallel Run

    Splice BAM

    First, download the BAM file and slice it by canonical chromosome, writing the *bam file names to another file.

    To split a BAM by regions, you need to have a *.bai index. You can either create an app(let) which takes the *.bai as an input or generate a *.bai in the applet. In this tutorial, the *.bai is generated in the applet, sorting the BAM if necessary.

    Xargs SAMtools view

    In the previous section, the name of each sliced BAM file was recorded into a record file. Next, perform a samtools view -c on each slice using the record file as input.

    Upload results

    The results file is uploaded using the standard bash process:

    1. Upload a file to the job execution's container.

    2. Provide the DNAnexus link as a job's output using the script dx-jobutil-add-output <output name>

    View full source code on GitHub
    Entry Points

    Distributed bash-interpreter apps use bash functions to declare entry points. This app has the following entry points specified as bash functions:

    • main

    • count_func

    • sum_reads

    Entry points are executed on a new worker with its own system requirements. The instance type can be set in the dxapp.json file's runSpec.systemRequirements:

    main

    The main function slices the initial *.bam file and generates an index *.bai if needed. The input *.bam is the sliced into smaller *.bam files containing only reads from canonical chromosomes. First, the main function downloads the BAM file and gets the headers.

    Sliced *.bam files are uploaded and their file IDs are passed to the count_func entry point using the dx-jobutil-new-job command.

    Outputs from the count_func entry points are referenced as Job Based Object References (JBOR) and used as inputs for the sum_reads entry point.

    The output of the sum_reads entry point is used as the output of the main entry point via JBOR reference using the command dx-jobutil-add-output.

    count_func

    This entry point downloads and runs the command samtools view -c on the sliced *.bam. The generated counts_txt output file is uploaded as the entry point's job output via the command dx-jobutil-add-output.

    sum_reads

    The main entry point triggers this sub job, providing the output of count_func as an input. This entry point gathers all the files generated by the count_func jobs and sums them.

    This function returns read_sum_file as the entry point output.

    View full source code on GitHub
    commands,
    and
    , let you either retrieve the data dictionary of a dataset or extract the underlying data described by that dictionary. You can also use these commands to get dataset metadata, such as the names and titles of entities and fields, or to list all relevant assays in a dataset.

    Often, you can retrieve data without using Spark, and extra compute resources are not required (see the example OpenBio notebooks). However, if you need more compute power—such as when working with complex data models, large datasets, or extracting large volumes of data—you can use a private Spark resource. In these scenarios, data is returned through the DNAnexus Thrift Server. While the Thrift Server is highly available, it has a fixed timeout that may limit the number of queries you can run. Using private compute resources helps avoid these timeouts by scaling resources as needed.

    If you use the --sql flag, the command returns a SQL statement (as a string) that you can use in a standalone Spark-enabled application, such as JupyterLab.

    Initiating a Spark Session

    The most common way to use Spark on the DNAnexus Platform is via a Spark enabled JupyterLab notebook.

    After creating a Jupyter notebook within a project, enter the commands shown below, to start a Spark session.

    Python:

    R:

    Executing SQL Queries

    Once you've initiated a Spark session, you can run SQL queries on the database within your notebook, with the results written to a Spark DataFrame:

    Python:

    R:

    Query to Extract Data From Database Using extract_dataset

    Python:

    Where dataset is the record-id or the path to the dataset or cohort, for example, "record-abc123" or "/mydirectory/mydataset.dataset."

    R:

    Where dataset is the record-id or the path to the dataset or cohort.

    Query to Filter and Extract Data from Database Using extract_assay germline

    Python:

    R:

    In the examples above, dataset is the record-id or the path to the dataset or cohort, for example, record-abc123 or /mydirectory/mydataset.dataset. allele_filter.json is a JSON object, as a file, and which contains filters for the --retrieve-allele command. For more information, refer to the notebooks in the DNAnexus OpenBio dx-toolkit examples.

    Run SQL Query to Extract Data

    Python:

    R:

    Best Practices

    • When querying large datasets - such as those containing genomic data - ensure that your Spark cluster is scaled up appropriately with multiple clusters to parallelize across.

    • Ensure that your Spark session is only initialized once per Jupyter session. If you initialize the Spark session in multiple notebooks in the same Jupyter Job - for example, run notebook 1 and also run notebook 2 OR run a notebook from start to finish multiple times - the Spark session becomes corrupted and you need to restart the specific notebook's kernel. As a best practice, shut down the kernel of any notebook you are not using, before running a second notebook in the same session.

    • If you want to use a database outside your project's scope, you must refer to it using its unique database name (typically this looks something like database_fjf3y28066y5jxj2b0gz4g85__metabric_data) as opposed to the database name (metabric_data in this case).

    Contact DNAnexus Sales
    extract_dataset
    extract_assay germline
    "runSpec": {
      ...
      "execDepends": [
        {"name": "samtools"}
      ]
    }
    {
    mappings_bam_path: [u'/home/dnanexus/in/mappings_bam/SRR504516.bam']
    mappings_bam_name: [u'SRR504516.bam']
    mappings_bam_prefix: [u'SRR504516']
    index_file_path: [u'/home/dnanexus/in/index_file/SRR504516.bam.bai']
    index_file_name: [u'SRR504516.bam.bai']
    index_file_prefix: [u'SRR504516']
    }
    inputs = dxpy.download_all_inputs()
    shutil.move(inputs['mappings_bam_path'][0], os.getcwd())
    input_bam = inputs['mappings_bam_name'][0]
    
    bam_to_use = create_index_file(input_bam)
    print("Dir info:")
    print(os.listdir(os.getcwd()))
    
    regions = parseSAM_header_for_region(bam_to_use)
    
    view_cmds = [
        create_region_view_cmd(bam_to_use, region)
        for region
        in regions]
    
    print('Parallel counts')
    t_pools = ThreadPool(10)
    results = t_pools.map(run_cmd, view_cmds)
    t_pools.close()
    t_pools.join()
    
    verify_pool_status(results)
    def verify_pool_status(proc_tuples):
        err_msgs = []
        for proc in proc_tuples:
            if proc[2] != 0:
                err_msgs.append(proc[1])
        if err_msgs:
            raise dxpy.exceptions.AppInternalError(b"\n".join(err_msgs))
    resultfn = bam_to_use[:-4] + '_count.txt'
    with open(resultfn, 'w') as f:
        sum_reads = 0
        for res, reg in zip(results, regions):
            read_count = int(res[0])
            sum_reads += read_count
            f.write("Region {0}: {1}\n".format(reg, read_count))
        f.write("Total reads: {0}".format(sum_reads))
    
    count_file = dxpy.upload_local_file(resultfn)
    output = {}
    output["count_file"] = dxpy.dxlink(count_file)
    
    return output
    {
      "runSpec": {
        ...
        "execDepends": [
          {"name": "samtools"}
        ]
      }
    inputs = dxpy.download_all_inputs()
    # download_all_inputs returns a dictionary that contains mapping from inputs to file locations.
    # Additionaly, helper keys, value pairs are added to the dicitonary, similar to bash helper functions
    inputs
    #     mappings_sorted_bam_path: [u'/home/dnanexus/in/mappings_sorted_bam/SRR504516.bam']
    #     mappings_sorted_bam_name: u'SRR504516.bam'
    #     mappings_sorted_bam_prefix: u'SRR504516'
    #     mappings_sorted_bai_path: u'/home/dnanexus/in/mappings_sorted_bai/SRR504516.bam.bai'
    #     mappings_sorted_bai_name: u'SRR504516.bam.bai'
    #     mappings_sorted_bai_prefix: u'SRR504516'
    print("Number of cpus: {0}".format(cpu_count()))  # Get cpu count from multiprocessing
    worker_pool = Pool(processes=cpu_count())         # Create a pool of workers, 1 for each core
    results = worker_pool.map(run_cmd, collection)    # map run_cmds to a collection
                                                      # Pool.map handles orchestrating the job
    worker_pool.close()
    worker_pool.join()  # Make sure to close and join workers when done
    def run_cmd(cmd_arr):
        """Run shell command.
        Helper function to simplify the pool.map() call in our parallelization.
        Raises OSError if command specified (index 0 in cmd_arr) isn't valid
        """
        proc = subprocess.Popen(
            cmd_arr,
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE)
        stdout, stderr = proc.communicate()
        exit_code = proc.returncode
        proc_tuple = (stdout, stderr, exit_code)
        return proc_tuple
    def parse_sam_header_for_region(bamfile_path):
        """Helper function to match SN regions contained in SAM header
    
        Returns:
            regions (list[string]): list of regions in bam header
        """
        header_cmd = ['samtools', 'view', '-H', bamfile_path]
        print('parsing SAM headers:', " ".join(header_cmd))
        headers_str = subprocess.check_output(header_cmd).decode("utf-8")
        rgx = re.compile(r'SN:(\S+)\s')
        regions = rgx.findall(headers_str)
        return regions
    # Write results to file
    resultfn = inputs['mappings_sorted_bam_name'][0]
    resultfn = (
        resultfn[:-4] + '_count.txt'
        if resultfn.endswith(".bam")
        else resultfn + '_count.txt')
    with open(resultfn, 'w') as f:
        sum_reads = 0
        for res, reg in zip(results, regions):
            read_count = int(res[0])
            sum_reads += read_count
            f.write("Region {0}: {1}\n".format(reg, read_count))
        f.write("Total reads: {0}".format(sum_reads))
    
    count_file = dxpy.upload_local_file(resultfn)
    output = {}
    output["count_file"] = dxpy.dxlink(count_file)
    return output
    def verify_pool_status(proc_tuples):
        """
        Helper to verify worker succeeded.
    
        As failed commands are detected, the `stderr` from that command is written
        to the job_error.json file. This file is printed to the Platform
        job log on App failure.
        """
        all_succeed = True
        err_msgs = []
        for proc in proc_tuples:
            if proc[2] != 0:
                all_succeed = False
                err_msgs.append(proc[1])
        if err_msgs:
            raise dxpy.exceptions.AppInternalError(b"\n".join(err_msgs))
    "access": {
      "network": ["github.com"]
    }
      "runSpec": {
     ...
        "execDepends": [
            {
              "name": "htslib",
              "package_manager": "git",
              "url": "https://github.com/samtools/htslib.git",
              "tag": "1.3.1",
              "destdir": "/home/dnanexus"
            },
            {
              "name": "samtools",
              "package_manager": "git",
              "url": "https://github.com/samtools/samtools.git",
              "tag": "1.3.1",
              "destdir": "/home/dnanexus",
              "build_commands": "make samtools"
            }
        ],
    ...
      }
    ├── home
    │   ├── dnanexus
    │       ├── < app script >
    │       ├── htslib
    │       ├── samtools
    │           ├── < samtools binary >
    main() {
      set -e -x -o pipefail
    
      dx download "$mappings_bam"
    
      count_filename="${mappings_bam_prefix}.txt"
      readcount=$(samtools/samtools view -c "${mappings_bam_name}")
      echo "Total reads: ${readcount}" > "${count_filename}"
    
      counts_txt=$(dx upload "${count_filename}" --brief)
      dx-jobutil-add-output counts_txt "${counts_txt}" --class=file
    }
    set -e -x -o pipefail
    echo "Value of mappings_sorted_bam: '${mappings_sorted_bam}'"
    echo "Value of mappings_sorted_bai: '${mappings_sorted_bai}'"
    
    mkdir workspace
    cd workspace
    dx download "${mappings_sorted_bam}"
    
    if [ -z "$mappings_sorted_bai" ]; then
      samtools index "$mappings_sorted_bam_name"
    else
      dx download "${mappings_sorted_bai}"
    fi
    # Extract valid chromosome names from BAM header
    chromosomes=$(
      samtools view -H "${mappings_sorted_bam_name}" | \
      grep "@SQ" | \
      awk -F '\t' '{print $2}' | \
      awk -F ':' '{
        if ($2 ~ /^chr[0-9XYM]+$|^[0-9XYM]/) {
          print $2
        }
      }'
    )
    
    # Split BAM by chromosome and record output file names
    for chr in $chromosomes; do
      samtools view -b "${mappings_sorted_bam_name}" "${chr}" -o "bam_${chr}.bam"
      echo "bam_${chr}.bam"
    done > bamfiles.txt
    
    # Parallel counting of reads per chromosome BAM
    busyproc=0
    
    while read -r b_file; do
      echo "${b_file}"
    
      # If busy processes hit limit, wait for one to finish
      if [[ "${busyproc}" -ge "$(nproc)" ]]; then
        echo "Processes hit max"
        while [[ "${busyproc}" -gt 0 ]]; do
          wait -n
          busyproc=$((busyproc - 1))
        done
      fi
    
      # Count reads in background
      samtools view -c "${b_file}" > "count_${b_file%.bam}" &
      busyproc=$((busyproc + 1))
    
    done < bamfiles.txt
    while [[ "${busyproc}" -gt  0 ]]; do
      wait -n # p_id
      busyproc=$((busyproc-1))
    done
    ├── $HOME
    │   ├── out
    │       ├── < output name in dxapp.json >
    │           ├── output file
    outputdir="${HOME}/out/counts_txt"
    mkdir -p "${outputdir}"
    cat count* \
      | awk '{sum+=$1} \
      END{print "Total reads = ",sum}' \
      > "${outputdir}/${mappings_sorted_bam_prefix}_count.txt"
    
    dx-upload-all-outputs
    print(mappings_sorted_bai)
    print(mappings_sorted_bam)
    
    mappings_sorted_bam = dxpy.DXFile(mappings_sorted_bam)
    sorted_bam_name = mappings_sorted_bam.name
    dxpy.download_dxfile(mappings_sorted_bam.get_id(),
                            sorted_bam_name)
    ascii_bam_name = unicodedata.normalize(  # Pysam requires ASCII not Unicode string.
        'NFKD', sorted_bam_name).encode('ascii', 'ignore')
    
    if mappings_sorted_bai is not None:
        mappings_sorted_bai = dxpy.DXFile(mappings_sorted_bai)
        dxpy.download_dxfile(mappings_sorted_bai.get_id(),
                                mappings_sorted_bai.name)
    else:
        pysam.index(ascii_bam_name)
    mappings_obj = pysam.AlignmentFile(ascii_bam_name, "rb")
    regions = get_chr(mappings_obj, canonical_chr)
    def get_chr(bam_alignment, canonical=False):
        """Helper function to return canonical chromosomes from SAM/BAM header
    
        Arguments:
            bam_alignment (pysam.AlignmentFile): SAM/BAM pysam object
            canonical (boolean): Return only canonical chromosomes
        Returns:
            regions (list[str]): Region strings
        """
        regions = []
        headers = bam_alignment.header
        seq_dict = headers['SQ']
    
        if canonical:
            re_canonical_chr = re.compile(r'^chr[0-9XYM]+$|^[0-9XYM]')
            for seq_elem in seq_dict:
                if re_canonical_chr.match(seq_elem['SN']):
                    regions.append(seq_elem['SN'])
        else:
            regions = [''] * len(seq_dict)
            for i, seq_elem in enumerate(seq_dict):
                regions[i] = seq_elem['SN']
    
        return regions
    total_count = 0
    count_filename = "{bam_prefix}_counts.txt".format(
        bam_prefix=ascii_bam_name[:-4])
    
    with open(count_filename, "w") as f:
        for region in regions:
            temp_count = mappings_obj.count(region=region)
            f.write("{region_name}: {counts}\n".format(
                region_name=region, counts=temp_count))
            total_count += temp_count
    
        f.write("Total reads: {sum_counts}".format(sum_counts=total_count))
    counts_txt = dxpy.upload_local_file(count_filename)
    output = {}
    output["counts_txt"] = dxpy.dxlink(counts_txt)
    
    return output
    /
    ├── usr
    │   ├── bin
    │       ├── < samtools binary >
    ├── home
    │   ├── dnanexus
    # Download BAM from DNAnexus
    dx download "${mappings_bam}"
    
    # Attempt to index the BAM file
    indexsuccess=true
    bam_filename="${mappings_bam_name}"
    samtools index "${mappings_bam_name}" || indexsuccess=false
    
    # If indexing fails, sort then index
    if [[ $indexsuccess == false ]]; then
      samtools sort -o "${mappings_bam_name}" "${mappings_bam_name}"
      samtools index "${mappings_bam_name}"
      bam_filename="${mappings_bam_name}"
    fi
    
    # Extract chromosome names from header
    chromosomes=$(
      samtools view -H "${bam_filename}" | \
      grep "@SQ" | \
      awk -F '\t' '{print $2}' | \
      awk -F ':' '{
        if ($2 ~ /^chr[0-9XYM]+$|^[0-9XYM]/) {
          print $2
        }
      }'
    )
    
    # Split BAM by chromosome and record filenames
    for chr in $chromosomes; do
      samtools view -b "${bam_filename}" "${chr}" -o "bam_${chr}.bam"
      echo "bam_${chr}.bam"
    done > bamfiles.txt
    counts_txt_name="${mappings_bam_prefix}_count.txt"
    
    # Sum all read counts across split BAM files
    sum_reads=$(
      < bamfiles.txt xargs -I {} \
      samtools view -c $view_options '{}' | \
      awk '{s += $1} END {print s}'
    )
    
    # Write the total read count to a file
    echo "Total Count: ${sum_reads}" > "${counts_txt_name}"
    counts_txt_id=$(dx upload "${counts_txt_name}" --brief)
    dx-jobutil-add-output counts_txt "${counts_txt_id}" --class=file
    dx run vcf-loader \
       -i vcf_manifest=file-xxxx \
       -i is_sample_partitioned=false \
       -i database_name=<my_favorite_db> \
       -i etl_spec_id=genomics-phenotype \
       -i create_mode=strict \
       -i insert_mode=append \
       -i run_mode=genotype
      "runSpec": {
        ...
        "execDepends": [
          {"name": "samtools"}
        ]
      }
    {
        mappings_bam_path: [u'/home/dnanexus/in/mappings_bam/SRR504516.bam']
        mappings_bam_name: [u'SRR504516.bam']
        mappings_bam_prefix: [u'SRR504516']
        index_file_path: [u'/home/dnanexus/in/index_file/SRR504516.bam.bai']
        index_file_name: [u'SRR504516.bam.bai']
        index_file_prefix: [u'SRR504516']
    }
    inputs = dxpy.download_all_inputs()
    shutil.move(inputs['mappings_bam_path'][0], os.getcwd())
    input_bam = inputs['mappings_bam_name'][0]
    
    bam_to_use = create_index_file(input_bam)
    print("Dir info:")
    print(os.listdir(os.getcwd()))
    
    regions = parseSAM_header_for_region(bam_to_use)
    
    view_cmds = [
        create_region_view_cmd(bam_to_use, region)
        for region
        in regions]
    
    print('Parallel counts')
    t_pools = ThreadPool(10)
    results = t_pools.map(run_cmd, view_cmds)
    t_pools.close()
    t_pools.join()
    
    verify_pool_status(results)
    def verify_pool_status(proc_tuples):
        err_msgs = []
        for proc in proc_tuples:
            if proc[2] != 0:
                err_msgs.append(proc[1])
        if err_msgs:
            raise dxpy.exceptions.AppInternalError(b"\n".join(err_msgs))
    resultfn = bam_to_use[:-4] + '_count.txt'
    with open(resultfn, 'w') as f:
        sum_reads = 0
        for res, reg in zip(results, regions):
            read_count = int(res[0])
            sum_reads += read_count
            f.write("Region {0}: {1}\n".format(reg, read_count))
        f.write("Total reads: {0}".format(sum_reads))
    
    count_file = dxpy.upload_local_file(resultfn)
    output = {}
    output["count_file"] = dxpy.dxlink(count_file)
    
    return output
    {
    
      ...
      "runSpec": {
        ...
        "execDepends": [
          {
            "name": "samtools"
          }
        ]
      }
      ...
    }
    {
      "runSpec": {
        ...
        "systemRequirements": {
          "main": {
            "instanceType": "mem1_ssd1_x4"
          },
          "count_func": {
            "instanceType": "mem1_ssd1_x2"
          },
          "sum_reads": {
            "instanceType": "mem1_ssd1_x4"
          }
        },
        ...
      }
    }
    dx download "${mappings_sorted_bam}" \
      chromosomes=$( \
      samtools view -H "${mappings_sorted_bam_name}" \
      | grep "\@SQ" \
      | awk -F '\t' '{print $2}' \
      | awk -F ':' '{if ($2 ~ /^chr[0-9XYM]+$|^[0-9XYM]/) {print $2}}')
    if [ -z "${mappings_sorted_bai}" ]; then
        samtools index "${mappings_sorted_bam_name}"
    else
        dx download "${mappings_sorted_bai}" -o "${mappings_sorted_bam_name}.bai"
    fi
    
    count_jobs=()
    
    for chr in $chromosomes; do
        seg_name="${mappings_sorted_bam_prefix}_${chr}.bam"
        samtools view -b "${mappings_sorted_bam_name}" "${chr}" > "${seg_name}"
        bam_seg_file=$(dx upload "${seg_name}" --brief)
        count_jobs+=($(dx-jobutil-new-job \
            -isegmentedbam_file="${bam_seg_file}" \
            -ichr="${chr}" \
            count_func))
    done
    for job in "${count_jobs[@]}"; do
        readfiles+=("-ireadfiles=${job}:counts_txt")
    done
    
    sum_reads_job=$(
        dx-jobutil-new-job \
            "${readfiles[@]}" \
            -ifilename="${mappings_sorted_bam_prefix}" \
            sum_reads
    )
    count_func () {
        echo "Value of segmentedbam_file: '${segmentedbam_file}'"
        echo "Chromosome being counted '${chr}'"
    
        dx download "${segmentedbam_file}"
    
        readcount=$(samtools view -c "${segmentedbam_file_name}")
        printf "${chr}:\t%s\n" "${readcount}" > "${segmentedbam_file_prefix}.txt"
    
        readcount_file=$(dx upload "${segmentedbam_file_prefix}.txt" --brief)
        dx-jobutil-add-output counts_txt "${readcount_file}" --class=file
    }
    sum_reads () {
        set -e -x -o pipefail
    
        printf "Value of read file array %s" "${readfiles[@]}"
        echo "Filename: ${filename}"
        echo "Summing values in files and creating output read file"
    
        for read_f in "${readfiles[@]}"; do
            echo "${read_f}"
            dx download "${read_f}" -o - >> chromosome_result.txt
        done
    
        count_file="${filename}_chromosome_count.txt"
        total=$(awk '{s+=$2} END {print s}' chromosome_result.txt)
        echo "Total reads: ${total}" >> "${count_file}"
    
        readfile_name=$(dx upload "${count_file}" --brief)
        dx-jobutil-add-output read_sum_file "${readfile_name}" --class=file
    }
    set -e -x -o pipefail
    echo "Value of mappings_sorted_bam: '${mappings_sorted_bam}'"
    echo "Value of mappings_sorted_bai: '${mappings_sorted_bai}'"
    
    mkdir workspace
    cd workspace
    dx download "${mappings_sorted_bam}"
    
    if [ -z "$mappings_sorted_bai" ]; then
      samtools index "$mappings_sorted_bam_name"
    else
      dx download "${mappings_sorted_bai}"
    fi
      "runSpec": {
        ...
        "execDepends": [
          {"name": "samtools"}
        ]
      }
    chromosomes=$( \
      samtools view -H "${mappings_sorted_bam_name}" \
      | grep "\@SQ" \
      | awk -F '\t' '{print $2}' \
      | awk -F ':' '{if ($2 ~ /^chr[0-9XYM]+$|^[0-9XYM]/) {print $2}}')
    
    for chr in $chromosomes; do
      samtools view -b "${mappings_sorted_bam_name}" "${chr}" -o "bam_${chr}.bam"
      echo "bam_${chr}.bam"
    done > bamfiles.txt
    
    busyproc=0
    while read -r b_file; do
      echo "${b_file}"
      if [[ "${busyproc}" -ge "$(nproc)" ]]; then
        echo Processes hit max
        while [[ "${busyproc}" -gt  0 ]]; do
          wait -n # p_id
          busyproc=$((busyproc-1))
        done
      fi
      samtools view -c "${b_file}"> "count_${b_file%.bam}" &
      busyproc=$((busyproc+1))
    done <bamfiles.txt
    while [[ "${busyproc}" -gt  0 ]]; do
      wait -n # p_id
      busyproc=$((busyproc-1))
    done
    ├── $HOME
    │   ├── out
    │       ├── < output name in dxapp.json >
    │           ├── output file
    outputdir="${HOME}/out/counts_txt"
    mkdir -p "${outputdir}"
    cat count* \
      | awk '{sum+=$1} END{print "Total reads = ",sum}' \
      > "${outputdir}/${mappings_sorted_bam_prefix}_count.txt"
    
    dx-upload-all-outputs
      counts_txt_id=$(dx upload "${counts_txt_name}" --brief)
      dx-jobutil-add-output counts_txt "${counts_txt_id}" --class=file
    ├── Applet dir
    │   ├── src
    │   ├── dxapp.json
    │   ├── resources
    │       ├── usr
    │           ├── bin
    │               ├── < samtools binary >
    /
    ├── usr
    │   ├── bin
    │       ├── < samtools binary >
    ├── home
    │   ├── dnanexus
    dx download "${mappings_bam}"
    
    indexsuccess=true
    bam_filename="${mappings_bam_name}"
    samtools index "${mappings_bam_name}" || indexsuccess=false
    if [[ $indexsuccess == false ]]; then
      samtools sort -o "${mappings_bam_name}" "${mappings_bam_name}"
      samtools index "${mappings_bam_name}"
      bam_filename="${mappings_bam_name}"
    fi
    
    chromosomes=$( \
      samtools view -H "${bam_filename}" \
      | grep "\@SQ" \
      | awk -F '\t' '{print $2}' \
      | awk -F ':' '{if ($2 ~ /^chr[0-9XYM]+$|^[0-9XYM]/) {print $2}}')
    
    for chr in $chromosomes; do
      samtools view -b "${bam_filename}" "${chr}" -o "bam_${chr}."bam
      echo "bam_${chr}.bam"
    done > bamfiles.txt
    counts_txt_name="${mappings_bam_prefix}_count.txt"
    
    sum_reads=$( \
      <bamfiles.txt xargs -I {} samtools view -c $view_options '{}' \
      | awk '{s+=$1} END {print s}')
    echo "Total Count: ${sum_reads}" > "${counts_txt_name}"
    {
    ...
        "runSpec": {
       ...
          "execDepends": [
            {"name": "samtools"}
          ]
        }
    ...
    }
    {
      "runSpec": {
        ...
        "systemRequirements": {
          "main": {
            "instanceType": "mem1_ssd1_x4"
          },
          "count_func": {
            "instanceType": "mem1_ssd1_x2"
          },
          "sum_reads": {
            "instanceType": "mem1_ssd1_x4"
          }
        },
        ...
      }
    }
    dx download "${mappings_sorted_bam}"
    chromosomes=$( \
      samtools view -H "${mappings_sorted_bam_name}" \
      | grep "\@SQ" \
      | awk -F '\t' '{print $2}' \
      | awk -F ':' '{if ($2 ~ /^chr[0-9XYM]+$|^[0-9XYM]/) {print $2}}')
    if [ -z "${mappings_sorted_bai}" ]; then
      samtools index "${mappings_sorted_bam_name}"
    else
      dx download "${mappings_sorted_bai}" -o "${mappings_sorted_bam_name}".bai
    fi
    
    count_jobs=()
    for chr in $chromosomes; do
      seg_name="${mappings_sorted_bam_prefix}_${chr}".bam
      samtools view -b "${mappings_sorted_bam_name}" "${chr}" > "${seg_name}"
      bam_seg_file=$(dx upload "${seg_name}" --brief)
      count_jobs+=($(dx-jobutil-new-job -isegmentedbam_file="${bam_seg_file}" -ichr="${chr}" count_func))
    done
    for job in "${count_jobs[@]}"; do
      readfiles+=("-ireadfiles=${job}:counts_txt")
    done
    
    sum_reads_job=$(dx-jobutil-new-job "${readfiles[@]}" -ifilename="${mappings_sorted_bam_prefix}" sum_reads)
    count_func ()
    {
        echo "Value of segmentedbam_file: '${segmentedbam_file}'";
        echo "Chromosome being counted '${chr}'";
        dx download "${segmentedbam_file}";
        readcount=$(samtools view -c "${segmentedbam_file_name}");
        printf "${chr}:\t%s\n" "${readcount}" > "${segmentedbam_file_prefix}.txt";
        readcount_file=$(dx upload "${segmentedbam_file_prefix}".txt --brief);
        dx-jobutil-add-output counts_txt "${readcount_file}" --class=file
    }
    sum_reads ()
    {
        set -e -x -o pipefail;
        printf "Value of read file array %s" "${readfiles[@]}";
        echo "Filename: ${filename}";
        echo "Summing values in files and creating output read file";
        for read_f in "${readfiles[@]}";
        do
            echo "${read_f}";
            dx download "${read_f}" -o - >> chromosome_result.txt;
        done;
        count_file="${filename}_chromosome_count.txt";
        total=$(awk '{s+=$2} END {print s}' chromosome_result.txt);
        echo "Total reads: ${total}" >> "${count_file}";
        readfile_name=$(dx upload "${count_file}" --brief);
        dx-jobutil-add-output read_sum_file "${readfile_name}" --class=file
    }
    import pyspark
    sc = pyspark.SparkContext()
    spark = pyspark.sql.SparkSession(sc)
    install.packages("sparklyr")
    library(sparklyr)
    port <- Sys.getenv("SPARK_MASTER_PORT")
    master <- paste("spark://master:", port, sep = '')
    sc = spark_connect(master)
    retrieve_sql = 'select .... from .... '
    df = spark.sql(retrieve_sql)
    library(DBI)
    retrieve_sql <- 'select .... from .... '
    df = dbGetQuery(sc, retrieve_sql)
    import subprocess
    cmd = ["dx", "extract_dataset", dataset, "--fields", "entity1.field1, entity1.field2, entity2.field4", "--sql", "-o", "extracted_data.sql"]
    subprocess.check_call(cmd)
    cmd <- paste("dx extract_dataset", dataset, " --fields", "entity1.field1, entity1.field2, entity2.field4", "--sql", "-o extracted_data.sql")
    system(cmd)
    import subprocess
    cmd = ["dx", "extract_assay", "germline", dataset, "--retrieve-allele", "allele_filter.json", "--sql", "-o", "extract_allele.sql"]
    subprocess.check_call(cmd)
    cmd <- paste("dx extract_assay", "germline", dataset, "--retrieve-allele", "allele_filter.json", "--sql", "-o extracted_allele.sql")
    system(cmd)
    with open("extracted_data.sql", "r") as file:
        retrieve_sql=""
        for line in file:
            retrieve_sql += line.strip()
    df = spark.sql(retrieve_sql.strip(";"))
    install.packages("tidyverse")
    library(readr)
    retrieve_sql <-read_file("extracted_data.sql")
    retrieve_sql <- gsub("[;\n]", "", retrieve_sql)
    df <- dbGetQuery(sc, retrieve_sql)

    For the cohort you want to edit, click Add Filter.

  • In Add Filter to Cohort > Assays > Genomic Sequencing, select a genomic filter.

  • In Edit Filter: Variant (Germline), specify your filtering criteria:

    • For datasets with multiple germline variant assays, select the specific assay to filter by.

    • On the Genes / Effects tab, select variants of specific types and variant consequences within the specified genes and/or genomic ranges. You can specify up to 5 genes or genomic ranges in a comma-separated list.

    • On the Variant IDs tab, specify a list of variant IDs, with a maximum of 100 variants.

    • To enter multiple genes, genomic ranges, or variants, separate them with commas or place each on a new line.

  • Click Apply Filter.

  • Adding a germline filter

    Exploring Variant Patterns in Your Cohort

    The Germline Variants tab includes a lollipop plot displaying allele frequencies for variants in a specified genomic region. This visualization helps you identify patterns in germline variants across your cohort and understand the distribution of allelic frequencies.

    Genomic Variant Browser and Details

    If your dataset contains multiple germline variant assays, such as WES and WGS assays, you can choose the assay to visualize at the top of the dashboard. The Cohort Browser displays data from only one assay at a time. When you switch between assays, your charts and their display settings are preserved.

    Examining Variant Annotations

    The allele table, located below the lollipop plot, shows the same variants in a tabular format with comprehensive annotation information. It allows you to examine specific variant characteristics and compare allele frequencies within your selected cohort, the entire dataset, and from annotation databases, including gnomAD.

    The annotation information includes:

    • Type: whether the variant is an SNP, deletion, insertion, or mixed.

    • Consequences: The impact of variant according to SnpEff. For variants with multiple gene annotations, this column displays the most severe consequence per gene.

    • Population Allele Frequency: Allele frequency calculated across entire dataset from which the cohort is created.

    • Cohort Allele Frequency: Allele frequency calculated across current cohort selection.

    • GnomAD Allele Frequency: Allele frequency of the specified allele from the public dataset .

    If canonical transcript information is available, the following three columns with additional annotation information appear in the Table:

    • Consequences (Canonical Transcript): Canonical effects per each associated gene, according to SnpEff.

    • HGVS DNA (Canonical Transcript): HGVS (DNA) standard terminology per each associated gene with this variant

    • HGVS Protein (Canonical Transcript): HGVS (Protein) standard terminology per each associated gene with this variant

    Exporting Variant Metadata

    You can export the selected variants in the table as a list of variant IDs or a CSV file.

    • To copy a comma-separated list of variant IDs to your clipboard, select the set of IDs you want to copy, and click Copy.

    • To export variants as a CSV file, select the set of IDs you need, and click Download (.csv file).

    For large datasets, you can use the SQL Runner app to download data in a more efficient way.

    Accessing Detailed Variant Information

    In Allele table > Location column, you can click on the specific location to open the locus details. The locus details provides in-depth annotations and population genetics data for the selected genomic position.

    When genomic information is ingested and made available in the Cohort Browser, variants are annotated using NCBI dbSNP and gnomAD. The specific versions of each are provided during the ingestion process and create a set of tables optimized for cohort creation through the Cohort Browser.

    Viewing specific locus details

    The locus details page displays three main sections of pre-calculated information from dataset ingestion: Location Summary, Genotype Distribution, and Allele Annotations. These sections provide a comprehensive view starting with a locus summary, including genotype frequencies, followed by detailed annotations for each allele.

    • Location Info provides a quick overview of the genomic locus in your dataset, including the chromosome and starting position, the frequency of both the reference allele and no-calls, and the total number of alleles available.

    • Genotypes shows a detailed breakdown of genotypes in the dataset at the specific location. Since allele order is not preserved, genotypes like C/A and A/C are counted in the same category, which is why only half of the comparison table is populated. These genotype frequencies represent the entire dataset at this location, not only your selected cohort.

    • Alleles displays detailed information for each allele, collected from dbSNP and gnomAD during data ingestion. When available, rsID or AffyID appear with direct links to the corresponding NCBI dbSNP page. The section provides allele type, affected samples (dataset), and gnomAD frequency for quick reference, with additional details sorted by transcript ID. For canonical transcripts, a blue indicator appears next to the transcript ID, identifying the primary transcript annotations.

    Integrating with Advanced Analysis Tools

    For more sophisticated genomic analysis beyond the Cohort Browser's visualization capabilities, you can connect your variant data with other DNAnexus tools. Export variant lists for detailed analysis in JupyterLab, leverage Spark clusters for large-scale genomic computations, or connect to SQL Runner for complex queries across your dataset.

    define your cohort

    An Apollo license is required to use Cohort Browser on the DNAnexus Platform. Org approval may also be required. Contact DNAnexus Sales for more information.

    Customizing the Gene Expression Dashboard

    You can customize your Gene Expression dashboard to focus on the most relevant analyses for your research:

    • Create new Expression Distribution or Feature Correlation charts.

    • Remove charts you no longer need.

    • Resize and reposition charts to optimize your workspace.

    • Save your dashboard customizations along with your cohort.

    Visualizing gene expression data in Cohort Browser

    The Gene Expression dashboard supports up to 15 charts, allowing you to create comprehensive expression analysis workspaces.

    For datasets with multiple gene expression assays, you can choose the specific assay to visualize at the top of the dashboard. The Cohort Browser displays data from only one assay at a time. Switching between assays preserves your charts and their display settings.

    Filtering by Gene Expression

    You can define your cohort by gene expression to include only patients with specific expression characteristics.

    To apply a gene expression filter to your cohort:

    1. For the cohort you want to edit, click Add Filter.

    2. In Add Filter to Cohort > Assays > Gene Expression, select a genomic filter.

    3. In Edit Filter: Gene Expression, specify the criteria:

      • For datasets with multiple gene expression assays, select the specific assay to filter by.

      • In Expression Level, specify inclusive minimum and maximum values. For an individual to be included, all their expression values across all samples for the feature must fall within the range.

      • In Gene / Transcript, enter a gene symbol, such as BRCA1, or feature ID, such as ENSG00000012048 or ENST00000309586. Search is case insensitive.

    4. Click Apply Filter.

    Adding a gene expression filter

    You can specify up to 10 gene expression filters for each cohort. All filters use an AND relationship.

    Visualizing Expression Distribution

    The Expression Level charts help you visualize gene expression patterns for individual transcript or gene features. You can examine how expression values are distributed across your cohort, identify outliers, and compare patterns between different patient groups.

    The chart displays data for one gene or transcript at a time. You can directly enter a transcript or gene feature ID, such as ID starting with [ENST](https://useast.ensembl.org/Help/View?id=151) or [ENSG](https://useast.ensembl.org/info/genome/genebuild/index.html), or search by gene symbol to see available options.

    Visualizing TP53 gene expression

    You can view the data as either a histogram showing frequency distribution or a box plot displaying quartiles and outliers. To switch between these views or adjust display statistics, click ⛭ Chart Settings.

    When comparing cohorts, the chart shows data from each cohort on the same axes for direct comparison.

    You can also customize your charts by selecting different transcript or gene features, resizing and rearranging them on your dashboard, or adjusting display settings to focus on the most relevant analyses for your research.

    Exploring Feature Correlations

    The Feature Correlation charts help you understand how the expression levels of two genes or transcripts relate to each other. You can use these charts to identify genes or transcripts that are co-expressed, explore potential pathway interactions, and compare correlation patterns between different cohorts.

    The chart displays a scatter plot where each point represents a sample, with the X and Y axes showing expression values for your two selected features. A best fit line shows the overall relationship trend, and you can swap which gene appears on which axis to view the data from different perspectives.

    Exploring feature correlations between ERBB2 and TP53

    The correlation analysis includes statistical measures to help you determine if the relationship you're seeing is meaningful. The Pearson correlation coefficient shows both the strength and direction of the linear relationship (ranging from -1 to +1), while the p-value indicates whether the correlation is statistically significant.

    You can toggle these statistics on or off as needed. The chart updates automatically when you change your feature selections or switch between viewing single cohorts versus comparing multiple cohorts. This quantitative analysis helps you assess whether observed correlations are both statistically sound and biologically relevant to your research.

    Examining Detailed Gene Expression Information

    The Expression Per Feature table provides gene metadata and expression statistics for all features in your dataset. Use the search bar to find specific genes by symbol or explore genes within genomic ranges.

    The table displays one row per feature ID with the following columns:

    • Feature ID: The unique transcript or gene identifier, such as ENST for a transcript or ENSG for a gene

    • Gene Symbol: The official gene name or symbol associated with the feature ID, such as TP53

    • Location: The genomic coordinates in "chromosome:start-end" format

    • Strand: The DNA strand orientation (+ or -)

    • Expression (Mean): The average expression value for this feature across the current cohort

    • Expression (SD): The standard deviation of expression values

    • Expression (Median): The median expression value

    When comparing cohorts, the table shows separate expression statistics for each cohort, allowing direct comparison of expression patterns.

    Examining TP53 expression per feature

    Each feature includes links to external annotation resources:

    • Ensembl transcript pages: Detailed transcript information and annotations

    • Ensembl gene pages: Comprehensive gene summaries and functional data

    These links provide quick access to additional context about genes and transcripts of interest.

    Molecular Expression Assay Loader

    An Apollo license is required to use Cohort Browser on the DNAnexus Platform. Org approval may also be required. Contact DNAnexus Sales for more information.

    Example Use Cases

    Dramatically Speed Up R&D of Workflows

    For example, suppose you are developing a workflow, and at each stage you end up debugging an issue. Each stage takes about one hour to develop and run. If you do not reuse outputs during development, the process takes 1 + 2 + 3 + ... + n hours because at every stage you fix something and must recompute results from previous stages. By reusing results for stages that have matured and are no longer modified, the total development time equals the time it takes to develop and run the pipeline (in this case n hours). This is an order-of-magnitude reduction in development time, and the improvement becomes more pronounced for longer workflows.

    This feature also saves time when developing forks of existing workflows. For example, suppose you are a developer in an R&D organization and want to modify the last couple of stages of a production workflow in another organization. As long as the new workflow uses the same executable IDs for the earlier stages, the time required for R&D of the forked version equals the time for the last stages.

    Dramatically Reduce Costs When Testing at Scale

    In production environments, test R&D modifications to a workflow at scale. This is especially relevant for workflows used in clinical tests. For example, suppose you are testing a workflow like the forked workflow discussed earlier. This clinical workflow must be tested on thousands of samples (let that number be represented by m) before it is vetted for production. Suppose the whole workflow takes n hours but only the last k stages changed. You save (n-k)m total compute hours. This can add up to dramatic cost savings as m grows and if k is small.

    Example Reuse with WDL

    To show Smart Reuse, the following example uses WDL syntax as supported by DNAnexus SDK and dxCompiler.

    The workflow above is a two-step workflow that duplicates a file and takes the first 10 lines from the duplicate.

    Suppose the user has run the workflow above on some file and wants to tweak headfile to output the first 15 lines instead:

    Here the only differences are the renamed headfile and basic_reuse, and the change from 10 to 15. The compilation process automatically detects that dupfile is the same but the second stage differs. The generated workflow therefore uses the original executable ID for dupfile but a different executable ID for headfile2.

    When executing basic_reuse_tweaked on the same input file with Smart Reuse enabled, the results from dupfile task are reused. This is because since there is already a job on the DNAnexus Platform that has run that specific executable with the same input file, the system can reuse that file.

    When using Smart Reuse with complex WDL workflows involving WDL expressions in input arguments, scatters, and nested sub-workflows, we recommend launching workflows using the --preserve-job-outputs option. This preserves the outputs of all jobs in the execution tree in the project and increases the potential for subsequent Smart Reuse.

    Specific Properties

    Smart Reuse applies under the following conditions:

    • It is available only for jobs run in projects billed to organizations with Smart Reuse enabled.

    • It applies only to jobs completed after the organization's policies have been updated to enable Smart Reuse.

    Jobs can reuse results from previous jobs if all these criteria are met:

    • An existing job exists that used the exact same executable and input IDs (including the function called within the applet). If an input is watermarked, both the watermark and its version must match. Other settings, such as the instance type, do not affect reuse.

    • If ignoreReuse: true is set, the job is not eligible for future reuse.

    • The job being reused must have all outputs available and accessible at the time of reuse. If any output is missing or inaccessible, reuse is impossible.

    • Each reused job includes an outputReusedFrom field, which points to the original job ID that produced the outputs. This field never refers to another reused job.

    • Results can be reused across projects only if the application's file includes "allProjects": "VIEW" in the "access" field.

    • You must have at least VIEW access to the original job's outputs, and those outputs must still exist on the Platform. Outputs that have been deleted cannot be reused.

    • Reused jobs are reported as having run for 0 seconds and are billed at $0.

    • Outputs are assumed to be deterministic.

    • If the reused job or workflow is in a different project or folder, the output data is not cloned to the new project or destination folder, since the job or workflow is not actually rerun.

    Enable/Disable Smart Reuse

    If you are an administrator of a licensed org and want to enable Smart Reuse, run this command:

    If you plan to reuse this feature across projects, you must modify all applet and app configurations with the "allProjects": "VIEW" as described above.

    Conversely, set the value to false to disable it. If you are a licensed customer and cannot run the command above, contact DNAnexus Support. If you are interested in this feature and are not a licensed customer, reach out to DNAnexus Sales or your account executive for more information.

    DNAnexus Sales

    Login and Logout

    Learn how to log into and out of the DNAnexus Platform, via both the user interface and the command-line interface. Learn how to use tokens to log in, and how to set up two-factor authentication.

    Logging In and Out via the User Interface

    To log in via the user interface (UI), open the login page and enter your username and password.

    To log out via the UI, click on your avatar at the far right end of the main Platform menu, then select Sign Out:

    Logging in and out

    Logging In via the Command-Line Interface

    To log in via the command-line interface (CLI), make sure you've . From the CLI, enter the command .

    Next, enter your username, or, if you've logged in before on the same computer and your username is displayed, hit Return to confirm that you want to use it to log in. Then enter your password.

    See below for directions on .

    See the for detail on optional arguments that can be used with dx login.

    Logging Out via the Command-Line Interface

    When using the CLI, log out by entering the command .

    If you use a token to log in, logging out invalidates that token. To log in again, you must .

    See the for detail on optional arguments that can be used with dx logout.

    Auto Logout

    The system logs out users after fifteen minutes of inactivity. Exceptions apply to users logged in with an that specifies a different session duration, or users in an org with a custom autoLogoutAfter policy.

    for more information on setting a custom autoLogoutAfter policy for an org.

    Using Tokens

    You can log in via the CLI, and stay logged in for a fixed length of time, by using an API token, also called an authentication token.

    Exercise caution when sharing DNAnexus Platform tokens. Anyone with a token can access the Platform and impersonate you as a user. They gain your access level to any projects accessible by the token, enabling them to run jobs and potentially incur charges to your account.

    Generating a Token

    To generate a token, click on your avatar at the top right corner of the main Platform menu, then select My Profile from the dropdown menu.

    Next, click on the API Tokens tab. Then click the New Token button:

    The New Token form opens in a modal window:

    Consider the following points when filling out the form:

    • The token provides access to each project at the level at which you have access. See the .

    • If the token provides access to a project within which you have PHI data access, it enables access to that PHI data.

    • Tokens without a specified expiration date expire in one month.

    After completing the form, click Generate Token. The system generates a 32-character token and displays it with a confirmation message.

    Copy your token immediately. The token is inaccessible after dismissing the confirmation message or navigating away from the API Tokens screen.

    Using a Token to Log In

    To log in with a token via the CLI, enter the command , followed by a valid 32-character token.

    Token Use Cases

    Tokens are useful in multiple scenarios, such as:

    • Logging in via the CLI with single sign-on enabled - If your organization uses , logging in via the CLI might require a token instead of a username and password.

    • Logging in via a script - Scripts can use tokens to authenticate with the Platform.

    When incorporating a token into a script, take care to set the token's expiration date such that the script has Platform access for only as long as necessary. Ensure as well that the script only has access to that project or those projects to which it must have access, to function properly.

    Revoking a Token

    To revoke a token, navigate to the API Tokens screen within your profile on the UI. Select the token you want to revoke, then click the Revoke button:

    In the Revoke Tokens Confirmation modal window, click the Yes, revoke it button. The token is revoked, and its name no longer appears in the list of tokens on the API Tokens screen.

    When to Revoke a Token

    • Token shared too widely - Revoke a token if someone with whom you've shared the token should no longer be able to use it, or if you're not certain who has access to it.

    • Token no longer needed - Revoke a token if a script that uses it is no longer in use, or if a group that had been using it no longer needs access to the Platform, or in any other situation in which the token is no longer necessary.

    Logging In Non-Interactively

    Though logging in typically requires direct interaction with the Platform through the UI or CLI, non-interactive login is also possible. Scripts commonly automate both login and project selection.

    Non-interactive login uses dx login with the --token argument. The command automates project selection. For manual project selection, add the argument to dx login.

    Two-Factor Authentication

    DNAnexus recommends adding two-factor authentication to your account, to provide an extra means of ensuring the security of all data to which you have access, on the Platform.

    With two-factor authentication enabled, you must enter a two-factor authentication code to log into the Platform and access certain other services. This code is a time-based one-time password valid for a single session, generated by a third-party two-factor authenticator application, such as Google Authenticator.

    Two-factor authentication protects your account by requiring both your credentials and an authentication code. This prevents unauthorized access even if your username and password are compromised.

    Enabling Two-Factor Authentication

    To enable two-factor authentication, select Account Security from the dropdown menu accessible via your avatar, at the top right corner of the main menu.

    In the Account Security screen, click the button labeled Enable 2FA. Then follow the instructions to select and set up a third-party authenticator application.

    DNAnexus recommends using Google Authenticator on your mobile device. Google Authenticator is a popular, free application that's available for both Apple iOS and Android mobile devices. Get it on or from the .

    If you are unable to use a smartphone application, compatible two-factor authenticator applications, using the TOTP (Time-based One-time Password) algorithm, exist for other platforms.

    Enabling two-factor authentication redirects you to a page containing back-up codes. These codes serve as alternatives to two-factor authentication codes if you lose access to your authenticator application.

    Store the back-up codes in a secure place. Without them and without access to your authenticator application, Platform login becomes impossible.

    if you lose both your codes and access to your authenticator application.

    Disabling Two-Factor Authentication

    DNAnexus recommends keeping two-factor authentication enabled after activation. If disabling is necessary, navigate to the Account Security screen of your profile, then click the Turn Off button in the Two-Factor Authentication section. The system requires your password and a two-factor authentication code to confirm this change.

    Disabling and re-enabling two-factor authentication requires reconfiguration of your authenticator application. Reconfiguration involves scanning a new QR code or entering a new secret key code, and saving a new set of back-up codes.

    Mkfifo and dx cat

    View full source code on GitHub

    This applet performs a SAMtools count on an input file while minimizing disk usage. For additional details on using FIFO (named pipes) special files, run the command man fifo in your shell.

    Named pipes require BOTH a stdin and stdout. The following examples run incomplete named pipes in background processes so the foreground script does not block.

    To approach this use case, outline the desired steps for the applet:

    1. Stream the BAM file from the platform to a worker.

    2. While the BAM streams, count the number of reads present.

    3. Write the result to a file.

    4. Stream the result file to the platform.

    Stream BAM file from the platform to a worker

    First, establish a named pipe on the worker. Then, stream to stdin of the named pipe and download the file as a stream from the platform using .

    Output BAM file read count

    Having created the FIFO special file representing the streamed BAM, you can call the samtools command as you normally would. The samtools command reading the BAM provides the BAM FIFO file with a stdout. However, remember that you want to stream the output back to the Platform. You must create a named pipe representing the output file too.

    The directory structure created here (~/out/counts_txt) is required to use the command in the next step. All files found in the path ~/out/<output name> are uploaded to the corresponding <output name> specified in the dxapp.json.

    Stream the result file to the platform

    A stream from the platform has been established, piped into a samtools command, and the results are output to another named pipe. However, the background process remains blocked without a stdout for the output file. Creating an upload stream to the platform resolves this.

    Upload as a stream to the platform using the commands or . Specify --buffer-size when needed.

    Alternatively, dx upload - can upload directly from stdin, eliminating the need for the directory structure required for dx-upload-all-outputs. Warning: When uploading a file that exists on disk, dx upload is aware of the file size and automatically handles any cloud service provider upload chunk requirements. When uploading as a stream, the file size is not automatically known and dx upload uses default parameters. While these parameters are fine for most use cases, you may need to specify upload part size with the --buffer-size option.

    Wait for background processes

    With background processes running, wait in the foreground for those processes to finish.

    Without waiting, the app script running in the foreground would finish and terminate the job prematurely.

    How is the SAMtools dependency provided?

    The SAMtools compiled binary is placed directly in the <applet dir>/resources directory. Any files found in the resources/ directory are uploaded so that they are present in the worker's root directory. In this case:

    When this applet is run on a worker, the resources/ folder is placed in the worker's root directory /:

    /usr/bin is part of the $PATH variable, so the samtools command can be referenced directly in the script as samtools view -c ....

    Distributed by Region (sh)

    Entry Points

    Distributed bash-interpreter apps use bash functions to declare entry points. Entry points are executed as subjobs on new workers with their own respective system requirements. This app has the following entry points specified as bash functions:

    • main

    • count_func

    • sum_reads

    main

    The main function takes the initial *.bam, generates an index *.bai if needed, and obtains the list of regions from the *.bam file. Every 10 regions are sent, as input, to the count_func entry point using command.

    Job outputs from the count_func entry point are referenced as Job Based Object References and used as inputs for the sum_reads entry point.

    Job outputs of the sum_reads entry point is used as the output of the main entry point via JBOR reference in the command.

    count_func

    This entry point performs a SAMtools count of the 10 regions passed as input. This execution runs on a new worker. As a result, variables from other functions are not accessible here. This includes variables from the main() function.

    Once the output file with counts is created, it is uploaded to the platform and assigned as the entry point's job output counts_txt via the command .

    sum_reads

    The main entry point triggers this subjob, providing the output of count_func as an input JBOR. This entry point gathers all the readcount.txt files generated by the count_func jobs and sums the totals.

    This entry point returns read_sum as a JBOR, which is then referenced as job output.

    In the main function, the output is referenced

    List View

    Learn to build and use list views in the Cohort Browser.

    An Apollo license is required to use Cohort Browser on the DNAnexus Platform. Org approval may also be required. for more information.

    When to Use List Views

    List views can be used to visualize categorical data.

    When creating a list view:

    • The data must be from a field that contains either categorical or categorical multi-select data

    • This field must contain no more than 20 distinct category values

    • The values can be organized in a hierarchy

    Supported Data Types

    Using List Views to Visualize Hierarchically Organized Data

    List views, unlike , can be used to visualize categorical data with values that are organized in a hierarchical fashion.

    Using List Views to Visualize Data from Two Different Fields

    List views can be used to visualize categorical data from two different fields. The same restrictions apply to the fields whose values are displayed, as when creating a basic list view.

    Using List Views in the Cohort Browser

    Visualizing Data from a Single Field

    In a list view in the Cohort Browser showing data from one field, each row displays a value, along with the number of records in the current cohort - the "count" - that contain this value. Also shown is a figure labeled "freq." - this is the percentage of all cohort records, that contain the value.

    Below is a sample list view showing the distribution of values in a field Episode type. In the current cohort selection of 80 participants, 13 records contain the value "Delivery episode", which represents 16.25% of the current cohort size.

    When records are missing values for the displayed field, the sum of the "count" figures is smaller than the total cohort size, and the sum of the "freq." figures is less than 100%. See for more information on how missing data affects chart calculations.

    Visualizing Data from Two Fields

    To visualize data from two fields, select a categorical field, then select "List View" as your visualization type. In the field list, select a second categorical field as a secondary field.

    Below is the default view of a sample list view visualizing data from two fields: Critical care record origin and Critical care record format:

    Critical care record origin is the primary field, Critical care record format is the secondary field.

    Here, the user has clicked the ">" icon next to "Originating from Scotland" to display additional rows with detail on records that contain that value in the field Critical care record origin:

    Each of these additional rows shows the number of records that contain a particular value for Critical care record format, along with the value "Originating from Scotland" for Critical care record origin.

    In these additional rows, "count" and "freq." figures refer to records having a particular combination of values, in the fields.

    Visualizing Complex Categorical Data

    Below is an example of a list view used to visualize data in a categorical hierarchical field Home State/Province:

    By default, only values in the category at the top level of the hierarchy are displayed.

    Here, the user has clicked ">" next to one of these values, revealing additional rows that show how many records have the value "Canada" for the top-level category, in combination with different values in the category at the next level down:

    In these additional rows, "count" and "freq." figures refer to records having a particular combination of values, in the fields. In the list view above, for example, a single record, representing 10% of the cohort, has both the value "Canada" for the top-level category, and "British Columbia" for the second-level category.

    The following example shows how "count" and "freq." are calculated, for list views based on fields containing categorical data organized into multiple levels of hierarchy:

    For the bottommost row, "count" and "freq" refer to records having the following values:

    • "Yes" for the category at the top of the hierarchy

    • "9" for the category at the second level of the hierarchy

    • "8" for the category at the third level of the hierarchy

    • "7" for the category at the fourth level of the hierarchy

    Locating Values in a List View

    In cases where the field has categories at multiple levels and this make it difficult to find a particular value, use the search box at the bottom of the list view, to hone in on a row or rows containing that value:

    List Views in Cohort Compare

    In Cohort Compare mode, a list view can be used to compare the distribution of values in a field that's common to both cohorts. In this scenario, the list includes a color-coded column for each cohort, as well as color-coded "count" figures for each, as in this example:

    In each column, count and "freq." figures refer to the occurrence of values in the individual cohort, not across both cohorts.

    Preparing Data for Visualization in List Views

    When , the following data types can be visualized in list views:

    • String Categorical

    • String Categorical Hierarchical

    • String Categorical Multi-Select

    • String Categorical Multi-Select Hierarchical

    Developer Quickstart

    Learn to build an app that you can run on the Platform.

    This tutorial provides a quick intro to the DNAnexus developer experience, and progresses to building a fully functional, useful app on the Platform. For a more in-depth discussion of the Platform, see .

    The steps below require the . You must download and install it if you have not done so already.

    Besides this Quickstart, there are Developer Tutorials located in the sidebar that go over helpful tips for new users as well. A few of them include:

    Distributed by Region (sh)

    Entry Points

    Distributed bash-interpreter apps use bash functions to declare entry points. Entry points are executed as subjobs on new workers with their own respective system requirements. This app has the following entry points specified as bash functions:

    • main

    Running Batch Jobs

    To launch a DNAnexus application or workflow on many files automatically, one may write a short script to loop over the desired files in a project and launch jobs or analyses. Alternatively, the provides a few handy utilities for batch processing. To use the GUI to run in batch mode, see these .

    Overview

    In this tutorial, you batch process a series of sample FASTQ files (forward and reverse reads). Use the dx generate_batch_inputs command to generate a batch file -- a tab-delimited (TSV) file where each row corresponds to a single run in the batch. Then you process the batch using the command with the --batch-tsv

    Path Resolution

    When using the command-line client, you may refer to objects either through their ID or by name.

    In the DNAnexus Platform, every data object has a unique starting with the class of the object followed by a hyphen ('-') and 24 alphanumeric characters. Common object classes include "record", "file", and "project". An example ID would be record-9zGPKyvvbJ3Q3P8J7bx00005. A string matching this format is always interpreted to be meant as the ID of such an object and is not further resolved as a name.

    The command-line client, however, also accepts names and paths as input in a particular syntax.

    Path Syntax

    Job Lifecycle

    Learn about the states through which a job or analysis may go, during its lifecycle.

    Example Execution Tree

    The following example shows a workflow that has two stages, one of which is an applet, and the other of which is an app.

    When the workflow runs, it generates an analysis with an attached workspace for storing intermediate output from its stages. Jobs are created to run the two stages. These jobs can spawn additional jobs to run other functions in the same executable or to run separate executables. The blue labels indicate which jobs or analyses can be described using a particular term (as defined above).

    Parallel by Region (py)

    This applet tutorial performs a SAMtools count using parallel threads.

    To take full advantage of the scalability that cloud computing offers, your scripts must implement the correct methodologies. This applet tutorial:

    1. Install SAMtools

    2. Download BAM file

    DXJupyterLab Reference

    This page is a reference for most useful operations and features in the DNAnexus JupyterLab environment.

    DXJupyterLab is accessible to all users of the UK Biobank Research Analysis Platform and the Our Future Health Trusted Research Environment.

    A license is required to access DXJupyterLab on the DNAnexus Platform. for more information.

    Connect to Thrift

    Learn about the DNAnexus Thrift server, a service that allows JDBC and ODBC clients to run Spark SQL queries.

    A license is required to access Spark functionality on the DNAnexus Platform. for more information.

    About the DNAnexus Thrift Server

    task dupfile {
        File infile
    
        command { cat ${infile} ${infile} > outfile.txt  }
        output { File outfile = 'outfile.txt' }
    }
    
    task headfile {
        File infile
    
        command { head -10 ${infile} > outfile.txt  }
        output { File outfile = 'outfile.txt' }
    }
    
    workflow basic_reuse {
        File infile
        call dupfile { input: infile=infile }
        call headfile { input: infile=dupfile.outfile }
    }
    task dupfile {
        File infile
    
        command { cat ${infile} ${infile} > outfile.txt  }
        output { File outfile = 'outfile.txt' }
    }
    
    task headfile2 {
        File infile
    
        command { head -15 ${infile} > outfile.txt  }
        output { File outfile = 'outfile.txt' }
    }
    
    workflow basic_reuse_tweaked {
        File infile
        call dupfile { input: infile=infile }
        call headfile { input: infile=dupfile.outfile }
    }
    dx api org-myorg update '{"policies":{"jobReuse":true}}'
    dxapp.json

    "3" for the category at the bottom level of the hierarchy

    String Categorical Sparse
  • String Categorical Sparse Hierarchical

  • Integer Categorical

  • Integer Categorical Hierarchical

  • Integer Categorical Multi-Select

  • Integer Categorical Multi-Select Hierarchical

  • Categorical (<=20 distinct category values)

    Categorical Multiple (<=20 distinct category values)

    Categorical Hierarchical (<=20 distinct category values)

    Categorical Hierarchical Multiple (<=20 distinct category values)

    row charts
    Chart Totals and Missing Data
    ingesting data using Data Model Loader
    Contact DNAnexus Sales
    List View in the Cohort Browser
    Primary Field Values in a List View Visualizing Data from Two Fields
    Seeing Combinations of Field Values
    List View of Hierarchical Categorical Data
    Seeing Combinations of Values in a Field Containing Hierarchical Categorical Data
    Multiple Levels of Hierarchy
    Using the Search Function in a List View
    List view: Treatment/Medication Code in compare mode
    gnomAD
    installed the dx command-line client
    dx login
    using a token to log in
    Index of dx Commands page
    dx logout
    generate a new token
    Index of dx Commands page
    API token
    Contact DNAnexus Support
    Projects page for more on project access levels
    dx login --token
    single sign-on
    dx select
    --noprojects
    Google Play
    Apple iTunes App Store
    Contact DNAnexus Support
    Creating a new token
    New token form
    Revoking a token

    FIFO

    stdin

    stdout

    BAM file

    YES

    NO

    FIFO

    stdin

    stdout

    BAM file

    YES

    YES

    output file

    YES

    NO

    FIFO

    stdin

    stdout

    BAM file

    YES

    YES

    output file

    YES

    YES

    dx cat
    dx-upload-all-outputs
    dx-upload-all-outputs
    dx upload -
    dx-jobutil-new-job
    JBOR
    dx-jobutil-add-output
    dx-jobutil-add-output
    Distributed by Chr (sh)
  • Parallel by Chr (py)

  • R Shiny Example Web App

  • Step 1. Build an App

    Every DNAnexus app starts with 2 files:

    • dxapp.json: a file containing the app's metadata: its inputs and outputs, how the app is run, and execution requirements

    • a script that is executed in the cloud when the app is run

    Start by creating a file called dxapp.json with the following text:

    The example specifies the app name (coolapp), the interpreter (python3) to run the script, and the path (code.py) to the script created next. ("version":"0") refers to the Ubuntu 24.04 application execution environment version that supports the python3 interpreter.

    Next, create the script in a file called code.py with the following text:

    That's all you need. To build the app, first log in to DNAnexus and start a project with dx login. In the directory with the two files above, run:

    Next, run the app and watch the output:

    That's it! You have made and run your first DNAnexus applet. Applets are lightweight apps that live in your project, and are not visible in the App Library. When you typed dx run, the app ran on its own Linux instance in the cloud. You have exclusive, secure access to the CPU, storage, and memory on the instance. The DNAnexus API lets your app read and write data on the Platform, as well as launch other apps.

    The app is available in the DNAnexus web interface, as part of the project that you started. It can be configured and run in the Workflow Builder, or shared with other users by sharing the project.

    Step 2. Run BLAST

    Next, make the app do something a bit more interesting: take in two files with FASTA-formatted DNA, run the BLAST tool to compare them, and output the result.

    In the cloud, your app runs on Ubuntu Linux 24.04, where BLAST is available as an APT package, ncbi-blast+. You can request that the DNAnexus execution environment install it before your script is run by listing ncbi-blast+ in the execDepends field of your dxapp.json like this:

    Next, update code.py to run BLAST:

    Rebuild the app and test it on some real data. You can use demo inputs available in the Demo Data project, or you can upload your own data with dx upload or via the website. If you use the Demo Data inputs, make sure the project you are running your app in is the same region as the Demo Data project.

    Rebuild the app with dx build -a, and run it like this:

    Once the job is done, you can examine the output with dx head report.txt, download it with dx download, or view it on the website.

    Step 3. Provide an Input/Output Spec

    Workflows are a powerful way to visually connect, configure, and run multiple apps in pipelines. To add the app to a workflow and connect its inputs and outputs to other apps, specify both input and output specifications. Update the dxapp.json as follows:

    Rebuild the app with dx build -a. Run it as before, and add the applet to a workflow by clicking "New Workflow" while viewing your project on the website, then click coolapp once to add it to the workflow. Inputs and outputs appear on the workflow stage and can be connected to other stages.

    If you run dx run coolapp with no input arguments from the command line, the command prompts for the input values for seq1 and seq2.

    Step 4. Configure App Settings

    Besides specifying input files, the I/O specification can also configure settings the app uses. For example, configure the E-value setting and other BLAST settings with this code and dxapp.json:

    Rebuild the app again and add it in the workflow builder. You should see the evalue and blast_args settings available when you click the gear button on the stage. After building and configuring a workflow, you can run the workflow itself with dx run workflowname.

    Step 5. Use SDK Tools

    One of the utilities provided in the SDK is dx-app-wizard. This tool prompts you with a series of questions with which it creates the basic files needed for a new app. It also gives you the option of writing your app as a bash shell script instead of Python. Run dx-app-wizard to try it out.

    Learn More

    For additional information and examples of how to run jobs using the CLI, see Working with files using dx run may be useful. This material is not a part of the official DNAnexus documentation and is for reference only.

    Intro to Building Apps
    DNAnexus SDK
  • count_func

  • sum_reads

  • main

    The main function takes the initial *.bam, generates an index *.bai if needed, and obtains the list of regions from the *.bam file. Every 10 regions are sent, as input, to the count_func entry point using dx-jobutil-new-job command.

    Job outputs from the count_func entry point are referenced as Job Based Object References JBOR and used as inputs for the sum_reads entry point.

    Job outputs of the sum_reads entry point is used as the output of the main entry point via JBOR reference in the dx-jobutil-add-output command.

    count_func

    This entry point performs a SAMtools count of the 10 regions passed as input. This execution runs on a new worker. As a result, variables from other functions are not accessible here. This includes variables from the main() function.

    Once the output file with counts is created, it is uploaded to the platform and assigned as the entry point's job output counts_txt via the command dx-jobutil-add-output.

    sum_reads

    The main entry point triggers this subjob, providing the output of count_func as an input JBOR. This entry point gathers all the readcount.txt files generated by the count_func jobs and sums the totals.

    This entry point returns read_sum as a JBOR, which is then referenced as job output.

    In the main function, the output is referenced

    View full source code on GitHub
    options.

    Generate Batch File

    The project My Research Project contains the following files in the project's root directory:

    Batch process these read pairs using BWA-MEM (link requires platform login). For a single execution of the BWA-MEM app, specify the following inputs:

    • reads_fastqgzs - FASTQ containing the left mates

    • reads2_fastqgzs - FASTQ containing the right mates

    • genomeindex_targz - BWA reference genome index

    The BWA reference genome index from the public Reference Genome (requires platform login) project is used for all runs. However, for the forward and reverse reads, the read pairs used vary from run to run. To generate a batch file that pairs the input reads:

    You can optionally provide a --path argument and provide a specific file and folder to search for recursively within your project. Specifically, the value for --path must be a directory specified as:

    /path/to/directory or project-xxxx:/path/to/directory

    Any file present within this directory or recursively within any subdirectory of this directory is considered a candidate for a batch run.

    The (.*) are regular expression groups. You can provide arbitrary regular expressions as input. The first match in the group is the pattern used to group pairs in the batch. These matches are called batch identifiers (batch IDs). To explain this behavior in more detail, consider the output of the dx generate_batch_inputs command above:

    The dx generate_batch_inputs command creates the dx_batch.0000.tsv that looks like:

    Recall the regular expression was RP(.*)_R1_(.*).fastq.gz. Although there are two grouped matches in this example, only the first one is used as the pattern for the batch ID. For example, the pattern identified for RP10B_S1_R1_001.fastq.gz is 10B_S1 which corresponds to the first grouped match while the second one is ignored.

    Examining the TSV file above, the files are grouped as expected, with the first match labeling the identifier of the group within the batch. The next two columns show the file names. The last two columns contain the IDs of the files on the DNAnexus Platform. You can either edit this file directly or import it into a spreadsheet to make any subsequent changes.

    If an input for the app is an array, the input file IDs within the batch.tsv file need to be in square brackets to work. The following bash command adds brackets to the file IDs in column 4 and 5. You may need to change the variables in the command ($4 and $5) to match the correct columns in your file. The command's output file, "new.tsv", is ready for the dx run --batch-tsv command.

    The example above is for a case where all files have been paired properly. dx generate_batch_inputs creates a TSV for all files that can be successfully matched for a particular batch ID. Two classes of errors may occur for batch IDs that are not successfully matched:

    • A particular input is missing. This could occur when reads_fastqgzs has a pattern but no corresponding match can be found for reads2_fastqgzs.

    • More than one file ID matches the exact same name.

    For both of these cases, dx generate_batch_inputs returns a description of these errors to STDERR.

    When matching more than 500 files, multiple batch files are generated in groups of 500 to limit the number of jobs in a single batch run.

    Run a Batch Job

    With the batch file prepared, you can execute the BWA-MEM batch process:

    Here, genomeindex_targz is a parameter set at execution time that is common to all groups in the batch and --batch-tsv corresponds to the input file generated above.

    To monitor a batch job, use the 'Monitor' tab like you normally would for jobs you launch.

    Setting Output Folders for Batch Jobs

    To direct the output of each run into a separate folder, the --batch-folders flag can be used, for example:

    This command outputs the results for each sample in folders named after batch IDs, such as /10B_S1/, /10T_S5/, /15B_S4/, and /15T_S8/. If the folders do not exist, they are created.

    The output folders are created under a path defined with --destination, which by default is set to the current project and the "/" folder. For example, this command outputs the result files in /run_01/10B_S1/, /run_01/10T_S5/, and other sample-specific folders:

    Batching Multiple Inputs

    The dx generate_batch_inputs command works well for batch processing with file inputs, but it has limitations. If you need to vary other input types (like strings, numbers, or file arrays), or want to customize run properties like job names, a for loop provides more flexibility.

    Here's an example of using a loop to launch multiple jobs with different inputs:

    You can also use the dx run command to use stage_id . For example, if you create a workflow called "Trio Exome Workflow - Jan 1st 2020 9:00am" in your project, you can run it from the command line:

    The \ character is needed to escape the : in the workflow name.

    Inputs to the workflow can be specified using dx run <workflow> --input name=stage_id:value, where stage_id is a numeric ID starting at 0. More help can be found by running the commands dx run --help and dx run <workflow> --help.

    To batch multiple inputs then, do the following:

    Additional Resources

    For additional information and examples of how to run batch jobs, Chapter 6 of this reference guide may be useful. This material is not a part of the official DNAnexus documentation and is for reference only.

    DNAnexus SDK
    instructions
    dx run
    Split workload
  • Count regions in parallel

  • How is the SAMtools dependency provided?

    The SAMtools dependency is resolved by declaring an Apt-Get package in the dxapp.json runSpec.execDepends field.

    Download Inputs

    This applet downloads all inputs at once using dxpy.download_all_inputs:

    Split workload

    This tutorial processes data in parallel using the Python multiprocessing module with a straightforward pattern shown below:

    This convenient pattern allows you to quickly orchestrate jobs on a worker. For more detailed overview of the multiprocessing module, visit the Python docs.

    The applet script includes helper functions to manage the workload. One helper is run_cmd, which manages subprocess calls:

    Before splitting the workload, determine what regions are present in the BAM input file. This initial parsing is handled in the parse_sam_header_for_region function:

    Once the workload is split and processing has started, wait and review the status of each Pool worker. Then, merge and output the results.

    The run_cmd function returns a tuple containing the stdout, stderr, and exit code of the subprocess call. These outputs are parsed from the workers to determine whether the run failed or passed.

    View full source code on GitHub
    mkdir workspace
    mappings_fifo_path="workspace/${mappings_bam_name}"
    mkfifo "${mappings_fifo_path}" # FIFO file is created
    dx cat "${mappings_bam}" > "${mappings_fifo_path}" &
    input_pid="$!"
    mkdir -p ./out/counts_txt/
    
    counts_fifo_path="./out/counts_txt/${mappings_bam_prefix}_counts.txt"
    
    mkfifo "${counts_fifo_path}" # FIFO file is created, readcount.txt
    samtools view -c "${mappings_fifo_path}" > "${counts_fifo_path}" &
    process_pid="$!"
    mkdir -p ./out/counts_txt/
    
    counts_fifo_path="./out/counts_txt/${mappings_bam_prefix}_counts.txt"
    
    mkfifo "${counts_fifo_path}" # FIFO file is created, readcount.txt
    samtools view -c "${mappings_fifo_path}" > "${counts_fifo_path}" &
    process_pid="$!"
    wait -n  # "$input_pid"
    wait -n  # "$process_pid"
    wait -n  # "$upload_pid"
    ├── Applet dir
    │   ├── src
    │   ├── dxapp.json
    │   ├── resources
    │       ├── usr
    │           ├── bin
    │               ├── < samtools binary >
    /
    ├── usr
    │   ├── bin
    │       ├── < samtools binary >
    ├── home
    │   ├── dnanexus
    regions=$(samtools view -H "${mappings_sorted_bam_name}" \
      | grep "\@SQ" | sed 's/.*SN:\(\S*\)\s.*/\1/')
    
    echo "Segmenting into regions"
    count_jobs=()
    counter=0
    temparray=()
    for r in $(echo $regions); do
      if [[ "${counter}" -ge 10 ]]; then
        echo "${temparray[@]}"
        count_jobs+=( \
          $(dx-jobutil-new-job \
          -ibam_file="${mappings_sorted_bam}" \
          -ibambai_file="${mappings_sorted_bai}" "${temparray[@]}" count_func))
        temparray=()
        counter=0
      fi
      temparray+=("-iregions=${r}") # Here we add to an array of -i<parameter>'s
      counter=$((counter+1))
    done
    
    if [[ counter -gt 0 ]]; then # Previous loop misses last iteration if it's < 10
      echo "${temparray[@]}"
      count_jobs+=($(dx-jobutil-new-job -ibam_file="${mappings_sorted_bam}" -ibambai_file="${mappings_sorted_bai}" "${temparray[@]}" count_func))
    fi
    echo "Merge count files, jobs:"
    echo "${count_jobs[@]}"
    readfiles=()
    for count_job in "${count_jobs[@]}"; do
      readfiles+=("-ireadfiles=${count_job}:counts_txt")
    done
    echo "file name: ${sorted_bamfile_name}"
    echo "Set file, readfile variables:"
    echo "${readfiles[@]}"
    countsfile_job=$(dx-jobutil-new-job -ifilename="${mappings_sorted_bam_prefix}" "${readfiles[@]}" sum_reads)
    echo "Specifying output file"
    dx-jobutil-add-output counts_txt "${countsfile_job}:read_sum" --class=jobref
    count_func() {
      set -e -x -o pipefail
    
      echo "Value of bam_file: '${bam_file}'"
      echo "Value of bambai_file: '${bambai_file}'"
      echo "Regions being counted '${regions[@]}'"
    
      dx-download-all-inputs
    
      mkdir workspace
      cd workspace || exit
      mv "${bam_file_path}" .
      mv "${bambai_file_path}" .
      outputdir="./out/samtool/count"
      mkdir -p "${outputdir}"
      samtools view -c "${bam_file_name}" "${regions[@]}" >> "${outputdir}/readcounts.txt"
    
      counts_txt_id=$(dx upload "${outputdir}/readcounts.txt" --brief)
      dx-jobutil-add-output counts_txt "${counts_txt_id}" --class=file
    }
    sum_reads() {
    
      set -e -x -o pipefail
      echo "$filename"
    
      echo "Value of read file array '${readfiles[@]}'"
      dx-download-all-inputs
      echo "Value of read file path array '${readfiles_path[@]}'"
    
      echo "Summing values in files"
      readsum=0
      for read_f in "${readfiles_path[@]}"; do
        temp=$(cat "$read_f")
        readsum=$((readsum + temp))
      done
    
      echo "Total reads: ${readsum}" > "${filename}_counts.txt"
    
      read_sum_id=$(dx upload "${filename}_counts.txt" --brief)
      dx-jobutil-add-output read_sum "${read_sum_id}" --class=file
    echo "Specifying output file"
    dx-jobutil-add-output counts_txt "${countsfile_job}:read_sum" --class=jobref
    dxapp.json
    { "name": "coolapp",
      "runSpec": {
        "distribution": "Ubuntu",
        "release": "24.04",
        "version": "0",
        "interpreter": "python3",
        "file": "code.py"
      }
    }
    code.py
    import dxpy
    
    @dxpy.entry_point('main')
    def main(**kwargs):
        print("Hello, DNAnexus!")
        return {}
    dx login
    dx build -a
    dx run coolapp --watch
    dxapp.json
    { "name": "coolapp",
      "runSpec": {
        "distribution": "Ubuntu",
        "release": "24.04",
        "version": "0",
        "interpreter": "python3",
        "file": "code.py",
        "execDepends": [ {"name": "ncbi-blast+"} ]
      }
    }
    code.py
    import dxpy, subprocess
    
    @dxpy.entry_point('main')
    def main(seq1, seq2):
        dxpy.download_dxfile(seq1, "seq1.fasta")
        dxpy.download_dxfile(seq2, "seq2.fasta")
    
        subprocess.call("blastn -query seq1.fasta -subject seq2.fasta > report.txt", shell=True)
    
        report = dxpy.upload_local_file("report.txt")
        return {"blast_result": report}
    dx run coolapp \
      -i seq1="Demo Data:/Developer Quickstart/NC_000868.fasta" \
      -i seq2="Demo Data:/Developer Quickstart/NC_001422.fasta" \
      --watch
    dxapp.json
    {
      "name": "coolapp",
      "runSpec": {
        "distribution": "Ubuntu",
        "release": "24.04",
        "version": "0",
        "interpreter": "python3",
        "file": "code.py",
        "execDepends": [ {"name": "ncbi-blast+"} ]
      },
      "inputSpec": [
        {"name": "seq1", "class": "file"},
        {"name": "seq2", "class": "file"}
      ],
      "outputSpec": [
        {"name": "blast_result", "class": "file"}
      ]
    }
    code.py
    import dxpy, subprocess
    
    @dxpy.entry_point('main')
    def main(seq1, seq2, evalue, blast_args):
        dxpy.download_dxfile(seq1, "seq1.fasta")
        dxpy.download_dxfile(seq2, "seq2.fasta")
    
        command = "blastn -query seq1.fasta -subject seq2.fasta -evalue {e} {args} > report.txt".format(e=evalue, args=blast_args)
        subprocess.call(command, shell=True)
    
        report = dxpy.upload_local_file("report.txt")
        return {"blast_result": report}
    dxapp.json
    {
      "name": "coolapp",
      "runSpec": {
        "distribution": "Ubuntu",
        "release": "24.04",
        "version": "0",
        "interpreter": "python3",
        "file": "code.py",
        "execDepends": [ {"name": "ncbi-blast+"} ]
      },
      "inputSpec": [
        {"name": "seq1", "class": "file"},
        {"name": "seq2", "class": "file"},
        {"name": "evalue", "class": "float", "default": 0.01},
        {"name": "blast_args", "class": "string", "default": ""}
      ],
      "outputSpec": [
        {"name": "blast_result", "class": "file"}
      ]
    }
    # Extract list of reference regions from BAM header
    regions=$(
      samtools view -H "${mappings_sorted_bam_name}" | \
      grep "@SQ" | \
      sed 's/.*SN:\(\S*\)\s.*/\1/'
    )
    
    echo "Segmenting into regions"
    
    count_jobs=()
    counter=0
    temparray=()
    
    # Loop through each region
    for r in $(echo "$regions"); do
      if [[ "${counter}" -ge 10 ]]; then
        echo "${temparray[@]}"
        count_jobs+=($(
          dx-jobutil-new-job \
            -ibam_file="${mappings_sorted_bam}" \
            -ibambai_file="${mappings_sorted_bai}" \
            "${temparray[@]}" \
            count_func
        ))
        temparray=()
        counter=0
      fi
      # Add region to temp array of -i<parameter>s
      temparray+=("-iregions=${r}")
      counter=$((counter + 1))
    done
    
    # Handle remaining regions (less than 10)
    if [[ $counter -gt 0 ]]; then
      echo "${temparray[@]}"
      count_jobs+=($(
        dx-jobutil-new-job \
          -ibam_file="${mappings_sorted_bam}" \
          -ibambai_file="${mappings_sorted_bai}" \
          "${temparray[@]}" \
          count_func
      ))
    fi
    echo "Merge count files, jobs:"
    echo "${count_jobs[@]}"
    readfiles=()
    for count_job in "${count_jobs[@]}"; do
      readfiles+=("-ireadfiles=${count_job}:counts_txt")
    done
    echo "file name: ${sorted_bamfile_name}"
    echo "Set file, readfile variables:"
    echo "${readfiles[@]}"
    countsfile_job=$(dx-jobutil-new-job -ifilename="${mappings_sorted_bam_prefix}" "${readfiles[@]}" sum_reads)
    echo "Specifying output file"
    dx-jobutil-add-output counts_txt "${countsfile_job}:read_sum" --class=jobref
    count_func() {
    
      set -e -x -o pipefail
    
      echo "Value of bam_file: '${bam_file}'"
      echo "Value of bambai_file: '${bambai_file}'"
      echo "Regions being counted '${regions[@]}'"
    
    
      dx-download-all-inputs
    
    
      mkdir workspace
      cd workspace || exit
      mv "${bam_file_path}" .
      mv "${bambai_file_path}" .
      outputdir="./out/samtool/count"
      mkdir -p "${outputdir}"
      samtools view -c "${bam_file_name}" "${regions[@]}" >> "${outputdir}/readcounts.txt"
    
    
      counts_txt_id=$(dx upload "${outputdir}/readcounts.txt" --brief)
      dx-jobutil-add-output counts_txt "${counts_txt_id}" --class=file
    }
    sum_reads() {
    
      set -e -x -o pipefail
      echo "$filename"
    
      echo "Value of read file array '${readfiles[@]}'"
      dx-download-all-inputs
      echo "Value of read file path array '${readfiles_path[@]}'"
    
      echo "Summing values in files"
      readsum=0
      for read_f in "${readfiles_path[@]}"; do
        temp=$(cat "$read_f")
        readsum=$((readsum + temp))
      done
    
      echo "Total reads: ${readsum}" > "${filename}_counts.txt"
    
      read_sum_id=$(dx upload "${filename}_counts.txt" --brief)
      dx-jobutil-add-output read_sum "${read_sum_id}" --class=file
    }
    echo "Specifying output file"
    dx-jobutil-add-output counts_txt "${countsfile_job}:read_sum" --class=jobref
    $ dx select "My Research Project"
    Selected project My Research Project
    $ dx ls /
    RP10B_S1_R1_001.fastq.gz
    RP10B_S1_R2_001.fastq.gz
    RP10T_S5_R1_001.fastq.gz
    RP10T_S5_R2_001.fastq.gz
    RP15B_S4_R1_002.fastq.gz
    RP15B_S4_R2_002.fastq.gz
    RP15T_S8_R1_002.fastq.gz
    RP15T_S8_R2_002.fastq.gz
    $ dx generate_batch_inputs \
        -i reads_fastqgzs='RP(.*)_R1_(.*).fastq.gz' \
        -i reads2_fastqgzs='RP(.*)_R2_(.*).fastq.gz'
    Found 4 valid batch IDs matching desired pattern.
    Created batch file dx_batch.0000.tsv
    
    CREATED 1 batch files each with at most 500 batch IDs.
    $ cat dx_batch.0000.tsv
    batch ID  reads_fastqgzs              reads2_fastqgzs              pair1 ID    pair2 ID
    10B_S1    RP10B_S1_R1_001.fastq.gz    RP10B_S1_R2_001.fastq.gz     file-aaa    file-bbb
    10T_S5    RP10T_S5_R1_001.fastq.gz    RP10T_S5_R2_001.fastq.gz     file-ccc    file-ddd
    15B_S4    RP15B_S4_R1_002.fastq.gz    RP15B_S4_R2_002.fastq.gz     file-eee    file-fff
    15T_S8    RP15T_S8_R1_002.fastq.gz    RP15T_S8_R2_002.fastq.gz     file-ggg    file-hhh
    head -n 1 dx_batch.0000.tsv > temp.tsv && \
    tail -n +2 dx_batch.0000.tsv | \
    awk '{sub($4, "[&]"); print}' | \
    awk '{sub($5, "[&]"); print}' >> temp.tsv && \
    tr -d '\r' < temp.tsv > new.tsv && \
    rm temp.tsv
    dx run bwa_mem_fastq_read_mapper \
      -igenomeindex_targz="Reference Genome Files":\
    "/H. Sapiens - GRCh38/GRCh38.no_alt_analysis_set.bwa-index.tar.gz" \
      --batch-tsv dx_batch.0000.tsv
    dx run bwa_mem_fastq_read_mapper \
      -igenomeindex_targz="project-BQpp3Y804Y0xbyG4GJPQ01xv:\
    file-BFBy4G805pXZKqV1ZVGQ0FG8" \
      --batch-tsv dx_batch.0000.tsv \
      --batch-folders
    dx run bwa_mem_fastq_read_mapper \
      -igenomeindex_targz="project-BQpp3Y804Y0xbyG4GJPQ01xv:\
    file-BFBy4G805pXZKqV1ZVGQ0FG8" \
      --batch-tsv dx_batch.0000.tsv \
      --batch-folders \
      --destination=My_project:/run_01
    for i in 1 2; do
        dx run swiss-army-knife -icmd="wc *>${i}.out" -iin="fileinput_batch${i}a" -iin="file_input_batch${i}b" --name "sak_batch${i}"
    done
    dx login
    dx run "Trio Exome Workflow - Jan 1st 2020 9\:00am"
    dx cd /path/to/inputs
    for i in $(dx ls); do
        dx run "Trio Exome Workflow - Jan 1st 2020 9\:00am" --input 0.reads="$i"
    done
    {
      "runSpec": {
        ...
        "execDepends": [
          {"name": "samtools"}
        ]
      }
    inputs = dxpy.download_all_inputs()
    # download_all_inputs returns a dictionary that contains mapping from inputs to file locations.
    # Additionaly, helper keys, value pairs are added to the dicitonary, similar to bash helper functions
    inputs
    #     mappings_sorted_bam_path: [u'/home/dnanexus/in/mappings_sorted_bam/SRR504516.bam']
    #     mappings_sorted_bam_name: u'SRR504516.bam'
    #     mappings_sorted_bam_prefix: u'SRR504516'
    #     mappings_sorted_bai_path: u'/home/dnanexus/in/mappings_sorted_bai/SRR504516.bam.bai'
    #     mappings_sorted_bai_name: u'SRR504516.bam.bai'
    #     mappings_sorted_bai_prefix: u'SRR504516'
    # Get cpu count from multiprocessing
    print("Number of cpus: {0}".format(cpu_count()))
    
    # Create a pool of workers, 1 for each core
    worker_pool = Pool(processes=cpu_count())
    
    # Map run_cmds to a collection
    # Pool.map handles orchestrating the job
    results = worker_pool.map(run_cmd, collection)
    
    # Make sure to close and join workers when done
    worker_pool.close()
    worker_pool.join()
    def run_cmd(cmd_arr):
        """Run shell command.
        Helper function to simplify the pool.map() call in our parallelization.
        Raises OSError if command specified (index 0 in cmd_arr) isn't valid
        """
        proc = subprocess.Popen(
            cmd_arr,
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE)
        stdout, stderr = proc.communicate()
        exit_code = proc.returncode
        proc_tuple = (stdout, stderr, exit_code)
        return proc_tuple
    def parse_sam_header_for_region(bamfile_path):
        """Helper function to match SN regions contained in SAM header
    
        Returns:
            regions (list[string]): list of regions in bam header
        """
        header_cmd = ['samtools', 'view', '-H', bamfile_path]
        print('parsing SAM headers:', " ".join(header_cmd))
        headers_str = subprocess.check_output(header_cmd).decode("utf-8")
        rgx = re.compile(r'SN:(\S+)\s')
        regions = rgx.findall(headers_str)
        return regions
    # Write results to file
    resultfn = inputs['mappings_sorted_bam_name'][0]
    resultfn = (
        resultfn[:-4] + '_count.txt'
        if resultfn.endswith(".bam")
        else resultfn + '_count.txt')
    with open(resultfn, 'w') as f:
        sum_reads = 0
        for res, reg in zip(results, regions):
            read_count = int(res[0])
            sum_reads += read_count
            f.write("Region {0}: {1}\n".format(reg, read_count))
        f.write("Total reads: {0}".format(sum_reads))
    
    count_file = dxpy.upload_local_file(resultfn)
    output = {}
    output["count_file"] = dxpy.dxlink(count_file)
    return output
    def verify_pool_status(proc_tuples):
        """
        Helper to verify worker succeeded.
    
        As failed commands are detected, the `stderr` from that command is written
        to the job_error.json file. This file is printed to the Platform
        job log on App failure.
        """
        all_succeed = True
        err_msgs = []
        for proc in proc_tuples:
            if proc[2] != 0:
                all_succeed = False
                err_msgs.append(proc[1])
        if err_msgs:
            raise dxpy.exceptions.AppInternalError(b"\n".join(err_msgs))
    The DNAnexus Platform recognizes three main types of paths for referring to data objects: project paths, job-based object references (JBORs), and DNAnexus links.

    Project Paths

    To refer to a project by name, it must be suffixed with the colon character ":". Anything appearing after the ":" or without a ":" is interpreted as a folder path to a named object. For example, to refer to a file called "hg19.fq.gz" in a folder called "human" in a project called "Genomes", the following path can be used in place of its object ID:

    The folder path appearing after the ":" is assumed to be relative to the root folder "/" of the project.

    Exceptions to this are when commands take in arbitrary names. This applies to commands like dx describe which accepts app names, user IDs, and other identifiers. In this case, all possible interpretations are attempted. However, it is always assumed that it is not a project name unless it ends in ":".

    Job-Based Object References (JBORs)

    To refer to the output of a particular job, you can use the syntax <job id>:<output name>.

    Examples

    If you have the job ID handy, you can use it directly.

    Or if you know it's the last analysis you ran:

    You can also automatically download a file once the job producing it is done:

    If the output is an array, you can extract a single element by specifying its array index (starting from 0) as follows:

    DNAnexus Links

    DNAnexus links are JSON hashes which are used for job input and output. They always contain one key, $dnanexus_link, and have as a value either

    • a string representing a data object ID

    • another hash with two keys:

      • project a string representing a project or other data container ID

      • id a string representing a data object ID

    For example:

    Special Characters

    When naming data objects, certain characters require special handling because they have specific meanings in the DNAnexus Platform:

    • The colon (:) identifies project names

    • The forward slash (/) separates folder names

    • Asterisks (*) and question marks (?) are used for wildcard matching

    To use these characters in object names, you must escape them with backslashes. Spaces may also need escaping depending on your shell environment and whether you use quotes.

    For the best experience, we recommend avoiding special characters in names when possible. If you need to work with objects that have special characters, using their object IDs directly is often simpler.

    The table below shows how to escape special characters when accessing objects with these characters in their names:

    Character
    Without Quotes
    With Quotes

    (single space)

    ' '

    :

    \\\\:

    '\\:'

    /

    \\\\/

    '\/'

    *

    \\\\\\\\*

    The following example illustrates how the special characters are escaped for use on the command line, with and without quotes.

    For commands where the argument supplied involves naming or renaming something, the only escaping necessary is whatever is necessary for your shell or for setting it apart from a project or folder path.

    Name Conflicts

    It is possible to have multiple objects with the same name in the same folder. When an attempt is made to access or modify an object which shares the same name as another object, you are prompted to select the desired data object.

    Some commands (like mv here) allow you to enter * so that all matches are used. Other commands may automatically apply the command to all matches. This includes commands like ls and describe. Some commands require that exactly one object be chosen, such as the run command.

    ID
    The subjob or child job of stage 1's origin job shares the same temporary workspace as its parent job. API calls to run a new applet or app using /applet-xxxx/run or /app-xxxx/run launch a master job that has its own separate workspace, and (by default) no visibility into its parent job's workspace.

    Job States

    Successful Jobs

    Every successful job goes through at least the following four states: 1. idle: initial state of every new job, regardless of what API call was made to create it. 2. runnable: the job's inputs are ready, and it is not waiting for any other job to finish or data object to finish closing. 3. running: the job has been assigned to and is being run on a worker in the cloud. 4. done: the job has completed, and it is not waiting for any descendent job to finish or data object to finish closing. This is a terminal state, so no job becomes a different state after transitioning to done.

    Jobs may also pass through the following transitional states as part of more complicated execution patterns:

    • waiting_on_input (between idle and runnable): a job enters and stays in this state if at least one of the following is true:

      • it has an unresolved job-based object reference in its input

      • it has a data object input that cannot be cloned yet because it is not in the closed state or a linked hidden object is not in the closed state

      • it was created to wait on a list of jobs or data objects that must enter the done or closed states, respectively (see the dependsOn field of any API call that creates a job). Linked hidden objects are implicitly included in this list

    • waiting_on_output (between running and done): a job enters and stays in this state if at least one of the following is true:

      • it has a descendant job that has not been moved to the done state

      • it has an unresolved job-based object reference in its output

    Unsuccessful Jobs

    Two terminal job states exist other than the done state: terminated and failed. A job can enter either of these states from any other state except another terminal state.

    Terminated Jobs

    The terminated state occurs when a user requests termination of the job (or another job sharing the same origin job). For all terminated jobs, the failureReason in their describe hash contains "Terminated", and the failureMessage indicates the user responsible for termination. Only the user who launched the job or administrators of the job's project context can terminate the job.

    Failed Jobs

    Jobs can fail for a variety of reasons, and once a job fails, this triggers failure for all other jobs that share the same origin job. If an unrelated job not in the same job tree has a job-based object reference or otherwise depends on a failed job, then it also fails. For more information about errors that jobs can encounter, see the Error Information page.

    On the DNAnexus Platform, jobs are limited to a runtime of 30 days. Jobs running longer than 30 days fail with JobTimeoutExceeded error.

    Restartable Jobs

    Jobs can automatically restart when they encounter specific types of failures. You configure which failure types trigger restarts in the executionPolicy of an app, applet, or workflow. Common restartable failure types include:

    • UnresponsiveWorker

    • ExecutionError

    • AppInternalError

    • JobTimeoutExceeded

    • SpotInstanceInterruption

    How job restarts work

    When a job fails for a restartable reason, the system determines where to restart based on the restartableEntryPoints configuration:

    • master setting (default): The failure propagates to the nearest master job, which then restarts

    • all setting: The job restarts itself directly

    The system restarts a job up to the maximum number of times specified in the executionPolicy. Once this limit is reached, the entire job tree fails.

    During the restart process, jobs transition through specific states:

    • restartable: The job is ready to be restarted

    • restarted: The job attempt was restarted (a new attempt begins)

    Job try tracking

    For jobs in root executions launched after July 12, 2023 00:13 UTC, the platform tracks restart attempts using a try integer attribute:

    • First attempt: try = 0

    • Second attempt (first restart): try = 1

    • Third attempt (second restart): try = 2

    Multiple API methods support job try operations and include try information in their responses:

    • /job-xxxx/describe

    • /job-xxxx/addTags

    • /job-xxxx/removeTags

    • /job-xxxx/setProperties

    • /system/findExecutions

    • /system/findJobs

    • /system/findAnalyses

    When you provide a job ID without specifying a try argument, these methods automatically refer to the most recent attempt for that job.

    Additional States

    For unsuccessful jobs, additional states exist between the running state and the terminal state of terminated or failed. Unsuccessful jobs starting in other non-terminal states transition directly to the appropriate terminal state.

    • terminating: the transitional state when the cloud worker begins terminating the job and tearing down the execution environment. The job moves to its terminal state after the worker reports successful termination or becomes unresponsive.

    • debug_hold: a job has been run with debugging options and has failed for an applicable reason, and is being held for debugging by the user. For more information about triggering this state, see the Connecting to Jobs page.

    Analysis States

    All analyses start in the state in_progress, and, like jobs, reach one of the terminal states done, failed, or terminated. The following diagram shows the state transitions for successful analyses.

    If an analysis is unsuccessful, it may transition through one or more intermediate states before it reaches its terminal state:

    • partially_failed: this state indicates that one or more stages in the analysis have not finished successfully, and there is at least one stage which has not transitioned to a terminal state. In this state, some stages may have already finished successfully (and entered the done state), and the remaining stages are also allowed to finish successfully if they can.

    • terminating: an analysis may enter this state either via an API call where a user has terminated the analysis, or there is some failure condition under which the analysis is terminating any remaining stages. This may happen if the executionPolicy for the analysis (or a stage of an analysis) had the onNonRestartableFailure value set to "failAllStages".

    Billing

    Compute and data storage costs for jobs that fail due to user error are charged to the project running those jobs. This includes errors such as InputError and OutputError. The same applies to terminated jobs. For DNAnexus Platform internal errors, these costs are not billed.

    The costs for each stage in an analysis is determined independently. If the first stage finishes successfully while a second stage fails for a system error, the first stage is still billed, and the second is not.

    Download Files from the Project to the Local Execution Environment

    Bash

    You can download input data from a project using dx download in a notebook cell:

    The %%bash keyword converts the whole cell to a magic cell which allows you to run bash code in that cell without exiting the Python kernel. See examples of magic commands in the IPython documentation. The ! prefix achieves the same result:

    Alternatively, the dx command can be executed from the terminal.

    Python

    To download data with Python in the notebook, you can use the download_dxfile function:

    Check the dxpy helper functions for details on how to download files and folders.

    Upload Data from the Session to the Project

    Bash

    Any files from the execution environment can be uploaded to the project using dx upload:

    Python

    To upload data using Python in the notebook, you can use the upload_local_file function:

    Check the dxpy helper functions for details on how to upload files and folders.

    Download and Upload Data to Your Local Machine

    By selecting a notebook or any other file on your computer and dragging it into the DNAnexus project file browser, you can upload the files directly to the project. To download a file, right-click on it and click Download (to local computer).

    You may upload and download data to the local execution environment in a similar way, that is, by dragging and dropping files to the execution file browser or by right-clicking on the files there and clicking Download.

    Use the Terminal

    It is useful to have a terminal provided by JupyterLab at hand, which uses bash shell by default and lets you execute shell scripts or interact with the platform via dx toolkit. For example, the following command confirms what the current project context is:

    Running pwd shows you that the working directory of the execution environment is /opt/notebooks. The JupyterLab server is launched from this directory, which is also the default location of the output files generated in the notebooks.

    To open a terminal window, go to File > New > Terminal or open it from the Launcher (using the "Terminal" box at the bottom). To open a Launcher, select File > New Launcher.

    Install Custom Packages in the Session Environment

    You can install pip, conda, apt-get, and other packages in the execution environment from the notebook:

    By creating a snapshot, you can start subsequent sessions with these packages pre-installed by providing the snapshot as input.

    Access Public and Private GitHub Repositories from the JupyterLab Terminal

    You can access public GitHub repositories from the JupyterLab terminal using git clone command. By placing a private ssh key that's registered with your GitHub account in /root/.ssh/id_rsa you can clone private GitHub repositories using git clone and push any changes back to GitHub using git push from the JupyterLab terminal.

    Below is a screenshot of a JupyterLab session with a terminal displaying a script that:

    • sets up ssh key to access a private GitHub repository and clones it,

    • clones a public repository,

    • downloads a JSON file from the DNAnexus project,

    • modifies an open-source notebook to convert the JSON file to CSV format,

    • saves the modified notebook to the private GitHub repository,

    • and uploads the results of JSON to CSV conversion back to the DNAnexus project.

    This animation shows the first part of the script in action:

    Run Notebooks Non-Interactively

    A command can be run in the JupyterLab Docker container without starting an interactive JupyterLab server. To do that, provide the cmd input and additional input files using the in input file array. The command runs in the directory where the JupyterLab server is started and notebooks are run, that is, /opt/notebooks/. Any output files generated in this directory are uploaded to the project and returned in the out output.

    The cmd input makes it possible to use a papermill tool pre-installed in the JupyterLab environment that executes notebooks non-interactively. For example, to execute all the cells in a notebook and produce an output notebook:

    where notebook.ipynb is the input notebook to "papermill", which needs to be passed in the "in" input, and output_notebook.ipynb is the name of the output notebook, which stores the result of the cells' execution. The output is uploaded to the project at the end of the app execution.

    If the snapshot parameter is specified, execution of cmd takes place in the specified Docker container. The duration argument is ignored when running the app with cmd. The app can be run from the command line with the --extra-args flag to limit the runtime, for example, dx run dxjupyterlab --extra-args '{"timeoutPolicyByExecutable": {"app-xxxx":{"\*": {"hours": 1}}}}'".

    If cmd is not specified, the in parameter is ignored and the output of an app consists of an empty array.

    Use newer NVIDIA GPU-accelerated software

    If you are trying to use newer NVIDIA GPU-accelerated software, you may find that the NVIDIA GPU Driver kernel-mode driver NVIDIA.ko that is installed outside of the DXJupyterLab environment does not support the newer CUDA version required by your application. You can install NVIDIA Forward Compatibility packages to use the newer CUDA version required by your application by following the steps below in a DXJupyterLab terminal.

    Session Inactivity

    After 15 to 30 minutes of inactivity in the JupyterLab browser tabs, the system logs you out automatically from the JupyterLab session and displays a "Server Connection Error" message. To re-enter the JupyterLab session, reload the JupyterLab webpage and log into the platform to be redirected to the JupyterLab session.

    Contact DNAnexus Sales
    The DNAnexus Thrift server connects to a high availability Apache Spark cluster integrated with the platform. It leverages the same security, permissions, and sharing features built into DNAnexus.

    Connecting to Thrift Server

    Prerequisites:

    1. The JDBC URL:

    2. The username in the format TOKEN__PROJECTID, where:

      • TOKEN is a DNAnexus user-generated token, separated by a double underscore (__) from the project ID.

      • PROJECTID is a DNAnexus project ID used as the project context (when creating databases).

      • The Thrift server and the project must be in the same region.

    Generate a DNAnexus Platform Authentication Token

    See the Authentication tokens page.

    Getting the Project ID

    1. Navigate to https://platform.dnanexus.com and login using your username and password.

    2. In Projects > your project > Settings >, for Project ID, click Copy to Clipboard.

    Using Beeline

    Beeline is a JDBC client bundled with Apache Spark that can be used to run interactive queries on the command line.

    Installing Apache Spark

    You can download Apache Spark 3.5.2 for Hadoop 3.x from here.

    You need to have Java installed in your system PATH, or the JAVA_HOME environment variable pointing to a Java installation.

    Single Command Connection

    If you already have beeline installed and all the credentials, you can quickly connect with the following command:

    In the following AWS example, you must escape some characters (; with \):

    The command for connecting to Thrift on Azure has a different format:

    Running Beeline Guided Connection

    The beeline client is located under $SPARK_HOME/bin/.

    Connect to beeline using the JDBC URL:

    Once successfully connected, you should see the message:

    Querying in Beeline

    After connecting to the Thrift server using your credentials, you can view all databases you have access to within your current region.

    You can query using the unique database name, which includes the lowercase database ID (for example, database_fjf3y28066y5jxj2b0gz4g85__metabric_data). If the database is in the same username and project used to connect to the Thrift server, you can use only the database name (for example, metabric_data). For databases outside the project, use the unique database name.

    Databases stored in other projects can be found by specifying the project context in the LIKE option of SHOW DATABASES, using the format '<project-id>:<database pattern>' as shown below:

    After connecting, you can run SQL queries.

    Contact DNAnexus Sales

    Distributed by Region (py)

    This applet creates a count of reads from a BAM format file.

    View full source code on GitHub

    How is the SAMtools dependency provided?

    The SAMtools dependency is resolved by declaring an Apt-Get package in the dxapp.json runSpec.execDepends.

    For additional information, refer to the execDepends documentation.

    Entry Points

    Distributed Python-interpreter apps use Python decorators on functions to declare entry points. This app has the following entry points as decorated functions:

    • main

    • samtoolscount_bam

    • combine_files

    Entry points are executed on a new worker with their own system requirements. In this example, the files are split and merged on basic mem1_ssd1_x2 instances and the more intensive processing step is performed on a mem1_ssd1_x4 instance. Instance type can be set in the dxapp.json runSpec.systemRequirements:

    main

    The main function scatters by region bins based on user input. If no *.bai file is present, the applet generates an index *.bai.

    Regions bins are passed to the samtoolscount_bam entry point using the function.

    Outputs from the samtoolscount_bam entry points are used as inputs for the combine_files entry point. The output of the combine_files entry point is used as the output of the main entry point.

    samtoolscount_bam

    This entry point downloads and creates a samtools view -c command for each region in the input bin. The dictionary returned from dxpy.download_all_inputs() is used to reference input names and paths.

    This entry point returns {"readcount_fileDX": readCountDXlink}, a JBOR referencing an uploaded text file. This approach to scatter-gather stores the results in files and uploads/downloads the information as needed. This approach exaggerates a scatter-gather for tutorial purposes. You're able to pass types other than file such as int.

    combine_files

    The main entry point triggers this subjob, providing the output of samtoolscount_bam as an input. This entry point gathers all the files generated by the samtoolscount_bam jobs and sums them.

    Important: While the main entry point triggers the processing and gathering entry points, remember that the main entry point doesn't do any heavy lifting or processing. Notice in the .runSpec JSON above the process starts with a lightweight instance, scales up for the processing entry point, then finally scales down for the gathering step.

    Distributed by Region (py)

    This applet creates a count of reads from a BAM format file.

    View full source code on GitHub

    How is the SAMtools dependency provided?

    The SAMtools dependency is resolved by declaring an Apt-Get package in the dxapp.json runSpec.execDepends.

    For additional information, refer to the execDepends documentation.

    Entry Points

    Distributed Python-interpreter apps use Python decorators on functions to declare entry points. This app has the following entry points as decorated functions:

    • main

    • samtoolscount_bam

    • combine_files

    Entry points are executed on a new worker with their own system requirements. In this example, the applet splits and merges files on basic mem1_ssd1_x2 instances and performs a more intensive processing step on a mem1_ssd1_x4 instance. Instance type can be set in the dxapp.json's runSpec.systemRequirements:

    main

    The main function scatters by region bins based on user input. If no *.bai file is present, the applet generates an index *.bai.

    Regions bins are passed to the samtoolscount_bam entry points in the function.

    Outputs from the samtoolscount_bam entry points are used as inputs for the combine_files entry point. The output of the combine_files entry point is used as the output of the main entry point.

    samtoolscount_bam

    This entry point downloads and creates a samtools view -c command for each region in the input bin. The dictionary returned from dxpy.download_all_inputs() is used to reference input names and paths.

    This entry point returns {"readcount_fileDX": readCountDXlink}, a JBOR referencing an uploaded text file. This approach to scatter-gather stores the results in files and uploads/downloads the information as needed. This approach exaggerates a scatter-gather for tutorial purposes. You're able to pass types other than file such as int.

    combine_files

    The main entry point triggers this subjob, providing the output of samtoolscount_bam as an input. This entry point gathers all the files generated by the samtoolscount_bam jobs and sums them.

    Important: While the main entry point triggers the processing and gathering entry points, remember that the main entry point doesn't do any heavy lifting or processing. The .runSpec JSON shows a workflow that starts with a lightweight instance, scales up for the processing entry point, and then scales down for the gathering step.

    Project Navigation

    You can treat dx as an invocation command for navigating the data objects on the DNAnexus Platform. By adding dx in front of commonly used bash commands, you can manage objects in the platform directly from the command-line. Common commands include dx ls, dx cd, dx mv, and dx cp, which let you list objects, change folders, move data objects, and copy objects.

    Listing Objects

    Listing Objects in Your Current Project

    By default when you set your current project, you are placed in the root folder / of the project. You can list the objects and folders in your current folder with .

    Listing Object Details

    To see more details, you can run the command with the option dx ls -l.

    As in bash, you can list the contents on a path.

    Listing Objects in a Different Project

    You can also list the contents of a different project. To specify a path that points to a different project, start with the project-ID, followed by a :, then the path within the project where / is the root folder of the project.

    Enclose the path in quotes (" ") so dx interprets the spaces as part of the folder name, not as a new command.

    Listing Objects That Match a Pattern

    You can also list only the objects which match a pattern. Here, a * is used as a wildcard to represent all objects whose names contain .fasta. This returns only a subset of the objects returned in the original query. Again the path is enclosed in " " so dx correctly interprets the asterisk and the spaces in the path.

    Switching Contexts

    Changing Folders

    To find out your present folder location, use the dx pwd command. You can switch contexts to a subfolder in a project using .

    Moving or Renaming Data Objects

    You can move and rename data objects and folders using the command .

    To rename an object or a folder, "move" it to a new name in the same folder. Here, a file named ce10.fasta.gz is renamed to C.elegans10.fastq.gz.

    If you want to move the renamed file into a folder, specify the path to the folder as the destination of the move command (dx mv).

    Copying Objects or Folders to Another Project

    You can copy data objects or folders to another project by running the command dx cp. The following example shows how to copy a human reference genome FASTA file (hs37d5.fa.gz) from a public project, "Reference Genome Files", to a project "Scratch Project" that the user has ADMINISTER permission to.

    You can also copy folders between projects by running dx cp folder_name destination_path. Folders are automatically copied recursively.

    The Platform prevents copying a data object within the same project, since each specific data object exists only once in a project. The system also prohibits copying any data object between projects that are located in different through dx cp.

    Changing Your Current Project

    Changing to Another Project With a Project Prompt List

    You can change to another project where you wanted to work by running the command . It brings up a prompt with a list of projects for you to select from. In the following example, the user has entered option 2 to select the project named "Mouse".

    Changing to a Public Project

    To view and select between all public projects, projects available to all DNAnexus users, you can run the command dx select --public:

    Changing to a Project With VIEW Permission

    By default, dx select prompts a list of projects that you have at least CONTRIBUTE permission to. If you wanted to switch to a project that you have VIEW permission to view the data objects, you can run dx select --level VIEW to list all the projects in which you have at least VIEW permission to.

    Changing Directly to a Specific Project

    If you know the project ID or name, you can also give it directly to switch to the project as dx select [project-ID | project-name]:

    Stata in DXJupyterLab

    Using Stata via DXJupyterlab, working with project files, and creating datasets with Spark.

    Stata is a powerful statistics package for data science. Stata commands and functionality can be accessed on the DNAnexus Platform via stata_kernel, in Jupyter notebooks.

    Before You Begin

    Project License Requirement

    On the DNAnexus Platform, use the to create and edit Jupyter notebooks.

    You can only run this app within a project that's billed to an account with a license that allows the use of both and . if you need to upgrade your license.

    DXJupyterLab is accessible to all users of the UK Biobank Research Analysis Platform and the Our Future Health Trusted Research Environment. A license is required to access DXJupyterLab on the DNAnexus Platform. for more information.

    Stata License Requirement

    To use Stata on the DNAnexus Platform, you need a valid Stata license. Before launching Stata in a project, you must save your license details according to the instructions below in a plain text file with the extension .json, then upload this file to the project's root directory. You only need to do this once per project.

    Creating a Stata License Details File

    Start by creating the file in a text editor, including all the fields shown here, where <user> is your DNAnexus username, and<organization>is the org of which you're a member:

    Save the file according to the following format, where <username> is your DNAnexus username: .stataSettings.user-<username>.json

    Some operating systems may not support the naming of files with a "." as the first character. If this is the case, you can rename the .json file after uploading it to your project by hovering over the name of your file and clicking the pencil icon that appears.

    Uploading the Stata License Details File

    Open the project in which you want to use Stata. Upload the Stata license details file to the project's root directory by going to your project's Manage tab, clicking on the Add button on the upper right, and then selecting the Upload data option.

    Secure Indirect Format Option for Shared Projects

    When working in a shared project, you can take an additional step to avoid exposing your Stata license details to project collaborators.

    Create a private project. Then create and save a Stata license details file in that project's root directory, per the instructions above.

    Within the shared project, create and save a Stata license details file in this format, where project-yyyy is the name of the private project, and file-xxxx is the license details file ID, in that private project:

    When working on the Research Analysis Platform, you can only create a private credentials project from the .

    Launching DXJupyterLab

    1. Open the project in which you want to use Stata. From within the project's Manage tab, click the Start Analysis button.

    2. Select the app DXJupyterLab with Python, R, Stata, ML.

    3. Click the Run Selected button. If you haven't run this app before, you are prompted to install it. Next, you are taken to the Run Analysis screen.

    The app can take some time to load and start running.

    Once the analysis starts, you see the notification "Running" appear under the name of the app.

    Opening JupyterLab

    Click the Monitor tab heading. This opens a list of running and past jobs. Jobs are shown in reverse chronological order, with the most recently launched at the top. The topmost row should show the job you launched. To open the job and enter the JupyterLab interface, click on the URL shown under Worker URL.

    If you do not see the worker URL, click on the name of the job in the Monitor page.

    Using Stata Within JupyterLab

    Within the JupyterLab interface, open the DNAnexus tab shown at the left edge of the screen.

    Open a new Stata notebook by clicking the Stata tile in the Notebooks section.

    Working with Project Files

    You can download DNAnexus data files to DXJupyterLab container from Stata notebook with:

    Data files in the current project can also be accessed using a /mnt/project folder from a Stata notebook as follows: To load a DTA file:

    To load a CSV file:

    To write a DTA file to the DXJupyterLab container:

    To write a CSV file to the DXJupyterLab container:

    To upload a data file from the DXJupyterLab container to the project, use the following command in a Stata notebook:

    Alternatively, open a new Launcher tab, open Terminal, and run:

    The /mnt/project directory is read-only, so trying to write to it results in an error.

    Creating a Stata Dataset with Spark

    can be used to query and filter DNAnexus returning a PySpark DataFrame. PySpark Dataframe can be converted to a pandas DataFrame with:

    Pandas dataframe can be exported to CSV or Stata DTA files in the JupyterLab container with:

    To upload a data file from the JupyterLab container to the DNAnexus project in the DXJupyterLab spark cluster app, use

    Once saved to the project, data files can be used in a DXJupyterLab Stata session using the instructions above.

    Visualizing Data

    The DNAnexus Platform offers multiple different methods for viewing your files and data.

    Previewing Files

    DNAnexus allows users to preview and open the following file types directly on the platform:

    • TXT

    • PNG

    • PDF

    • HTML

    To preview these files, select the file you wish to view by either clicking on its name in the Manage tab or selecting the checkbox next to the file. If the file is one of the file types listed above, the "Preview" and "Open in New Tab" options appear in the toolbar above.

    Alternatively, you can click on the three dots on the far right and choose the "Preview" or "Open in New Tab" options from the dropdown menu.

    "Preview" opens a fixed-sized box in your current tab to preview the file of interest. "Open in New Tab" enables viewing the file in a separate tab. Due to limitations in web browser technologies, "Preview" and "Open in New Tab" may produce different results.

    The file type is not necessarily determined by the file extension. For example, you can preview a FASTA file reads.fa, even though the file extension is not .txt. However, you cannot preview a BAM file (a binary file) using the Preview option.

    Preview Restrictions

    File preview and viewer functionality are subject to . When a project has the previewViewerRestricted flag enabled, preview and viewer capabilities are disabled for all project members. This flag is automatically set to true when downloadRestricted is enabled on a project (for both new projects and when updating existing projects), though project admins can override this behavior by explicitly providing the previewViewerRestricted flag.

    Using File Viewers

    For files not listed in the section above, the DNAnexus Platform also provides a lightweight framework called Viewers, which allows users to view their data using new or existing web-based tools.

    A Viewer is an HTML file that you can give one or more DNAnexus URLs representing files to be viewed. Viewers generally integrate third-party technologies, such as HTML-based genome browsers.

    The data you select to be viewed is accessible by the Viewer, which can also access the Internet. You should only run Viewers from trusted sources.

    Launching a Viewer

    You can launch a viewer by clicking on the Visualize tab within a project.

    This tab opens a window displaying all Viewers available to you within your project. Any Viewers you've created and saved within your current project appear in this list along with the DNAnexus-provided Viewers.

    Clicking on a Viewer opens a data selector for you to choose the files you wish to visualize. Tick one or more files that you want to provide to the Viewer. (The Viewer does not have access to any other of your data.) From there, you can either create a Viewer Shortcut or launch the Viewer.

    Example Viewers

    Human Genome Browsers (BioDalliance, IGV.js)

    The BioDalliance and IGV.js viewers provide HTML-based human genome browsers which you can use to visualize mappings and variants. When launching either one of these viewers, tick a pair of *.bam + *.bai files for each mappings track you would like to visualize, and a pair of *.vcf.gz + *.vcf.gz.tbi for each variant track you want to add. Also, the BioDalliance browser supports bigBed (*.bb) and bigWig (*.bw) tracks.

    For more information about BioDalliance, consult . For IGV.js, see .

    BAM Header Viewer

    The BAM Header Viewer allows you to peek inside a BAM header, similar to what you would get if you were to run samtools view -H on the BAM file. (BAM headers include information about the reference genome sequences, read groups, and programs used). When launching this viewer, tick one or more BAM files (*.bam).

    Jupyter Notebook Viewer

    The Jupyter notebook viewer displays *.ipynb notebook files, showing notebook images, highlighted code blocks and rendered markdown blocks as shown below.

    Gzipped File Viewers

    This viewer allows you to decompress and see the first few kilobytes of a gzipped file. It is conceptually similar to what you would get if you were to run zcat <file> \| head. Use this viewer to peek inside compressed reads files (*.fastq.gz) or compressed variants files (*.vcf.gz). When launching this viewer, tick one or more gzipped files (*.gz).

    Troubleshooting Viewers

    If a viewer fails to load, try temporarily disabling browser extensions such as AdBlock and Privacy Badger. Also, viewers are not supported in Incognito browser windows.

    Custom Viewers

    Developers comfortable with HTML and JavaScript can to visualize data on the platform.

    Viewer Shortcuts

    Viewer Shortcuts are objects which, when opened, open a data selector to select inputs for launching a specified Viewer. The Viewer Shortcut includes a Viewer and an array of inputs that are selected by default.

    The Viewer Shortcut appears in your project as an object of type "Viewer Shortcut." You can modify the name of the Viewer Shortcut and move it within your folders and projects like any other object in the DNAnexus Platform.

    Apps and Workflows

    Every analysis in DNAnexus is run using apps. Apps can be linked together to create workflows. Learn the basics of using both.

    You must set up billing for your account before you can perform an analysis, or upload or egress data.

    Finding the Right App or Workflow

    User Interface Quickstart

    Learn to create a project, add members and data to the project, and run a simple workflow.

    You must set up billing for your account before you can perform an analysis, or upload or egress data.

    Step 1. Create Your First Project

    On the DNAnexus Platform, all data is stored within projects. Before you upload, browse, or analyze any data, you must create a

    Genomes:human/hg19.fa.gz
    dx describe job-B0kK3p64Zg2FG1J75vJ00004:reads
    dx describe $(dx find jobs -n 1 --brief):reads
    dx download $(dx run some_exporter_app -iinput=my_input -y --brief --wait):file_output
    dx describe job-B0kK3p64Zg2FG1J75vJ00004:reads.0
    $ dx ls '{"$dnanexus_link": "file-B2VBGXyK8yjzxF5Y8j40001Y"}'
    file-name
    $ dx ls Project\ Mouse:
    name: with/special*characters?
    $ dx cd Project\ Mouse:
    $ dx describe name\\\\:\ with\\\\/special\\\\\\\\*characters\\\\\\\\?
    ID              file-9zz0xKJkf6V4yzQjgx2Q006Y
    Class           file
    Project         project-9zb014Jkf6V33pgy75j0000G
    Folder          /
    Name            name: with/special*characters?
    State           closed
    Hidden          visible
    Types           -
    Properties      -
    Tags            -
    Outgoing links  -
    Created         Wed Jul 11 16:39:37 2012
    Created by      alice
    Last modified   Sat Jul 21 14:19:55 2012
    Media type      text/plain
    Size (bytes)    4
    $ dx describe "name\: with\/special\\\\\\*characters\\\\\\?"
    ...
    $ dx new record -o "must\: escape\/everything\*once\?at creation"
    ID              record-B13BBVK4Zg29fvVv08q00005
    ...
    Name            must: escape/everything*once? at creation
    ...
    $ dx rename record-B13BBVK4Zg29fvVv08q00005 "no:escaping/necessary*even?wildcards"
    $ dx ls
    sample : file-9zbpq72y8x6F0xPzKZB00003
    sample : file-9zbjZf2y8x61GP1199j00085
    $ dx mv sample mouse_sample
    The given path "sample" resolves to the following data objects:
    0) closed  2012-06-27 18:04:28 sample (file-9zbpq72y8x6F0xPzKZB00003)
    1) closed  2012-06-27 15:34:00 sample (file-9zbjZf2y8x61GP1199j00085)
    
    Pick a numbered choice or "*" for all: 1
    $ dx ls -l
    closed  2012-06-27 15:34:00 mouse_sample (file-9zbjZf2y8x61GP1199j00085)
    closed  2012-06-27 18:04:28 sample (file-9zbpq72y8x6F0xPzKZB00003)
    %%bash
    dx download input_data/reads.fastq
    ! dx download input_data/reads.fastq
    import dxpy
    dxpy.download_dxfile(dxid='file-xxxx',
                         filename='unique_name.txt')
    %%bash
    dx upload Readme.ipynb
    import dxpy
    dxpy.upload_local_file('variants.vcf')
    $ dx pwd
    MyProject:/
    %%bash
    pip install torch
    pip install torchvision
    conda install -c conda-forge opencv
    my_cmd="papermill notebook.ipynb output_notebook.ipynb"
    dx run dxjupyterlab -icmd="$my_cmd" -iin="notebook.ipynb"
    # NVIDIA-smi
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 470.129.06   Driver Version: 470.129.06   CUDA Version: 11.4     |
    |-------------------------------+----------------------+----------------------+
    // Let's upgrade CUDA 11.4 to 12.5
    # apt-get update
    # apt-get -y install cuda-toolkit-12-5 cuda-compat-12-5
    # echo /usr/local/cuda/compat > /etc/ld.so.conf.d/NVIDIA-compat.conf
    # ldconfig
    # NVIDIA-smi
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 470.256.02   Driver Version: 470.256.02   CUDA Version: 12.5     |
    |-------------------------------+----------------------+----------------------+
    // CUDA 12.5 is now usable from terminal and notebooks
     AWS US (East): jdbc:hive2://query.us-east-1.apollo.dnanexus.com:10000/;ssl=true
     AWS London (UKB): jdbc:hive2://query.eu-west-2.apollo.dnanexus.com:10000/;ssl=true
     Azure US (West): jdbc:hive2://query.westus.apollo.dnanexus.com:10001/;ssl=true;transportMode=http;httpPath=cliservice
     AWS Frankfurt (General): jdbc:hive2://query.eu-central-1.apollo.dnanexus.com:10000/;ssl=true
    tar -zxvf spark-3.5.2-bin-hadoop3.tgz
    <beeline> -u <thrift path> -n <token>__<project-id>
    $SPARK_HOME/bin/beeline -u jdbc:hive2://query.us-east-1.apollo.dnanexus.com:10000/\;ssl=true -n yourToken__project-xxxx
    $SPARK_HOME/bin/beeline -u jdbc:hive2://query.westus.apollo.dnanexus.com:10001/\;ssl=true\;transportMode=http\;httpPath=cliservice -n yourToken__project-xxxx
    cd spark-3.5.2-bin-hadoop3/bin
    ./beeline
    $ beeline> !connect jdbc:hive2://query.us-east-1.apollo.dnanexus.com:10000/;ssl=true
    
    Enter username: <TOKEN__PROJECTID>
    Enter password: <empty - press RETURN>
    Connected to: Spark SQL (version 3.5.2)
    Driver: Hive JDBC (version 2.3.9)
    Transaction isolation: TRANSACTION_REPEATABLE_READ
    0: jdbc:hive2://query.us-east-1.apollo.dnanex> show databases;
    +---------------------------------------------------------+--+
    |                      databaseName                       |
    +---------------------------------------------------------+--+
    | database_fj7q18009xxzzzx0gjfk6vfz__genomics_180718_01   |
    | database_fj8gygj0v10vj50j0gyfqk1x__af_result_180719_01  |
    | database_fj96qx00v10vj50j0gyfv00z__af_result2           |
    | database_fjf3y28066y5jxj2b0gz4g85__metabric_data        |
    | database_fjj1jkj0v10p8pvx78vkkpz3__pchr1_test           |
    | database_fjpz6fj0v10fjy3fjy282ybz__af_result1           |
    +---------------------------------------------------------+--+
    0: jdbc:hive2://query.us-east-1.apollo.dnanex> use metabric_data;
    0: jdbc:hive2://query.us-east-1.apollo.dnanex> SHOW DATABASES LIKE 'project-xxx:af*';
    +---------------------------------------------------------+--+
    |                      databaseName                       |
    +---------------------------------------------------------+--+
    | database_fj8gygj0v10vj50j0gyfqk1x__af_result_180719_01  |
    | database_fj96qx00v10vj50j0gyfv00z__af_result2           |
    | database_fjpz6fj0v10fjy3fjy282ybz__af_result1           |
    +---------------------------------------------------------+--+
    0: jdbc:hive2://query.us-east-1.apollo.dnanex> select * from cna limit 10;
    +--------------+-----------------+------------+--------+--+
    | hugo_symbol  | entrez_gene_id  | sample_id  | value  |
    +--------------+-----------------+------------+--------+--+
    | MIR3675      | NULL            | MB-6179    | -1     |
    | MIR3675      | NULL            | MB-6181    | 0      |
    | MIR3675      | NULL            | MB-6182    | 0      |
    | MIR3675      | NULL            | MB-6183    | 0      |
    | MIR3675      | NULL            | MB-6184    | 0      |
    | MIR3675      | NULL            | MB-6185    | -1     |
    | MIR3675      | NULL            | MB-6187    | 0      |
    | MIR3675      | NULL            | MB-6188    | 0      |
    | MIR3675      | NULL            | MB-6189    | 0      |
    | MIR3675      | NULL            | MB-6190    | 0      |
    +--------------+-----------------+------------+--------+--+
    "runSpec": {
      ...
      "execDepends": [
        {"name": "samtools"}
      ]
    }
      "runSpec": {
        ...
        "execDepends": [
          {"name": "samtools"}
        ]
      }

    '\\\\\\*'

    ?

    \\\\\\\\?

    '\\\\\\?'

    project access controls
    BioDalliance's Getting Started
    https://igv.org/
    create custom viewers
    it is an origin or master job which has a data object (or linked hidden data object) output in the closing state
    dxpy.new_dxjob
    dxpy.new_dxjob
    dx ls
    dx cd
    dx mv
    cloud regions
    dx select

    On the Run Analysis screen, open the Analysis Inputs tab and click the Stata settings file button.

  • Add your Stata settings file as an input. This is the .json file you created, containing your Stata license details.

  • In the Common section at the bottom of the Analysis Inputs pane, open the Feature dropdown menu and select Stata.

  • Click the Start Analysis button at the top right corner of the screen. This launches the DXJupyterLab app, and takes you to the project's Monitor tab, where you can monitor the app's status as it loads.

  • DXJupyterLab app
    DXJupyterLab
    HTTPS apps
    Contact DNAnexus Sales
    Contact DNAnexus Sales
    Research Analysis Platform Projects page
    DXJupyterLab spark cluster app
    datasets
    The location of the DNAnexus tab within the JupyterLab interface.
    "runSpec": {
      ...
      "systemRequirements": {
        "main": {
          "instanceType": "mem1_ssd1_x2"
        },
        "samtoolscount_bam": {
          "instanceType": "mem1_ssd1_x4"
        },
        "combine_files": {
          "instanceType": "mem1_ssd1_x2"
        }
      },
      ...
    }
    regions = parseSAM_header_for_region(filename)
    split_regions = [regions[i:i + region_size]
                      for i in range(0, len(regions), region_size)]
    
    if not index_file:
        mappings_bam, index_file = create_index_file(filename, mappings_bam)
    print('creating subjobs')
    subjobs = [dxpy.new_dxjob(
                fn_input={"region_list": split,
                          "mappings_bam": mappings_bam,
                          "index_file": index_file},
                fn_name="samtoolscount_bam")
                for split in split_regions]
    
    fileDXLinks = [subjob.get_output_ref("readcount_fileDX")
                    for subjob in subjobs]
    print('combining outputs')
    postprocess_job = dxpy.new_dxjob(
        fn_input={"countDXlinks": fileDXLinks, "resultfn": filename},
        fn_name="combine_files")
    
    countDXLink = postprocess_job.get_output_ref("countDXLink")
    
    output = {}
    output["count_file"] = countDXLink
    
    return output
    def samtoolscount_bam(region_list, mappings_bam, index_file):
        """Processing function.
    
        Arguments:
            region_list (list[str]): Regions to count in BAM
            mappings_bam (dict): dxlink to input BAM
            index_file (dict): dxlink to input BAM
    
        Returns:
            Dictionary containing dxlinks to the uploaded read counts file
        """
        #
        # Download inputs
        # -------------------------------------------------------------------
        # dxpy.download_all_inputs will download all input files into
        # the /home/dnanexus/in directory.  A folder will be created for each
        # input and the file(s) will be download to that directory.
        #
        # In this example our dictionary inputs has the following key, value pairs
        # Note that the values are all list
        #     mappings_bam_path: [u'/home/dnanexus/in/mappings_bam/<bam filename>.bam']
        #     mappings_bam_name: [u'<bam filename>.bam']
        #     mappings_bam_prefix: [u'<bam filename>']
        #     index_file_path: [u'/home/dnanexus/in/index_file/<bam filename>.bam.bai']
        #     index_file_name: [u'<bam filename>.bam.bai']
        #     index_file_prefix: [u'<bam filename>']
        #
    
        inputs = dxpy.download_all_inputs()
    
        # SAMtools view command requires the bam and index file to be in the same
        shutil.move(inputs['mappings_bam_path'][0], os.getcwd())
        shutil.move(inputs['index_file_path'][0], os.getcwd())
        input_bam = inputs['mappings_bam_name'][0]
    
        #
        # Per region perform SAMtools count.
        # --------------------------------------------------------------
        # Output count for regions and return DXLink as job output to
        # allow other entry points to download job output.
        #
    
        with open('read_count_regions.txt', 'w') as f:
            for region in region_list:
                    view_cmd = create_region_view_cmd(input_bam, region)
                    region_proc_result = run_cmd(view_cmd)
                    region_count = int(region_proc_result[0])
                    f.write("Region {0}: {1}\n".format(region, region_count))
        readcountDXFile = dxpy.upload_local_file("read_count_regions.txt")
        readCountDXlink = dxpy.dxlink(readcountDXFile.get_id())
    
        return {"readcount_fileDX": readCountDXlink}
    def combine_files(countDXlinks, resultfn):
        """The 'gather' subjob of the applet.
    
        Arguments:
            countDXlinks (list[dict]): list of DXlinks to process job output files.
            resultfn (str): Filename to use for job output file.
    
        Returns:
            DXLink for the main function to return as the job output.
    
        Note: Only the DXLinks are passed as parameters.
        Subjobs work on a fresh instance so files must be downloaded to the machine
        """
        if resultfn.endswith(".bam"):
            resultfn = resultfn[:-4] + '.txt'
    
        sum_reads = 0
        with open(resultfn, 'w') as f:
            for i, dxlink in enumerate(countDXlinks):
                dxfile = dxpy.DXFile(dxlink)
                filename = "countfile{0}".format(i)
                dxpy.download_dxfile(dxfile, filename)
                with open(filename, 'r') as fsub:
                    for line in fsub:
                        sum_reads += parse_line_for_readcount(line)
                        f.write(line)
            f.write('Total Reads: {0}'.format(sum_reads))
    
        countDXFile = dxpy.upload_local_file(resultfn)
        countDXlink = dxpy.dxlink(countDXFile.get_id())
    
        return {"countDXLink": countDXlink}
      "runSpec": {
        ...
        "systemRequirements": {
          "main": {
            "instanceType": "mem1_ssd1_x2"
          },
          "samtoolscount_bam": {
            "instanceType": "mem1_ssd1_x4"
          },
          "combine_files": {
            "instanceType": "mem1_ssd1_x2"
          }
        },
        ...
      }
    regions = parseSAM_header_for_region(filename)
    split_regions = [regions[i:i + region_size]
                      for i in range(0, len(regions), region_size)]
    
    if not index_file:
        mappings_bam, index_file = create_index_file(filename, mappings_bam)
    print('creating subjobs')
    subjobs = [dxpy.new_dxjob(
                fn_input={"region_list": split,
                          "mappings_bam": mappings_bam,
                          "index_file": index_file},
                fn_name="samtoolscount_bam")
                for split in split_regions]
    
    fileDXLinks = [subjob.get_output_ref("readcount_fileDX")
                    for subjob in subjobs]
    print('combining outputs')
    postprocess_job = dxpy.new_dxjob(
        fn_input={"countDXlinks": fileDXLinks, "resultfn": filename},
        fn_name="combine_files")
    
    countDXLink = postprocess_job.get_output_ref("countDXLink")
    
    output = {}
    output["count_file"] = countDXLink
    
    return output
    def samtoolscount_bam(region_list, mappings_bam, index_file):
        """Processing function.
    
        Arguments:
            region_list (list[str]): Regions to count in BAM
            mappings_bam (dict): dxlink to input BAM
            index_file (dict): dxlink to input BAM
    
        Returns:
            Dictionary containing dxlinks to the uploaded read counts file
        """
        #
        # Download inputs
        # -------------------------------------------------------------------
        # dxpy.download_all_inputs will download all input files into
        # the /home/dnanexus/in directory.  A folder will be created for each
        # input and the file(s) will be download to that directory.
        #
        # In this example, the dictionary has the following key-value pairs
        # Note that the values are all list
        #     mappings_bam_path: [u'/home/dnanexus/in/mappings_bam/<bam filename>.bam']
        #     mappings_bam_name: [u'<bam filename>.bam']
        #     mappings_bam_prefix: [u'<bam filename>']
        #     index_file_path: [u'/home/dnanexus/in/index_file/<bam filename>.bam.bai']
        #     index_file_name: [u'<bam filename>.bam.bai']
        #     index_file_prefix: [u'<bam filename>']
        #
    
        inputs = dxpy.download_all_inputs()
    
        # SAMtools view command requires the bam and index file to be in the same
        shutil.move(inputs['mappings_bam_path'][0], os.getcwd())
        shutil.move(inputs['index_file_path'][0], os.getcwd())
        input_bam = inputs['mappings_bam_name'][0]
    
        #
        # Per region perform SAMtools count.
        # --------------------------------------------------------------
        # Output count for regions and return DXLink as job output to
        # allow other entry points to download job output.
        #
    
        with open('read_count_regions.txt', 'w') as f:
            for region in region_list:
                    view_cmd = create_region_view_cmd(input_bam, region)
                    region_proc_result = run_cmd(view_cmd)
                    region_count = int(region_proc_result[0])
                    f.write("Region {0}: {1}\n".format(region, region_count))
        readcountDXFile = dxpy.upload_local_file("read_count_regions.txt")
        readCountDXlink = dxpy.dxlink(readcountDXFile.get_id())
    
        return {"readcount_fileDX": readCountDXlink}
    def combine_files(countDXlinks, resultfn):
        """The 'gather' subjob of the applet.
    
        Arguments:
            countDXlinks (list[dict]): list of DXlinks to process job output files.
            resultfn (str): Filename to use for job output file.
    
        Returns:
            DXLink for the main function to return as the job output.
    
        Note: Only the DXLinks are passed as parameters.
        Subjobs work on a fresh instance so files must be downloaded to the machine
        """
        if resultfn.endswith(".bam"):
            resultfn = resultfn[:-4] + '.txt'
    
        sum_reads = 0
        with open(resultfn, 'w') as f:
            for i, dxlink in enumerate(countDXlinks):
                dxfile = dxpy.DXFile(dxlink)
                filename = "countfile{0}".format(i)
                dxpy.download_dxfile(dxfile, filename)
                with open(filename, 'r') as fsub:
                    for line in fsub:
                        sum_reads += parse_line_for_readcount(line)
                        f.write(line)
            f.write('Total Reads: {0}'.format(sum_reads))
    
        countDXFile = dxpy.upload_local_file(resultfn)
        countDXlink = dxpy.dxlink(countDXFile.get_id())
    
        return {"countDXLink": countDXlink}
    $ dx ls
    Developer Quickstart/
    Developer Tutorials/
    Quickstart/
    RNA-seq Workflow Example/
    SRR100022/
    _README.1st.txt
    $ dx ls -l
    Project: Demo Data (project-BQbJpBj0bvygyQxgQ1800Jkk)
    Folder : /
    Developer Quickstart/
    Developer Tutorials/
    Quickstart/
    RNA-seq Workflow Example/
    SRR100022/
    State   Last modified       Size     Name (ID)
    closed  2015-09-01 17:55:33 712 bytes _README.1st.txt (file-BgY4VzQ0bvyg22pfZQpXfzgK)
    $ dx ls SRR100022/
    SRR100022_1.filt.fastq.gz
    SRR100022_2.filt.fastq.gz
    $ dx ls "project-BQpp3Y804Y0xbyG4GJPQ01xv:/C. Elegans - Ce10/"
    ce10.bt2-index.tar.gz
    ce10.bwa-index.tar.gz
    ce10.cw2-index.tar.gz
    ce10.fasta.fai
    ce10.fasta.gz
    ce10.tmap-index.tar.gz
    $ dx ls "project-BQpp3Y804Y0xbyG4GJPQ01xv:/C. Elegans - Ce10/*.fasta*"
    ce10.fasta.fai
    ce10.fasta.gz
    $ dx pwd
    Demo Data:/
    $ dx cd Quickstart/
    $ dx ls
    SRR100022_20_1.fq.gz
    SRR100022_20_2.fq.gz
    $ dx ls
    some_folder/
    an_applet
    ce10.fasta.gz
    Variation Calling Workflow
    $ dx mv ce10.fasta.gz C.elegans10.fasta.gz
    $ dx ls
    some_folder/
    an_applet
    C.elegans10.fasta.gz
    Variation Calling Workflow
    $ dx mv C.elegans10.fasta.gz some_folder/
    $ dx ls some_folder/
    Hg19
    C.elegans10.fasta.gz
    ...
    $ dx select project-BQpp3Y804Y0xbyG4GJPQ01xv
    Selected project project-BQpp3Y804Y0xbyG4GJPQ01x
    $ dx cd H.\ Sapiens\ -\ GRCh37\ -\ hs37d5\ (1000\ Genomes\ Phase\ II)/
    $ dx ls
    hs37d5.2bit
    hs37d5.bt2-index.tar.gz
    hs37d5.bwa-index.tar.gz
    hs37d5.cw2-index.tar.gz
    hs37d5.fa.fai
    hs37d5.fa.gz
    hs37d5.fa.sa
    hs37d5.tmap-index.tar.gz
    $ dx cp hs37d5.fa.gz project-9z94ZPZvbJ3qP0pyK1P0000p:/
    $ dx select project-9z94ZPZvbJ3qP0pyK1P0000p
    $ dx ls
    some_folder/
    an_applet
    C.elegans10.fasta.gz
    hs37d5.fa.gz
    Variation Calling Workflow
    $ dx select
    
    Note: Use "dx select --level VIEW" or "dx select --public" to select from
    projects for which you only have VIEW permission to.
    
    Available projects (CONTRIBUTE or higher):
    0) SAM importer test (CONTRIBUTE)
    1) Scratch Project (ADMINISTER)
    2) Mouse (ADMINISTER)
    
    Project # [1]: 2
    Setting current project to: Mouse
    $ dx ls -l
    Project: Mouse (project-9zVfbG2y8x65kxKY7x20005G)
    Folder : /
    $ dx select --public
    
    Available public projects:
    0) Example 1 (VIEW)
    1) Apps Data (VIEW)
    2) Parliament (VIEW)
    3) CNVkit Tests (VIEW)
    ...
    m) More options not shown...
    
    Pick a numbered choice or "m" for more options: 1
    $ dx select --level VIEW
    
    Available projects (VIEW or higher):
    0) SAM importer test (CONTRIBUTE)
    1) Scratch Project (ADMINISTER)
    2) Shared Applets (VIEW)
    3) Mouse (ADMINISTER)
    
    Pick a numbered choice or "m" for more options: 2
    $ dx select project-9zVfbG2y8x65kxKY7x20005G
    Selected project project-9zVfbG2y8x65kxKY7x20005G
    $ dx ls -l
    Project: Mouse (project-9zVfbG2y8x65kxKY7x20005G)
    Folder : /
    {
      "license": {
        "serialNumber": "<Serial number from Stata>",
        "code": "<Code from Stata>",
        "authorization": "<Authorization from Stata>",
        "user": "<Registered user line 1>",
        "organization": "<Registered user line 2>"
      }
    }
    {
      "licenseFile": {
        "$dnanexus_link": {
          "id": "file-xxxx",
          "project": "project-yyyy"
        }
      }
    }
    !dx download project-xxxx:file-yyy
    use /mnt/project/<path>/data_in.dta
    import delimited /mnt/project/<path>/data_in.csv
    save data_out
    export delimited data_out.csv
    !dx upload <file> --destination=<destination>
    dx upload <file> --destination=<destination>
    pandas_df = spark_df.toPandas()
    pandas_df.to_stata("data_out.dta")
    pandas_df.to_csv("data_out.csv")
    %%bash
    dx upload <file>
    The Tools Library provides a list of available apps and workflows. To see this list, select
    Tools Library
    from the
    Tools
    entry in the main Platform menu.

    On the DNAnexus Platform, apps and workflows are generically referred to as "tools."

    To find the tool you're looking for in the Tools Library, you can use search filters. Filtering enables you to find tools with a specific name, in a specific category, or of a specific type:

    Find all tools with 'assay' in their name.

    To see what inputs a tool requires, and what outputs it generates, select that tool's row in the list. The row is highlighted in blue. The tool's inputs and outputs are displayed in a pane to the right of the list:

    Check a tool's list of inputs and outputs.

    To make sure you can find a tool later, you can pin it to the top of the list. Click More actions (⋮) icon at the far right end of the row showing the tool's name and key details about it. Then click Add Pin.

    Pin your favorite apps to the top of the list.

    To learn more about a tool, click on its name in the list. The tool's detail page opens, showing a wide range of info, including guidance in how to use it, version history, pricing, and more:

    View app details with usage instructions.

    Running Apps and Workflows

    Launching a Tool

    Launching from the Tools Library

    You can quickly launch the latest version of any given tool from the Tools Library page. Or you can navigate to the apps details and click Run.

    By default, you run the latest app version.

    Launching from a Project

    From within a project, navigate to the Manage pane, then click the Start Analysis button.

    A dialog window opens, showing a list of tools. These include the same tools as shown in the Tools Library, as well as workflows and applets specifically available in the current project. Select the tool you want to run, then click Run Selected:

    Workflows and applets can be launched directly from where they reside within a project. Select the workflow or applet in their folder location, and click Run.

    Launch Configuration

    Confirm details of the tool you are about to run. Selection of a project location is required for any tool to be run. You need at minimum Contributor access level to the project.

    Provide name and output location before the launch

    Specialized tools, such as JupyterLab and Spark Apps, require special licenses to run.

    Configure Inputs and Outputs

    The tool may require specific inputs to be filled in before starting the run. You can quickly identify the required inputs by looking for the highlighted areas that are marked Inputs Required on the page.

    Fill in the required inputs before starting the run

    You can access help information about each input or output by inspecting the label of each item. If a detailed README is provided for the executable, you can click the View Documentation icon to open the app or workflow info pane.

    Help information for each field and the tool overall

    To configure instance type settings for a given tool or stage, click the Instance Type icon located on the top-right corner of the stage.

    Show / Hide instance type settings

    To configure output location and view info regarding output items, go to the Outputs tab under each stage. For workflows, output location can be specified separately for each stage.

    Configure output locations for each stage of the workflow

    The I/O graph provides an overview of the input/output structure of the tool. The graph is available for any tool and can be accessed via the Actions/Workflow Actions menu.

    The workflow's I/O graph visualization

    Once all required inputs have been configured, the page indicates that the run is ready to start. Click on Start Analysis to proceed to the final step.

    The tool has been fully configured and ready to start the run

    Configure Runtime Settings

    As the last step before launching the tool, you can review and confirm specific runtime settings, including execution name, output location, priority, job rank, spending limit, and resource allocation. You can also review and modify instance type settings before starting the run.

    Once you have confirmed final details, click Launch Analysis to start the run.

    Review and confirm runtime settings before starting the run
    Configure advanced runtime settings

    A license is required to use the Job Ranking feature. Contact DNAnexus Support for more information.

    Batch Run

    Batch run allows users to run the same app or workflow multiple times, with specific inputs varying between runs.

    Specify Batch Inputs

    To enable batch run, start from any input that you wish to specify for batch run, and open its I/O Options menu on the right hand side. From the list of options available, select Enable Batch Run.

    Input fields with batch run enabled are highlighted with a Batch label. Click any of the batch enabled input fields to enter the batch run configuration page.

    Not all input classes are supported for batch run configuration. See table below.

    Input Class
    Batch Run Support

    Files and other data objects

    Yes

    Files and other data objects (array)

    Partially supported. Can accept entry of a single-value array

    String

    Yes

    Integer

    Yes

    Float

    Yes

    Boolean

    Yes

    Configure Batch Inputs

    The batch run configuration page allows specifying inputs across multiple runs. Interact with each table cell to fill in desired values for any run or field.

    Similar to configuration of inputs for non-batch runs, you need to fill all the required input fields to proceed to next steps. Optional inputs, or required inputs with a predefined default value, can be left empty.

    Once all required fields (for both batch inputs and non-batch inputs) have been configured, you can proceed to start the run via the Start Analysis button.

    The total 10 batch runs have been fully configured and ready to launch

    Starting and Monitoring Your Analysis

    Once you've finished setting up your tool, start your analysis by clicking the Start Analysis button. Follow these instructions to monitor the job as it runs.

    Learn More

    Learn in depth about running apps and workflows, leveraging advanced techniques like Smart Reuse.

    Learn how to build an app.

    Learn more about building apps using Bash or Python.

    Learn in depth about building and deploying apps, including Spark apps.

    Learn in depth about importing, building, and running workflows.

    Follow these instructions to set up billing.
    to house that data.

    To create a project:

    1. In the DNAnexus Platform, select Projects > All Projects.

    2. In the Projects page, click New Project.

    3. In the New Project dialog:

      1. In Project Name, enter your project's name.

      2. (Optional) In More Info, you can enter or custom-defined . These make it easier to find the project later, and organize it among other projects.

      3. (Optional) In More Info, you can enter a Project Summary and Project Description to help other users understand the project's purpose.

      4. In Billing > Billed To, choose a to which project charges are billed.

      5. In Billing > Billed To, choose a to use for storing project files and running analyses. Feel free to use the default region.

      6. (Optional) In Usage Limits, available in Billed To orgs with compute and egress usage limits configured, you can set project-level limits for each.

      7. In Access, you can specify for specific , defining who can copy, delete, and download data. Feel free to accept the defaults.

      8. Click Create Project.

    After the project is created, you can add data in the Manage page.

    Once you add data to your project, this is where you can see and get info on this data, and launch analyses that use it.

    Step 2. Add Project Members

    Once you've created a project, you can add members by doing the following:

    1. From the project's Manage screen, click the Share Project button - the "two people" icon - in the top right corner of the project page.

    2. Type the username or the email address of an existing Platform user, or the ID of an org whose members you want to add the project.

    3. In Access, choose the type of access the user or org has to the project. For more on this, see the detailed explanation of project access levels.

    4. If you don't want the user to receive an email notification on being added to the project, click the Email Notification to "Off."

    5. Click the Add User button.

    6. Repeat Steps 2-5, for each user you want to add to the project.

    7. Click Done when you're finished adding members.

    Step 3. Add Data to Your Project

    To add data to your project, click the Add button in the top right corner of the project's Manage screen. You see three options for adding data:

    • Upload Data - Use your web browser to upload data from your computer. For long upload times, you must stay logged into the Platform and keep your browser window open until the upload completes.

    • Add Data from Server - Specify an URL of an accessible server from which the file is uploaded.

    • Copy Data from Project - Copy data from another project on the Platform.

    When uploading large files, consider using the Upload Agent, a command-line tool that's both faster and more reliable than uploading via the UI.

    Adding Data to Use in Your First Analysis

    To prepare for running your first analysis, as detailed in Steps 4-7, copy in data from the "Demo Data" project:

    1. From the project's Manage screen, click the Add button, then select Copy Data from Project.

    2. In the Copy Data from Project modal window, open the "Demo Data" project by clicking on its name.

    3. Open the "Quickstart" folder. This folder contains two 1000 Genomes project files with the paired-end sequencing reads from chromosome 20 of exome SRR100022: SRR100022_20_1.fq.gz and SRR100022_20_2.fq.gz.

    4. Click the box next to the Name header, to select both files.

    5. Click Copy to copy the files to your project.

    Step 4. Install Apps

    Next, install the apps you need, to analyze the data you added to the project in Step 3:

    1. Select Tools Library from the Tools link in the main menu.

    2. A list of available tools opens.

    3. Find the BWA-MEM FASTQ Read Mapper in the list and click on its name.

    4. A tool detail page opens, a full range of information about the tool, and how to use it.

    5. Click the Install button in the upper left part of the screen, under the name of the tool.

    6. In the Install App modal, click the Agree and Install button.

    7. After the tool has been installed, you are returned to the tool detail page.

    8. Use your browser's "Back" button to return to the tools list page.

    9. Repeat Steps 3-6 to install the .

    Step 5. Build a Workflow

    Build a workflow using the two apps you installed, and configure it to use the data you added to your project in Step 3.

    Adding Workflow Steps

    A workflow runs tools as part of a preconfigured series of steps. Start building your workflow by adding steps to it:

    1. Return to your project's Manage screen. You can do this by using your browser's "Back" button, or by selecting All Projects from the Projects link in the main menu, then clicking on the name of your project in the projects list.

    2. Click the Add button in the top right corner of the screen, then select New Workflow from the dropdown. The Workflow Builder opens.

    3. In the Workflow Builder, give your new workflow a name. In the upper left corner of the screen, you see a field with a placeholder value that begins "Untitled Workflow." Click on the "pencil" icon next to this placeholder name, then enter a name of your choosing.

    4. Click the Add a Step button. In the Select a Tool modal window, find the BWA-MEM FASTQ Read Mapper and click the "+" to the left of its name, to add it to your workflow.

    5. Repeat Step 4 for the FreeBayes Variant Caller.

    6. Close the Select a Tool modal window, by clicking either on the "x" in its upper right corner, or the Close button in its lower right corner. You return to the main Workflow Builder screen.

    Setting Inputs for Each Step

    In the Workflow Builder, required inputs have orange placeholder text, while optional inputs have black placeholder text.

    Set the required inputs for each step by doing the following:

    1. To set the required inputs for the first step, start by clicking on the input labeled "Reads [array]" for the BWA-MEM FASTQ Read Mapper. In the Select Data for Reads Input modal window, click the box for the SRR100022_20_1.fq.gz file. Then click the Select button.

    2. Since the SRR100022 exome was sequenced using paired-end sequencing, you need to provide the right-mates for the first set of reads. Click on the input labeled "Reads (right mates) [array]" for the BWA-MEM FASTQ Read Mapper. Select the SRR100022_20_2.fq.gz file.

    3. Click on the input labeled "BWA reference genome index." At the bottom of the modal window that opens, there is a Suggestions section that includes a link to a folder containing reference genome files. Click on this link, then open the folder named H. Sapiens - GRCh37 - b37 (1000 Genomes Phase I). Select the human_g1k_v37.bwa-index.tar.gz file.

    4. Next set the "Sorted mappings [array]" required input for the second step. In the "Output" section for the first step, click on the blue pill labeled "Sorted mappings," then drag it to the second step input labeled "Sorted mappings [array]."

    5. Click on the second step input labeled "Genome." In the modal that opens, find the reference genomes folder as in Step 3. Open the folder named H. Sapiens - GRCh37 - b37 (1000 Genomes Phase I). Select the human_g1k_v37.fa.gz file.

    Each tool has different input and output requirements. To learn about a tool's required and optional inputs and outputs, file format restrictions, and other configuration details, refer to its detail page in the Tools Library.

    Step 6. Launch the Workflow

    You're ready to launch your workflow, by doing the following:

    1. Click the Start Analysis button at the upper right corner of the Workflow Builder.

    2. In the modal window that opens, click the Run as Analysis button.

    The BWA-MEM FASTQ Read Mapper starts executing immediately. Once it finishes, the FreeBayes Variant Caller starts, using the Read Mapper's output as an input.

    Step 7. Monitor Your Job

    Once you've launched your workflow, you are taken to your project's Monitor screen. Here, you see a list of both current and past analyses run within the project, along with key information about each run.

    As your workflow runs, its status shows as "In Progress."

    Terminating Your Job

    If for some reason you need to terminate the run before it completes, find its row in the list on the Monitor screen. In the last column on the right, you see a red button labeled Terminate. Click the button to terminate the job. This process may take some time. While the job is being terminated, the job's status shows as "Terminating."

    Step 8. Access the Results

    When your workflow completes, output files are placed into a new folder in your project, with the same name as the workflow. The folder is accessible by navigating to your project's Manage screen.

    Running the Workflow Using the Full SRR100022 Exome

    You can run this workflow using the full SRR100022 exome, which is available in the SRR100022 folder, in the "Demo Data" project. Because this means working with a much larger file, running the workflow using the exome data takes longer.

    Learn More

    See these Key Concepts pages for more in-depth treatments of topics that are covered briefly here:

    • Projects

    • Apps and Workflows

    For a video intro to the Platform, watch the series of short, task-oriented tutorials.

    For a more in-depth video intro to the Platform, watch the DNAnexus Platform Essentials video.

    Follow these instructions to set up billing.
    project

    Analyzing Somatic Variants

    Analyze somatic variants, including cancer-specific filtering, visualization, and variant landscape exploration in the Cohort Browser.

    An Apollo license is required to use Cohort Browser on the DNAnexus Platform. Org approval may also be required. for more information.

    Explore and analyze datasets with somatic variant assays by opening them in the Cohort Browser and switching to the Somatic Variants tab. You can create cohorts based on somatic variants, visualize variant patterns, and examine detailed variant information.

    You can analyze somatic variants across four main categories: Single Nucleotide Variants (SNVs) & Indels for small genomic changes, Copy Number Variants (CNVs) for alterations in gene copy numbers, Fusions for structural rearrangements involving gene coding sequences, and Structural Variants (SVs) for larger genomic rearrangements.

    Somatic assay datasets are created using the Somatic Variant Assay Loader.

    Variant Classification

    The somatic data model classifies all genomic variants into four main classes, defined by their size, structure, and representation in VCF files. Each variant type has specific criteria that must be met for classification.

    Variant Type
    Classification Criteria
    Examples

    Somatic Variants in Cohort Browser

    • CNVs and Fusions are also classified as Structural Variants in the Cohort Browser because they use symbolic allele representations (<CNV>, <DEL>, <DUP>, <BND>). This dual classification ensures they are correctly distinguished from SNVs regardless of their physical length.

    Filtering by Somatic Variants

    You can to include only samples with specific somatic variants.

    To apply a somatic filter to your cohort:

    1. For the cohort you want to edit, click Add Filter.

    2. In Add Filter to Cohort > Assays > Variant (Somatic), select a genomic filter.

    3. In Edit Filter: Variant (Somatic), specify the criteria:

    You can specify up to 10 somatic variant filters for each cohort.

    Working with Large Structural Variants (>10Mbp)

    Structural variants larger than 10 megabases lack gene-level annotations, which limits how you can filter and visualize them. Use these alternative filtering approaches:

    • Filter by genomic coordinates: In the Genes / Effects filter, enter genomic coordinates in the format chr:start-end, for example, 17:7661779-7687538 for the TP53 gene region. Set the variant type scope to SV or CNV and leave consequence types blank. Find gene coordinates by typing the gene symbol in the search icon next to the Variants & Events table.

    • Filter by variant IDs: In the Variant IDs filter, enter up to 10 variant IDs in the format chr_pos_ref_alt, for example, 17_7674257_A_<DEL>. To get variant IDs, navigate to the gene region in the Variants & Events table, select variants of interest, and download the CSV file - the Location column contains the variant IDs.

    For comprehensive structural variant analysis, combine multiple filtering approaches. Use gene symbol filters to capture annotated structural variants ≤ 10Mbp, then add coordinate-based filters to include larger structural variants in the same genomic regions.

    Large structural variants are visible in the table with full details, but they do not appear in the due to missing gene-level annotations.

    Comparing Variant Patterns Across Your Cohort

    The Variant Frequency Matrix provides a visual overview of how often somatic variants appear throughout your cohort. The matrix helps you identify variant patterns across tumor samples and discover which variants frequently occur together. You can also measure the mutation burden in different genes and compare how mutation profiles differ between two cohorts. This makes it easier to spot trends and relationships in your data that might not be apparent when examining individual variants.

    The Variant Frequency Matrix is interactive. You can , and , and zoom in on specific genes or regions.

    In the Variant Frequency Matrix, the rows represents genes and columns represent samples, both are sorted by variant frequency.

    • Sorted gene list: Genes are ranked from most to least frequently affected by variants. A sample is considered "affected" by a gene if it is a tumor sample with at least one detected variant of high or moderate impact in that gene's canonical transcript. Matched normal samples are not included in this calculation.

    • Sorted sample list: Samples are ordered by the total number of genes that contain variants. This ranking is independent of how frequently each individual gene is affected.

    The Variant Frequency Matrix displays up to the top 50 genes with the most variants and up to 500 samples for any given cohort. The samples shown are the 500 with the highest number of genes containing variants. If your cohort has fewer than 500 samples, the matrix shows all samples.

    Filtering by Genes and Consequences

    By default, the Variant Frequency Matrix includes all genes and samples. To narrow your view, you can filter the matrix to specific classes of somatic variants, such as SNVs & Indels, Structural Variants, CNVs, or Fusions.

    Using the legend in bottom right, you can focus on specific variants, events, or consequences. This allows you to better explore particular areas of interest, such as high-impact mutations or specific consequences relevant to your research.

    When , the matrix can display the top 200 samples (columns) from both the primary and secondary cohorts. The top genes are selected and sorted by their variant frequency within the primary cohort.

    Viewing Gene and Sample Details

    The Variant Frequency Matrix is highly interactive, allowing you to quickly access more details and apply filters.

    When you hover over a cell, the matrix shows a unique identifier for the sample, along with a breakdown of the variants detected in that gene, organized by their consequence type. You can copy the sample ID to your clipboard to apply it to a cohort filter.

    When you hover over a gene ID on the left axis, the matrix shows more information about that gene. This includes a unique identifier for the gene, along with a quick breakdown of available external annotations, with direct links to the and databases (when available).

    To create a filter, hover over the gene and click + Add to Filter, or copy the gene ID to your clipboard for use in a custom filter.

    Color Coding and Consequences

    The Variant Frequency Matrix uses color coding to represent the consequences of detected variants, providing a quick visual assessment of variant types. Only high and moderate impact consequences, as defined by , are included in this visualization.

    Samples with two or more detected variants are color-coded as "Multi Hit", indicating a complex variant profile.

    Exploring Gene-Level Mutation Patterns

    The Lollipop Plot is a visualization tool that shows the somatic variants of a cohort on a single gene's canonical protein. With Lollipop Plot, you can identify mutation hotspots within a specific gene, understand the functional impact of variants in the context of protein domains, compare mutation patterns across different patient cohorts, and explore recurrent mutations in cancer driver genes.

    Use the Go to Gene field to quickly navigate to a gene of interest, such as TP53.

    When you hover over a lollipop, you can see details about the amino acid change, such as the HGVS notation and the frequency of that change in the current cohort. The plot also shows the location of each mutation along the protein sequence, with color coding to indicate the consequence type.

    The Lollipop Plot displays SNV & Indel data from the same genomic region as the . When you change the genomic region in the table, the Lollipop Plot updates to reflect the change and the other way around.

    Reading the Lollipop Plot

    • Each lollipop on the plot represents amino acid changes at a specific location.

    • The horizontal position (X axis) indicates the location of the change, while the height (Y axis) represents the frequency of that change within the current cohort.

    • Lollipops are color-coded by consequence based on the canonical transcript.

    • If a lollipop represents multiple consequence types, it is coded as "Multi Hit".

    Examining Detailed Variant Information

    The Variants & Events table displays details on the same genomic region as the . You can filter the table to focus on specific variant types, such as SNV & Indels, SV (Structural Variants), CNV, or Fusion.

    Unlike the Variant Frequency Matrix, the Variants & Events table displays all structural variants including those larger than 10Mbp. Use this table to examine large SVs that may not appear in other visualizations.

    Information displayed in the Variants & Events table includes:

    • Location of variant, with a link to its

    • Reference allele of variant

    • Alternate allele of variant

    • Type of variant, such as SNV, Indel, or Structural Variant

    Exporting Variant Information

    You can export the selected variants in the Variants & Events table as a list of variant IDs or a CSV file.

    • To copy a comma-separated list of variant IDs to your clipboard, select the set of IDs you want to copy, and click Copy.

    • To export variants as a CSV file, select the set of IDs you need, and click Download (.csv file).

    Accessing External Annotations and Resources

    In Variants & Events > Location column, you can click on the specific location to open the locus details.

    The locus details show specific SNV & Indel variants as well as up to 200 structural variants overlapping with the specific location. For canonical transcripts, a blue indicator appears next to the transcript ID, identifying the primary transcript annotations.

    The locus details include enhanced annotations to external resources:

    • Gene-level links - Direct links to gene information in external databases

    • Variant-level links - Links to variant-specific annotation resources

    These links allow you to quickly navigate to external annotation resources for further information about genes or variants of interest.

    Spark Cluster-Enabled DXJupyterLab

    Learn to use the DXJupyterLab Spark Cluster app.

    DXJupyterLab is accessible to all users of the UK Biobank Research Analysis Platform and the Our Future Health Trusted Research Environment.

    A license is required to access DXJupyterLab on the DNAnexus Platform. Contact DNAnexus Sales for more information.

    Overview

    The DXJupyterLab Spark Cluster App is a that runs a fully-managed standalone Spark/Hadoop cluster. This cluster enables distributed data processing and analysis from directly within the JupyterLab application. In the JupyterLab session, you can interactively create and query DNAnexus databases or run any analysis on the Spark cluster.

    Besides the core JupyterLab features, the Spark cluster-enabled JupyterLab app allows you to:

    • Explore the available databases and get an overview of the available datasets

    • Perform analyses and visualizations directly on data available in the database

    • Create databases

    • Submit data analysis jobs to the Spark cluster

    Check the general for an introduction to DNAnexus JupyterLab products.

    Running and Using DXJupyterLab Spark Cluster

    The page contains information on how to start a JupyterLab session and create notebooks on the DNAnexus Platform. The page has additional useful tips for using the environment.

    Instantiating the Spark Context

    Having created your notebook in the project, you can populate your first cells as below. It is good practice to instantiate your Spark context at the beginning of your analyses, as shown below.

    Basic Operations on DNAnexus Databases

    Exploring Existing Databases

    To view any databases to which you have access to in your current region and project context, run a cell with the following code:

    A sample output should be:

    You can inspect one of the returned databases by running:

    which should return an output similar to:

    To find a database in your current region that may be in a different project than your current context, run the following code:

    A sample output should be:

    To inspect one of the databases listed in the output, use the unique database name. If you use only the database name, results are limited to the current project. For example:

    Creating Databases

    Here's an example of how to create and populate your own database:

    You can separate each line of code into different cells to view the outputs iteratively.

    Using Hail

    is an open-source, scalable framework for exploring and analyzing genomic data. It is designed to run primarily on a Spark cluster and is available with DXJupyterLab Spark Cluster. It is included in the app and can be used when the app is run with the feature input set to HAIL (set as default).

    Initialize the context when beginning to use Hail. It's important to pass previously started Spark Context sc as an argument:

    We recommend continuing your exploration of Hail with the . For example:

    Using VEP with Hail

    To use (Ensemble Variant Effect Predictor) with HAIL, select "Feature," then "HAIL" when launching Spark Cluster-Enabled DXJupyterLab via the CLI.

    VEP can predict the functional effects of genomic variants on genes, transcripts, protein sequence, and regulatory regions. The is included as well, and is used when VEP configuration includes LoF plugin as shown in the configuration file below.

    Behind the Scenes

    The Spark cluster app is a Docker-based app which runs the JupyterLab server in a Docker container.

    The JupyterLab instance runs on port 443. Because it is an HTTPS app, you can bring up the JupyterLab environment in a web browser using the URL https://job-xxxx.dnanexus.cloud, where job-xxxx is the ID of the job that runs the app.

    The script run at the instantiation of the container, /opt/start_jupyterlab.sh, configures the environment and starts the server needed to connect to the Spark cluster. The environment variables needed are set by sourcing two scripts, bind-mounted into the container:

    The default user in the container is root.

    The option --network host is used when starting Docker to remove the network isolation between the host and the Docker container, which allows the container to bind to the host's network and access Sparks master port directly.

    Accessing AWS S3 Buckets

    S3 buckets can have private or public access. Either the s3 or the s3a scheme can be used to access S3 buckets. The s3 scheme is automatically aliased to s3a in all Apollo Spark Clusters.

    Public Bucket Access

    To access public s3 buckets, you do not need to have s3 credentials. The example below shows how to access the public 1000Genomes bucket in a JupyterLab notebook:

    When the above is run in a notebook, the following is displayed:

    Private Bucket Access

    To access private buckets, see the example code below. The example assumes that a Spark session has been created as shown above.

    SQL Runner

    A license is required to access Spark functionality on the DNAnexus Platform. Contact DNAnexus Sales for more information.

    Overview

    The Spark SQL Runner application brings up a Spark cluster and executes your provided list of SQL queries. This is especially useful if you need to perform a sequence repeatedly or if you need to run a complex set of queries. You can vary the size of your cluster to speed up your tasks.

    How to Run Spark SQL Runner

    Input:

    • sqlfile: [Required] A SQL file which contains an ordered list of SQL queries.

    • substitutions: A JSON file which contains the variable substitutions.

    • user_config: User configuration JSON file, in case you want to set or override certain Spark configurations.

    Other Options:

    • export: (boolean) default false. Exports output files with results for the queries in the sqlfile.

    • export_options: A JSON file which contains the export configurations.

    • collect_logs

    Output:

    • output_files: Output files include report SQL file and query export files.

    Basic Run

    Examples

    sqlfile

    How sqlfile is Processed

    1. The SQL runner extracts each command in sqlfile and runs them in sequential order.

    2. Every SQL command needs to be separated with a semicolon ;.

    1. Any command starting with -- is ignored (comments). Any comment within a command should be inside /*...*/ The following are examples of valid comments:

    Variable Substitution

    Variable substitution can be done by specifying the variables to replace in substitutions.

    In the above example, each reference to srcdb in sqlfile within ${...} is substituted with sskrdemo1. For example, select * from ${srcdb}.${patient_table};. The script adds the set command before executing any of the SQL commands in sqlfile. As a result, select * from ${srcdb}.${patient_table}; translates to:

    Export

    If enabled, the results of the SQL commands are exported to a CSV file. export_options defines an export configuration.

    1. num_files: default 1. This defines the maximum number of output files to generate. The number generally depends on how many executors are running in the cluster and how many partitions of this file exist in the system. Each output file corresponds to a part file in parquet.

    2. fileprefix: The filename prefix for every SQL output file. By default, output files are prefixed with query_id, which is the order in which the queries are listed in sqlfile (starting with 1), for example, 1-out.csv. If a prefix is specified, output files are named like <prefix>-1-out.csv

    User Configuration

    Values in spark-defaults.conf override or add to the default Spark configuration.

    Output Files

    The export folder contains two generated files:

    • <JobId>-export.tar: Contains all the query results.

    • <JobId>-outfile.sql: SQL debug file.

    Export Files

    After extracting the export tar file, the structure appears as follows:

    In the above example, demo is the fileprefix used. The export produces one folder per query. Each folder contains a SQL file with the query executed and a .csv folder containing the result CSV.

    SQL Report File

    Every SQL run execution generates a SQL runner debug report file. This is a SQL file.

    It lists all the queries executed and status of the execution (Success or Fail). It also lists the name of the output file for that command and the time taken. If there are any failures, it reports the query and stops executing subsequent commands.

    SQL Errors

    During execution of the series of SQL commands, a command may fail (error, syntax, etc). In that case, the app quits and uploads a SQL debug file to the project:

    The output identifies the line with the SQL error and its response.

    The query in the .sql file can be fixed, and this report file can be used as input for a subsequent run, allowing you to resume from where execution stopped.

    Using Spark

    Connect with Spark for database sharing, big data analytics, and rich visualizations.

    A license is required to access Spark functionality on the DNAnexus Platform. Contact DNAnexus Sales for more information.

    Apache Spark can help you tackle big data analytics combined with rich visualization. Sharing a database is straightforward: platform access levels map directly to SQL abilities, so you can fine-tune access control to your databases at either an individual or org level.

    The DNAnexus Platform provides two ways to connect to the Spark service: through the Thrift server or, for more scalable throughput, using Spark applications.

    Thrift Server

    DNAnexus hosts a high-availability Thrift server with which you can connect over JDBC with a client-like beeline to run Spark SQL interactively. Refer to the page for more details.

    Spark Applications

    You can launch a Spark application distributed across a cluster of workers. Since this is all tightly integrated with the rest of the platform, Spark jobs leverage the features of normal jobs. You have the same ways to monitor a job's progress, SSH into a job instance to debug, and use the features of dx-toolkit and the platform web UI. You also have access to logs from workers and can monitor the job in the Spark UI.

    Visualization

    With Spark, you can visualize your results in real time. You can save those queries as cohorts, share them with your team, or use them as inputs to Spark-based analysis apps. You can create charts and shareable dashboards. The filter view allows you to build cohorts quickly without the need to write complex SQL queries by hand.

    Databases

    A database is a on the Platform. A object is stored in a project.

    Database Sharing

    Databases can be shared with other users or organizations through project sharing. Access to a database can be revoked at any time by revoking access to the project by the project administrator. If revoking access to the project is impossible, the database can be to another project with different set of collaborators.

    Database and Project Policies

    Project policies restrict how the data can be modified or copied to other projects. Databases follow the Delete Policy and the Copy Policy. If a database is in a restricted project, the database can be accessed for reading only from the same project context, when . Databases also adhere to the project's policy. If a database is in a project for which Data Protection is enabled ("PHI project"), the database is subject to the following restrictions:

    • The database cannot be accessed by Spark apps launched in projects for which PHI Data Protection is not enabled ("non-PHI projects").

    • If a non-PHI project is provided as a project context when connecting to Thrift, only databases from non-PHI projects are available for retrieving data.

    • If a PHI project is provided as a project context when connecting to Thrift, only databases from PHI projects are available to add new data.

    A license and a signed Business Associate Agreement are required to enable and use PHI Data Protection. for more information.

    Database Access

    As with all DNAnexus file objects, database access is controlled by project access. These access levels and database object translate into specific SQL abilities for the database, tables, data and database object in the project.

    The following tables reference supported actions on a database and database object with lowest necessary access level for an open and closed database.

    Spark SQL Function
    Open Database
    Closed Database
    Data Object Action
    Open Database
    Closed Database

    (*) If a project is protected, then ADMINISTER access is required.

    Database Naming Conventions

    The system handles database names in two ways:

    • User-provided name: Your database name is converted to lowercase and stored as the databaseName attribute.

    • System-generated unique name: A unique identifier is created by combining your lowercase database name with the database object ID (also converted to lowercase with hyphens changed to underscores) separated by two underscores. This is stored as the uniqueDatabaseName attribute.

    When a database is created using the following SQL statement and a user-generated database name (referenced below as, db_name):

    The platform database object, database-xxxx, is created with all lowercase characters. However, when creating a database using , the Python module supported by the DNAnexus SDK, dx-toolkit, the following case-sensitive command returns a database ID based on the user-generated database name, assigned here to the variable db_name:

    With that in mind, it is suggested to either use lowercase characters in your db_name assignment or to instead apply a forcing function like, .lower(), to the user-generated database name:

    Filtering Objects and Jobs

    You can perform advanced filtering on projects, data objects, and jobs using the filter bars above the table of results. This feature is displayed at the top of the Monitor tab but is hidden by default on the Manage tab and Projects page. You can display or hide the filter bar by toggling the filters icon in the top right corner.

    The filter bar lets you to specify different criteria on which to filter your data. You can combine multiple different filters for greater control over your results.

    To use this feature, first choose the field you want to filter your data by, then enter your filter criteria. For example, select the "Name" filter then search for "NA12878". The filter activates when you press enter or click outside of the filter bar.

    Filtering Projects

    The following filters are available for projects, and can be added by selecting them from the "Filters" dropdown menu.

    • Billed to: The user or org ID that the project is billed to, for example, "user-xxxx" or "org-xxxx". When viewing a partner organization's projects, the "Billed to" field is fixed to the org ID.

    • Project Name: Search by case insensitive string or regex, for example, "Example" or "exam$" both match "Example Project"

    • ID: Search by project ID, for example, "project-xxxx"

    Filtering Objects

    The following filters are available for objects. Filters listed in italics are not displayed in the filter bar by default but can be added by selecting them from the "Filters" dropdown menu on the right.

    • Search scope: The default scope is "Entire project", but if you know the location of the object you are looking for, limiting your search scope to "Current Folder" allows you to search more efficiently.

    • Object name: Search by case insensitive string or regex, for example, NA1 or bam$ both match NA12878.bam

    • ID: Search by object ID, for example, file-xxxx

    When filtering on anything other than the current folder, results appear from many different places in the project. The folders appear in a lighter gray font and some actions are unavailable (such as creating a new workflow or folder), but otherwise functionality remains the same as in the normal data view.

    Filtering Jobs and Analyses

    The following filters are available for executions. Filters listed in italics are not displayed in the filter bar by default but can be added to the bar by selecting them from the "Filters" dropdown menu on the right.

    • Search scope: The default displays root executions only, but you can choose to view all executions (root and subjobs) instead

    • State: for example, Failed, Waiting, Done, Running, In Progress, Terminated

    • Name: Search by case-insensitive string or regex, for example, "BWA" or "MEM$" both match "BWA-MEM". This only matches the name of the job or analysis, not the executable name.

    Multi-Word Queries in Filters and Searches

    When filtering on a name, any spaces expand to include intermediate words. For example, filtering by "b37 build" also returns "b37 dbSNP build".

    Filtering by Date

    Some filters allow you to specify a date range for your query. For example, the "Created date" filter allows you to specify a beginning time ("From") and/or an end time ("To"). Clicking on the date box opens a calendar widget which allows you to specify a relative time in minutes, hours, days, weeks, months, or an absolute time by specifying a certain date.

    For relative time, specify an amount of time before the access time. For example, selecting "Day" and typing 5 sets the datetime to 5 days before the current time.

    Alternatively, you can use the calendar to represent an exact (absolute) datetime.

    Setting only the beginning datetime ("From") creates a range from that time to the access time. Setting only the end datetime ("To") creates a range from the earliest records to the "To" time.

    A filter with a relative time period updates each time it is accessed. For example, a filter for items created within two hours shows different results at different times: items from 9am at 11am, and items from 2pm at 4pm. For consistent results, use absolute datetimes from the calendar widget.

    Filtering by Tags and Properties

    Tags

    To search by tag, enter or select the tags you want to find. For example, to find all objects tagged with "human", type "human" in the filter box and select the checkbox next to the tag.

    Unlike other searches where you can enter partial text, tag searches require the complete tag name. However, capitalization doesn't matter. For example, searching for "HUMAN", "human", or "Human" all find objects with the "Human" tag. Partial matches like "Hum" do not return results.

    Properties

    Properties have two parts: a key and a value. The system prompts for both when creating a new property. Like tags, properties allow you to create your own common attributes across multiple projects or items and find them quickly. When searching for a property, you can either search for all items that have that property, or items that have a property with a certain value.

    To search for all items that have a property, regardless of the value of that property, select the "Properties" filter (not displayed by default), enter the property key, and click Apply. To search for items that have a property with a specific value, enter that property's key and value.

    The keys and values must be entered in their entirety. For example, entering the key sample and the value NA does not match objects with {"sample_id": "NA12878"}.

    Any vs. All Queries

    Some filters allow you to select multiple values. For example, the "Tag" filter allows you to specify multiple tags in the dialog. When you have selected multiple tags, you have a choice whether to search for objects containing any of the selected tags or containing all the selected tags.

    Given the following set of objects:

    • Object 1 (tags: "human", "normal")

    • Object 2 (tags: "human", "tumor")

    • Object 3 (tags: "mouse", "tumor")

    Selecting both "human" and "tumor" tags, and choosing to filter by any tag returns all 3 objects. Choosing to filter by all tags returns only Object 2.

    Clearing All Filters

    Click the "Clear All Filters" button on the filter bar to reset your filters.

    Saving Filters

    If you wish to save your filters, active filters are saved in the URL of the filtered page. You can bookmark this URL in your browser to return to your filtered view in the future.

    Bookmarking a filtered URL saves the search parameters, not the search results. The filters are applied to the data present when accessing the bookmarked link. For example, filters for items created in the last thirty days show items from the thirty days before viewing the results, not the thirty days before creating the bookmark. Results update based on when you access the saved search.

    Relational Database Clusters

    Relational Database Clusters

    The DNAnexus Relational Database Service provides users with a way to create and manage cloud database clusters (referred to as dbcluster objects on the platform). These databases can then be securely accessed from within DNAnexus jobs/workers.

    The Relational Database Service is accessible through the application program interface (API) in AWS regions only. See DBClusters API page for details.

    A license is required to access the Relational Database Service. for more information.

    Overview of the Relational Database Service

    DNAnexus Relational DB Cluster States

    When describing a DNAnexus DBCluster, the status field can be any of the following:

    DBCluster status
    Details

    Connecting to a DB Cluster

    DB Clusters are not accessible from outside of the DNAnexus Platform. Any access to these databases must occur from within a DNAnexus job. Refer to this page on for one possible way to access a DB Cluster from within a job. Executions such as can access a DB Cluster as well.

    The parameters needed for connecting to the database are:

    • host Use endpoint as returned from

    • port 3306 for MySQL Engines or 5432 for PostgreSQL Engines

    DBCluster Instance Types

    The table below provides all the valid configurations of dxInstanceClass, database engine and versions

    DxInstanceClass
    Engine + Version Supported
    Memory (GB)
    # Cores

    * - db_std1 instances may incur CPU Burst charges similar to AWS T3 Db instances described in . Regular hourly charges for this instance type are based on 1 core, CPU Burst charges are based on 2 cores.

    Restriction on Transfers of Projects Containing DBClusters

    If a project contains a , its ownership cannot be changed. when attempting to change the billTo of such a project.

    Archiving Files

    Learn how to archive files, a cost-effective way to retain files in accord with data-retention policies, while keeping them secure and accessible, and preserving file provenance and metadata.

    A license is required to use the DNAnexus Archive Service. Contact for more information.

    Archiving in DNAnexus is file-based. You can archive individual files, folders with files, or entire projects' files and save on storage costs. You can also unarchive one or more files, folders, or projects when you need to make the data available for further analyses.

    The DNAnexus Archive Service is available via the API in Amazon AWS and Microsoft Azure regions.

    Created date: Search by projects created before, after, or between different dates
  • Modified date: Search by projects modified before, after, or between different dates

  • Creator: The user ID who created the project, for example, "user-xxxx"

  • Shared with member: A user ID with whom the project is shared, for example, "user-xxxx" or "org-xxxx"

  • Level: The minimum permission level to the project. The dropdown has the options Viewer+", "Uploader+", "Contributor+", and "Admin only". For example, "Contributor+" filters projects with access CONTRIBUTOR or ADMINISTER

  • Tags: Search by tag. The filter bar automatically populates with tags available on projects

  • Properties: Search by properties. The filter bar automatically provides properties available on projects

  • or
    applet-xxxx
  • Modified date: Search by objects modified before, after, or between different dates

  • Class: such as "File", "Applet", "Folder"

  • Types: such as "File" or custom Type

  • Created date: Search by objects created before, after, or between different dates

  • Tags: Search by tag. The filter bar automatically populates with tags available on objects within the current folder

  • Properties: Search by properties. The filter bar automatically provides properties available on objects within the current folder

  • ID: Search by job or analysis ID, for example, "job-1234" or "analysis-5678"
  • Created date: Search by executions created before, after, or between different dates

  • Launched by: Search by the user ID of the user who launched the job. The filter bar shows users who have run jobs visible in the project

  • Tags: Search by tag. The filter bar automatically populates with tags available on the visible executions

  • Properties: Search by properties. The filter bar automatically provides properties available on executions visible in the project

  • Executable: Search by the ID of executable run by the executions in question. Examples include app-1234 or applet-5678

  • Class: for example, Analysis or Job

  • Origin Jobs: ID of origin job

  • Parent Jobs: ID of parent job

  • Parent Analysis: ID of parent analysis

  • Root Executions: ID of root execution

  • Tags
    Properties
    billing account
    cloud region
    access levels
    types of users
    FreeBayes Variant Caller

    String (array)

    No

    Integer (array)

    No

    Float (array)

    No

    Boolean (array)

    No

    Hash

    No

    For optimal performance and annotation scalability, the Cohort Browser processes SVs and CNVs between 50bp and 10Mbp differently than larger variants:

    • SVs and CNVs ≤ 10Mbp: Fully annotated with gene symbols and consequences, appear in all visualizations including the Variant Frequency Matrix

    • SVs and CNVs > 10Mbp: Ingested and visible in the Variants & Events table but lack gene-level annotations. These larger variants do not appear in the Variant Frequency Matrix and cannot be filtered using gene symbols or consequence terms. Use genomic coordinates or variant IDs to filter for these variants (see Working with Large Structural Variants below).

    • Fusions are not affected by this size limit as they are considered two single-position events.

    For datasets with multiple somatic variant assays, select the specific assay to filter by.
  • Choose whether to include patients with at least one detected variant matching the specified criteria (WITH Variant), or include only patients who have no detected variants matching the criteria (WITHOUT Variant). By default, the filter includes those with matching variants. This choice applies to all specified filtering criteria.

  • On the Genes / Effects tab, select variants of specific types and variant consequences within specified genes and genomic ranges. You can specify up to 5 genes or genomic ranges in a comma-separated list.

  • On the HGVS tab, specify a particular HGVS DNA or HGVS protein notation, preceded by a gene symbol. Example: KRAS p.Arg1459Ter.

  • On the Variant IDs tab, specify variant IDs using the standard format chr_pos_ref_alt (for example, 17_7674257_A_G). You can enter up to 10 variant IDs in a comma-separated list.

  • Enter multiple genes, ranges, or variants, by separating them with commas or placing each on a new line.

  • Click Apply Filter.

  • You can identify mutation hotspots for a given gene and see protein changes in HGVS short form notation, such as T322A, and HGVS.p notation, such as p.Thr322Ala.

    Variant consequences, with entries color-coded by level of severity

  • HGVS cDNA

  • HGVS Protein

  • COSMIC ID

  • RSID, with a link to the dbSNP entry for the variant

  • SNV & Indel Single base substitutions and small insertions/deletions with precise allele sequences

    All must match: • Variant size ≤ 50bp • ALT field contains precise allele (NOT symbolic like <DEL>, <INS>, <DUP>, <CNV>)

    A→G, ATCG→A, A→ATCG

    Copy Number Variant (CNV) Changes in gene copy number

    All must match: • ALT field contains symbolic allele (<CNV>, <DEL>, <DUP>) • Explicit copy number value present in FORMAT field key CN

    <CNV>, <DEL>, <DUP>

    Fusion Structural rearrangements involving gene coding sequences

    All must match: • ALT field contains breakend notation with square brackets ([ or ]) • At least one breakpoint overlaps with annotated gene or transcript

    [chr2:123456[, ]chr5:789012]

    Structural Variant (SV) Large or complex structural changes

    Either must match: • Variant length > 50bp • ALT field contains symbolic allele (<DEL>, <INV>, <CNV>, <BND>)

    define your cohort
    Variants & Events
    Variant Frequency Matrix
    filter by genes and consequences
    view details of specific genes and samples
    comparing cohorts
    CIViC
    OncoKB
    Ensembl VEP version 109
    Variants & Events table
    Lollipop Plot
    locus details
    Contact DNAnexus Sales
    Adding a somatic variant filter
    Variant Frequency Matrix filtered to SNVs & Indels
    Show only a specific class of somatic variants
    Show only a specific consequence for a class of somatic variants
    Hovering over a cell to view specific sample details
    Hovering over a gene ID to view its details
    Lollipop Plot for the TP53 gene
    Variants & Events for the TP53 gene
    Showing locus details for a specific somatic variant

    <DEL>, <INV>, large insertions

    terminated

    The database cluster has been terminated and all data deleted.

    user root
  • password Use the adminPassword specified when creating the database dbcluster/new

  • For MySQL: ssl-mode 'required'

  • For PostgreSQL: sslmode 'require' Note: For connecting and verifying certs, see Using SSL/TLS to encrypt a connection to a DB instance or cluster

  • 4

    db_mem1_x8

    aurora-mysql: 8.0.mysql_aurora.3.04.1, aurora-postgresql: 12.9, 13.9, 14.6

    64

    8

    db_mem1_x16

    aurora-mysql: 8.0.mysql_aurora.3.04.1, aurora-postgresql: 12.9, 13.9, 14.6

    128

    16

    db_mem1_x32

    aurora-mysql: 8.0.mysql_aurora.3.04.1, aurora-postgresql: 12.9, 13.9, 14.6

    244

    32

    db_mem1_x48

    aurora-mysql: 8.0.mysql_aurora.3.04.1, aurora-postgresql: 12.9, 13.9, 14.6

    384

    48

    db_mem1_x64

    aurora-mysql: 8.0.mysql_aurora.3.04.1, aurora-postgresql: 12.9, 13.9, 14.6

    488

    64

    db_mem1_x96

    aurora-postgresql: 12.9, 13.9, 14.6

    768

    96

    creating

    The database cluster is being created, but not yet available for reading/writing.

    available

    The database cluster is created and all replicas are available for reading/writing.

    stopping

    The database cluster is stopping.

    stopped

    The database cluster is stopped.

    starting

    The database cluster is restarting from a stopped state, transitioning to available when ready.

    terminating

    The database cluster is being terminated.

    db_std1_x2 (*)

    aurora-mysql: 8.0.mysql_aurora.3.04.1, aurora-postgresql: 12.9, 13.9, 14.6

    4

    2

    db_mem1_x2

    aurora-mysql: 8.0.mysql_aurora.3.04.1, aurora-postgresql: 12.9, 13.9, 14.6

    16

    2

    db_mem1_x4

    aurora-mysql: 8.0.mysql_aurora.3.04.1, aurora-postgresql: 12.9, 13.9, 14.6

    Contact DNAnexus Sales
    cloud workstations
    app/applets
    dbcluster-xxxx/describe
    AWS Instance Types
    DBCluster
    A PermissionDenied error occurs

    32

    Spark application
    Overview
    Quickstart
    References
    Hail
    GWAS using Hail tutorial
    VEP
    LoF plugin
    : (boolean) default
    false
    . Collects cluster logs from all nodes.
  • executor_memory: (string) Amount of memory to use per executor process, in MiB unless otherwise specified. Common values include 2g or 8g. This is passed as --executor-memory to Spark submit.

  • executor_cores: (integer) Number of cores to use per executor process. This is passed as --executor-cores to Spark submit.

  • driver_memory: (string) Amount of memory to use for the driver process. Common values include 2g or 8g. This is passed as --driver-memory to Spark submit.

  • log_level: (string) default INFO. Logging level for both driver and executors. [ALL, TRACE, DEBUG, INFO]

  • .
  • header: Default is true. If true, a header is added to each exported file.

  • ANALYZE TABLE COMPUTE STATISTICS

    UPLOAD

    N/A

    CACHE TABLE, CLEAR CACHE

    N/A

    N/A

    CREATE DATABASE

    UPLOAD

    UPLOAD

    CREATE FUNCTION

    N/A

    N/A

    CREATE TABLE

    UPLOAD

    N/A

    CREATE VIEW

    UPLOAD

    UPLOAD

    DESCRIBE DATABASE, TABLE, FUNCTION

    VIEW

    VIEW

    DROP DATABASE

    CONTRIBUTE (*)

    ADMINISTER

    DROP FUNCTION

    N/A

    N/A

    DROP TABLE

    CONTRIBUTE (*)

    N/A

    EXPLAIN

    VIEW

    VIEW

    INSERT

    UPLOAD

    N/A

    REFRESH TABLE

    VIEW

    VIEW

    RESET

    VIEW

    VIEW

    SELECT

    VIEW

    VIEW

    SET

    VIEW

    VIEW

    SHOW COLUMNS

    VIEW

    VIEW

    SHOW DATABASES

    VIEW

    VIEW

    SHOW FUNCTIONS

    VIEW

    VIEW

    SHOW PARTITIONS

    VIEW

    VIEW

    SHOW TABLES

    VIEW

    VIEW

    TRUNCATE TABLE

    UPLOAD

    N/A

    UNCACHE TABLE

    N/A

    N/A

    Remove

    CONTRIBUTE (*)

    ADMINISTER

    Remove Tags

    UPLOAD

    CONTRIBUTE

    Remove Types

    UPLOAD

    N/A

    Rename

    UPLOAD

    CONTRIBUTE

    Set Details

    UPLOAD

    N/A

    Set Properties

    UPLOAD

    CONTRIBUTE

    Set Visibility

    UPLOAD

    N/A

    ALTER DATABASE SET DBPROPERTIES

    CONTRIBUTE

    N/A

    ALTER TABLE RENAME

    CONTRIBUTE

    N/A

    ALTER TABLE DROP PARTITION

    CONTRIBUTE (*)

    N/A

    ALTER TABLE RENAME PARTITION

    CONTRIBUTE

    Add Tags

    UPLOAD

    CONTRIBUTE

    Add Types

    UPLOAD

    N/A

    Close

    UPLOAD

    N/A

    Get Details

    VIEW

    Thrift Server
    data object
    database
    relocated
    connecting to Thrift
    PHI Data Protection
    Contact DNAnexus Sales
    states
    dxpy

    N/A

    VIEW

    import pyspark
    sc = pyspark.SparkContext()
    spark = pyspark.sql.SparkSession(sc)
    spark.sql("show databases").show(truncate=False)
    +------------------------------------------------------------+
    |namespace                                                   |
    +------------------------------------------------------------+
    |database_xxxx__brca_pheno                                   |
    |database_yyyy__gwas_vitamind_chr1                           |
    |database_zzzz__meta_data                                    |
    |database_tttt__genomics_180820                              |
    +------------------------------------------------------------+
    db = "database_xxxx__brca_pheno"
    spark.sql(f"SHOW TABLES FROM {db}").show(truncate=False)
    +------------------------------------+-----------+-----------+
    |namespace                           |tableName  |isTemporary|
    +------------------------------------+-----------+-----------+
    |database_xxxx__brca_pheno           |cna        |false      |
    |database_xxxx__brca_pheno           |methylation|false      |
    |database_xxxx__brca_pheno           |mrna       |false      |
    |database_xxxx__brca_pheno           |mutations  |false      |
    |database_xxxx__brca_pheno           |patient    |false      |
    |database_xxxx__brca_pheno           |sample     |false      |
    +------------------------------------+-----------+-----------+
    show databases like "<project_id_pattern>:<database_name_pattern>";
    show databases like "project-*:<database_name>";
    +------------------------------------------------------------+
    |namespace                                                   |
    +------------------------------------------------------------+
    |database_xxxx__brca_pheno                                   |
    |database_yyyy__gwas_vitamind_chr1                           |
    |database_zzzz__meta_data                                    |
    |database_tttt__genomics_180820                              |
    +------------------------------------------------------------+
    db = "database_xxxx__brca_pheno"
    spark.sql(f"SHOW TABLES FROM {db}").show(truncate=False)
    # Create a database
    my_database = "my_database"
    spark.sql("create database " + my_database + " location 'dnax://'")
    spark.sql("create table " + my_database + ".foo (k string, v string) using parquet")
    spark.sql("insert into table " + my_database + ".foo values ('1', '2')")
    sql("select * from " + my_database + ".foo")
    import hail as hl
    hl.init(sc=sc)
    # Download example data from 1k genomes project and inspect the matrix table
    hl.utils.get_1kg('data/')
    hl.import_vcf('data/1kg.vcf.bgz').write('data/1kg.mt', overwrite=True)
    mt = hl.read_matrix_table('data/1kg.mt')
    mt.rows().select().show(5)
    # Annotate hail matrix table with VEP and LoF using configuration specified in the
    # vep-GRCh38.json file in the project you're working in.
    #
    # Annotation process relies on "dnanexus/dxjupyterlab-vep" docker container
    # as well as VEP and LoF resources that are pre-installed on every Spark node when
    # HAIL-VEP feature is selected.
    annotated_mt = hl.vep(mt, "file:///mnt/project/vep-GRCh38.json")
    % cat /mnt/project/vep-GRCh38.json
    {"command": [
         "docker", "run", "-i", "-v", "/cluster/vep:/root/.vep", "dnanexus/dxjupyterlab-vep",
         "./vep", "--format", "vcf", "__OUTPUT_FORMAT_FLAG__", "--everything", "--allele_number",
         "--no_stats", "--cache", "--offline", "--minimal", "--assembly", "GRCh38", "-o", "STDOUT",
         "--check_existing", "--dir_cache", "/root/.vep/",
         "--fasta", "/root/.vep/homo_sapiens/109_GRCh38/Homo_sapiens.GRCh38.dna.toplevel.fa.gz",
        "--plugin", "LoF,loftee_path:/root/.vep/Plugins/loftee,human_ancestor_fa:/root/.vep/human_ancestor.fa,conservation_file:/root/.vep/loftee.sql,gerp_bigwig:/root/.vep/gerp_conservation_scores.homo_sapiens.GRCh38.bw"],
      "env": {
          "PERL5LIB": "/root/.vep/Plugins"
      },
      "vep_json_schema": "Struct{assembly_name:String,allele_string:String,ancestral:String,colocated_variants:Array[Struct{aa_allele:String,aa_maf:Float64,afr_allele:String,afr_maf:Float64,allele_string:String,amr_allele: String,amr_maf:Float64,clin_sig:Array[String],end:Int32,eas_allele:String,eas_maf:Float64,ea_allele:String,ea_maf:Float64,eur_allele:String,eur_maf:Float64,exac_adj_allele:String,exac_adj_maf:Float64,exac_allele:      String,exac_afr_allele:String,exac_afr_maf:Float64,exac_amr_allele:String,exac_amr_maf:Float64,exac_eas_allele:String,exac_eas_maf:Float64,exac_fin_allele:String,exac_fin_maf:Float64,exac_maf:Float64,exac_nfe_allele:  String,exac_nfe_maf:Float64,exac_oth_allele:String,exac_oth_maf:Float64,exac_sas_allele:String,exac_sas_maf:Float64,id:String,minor_allele:String,minor_allele_freq:Float64,phenotype_or_disease:Int32,pubmed:            Array[Int32],sas_allele:String,sas_maf:Float64,somatic:Int32,start:Int32,strand:Int32}],context:String,end:Int32,id:String,input:String,intergenic_consequences:Array[Struct{allele_num:Int32,consequence_terms:          Array[String],impact:String,minimised:Int32,variant_allele:String}],most_severe_consequence:String,motif_feature_consequences:Array[Struct{allele_num:Int32,consequence_terms:Array[String],high_inf_pos:String,impact:   String,minimised:Int32,motif_feature_id:String,motif_name:String,motif_pos:Int32,motif_score_change:Float64,strand:Int32,variant_allele:String}],regulatory_feature_consequences:Array[Struct{allele_num:Int32,biotype:   String,consequence_terms:Array[String],impact:String,minimised:Int32,regulatory_feature_id:String,variant_allele:String}],seq_region_name:String,start:Int32,strand:Int32,transcript_consequences:                        Array[Struct{allele_num:Int32,amino_acids:String,appris:String,biotype:String,canonical:Int32,ccds:String,cdna_start:Int32,cdna_end:Int32,cds_end:Int32,cds_start:Int32,codons:String,consequence_terms:Array[String],    distance:Int32,domains:Array[Struct{db:String,name:String}],exon:String,gene_id:String,gene_pheno:Int32,gene_symbol:String,gene_symbol_source:String,hgnc_id:String,hgvsc:String,hgvsp:String,hgvs_offset:Int32,impact:   String,intron:String,lof:String,lof_flags:String,lof_filter:String,lof_info:String,minimised:Int32,polyphen_prediction:String,polyphen_score:Float64,protein_end:Int32,protein_start:Int32,protein_id:String,             sift_prediction:String,sift_score:Float64,strand:Int32,swissprot:String,transcript_id:String,trembl:String,tsl:Int32,uniparc:String,variant_allele:String}],variant_class:String}"
     }
    source /home/dnanexus/environment
    source /cluster/dx-cluster.environment
    #read csv from public bucket
    df = spark.read.options(delimiter='\t', header='True', inferSchema='True').csv("s3://1000genomes/20131219.populations.tsv")
    df.select(df.columns[:4]).show(10, False)
    #access private data in S3 by first unsetting the default credentials provider
    sc._jsc.hadoopConfiguration().set('fs.s3a.aws.credentials.provider', '')
    
    # replace "redacted" with your keys
    sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'redacted')
    sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'redacted')
    df=spark.read.csv("s3a://your_private_bucket/your_path_to_csv")
    df.select(df.columns[:5]).show(10, False)
    dx run spark-sql-runner \
       -i sqlfile=file-FQ4by2Q0Yy3pGp21F7vp8XGK \
       -i paramfile=file-FK7Qpj00GQ8Q7ybZ0pqYJj6G \
       -i export=true
    SELECT * FROM ${srcdb}.${patient_table};
    DROP DATABASE IF EXISTS ${dstdb} CASCADE;
    CREATE DATABASE IF NOT EXISTS ${dstdb} LOCATION 'dnax://';
    CREATE VIEW ${dstdb}.patient_view AS SELECT * FROM ${srcdb}.patient;
    SELECT * FROM ${dstdb}.patient_view;
    SHOW DATABASES;
    SELECT * FROM dbname.tablename1;
    SELECT * FROM
    dbname.tablename2;
    DESCRIBE DATABASE EXTENDED dbname;
    -- SHOW DATABASES;
    -- SELECT * FROM dbname.tablename1;
    SHOW /* this is valid comment */ TABLES;
    {
        "srcdb": "sskrdemo1",
        "dstdb": "sskrtest201",
        "patient": "patient_new",
        "f2c":"patient_f2c",
        "derived":"patient_derived",
        "composed":"patient_composed",
        "complex":"patient_complex",
        "patient_view": "patient_newview",
        "brca": "brca_new",
        "patient_table":"patient",
        "cna": "cna_new"
    }
    set srcdb=sskrdemo1;
    set patient_table=patient;
    select * from ${srcdb}.${patient_table};
    {
       "num_files" : 2,
       "fileprefix":"demo",
       "header": true
    }
    {
      "spark-defaults.conf": [
        {
          "name": "spark.app.name",
          "value": "SparkAppName"
        },
        {
          "name": "spark.test.conf",
          "value": true
        }
      ]
    }
    $ dx tree export
    export
    ├── job-FFp7K2j0xppVXZ791fFxp2Bg-export.tar
    ├── job-FFp7K2j0xppVXZ791fFxp2Bg-debug.sql
    ├── demo-0
    │   ├── demo-0-out.csv
    │   │   ├── _SUCCESS
    │   │   ├── part-00000-1e2c301e-6b28-47de-b261-c74249cc6724-c000.csv
    │   │   └── part-00001-1e2c301e-6b28-47de-b261-c74249cc6724-c000.csv
    │   └── demo-0.sql
    ├── demo-1
    │   ├── demo-1-out.csv
    │   │   ├── _SUCCESS
    │   │   └── part-00000-b21522da-0e5f-42ba-8197-e475841ba9c3-c000.csv
    │   └── demo-1.sql
    ├── demo-2
    │   ├── demo-2-out.csv
    │   │   ├── _SUCCESS
    │   │   ├── part-00000-e61c6eff-5448-4c39-8c72-546279d8ce6f-c000.csv
    │   │   └── part-00001-e61c6eff-5448-4c39-8c72-546279d8ce6f-c000.csv
    │   └── demo-3.sql
    ├── demo-3
    │   ├── demo-3-out.csv
    │   │   ├── _SUCCESS
    │   │   └── part-00000-5a48ba0f-d761-4aa5-bdfa-b184ca7948b5-c000.csv
    │   └── demo-3.sql
    -- [SQL Runner Report] --;
    -- [SUCCESS][TimeTaken: 1.90734863281e-06 secs ] set f2c=patient_f2c;
    -- [SUCCESS][TimeTaken: 1.90734863281e-06 secs ] set srcdb=sskrdemosrcdb1_13;
    -- [SUCCESS][TimeTaken: 9.53674316406e-07 secs ] set patient=patient_new;
    -- [SUCCESS][TimeTaken: 9.53674316406e-07 secs ] set derived=patient_derived;
    -- [SUCCESS][TimeTaken: 9.53674316406e-07 secs ] set composed=patient_composed;
    -- [SUCCESS][TimeTaken: 9.53674316406e-07 secs ] set patient_table=patient;
    -- [SUCCESS][TimeTaken: 1.19209289551e-06 secs ] set complex=patient_complex;
    -- [SUCCESS][TimeTaken: 9.53674316406e-07 secs ] set patient_view=patient_newview;
    -- [SUCCESS][TimeTaken: 9.53674316406e-07 secs ] set cna=cna_new;
    -- [SUCCESS][TimeTaken: 0.0 secs ] set brca=brca_new;
    -- [SUCCESS][TimeTaken: 2.14576721191e-06 secs ] set dstdb=sskrdemodstdb1_13;
    -- [SUCCESS][OutputFile: demo-0-out.csv, TimeTaken: 8.83630990982 secs] SHOW DATABASES;
    -- [SUCCESS][OutputFile: demo-1-out.csv, TimeTaken: 3.85295510292 secs] create database sskrdemo2 location 'dnax://';
    -- [SUCCESS][OutputFile: demo-2-out.csv, TimeTaken: 4.8106200695 secs] use sskrdemo2;
    -- [SUCCESS][OutputFile: demo-3-out.csv , TimeTaken: 1.00737595558 secs] create table patient (first_name string, last_name string, age int, glucose int, temperature int, dob string, temp_metric string) stored as parquet;
    -- [SQL Runner Report] --;
    -- [SUCCESS][TimeTaken: 1.90734863281e-06 secs ] set f2c=patient_f2c;
    -- [SUCCESS][TimeTaken: 1.90734863281e-06 secs ] set srcdb=sskrdemosrcdb1_13;
    -- [SUCCESS][TimeTaken: 9.53674316406e-07 secs ] set patient=patient_new;
    -- [SUCCESS][TimeTaken: 9.53674316406e-07 secs ] set derived=patient_derived;
    -- [SUCCESS][TimeTaken: 9.53674316406e-07 secs ] set composed=patient_composed;
    -- [SUCCESS][TimeTaken: 9.53674316406e-07 secs ] set patient_table=patient;
    -- [SUCCESS][TimeTaken: 1.19209289551e-06 secs ] set complex=patient_complex;
    -- [SUCCESS][TimeTaken: 9.53674316406e-07 secs ] set patient_view=patient_newview;
    -- [SUCCESS][TimeTaken: 9.53674316406e-07 secs ] set cna=cna_new;
    -- [SUCCESS][TimeTaken: 0.0 secs ] set brca=brca_new;
    -- [SUCCESS][TimeTaken: 2.14576721191e-06 secs ] set dstdb=sskrdemodstdb1_13;
    -- [SUCCESS][OutputFile: demo-0-out.csv, TimeTaken: 8.83630990982 secs] select * from ${srcdb}.${patient_table};
    -- [FAIL] SQL ERROR while below command [ Reason: u"\nextraneous input '`' expecting <EOF>(line 1, pos 45)\n\n== SQL ==\ndrop database if exists sskrtest2011 cascade `\n---------------------------------------------^^^\n"];
    drop database if exists ${dstdb} cascade `;
    create database if not exists ${dstdb} location 'dnax://';
    create view ${dstdb}.patient_view as select * from ${srcdb}.patient;
    select * from ${dstdb}.patient_view;
    drop database if exists ${dstdb} cascade `;
    CREATE DATABASE IF NOT EXISTS {db_name} LOCATION 'dnax://
    db_uri = dxpy.find_one_data_object(name=db_name", classname="database")['id']
    db_uri = dxpy.find_one_data_object(name=db_name.lower(), classname="database")['id']
    Overview

    File Archival States

    To understand the archival life cycle as well as which operations can be performed on files and how billing works, it's helpful to understand the different file states associated with archival. A file in a project can assume one of four archival states:

    Archival states
    Details

    live

    The file is in standard storage, such as AWS S3 or Azure Blob.

    archival

    Archival requested on the current file, but other copies of the same file are in the live state in multiple projects with the same billTo entity. The file is still in standard storage.

    archived

    The file is in archival storage, such as AWS S3 Glacier or Azure Blob ARCHIVE.

    unarchiving

    Restore requested on the current file. The file is in transition from archival storage to standard storage.

    Different states of a file allow different operations to the file. See the table below, for which operations can be performed based on a file's current archival state.

    Archival states
    Download
    Clone
    Compute
    Archive
    Unarchive

    live

    Yes

    Yes

    Yes

    Yes

    No

    archival

    No

    Yes*

    * Clone operation would fail if the object is actively transitioning from archival to archived.

    File Archival Life Cycle

    When the project-xxxx/archive API is called on a file object, the file transitions from the live state to the archival state. Only when all copies of a file in all projects with the same billTo organization are in the archival state, does the file transition to the archived state automatically by the platform.

    Likewise, when the project-xxxx/unarchive API is called on a file in the archived state, the file transitions from the archived to the unarchiving state. During the unarchiving state, the file is being restored by the third-party storage platform, such as AWS or Azure. The unarchiving process may take a while depending on the retrieval option selected for the specific platform. Finally, when unarchiving is completed, and the file becomes available on standard storage, the file is transitioned to a live state.

    Archive Service Operations

    The File-based Archive Service allows users who have the CONTRIBUTE or ADMINISTER permissions to a project to archive or unarchive files that reside in the project.

    Using API, users can archive or unarchive files, folders, or entire projects, although the archiving process itself happens at the file level. The API can accept a list of up to 1000 files for archiving and unarchiving.

    When archiving or unarchiving folders or projects, the API by default archives or unarchives all the files at the root level and those in the subfolders recursively. If you archive a folder or a project that includes files in different states, the Service only archives files that are in the live state and skips files that are in other states. Likewise, if you unarchive a folder or a project that includes files in different states, the Service only unarchives files that are in the archived state, transitions archival files back to the live state, and skips files in other states.

    Archival Billing

    The archival process incurs specific charges, all billed to the billTo organization of the project:

    • Standard storage charge: The monthly storage charge for files that are located in the standard storage on the platform. The files in the live and archival state incur this charge. The archival state indicates that the file is waiting to be archived or that other copies of the same file in other projects are still in the live state, so the file is in standard storage, such as AWS S3. The standard storage charge continues to get billed until all copies of the file are requested to be archived and eventually the file is moved to archival storage and transitioned into the archived state.

    • Archival storage charge: The monthly storage charge for files that are located in archival storage on the platform. Files in the archived state incur a monthly archival storage charge.

    • Retrieval fee: The retrieval fee is a one-time charge at the time of unarchiving based on the volume of data being unarchived.

    • Early retrieval fee: If you retrieve or delete data from archival storage before the required retention period is met, an early retrieval fee applies. This is 90 days for AWS regions and 180 days for Microsoft Azure regions. You are be charged a pro-rated fee equivalent to the archival storage charges for any remaining days within that period.

    Best Practices

    When using the Archive Service, we recommend the following best practices.

    • The Archive Service does not work on sponsored projects. If you want to archive files within a sponsored project, then you must move files into a different project or end the project sponsorship before archival.

    • If a file is shared in multiple projects, archiving one copy in one of the projects only transitions the file into the archival state, which still incurs the standard storage cost. To achieve the lower archival storage cost, you need to ensure that all copies of the file in all projects with the same billTo org are being archived. When all copies of the file reach the archival state, the Service moves the files from archival to archived state. Consider using the allCopies option of the API to archive all copies of the file. You must be the org ADMIN of the billTo org of the current project to use the allCopies option.

      Refer to the following example: The file-xxxx has copies in project-xxxx, project-yyyy, and project-zzzz which are sharing the same billTo org (org-xxxx). You are the ADMINISTER of project-xxxx, and a CONTRIBUTE of project-yyyy, but do not have any role in project-zzzz. You are the org ADMIN of the project billTo org, and try to archive all copies of files in all projects with the same billTo

      1. List all the copies of the file in the org-xxxx .

      2. Force archiving all the copies of file-xxxx .

      3. All copies of file-xxxx transition into the archived state.

    DNAnexus Sales

    Searching Data Objects

    You can use the dx ls command to list the objects in your current project. You can determine the current project and folder you are in by using the command dx pwd. Using glob patterns, you can broaden your search for objects by specifying filenames with wildcard characters such as * and ?. An asterisk (*) represents zero or more characters in a string, and a question mark (?) represents exactly one character.

    Searching Objects with Glob Patterns

    Searching Objects in Your Current Folder

    By listing objects in your current directory with the wildcard characters * and ?, you can search for objects with a filename using a glob pattern. The examples below use the folder "C. Elegans - Ce10/" in the public project (platform login required to access this link).

    Printing the Current Working Directory

    Listing Folders and/or Objects in a Folder

    Listing Objects Named Using a Pattern

    Searching Across Objects in the Current Project

    To search the entire project with a filename pattern, use the command dx find data --name with the wildcard characters. Unless --path or --all-projects is specified, dx find data searches data under the current project. Below, the command dx find data is used in the public project (platform login required to access this link) using the --name option to specify the filename of objects that you're searching for.

    Escaping Special Characters

    When filenames contain special characters, escape these characters with a backslash (\) during searches. Characters requiring escaping include wildcards (* and ?) as well as colons (:) and slashes (/), which have special meaning in DNAnexus paths.

    Shell behavior affects escaping rules. In many shells, you need to either double-escape (\\) or use single quotes to prevent the shell from interpreting the backslash.

    The following examples show proper escaping techniques:

    Searching Objects with Other Criteria

    dx find data also allows you to search data using metadata fields, such as when the data was created, the data tags, or the project the data exists in.

    Searching Objects Created Within a Certain Period of Time

    You can use the flags --created-after and --created-before to search for data objects created within a specific time period.

    Searching Objects by Their Metadata

    You can search for objects based on their metadata. An object's metadata can be set by performing the command or to respectively tag or setup key-value pairs to describe your data object. You can also set metadata while uploading data to the platform. To search by object tags, use the option --tag. This option can be repeated if the search requires multiple tags.

    To search by object properties, use the option --property. This option can be repeated if the search requires multiple properties.

    Searching Objects in Another Project

    You can search for an object living in a different project than your current working project by specifying a project and folder path with the flag --path. Below, the project ID (project-BQfgzV80bZ46kf6pBGy00J38) of the public project (platform login required to access this link) is specified as an example.

    Searching Objects Across Projects with VIEW and Above Permissions

    To search for data objects in all projects where you have VIEW and above permissions, use the --all-projects flag. Public projects are not shown in this search.

    Scoping Within Projects

    To describe data for small amounts of files (typically below 100), scope findDataObjects to only a project level.

    The below is an example of code used to scope a project:

    See the for more information about usage.

    Symlinks

    Use Symlinks to access, work with, and modify files that are stored on an external cloud service.

    A license is required to use Symlinks. Contact DNAnexus Sales for more information.

    Overview

    The DNAnexus Symlinks feature enables users to link external data files on AWS S3 and Azure blob storage as objects on the platform and access such objects for any usage as though they are native DNAnexus file objects.

    No storage costs are incurred when using symlinked files on the Platform. When used by jobs, symlinked files are downloaded to the Platform at runtime.

    DNAnexus validates the integrity of symlinked files on the DNAnexus Platform using recorded md5 checksums. But DNAnexus cannot control or monitor changes made to these files in a customer's cloud storage. It is each customer's responsibility to safeguard files from any modifications, removals, and security breaches, while in the customer's cloud storage.

    Quickstart

    Symlinked files stored in AWS S3 or Azure blob storage are made accessible on DNAnexus via a Symlink Drive. The drive contains the necessary cloud storage credentials, and can be created by following Step 1 below.

    Step 1. Create a Symlink Drive

    To set up Symlink Drives, use the CLI to provide the following information:

    • A name for the Symlink Drive

    • The cloud service (AWS or Azure) where your files are stored

    • The access credentials required by the service

    AWS

    Azure

    After you've entered the appropriate command, a new drive object is created. You can see a confirmation message that includes the id of the new Symlink Drive in the format drive-xxxx.

    When your cloud service access credentials change, you must update the definition of each Symlink Drive that links to the cloud service. See .

    Step 2. Linking a Project with a Symlink Drive

    By associating a DNAnexus Platform project with a Symlink Drive, you can both:

    • Have all new project files automatically uploaded to the AWS S3 bucket or Azure blob, to which the Drive links

    • Enable project members to work with those files

    "New project files" includes the following:

    • Newly created files

    • File outputs from jobs

    • Files uploaded to the project

    Non-symlinked files cloned into a symlinked project are not uploaded to the linked AWS S3 bucket or Azure blob.

    Linking a New Project with a Symlink Drive via the UI

    When creating a new project via the UI, you can link it with an existing Symlink Drive by toggling the Enable Auto-Symlink in This Project setting to "On":

    Next:

    • In the Symlink Drive field, select the drive with which the project should be linked

    • In the Container field, enter the name of the AWS S3 bucket or Azure blob where newly created files should be stored

    • Optionally, in the Prefix field, enter the name of a folder within the AWS S3 bucket or Azure blob where these files should be stored

    Linking a New Project with a Symlink Drive via the CLI

    When creating a new project via the CLI, you can link it to a Symlink Drive by using the optional argument --default-symlink with dx new project. See for details on inputs and input format.

    Step 3. Enable CORS

    To ensure that files can be saved to your AWS S3 bucket or Azure blob, you must enable CORS for that remote storage container.

    Enabling CORS for an AWS S3 bucket

    Refer to Amazon documentation for .

    Use the following JSON object when configuring CORS for the bucket:

    Enabling CORS for an Azure S3 Blob

    Refer to Microsoft documentation for .

    Working with Symlinked Files

    Working with Symlinked files is similar to working with files that are stored on the Platform. These files can, for example, be used as inputs to apps, applets, or workflows.

    Renaming Symlinks

    If you rename a symlink on DNAnexus, this does not change the name of the file in S3 or Azure blob storage. In this example, the symlink has been renamed from the original name file.txt, to Example File. The remote filename, as shown in the Remote Path field in the right-side info pane, remains file.txt:

    Deleting Symlinks

    If you delete a symlink on the Platform, the file to which it points is not deleted.

    Working with Symlink Drives

    Updating Cloud Service Access Credentials

    If your cloud access credentials change, you must update the definition of all Symlink Drives to keep using files to which those Drives provide access.

    AWS

    To update a drive definition with new AWS access credentials, use the following command:

    Azure

    To update a drive definition with new Azure access credentials, use the following command:

    Learn More

    For more information, see .

    FAQ

    What happens if I move a symlinked file from one folder to another, within a DNAnexus project? Does the file also mirror that move within the AWS S3 bucket or Azure blob?

    No, the symlinked file only moves within the project. The change is not mirrored in the linked S3 or Azure blob container.

    What happens if I delete a symlinked file directly on S3 or Azure blob storage, and a job tries to access the symlinked object on DNAnexus?

    The job fails after it is unable to retrieve the source file.

    Can I copy a symlinked file from one project to another and still keep access?

    Yes, you can copy a symlinked file from one project to another. This includes copying symlinked files from a symlink-enabled project to a project without this feature enabled.

    Can I create a symlink in another region relative to my project's region?

    Yes - egress charges are incurred.

    What if I upload a file to my auto-symlink-enabled project with a filename that already matches the name of a file in the S3 bucket or Azure blob linked to the project?

    In this scenario, the uploaded file overwrites, or "clobbers," the file that shares its name, and only the newly uploaded file is stored in the AWS S3 bucket or Azure blob.

    This is true even if, within your project, you first renamed the symlinked file and uploaded a new file with the prior name. For example, if you upload a file named file.txt to your DNAnexus project, the file is automatically uploaded to your S3 or Azure blob to the specified directory. If you then rename the file on DNAnexus from file.txt to file.old.txt, and upload a new file to the project called file.txt, the original file.txt that was uploaded to S3 or Azure blob is overwritten. However, you are still left with file.txt and file.old.txt symlinks in your DNAnexus project. Trying to access the original file.old.txt symlink results in a checksum error.

    What happens if I try to transfer billing responsibility of an auto-symlink-enabled project to someone else?

    If the auto-symlink feature has been enabled for a project, billing responsibility for the project cannot be transferred. Attempting to do so via API call returns a PermissionDenied error.

    Defining and Managing Cohorts

    Create, filter, and manage patient cohorts using clinical, genomic, and other data fields in the Cohort Browser.

    Create comprehensive patient cohorts by filtering your datasets. You can combine, compare, and export your cohorts for further analysis.

    If you'd like to visualize data in your cohorts, see .

    Managing Cohorts

    When you start exploring a dataset, Cohort Browser automatically creates an empty cohort that includes all patients/samples. You can then by adding filters, and repeat multiple times to create additional cohorts.

    The Cohorts panel gives you an overview of your active cohorts on the dashboard (up to 2) and the recently used cohorts (up to 8) in your current session. These can be temporary unsaved cohorts as well as

    org using
    :

    No

    No

    Yes (Cancel archive)

    archived

    No

    Yes

    No

    No

    Yes

    unarchiving

    No

    No

    No

    No

    No

    /project-xxxx/archive
    "Reference Genome Files"
    "Reference Genome Files"
    dx tag
    dx set_properties
    "Exome Analysis Demo"
    API method system/findDataObjects
    Updating Cloud Service Access Credentials
    manual for dx new project
    guidance on enabling CORS for an S3 bucket
    guidance on enabling CORS for Azure Storage
    API endpoints for working with Symlink Drives
    dx api file-xxxx listProjects '{"archivalInfoForOrg":"org-xxxx"}'
    {
    "project-xxxx": "ADMINISTER",
    "project-yyyy": "CONTRIBUTE",
    "liveProjects": [
     "project-xxxx",
     "project-yyyy",
     "project-zzzz"
    ]
    }
    dx api project-xxxx archive '{"files": ["file-xxxx"], "allCopies": true}'
    {
    "id": "project-xxxx"
    "count": 1
    }
    $ dx select "Reference Genome Files"
    $ dx cd "C. Elegans - Ce10/"
    $ dx pwd # Print current working directory
    Reference Genome Files:/C. Elegans - Ce10
    $ dx ls
    ce10.bt2-index.tar.gz
    ce10.bwa-index.tar.gz
    ce10.cw2-index.tar.gz
    ce10.fasta.fai
    ce10.fasta.gz
    ce10.hisat2-index.tar.gz
    ce10.star-index.tar.gz
    ce10.tmap-index.tar.gz
    $ dx ls '*.fa*' # List objects with filenames of the pattern "*.fa*"
    ce10.fasta.fai
    ce10.fasta.gz
    $ dx ls ce10.???-index.tar.gz # List objects with filenames of the pattern "ce10.???-index.tar.gz"
    ce10.cw2-index.tar.gz
    ce10.bt2-index.tar.gz
    ce10.bwa-index.tar.gz
    $ dx find data --name "*.fa*.gz"
    closed  2014-10-09 09:50:51 776.72 MB /M. musculus - mm10/mm10.fasta.gz (file-BQbYQPj0Z05ZzPpb1xf000Xy)
    closed  2014-10-09 09:50:30 767.47 MB /M. musculus - mm9/mm9.fasta.gz (file-BQbYK6801fFJ9Fj30kf003PB)
    closed  2014-10-09 09:49:27 49.04 MB /D. melanogaster - Dm3/dm3.fasta.gz (file-BQbYVf80yf3J9Fj30kf00PPk)
    closed  2014-10-09 09:48:55 29.21 MB /C. Elegans - Ce10/ce10.fasta.gz (file-BQbY9Bj015pB7JJVX0vQ7vj5)
    closed  2014-10-08 13:52:26 818.96 MB /H. Sapiens - GRCh37 - hs37d5 (1000 Genomes Phase II)/hs37d5.fa.gz (file-B6ZY7VG2J35Vfvpkj8y0KZ01)
    closed  2014-10-08 13:51:31 876.79 MB /H. Sapiens - hg19 (UCSC)/ucsc_hg19.fa.gz (file-B6qq93v2J35fB53gZ5G0007K)
    closed  2014-10-08 13:50:53 827.95 MB /H. Sapiens - hg19 (Ion Torrent)/ion_hg19.fa.gz (file-B6ZYPQv2J35xX095VZyQBq2j)
    closed  2014-10-08 13:50:17 818.88 MB /H. Sapiens - GRCh38/GRCh38.no_alt_analysis_set.fa.gz (file-BFBv6J80634gkvZ6z100VGpp)
    closed  2014-10-08 13:49:53 810.45 MB /H. Sapiens - GRCh37 - b37 (1000 Genomes Phase I)/human_g1k_v37.fa.gz (file-B6ZXxfG2J35Vfvpkj8y0KXF5)
    # Searching for a file with colons in the name
    dx find data --name "sample\:123.txt"
    # Or alternatively with single quotes
    dx find data --name 'sample\:123.txt'
    
    # Searching for a file with a literal asterisk
    dx find data --name "experiment\*.fastq"
    $ dx find data --created-after 2017-02-22 --created-before 2017-02-25
    closed  2017-02-27 19:14:51 3.90 GB  /H. Sapiens - hg19 (UCSC)/ucsc_hg19.hisat2-index.tar.gz (file-F2pJvF80Vzx54f69K4J8K5xy)
    closed  2017-02-27 19:14:21 3.55 GB  /M. musculus - mm10/mm10.hisat2-index.tar.gz (file-F2pJqk00Vq161bzq44Vjvpf5)
    closed  2017-02-27 19:13:57 3.51 GB  /M. musculus - mm9/mm9.hisat2-index.tar.gz (file-F2pJpKj0G0JxZxBZ4KJq0Q6B)
    closed  2017-02-27 19:13:41 3.85 GB  /H. Sapiens - hg19 (Ion Torrent)/ion_hg19.hisat2-index.tar.gz (file-F2pJkp00BjBk99xz4Jk74V0y)
    closed  2017-02-27 19:13:28 3.85 GB  /H. Sapiens - GRCh37 - b37 (1000 Genomes Phase I)/human_g1k_v37.hisat2-index.tar.gz (file-F2pJpy007bGBzj7X446PzxJJ)
    closed  2017-02-27 19:13:02 3.90 GB  /H. Sapiens - GRCh37 - hs37d5 (1000 Genomes Phase II)/hs37d5.hisat2-index.tar.gz (file-F2pJpb000vFpzj7X446PzxF0)
    closed  2017-02-27 19:12:31 3.91 GB  /H. Sapiens - GRCh38/GRCh38.no_alt_analysis_set.hisat2-index.tar.gz (file-F2pK5y00F8Bp9BYk4KX7Qb4P)
    closed  2017-02-27 19:12:18 224.54 MB /D. melanogaster - Dm3/dm3.hisat2-index.tar.gz (file-F2pJP7j0QkbQ3ZqG269589pj)
    closed  2017-02-27 19:11:56 139.76 MB /C. Elegans - Ce10/ce10.hisat2-index.tar.gz (file-F2pJK300KKz8bx1126Ky5b3P)
    $ dx find data --tag sampleABC --tag batch123
    closed  2017-01-01 09:00:00 6.08 GB  /Input/SRR504516_1.fastq.gz (file-xxxx)
    closed  2017-01-01 09:00:00 5.82 GB  /Input/SRR504516_2.fastq.gz (file-wwww)
    $ dx find data --property sequencing_providor=CRO_XYZ
    closed  2017-01-01 09:00:00 8.06 GB  /Input/SRR504555_1.fastq.gz (file-qqqq)
    closed  2017-01-01 09:00:00 8.52 GB  /Input/SRR504555_2.fastq.gz (file-rrrr)
    $ dx find data --name "*.fastq.gz"
     --path project-BQfgzV80bZ46kf6pBGy00J38:/Input
      closed  2014-10-03 12:04:16 6.08 GB  /Input/SRR504516_1.fastq.gz (file-B40jg7v8KfPy38kjz1vQ001y)
      closed  2014-10-03 12:04:16 5.82 GB  /Input/SRR504516_2.fastq.gz (file-B40jgYG8KfPy38kjz1vQ0020)
    $ dx find data --name "SRR*_1.fastq.gz" --all-projects
    closed  2017-01-01 09:00:00 6.08 GB  /Exome Analysis Demo/Input/SRR504516_1.fastq.gz (project-xxxx:file-xxxx)
    closed  2017-07-01 10:00:00 343.58 MB /input/SRR064287_1.fastq.gz (project-yyyy:file-yyyy)
    closed  2017-01-01 09:00:00 6.08 GB  /data/exome_analysis_demo/SRR504516_1.fastq.gz (project-zzzz:file-xxxx)
    dx api system findDataObjects '{"scope": {"project": "project-xxxx"}, "describe":{"fields":{"state":true}}}'
    dx api drive new '{
        "name" : "<drive_name>",
        "cloud" : "aws",
        "credentials" : {
            "accessKeyId" : "<my_aws_access_key>",
            "secretAccessKey" : "<my_aws_secret_access_key>"
        }
    }'
    dx api drive new '{
        "name" : "<drive_name>",
        "cloud" : "azure",
        "credentials" : {
            "account" : "<my_azure_storage_account_name>",
            "key" : "<my_azure_storage_access_key>"
        }
    }'
    [
        {
            "AllowedHeaders": [
                "Content-Length",
                "Origin",
                "Content-MD5",
                "accept",
                "content-type"
            ],
            "AllowedMethods": [
                "PUT",
                "POST"
            ],
            "AllowedOrigins": [
                "https://*"
            ],
            "ExposeHeaders": [
                "Retry-After"
            ],
            "MaxAgeSeconds": 3600
        }
    ]
    dx api <driveID> update '{
        "credentials" : {
            "accessKeyId" : "<my_new_aws_access_key>",
            "secretAccessKey" : "<my_new_aws_secret_access_key>"
        }
    }'
    dx api <driveID> update '{
        "credentials" : {
            "account" : "<my_azure_storage_account_name>",
            "key" : "<my_azure_storage_access_key>"
        }
    }'
    .
    You can toggle the Cohorts panel by clicking Manage Cohorts

    To change the active cohorts on the dashboard, you need to swap them between the Dashboard and Recent sections:

    1. In Cohorts > Dashboard, click In Dashboard to remove a cohort from the dashboard.

    2. In Cohorts > Recent, click Add to Dashboard next to the cohort you want to add to the dashboard.

    This way you can quickly explore, compare, and iterate across multiple cohorts within a single session.

    Defining Cohort Criteria

    Adding Clinical and Phenotypic Filters

    To apply a filter to your cohort:

    1. For the cohort you want to edit, click Add Filter.

    2. In Add Filter to Cohort > Clinical, select a data field to filter by.

    3. Click Add Cohort Filter.

    4. In Edit Filter, select operators and enter the values to filter by.

    5. Click Apply Filter.

    After you apply the filter, the dashboard automatically refreshes and displays the updated cohort size below the filtered cohort's name.

    Cohort size updates after filters are applied

    Adding Assay Filters

    With multi-assay datasets, you can create cohorts by applying filters from multiple assay types and instances.

    When adding filters, you can find assay types under the Assays tab. This allows you to create cohorts that combine different types of data. For example, you can filter patients based on both clinical characteristics and germline variants, merge somatic mutation criteria with gene expression levels, or build cohorts that span multiple assays of the same type.

    To learn more about filtering by specific assay types, see:

    • Analyzing Germline Variants

    • Analyzing Somatic Variants

    • Analyzing Gene Expression Data

    When working with an omics dataset that includes multiple assays, such as a germline dataset with both WES and WGS assays, you can:

    • Select specific assays to choose which assay to filter on.

    • Apply different filters per assay.

    • Create separate cohorts for different assays of the same type and compare results.

    Filter Limits by Assay Type

    The maximum number of filters allowed varies by assay type and is shared across all instances of that type:

    • Germline variant assays: 1 filter maximum

    • Somatic variant assays: Up to 10 filter criteria

    • Gene expression assays: Up to 10 filter criteria

    Creating Filter Groups

    If you add multiple filters from the same category, such as Patient or Sample, they automatically form a filter group.

    By default, filters within a filter group are joined by the logical operator 'AND', meaning that all filters in the group must be satisfied for a record to be included in the cohort. You can change the logical operator used within the group to 'OR' by clicking on the operator.

    Adding a filter and toggling 'AND' and 'OR' functionality

    Joining Multiple Filters

    Join filters allow you to create cohorts by combining criteria across multiple related data entities within your dataset. This is useful when working with complex datasets that contain interconnected information, such as patient records linked to visits, medications, lab tests, or other clinical data.

    Understanding Data Entities

    An entity is a grouping of data around a unique item, event, or concept.

    In the Cohort Browser, an entity can refer either to a data model object, such as patient or visit, or to a specific input parameter in the Table Exporter app.

    Common examples of data entities include:

    • Patient: Demographics, medical history, baseline characteristics

    • Visit: Hospital admissions, appointments, encounters

    • Medication: Prescriptions, dosages, administration records

    • Lab Test: Results, procedures, sample information

    Creating Join Filters

    To create join filters that span multiple data entities:

    1. Start a new join filter: On the cohort panel, click Add Filter or, on a chart tile, click Cohort Filters > Add Cohort Filter.

    2. Select secondary entity: Choose data fields from a secondary entity (different from your primary entity) to create the join relationship.

    3. Add criteria to existing joins: To expand an existing join filter, click Add additional criteria on the row of the chosen filter.

    Working with Logical Operators

    Join filters support both AND as well as OR logical operators to control how criteria are combined:

    • AND logic: All specified criteria must be met

    • OR logic: Any of the specified criteria can be met

    Key rules for logical operators:

    • Click on the operator buttons to switch between the AND logic and OR logic.

    • For a specific level of join filtering, joins are either all AND or all OR.

    • When using OR for join filters, the existence condition applies first: "where exists, join 1 OR join 2".

    Example of 'OR' and 'AND' filtering.

    Building Complex Join Structures

    As your filtering needs become more sophisticated, you can create multi-layered join structures:

    • Add criteria to branches: Further define secondary entities by adding additional criteria to existing join branches

    • Create nested joins: Add more layers of join filters that derive from the current branch

    • Automatic field filtering: The field selector automatically hides fields that are ineligible based on the current join structure

    Practical Examples

    The following examples show how join filters work in practice:

    1. First Example Cohort - Separate Conditions: This cohort identifies all patients with a "high" or "medium" risk level who meet both of these conditions:

      • Have a first hospital visit (visit instance = 1)

      • Have had a "nasal swab" lab test at any point (not necessarily during the first visit)

    2. Second Example Cohort - Connected Conditions: This cohort includes all patients with a "high" or "medium" risk level who had the "nasal swab" test performed specifically during their first visit, creating a more restrictive temporal relationship between the visit and lab test.

    Join filters used with two example cohorts

    Saving Cohorts

    You can save your cohort selection to a project as a cohort record by clicking Save Cohort in the top-right corner of the cohort panel.

    Save cohort action

    Cohorts are saved with their applied filters, as well as the latest visualizations and dashboard layout. Like other dataset objects, you can find your saved cohorts under the Manage tab in your project.

    To open a cohort, double-click it or click Explore Data.

    CohortBrowser object in a project

    Need to use your cohorts with a different dataset? If you want to apply your cohort definitions to a different Apollo Dataset, you can use the Rebase Cohorts And Dashboards app to transfer your saved cohorts to a new target dataset.

    Exporting Data from Cohorts

    For each cohort, you can export a list of main entity IDs in your current cohort selection as a CSV file by clicking Export sample IDs.

    Exporting a list of sample IDs

    Data Preview

    On the Data Preview tab, you can export tabular information as record IDs or a CSV file. Select multiple table rows to see export options in the top-right corner. Exports include only the fields displayed in the Data Preview tab.

    The Data Preview supports up to 30 columns per tab. Tables with 30-200 columns show column names only. In such cases, you can save cohorts but data is not queried. Tables with over 200 columns are not supported.

    You can view up to 30,000 records in the Data Preview. If your cohort exceeds this size, the table may not display all data. For larger exports, use the Table Exporter app.

    If your view contains more than one table, such as a participants table and a hospital records table, exporting to CSV or TSV generates a separate file for each table.

    Download Restrictions

    The Cohort Browser follows your project's download policy restrictions. Downloads are blocked when:

    • Database restrictions apply: If the database storing your dataset has restricted download permissions, you cannot download data from any Cohort Browser view of that dataset, regardless of which project contains the cohort or dashboard.

    • All dataset copies are restricted: When every copy of your dataset exists in projects with restricted download policies, downloads are blocked. However, if at least one copy exists in a project that allows downloads, then downloads are permitted.

    • Cohort or dashboard restrictions apply: If the specific cohort or dashboard you're viewing has restricted download permissions, downloads are blocked regardless of the underlying dataset permissions.

    Combining Cohorts

    You can create complex cohorts by combining existing cohorts from the same dataset.

    • Near the cohort name, click + > Combine Cohorts.

    • In the Cohorts panel, click Combine Cohorts.

    Depending on the combination logic, you can combine up to 5 cohorts

    You can also create a combined cohort basing on the cohorts already being compared.

    Combine cohorts from a comparison view

    The Cohort Browser supports the following combination logic:

    Logic

    Description

    Number of Cohorts Supported

    Intersection

    Select members that are present in ALL selected cohorts. Example: intersection of cohort A, B and C would be A ∩ B ∩ C.

    Up to 5 cohorts

    Union

    Select members that are present in ANY of the selected cohorts. Example: union of cohort A, B and C would be A ∪ B ∪ C.

    Up to 5 cohorts

    Subtraction

    Select members that are present only in the first selected cohort and not in the second. Example: Subtraction of cohort A, B would be A - B.

    2 cohorts

    Unique

    Select members that appear in exactly one of the selected cohorts. Example: Unique of cohort A, B would be (A - B) ∪ (B - A).

    2 cohorts

    Once a combined cohort is created, you can inspect the combination logic and its original cohorts in the cohort filters section.

    Inspect combination logic on a combined cohort

    Cohorts already combined cannot be combined a second time.

    Comparing Cohorts

    You can compare two cohorts from the same dataset by adding both cohorts into the Cohort Browser.

    To compare cohorts, click + next to the cohort name. You can create a new cohort, duplicate the current cohort, or load a previously saved cohort.

    When comparing cohorts:

    • All visualizations are converted to show data from both cohorts.

    • You can continue to edit both cohorts and visualize the results dynamically.

    You can compare a cohort with its complement in the dataset by selecting Compare / Combine Cohorts > Not In …. Similar to combining cohorts, you first need to save your current cohort before creating its not-in counterpart.

    Logic

    Description

    Not In

    Select patients that are present in the dataset, but not in the current cohort. Example: In dataset U, the result of "Not In" A would be U - A.

    Creating Comparison Using the "Not In" logic

    Cohorts created using Not In cannot be used for further creation of combined or not-in cohorts. "Not In" cohorts are linked to the cohort they are originally based on. Once a not-in cohort is created, further changes to the original cohort definition are not reflected.

    Creating Cohorts via CLI

    The dx command create_cohort generates a new Cohort object on the platform using an existing Dataset or Cohort object, and a list of primary IDs. The filters are applied to the global primary key of the dataset/cohort object.

    When the input is a CohortBrowser typed record, the existing filters are preserved and the output record has additional filters on the global primary key. The filters are combined in a way such that the resulting record is an intersection of the IDs present in the original input and the IDs passed through CLI.

    For additional details, see the create_cohort command reference and example notebooks in the public GitHub repository, DNAnexus/OpenBio.

    Creating Charts and Dashboards
    define your cohort criteria

    An Apollo license is required to use Cohort Browser on the DNAnexus Platform. Org approval may also be required. Contact DNAnexus Sales for more information.

    saved cohorts

    Projects

    Learn to use projects to collaborate, organize your work, manage billing, and control access to files and executables.

    About Projects

    Within the DNAnexus Platform, a project is first and foremost a means of enabling users to collaborate, by providing them with shared access to specific data and tools.

    Projects have a series of features designed for collaboration, helping project members coordinate and organize their work, and ensuring appropriate control over both data and tools.

    See the for details on how to create a project, share it with other users, and run an analysis.

    Managing Project Content

    A key function of each project is to serve as a shared storehouse of data objects used by project members as they collaborate.

    Click on a project's Manage tab to see a list of all the data objects stored in the project. Within the Manage screen, you can browse and manage these objects, with the range of available actions for an object dependent on its type.

    The following are four common actions you can perform on objects from within the Manage screen.

    Downloading Files

    You can directly download file objects from the system.

    1. Select the file's row.

    2. Click More Actions (⋮).

    3. From the list of available actions, select Download.

    4. Follow the instructions in the modal window that opens.

    Getting More Information on Objects

    To learn more about an object:

    1. Select its row, then click the Show Info Panel button - the "i" icon - in the upper corner of the Manage screen.

    2. Select the row showing the name of the object about which you want to know more. An info panel opens on the right, displaying a range of information about the object. This includes its unique ID, as well as metadata about its owner, time of creation, size, tags, properties, and more.

    Deleting Objects

    Deletion is permanent and cannot be undone.

    To delete an object:

    1. Select its row.

    2. Click More Actions (⋮).

    3. From the list of available actions, select Delete.

    4. Follow the instructions in the modal window that opens.

    Copying Data to Another Project

    To copy a data object or objects to another project, you must have CONTRIBUTE or ADMINISTER access to that project.

    1. Select the object or objects you want to copy to a new project, by clicking the box to the left of the name of each object in the objects list.

    2. Click the Copy button in the upper right corner of the Manage screen. A modal window opens.

    3. Select the project to which you want to copy the object or objects, then select the location within the project to which the objects should be copied.

    Access and Sharing

    Adding Project Members

    You can collaborate on the platform by . On sharing a project with a user, or group of users in an , they become project members, with access at one of the levels described below. Project access can be revoked at any time by a project administrator.

    Removing Project Members

    To remove a user or org from a project to which you have ADMINISTER access:

    1. On the project's Manage screen, click the Share Project button - the "two people" icon - in the top right corner of the page. A modal window opens, showing a list of project members.

    2. Find the row showing the user you want to remove from the project.

    3. Move your mouse over that row, then click the Remove from Members button at the right end of the row.

    Project Access Levels

    Access Level
    Description

    Project Access Levels: Two Examples

    Suppose you have a set of samples sequenced at your lab, and you have a collaborator who's interested in three of the samples. You can upload the data associated with those samples into a new project, then share that new project with your collaborator, granting them VIEW access.

    Alternatively, suppose that you and your collaborator are working on the same tissue samples, but each of you wants to try a different sequencing process. You can create a new project, then upload your sequenced data to the project. Then grant your collaborator UPLOAD access to the project, allowing them to upload their data. You both are then able to use each other's data to perform downstream analyses.

    Restricting Access to Executables

    A project admin can configure a project to allow project members to run only specific executables as . The list of allowed executables is set by entering the following command, via the CLI:

    This command overwrites any existing list of allowed executables.

    To discard the allowed executables list, that is, let project members run all available executables as root executions, enter the following command:

    Executables that are called by a permitted executable can run even if they are not included in the list.

    Project Data Access Controls

    Users with ADMINISTER access to a project can restrict the ability of project members to view, copy, delete, and download project data. The project-level boolean flags below provide fine-grained data access control. All data access control flags default to false and can be viewed and modified via CLI and platform API. The protected, restricted, downloadRestricted, externalUploadRestricted and containsPHI settings can be viewed and modified in the project's Settings web screen as described below.

    • protected: If set to true, only project members with ADMINISTER access to the project can delete project data. Otherwise, project members with ADMINISTER and CONTRIBUTE access can delete project data. This flag corresponds to the Delete Access policy in the project's Settings web interface screen.

    • restricted: If set to true,

    PHI Data Protection

    Only projects billed to org billing accounts can have PHI Data Protection enabled.

    A license and a signed Business Associate Agreement are required to enable and use PHI Data Protection. for more information.

    Protected Health Information, or PHI, is identifiable health information that can be linked to a specific person. On the DNAnexus Platform, PHI Data Protection safeguards the confidentiality and integrity of data in compliance with the Health Insurance Portability and Accountability Act of 1996 (HIPAA).

    When PHI Data Protection is enabled for a project, it is subject to the following protective restrictions:

    • Data in this project cannot be cloned to other projects that do not have containsPHI set to true

    • Any jobs that run in non-PHI projects cannot access any data that can only be found in PHI projects

    • Job email notifications sent from the project refer to objects by object ID instead of by name, and other information in the notification may be elided. If you receive such a notification, you can view the elided information by logging onto the Platform and opening the notification and accessing it in the Notifications pane, accessible by clicking the "bell" icon at the far right end of the main menu.

    Billing and Charges

    On the DNAnexus Platform, running analyses, storing data, and egressing data are billable activities, and always take place within a specific project. Each project is associated with a billing account to which invoices are sent, covering all billable activities carried out within the project.

    For information on configuring your billing account, see .

    You link a project to a billing account, that is an organization that the expenses are billed to, when you .

    Monthly Project Spending and Usage Limits

    Licenses are required for both the Monthly Project Compute and Egress Usage Limit and Monthly Project Storage Spending Limit features. for more information.

    The Monthly Project Usage Limit for Compute and Egress and Monthly Project Storage Spending Limit features can help project admins monitor and keep project costs under control. For more information, see .

    In the project's Settings tab under the Usage Limits section, project admins can view the project's compute and egress usage limits.

    For details on how to set and retrieve project-specific compute and egress usage limits, and storage spending limits, see the .

    Transferring Project Billing Responsibility

    Transferring Billing Responsibility to Another User

    If you have ADMINISTER access to a project, you can transfer project billing responsibility to another user, by doing the following:

    1. On the project's Settings screen, scroll down to the Administration section.

    2. Click the Transfer Billing button. A modal window opens.

    3. Enter the email address or username of the user to whom you want to transfer billing responsibility for the project.

    The user receives an email notification of your request. To finalize the transfer, they need to log onto the Platform and formally accept it.

    Transferring Billing Responsibility to an Org

    If you have billable activities access in the org to which you wish to transfer the project, you can change the billing account of the project to the org. To do this, navigate to the project settings page by clicking on the gear icon in the project header. On the project settings page, you can then select which to which billing account the project should be billed.

    If you do not have billable activities access in the org you wish to transfer the project to, you need to transfer the project to a user who does have this access. The recipient is then able to follow the instructions below to accept a project transfer on behalf of an org.

    Cancelling a Transfer of Billing Responsibility

    You can cancel a transfer of project billing responsibility, so long as it hasn't yet been formally accepted by the recipient. To do this:

    1. Select All Projects from the Projects link in the main menu. Open the project. You see a Pending Project Ownership Transfer notification at the top of the screen.

    2. Click the Cancel Transfer button to cancel the transfer.

    Accepting a Transfer Request

    When another user initiates a project transfer to you, you receive a project transfer request, via both an email, and a notification accessible by clicking the Notifications button - the "bell" - at the far right end of the main menu.

    If you did not already have access to the project being transferred, you receive VIEW access and the project appears in the list on the Projects screen.

    To accept the transfer:

    1. Open the project. You see a Pending Project Ownership Transfer notification in the project header.

    2. Click the Accept Transfer button.

    3. Select a new billing account for the project from the dropdown of eligible accounts.

    Projects with Auto-Symlink Enabled

    Projects with auto-symlink enabled cannot be transferred to a different billing account. For more information, see .

    Projects with PHI Data Protection Enabled

    If a project has PHI Data Protection enabled, it may only be transferred to an org billing account which also has PHI Data Protection enabled.

    Sponsored Projects

    Ownership of may not be transferred without the sponsorship first being terminated.

    Project Sponsorship

    A user or org can sponsor the cost of data storage in a project for a fixed term. During the sponsorship period, project members may copy this data to their own projects and store it there, without incurring storage charges.

    On setting up the sponsorship, the sponsor sets it end date. The sponsor can change this end date at any time.

    Billing responsibility for sponsored projects may not be transferred.

    Sponsored projects may not be deleted, without the project sponsor first ending the sponsorship, by changing its end date to a date in the past.

    For more information about sponsorship, contact .

    Learn More

    for detailed information on projects that are billed to an org.

    Learn about accessing and working with projects via the CLI:

    Learn about working with projects as a developer:

    Using DXJupyterLab

    Use Jupyter notebooks on the DNAnexus Platform to craft sophisticated custom analyses in your preferred coding language.

    DXJupyterLab is accessible to all users of the UK Biobank Research Analysis Platform and the Our Future Health Trusted Research Environment.

    A license is required to access DXJupyterLab on the DNAnexus Platform. Contact DNAnexus Sales for more information.

    Jupyter notebooks are a popular way to track the work performed in computational experiments the way a lab notebook tracks the work done in a wet lab setting. DXJupyterLab, or JupyterLab, is an application provided by DNAnexus that allows you to perform computational experiments on the DNAnexus Platform using Jupyter notebooks. DXJupyterLab allows users on the DNAnexus Platform to collaborate on notebooks and extends JupyterLab with options for directly accessing a DNAnexus project from the JupyterLab environment.

    Why Use DXJupyterLab?

    DXJupyterLab supports the use of Bioconductor and Bioconda, useful tools for bioinformatics analysis.

    DXJupyterLab is a versatile application that can be used to:

    • Collaborate on exploratory analysis of data

    • Reproduce and fork work performed in computational analyses

    • Visualize and gain insights into data generated from biological experiments

    • Create figures and tables for scientific publications

    The DNAnexus Platform offers two different DXJupyterLab apps. One is a general-purpose JupyterLab application. The other is Spark cluster-enabled, and can be used within the framework.

    Both apps instantiate a JupyterLab server that allows for data analyses to be interactively performed in Jupyter notebooks on a DNAnexus worker.

    The app contains all the features found in the general-purpose DXJupyterLab along with access to a fully-managed, on-demand Spark cluster for big data processing and translational informatics.

    Version Information

    DXJupyterLab 2.2 is the default version on the DNAnexus Platform. .

    Creating Interactive Notebooks

    A step-by-step guide on how to start with DXJupyterLab and create and edit Jupyter notebooks can be found in the .

    DXJupyterLab Environments

    Creating a DXJupyterLab session requires the use of two different environments:

    1. The DNAnexus project (accessible through the web platform and the CLI).

    2. The worker execution environment.

    The Project on the DNAnexus Platform

    You have direct access to the project in which the application is run from the JupyterLab session. The project file browser (which lists folders, notebooks, and other files in the project) can be accessed from the DNAnexus tab in the left sidebar or from the :

    The project is selected when the DXJupyterLab app is started and cannot be subsequently changed.

    The DNAnexus file browser shows:

    • Up to 1,000 of your most recently modified files and folders

    • All Jupyter notebooks in the project

    • Databases (Spark-enabled app only, limited to 1,000 most recent)

    The file list refreshes automatically every 10 seconds. You can also refresh manually by clicking the circular arrow icon in the top right corner.

    Need to see more files? Use dx ls in the terminal or access them programmatically through the API.

    Worker Execution Environment

    When you open and run a notebook from the the kernel corresponding to this notebook is started in the worker execution environment and is used to execute the notebook code. DNAnexus notebooks have a [DX] prepended to the notebook name in the tab of all opened notebooks.

    The execution environment file browser is accessible from the left sidebar (notice the folder icon at the top) or from the terminal:

    To create Jupyter notebooks in the worker execution environment, use the File menu. These notebooks are stored on the local file system of the DXJupyterLab execution environment and require persistence in a DNAnexus project. More information about saving appears in the .

    Local vs. DNAnexus Notebooks

    DNAnexus Notebooks

    You can directly in the DNAnexus project as well as duplicate, delete, or download them to your local machine. Notebooks stored in your DNAnexus project, which are housed within the DNAnexus tab on the left sidebar, are fetched from and saved to the project on the DNAnexus Platform without being stored in the JupyterLab execution environment file system. These are referred to as "DNAnexus notebooks" and these notebooks persist in the DNAnexus project after the DXJupyterLab instance is terminated.

    DNAnexus notebooks can be recognized by the [DX] that is prepended to its name in the tab of all opened notebooks.

    DNAnexus notebooks can be created by clicking the DNAnexus Notebook icon from the Launcher tab that appears on starting the JupyterLab session, or by clicking the DNAnexus tab on the upper menu and then clicking "New notebook". The Launcher tab can also be opened by clicking File and then selecting "New Launcher" from the upper menu.

    Local Notebooks

    To create a new local notebook, click the File tab in the upper menu and then select "New" and then "Notebook". These non-DNAnexus notebooks can be saved to DNAnexus by dragging and dropping them in the DNAnexus file viewer in the left panel.

    Accessing Data

    In JupyterLab, users can access input data that is located in a DNAnexus project in one of the following ways.

    • For reading the input file multiple times or for reading a large fraction of the file in random order:

      • Download the file from the DNAnexus project to the execution environment with dx download and access the downloaded local file from Jupyter notebook.

    • For scanning the content of the input file once or for reading only a small fraction of file's content:

    Uploading Data

    Files, such as local notebooks, can be persisted in the DNAnexus project by using one of these options:

    • dx upload in bash console.

    • Drag the file onto the DNAnexus tab that is in the column of icons on the left side of the screen. This uploads the file into the selected DNAnexus folder.

    Exporting DNAnexus Notebooks

    Exporting DNAnexus notebooks to formats such as HTML or PDF is not supported. However, you can dx download the DNAnexus notebook from the current DNAnexus project to the JupyterLab environment and export the downloaded notebook. For exporting local notebook to certain formats, the following commands might be needed beforehand: apt-get update && apt-get install texlive-xetex texlive-fonts-recommended texlive-plain-generic.

    Non-Interactive Execution of Notebooks

    A command can be executed in the DXJupyterLab worker execution environment without starting an interactive JupyterLab server. To do that, provide the cmd input and additional input files using the in input file array to the DXJupyterLab app. The provided command runs in the /opt/notebooks/directory and any output files generated in this directory are uploaded to the project and returned in the out output field of the job that ran DXJupyterLab app.

    The cmd input makes it possible to use the papermill command that is pre-installed in the DXJupyterLab environment to execute notebooks non-interactively. For example, to execute all the cells in a notebook and produce an output notebook:

    Where notebook.ipynb is the input notebook to the papermill command, which is passed to the dxjupyterlab app using the in input, and output_notebook.ipynb is the name of the output notebook, which contains the result of executing the input notebook and is uploaded to the project at the end of app's execution. See the for details.

    Collaboration in the Cloud

    Collaborators can work on notebooks in the project without the risk of overwriting each other's changes.

    Notebook Locking During Editing

    If a user has opened a specific notebook in a JupyterLab session, other users cannot open or edit the notebook. This is indicated by a red lock icon next to the notebook's name.

    It is still possible to create a duplicate to see what changes are being saved in the locked notebook or to continue work on this "forked" version of the notebook. To copy a notebook, right-click on its name and select Duplicate. After a few seconds, a notebook with the same name and a "copy" suffix should appear in the project.

    Once the editing user closes the notebook, the lock is released and anybody else with access to the project can open it.

    Notebook Versioning

    Whenever a notebook is saved in the project, it is uploaded to the platform as a new file that replaces the previous version, that is, the file of the same name. The previous version is moved to the .Notebook_archive folder with a timestamp suffix added to its name and its ID is saved in the properties of the new file. Saving notebooks directly in the project ensures that your analyses are not lost when the DXJupyterLab session ends.

    Session Timeout Control

    DXJupyterLab sessions begin with a set duration and shut down automatically at the end of this period. The timeout clock appears in the footer on the right side and can be adjusted using the Update duration button. The session terminates at the set timestamp even if the DXJupyterLab webpage is closed. Job lengths have an upper limit of 30 days, which cannot be extended.

    A session can be terminated immediately from the top menu (DNAnexus > End Session).

    Environment Snapshots

    It is possible to save the current session environment and data and reload it later by creating a session snapshot (DNAnexus > Create Snapshot).

    A DXJupyterLab session is , and a session snapshot file is a tarball generated by saving the Docker container state (with the docker commit and docker save commands). Any installed packages and files created locally are saved to a snapshot file, except for directories /home/dnanexus and /mnt/, which are not included. This file is then uploaded to the project to .Notebook_snapshots and can be passed as input the next time the app is started.

    If many large files are created locally, the resulting snapshots take a long time to save and load. In general, it is recommended not to snapshot more than 1 GB of locally saved data/packages and rely on downloading larger files as needed.

    Snapshots Created in Older Versions of DXJupyterLab

    Snapshots created with DXJupyterLab versions older than 2.0.0 (released mid-2023) are not compatible with the current version. These previous snapshots contain tool versions that may conflict with the newer environment, potentially causing problems.

    Using Previous Snapshots in the Current Version of DXJupyterLab

    To use a snapshot from a previous version in the current version of DXJupyterLab, recreate the snapshot as follows:

    1. Create a tarball incorporating all the necessary data files and packages.

    2. Save the tarball in a project.

    3. Launch the current version of DXJupyterLab.

    4. Import and unpack the tarball file.

    Accessing an Older Snapshot in an Older Version of DXJupyterLab

    If you don't want to have to recreate your older snapshot, you can run an and access the snapshot there.

    Viewing Other Files in the Project

    Viewing any other file types from your project, such as CSV, JSON, PDF files, images, or scripts, is convenient because JupyterLab displays them accordingly. For example, JSON files are collapsible and navigable and CSV files are presented in the tabular format.

    However, editing and saving any open files from the project other than IPython notebooks results in an error.

    Permissions in the JupyterLab Session

    The JupyterLab apps are run in a specific project, defined at start time, and this project cannot be subsequently changed. The job associated with the JupyterLab app has CONTRIBUTE access to the project in which it is run.

    When running DXJupyterLab app, it is possible to view, but not update, other projects the user has access to. This enhanced scope is required to be able to read databases which may be located in different projects and cannot be cloned.

    Running Jobs in the JupyterLab Session

    Use dx run to start new jobs from within a notebook or the terminal. If the billTo for the project where your JupyterLab session runs does not have a license for detached executions, any started jobs run as subjobs of your interactive JupyterLab session. In this situation, the --project argument for dx run is ignored, and the job uses the JupyterLab session's workspace instead of the specified project. If a subjob fails or terminates on the DNAnexus Platform, the entire job tree—including your interactive JupyterLab session—terminates as well.

    Jobs are limited to a runtime of 30 days. The system automatically terminates jobs running longer than 30 days.

    Environment and Feature Options

    The DXJupyterLab app is a Docker-based app that runs the JupyterLab server instance in a Docker container. The server runs on port 443. Because it's an HTTPS app, you can bring up the JupyterLab environment in a web browser using the URL https://job-xxxx.dnanexus.cloud, where job-xxxx is the ID of the job that runs the app. Only the user who launched the JupyterLab job has access to the JupyterLab environment. Other users see a "403 Permission Forbidden" message under the JupyterLab session's URL.

    On the DNAnexus Platform, the JupyterLab server runs in a Python 3.9.16 environment, in a container running Ubuntu 20.04 as its operating system.

    Feature Options

    When launching JupyterLab, the feature options available are PYTHON_R, ML, IMAGE_PROCESSING, STATA, and MONAI_ML.

    • PYTHON_R (default option): Loads the environment with Python3 and R kernel and interpreter.

    • ML: Loads the environment with Python3 and machine learning packages, such as TensorFlow, PyTorch, CNTK as well as the image processing package Nipype, but it does not contain R.

    • IMAGE_PROCESSING: Loads the environment with Python3 and Image Processing packages such as Nipype, FreeSurfer and FSL but it does not contain R. The FreeSurfer package requires a license to run. Details about license creation and usage can be found in the

    Full List of Pre-Installed Packages

    For the full list of pre-installed packages, see the . This list includes details on feature-specific packages available when running the PYTHON_R, ML, IMAGE_PROCESSING, STATA, and MONAI_ML features.

    Installing Additional Packages

    Additional packages can be during a JupyterLab session. By creating a Docker container , users can then start subsequent sessions with the new packages pre-installed by providing the snapshot as input.

    JupyterLab Documentation

    For more information on the features and benefits of JupyterLab, see the .

    Next Steps

    • Create your first notebooks by following the instructions in the guide.

    • See the guide for tips and info on the most useful DXJupyterLab features.

    Describing Data Objects

    You can describe objects (files, app(let)s, and workflows) on the DNAnexus Platform using the command dx describe.

    Describing an Object by Name

    Objects can be described using their DNAnexus Platform name via the command line interface (CLI) using a path.

    Describe an Object With a Relative Path

    Objects can be described relative to the user's current directory on the DNAnexus Platform. In the following example, the indexed reference genome file human_g1k_v37.bwa-index.tar.gz is described.

    The entire path is enclosed in quotes because the folder name Original files contains whitespace. Instead of quotes, escape special characters with \: dx describe Original\ files/human_g1k_v37.bwa-index.tar.gz.

    Describe an Object in a Different Project Using an Absolute Path

    Objects can be described using an absolute path. This allows you to describe objects outside the current project context. In the following example, selects the project "My Research Project" and dx describe describes the file human_g1k_v37.fa.gz in the "Reference Genome Files" project.

    Describe an Object Using Object ID

    Objects can be described using a unique object ID.

    This example describes the workflow object "Exome Analysis Workflow" using its ID. This workflow is publicly available in the "Exome Analysis Demo" project.

    Because workflows can include many app(let)s, inputs/outputs, and default parameters, the dx describe output can seem overwhelming.

    Manipulating Outputs

    The output from a dx describe command can be used for multiple purposes. The optional argument --json converts the output from dx describe into JSON format for advanced scripting and command line use.

    In this example, the publicly available workflow object "Exome Analysis Workflow" is described and the output is returned in JSON format.

    Parse, process, and query the JSON output using . Below, the dx describe --json output is processed to generate a list of all stages in the exome analysis pipeline.

    To get the "executable" value of each stage present in the "stages" array value of the dx describe output above, use the following command:

    General Response Fields Overview

    Field name
    Objects
    Description
    Click the Copy Selected button.
    data in this project cannot be cloned to another project
  • data in this project cannot be used as input to a job or an analysis in another project

  • any running app or applet that reads from this project cannot write results to any other project

  • a job running in the project has singleContext flag set to true irrespective of the singleContext value supplied to /job/new and /executable-xxxx/run, and is only allowed to use the job's DNAnexus authentication token when issuing requests to the proxied DNAnexus API endpoint within the job. Use of any other authentication token results in an error.

    This flag corresponds to the Copy Access policy in the project's Settings web interface screen.

  • downloadRestricted: If set to true, data in this project cannot be downloaded outside of the platform. For database objects, users cannot access the data in the project from outside DNAnexus. When set to true, previewViewerRestricted defaults to true unless explicitly overridden. This flag corresponds to the Download Access policy in the project's Settings web interface screen.

  • previewViewerRestricted: If set to true, file preview and viewer are disabled for the project. This flag defaults to true when downloadRestricted is set to true. You can override this by explicitly setting previewViewerRestricted to false using the /project-xxxx/update API method.

  • databaseUIViewOnly: If set to true, project members with VIEW access have their access to project databases restricted to the Cohort Browser only. This feature is only available to customers with an Apollo license. Contact DNAnexus Sales for more information.

  • containsPHI: If set to true, data in this project is treated as Protected Health Information (PHI), an identifiable health information that can be linked to a specific person. PHI data protection safeguards the confidentiality and integrity of the project data in compliance with the Health Insurance Portability and Accountability Act of 1996 (HIPAA) by imposing additional restrictions documented in PHI Data Protection section. This flag corresponds to the PHI Data Protection setting in the Administration section of a project's Settings web interface screen.

  • displayDataProtectionNotice: If set to true, ADMIN users can turn on/off the ability to show a Data Protection Notice to any users accessing the selected project. If the Data Protection Notice feature is enabled for a project, all users, when first accessing the project, are required to review and confirm their acceptance of a requirement not to egress data from the project. A license is required to use this feature. Contact DNAnexus Sales for more information.

  • externalUploadRestricted: If set to true, external file uploads to this project (from outside the job context) are rejected. The creation of Apollo databases, tables, and inserts of data into tables is disallowed from Thrift with a non-job token. This flag corresponds to the External Upload Access policy in the project's Settings web interface screen. A license is required to use this feature. Contact DNAnexus Sales for more information.

  • httpsAppIsolatedBrowsing: If set to true, httpsApp access to jobs launched in this project are wrapped in Isolated Browsing, which restricts data transfers through the httpsApp job interface. A license is required to use this limited-access feature. Contact DNAnexus Sales for more information.

  • Apollo database access is subject to additional restrictions.

  • Once PHI Data Protection is activated for a project, it cannot be disabled.

  • Click Send Transfer Request.

    VIEW

    Allows users to browse and visualize data stored in the project, download data to a local computer, and copy data to other projects.

    UPLOAD

    Gives users VIEW access, plus the ability to create new folders and data objects, modify the metadata of open data objects, and close data objects.

    CONTRIBUTE

    Gives users UPLOAD access, plus the ability to run executions directly in the project.

    ADMINISTER

    Gives users CONTRIBUTE access, plus the power to change project permissions and policies, including giving other users access, revoking access, transferring project ownership, and deleting the project.

    User Interface Quickstart
    sharing a project with other DNAnexus users
    organization
    root executions
    Contact DNAnexus Sales
    Setting Up Billing
    create a new project
    Contact DNAnexus Sales
    Enforcing Monthly Spending and Usage Limits
    /projects/new API method
    symlinks
    sponsored projects
    DNAnexus Support
    See the Org Management page
    Project Navigation
    Path Resolution
    Project API Specifications
    Project Permissions and Sharing

    Build and test algorithms directly in the cloud before creating DNAnexus apps and workflows

  • Test and train machine/deep learning models

  • Interactively run commands on a terminal

  • A project in which the app is running is mounted in a read-only fashion at /mnt/project folder. Reading the content of the files in /mnt/project dynamically fetches the content from the DNAnexus Platform, so this method uses minimal disk space in the JupyterLab execution environment, but uses more API calls to fetch the content.

  • Create a snapshot of the DXJupyterLab environment.

    .
  • STATA: Requires a license to run. See Stata in DXJupyterLab for more information about running Stata in JupyterLab.

  • MONAI_ML: Loads the environment with Python3 and extends the ML feature. This feature is ideal for medical imaging research involving machine learning model development and testing. It includes medical imaging frameworks designed for AI-powered analysis:

    • MONAI Core and MONAI Label: Medical imaging AI frameworks for deep learning workflows.

    • 3D Slicer: Medical image visualization and analysis platform accessible through the SlicerJupyter kernel.

  • DNAnexus Apollo
    DXJupyterLab Spark Cluster
    Previous versions remain available
    Quickstart
    terminal
    DNAnexus file browser
    following section
    create, edit, and save notebooks
    DXJupyterLab app page
    run in a Docker container
    older version of DXJupyterLab
    FreeSurfer in DXJupyterLab guide
    DNAnexus JupyterLab in-product documentation
    installed
    snapshot
    official JupyterLab documentation
    Quickstart
    DXJupyterLab Reference
    dx update project project-xxxx --allowed-executables applet-yyyy --allowed-executables workflow-zzzz [...]
    dx update project project-xxxx --unset-allowed-executables
    my_cmd="papermill notebook.ipynb output_notebook.ipynb"
    dx run dxjupyterlab -icmd="$my_cmd" -iin="notebook.ipynb"

    Name

    All

    Object name on the platform.

    All

    Status of the object on the platform.

    Visibility

    All

    Whether the file is visible to the user through the platform web interface.

    Tags

    All

    Set of tags associated with an object. Tags are strings used to organize or annotate objects.

    Properties

    All

    Key/value pairs attached to object.

    All

    JSON reference to another object on the platform. Linked objects are copied along with the object if the object is cloned to another project.

    Created

    All

    Date and time object was created.

    Created by

    All

    DNAnexus user who created the object. Contains subfield "via the job" if the object was created by an app or applet.

    Last modified

    All

    Date and time the object was last modified.

    Input Spec

    App(let)s and Workflows

    App(let) or workflow input names and classes. With workflows, the corresponding applet stage ID is also provided.

    Output Spec

    App(let) and Workflows

    App(let) or workflow output names and classes. With workflows, the corresponding applet stage ID is also provided.

    ID

    All

    Unique ID assigned to a DNAnexus object.

    Class

    All

    DNAnexus object type.

    Project

    All

    Container where the object is stored.

    Folder

    All

    dx select
    jq

    Objects inside a container (project) can be organized into folders. Objects can only exist in one path within a project.

    Organizations

    Learn about organizations, which associate users, projects, and resources with one another, enabling fluid collaboration, and simplifying the management of access, sharing, and billing.

    This functionality is also available via command line interface (CLI) tools. You may find it easier to use the CLI tools to perform some actions, such as inviting multiple users or exporting information into a machine-readable format.

    What Is an Org?

    $ dx describe "Original files/human_g1k_v37.bwa-index.tar.gz"
    Result 1:
    ID                file-xxxx
    Class             file
    Project           project-xxxx
    Folder            /Original files
    Name              human_g1k_v37.bwa-index.tar.gz
    State             closed
    Visibility        visible
    Types             -
    Properties        -
    Tags              -
    Outgoing links    -
    Created           ----
    Created by        Amy
     via the job      job-xxxx
    Last modified     ----
    archivalState     "live"
    Size              3.21 GB
    $ dx select "My Research Project"
    $ dx describe Reference\ Genome\ Files:H.\ Sapiens\ -\ GRCh37\ -\ b37\ (1000\ Genomes\ Phase\ I)/human_g1k_v37.fa.gz
    Result 1:
    ID                file-xxxx
    Class             file
    Project           project-xxxx
    Folder           /H. Sapiens - GRCh37 - b37 (1000 Genomes Phase I)
    Name              human_g1k_v37.fa.gz
    State             closed
    Visibility        visible
    Types             -
    Properties        -
    Tags              -
    Outgoing links    -
    Created           ----
    Created by        Amy
     via the job      job-xxxx
    Last modified     ----
    archivalState     "live"
    Size              810.45 MB
    $ dx describe "Exome Analysis Demo":workflow-G409jQQ0bZ46x5GF4GXqKxZ0
    Result 1:
    ID                  workflow-G409jQQ0bZ46x5GF4GXqKxZ0
    Class               workflow
    Project             project-BQfgzV80bZ46kf6pBGy00J38
    Folder              /
    Name                Exome Analysis Workflow
    ....
    Stage 0             bwa_mem_fastq_read_mapper
      Executable        app-bwa_mem_fastq_read_mapper/2.0.1
    Stage 1             fastqc
      Executable        app-fastqc/3.0.1
    Stage 2             gatk4_bqsr
      Executable        app-gatk4_bqsr_parallel/2.0.1
    Stage 3             gatk4_haplotypecaller
      Executable        app-gatk4_haplotypecaller_parallel/2.0.1
    Stage 4             gatk4_genotypegvcfs
      Executable        app-gatk4_genotypegvcfs_single_sample_parallel/2.0.0
    $ dx describe "Exome Analysis Demo":workflow-G409jQQ0bZ46x5GF4GXqKxZ0 --json
      {
        "project": "project-BQfgzV80bZ46kf6pBGy00J38",
        "name": "Exome Analysis Workflow",
        "inputSpec": [
          {
            "name": "bwa_mem_fastq_read_mapper.reads_fastqgzs",
            "class": "array:file",
            "help": "An array of files, in gzipped FASTQ format, with the first read mates to be mapped.",
            "patterns": [ "*.fq.gz", "*.fastq.gz" ],
            ...
          },
          ...
        ],
        "stages": [
          {
            "id": "bwa_mem_fastq_read_mapper",
            "executable": "app-bwa_mem_fastq_read_mapper/2.0.1",
            "input": {
              "genomeindex_targz": {
                "$dnanexus_link": {
                  "project": "project-BQpp3Y804Y0xbyG4GJPQ01xv",
                  "id": "file-FFJPKp0034KY8f20F6V9yYkk"
                }
              }
            },
            ...
          },
          {
            "id": "fastqc",
            "executable": "app-fastqc/3.0.1",
            ...
          }
          ...
        ]
      }
    $ dx describe "Exome Analysis Demo":workflow-G409jQQ0bZ46x5GF4GXqKxZ0 --json |jq .stages
    [{
        "id": "bwa_mem_fastq_read_mapper",
        "executable": "app-bwa_mem_fastq_read_mapper/2.0.1",
      ...
      }, {
        "id": "fastqc",
        "executable": "app-fastqc/3.0.1",
      ...
      }, {
        "id": "gatk4_bqsr",
        "executable": "app-gatk4_bqsr_parallel/2.0.1",
      ...
      }
      ...
    }]
    $ dx describe "Exome Analysis Demo":workflow-G409jQQ0bZ46x5GF4GXqKxZ0 --json | jq '.stages | map(.executable) | .[]'
      "app-bwa_mem_fastq_read_mapper/2.0.1"
      "app-fastqc/3.0.1"
      "app-gatk4_bqsr_parallel/2.0.1"
      "app-gatk4_haplotypecaller_parallel/2.0.1"
      "app-gatk4_genotypegvcfs_single_sample_parallel/2.0.0"
    State
    Outgoing Links
    An organization (or "org") is a DNAnexus entity used to manage a group of users. Use orgs to group users, projects, and other resources together, in a way that models real-world collaborative structures.

    In its simplest form, an org can be thought of as referring to a group of users on the same project. An org can be used efficiently to share projects and data with multiple users - and, if necessary, to revoke access.

    Org admins can manage org membership, configure access and projects associated with the org, and oversee billing. All storage and compute costs associated with an org are invoiced to a single billing account designated by the org admin. You can create an org that is associated with a billing account by contacting DNAnexus Sales.

    Orgs are referenced on the DNAnexus Platform by a unique org ID, such as org-dnanexus. Org IDs are used when sharing projects with an org in the Platform user interface or when manipulating the org in the CLI.

    Org Membership Levels

    Users may have one of two membership levels in an org:

    • ADMIN

    • MEMBER

    An ADMIN-level user is granted all possible access in the org and may perform org administrative functions. These functions include adding/removing users or modifying org policies. A MEMBER-level user, on the other hand, is granted only a subset of the possible org accesses in the org and has no administrative power in the org.

    Members

    A user with MEMBER level can be configured to have a subset of the following org access. These access levels determine which actions each user can perform in an org.

    Access
    Description
    Options

    Billable activities access

    If allowed, the org member can create new projects and apps billed to the org, download data (incurring data egress charges against the org), and set their own default billing account to that of the org.

    [Allowed] or [Not Allowed]

    Shared apps access

    If allowed, the org member has access to view and run apps in which the org has been added as an "authorized user".

    [Allowed] or [Not Allowed]

    Shared projects access

    The maximum access level a user can have in projects shared with an org. For example, if this is set to UPLOAD for an org member, the member has at most UPLOAD access in projects shared with the org, even if the org was given CONTRIBUTE or ADMINISTER access to the project.

    [NONE], [VIEW], [UPLOAD], [CONTRIBUTE] or [ADMINISTER]

    These accesses allow you to have fine-grained control over what members of your orgs can do in the context of your org.

    Admins

    Org admins are granted all possible access in the org. More specifically, org admins receive the following set of accesses:

    Access
    Level

    Billable activities access

    Allowed

    Shared apps access

    Allowed

    Shared projects access

    ADMINISTER

    Org admins also have the following special privileges:

    Viewing Metadata for All Org Projects

    Org admins can list and view metadata for all org projects (projects billed to the org) even if the project is not explicitly shared with them. They can also give themselves access to any project billed to the org. For example, when a member creates a new project, Project-P, and bills it to the org, they are the only user with access to Project-P. The org admin can see all projects billed to the org, including Project-P. Org admins can also invite themselves to Project-P at any time to get access to objects and jobs in the project.

    Becoming a Developer for All Org Apps

    Org admins can add themselves as developers to any app billed to the org. For example, when a member creates a new app, App-A, billed to the org, they are the only developer for App-A. However, any org admins may add themselves as developers at any time.

    Examples of Using Orgs

    Org Structure Diagram

    In the diagram below, there are 3 examples of how organizations can be structured.

    ORG-1

    The simplest example, ORG-1, is represented by the leftmost circle. In this situation, ORG-1 is a billable org that has 3 members who share one billing account, so all 5 projects created by the members of ORG-1 are billed to that org. One admin (user A) manages ORG-1.

    ORG-2 and ORG-3

    The second example shows ORG-2 and ORG-3 demonstrating more a complicated organizational setup. Here users are grouped into two different billable orgs, with some users belonging to both orgs and others belonging to only one.

    In this case, ORG-2 and ORG-3 bill their work against separate billing accounts. This separation of orgs can represent two different groups in one company working in different departments, each with their own budgets, two different labs that work closely together, or any other scenario in which two collaborators would share work.

    ORG-2 has 5 members, 4 projects, and is managed by one org admin (user G). ORG-3 has 5 members and 3 projects, but is managed by 2 admins (users G and I).

    In this example, admin G and member H belong to both ORG-2 and ORG-3. They can create new projects billed to either org, depending on the project they're working on. Admin G can manage users and projects in both ORG-2 and ORG-3.

    Example 1: Creating an Org for Sharing Data

    You can create a non-billable org as an alias for a group of users. For example, you have a group of users who all need access to a shared dataset. You can make an org which represents all the users who need access to the dataset, for example, an org named org-dataset_access, and share all the projects and apps related to the dataset with that org. All members of the org have at least VIEW "shared project access" and "shared app access" so that they are all given permission to view the dataset. If a member no longer needs access to the dataset, they can be removed from the org, and then no longer have access to any projects or apps shared with org-dataset_access.

    Example 2: Only Admins can Create Projects

    You can contact DNAnexus Sales to create a billable org where only one member, the org admin, can create new org projects. All other org members are not granted the "billable activities access", and so cannot create new org projects. The org admin can then assign each org member a "shared projects access" (VIEW, UPLOAD, CONTRIBUTE, ADMINISTER) and share every org project with the org with ADMINISTER access. The members' permissions to the projects are restricted by their respective "shared project access."

    For example, in a given group, bioinformaticians can be given CONTRIBUTE access to the projects shared with the entire org, so they can run analyses and produce new data in any of the org projects. However, the sequencing center technicians only need UPLOAD permissions to add new data to the projects. Analysts in the group are only given VIEW access to projects shared with the org. When you need to add a new member to your group and give them access to the projects shared with the org, you need to add them to the org as a new member and assign them the appropriate permission levels.

    This membership structure allows the org admin to control the number of projects billed to the org. The org admin can also quickly share new projects with their org and revoke permissions from users who have been removed from the org.

    Example 3: Shared Billing Account

    You can contact DNAnexus Sales to create a billable org where users work independently and bill their activities to the org billing account (as specified by the org admin). All org members are granted "billable activities access." The org members also need to share common resources. These resources might include incoming samples or reference datasets.

    In this case, all members should be granted the "shared apps access" and assigned VIEW as their "shared projects access." The reference datasets that need to be shared with the org are stored in an "Org Resources" project that is shared with the org, which is granted VIEW access. The org can also have best-practice executables built as apps on the DNAnexus system.

    The apps can be shared with the org so all members of the org have access to these (potentially proprietary) executables. If any user leaves your company or institution, their access to reference datasets and executables is revoked by removing them from the org.

    Other Cases

    In general, it is possible to apply many different schemas to orgs as they were designed for many different real-life collaborative structures. If you have a type of collaboration you would like to support, contact DNAnexus Support for more information about how orgs can work for you.

    Managing Your Orgs

    If you are an admin of an org, you can access the org admin tools from the Org Admin link in the header of the DNAnexus Platform. From here, you can quickly navigate to the list of orgs you administer via All Orgs, or to a specific org.

    The Organizations list shows you the list of all orgs to which you have admin access. On this page, you can quickly see your orgs, the org IDs, their Project Transfer setting, and the Member List Visibility setting.

    Within an org, the Settings tab allows you to view and edit basic information, billing, and policies for your org.

    Viewing and Updating Org Information

    You can find the org overview on the Settings tab. From here, you can:

    • View and edit the organization name (this is how the org is referred to in the Platform user interface and in email notifications).

    • View the organization ID, the unique ID used to reference a particular org on the CLI. An example org ID would be org-demo_org.

    • View the number of org members, org projects, and org apps.

    • View the list of organization admins.

    Managing Org Members

    Within an org page, the Members tab allows you to view all the members of the org, invite new members, remove existing members, and update existing members' permission levels.

    From the Members tab, you can quickly see the names and access levels for all org members. For more information about org membership, see the organization member guide.

    Inviting a New Member

    To add existing DNAnexus user to your org, you can use the + Invite New Member button from the org's Members tab. This opens a screen where you can enter the user's username, such as smithj, or user-ID, such as user-smithj. Then you can configure the user's access level in the org.

    If you add a member to the org with billable activities access set to billing allowed, they have the ability to create new projects billed to the org.

    However, adding the member does not change their default billing account. If the user wishes to use the org as their default billing account, they must set their own default billing account.

    If the member has any pre-existing projects that are not billed to the org, the user must transfer the project to an org if they wish to have the project billed to the org.

    The user receives an email notification informing them that they have been added to the organization.

    Creating New DNAnexus Accounts

    Org admins have the ability to create new DNAnexus accounts on behalf of the org, provided the org is covered by a license that enables account provisioning. The user then receives an email with instructions to activate their account and set their password.

    For information on a license that enables account provisioning, contact DNAnexus Sales.

    If this feature has already been turned on for an org you administer, you see an option to Create New User when you go to invite a new member.

    Here you can specify a username, such as alice or smithj, the new user's name, and their email address. The system automatically creates a new user account for the given email address and adds them as a member in the org.

    If you create a new user and set their Billable Activities Access to Billing Allowed, consider setting the org as the user's default billing account. This option is available as a checkbox under the Billable Activities Access dropdown.

    Editing Member Access

    From the org Members tab, you can edit the permissions for one or multiple members of the org. The option to Edit Access appears when you have one or more org members selected in the table.

    When you edit multiple members, you have the option of changing only one access while leaving the rest alone.

    Removing Members

    From the org Members tab, you can remove one or more members from the org. The option to Remove appears when you have one or more org members selected on the Members tab.

    Removing a member revokes the user's access to all projects and apps billed to or shared with the org.

    Org Projects

    In the org's Projects tab you to see the list of all projects billed to the org. This list includes all projects in which you have VIEW and above permissions as well as projects that are billed to the org in which you do not have permissions (not a Member of).

    You can view all project metadata, such as the list of members, data usage, and creation date. You can also view other optional columns such as project creator. To enable the optional columns, select the column from the dropdown menu to the right of the column names.

    Granting Admin Access to Org Projects

    Org admins can give themselves access to any project billed to the org. If you select a project in which you are not a member, you are still able to navigate into the project's settings page. On the project settings page, you can click a button to grant yourself ADMINISTER permissions to the project.

    You can also grant yourself ADMINISTER permissions if you are a member of a project billed to your org but you only have VIEW, CONTRIBUTE, or UPLOAD permissions.

    Org Billing

    Accessing Org Billing Information

    To access your org's billing information:

    1. In the main menu, click Orgs > All Orgs.

    2. Select an organization you want to view.

    3. Select the Billing tab to view billing information.

    Setting Up or Updating Billing Information for an Org

    To set up or update the billing information for an org you administer, contact DNAnexus Billing team.

    Setting up billing for an organization designates someone to receive and pay DNAnexus invoices, including usage by organization members. The billing contact can be you, someone from your finance department, or another designated person.

    When you click Confirm Billing, DNAnexus sends an email to the designated billing contact requesting confirmation of their responsibility for receiving and paying invoices. The organization's billing contact information does not update until DNAnexus receives this confirmation.

    Setting and Modifying an Org Spending Limit

    The org spending limit is the total in outstanding usage charges that can be incurred by projects linked to an org.

    If you are an org admin, you can set or modify this spending limit:

    1. In the main menu, click Orgs > All Orgs.

    2. Select the org for which you'd like to set or modify a spending limit.

    3. In the org details, select the Billing tab.

    4. In Summary, click Increase Spending Limit to request increasing the limit via DNAnexus Support.

    Doing this only submits your request.

    Before approving your request, DNAnexus Support may follow up with you via email with questions about the change.

    Viewing Estimated Charges

    The Usage Charges section allows users with billable access to view total charges incurred to date. You can see how much is left of the org's spending limit. This section is only visible if your org is a billable org, which means your org has confirmed billing information.

    If your org doesn't have a spending limit, your org is unlimited and shows as "N/A."

    Using Monthly Project Spending and Usage Limits

    You need a license to use both the Monthly Project Usage Limit for Computing and Egress, and Monthly Project Spending Limit for Storage features. Contact DNAnexus Sales for more information.

    For orgs with the Monthly Project Usage Limit for Computing and Egress and/or the Monthly Project Spending Limit for Storage feature enabled, org admins can set, update, and view default limits for each spending type, and set limit enforcement actions via API calls.

    • To set the org policies for compute, egress, and storage spending limits and related enforcement actions, use the API methods org/new, org-xxxx/update, or org-xxxx/bulkUpdateProjectLimit.

    • To retrieve and view your org's policies configuration for spending and usage limits, use the API method org-xxxx/describe.

    In orgs with the Monthly Project Usage Limit for Computing and Egress feature enabled, org admins can set default limits and enforcement actions:

    1. In the main menu, click Orgs > All Orgs.

    2. In the Organizations list, click the organization name you want to configure.

    3. In the Usage Limits section

      1. Set the default compute and egress spending limits for linked projects.

      2. Configure the enforcement action when limits are reached.

      3. Choose whether to prevent new executions and terminate ongoing ones, or send alerts while allowing executions to continue.

    For details on these limits, see How Spending and Usage Limits Work.

    Configuring limits and their enforcement in org details > Billing > Usage Limits.

    Monitoring Spending and Usage

    Licenses are required to use the Per-Project Usage Report and Root Execution Stats Report features. Contact DNAnexus Sales for more information.

    Configuration of these features, and report delivery, is handled by DNAnexus Support.

    The Per-Project Usage Report and Root Execution Stats Report are monthly reports that provide detailed breakdowns of charges incurred by org members. These reports help you track and analyze spending patterns across your organization. For more information, see organization management and usage monitoring.

    Org Policies

    Org admins can also set configurable policies for the org. Org policies dictate many different behaviors when the org interacts with other entities. The following policies exist:

    Policy
    Description
    Options

    Membership List Visibility

    Dictates the minimum org membership level required to view the list of org members, their membership level, and access within the org. If PUBLIC, any DNAnexus user can view the list of org members.

    [ADMIN], [MEMBER], or [PUBLIC]

    Project Transfer

    Dictates the minimum org membership level allowed to change the billing account of an org project (via the UI or project transfer).

    [ADMIN] or [MEMBER]

    Project Sharing

    Dictates the minimum org membership level allowed for a user to invite that org to a project

    [ADMIN] or [MEMBER]

    DNAnexus recommends, as a starting point, to restrict the "membership list visibility policy" to ADMIN and "project transfer policy" to ADMIN. This ensures that only the org admin is allowed to see the list of members and their access within the org and that org projects always remain under control of the org.

    You can update org policies for your org in the Policies and Administration section of the org Settings tab. Here, you can both change the membership list visibility and restrict project transfer policies for the org and contact DNAnexus Support to enable PHI data policies for org projects.

    Glossary of Org Terms

    • Billable activities access is an access level that can be granted to org members. If allowed, the org member can create new projects and apps billed to the org, download data (incurring data egress charges against the org), and set their own default billing account to that of the org.

    • Billable org is an org that has confirmed billing information or a non-negative spending limit remaining. Users with billable activities access in a billable org are allowed to create new projects billed to the org. See the definition of a non-billable org for an org that is used for sharing.

    • Billed to an org (app context) sets the billing account of an app to an org. Apps require storage for their resources and assets, and the billing account of the app are billed for that storage. The billing account of an app does not pay for invocations of the app unless the app is run in a project billed to the org.

    • Billed to an org (project context) sets the billing account of a project to an org. The org is invoiced the storage for all data stored in the project as well as compute charges for all jobs and analyses run in the project.

    • Membership level describes one of two membership levels available to users in an org, ADMIN or MEMBER. Remember that ADMINISTER is a type of access level.

    • Membership list visibility policy dictates the minimum org membership level required to view the list of org members, their membership level, and access within the org.

    • Non-billable org describes an org only used as an alias for a group of users. Non-billable orgs do not have billing information and do not have any org projects or org apps. Any user can share a project with a non-billable org.

    • Org access is granted to a user to determine which actions the user can perform in an org.

    • Org admin describes administrators of an org who can manage org membership, configure access and projects associated with the org, and oversee billing.

    • Org app is an app billed to an org.

    • Org ID is the unique ID used to reference a particular org on the DNAnexus Platform. An example is org-dnanexus.

    • Org member is a DNAnexus user associated with an org. Org members can have variable membership levels in an org which define their role in the org. Admins are a type of org member as well.

    • Org policy is a configurable policy for the org. Org policies dictate many different behaviors when the org interacts with other entities.

    • Org project describes a project billed to an org.

    • Org (or "organization") is a DNAnexus entity that is used to associate a group of users. Orgs are referenced on the DNAnexus Platform by a unique org ID.

    • Project transfer policy dictates the minimum org membership level allowed to change the billing account of an org project.

    • Share with an org means to give the members of an org access to a project or app via giving the org access to the project or adding the org as an "authorized user" of an app.

    • Shared apps access is an org access level that can be granted to org members. If allowed, the org member can view and run apps in which the org has been added as an "authorized user."

    • Shared projects access is an org access level that can be granted to org members: the maximum access level a user can have in projects shared with an org.

    Learn More

    Learn in depth about setting up and managing orgs as an administrator.

    Learn about what you can do as an org member.

    Learn about creating and managing orgs as a developer, via the DNAnexus API.

    Running Workflows

    You can run workflows from the command-line using the command dx run. The inputs to these workflows can be from any project for which you have VIEW access.

    The examples here use the publicly available Exome Analysis Workflow (platform login required to access this link).

    For information on how to run a Nextflow pipeline, see Running Nextflow Pipelines.

    Running in Interactive Mode

    Running dx run without specifying an input launches interactive mode. The system prompts for each required input, followed by options to select from a list of optional parameters to modify. Optional parameters include all modifiable parameters for each stage of the workflow. The interface outputs a JSON file detailing the input specified and generates an analysis ID of the form analysis-xxxx unique to this particular run of the workflow.

    Below is an example of running the Exome Analysis Workflow from the public "Exome Analysis Demo" project.

    Running in Non-Interactive Mode

    You can specify each input on the command-line using the -i or --input flags using the syntax -i<stage ID>.<input name>=<input value>. <input-value> must take the form of a DNAnexus object ID or a file named in the project you have selected. It is also possible to specify the number of a stage in place of the stage ID for a given workflow, where stages are indexed starting at zero. The inputs in the following example are specified for the first stage of the workflow only to illustrate this point. The parentheses around the <input-value> in the help string are omitted when entering input.

    Possible values for the input name field can be found by running the command dx run workflow-xxxx -h, as shown below using the Exome Analysis Workflow.

    This help message describes the inputs for each stage of the workflow in the order they are specified. For each stage of the workflow, the help message first lists the required inputs for that stage, specifying the requisite type in the <input-value> field. Next, the message describes common options for that stage (as seen in that stage's corresponding UI on the platform). Lastly, it lists advanced command-line options for that stage. If any stage's input is linked to the output of a prior stage, the help message shows the default value for that stage as a DNAnexus link of the form

    {"$dnanexus_link": {"outputField": "<prior stage output name>", "stage": "stage-xxxx" }}.

    This link format can also be used to specify output from any prior stage in the workflow as input for the current stage.

    For the Exome Analysis Workflow, one required input parameter needs to be specified manually: -ibwa_mem_fastq_read_mapper.reads_fastqgzs.

    This parameter targets the first stage of the workflow. For convenience, use the stage number instead of the full stage ID. Since this is the first stage (and workflow stages are zero-indexed), replace bwa_mem_fastq_read_mapper with 0 like this: -i0.reads_fastqgzs.

    The example below shows how to run the same Exome Analysis Workflow on a FASTQ file containing reads, as well as a BWA reference genome, using the default parameters for each subsequent stage.

    Specifying Array Input

    Array input can be specified by specifying multiple inputs for a single parameter in a stage. For example, the following flags would add files 1 through 3 to the file_inputs parameter for stage-xxxx of the workflow:

    If no project is selected, or if the file is in another project, the project containing the files you wish to use must be specified as follows: -i<stage ID>.<input name>=<project id>:<file id>.

    Job-Based Object References (JBORs)

    The -i flag can also be used to specify (JBORs) with the syntax -i<stage ID or number>:<input name>=<job id>:<output name>. The --brief flag, when used with the command dx run, outputs only the execution's ID. You can also skip the interactive prompts confirming the execution using the -y flag. Calling dx run on a job with the --brief flag returns only the job ID of that execution, and you can skip being prompted to begin execution with the -y flag.

    The example below calls the app (platform login required to access this link) to produce the sorted_bam output described in the help string produced by running dx run app-bwa_mem_fastq_read_mapper -h. This output is then used as input to the first stage of the featured on the DNAnexus Platform (platform login required to access this link).

    Advanced Options

    Quiet Output

    Using the --brief flag at the end of a dx run command causes the command line to print the execution's analysis ID ("analysis-xxxx") instead of the input JSON for the execution. This ID can be saved for later reference.

    Rerunning Analyses With Modified Settings

    To modify specific settings from the previous analysis, you can run the command dx run --clone analysis-xxxx [options]. The [options] parameters override anything set by the --clone flag, and take the form of options passed as input from the command line.

    The --clone flag does not copy the usage of the --allow-ssh or --debug-on flags, which must be set with the new execution. Only the applet, instance type, and input spec are copied. See the page for more information on the usage of these flags.

    For example, the command below redirects the output of the analysis to the outputs/ folder and reruns all stages.

    Only the outputs of stages rerun are placed in the destination specified.

    Rerunning Specific Stages

    When rerunning workflows, if a stage runs identically to how it ran in a previous analysis, the stage itself is not rerun. The outputs of that stage are not copied or rewritten in a new location. To rerun a specific stage, use the option --rerun-stage STAGE_ID to force a stage to be run again, where STAGE_ID is an ID of the form stage-xxxx, the stage's name, or the index of that stage (where the first stage of a workflow is indexed at 0). If you want to rerun all stages of an analysis, you can use --rerun-stage "*", where the asterisk is enclosed in quotes to prevent expansion of that variable into all folders in your current directory via globbing.

    The command below reruns the third and final stage of analysis-xxxx

    Specifying Analysis Output Folders

    The --destination flag allows you to specify the path of the output of a workflow. By default, every output of every stage is written to the destination specified.

    Specifying Output Folders

    You can use the --stage-output-folder <stage_ID> <folder> command to specify the output destination of a particular stage in the analysis being run. In this command, stage_ID is the stage's name, or the index of that stage (where the first stage of a workflow is indexed at 0). The folder is the project and path to which you wish the stage to write using the syntax project-xxxx:/PATH where PATH is the path to the folder in project-xxxx where you wish to write outputs.

    The following command reruns all stages of analysis-xxxx and sets the output destination of the first step of the workflow (BWA) to "mappings" in the current project:

    Specifying Stage-Relative Output Folders

    If you want to specify output folder of a stage within the current output folder of the entire analysis, you can use the flag --stage-relative-output-folder <stage_id> <folder>, where stage_id is the stage's name (stage-xxxx), or the index of that stage (where the first stage of a workflow is indexed at 0). For the folder argument, you can specify a quoted path to write the output of that stage that is relative to the output folder of the analysis.

    The following command reruns all stages of analysis-xxxx, setting the output destination of the analysis to /exome_run, and the output destination of stage 0 to /exome_run/mappings in the current project:

    Specifying a Different Instance Type

    To specify the instance type of all stages in your analysis or a specific set of stages in your analysis, use the flag --instance-type. Specifically, the format --instance-type STAGE_ID=INSTANCE_TYPE allows you to set the instance type of a specific stage, while --instance-type INSTANCE_TYPE sets one instance type for all stages. The two options can be combined, for example, --instance-type mem2_ssd1_x2 --instance-type my_stage_0=mem3_ssd1_x16 sets all stages' instance types to mem2_ssd1_x2 except for the stage my_stage_0, for which mem3_ssd1_x16 is used.

    Here STAGE_ID is an ID of a stage, the stage's name, or the index of that stage (where the first stage of a workflow is indexed at 0).

    The example below reruns all stages of analysis-xxxx and specifies that the first and second stages should be run on mem1_ssd2_x8 and mem1_ssd2_x16 instances respectively:

    Adding Metadata to an Analysis

    This is identical to adding metadata to a job. See for details.

    Monitoring an Analysis

    Command line monitoring of an analysis is not available. For information about monitoring a job from the command line, see .

    On the DNAnexus Platform, jobs are limited to a runtime of 30 days. Jobs that run longer than 30 days are automatically terminated.

    Providing Input JSON

    This is identical to providing an input JSON to a job. For more information, see .

    As in running a workflow in non-interactive mode, inputs to a workflow must be specified as STAGE_ID.<input>. Here STAGE_ID is either an ID of the form stage-xxxx or the index of that stage in the workflow (starting with the first stage at index 0).

    job-based object references
    BWA-MEM FASTQ Read Mapper
    Parliament Workflow
    Connecting to Jobs
    Adding metadata to a job
    Monitoring Executions
    Providing input JSON
    $ dx run "Exome Analysis Demo:Exome Analysis Workflow"
    Entering interactive mode for input selection.
    
    Input:   Reads (bwa_mem_fastq_read_mapper.reads_fastqgzs)
    Class:   array:file
    
    Enter file values, one at a time (^D or <ENTER> to finish, <TAB> twice for compatible files in
        current directory, '?' for more options)
    bwa_mem_fastq_read_mapper.reads_fastqgzs[0]: "Exome Analysis Demo:/Input/SRR504516_1.fastq.gz"
    
    
    Select an optional parameter to set by its # (^D or <ENTER> to finish):
    
     [0] Reads (right mates) (bwa_mem_fastq_read_mapper.reads2_fastqgzs)
     [1] Read group information (bwa_mem_fastq_read_mapper.rg_info_csv)
    .
    .
    .
     [33] Output prefix (gatk4_genotypegvcfs.prefix)
     [34] Extra command line options (gatk4_genotypegvcfs.extra_options) [default="-G StandardAnnotation --only-output-calls-starting-in-intervals"]
    
    Optional param #: 0
    
    Input:   Reads (right mates) (bwa_mem_fastq_read_mapper.reads2_fastqgzs)
    Class:   array:file
    
    Enter file values, one at a time (^D or <ENTER> to finish, <TAB> twice for compatible files in
       current directory, '?' for more options)
    bwa_mem_fastq_read_mapper.reads2_fastqgzs[0]: "Exome Analysis Demo:/Input/SRR504516_2.fastq.gz"
    bwa_mem_fastq_read_mapper.reads2_fastqgzs[1]:
    
    Optional param #: <ENTER>
    
    Using input JSON:
    {
      "bwa_mem_fastq_read_mapper.reads_fastqgzs": [
        {
          "$dnanexus_link": {
            "project": "project-BQfgzV80bZ46kf6pBGy00J38",
            "id": "file-B40jg7v8KfPy38kjz1vQ001y"
          }
        }
      ],
      "bwa_mem_fastq_read_mapper.reads2_fastqgzs": [
        {
          "$dnanexus_link": {
            "project": "project-BQfgzV80bZ46kf6pBGy00J38",
            "id": "file-B40jgYG8KfPy38kjz1vQ0020"
          }
        }
      ]
    }
    
    Confirm running the executable with this input [Y/n]: <ENTER>
    Calling workflow-xxxx with output destination project-xxxx:/
    
    Analysis ID: analysis-xxxx
    $ dx run "Exome Analysis Demo:Exome Analysis Workflow" -h
    usage: dx run Exome Analysis Demo:Exome Analysis Workflow [-iINPUT_NAME=VALUE ...]
    
    Workflow: GATK4 Exome FASTQ to VCF (hs38DH)
    
    Runs GATK4 Best Practice for Exome on hs38DH reference genome
    
    Inputs:
     bwa_mem_fastq_read_mapper
      Reads: -ibwa_mem_fastq_read_mapper.reads_fastqgzs=(file) [-ibwa_mem_fastq_read_mapper.reads_fastqgzs=... [...]]
            An array of files, in gzipped FASTQ format, with the first read mates
            to be mapped.
    
      Reads (right mates): [-ibwa_mem_fastq_read_mapper.reads2_fastqgzs=(file) [-ibwa_mem_fastq_read_mapper.reads2_fastqgzs=... [...]]]
            (Optional) An array of files, in gzipped FASTQ format, with the second
            read mates to be mapped.
      BWA reference genome index: [-ibwa_mem_fastq_read_mapper.genomeindex_targz=(file, default={"$dnanexus_link": {"project": "project-BQpp3Y804Y0xbyG4GJPQ01xv", "id": "file-FFJPKp0034KY8f20F6V9yYkk"}})]
            A file, in gzipped tar archive format, with the reference genome
            sequence already indexed with BWA.
      ...
     fastqc
      Reads: [-ifastqc.reads=(file, default={"$dnanexus_link": {"stage": "bwa_mem_fastq_read_mapper", "outputField": "sorted_bam"}})]
            A file containing the reads to be checked. Accepted formats are
            gzipped-FASTQ and BAM.
      ...
     gatk4_bqsr
      Sorted mappings: [-igatk4_bqsr.mappings_sorted_bam=(file, default={"$dnanexus_link": {"outputField": "sorted_bam", "stage": "bwa_mem_fastq_read_mapper"}})]
            A coordinate-sorted BAM or CRAM file with the base quality scores to
            be recalibrated.
       ...
     ...
    
    Outputs:
      Sorted mappings: bwa_mem_fastq_read_mapper.sorted_bam (file)
            A coordinate-sorted BAM file with the resulting mappings.
    
      Sorted mappings index: bwa_mem_fastq_read_mapper.sorted_bai (file)
            The associated BAM index file.
      ...
      Variants index: gatk4_genotypegvcfs.variants_vcfgztbi (file)
            The associated TBI file.
    $ dx run "Exome Analysis Demo:Exome Analysis Workflow" \
     -i0.reads_fastqgzs="Exome Analysis Demo:/Input/SRR504516_1.fastq.gz" \
     -ibwa_mem_fastq_read_mapper.genomeindex_targz='Reference Genome Files\: AWS US (East):/H. Sapiens - GRCh37 - hs37d5 (1000 Genomes Phase II)/hs37d5.bwa-index.tar.gz' -y
    Using input JSON:
    {
      "bwa_mem_fastq_read_mapper.reads_fastqgzs": [
        {
          "$dnanexus_link": {
            "project": "project-BQfgzV80bZ46kf6pBGy00J38",
            "id": "file-B40jg7v8KfPy38kjz1vQ001y"
          }
        }
      ],
      "bwa_mem_fastq_read_mapper.genomeindex_targz": {
        "$dnanexus_link": {
          "project": "project-BQpp3Y804Y0xbyG4GJPQ01xv",
          "id": "file-B6ZY4942J35xX095VZyQBk0v"
        }
      }
    }
    
    Calling workflow-xxxx with output destination
      project-xxxx:/
    
    Analysis ID: analysis-xxxx
    $ dx run workflow \
    -istage-xxxx.file_inputs=project-xxxx:file-1xxxx \
    -istage-xxxx.file_inputs=project-xxxx:file-2xxxx \
    -istage-xxxx.file_inputs=project-xxxx:file-3xxxx
    
    Using input JSON:
    {
      "stage-xxxx.file_inputs": [
          {
           "$dnanexus_link": {
              "project": "project-xxxx",
              "id": "file-1xxxx"
          },
          {
           "$dnanexus_link": {
              "project": "project-xxxx",
              "id": "file-2xxxx"
          },
          {
           "$dnanexus_link": {
              "project": "project-xxxx",
              "id": "file-3xxxx"
          }
      ]
    }
    $ dx run Parliament \
      -i0.illumina_bam=$(dx run bwa_mem_fastq_read_mapper -ireads_fastqgzs=file-xxxx -ireads2_fastqgzs=file-xxxx -igenomeindex_targz=project-BQpp3Y804Y0xbyG4GJPQ01xv:file-B6qq53v2J35Qyg04XxG0000V -y --brief):sorted_bam \
      -i0.ref_fasta=project-BQpp3Y804Y0xbyG4GJPQ01xv:file-B6qq53v2J35Qyg04XxG0000V \
      -y
    
    Using input JSON:
    {
        "stage-F14F5qQ0Jz1gfpjX8y1JxG3y.illumina_bam": {
            "$dnanexus_link": {
                "field": "sorted_bam",
                "job": "job-xxxx"
            }
        },
        "stage-F14F5qQ0Jz1gfpjX8y1JxG3y.ref_fasta": {
            "$dnanexus_link": {
                "project": "project-xxxx",
                "id": "file-B6qq53v2J35Qyg04XxG0000V"
            }
        }
    }
    
    Calling workflow-xxxx with output destination project-xxxx:/
    
    Analysis ID: analysis-xxxx
    $ dx run workflow-xxxx -i0.input_file=Input/SRR504516_1.fastq.gz -y --brief
    analysis-xxxx
    dx run --clone analysis-xxxx \
      --rerun-stage "*" \
      --destination project-xxxx:/output -y
    dx run --clone analysis-xxxx --rerun-stage 2 --brief -y
    dx run --clone analysis-xxxx --rerun-stage "*" \
      --stage-output-folder 0 "mappings" --brief -y
    dx run --clone analysis-xxxx --rerun-stage "*" \
      --destination "exome_run" \
      --stage-relative-output-folder 0 "mappings" --brief -y
    dx run --clone analysis-xxxx \
      --rerun-stage "*" \
      --instance-type '{"0": "mem1_hdd2_x8", "1": "mem1_ssd2_x4"}' \
      --brief -y

    Running Apps and Applets

    You can run apps and applets from the command-line using the command dx run. The inputs to these app(let)s can be from any project for which you have VIEW access. Or run from UI.

    Running in Interactive Mode

    If dx run is run without specifying any inputs, interactive mode launches. When you run this command, the platform prompts you for each required input, followed by a prompt to set any optional parameters. As shown below using the BWA-MEM FASTQ Read Mapper app (platform login required to access this link), after you are done entering inputs, you must confirm that you want the applet/app to run with the inputs you have selected.

    Running in Non-interactive Mode

    Naming Each Input

    You can also specify each input parameter by name using the ‑i or ‑‑input flags with syntax ‑i<input name>=<input value>. Names of data objects in your project are resolved to the appropriate IDs and packaged correctly for the API method as shown below.

    When specifying input parameters using the ‑i/‑‑input flag, you must use the input field names (not to be confused with their human-readable labels). To look up the input field names for an app, applet, or workflow, you can run the command dx run app(let)-xxxx -h, as shown below using the (platform login required to access this link).

    The help message describes the inputs and outputs of the app, their types, and how to identify them when running the app from the command line. For example, from the above help message, the Swiss Army Knife app has two primary inputs: one or more file and a string to be executed on the command line, to be specified as -iin=file-xxxx and icmd=<string>, respectively.

    The example below shows you how to run the same Swiss Army Knife app to sort a small BAM file using these inputs.

    Specifying Array Input

    To specify array inputs, reuse the ‑i/‑‑input flag for each input in the array and each file specified is appended into an array in the same order as it was entered on the command line. Below is an example of how to use the to index multiple BAM files (platform login required to access this link).

    Job-Based Object References (JBORs)

    (JBORs) can also be provided using the -i flag with syntax ‑i<input name>=<job id>:<output name>. Combined with the --brief flag (which allows dx run to output only the job ID) and the -y flag (to skip confirmation), you can string together two jobs using one command.

    Below is an example of how to run the (platform login required to access this link), producing the output named sorted_bam as described in the dx help output by executing the command dx run app-bwa_mem_fastq_read_mapper -h. The sorted_bam output is then used as input for the (platform login required to access this link).

    Advanced Options

    Some examples of additional functionalities provided by dx run are listed below.

    Quiet Output

    Regardless of whether you run a job interactively or non-interactively, the command dx run always prints the exact input JSON with which it is calling the applet or app. If you don't want to print this verbose output, you can use the --brief flag which tells dx to print out only the job ID instead. This job ID can then be saved.

    To run jobs without being prompted for confirmation, use the -y or --yes option. This is especially helpful when scripting or automating job submissions.

    If you want to both skip confirmation and immediately monitor the job's progress, use -y --watch. This starts the job and displays its logs in your terminal as it runs.

    Rerunning a Job With the Same Settings

    If you are debugging applet-xxxx and wish to rerun a job you previously ran, using the same settings (destination project and folder, inputs, instance type requests), but use a new executable applet-yyyy, you can use the --clone flag.

    In the above command, the command overrides the --clone job-xxxx command to use the executable (platform login required to access this link) rather than that used by the job.

    If you want to modify some but not all settings from the previous job, you can run dx run <executable> --clone job-xxxx [options]. The command-line arguments you provide in [options] override the settings reused from --clone. For example, this is useful if you want to rerun a job with the same executable and inputs but a different instance type, or if you want to run an executable with the same settings but slightly different inputs.

    The example shown below redirects the outputs of the job to the folder "outputs/".

    While the --clone job-xxxx flag copies the applet, instance type, and inputs, it does not copy usage of the --allow-ssh or --debug-on flags. These must be re-specified for each job run. For more information, see the page.

    Specifying the Job Output Folder

    The --destination flag allows you to specify the full project-ID:/folder/ path in which to output the results of the app(let). If this flag is unspecified, the output of the job defaults to the present working directory, which can be determined by running .

    In the above command, the flag --destination project-xxxx:/mappings instructs the job to output all results into the "mappings" folder of project-xxxx.

    Specifying a Different Instance Type

    The dx run --instance-type command allows you to specify the instance types to use for the job. More information is available by running the command dx run --instance-type-help.

    Some apps and applets have multiple , meaning that different instance types can be specified for different functions executed by the app(let). In the example below, the (platform login required to access this link) is run while specifying the instance types for the entry points honey, ssake, ssake_insert, and main. Specifying the instance types for each entry point requires a JSON-like string, meaning that the string should be wrapped in single quotes, as explained earlier, and demonstrated below.

    Adding Metadata to a Job

    If you are running many jobs that have varying purposes, you can organize the jobs using metadata. Two types of metadata are available on the DNAnexus Platform: properties and tags.

    Properties are key-value pairs that can be attached to any object on the platform, whereas tags are strings associated with objects on the platform. The --property flag allows you to attach a property to a job, and the --tag flag allows you to tag a job.

    Adding metadata to executions does not affect the metadata of the executions' output files. Metadata on jobs make it easier for you to search for a particular job in your job history. This is useful when you want to tag all jobs run with a particular sample, for instance.

    Specifying an App Version

    If your current workflow is not using the most up-to-date version of an app, you can specify an older version when running your job. Append the app name with the version required, for example, app-xxxx/0.0.1 if the current version is app-xxxx/1.0.0.

    Watching a Job

    To monitor your job as it runs, use the --watch flag to display the job's logs in your terminal window as it progresses.

    Providing Input JSON

    You can also specify the input JSON in its entirety. To specify a data object, you must wrap it in (a key-value pair with a key of $dnanexus_link and value of the data object's ID). Because you are already providing the JSON in its entirety, as long as the applet/app ID can be resolved and the JSON can be parsed, confirmation before the job starts is not required. Three methods exist for entering the full input JSON, discussed in separate sections below.

    From the CLI

    If using the CLI to enter the full input JSON, you must use the flag ‑j/‑‑input‑json followed by the JSON in single quotes. Only single quotes should be used to wrap the JSON to avoid interfering with the double quotes used by the JSON itself.

    From a File

    If using a file to enter the input JSON, you must use the flag ‑f/‑‑input‑json‑file followed by the name of the JSON file.

    From stdin

    Entering the input JSON file using stdin is done the same way as entering the file using the -f flag with the substitution of using "-" as the filename. Below is an example that demonstrates how to echo the input JSON to stdin and pipe the output to the input of dx run. As before, use single quotes to wrap the JSON input to avoid interfering with the double quotes used by the JSON itself.

    Getting Additional Information on dx run

    Executing the dx run --help command shows the flags available to use with dx run. The message printed by this command is identical to the one displayed in the brief description of .

    Cost Run Limits

    The --cost-limit cost_limit sets the maximum cost of the job before termination. In case of workflows, it is the cost of the entire analysis job. For batch run, this limit applies per job. See the dx run --help command for more information.

    Job Runtime Limits

    On the DNAnexus Platform, jobs are limited to a runtime of 30 days. Jobs running longer than 30 days are automatically terminated.

    $ dx run app-bwa_mem_fastq_read_mapper
    Entering interactive mode for input selection.
    
    Input:   Reads (reads_fastqgz)
    Class:   file
    
    Enter file ID or path ((<TAB> twice for compatible files in current directory, '?' for more options)
    reads_fastqgz: reads.fastq.gz
    
    Input:   BWA reference genome index (genomeindex_targz)
    Class:   file
    Suggestions:
        project-BQpp3Y804Y0xbyG4GJPQ01xv://file-* (DNAnexus Reference Genomes)
    
    Enter file ID or path (<TAB> twice for compatible files in current directory, '?' for more options)
    genomeindex_targz: "Reference Genome Files:/H. Sapiens - hg19 (UCSC)/ucsc_hg19.bwa-index.tar.gz"
    
    Select an optional parameter to set by its # (^D or <ENTER> to finish):
    
     [0] Reads (right mates) (reads2_fastqgz)
     [1] Add read group information to the mappings (required by downstream GATK)? (add_read_group) [default=true]
     [2] Read group id (read_group_id) [default={"$dnanexus_link": {"input": "reads_fastqgz", "metadata": "name"}}]
     [3] Read group platform (read_group_platform) [default="ILLUMINA"]
     [4] Read group platform unit (read_group_platform_unit) [default="None"]
     [5] Read group library (read_group_library) [default="1"]
     [6] Read group sample (read_group_sample) [default="1"]
     [7] Output all alignments for single/unpaired reads? (all_alignments)
     [8] Mark shorter split hits as secondary? (mark_as_secondary) [default=true]
     [9] Advanced command line options (advanced_options)
    
    Optional param #: <ENTER>
    
    Using input JSON:
    {
        "reads_fastqgz": {
            "$dnanexus_link": {
                "project": "project-xxxx",
                "id": "file-xxxx"
            }
        },
        "genomeindex_targz": {
            "$dnanexus_link": {
                "project": "project-xxxx",
                "id": "file-xxxx"
            }
        }
    }
    
    Confirm running the applet/app with this input [Y/n]: <ENTER>
    Calling app-xxxx with output destination project-xxxx:/
    
    Job ID: job-xxxx
    Watch launched job now? [Y/n] n
    Swiss Army Knife app
    Swiss Army Knife app
    Job-based object references
    BWA-MEM FASTQ Read Mapper app
    Swiss Army Knife app
    Swiss Army Knife app
    Connecting to Jobs
    dx pwd
    entry points
    Parliament app
    DNAnexus link form
    dx run
    $ dx run app-swiss-army-knife \
    -iin=project-BQbJpBj0bvygyQxgQ1800Jkk:file-BQbXVY0093Jk1K VY1J082y7v \
    -icmd="samtools sort -T /tmp/aln.sorted -o SRR100022_chrom20_mapped_to_b37.sorted.bam \
    SRR100022_chrom20_mapped_to_b37.bam" -y
    
    Using input JSON:
    {
        "cmd": "samtools sort -T /tmp/aln.sorted -o SRR100022_chrom20_mapped_to_b37.sorted.bam SRR100022_chrom20_mapped_to_b37.bam",
        "in": [
            {
                "$dnanexus_link": {
                    "project": "project-BQbJpBj0bvygyQxgQ1800Jkk",
                    "id": "file-BQbXVY0093Jk1KVY1J082y7v"
                }
            }
        ]
    }
    
    Calling app-xxxx with output destination project-xxxx:/
    
    Job ID: job-xxxx
    $ dx run app-swiss-army-knife \
    -iin=project-BQbJpBj0bvygyQxgQ1800Jkk:file-BQbXVY0093Jk1KVY1J082y7v \
    -iin=project-BQbJpBj0bvygyQxgQ1800Jkk:file-BZ9YGpj0x05xKxZ42QPqZkJY \
    -iin=project-BQbJpBj0bvygyQxgQ1800Jkk:file-BZ9YGzj0x05b66kqQv51011q \
    -icmd="ls *.bam | xargs -n1 -P5 samtools index" -y
    
    Using input JSON:
    {
        "cmd": "ls *.bam | xargs -n1 -P5 samtools index",
        "in": [
            {
                "$dnanexus_link": {
                    "project": "project-BQbJpBj0bvygyQxgQ1800Jkk",
                    "id": "file-BQbXVY0093Jk1KVY1J082y7v"
                }
            },
            {
                "$dnanexus_link": {
                    "project": "project-BQbJpBj0bvygyQxgQ1800Jkk",
                    "id": "file-BZ9YGpj0x05xKxZ42QPqZkJY"
                }
            },
            {
                "$dnanexus_link": {
                    "project": "project-BQbJpBj0bvygyQxgQ1800Jkk",
                    "id": "file-BZ9YGzj0x05b66kqQv51011q"
                }
            }
        ]
    }
    
    Calling app-xxxx with output destination project-xxxx:/
    
    Job ID: job-xxxx
    $ dx run app-swiss-army-knife \
        -iin=$(dx run app-bwa_mem_fastq_read_mapper \
        -ireads_fastqgz=project-BQbJpBj0bvygyQxgQ1800Jkk:file-BQbXKk80fPFj4Jbfpxb6Ffv2 \
        -igenomeindex_targz=project-BQpp3Y804Y0xbyG4GJPQ01xv:file-B6qq53v2J35Qyg04XxG0000V -y \
        --brief):sorted_bam \
        -icmd="samtools index *.bam" -y
    
    Using input JSON:
    {
        "in": [
            {
                "$dnanexus_link": {
                    "field": "sorted_bam",
                    "job": "job-xxxx"
                }
            }
        ],
        "cmd": "samtools index *.bam"
    }
    
    Calling app-xxxx with output destination project-xxxx:/
    
    Job ID: job-xxxx
    $ dx run app-bwa_mem_fastq_read_mapper \
        -ireads_fastqgz="project-BQbJpBj0bvygyQxgQ1800Jkk:/SRR100022/SRR100022_1.filt.fastq.gz" \
        -ireads_fastqgz="project-BQbJpBj0bvygyQxgQ1800Jkk:/SRR100022/SRR100022_2.filt.fastq.gz" \
        -igenomeindex_targz="project-BQpp3Y804Y0xbyG4GJPQ01xv:file-B6ZY4942J35xX095VZyQBk0v" \
        --destination "mappings" -y --brief
    $ dx run app-swiss-army-knife --clone job-xxxx -y
    
    Using input JSON:
    {
        "cmd": "ls *.bam | xargs -n1 -P5 samtools index",
        "in": [
            {
                "$dnanexus_link": {
                    "project": "project-BQbJpBj0bvygyQxgQ1800Jkk",
                    "id": "file-BQbXVY0093Jk1KVY1J082y7v"
                }
            },
            {
                "$dnanexus_link": {
                    "project": "project-BQbJpBj0bvygyQxgQ1800Jkk",
                    "id": "file-BZ9YGpj0x05xKxZ42QPqZkJY"
                }
            },
            {
                "$dnanexus_link": {
                    "project": "project-BQbJpBj0bvygyQxgQ1800Jkk",
                    "id": "file-BZ9YGzj0x05b66kqQv51011q"
                }
            }
        ]
    }
    
    Calling app-xxxx with output destination project-xxxx:/
    
    Job ID: job-xxxx
    $ dx run app-swiss-army-knife \
    --clone job-xxx --destination project-xxxx:/output -y
    $ dx run app-bwa_mem_fastq_read_mapper \
    -ireads_fastqgz="project-BQbJpBj0bvygyQxgQ1800Jkk:/SRR100022/SRR100022_1.filt.fastq.gz" \
    -ireads_fastqgz="project-BQbJpBj0bvygyQxgQ1800Jkk:/SRR100022/SRR100022_2.filt.fastq.gz" \
    -igenomeindex_targz="project-BQpp3Y804Y0xbyG4GJPQ01xv:file-B6ZY4942J35xX095VZyQBk0v" \
    --destination "mappings" -y --brief
    dx run parliament \
      -iillumina_bam=illumina.bam \
      -iref_fasta=ref.fa.gz \
      --instance-type '{
        "honey": "mem1_ssd1_x32",
        "ssake": "mem1_ssd1_x8",
        "ssake_insert": "mem1_ssd1_x32",
        "main": "mem1_ssd1_x16"
      }' \
      -y \
      --brief
    $ dx run app-swiss-army-knife \
        -iin=project-BQbJpBj0bvygyQxgQ1800Jkk:file-BQbXVY0093Jk1KVY1J082y7v \
        -icmd="samtools sort -T /tmp/aln.sorted -o \
        SRR100022_chrom20_mapped_to_b37.sorted.bam SRR100022_chrom20_mapped_to_b37.bam" \
        --property foo=bar --tag dna -y
    $ dx run app-swiss-army-knife/2.0.1 \
        -iin=project-BQbJpBj0bvygyQxgQ1800Jkk:file-BQbXVY0093Jk1KVY1J082y7v \
        -icmd="samtools sort -T /tmp/aln.sorted -o SRR100022_chrom20_mapped_to_b37.sorted.bam SRR100022_chrom20_mapped_to_b37.bam" \
        -y --brief
    $ dx run app-swiss-army-knife \
      -iin=project-BQbJpBj0bvygyQxgQ1800Jkk:file-BQbXVY0093Jk1KVY1J082y7v \
      -icmd="samtools sort -T /tmp/aln.sorted \
        -o SRR100022_chrom20_mapped_to_b37.sorted.bam \
        SRR100022_chrom20_mapped_to_b37.bam" \
      --watch \
      -y \
      --brief
    
    job-xxxx
    
    Job Log
    -------
    Watching job job-xxxx. Press Ctrl+C to stop.
    $ dx run app-swiss-army-knife -j '{
      "cmd": "ls *.bam | xargs -n1 -P5 samtools index",
      "in": [
        { "$dnanexus_link": {
            "project": "project-BQbJpBj0bvygyQxgQ1800Jkk",
            "id": "file-BQbXVY0093Jk1KVY1J082y7v"
          }
        },
        { "$dnanexus_link": {
            "project": "project-BQbJpBj0bvygyQxgQ1800Jkk",
            "id": "file-BZ9YGpj0x05xKxZ42QPqZkJY"
          }
        },
        { "$dnanexus_link": {
            "project": "project-BQbJpBj0bvygyQxgQ1800Jkk",
            "id": "file-BZ9YGzj0x05b66kqQv51011q"
          }
        }
      ]
    }' -y
    
    Calling app-xxxx with output destination project-xxxx:/
    
    Job ID: job-xxxx
    $ dx run app-swiss-army-knife -f input.json
    
    Using input JSON:
    {
        "cmd": "ls *.bam | xargs -n1 -P5 samtools index",
        "in": [
            {
                "$dnanexus_link": {
                    "project": "project-BQbJpBj0bvygyQxgQ1800Jkk",
                    "id": "file-BQbXVY0093Jk1KVY1J082y7v"
                }
            },
            {
                "$dnanexus_link": {
                    "project": "project-BQbJpBj0bvygyQxgQ1800Jkk",
                    "id": "file-BZ9YGpj0x05xKxZ42QPqZkJY"
                }
            },
            {
                "$dnanexus_link": {
                    "project": "project-BQbJpBj0bvygyQxgQ1800Jkk",
                    "id": "file-BZ9YGzj0x05b66kqQv51011q"
                }
            }
        ]
    }
    
    Calling app-xxxx with output destination project-xxxx:/
    
    Job ID: job-xxxx
    $ echo '{
      "cmd": "ls *.bam | xargs -n1 -P5 samtools index",
      "in": [
        {
          "$dnanexus_link": {
            "project": "project-BQbJpBj0bvygyQxgQ1800Jkk",
            "id": "file-BQbXVY0093Jk1KVY1J082y7v"
          }
        },
        {
          "$dnanexus_link": {
            "project": "project-BQbJpBj0bvygyQxgQ1800Jkk",
            "id": "file-BZ9YGpj0x05xKxZ42QPqZkJY"
          }
        },
        {
          "$dnanexus_link": {
            "project": "project-BQbJpBj0bvygyQxgQ1800Jkk",
            "id": "file-BZ9YGzj0x05b66kqQv51011q"
          }
        }
      ]
    }' | dx run app-swiss-army-knife -f - -y
    
    Calling app-xxxx with output destination project-xxxx:/
    
    Job ID: job-xxxx

    Command Line Quickstart

    Learn to use the dx client for command-line access to the full range of DNAnexus Platform features.

    You must set up billing for your account before you can perform an analysis, or upload or egress data. Follow these instructions to set up billing.

    The dx command-line client is included in the DNAnexus SDK (dx-toolkit). You can use the dx client to log into the Platform, to upload, browse, and organize data, and to launch analyses.

    All the projects and data referenced in this Quickstart are publicly available, so you can follow along step-by-step.

    Before You Begin

    If you haven't already done so, , which includes the dx command-line client, as well as range of useful utilities.

    Getting Help

    As you work, use the as a reference.

    On the command line, you can also enter dx help to see a list of commands, broken down by category. To see a list of commands from a particular category, enter dx help <category>.

    To learn what a particular command does, enter dx help <command>, dx <command> -h, or dx <command> -help . For example, enter dx help ls to learn about the command dx ls:

    Step 1: Log In

    The first step is to . If you have not created a DNAnexus account, open the and sign up. User signup is not supported on the command line.

    Your and your current project settings are saved in a local configuration file, and you can start accessing your project.

    You can generate an authentication token from the online DNAnexus Platform .

    Step 2: Explore

    Public Projects

    Look inside some public projects that have already been set up. From the command line, enter the command:

    By running the command and picking a project, you perform the command-line equivalent of going to the project page for (platform login required to access this link) on the website. This is a DNAnexus-sponsored project containing popular genomes for use in analyses with your own data.

    For more information about the dx select command, see the page.

    DNAnexus-sponsored data is free to copy from this project as many times as needed.

    List the data in the top-level directory of the project you've selected by running the command . View the contents of a folder by running the command dx ls <folder_name>.

    You can avoid typing out the full name of the folder by typing in dx ls C and then pressing <TAB>. The folder name auto-completes from there.

    You don't have to be in a project to inspect its contents. You can also look into another project, and a folder within the project, by giving the project name or ID, followed by a colon (:) and the folder path. Here, the contents of the publicly available project "Demo Data" are listed using both its name and ID.

    As shown above, you can use the -l flag with dx ls to list more details about files, such as the time a file was last modified, its size (if applicable), and its full DNAnexus ID.

    Describing DNAnexus Objects

    You can use the command to learn more about on the platform. Given a DNAnexus object ID or name, dx describe returns detailed information about the object. dx describe only returns results for data objects to which you have access.

    Besides describing data and projects (examples for which are shown below), you can also describe apps, jobs, and users.

    Describing a File

    Below, the reference genome file for C. Elegans located in the "Reference Genome Files: AWS US (East)" project that has been used is described (which should be accessible from other regions as well). You need to add a colon (:) after the project name, here that would be Reference Genome Files\: AWS US (East): .

    Describing a Project

    Below, the publicly available Reference Genome Files project that has been used is described.

    Step 3: Create Your Own Project

    Use the command to create a new project.

    The text project-xxxx denotes a placeholder for a unique, immutable project ID. For more information about object IDs, see the page.

    The project is ready for uploading data and running analyses.

    The new command can also allow you to create other new data objects, including new orgs or users. Use the command dx help new to see additional information. For mode information, see the .

    Step 4: Upload and Manage Your Data

    To analyze a sample, use the command or the if installed. For this tutorial, download the file , which represents the first 25000 C. elegans reads from SRR070372. This file is used in the sample analysis below.

    For uploading multiple or large files, use the . It compresses files and uploads them in parallel over multiple HTTP connections and supports resumable uploads.

    The following command uploads the small-celegans-sample.fastq file into the current directory of the current project. The --wait flag tells to wait until uploading is complete before returning the prompt and describing the result.

    If you run the same command but add the flag --brief, only the file ID (in the form of file-xxxx) is printed to the terminal. Other dx commands also accept the --brief flag and report only object IDs.

    Examining Data

    To take a quick look at the first few lines of the file you uploaded, use the command. By default, it prints the first 10 lines of the given file.

    Run it on the file you uploaded and use the -n flag to ask for the first 12 lines (the first 3 reads) of the FASTQ file.

    Downloading Data

    If you'd like to download a file from the platform, use the command. This command uses the name of the file for the filename unless you specify your own with the -o or --output flag. The example below downloads the same C. elegans file that was uploaded previously.

    About Metadata

    Files have different available fields for metadata, such as "properties" (key-value pairs) and "tags".

    Step 5: Analyze a Sample

    For the next few steps, if you would like to follow along, you need a C. elegans FASTQ file. This tutorial maps the reads against the ce10 genome. If you haven't already, you can download and use the following FASTQ file, which contains the first 25,000 reads from SRR070372: .

    You can also substitute your own reads file for a different species (though it may take longer to run the example). For convenience, DNAnexus has already imported a variety of reference genomes to the platform. If you have a FASTA file to use, upload it and create genome indices for BWA using the (platform login required to access these links).

    The following walkthrough explains what each command does and shows which apps run. If you only want to convert a gzipped FASTQ file to a VCF via BWA and the FreeBayes Variant Caller, to see the commands required to run the apps.

    Uploading Reads

    If you have not yet done so, you can upload a FASTQ file for analysis.

    For more information about using the command , see the page.

    Mapping Reads

    Next, use the (platform login required to access this link) to map the uploaded reads file to a reference genome.

    Finding the App Name

    If you don't know the command-line name of the app to run, you have two options:

    1. Navigate to its web page from the (platform login required to access this link). The app's page shows how to run it from the command line. See the for details on the app used here (platform login required).

    2. Alternatively, search for apps from the command line by running dx find apps. The command-line name appears in parentheses in the output (underlined below).

    Installing and Running the App

    Install the app using and check that it has been installed. While you do not always need to install an app to run it, you may find it useful as a bookmarking tool.

    You can run the app using . When you run it without any arguments, it prompts you for required and then optional arguments. The reference file genomeindex_targz for this C. elegans sample is in a .tar.gz format and can be found in the Reference Genome folder of the region your project is in.

    Monitoring Your Job

    You can use the command to monitor jobs. The command prints out the log file of the job, including the STDOUT, STDERR, and INFO printouts.

    You can also use the command dx describe job-xxxx to learn more about your job. If you don't know the job's ID, you can use the command to list all the jobs run in the current project, along with the user who ran them, their status, and when they began.

    Additional options are available to restrict your search of previous jobs, such as by their names or when they were run.

    Terminating Your Job

    If for some reason you need to terminate your job before it completes, use the command .

    After Your Job Finishes

    You should see two new files in your project: the mapped reads in a BAM file, and an index of that BAM file with a .bai extension. You can refer to the output file by name or by the job that produced it using the syntax job-xxxx:<output field>. Try it yourself with the job ID you got from calling the BWA-MEM app!

    Variant Calling

    You can use the (platform login required to access this link) to call variants on your BAM file.

    This time, instead of relying on interactive mode to enter inputs, you provide them directly. First, look up the app's spec to determine the input names. Run the command dx run freebayes -h.

    Optional inputs are shown using square brackets ([]) around the command-line syntax for each input. Notice that there are two required inputs that must be specified:

    1. Sorted mappings (sorted_bams): A list of files with a .bam extension.

    2. Genome (genome_fastagz): A reference genome in FASTA format that has been gzipped.

    You can also run dx describe freebayes for a more compact view of the input and output specifications. By default, it hides the advanced input options, but you can view them using the --verbose flag.

    Running the App with a One-Liner Using a Job-Based Object Reference

    It is sometimes more convenient to run apps using a single one-line command. You can do this by specifying all the necessary inputs either via the command line or in a prepared file. Use the -i flag to specify inputs as suggested by the output of dx run freebayes ‑h:

    • sorted_bams: The output of the previous BWA step (see the section for more information).

    • genome_fastagz: The ce10 genome in the Reference Genomes project.

    To specify new job input using the output of a previous job, use a via the job-xxxx:<output field> syntax used earlier.

    You can use job-based object references as input even before the referenced jobs have finished. The system waits until the input is ready to begin the new job.

    Replace the job ID below with that generated by the BWA app you ran earlier. The -y flag skips the input confirmation.

    Automatically Running a Command After a Job Finishes

    Use the command to wait for a job to finish. If you run the following command immediately after launching the FreeBayes app, it shows recent jobs only after the job has finished, as shown in the example below.

    Congratulations! You have called variants on a reads sample using the command line. Next, see how to automate this process.

    Automation

    The CLI enables automation of these steps. The following script assumes that you are logged in. It is hardcoded to use the ce10 genome and takes a local gzipped FASTQ file as its command-line argument.

    Learn More

    You can start scripting using dx. The --brief flag is useful for scripting. A list of all dx commands and flags is on the page.

    For more detailed information about running apps and applets from the command line, see the page.

    For a comprehensive guide to the DNAnexus SDK, see the .

    Want to start writing your own apps? Check out the for some useful tutorials.

    download and install the DNAnexus Platform toolkit
    index of dx commands
    log in
    DNAnexus Platform
    authentication token
    using the UI
    dx select
    Reference Genome Files: AWS US (East)
    Changing Your Current Project
    dx ls
    dx describe
    files and other objects
    dx new project
    Entity IDs
    full list of dx commands
    dx upload
    Upload Agent
    small-celegans-sample.fastq
    Upload Agent
    dx upload
    dx head
    dx download
    small-celegans-sample.fastq
    BWA FASTA Indexer app
    skip ahead to the Automate It section
    dx upload
    dx upload
    BWA-MEM app
    Apps page
    BWA-MEM FASTQ Read Mapper page
    dx install
    dx run
    dx watch
    dx find jobs
    dx terminate
    FreeBayes Variant Caller app
    Map Reads
    job-based object reference
    dx wait
    Index of dx Commands
    Running Apps and Applets
    SDK documentation
    Developer Portal
    $ dx help ls
    usage: dx ls [-h] [--color {off,on,auto}] [--delimiter [DELIMITER]]
    [--env-help] [--brief | --summary | --verbose] [-a] [-l] [--obj]
    [--folders] [--full]
    [path]
    
    List folders and/or objects in a folder
    ... # output truncated for brevity
    $ dx login
    Acquiring credentials from https://auth.dnanexus.com
    Username: <your username>
    Password: <your password>
    
    No projects to choose from. You can create one with the command "dx new project".
    To pick from projects for which you only have VIEW permissions, use "dx select --level VIEW" or "dx select --public".
    dx select --public --name "Reference Genome Files*"
    $ dx ls
    C. Elegans - Ce10/
    D. melanogaster - Dm3/
    H. Sapiens - GRCh37 - b37 (1000 Genomes Phase I)/
    H. Sapiens - GRCh37 - hs37d5 (1000 Genomes Phase II)/
    H. Sapiens - GRCh38/
    H. Sapiens - hg19 (Ion Torrent)/
    H. Sapiens - hg19 (UCSC)/
    M. musculus - mm10/
    M. musculus - mm9/
    $ dx ls "C. Elegans - Ce10/"
    ce10.bt2-index.tar.gz
    ce10.bwa-index.tar.gz
    ... # output truncated for brevity
    $ dx ls "Demo Data:/SRR100022/"
    SRR100022_1.filt.fastq.gz
    SRR100022_2.filt.fastq.gz
    $ dx ls -l "project-BQbJpBj0bvygyQxgQ1800Jkk:/SRR100022/"
    Project: Demo Data (project-BQbJpBj0bvygyQxgQ1800Jkk)
    Folder : /SRR100022
    State   Last modified       Size     Name (ID)
    ... # output truncated for brevity
    $ dx describe "Reference Genome Files\: AWS US (East):/C. Elegans - Ce10/ce10.fasta.gz"
    Result 1:
    ID                  file-BQbY9Bj015pB7JJVX0vQ7vj5
    Class               file
    Project             project-BQpp3Y804Y0xbyG4GJPQ01xv
    Folder              /C. Elegans - Ce10
    Name                ce10.fasta.gz
    State               closed
    Visibility          visible
    Types               -
    Properties          Assembly=UCSC ce10,
                        Origin=http://hgdownload.cse.ucsc.edu/goldenPath/ce10/bigZip
                        s/ce10.2bit, Species=Caenorhabditis elegans, Taxonomy
                        ID=6239
    Tags                -
    Outgoing links      -
    Created             Tue Sep 30 18:54:35 2014
    Created by          bhannigan
     via the job        job-BQbY8y80KKgP380QVQY000qz
    Last modified       Thu Mar  2 12:17:27 2017
    Media type          application/x-gzip
    archivalState       "live"
    Size                29.21 MB, sponsored by DNAnexus
    $ dx describe "Reference Genome Files\: AWS US (East):"
    Result 1:
    ID                  project-BQpp3Y804Y0xbyG4GJPQ01xv
    Class               project
    Name                Reference Genome Files: AWS US (East)
    Summary             
    Billed to           org-dnanexus
    Access level        VIEW
    Region              aws:us-east-1
    Protected           true
    Restricted          false
    Contains PHI        false
    Created             Wed Oct  8 16:42:53 2014
    Created by          tnguyen
    Last modified       Tue Oct 23 14:15:59 2018
    Data usage          0.00 GB
    Sponsored data      519.77 GB
    Sponsored egress    0.00 GB used of 0.00 GB total
    Tags                -
    Properties          -
    downloadRestricted  false
    defaultInstanceType "mem2_hdd2_x2"
    $ dx new project "My First Project"
    Created new project called "My First Project"
    (project-xxxx)
    Switch to new project now? [y/N]: y
    $ dx upload --wait small-celegans-sample.fastq
    [===========================================================>] Uploaded (16801690 of 16801690 bytes) 100% small-celegans-sample.fastq
    ID              file-xxxx
    Class           file
    Project         project-xxxx
    Folder          /
    Name            small-celegans-sample.fastq
    State           closed
    Visibility      visible
    Types           -
    Properties      -
    Tags            -
    Details         {}
    Outgoing links  -
    Created         Sun Jan  1 09:00:00 2017
    Created by      amy
    Last modified   Sat Jan  1 09:00:00 2017
    Media type      text/plain
    Size            16.02 MB
    $ dx head -n 12 small-celegans-sample.fastq
    @SRR070372.1 FV5358E02GLGSF length=78
    TTTTTTTTTTTTTTTTTTTTTTTTTTTNTTTNTTTNTTTNTTTATTTATTTATTTATTATTATATATATATATATATA
    +SRR070372.1 FV5358E02GLGSF length=78
    ...000//////999999<<<=<<666!602!777!922!688:669A9=<=122569AAA?>@BBBBAA?=<96632
    @SRR070372.2 FV5358E02FQJUJ length=177
    TTTCTTGTAATTTGTTGGAATACGAGAACATCGTCAATAATATATCGTATGAATTGAACCACACGGCACATATTTGAACTTGTTCGTGAAATTTAGCGAACCTGGCAGGACTCGAACCTCCAATCTTCGGATCCGAAGTCCGACGCCCCCGCGTCGGATGCGTTGTTACCACTGCTT
    +SRR070372.2 FV5358E02FQJUJ length=177
    222@99912088>C<?7779@<GIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIC;6666IIIIIIIIIIII;;;HHIIE>944=>=;22499;CIIIIIIIIIIIIHHHIIIIIIIIIIIIIIIH?;;;?IIEEEEEEEEIIII77777I7EEIIEEHHHHHIIIIIIIIIIIIII
    @SRR070372.3 FV5358E02GYL4S length=70
    TTGGTATCATTGATATTCATTCTGGAGAACGATGGAACATACAAGAATTGTGTTAAGACCTGCATAAGGG
    +SRR070372.3 FV5358E02GYL4S length=70
    @@@@@DFFFFFHHHHHHHFBB@FDDBBBB=?::5555BBBBD??@?DFFHHFDDDDFFFDDBBBB<<410
    $ dx download small-celegans-sample.fastq
    [                                                            ] Downloaded 0 byte
    [===========================================================>] Downloaded 16.02 of
    [===========================================================>] Completed 16.02 of 16.02 bytes (100%) small-celegans-sample.fastq
    dx upload small-celegans-sample.fastq --wait
    $ dx find apps
    ...
    x BWA-MEM FASTQ Read Mapper (bwa_mem_fastq_read_mapper), v1.4.0
    ...
    $ dx install bwa_mem_fastq_read_mapper
    Installed the bwa_mem_fastq_read_mapper app
    $ dx find apps --installed
    BWA-MEM FASTQ Read Mapper (bwa_mem_fastq_read_mapper), v1.4.0
    $ dx run bwa_mem_fastq_read_mapper
    Entering interactive mode for input selection.
    
    Input:   Reads (reads_fastqgz)
    Class:   file
    Enter file ID or path (<TAB> twice for compatible files in current directory,'?' for help)
    reads_fastqgz[0]: <small-celegans-sample.fastq.gz>
    
    Input:   BWA reference genome index (genomeindex_targz)
    Class:   file
    
    Suggestions:
    project-BQpp3Y804Y0xbyG4GJPQ01xv://file-\* (DNAnexus Reference Genomes)
    Enter file ID or path (<TAB> twice for compatible files in current
    directory,'?' for more options)
    genomeindex_targz: <"Reference Genome Files\: <REGION_OF_PROJECT>:/C. Elegans - Ce10/ce10.bwa-index.tar.gz">
    
    Select an optional parameter to set by its # (^D or <ENTER> to finish):
    
    [0] Reads (right mates) (reads2_fastqgz)
    [1] Add read group information to the mappings (required by downstream GATK)? (add_read_group) [default=true]
    [2] Read group id (read_group_id) [default={"$dnanexus_link": {"input": "reads_fastqgz", "metadata": "name"}}]
    [3] Read group platform (read_group_platform) [default="ILLUMINA"]
    [4] Read group platform unit (read_group_platform_unit) [default="None"]
    [5] Read group library (read_group_library) [default="1"]
    [6] Read group sample (read_group_sample) [default="1"]
    [7] Output all alignments for single/unpaired reads? (all_alignments)
    [8] Mark shorter split hits as secondary? (mark_as_secondary) [default=true]
    [9] Advanced command line options (advanced_options)
    
    Optional param #: <ENTER>
    
    Using input JSON:
    {
        "reads_fastqgz": {
            "$dnanexus_link": {
                "project": "project-B3X8bjBqqBk1y7bVPkvQ0001",
                "id": "file-B3P6v02KZbFFkQ2xj0JQ005Y"
            }
    
    "genomeindex_targz": {
            "$dnanexus_link": {
                "project": "project-xxxx(project ID for the reference genome in your region)",
                "id": "file-BQbYJpQ09j3x9Fj30kf003JG"
            }
        }
    }
    
    Confirm running the applet/app with this input [Y/n]: <ENTER>
    Calling app-BP2xVx80fVy0z92VYVXQ009j with output destination
         project-xxxx:/
    
    Job ID: job-xxxx
    $ dx find jobs
    * BWA-MEM FASTQ Read Mapper (bwa_mem_fastq_read_mapper:main)(done) job-xxxx
    user-amy 20xx-xx-xx 0x:00:00 (runtime 0:00:xx)
    $ dx describe job-xxxx
    ...
    $ dx ls
    small-celegans-sample.bam
    small-celegans-sample.bam.bai
    small-celegans-sample.fastq
    $ dx describe small-celegans-sample.bam
    ...
    $ dx describe job-xxxx:sorted_bam
    ...
    $ dx run freebayes -y \
     -igenome_fastagz=Reference\ Genome\ Files:/C.\ Elegans\ -\ Ce10/ce10.fasta.gz \
     -isorted_bams=job-xxxx:sorted_bam
    
    Using input JSON:
    {
      "genome_fastagz": {
        "$dnanexus_link": {
          "project": "project-xxxx",
          "id": "file-xxxx"
        }
      },
      "sorted_bams": {
        "field": "sorted_bam",
        "job": "job-xxxx"
      }
    }
    
    Calling app-BFG5k2009PxyvYXBBJY00BK1 with output destination
    project-xxxx:/
    
    Job ID: job-xxxx
    $ dx wait job-xxxx && dx find jobs
    Waiting for job-xxxx to finish running...
    Done
    * FreeBayes Variant Caller (done) job-xxxx
    user-amy 2017-01-01 09:00:00 (runtime 0:05:24)
    ...
    #!/usr/bin/env bash
    # Usage: <script_name.sh> local_fastq_filename.fastq.gz
    
    reference="Reference Genome Files\: AWS US (East):/C. Elegans - Ce10/ce10.fasta.gz"
    bwa_indexed_reference="Reference Genome Files\: AWS US (East):/C. Elegans - Ce10/ce10.bwa-index.tar.gz"
    local_reads_file="$1"
    
    reads_file_id=$(dx upload "$local_reads_file" --brief)
    bwa_job=$(dx run bwa_mem_fastq_read_mapper -ireads_fastqgzs=$reads_file_id -igenomeindex_targz="$bwa_indexed_reference" -y --brief)
    freebayes_job=$(dx run freebayes -isorted_bams=$bwa_job:sorted_bam -igenome_fastagz="$reference" -y --brief)
    
    dx wait $freebayes_job
    
    dx download $freebayes_job:variants_vcfgz -o "$local_reads_file".vcf.gz
    gunzip "$local_reads_file".vcf.gz
    usage: dx run app-swiss-army-knife [-iINPUT_NAME=VALUE ...]
    
    App: Swiss Army Knife
    
    Version: 5.1.0 (published)
    
    A multi-purpose tool for all your basic analysis needs
    
    See the app page for more information:
      https://platform.dnanexus.com/app/swiss-army-knife
    
    Inputs:
      Input files: [-iin=(file) [-iin=... [...]]]
            (Optional) Files to download to instance temporary folder before
            command is executed.
    
      Command line: -icmd=(string)
            Command to execute on instance. View the app readme for details.
    
      Whether to use "dx-mount-all-inputs"?: [-imount_inputs=(boolean, default=false)]
            (Optional) Whether to mount all files that were supplied as inputs to
            the app instead of downloading them to the local storage of the
            execution worker.
    
      Public Docker image identifier: [-iimage=(string)]
            (Optional) Instead of using the default Ubuntu 24.04 environment, the
            input command <CMD> will be run using the specified publicly
            accessible Docker image <IMAGE> as it would be when running 'docker
            run <IMAGE> <CMD>'. Example image identifiers are 'ubuntu:25.04',
            'quay.io/ucsc_cgl/samtools'. Cannot be specified together with
            'image_file'. This input relies on access to internet and is unusable
            in an internet-restricted project.
    
      Platform file containing Docker image accepted by `docker load`: [-iimage_file=(file)]
            (Optional) Instead of using the default Ubuntu 24.04 environment, the
            input command <CMD> will be run using the Docker image <IMAGE> loaded
            from the specified image file <IMAGE_FILE> as it would be when running
            'docker load -i <IMAGE_FILE> && docker run <IMAGE> <CMD>'. Cannot be
            specified together with 'image'.
    
    Outputs:
      Output files: [out (array:file)]
            (Optional) New files that were created in temporary folder.

    Monitoring Executions

    Learn how to get information on current and past executions via both the UI and the CLI.

    Monitoring an Execution via the UI

    Getting Basic Information on an Execution

    To get basic information on :

    1. Click on Projects in the main Platform menu.

    2. On the Projects list page, find and click on the name of the project within which the execution was launched.

    3. Click on the Monitor tab to open the Monitor screen.

    Available Basic Information on Executions

    The list on the Monitor screen displays the following information for each execution that is running or has been run within the project:

    • Name - The default name for an execution is the name of the app, applet, or workflow being run. When configuring an execution, you can give it a custom name, either , or . The execution's name is used in Platform email alerts related to the execution. Clicking on a name in the executions list opens the .

    • State - This is the execution's state. State values include:

      • "Waiting" - The execution awaits Platform resource allocation or completion of dependent executions.

    Additional Basic Information

    Additional basic information can be displayed for each execution. To do this:

    1. Click on the "table" icon at the right edge of the table header row.

    2. Select one or more of the entries in the list, to display an additional column or columns.

    Available additional columns include:

    • Stopped Running - The time at which the execution stopped running.

    • Custom properties columns - If a have been assigned to any of the listed executions, a column can be added to the table, for each such property, showing the values assigned to each execution, for that property.

    Customizing the Executions List Display

    To remove columns from the list, click on the "table" icon at the right edge of the table header row, then de-select one or more of the entries in the list, to hide the column or columns.

    Filtering the Executions List

    A filter menu above the executions list allows you to run a search that refines the list to display only executions meeting specific criteria.

    By default, pills are available to set search criteria for filtering executions by one or more of these attributes:

    • Name - Execution name

    • State - Execution state

    • ID - An execution's or

    • Executable

    Click the List icon, above the right edge of the executions list, to display pills that allow filtering by additional execution attributes.

    Search Scope

    By default, filters are set to display only that meet the criteria defined in the filter. To include all executions, including those run during individual stages of workflows, click the button above the left edge of the executions list showing the default value "Root Executions Only," then click "All Executions."

    Saving and Reusing Filters

    To save a particular filter, click the Bookmark icon, above the right edge of the executions list, assign your filter a name, then click Save.

    To apply a saved filter to the executions list, click the Bookmark icon, then select the filter from the list.

    Terminating an Execution from the Monitor Screen

    If you launched an execution or have contributor access to the project in which the execution is running, you can terminate the execution from the list on the Monitor screen when it is in a non-terminal state. You can also terminate executions launched by other project members if you have project admin status.

    To terminate an execution:

    1. Find the execution in the list:

      • Select the execution by clicking on the row. Click the red Terminate button that appears at the end of the header.

      • Hover over the row and click on the "More Actions" button that looks like three vertical dots at the end of the row to select Terminate in the menu.

    Getting Detailed Information on an Execution via the UI

    For additional information about an execution, click its name in the list on the Monitor screen to open its details page.

    Available Detailed Information on Executions

    The details page for an execution displays a range of information, including:

    • High-level details - The high-level information in this section includes:

      • For a standalone execution - such as a job without children - the display shows a single entry with details about the execution state, start and stop times, and duration in the running state.

      • For an execution with descendants - such as an analysis with multiple stages - the display shows a list with each row containing details about stage executions. For executions with descendants, click the "+" icon next to the name to expand the row and view descendant information. A page displaying detailed information on a stage appears when clicking on its name in the list. To navigate back to the workflow's details page, click its name in the "breadcrumb" navigation menu in the top right corner of the screen.

    Getting Help with Failed Executions

    For failed executions, a Cause of Failure pane appears above the Execution Tree section. The cause of failure is a system-generated error message. For assistance in diagnosing the failure and any related issues:

    1. Click the button labeled Send Failure Report to DNAnexus Support.

    2. A form opens in a modal window, with pre-populated Subject and Message fields containing diagnostic information for DNAnexus Support.

    3. Click the button in the Grant Access section to grant DNAnexus Support "View" access to the project, enabling faster issue diagnosis and resolution.

    Launching a New Execution

    To re-launch a job from the execution details screen:

    1. Click the Launch as New Job button in the upper right corner of the screen.

    2. A new browser tab opens, displaying the Run App / Applet form.

    3. Configure the run, then click Start Analysis.

    To re-launch an analysis from the execution details screen:

    1. Click the Launch as New Analysis button in the upper right corner of the screen.

    2. A new browser tab opens, displaying the Run Analysis form.

    3. Configure the run, then click Start Analysis.

    Saving a Workflow as a New Workflow

    To save a copy of a workflow along with its input configurations under a new name from the execution details screen:

    1. Click the Save as New Workflow button in the upper right corner of the screen.

    2. In the Save as New Workflow modal window, give the workflow a name, and select the project in which you'd like to save it.

    3. Click Save.

    Viewing Initial Tries for Restarted Jobs

    As described in , jobs can be configured to restart automatically on certain types of failures.

    If you want to view the execution details for the initial tries for a restarted job:

    1. Click on the "Tries" link below the job name in the summary banner, or the "Tries" link next to the job name in the execution tree.

    2. A modal window opens.

    3. Click the name of the try for which you'd like to view execution details.

    You can only for the most recent try, not for any previous tries.

    Monitoring a Job via the CLI

    You can use dx watch to view the log of a running job or any past jobs, which may have finished successfully, failed, or been terminated.

    Monitoring a Running Job

    Use to view a job's log stream during execution. The log stream includes stdout, stderr, and additional worker output information.

    Terminating a Job

    To terminate a job before completion, use the command .

    Monitoring Past Jobs

    Use the command to view completed jobs. The log stream includes stdout, stderr, and additional worker output information from the execution.

    Finding Executions via the CLI

    Use dx find executions to display the ten most recent executions in your current project. Specify a different number of executions by using dx find executions -n <specified number>. The output matches the information shown in the "Monitor" tab on the DNAnexus web UI.

    Below is an example of dx find executions. In this case, only two executions have been run in the current project. An individual job, DeepVariant Germline Variant Caller, and a workflow consisting of two stages, Variant Calling Workflow, are shown. A stage is represented by either another analysis (if running a workflow) or a job (if running an app(let)).

    The job running the DeepVariant Germline Variant Caller executable is running and has been running for 10 minutes and 28 seconds. The analysis running the Variant Calling Workflow consists of 2 stages, FreeBayes Variant Caller, which is waiting on input, and BWA-MEM FASTQ Read Mapper, which has been running for 10 minutes and 18 seconds.

    Using dx find executions

    The dx find executions operation searches for jobs or analyses created when a user runs an app or applet. For jobs that are part of an analysis, the results appear in a tree representation linking related jobs together.

    By default, dx find executions displays up to ten of the most recent executions in your current project, ordered by creation time.

    Filter executions by job type using command flags: --origin-jobs shows only original jobs, while --all-jobs includes both original jobs and subjobs.

    Finding Analyses via the CLI

    You can monitor analyses by using the command dx find analyses, which displays the top-level analyses, excluding contained jobs. Analyses are executions of workflows and consist of one or more app(let)s being run.

    Below is an example of dx find analyses:

    Finding Jobs via the CLI

    Jobs are runs of an individual app(let) and compose analyses. Monitor jobs using the command to display a flat list of jobs. For jobs within an analysis, the command returns all jobs in that analysis.

    Below is an example of dx find jobs:

    Advanced CLI Monitoring Options

    Searches for executions can be restricted to specific parameters.

    Viewing stdout and/or stderr from a Job Log

    • To extract stdout only from this job, run the command dx watch job-xxxx --get-stdout.

    • To extract stderr only from this job, run the command dx watch job-xxxx --get-stderr.

    • To extract both

    Below is an example of viewing stdout lines of a job log:

    Viewing Subjobs

    To view the entire job tree, including both main jobs and subjobs, use the command dx watch job-xxxx --tree.

    Viewing the First n Messages of a Job Log

    To view the entire job tree -- both main jobs and subjobs -- use the command dx watch job-xxxx -n 8. If the job already ran, the output is displayed as well.

    In the example below, the app Sample Prints doesn't have any output.

    Finding and Examining Initial Tries for Restarted Jobs

    Jobs can be configured to restart automatically on certain types of failures as described in the section. To view initial tries of the restarted jobs along with execution subtrees rooted in those initial tries, use dx find executions --include-restarted. To examine job logs for initial tries, use dx watch job-xxxx --try X. An example of these commands is shown below.

    Searching Across All Projects

    By default, dx find restricts searches to your current project context. Use the --all-projects flag to search across all accessible projects.

    Returning More Than Ten Results

    By default, dx find returns up to ten of the most recently launched executions matching your search query. Use the -n option to change the number of executions returned.

    Searching by Executable

    A user can search for only executions of a specific app(let) or workflow based on its .

    Searching by Execution Start Time

    Users can also use the --created-before and --created-after options to search based on when the execution began.

    Searching by Date

    Searching by Time

    Searching by Execution State

    Users can also restrict the search to a specific state, for example, "done", "failed", "terminated".

    Scripting

    Delimiters

    The --delim flag produces tab-delimited output, suitable for processing by other shell commands.

    Returning Only IDs

    Use the --brief flag to display only the object IDs for objects returned by your search query. The ‑‑origin‑jobs flag excludes subjob information.

    Below is an example usage of the --brief flag:

    Below is an example of using the flags --origin-jobs and --brief. In the example below, the last job run in the current default project is described.

    Rerunning Time-Specific Failed Jobs With Updated Instance Types

    Rerunning Failed Executions With an Updated Executable

    Getting Information on Jobs That Share a Tag

    See more on .

    Forwarding Job Logs to Splunk for Analysis

    A license is required to use this feature. for more information.

    Job logs can be automatically for analysis.

    The Monitor screen shows a list of executions launched within the project. By default, executions appear in reverse chronological order, with the most recently launched execution at the top.
  • Find the row displaying information on the execution.

    • For an analysis (the execution of a workflow), click the "+" icon to the left of the analysis name to expand the row and view information on its stages. For executions with further descendants, click the "+" icon next to the name to expand the row and show additional details.

  • To see additional information on an execution, click on its name to be taken to its details page.

    • The following shortcuts allow you to view information from the details page directly on the list page, or relaunch an execution:

      • To view the Info pane:

        • Click the Info icon, above the right edge of the executions list, if it's not already selected, and then select the execution by clicking on the row.

        • Hover over the row and click on the "More Actions" button that looks like three vertical dots at the end of the row to select View Info in the fly out menu.

      • To view the for a job, do either of the following:

        • Select the execution by clicking on the row. When a View Log button appears in the header, click it,

        • Hover over the row and click on the "More Actions" button that looks like three vertical dots at the end of the row to select View Log in the fly out menu.

      • To , do either of the following:

        • Select the execution by clicking on the row. When a Launch as New Job button appears in the header, click it.

        • Hover over the row and click on the "More Actions" button that looks like three vertical dots at the end of the row, then select Launch as New Job in the menu.

      • To , do either of the following:

        • Select the execution by clicking on the row. When a Launch as New Analysis button appears in the header, click it.

        • Hover over the row and click on the "More Actions" button that looks like three vertical dots at the end of the row to select Launch as New Analysis in the menu.

  • "Running" - The job is actively executing.

  • "In Progress" - The analysis is actively processing.

  • "Done" - The execution completed successfully without errors.

  • "Failed" - The execution encountered an error and could not complete. See Types of Errors for troubleshooting assistance.

  • "Partially Failed" - An analysis reaches "Partially Failed" state if one or more workflow stages did not finish successfully, with at least one stage not in a terminal state (either "Done," "Failed," or "Terminated").

  • "Terminating" - The worker has initiated but not completed the termination process.

  • "Terminated" - The execution stopped before completion.

  • "Debug Hold" - The execution, run with debugging options, encountered an applicable failure and entered debugging hold.

  • Executable - The executable or executables run during the execution. If the execution is an analysis, each stage appears in a separate row, including the name of the executable run during the stage. If an informational page exists with details about the executable's configuration and use, the executable name becomes clickable, and clicking displays that page.

  • Tags - Tags are strings associated with objects on the platform. They are a type of metadata that can be added to an execution.

  • Launched By - The name of the user who launched the execution.

  • Launched On - The time at which the execution was launched. This time often precedes the time in the Started Running column due to executions waiting for available resources before starting.

  • Started Running - The time at which the execution started running, if it has done so. This is not always the same as its launch time, if it requires time waiting for available resources before starting.

  • Duration - For jobs, this figure represents the time elapsed since the job entered the running state. For analyses, it represents the time elapsed since the analysis was created.

  • Cost - A value is displayed in this column when the user has access to billing info for the execution. The figure shown represents either, for a running execution, an estimate of the charges it has incurred so far, or, for a completed execution, the total costs it incurred.

  • Priority - The priority assigned to the execution - either "low," "normal," or "high" - when it was configured, either via the CLI or via the UI. This setting determines the scheduling priority of the execution relative to other executions that are waiting to be launched.

  • Worker URL - If the execution runs an executable, such as DXJupyterLab, with direct web URL connection capability, the URL appears here. Clicking the URL opens a connection to the executable in a new browser tab.

  • Output Folder - For each execution, the value shows a path relative to the project's root folder. Click the value to open the folder containing the execution's outputs.

  • - A specific executable
  • Launched By - The user who launched an execution or executions

  • Launch Time - The time range within which executions were launched

  • A modal window opens, asking you to confirm the termination. Click Terminate to confirm.
  • The execution's state changes to "Terminating" during termination, then to "Terminated" once complete.

  • Execution state - In the Execution Tree section, each execution row includes a color bar that represents the execution's current state. For descendants within the same execution tree, the time visualizations are staggered, indicating their different start and stop times compared to each other. The colors include:

    • Blue - A blue bar indicates that the execution is in the "Running" or "In Progress" state.

    • Green - A green bar indicates that the execution is in the "Done" state.

    • Red - A red bar indicates that the execution is in the "Failed" or "Partially Failed" state.

    • "Grey" indicates that the execution is in the "Terminated" state.

  • Execution start and stop times - Times are displayed in the header bar at the top of the Execution Tree section. These times run, from left to right, from the time at which the job started running, or when the analysis was created, to either the current time, or the time at which the execution entered a terminal state ("Done," "Failed," or "Terminated").

  • Inputs - This section lists the execution inputs. Available input files appear as hyperlinks to their project locations. For inputs from other workflow executions, the source execution name appears as a hyperlink to its details page.

  • Outputs - This section lists the execution's outputs. Available output files appear as hyperlinks. Click a link to open the folder containing the output file.

  • Log files - An execution's log file is useful in understanding details about, for example, the resources used by an execution, the costs it incurred, and the source of any delays it encountered. To access log files, and, as needed, download them in .txt format:

    • To access the log file for a job, click either the View Log button in the top right corner of the screen, or the View Log link in the Execution Tree section.

    • To access the log file for each stage in an analysis, click the View Log link next to the row displaying information on the stage, in the Execution Tree section.

  • Basic info - The Info pane, on the right side of the screen, displays a range of basic information on the execution, along with additional detail such as the execution's unique ID, and custom properties and tags assigned to it.

  • Reused results - For executions reusing results from another execution, the information appears in a blue pane above the Execution Tree section. Click the source execution's name to see details about the execution that generated these results.

  • Click Send Report to send the report.

    stdout
    and
    stderr
    from this job, run the command
    dx watch job-xxxx --get-streams
    .
    a job (the execution of an app or applet) or an analysis (the execution of a workflow)
    via the UI
    via the CLI
    execution details page, giving in-depth information on the execution
    custom property or properties
    job ID
    analysis ID
    root executions
    job states
    send a failure report
    dx watch
    dx terminate
    dx watch
    dx find jobs
    Restartable Jobs
    entity ID
    using dx find jobs
    Contact DNAnexus Sales
    forwarded to a customer's Splunk instance
    $ dx watch job-xxxx
    Watching job job-xxxx. Press Ctrl+C to stop.
    * Sample Prints (sample_prints:main) (running) job-xxxx
      amy 2024-01-01 09:00:00 (running for 0:00:37)
    2024-01-01 09:06:00 Sample Prints INFO Logging initialized (priority)
    2024-01-01 09:06:37 Sample Prints INFO CPU: 4% (4 cores) * Memory: 547/7479MB * Storage: 74GB free * Net: 0↓/0↑MBps
    2024-01-01 09:06:37 Sample Prints INFO Setting SSH public key
    2024-01-01 09:06:37 Sample Prints STDOUT dxpy/0.365.0 (Linux-5.15.0-1050-aws-x86_64-with-glibc2.29) Python/3.8.10
    2024-01-01 09:06:37 Sample Prints STDOUT Invoking main with {}
    2024-01-01 09:06:37 Sample Prints STDOUT 0
    ...
    $ dx watch job-xxxx
    Watching job job-xxxx. Press Ctrl+C to stop.
    * Sample Prints (sample_prints:main) (running) job-xxxx
      amy 2024-01-01 09:00:00 (running for 0:00:37)
    2024-01-01 09:06:00 Sample Prints INFO Logging initialized (priority)
    2024-01-01 09:06:37 Sample Prints INFO CPU: 4% (4 cores) * Memory: 547/7479MB * Storage: 74GB free * Net: 0↓/0↑MBps
    2024-01-01 09:06:37 Sample Prints INFO Setting SSH public key
    204-01-01 09:06:37 Sample Prints STDOUT dxpy/0.365.0 (Linux-5.15.0-1050-aws-x86_64-with-glibc2.29) Python/3.8.10
    2024-01-01 09:06:37 Sample Prints STDOUT Invoking main with {}
    2024-01-01 09:06:37 Sample Prints STDOUT 0
    2024-01-01 09:06:37 Sample Prints STDOUT 1
    2024-01-01 09:06:37 Sample Prints STDOUT 2
    2024-01-01 09:06:37 Sample Prints STDOUT 3
    * Sample Prints (sample_prints:main) (done) job-xxxx
      amy 2024-01-01 09:08:11 (runtime 0:02:11)
      Output: -
    $ dx find executions
    * DeepVariant Germline Variant Caller (deepvariant_germline:main) (running) job-xxxx
      amy 2024-01-01 09:00:18 (running for 0:10:28)
    * Variant Calling Workflow (in_progress) analysis-xxxx
    │ amy 2024-01-01 09:00:18
    ├── * FreeBayes Variant Caller (freebayes:main) (waiting_on_input) job-yyyy
    │     amy 2024-01-01 09:00:18
    └── * BWA-MEM FASTQ Read Mapper (bwa_mem_fastq_read_mapper:main) (running) job-zzzz
          amy 2024-01-01 09:00:18 (running for 0:10:18)
    $ dx find analyses
    * Variant Calling Workflow (in_progress) analysis-xxxx
      amy 2024-01-01 09:00:18
    $ dx find jobs
    * DeepVariant Germline Variant Caller (deepvariant_germline:main) (running) job-xxxx
      amy 2024-01-01 09:10:00 (running for 0:00:28)
    * FreeBayes Variant Caller (freebayes:main) (waiting_on_input) job-yyyy
      amy 2024-01-01 09:00:18
    * BWA-MEM FASTQ Read Mapper (bwa_mem_fastq_read_mapper:main) (running) job-zzzz
      amy 2024-01-01 09:00:18 (running for 0:10:18)
    $ dx watch job-xxxx --get-streams
    Watching job job-xxxx. Press Ctrl+C to stop.
    dxpy/0.365.0 (Linux-5.15.0-1050-aws-x86_64-with-glibc2.29) Python/3.8.10
    Invoking main with {}
    0
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    $ dx watch job-F5vPQg807yxPJ3KP16Ff1zyG -n 8
    Watching job job-xxxx. Press Ctrl+C to stop.
    * Sample Prints (sample_prints:main) (done) job-xxxx
      amy 2024-01-01 09:00:00 (runtime 0:02:11)
    2024-01-01 09:06:00 Sample Prints INFO Logging initialized (priority)
    2024-01-01 09:08:11 Sample Prints INFO CPU: 4% (4 cores) * Memory: 547/7479MB * Storage: 74GB free * Net: 0↓/0↑MBps
    2024-01-01 09:08:11 Sample Prints INFO Setting SSH public key
    2024-01-01 09:08:11 Sample Prints dxpy/0.365.0 (Linux-5.15.0-1050-aws-x86_64-with-glibc2.29) Python/3.8.10
    * Sample Prints (sample_prints:main) (done) job-F5vPQg807yxPJ3KP16Ff1zyG
      amy 2024-01-01 09:00:00 (runtime 0:02:11)
      Output: -
    $ dx run swiss-army-knife -icmd="exit 1" \
        --extra-args '{"executionPolicy": { "restartOn":{"*":2}}}'
    
    $ dx find executions --include-restarted
    * Swiss Army Knife (swiss-army-knife:main) (failed) job-xxxx tries
    ├── * Swiss Army Knife (swiss-army-knife:main) (failed) job-xxxx try 2
    │     amy 2023-08-02 16:33:40 (runtime 0:01:45)
    ├── * Swiss Army Knife (swiss-army-knife:main) (restarted) job-xxxx try 1
    │     amy 2023-08-02 16:33:40
    └── * Swiss Army Knife (swiss-army-knife:main) (restarted) job-xxxx try 0
          amy 2023-08-02 16:33:40
    
    $ dx watch job-xxxx --try 0
    Watching job job-xxxx try 0. Press Ctrl+C to stop watching.
    * Swiss Army Knife (swiss-army-knife:main) (restarted) job-xxxx try 0
      amy 2023-08-02 16:33:40
    2023-08-02 16:35:26 Swiss Army Knife INFO Logging initialized (priority)
    $ dx find executions -n 3 --all-projects
    * Sample Prints (sample_prints:main) (done) job-xxxx
      amy 2024-01-01 09:15:00 (runtime 0:02:11)
    * Sample Applet (sample_applet:main) (done) job-yyyy
      ben 2024-01-01 09:10:00 (runtime 0:00:28)
    * Sample Applet (sample_applet:main) (failed) job-zzzz
      amy 2024-01-01 09:00:00 (runtime 0:19:02)
    # Find the 100 most recently launched jobs in your project
    $ dx find executions -n 100
    # Find most recent executions running app-deepvariant_germline in the current project
    $ dx find executions --executable app-deepvariant_germline
    * DeepVariant Germline Variant Caller (deepvariant_germline:main) (running) job-xxxx
      amy 2024-01-01 09:00:18 (running for 0:10:18)
    # Find executions run on January 2, 2024
    $ dx find executions --created-after=2024-01-01 --created-before=2024-01-03
    # Find executions created in the last 2 hours
    $ dx find executions --created-after=-2h
    # Find analyses created in the last 5 days
    $ dx find analyses --created-after=-5d
    # Find failed jobs in the current project
    $ dx find jobs --state failed
    $ dx find jobs --delim
    * Cloud Workstation (cloud_workstation:main) done  job-xxxx    amy   2024-01-07 09:00:00 (runtime 1:00:00)
    * GATK3 Human Exome Pipeline(gatk3_human_exome_pipeline:main)    done  job-yyyy amy 2024-01-07  09:00:00 (runtime 0:21:16)
    $ dx find jobs -n 3 --brief
    job-xxxx
    job-yyyy
    job-zzzz
    $ dx describe $(dx find jobs -n 1 --origin-jobs --brief)
    Result 1:
    ID                  job-xxxx
    Class               job
    Job name            BWA-MEM FASTQ Read Mapper
    Executable name     bwa_mem_fastq_read_mapper
    Project context     project-xxxx
    Billed to           amy
    Workspace           container-xxxx
    Cache workspace     container-yyyy
    Resources           container-zzzz
    App                 app-xxxx
    Instance Type       mem1_ssd1_x8
    Priority            high
    State               done
    Root execution      job-zzzz
    Origin job          job-zzzz
    Parent job          -
    Function            main
    Input               genomeindex_targz = file-xxxx
                    reads_fastqgz = file-xxxx
                    [read_group_library = "1"]
                    [mark_as_secondary = true]
                    [read_group_platform = "ILLUMINA"]
                    [read_group_sample = "1"]
                    [add_read_group = true]
                    [read_group_id = {"$dnanexus_link": {"input": "reads_fastqgz", "metadata": "name"}}]
                    [read_group_platform_unit = "None"]
    Output              -
    Output folder       /
    Launched by         amy
    Created             Sun Jan  1 09:00:17 2024
    Started running     Sun Jan  1 09:00:10 2024
    Stopped running     Sun Jan  1 09:00:27 2024 (Runtime: 0:00:16)
    Last modified       Sun Jan  1 09:00:28 2024
    Depends on          -
    Sys Requirements    {"main": {"instanceType": "mem1_ssd1_x8"}}
    Tags                -
    Properties          -
    # Find failed jobs in the current project from a time period
    $ dx find jobs --state failed --created-after=2024-01-01 --created-before=2024-02-01
    * BWA-MEM FASTQ Read Mapper (bwa_mem_fastq_read_mapper:main) (failed) job-xxxx
      amy 2024-01-22 09:00:00 (runtime 0:02:12)
    * BWA-MEM FASTQ Read Mapper (bwa_mem_fastq_read_mapper:main) (done) job-yyyy
      amy 2024-01-07 06:00:00 (runtime 0:11:22)
    # Find all failed executions of specified executable
    $ dx find executions --state failed --executable app-bwa_mem_fastq_read_mapper
    * BWA-MEM FASTQ Read Mapper (bwa_mem_fastq_read_mapper:main) (failed) job-xxxx
      amy 2024-01-01 09:00:00 (runtime 0:02:12)
    # Update the app and navigate to within app directory
    $ dx build -a
    INFO:dxpy:Archived app app-xxxx to project-xxxx:"/.App_archive/bwa_mem_fastq_read_mapper (Sun Jan  1 09:00:00 2024)"
    {"id": "app-yyyy"}
    # Rerun job with updated app
    dx run bwa_mem_fastq_read_mapper --clone job-xxxx
    dx find jobs --tag TAG
    log file
    re-launch a job
    re-launch an analysis
    usage: dx run freebayes [-iINPUT_NAME=VALUE ...]
    
    App: FreeBayes Variant Caller
    
    Version: 3.0.1 (published)
    
    Calls variants (SNPs, indels, and other events) using FreeBayes
    
    See the app page for more information:
      https://platform.dnanexus.com/app/freebayes
    
    Inputs:
      Sorted mappings: -isorted_bams=(file) [-isorted_bams=... [...]]
            One or more coordinate-sorted BAM files containing mappings to call
            variants for.
    
      Genome: -igenome_fastagz=(file)
            A file, in gzipped FASTA format, with the reference genome that the
            reads were mapped against.
    
            Suggestions:
              project-BQpp3Y804Y0xbyG4GJPQ01xv://file-* (DNAnexus Reference Genomes: AWS US (East))
              project-F3zxk7Q4F30Xp8fG69K1Vppj://file-* (DNAnexus Reference Genomes: AWS Germany)
              project-F0yyz6j9Jz8YpxQV8B8Kk7Zy://file-* (DNAnexus Reference Genomes: Azure US (West))
              project-F4gXb605fKQyBq5vJBG31KGG://file-* (DNAnexus Reference Genomes: AWS Sydney)
              project-FGX8gVQB9X7K5f1pKfPvz9yG://file-* (DNAnexus Reference Genomes: Azure Amsterdam)
              project-GvGXBbk36347jYPxP0j755KZ://file-* (DNAnexus Reference Genomes: Bahrain)
    
      Target regions: [-itargets_bed=(file)]
            (Optional) A BED file containing the coordinates of the genomic
            regions to intersect results with. Supplying this will cause 'bcftools
            view -R' to be used, to limit the results to that subset. This option
            does not speed up the execution of FreeBayes.
    
            Suggestions:
              project-B6JG85Z2J35vb6Z7pQ9Q02j8:/vendor_exomes/file-* (Vendor Exomes (GRCh37 and hg19): AWS US (East))
              project-F3zqGV04fXX5j7566869fjFq:/vendor_exomes/file-* (Vendor Exomes (GRCh37 and hg19): AWS Germany)
              project-F29g0xQ90fvQf5z1BX6b5106:/vendor_exomes/file-* (Vendor Exomes (GRCh37 and hg19): Azure US (West))
              project-F4gYG1850p1JXzjp95PBqzY5:/vendor_exomes/file-* (Vendor Exomes (GRCh37 and hg19): AWS Sydney)
              project-FGXfq9QBy7Zv5BYQ9Yvqj9Xv:/vendor_exomes/file-* (Vendor Exomes (GRCh37 and hg19): Azure Amsterdam)
              project-GvGXBZk3f624QVfBPjB8916j:/vendor_exomes/file-* (Vendor Exomes (GRCh37 and hg19): Bahrain)
    
     Common
      Output prefix: [-ioutput_prefix=(string)]
            (Optional) The prefix to use when naming the output files (they will
            be called prefix.vcf.gz, prefix.vcf.gz.tbi). If not provided, the
            prefix will be the same as the first BAM file given.
    
      Apply standard filters?: [-istandard_filters=(boolean, default=true)]
            Select this to use stringent input base and mapping quality filters,
            which may reduce false positives. This will supply the
            '--standard-filters' option to FreeBayes.
    
      Normalize variants representation?: [-inormalize_variants=(boolean, default=true)]
            Select this to use 'bcftools norm' in order to normalize the variants
            representation, which may help with downstream compatibility.
    
      Perform parallelization?: [-iparallelized=(boolean, default=true)]
            Select this to parallelize FreeBayes using multiple threads. This will
            use the 'freebayes-parallel' script from the FreeBayes package, with a
            granularity of 3 million base pairs. WARNING: This option may be
            incompatible with certain advanced command-line options.
    
     Advanced
      Report genotype qualities?: [-igenotype_qualities=(boolean, default=false)]
            Select this to have FreeBayes report genotype qualities.
    
      Add RG tags to BAM files?: [-ibam_add_rg=(boolean, default=false)]
            Select this to have FreeBayes add read group tags to the input BAM
            files so each file will be treated as an individual sample. WARNING:
            This may increase the memory requirements for FreeBayes.
    
      Advanced command line options: [-iadvanced_options=(string)]
            (Optional) Advanced command line options that will be supplied
            directly to the FreeBayes program.
    
    Outputs:
      Variants: variants_vcfgz (file)
            A bgzipped VCF file with the called variants.
    
      Variants index: variants_tbi (file)
            A tabix index (TBI) file with the associated variants index.

    Tools List

    Public Tools

    Annotation Apps

    Name of Tool
    Name in CLI
    Scientific Algorithms
    Common Uses

    Data Transfer Apps

    Name of Tool
    Name in CLI
    Scientific Algorithms
    Common Uses

    DNAseq Apps

    Name of Tool
    Name in CLI
    Scientific Algorithms
    Common Uses

    GWAS Apps

    Name of Tool
    Name in CLI
    Scientific Algorithms
    Common Uses

    File Transfer Apps

    Name of Tool
    Name in CLI
    Scientific Algorithms
    Common Uses

    Imaging Apps

    Name of Tool
    Name in CLI
    Scientific Algorithms
    Common Uses

    Import Apps

    Name of Tool
    Name in CLI
    Scientific Algorithms
    Common Uses

    Interactive Analysis Apps

    Name of Tool
    Name in CLI
    Scientific Algorithms
    Common Uses

    Joint Genotyping Apps

    Name of Tool
    Name in CLI
    Scientific Algorithms
    Common Uses

    Mapping Manipulation Apps

    Name of Tool
    Name in CLI
    Scientific Algorithms
    Common Uses

    PheWAS Apps

    Name of Tool
    Name in CLI
    Scientific Algorithms
    Common Uses

    PRS Apps

    Name of Tool
    Name in CLI
    Scientific Algorithms
    Common Uses

    QC Apps

    Name of Tool
    Name in CLI
    Scientific Algorithms
    Common Uses

    Quantification Apps

    Name of Tool
    Name in CLI
    Scientific Algorithms
    Common Uses

    Read Mapping Apps

    Name of Tool
    Name in CLI
    Scientific Algorithms
    Common Uses

    Read Manipulation Apps

    Name of Tool
    Name in CLI
    Scientific Algorithms
    Common Uses

    RNAseq Apps

    Name of Tool
    Name in CLI
    Scientific Algorithms
    Common Uses

    RNAseq Notebooks

    Name of Tool
    Name in CLI
    Scientific Algorithms
    Common Uses

    Utility Apps

    Name of Tool
    Name in CLI
    Scientific Algorithm
    Common Uses

    Variant Calling Apps

    Name of Tool
    Name in CLI
    Scientific Algorithms
    Common Uses

    Visualization Apps

    Name of Tool
    Name in CLI
    Scientific Algorithms
    Common Uses

    Titan Tools

    A Titan license is required to access and use these tools. Contact for more information.

    Statistics Apps

    Name of Tool
    Name in CLI
    Scientific Algorithms
    Common Uses

    Apollo Tools

    An Apollo license is required to access and use these tools. Contact for more information.

    Dataset Administration Apps

    Name of Tool
    Name in CLI
    Scientific Algorithms
    Common Uses

    Dataset Management Apps

    Name of Tool
    Name in CLI
    Scientific Algorithms
    Common Uses

    Data Science Apps

    Name of Tool
    Name in CLI
    Scientific Algorithms
    Common Uses

    Third Party Tools

    Tools in this section are created and maintained by their respective vendors and may require separate licenses.

    DNAseq Apps

    Name of Tool
    Name in CLI
    Scientific Algorithms
    Common Uses

    Joint Cohort Genotyping Apps

    Name of Tool
    Name in CLI
    Scientific Algorithms
    Common Uses

    Read Mapping Apps

    Name of Tool
    Name in CLI
    Scientific Algorithms
    Common Uses

    Retrieve reads in FASTQ format from SRA

    Short read alignment

    gatk4_haplotypecaller_parallel

    Variant calling, post-alignment QC

    gatk4_genotypegvcfs_single_sample_parallel

    Variant calling

    picard_mark_duplicates

    Variant Calling- remove duplicates, post-alignment

    saige_gwas_gbat

    saige_gwas_svat

    saige_gwas_grm

    saige_gwas_sparse_grm

    plink_pipeline

    Plink2

    plato_pipeline

    Plato, Plink2

    locuszoom

    LocusZoom

    GWAS, visualization

    Interactive radiology image analysis

    Transcriptomics Expression Quantification

    fastqc

    FastQC

    Building reference for BWA alignment

    bwa_mem_fastq_read_mapper

    BWA-MEM

    Short read alignment

    star_generate_genome_index

    (Spliced Transcripts Alignment to a Reference) (--runMode genomeGenerate)

    RNA Seq- indexing

    star_mapping

    (Spliced Transcripts Alignment to a Reference)

    RNA Seq- mapping

    subread_feature_counts

    featureCounts

    Read summarization, RNAseq

    salmon_index_builder

    Salmon

    Transcriptomics Expression Quantification

    salmon_mapping_quant

    Salmon

    Transcriptomics Expression Quantification

    Read quality trimming, adapter trimming

    RNA Seq- mapping

    subread_feature_counts

    featureCounts

    Read summarization, RNAseq

    star_generate_genome_index

    (Spliced Transcripts Alignment to a Reference) (--runMode genomeGenerate)

    Transcriptomics Expression Quantification

    star_mapping

    (Spliced Transcripts Alignment to a Reference)

    Transcriptomics Expression Quantification

    salmon_index_builder

    Salmon

    Transcriptomics Expression Quantification

    salmon_mapping_quant

    Salmon

    Transcriptomics Expression Quantification

    salmon_quant

    Salmon

    Transcriptomics Expression Quantification

    Transcript_Expression_Part-05_Analysis-Regulatory-Network_R.ipynb

    GENIE3

    Data processing tools

    ttyd

    N/A

    Unix shell on a platform cloud worker in your browser. Use it for on-demand CLI operations and to launch https apps on 2 extra ports

    Variant calling

    picard_mark_duplicates

    Variant Calling- remove duplicates, post-alignment

    freebayes

    Use for short variant calls

    gatk4_mutect2_variant_caller_and_filter

    Somatic variant calling and post calling filtering

    gatk4_somatic_panel_of_normals_builder

    Create a panel of normals (PoN) containing germline and artifactual sites for use with Mutect2.

    Running analyses, visualizing data, building and testing models and algorithms in an interactive way, accessing and manipulating data in spark databases and tables

    dxjupyterlab

    dxpy, matplotlib, numpy, pandas, papermill, scipy, seaborn, dxdata, nipype, freesurfer, FSL

    Running imaging processing related analysis

    dxjupyterlab

    dxpy, matplotlib, numpy, pandas, papermill, scipy, seaborn, dxdata, keras, scikit-learn, TensorFlow, torch, monai, monailabel

    Running image processing related analysis, building and testing models and algorithms in an interactive way

    Data Extraction

    Running analyses, visualizing data, building and testing models and algorithms in an interactive way

    WGS, WES, accelerated analysis

    sentieon-tnbam

    Sentieon's BAM to VCF somatic analysis pipeline

    WGS, WES, accelerated analysis

    pbdeepvariant

    Deepvariant

    Variant calling, accelerated analysis

    sentieon-umi

    Sentieon's pre-processing and alignment pipeline for next-generation sequence

    WGS, WES, accelerated analysis

    sentieon-dnabam

    Sentieon's BAM to VCF germline analysis pipeline

    WGS, WES, accelerated analysis

    sentieon-joint_genotyping

    Sentieon GVCFtyper

    WGS, WES, accelerated analysis

    sentieon-ccdg

    Sentieon's FASTQ to CRAM pipeline, Functional Equivalent Pipeline

    WGS, WES, accelerated analysis

    sentieon-dnaseq

    Sentieon's FASTQ to VCF germline analysis pipeline

    WGS, WES, accelerated analysis

    WGS, WES, accelerated analysis

    pbdeepvariant

    Deepvariant

    Variant calling, accelerated analysis

    sentieon-umi

    Sentieon's pre-processing and alignment pipeline for next-generation sequence

    WGS, WES, accelerated analysis

    sentieon-ccdg

    Sentieon's FASTQ to CRAM pipeline, Functional Equivalent Pipeline

    WGS, WES, accelerated analysis

    sentieon-dnaseq

    Sentieon's FASTQ to VCF germline analysis pipeline

    WGS, WES, accelerated analysis

    SnpEff Annotate

    snpeff_annotate

    SnpEff

    Annotation

    SnpSift Annotate

    snpsift_annotate

    SnpSift

    Annotation

    AWS S3 Exporter

    aws_platform_to_s3_file_transfer

    AWS S3

    AWS S3 Importer

    aws_s3_to_platform_files

    AWS S3

    SRA FASTQ Importer

    sra_fastq_importer

    Original Quality Functionally Equivalent

    oqfe

    A revision of Functionally Equivalent

    WGS, WES- alignment and duplicate marking

    GATK4 Base Quality Score Recalibrator (Parallel Per-Chrom)

    gatk4_bqsr_parallel

    GATK4- BaseRecalibrator and ApplyBQSR

    Variant calling

    BWA-MEM FASTQ Read Mapper

    bwa_mem_fastq_read_mapper

    REGENIE

    regenie

    REGENIE

    GWAS

    PLINK GWAS

    plink_gwas

    PLINK2

    Raremetal2 - Raremetal and Raremetalworker

    raremetal2

    URL Fetcher

    url_fetcher

    N/A

    Fetches a file from a URL onto the DNAnexus Platform

    Imaging Multitool - MONAI

    imaging_multitool_monai

    MONAI Core

    Image processing

    3D Slicer

    3d_slicer

    3D Slicer

    Visualization and image analysis

    3D Slicer with MONAI Label

    3d_slicer_monai

    SRA FASTQ Importer

    sra_fastq_importer

    SRA tools: fasterq-dump

    Retrieve reads in FASTQ format from SRA

    URL Fetcher

    url_fetcher

    N/A

    Fetches a file from a URL onto the DNAnexus Platform

    Cloud Workstation

    cloud_workstation

    N/A

    SSH-accessible unix shell on a platform cloud worker. Use it for on-demand analysis of platform data.

    ttyd

    ttyd

    N/A

    Unix shell on a platform cloud worker in your browser. Use it for on-demand CLI operations and to launch https apps on 2 extra ports

    GLnexus

    glnexus

    GLnexus

    This app can also be used to create pVCF without running joint genotyping

    SAMtools Mappings Indexer

    samtools_index

    SAMtools - SAMtools index

    Building BAM index file

    SAMtools Mappings Sorter

    samtools_sort

    SAMtools - SAMtools sort

    Sort alignment result based on coordinates

    PHESANT

    phesant

    PHESANT

    PheWAS

    PRSice 2

    prsice2

    PRSice-2

    Polygenic risk scores

    MultiQC

    multiqc

    MultiQC

    QC reporting

    Qualimap2 Analysis

    qualimap2_anlys

    Qualimap2

    QC

    RNASeQC

    rnaseqc

    Salmon Quantification

    salmon_quant

    Salmon

    Transcriptomics Expression Quantification

    Salmon Mapping and Quantification

    salmon_mapping_quant

    Salmon

    Transcriptomics Expression Quantification

    Bowtie2 FASTA Indexer

    bowtie2_fasta_indexer

    bowtie2: bowtie2-build

    Building reference for Bowite2 alignment

    Bowtie2 FASTQ Read Mapper

    bowtie2_fastq_read_mapper

    bowtie2, SAMtools view, SAMtools sort, SAMtools index

    Short read alignment

    BWA FASTA Indexer

    bwa_fasta_indexer

    GATK4 Base Quality Score Recalibrator (Parallel Per-Chrom)

    gatk4_bqsr_parallel

    GATK4- BaseRecalibrator and ApplyBQSR

    Variant calling

    Flexbar FASTQ Read Trimmer

    flexbar_fastq_read_trimmer

    flexbar

    QC

    Trimmomatic

    trimmomatic

    RNASeQC

    rnaseqc

    RNASeQC 2

    Transcriptomics Expression Quantification

    STAR Generate Genome Index

    star_generate_genome_index

    STAR (Spliced Transcripts Alignment to a Reference) (--runMode genomeGenerate)

    RNA Seq- indexing

    STAR Mapping

    star_mapping

    Transcript_Expression_Part-02_Analysis-diff-exp_R.ipynb

    Transcript_Expression_Part-02_Analysis-diff-exp_R.ipynb

    DESeq2

    Transcript_Expression_Part-03_Analysis-GSEA_R.ipynb

    Transcript_Expression_Part-03_Analysis-GSEA_R.ipynb

    WebGestaltR

    Transcript_Expression_Part-04_Analysis-CoEx-Network_R.ipynb

    Transcript_Expression_Part-04_Analysis-CoEx-Network_R.ipynb

    File Concatenator

    file_concatenator

    N/A

    Gzip File Compressor

    gzip

    gzip

    Swiss Army Knife

    swiss-army-knife

    CNVkit

    cnvkit_batch

    CNVkit

    Copy Number Variant

    GATK4 HaplotypeCaller (Parallel Per-IntervalByNs)

    gatk4_haplotypecaller_parallel

    GATK4- HaplotypeCaller module

    Variant calling, post-alignment QC

    GATK4 Single-Sample GenotypeGVCFs (Parallel Per-Chrom)

    gatk4_genotypegvcfs_single_sample_parallel

    LocusZoom

    locuszoom

    LocusZoom

    GWAS, visualization

    JupyterLab with ML

    dxjupyterlab

    dxpy, matplotlib, numpy, pandas, papermill, scipy, seaborn, dxdata, cntk, keras, scikit-learn, TensorFlow, torch

    Running analyses, visualizing data, building and testing models and algorithms in an interactive way

    JupyterLab with Python_R

    dxjupyterlab

    dxpy, matplotlib, numpy, pandas, papermill, scipy, seaborn, dxdata, BiocManager, coloc, epiR, hyprcoloc, incidence, MendelianRandomization, outbreaks, prevalence

    Running analyses, visualizing data, building and testing models and algorithms in an interactive way

    JupyterLab with Stata

    dxjupyterlab

    Data Model Loader

    data_model_loader_v2

    Dataset Creation

    Dataset Creation

    Dataset Extender

    dataset-extender

    Dataset Extension

    Dataset Extension

    CSV Loader

    csv-loader

    N/A

    Data Loading

    Spark SQL Runner

    spark-sql-runner

    Spark SQL

    Dynamic SQL Execution

    Table Exporter

    table-exporter

    JupyterLab with Spark Cluster with Glow

    dxjupyterlab_spark_cluster

    dxpy, hail, matplotlib, numpy, pandas, papermill, scipy, seaborn, dxdata, pyOpenSSL, glow.py, pypandoc, koalas, pyarrow, bokeh, vep, BiocManager, coloc, epiR, yprcoloc, incidence, MendelianRandomization, outbreaks, prevalence, sparklyr, Glow

    Running analyses, visualizing data, building and testing models and algorithms in an interactive way

    JupyterLab with Spark Cluster with Hail

    dxjupyterlab_spark_cluster

    dxpy, hail, matplotlib, numpy, pandas, papermill, scipy, seaborn, dxdata, pyOpenSSL, glow.py, pypandoc, koalas, pyarrow, bokeh, vep, BiocManager, coloc, epiR, yprcoloc, incidence, MendelianRandomization, outbreaks, prevalence, sparklyr, HAIL

    Running analyses, visualizing data, building and testing models and algorithms in an interactive way

    JupyterLab with Spark Cluster with Hail and VEP

    dxjupyterlab_spark_cluster

    Sentieon somatic FASTQ to VCF

    sentieon-tnseq

    Sentieon's FASTQ to VCF somatic analysis pipeline

    WGS, WES, accelerated analysis

    Sentieon BWA-MEM and Sentieon De-duplication

    sentieon-bwa

    Sentieon's FASTQ to BAM/CRAM pipeline

    WGS, WES, accelerated analysis

    Germline Pipeline (NVIDIA Clara Parabricks accelerated)

    pbgermline

    Sentieon distributed Joint Genotyping

    sentieon-joint_genotyping

    Sentieon GVCFtyper

    WGS, WES, accelerated analysis

    Sentieon somatic FASTQ to VCF

    sentieon-tnseq

    Sentieon's FASTQ to VCF somatic analysis pipeline

    WGS, WES, accelerated analysis

    Sentieon BWA-MEM and Sentieon De-duplication

    sentieon-bwa

    Sentieon's FASTQ to BAM/CRAM pipeline

    WGS, WES, accelerated analysis

    Germline Pipeline (NVIDIA Clara Parabricks accelerated)

    pbgermline

    DNAnexus Sales
    DNAnexus Sales

    BWA-MEM

    RAREMETALWORKER, RAREMETAL

    ,

    BWA- bwa index

    (Spliced Transcripts Alignment to a Reference)

    WGCNA, topGO

    bcftools, bedtools, bgzip, plink, sambamba, SAMtools, seqtk, tabix, vcflib, Plato, QCTool, vcftools, plink2, Picard

    dxpy, matplotlib, numpy, pandas, papermill, scipy, seaborn, dxdata, BiocManager, coloc, epiR, hyprcoloc, incidence, MendelianRandomization

    N/A

    dxpy, hail, matplotlib, numpy, pandas, papermill, scipy, seaborn, dxdata, pyOpenSSL, glow.py, pypandoc, koalas, pyarrow

    BWA-Mem Alignment, Co-ordinate Sorting, Picard MarkDuplicates, Base Quality Score Recalibration

    BWA-Mem Alignment, Co-ordinate Sorting, Picard MarkDuplicates, Base Quality Score Recalibration

    ,
    REGENIE
    ,
    BOLT-LMM
    ,
    BGEN
    ,
    outbreaks
    ,
    prevalence
    ,
    Stata
    ,
    stata_kernel
    ,
    3D Slicer
    ,
    bokeh
    ,
    vep
    ,
    BiocManager
    ,
    coloc
    ,
    epiR
    ,
    yprcoloc
    ,
    incidence
    ,
    MendelianRandomization
    ,
    outbreaks
    ,
    prevalence
    ,
    sparklyr
    ,
    HAIL
    , Ensembl Variant Effect Predictor
    SRA toolkit fasterq-dump
    GATK4 HaplotypeCaller (Parallel Per-IntervalByNs)
    GATK4- HaplotypeCaller module
    GATK4 Single-Sample GenotypeGVCFs (Parallel Per-Chrom)
    GATK4- GenotypeGVCFs module
    Picard MarkDuplicates Mappings Deduplicator
    MarkDuplicates from the Picard suite of tools
    SAIGE GWAS - Gene and region based association tests
    SAIGE
    SAIGE GWAS - Single variant association tests
    SAIGE
    SAIGE GWAS GRM
    SAIGE
    SAIGE-GWAS Sparse GRM
    SAIGE
    PLINK GWAS Pipeline
    PLATO GWAS Pipeline
    LocusZoom
    3D Slicer
    MONAI Label
    RNASeQC 2
    FastQC Reads Quality Control
    BWA-MEM FASTQ Read Mapper
    STAR Generate Genome Index
    STAR
    STAR Mapping
    STAR
    Subread featureCounts
    Salmon Index Builder
    Salmon Mapping and Quantification
    trimmomatic
    STAR
    Subread featureCounts
    STAR Generate Genome Index
    STAR
    STAR Mapping
    STAR
    Salmon Index Builder
    Salmon Mapping and Quantification
    Salmon Quantification
    Transcript_Expression_Part-05_Analysis-Regulatory-Network_R.ipynb
    ttyd
    GATK4- GenotypeGVCFs module
    Picard MarkDuplicates Mappings Deduplicator
    MarkDuplicates from the Picard suite of tools
    FreeBayes Variant Caller
    FreeBayes
    GATK4 Mutect2 Variant Caller and Filter
    GATK Mutect2
    GATK4 Somatic Panel Of Normals Builder
    GATK CreateSomaticPanelOfNormals
    JupyterLab with IMAGE_PROCESSING
    JupyterLab with MONAI_ML
    Sentieon somatic BAM/CRAM to VCF
    DeepVariant Pipeline (Parabricks accelerated)
    Sentieon UMI
    Sentieon germline BAM/CRAM to VCF
    Sentieon distributed Joint Genotyping
    Sentieon Functional equivalent protocol
    Sentieon germline FASTQ to VCF
    DeepVariant Pipeline (Parabricks accelerated)
    Sentieon UMI
    Sentieon Functional equivalent protocol
    Sentieon germline FASTQ to VCF

    Running Nextflow Pipelines

    This tutorial demonstrates how to use Nextflow pipelines on the DNAnexus Platform by importing a Nextflow pipeline from a remote repository or building from local disk space.

    A license is required to create a DNAnexus app or applet from the Nextflow script folder. Contact DNAnexus Sales for more information.

    This documentation assumes you already have a basic understanding of how to develop and run a Nextflow pipeline. To learn more about Nextflow, consult the official Nextflow Documentation.

    To run a Nextflow pipeline on the DNAnexus Platform:

    1. Import the pipeline script from a remote repository or local disk.

    2. Convert the script to an app or applet.

    3. Run the app or applet.

    You can do this via either the user interface (UI) or the command-line interface (CLI), using the .

    Use the latest version of to take advantage of recent improvements and bug fixes.

    As of dx-toolkit version v0.391.0, pipelines built using dx build --nextflow default to running on Ubuntu 24.04. To use the Ubuntu 20.04 instead, override the default by specifying the release in --extra-args:

    This documentation covers features available in dx-toolkit versions beginning with

    Quickstart

    Pipeline Script Folder Structure

    A Nextflow pipeline script is structured as a folder with Nextflow scripts with optional configuration files and subfolders. Below are the basic elements of the folder structure when building a Nextflow executable:

    • (Required) A main Nextflow file with the extension .nf containing the pipeline. The default filename is main.nf. A different filename can be specified in the nextflow.config file.

    • (Optional) A .

    • (Optional, recommended) A . If this file is present at the root folder of the Nextflow script when importing or building the executable, the input parameters described in the file are exposed as the built Nextflow pipeline applet's input parameters. For more information on how the exposed parameters are used at run time, see

    An flavored folder structure is encouraged but not required.

    Importing a Nextflow Pipeline

    Import via UI

    To import a Nextflow pipeline via the UI, click on the Add button on the top-right corner of the project's Manage tab, then expand the dropdown menu. Select the Import Pipeline/Workflow option.

    Once the Import Pipeline/Workflow modal appears, enter the repository URL where the Nextflow pipeline source code resides, for example, . Then choose the desired project import location. If the repository is private, provide the credentials necessary for accessing it.

    An example of the Import Pipeline/Workflow modal:

    Click the Start Import button after providing the necessary information. This starts a pipeline import job in the project specified in the Import To field (default is the current project).

    After launching the import job, a status message "External workflow import job started" appears.

    Access information about the pipeline import job in the project's Monitor tab:

    After the import finishes, the imported pipeline executable exists as an applet. This is the output of the pipeline import job:

    The newly created Nextflow pipeline applet appears in the project, for example, hello.

    Import via CLI from a Remote Repository

    To import a Nextflow pipeline from a remote repository via the CLI, run the following command to specify the repository's URL. You can also provide optional information, such as a and an :

    Use the latest version of to take advantage of recent improvements and bug fixes.

    All versions beginning with v0.338.0 support converting Nextflow pipelines to apps or applets.

    This documentation covers features available in dx-toolkit versions beginning with v0.370.0.

    Your destination project's billTo feature needs to be enabled for Nextflow pipeline applet building. for more information.

    For Nextflow pipelines stored in private repositories, access requires credentials provided via the --git-credentials option with a DNAnexus file containing your authentication details. The file should be specified using either its qualified ID or path on the Platform. See the section for more details on setting up and formatting these credentials.

    Once the pipeline import job finishes, it generates a new Nextflow pipeline applet with an applet ID in the form applet-zzzz.

    Use dx run -h to get more information about running the applet:

    Building from a Local Disk

    Through the CLI you can also build a Nextflow pipeline applet from a pipeline script folder stored on a local disk. For example, you may have a copy of the nextflow-io/hello pipeline from the Nextflow on your local laptop, stored in a directory named hello, which contains the following files:

    Ensure that the folder structure is in the required format, as .

    To build a Nextflow pipeline applet using a locally stored pipeline script, run the following command and specify the path to the folder containing the Nextflow pipeline scripts. You can also provide , such as an import destination:

    Your destination project's billTo feature needs to be enabled for Nextflow pipeline applet building. Contact for more information.

    This command packages the Nextflow pipeline script folder as an applet named hello with ID applet-yyyy, and stores the applet in the destination project and path project-xxxx:/applets2/hello. If an import destination is not provided, the current working directory is used.

    The command can be run to see information about this applet, similar to the above example.

    A Nextflow pipeline applet has a type nextflow under its metadata. This applet acts like a regular DNAnexus applet object, and can be shared with other DNAnexus users who have access to the project containing the applet.

    For advanced information regarding the parameters of dx build --, run dx build --help in the CLI and find the Nextflow section for all arguments that are supported for building an Nextflow pipeline applet.

    Building a Nextflow Pipeline App from a Nextflow Pipeline Applet

    You can also build a Nextflow pipeline by running the command: dx build --app --from applet-xxxx.

    Running a Nextflow Pipeline Executable (App or Applet)

    Running a Nextflow Pipeline Executable via UI

    You can access a Nextflow pipeline applet from the Manage tab in your project, while the Nextflow pipeline app that you built can be accessed by clicking on the Tools Library option from the Tools tab. Once you click on the applet or app, the Run Analysis tab is displayed. Fill out the required inputs/outputs and click the Start Analysis button to launch the job.

    Running a Nextflow Pipeline Applet via CLI

    To run the Nextflow pipeline applet, use dx run applet-xxxx or dx run app-xxxx commands in the CLI and specify your :

    You can list and see the progress of the Nextflow pipeline job tree, which is structured as a head job with many subjobs, using the following :

    Monitoring Jobs

    Each Nextflow pipeline executable run is represented as a job tree with one head job and many subjobs. The head job launches and supervises the entire pipeline execution. Each subjob handles a process in the Nextflow pipeline. You can monitor the progress of the entire pipeline job tree by viewing the status of the subjobs (see example above).

    Monitor the detail log of the head job and the subjobs through each job's DNAnexus log via the UI or the CLI.

    On the DNAnexus Platform, jobs are limited to a runtime of 30 days. Jobs running longer than 30 days are automatically terminated.

    Monitoring in the UI

    Once your job tree is running, you can go to the Monitor tab to view the status of your job tree. From the Monitor tab, view the job log of the head job as well as the subjobs by clicking on the Log link in the row of the desired job. The costs (when your account has permission) and resource usage of a job are also viewable.

    An example of the log of a head job:

    An example of the log of a subjob:

    Monitoring in the CLI

    From the CLI, you can use the command to check the status and view the log of the head job or each subjob.

    Monitoring the head job:

    Monitoring a subjob:

    Advanced Options: Running a Nextflow Pipeline Executable (App or Applet)

    Nextflow Execution on DNAnexus

    The Nextflow pipeline executable is launched as a job tree, with one head job running the Nextflow , and multiple subjobs running a single each. Throughout the pipeline's execution, the head job remains in "running" state and supervises the job tree's execution.

    Nextflow Execution Log File

    When a Nextflow head job (job-xxxx) enters its terminal state, either "done" or "failed", the system writes a named nextflow-<job-xxxx>.log to the of the head job.

    Private Docker Repository

    DNAnexus supports Docker container engines for the Nextflow pipeline execution environment. The pipeline developer may refer to a public Docker repository or a private one. When the pipeline is referencing a private Docker repository, you should provide your Docker credential file as a file input of docker_creds to the Nextflow pipeline executable when launching the job tree.

    Syntax of a private Docker credential:

    Store this credential file in a separate project with restricted access permissions for security.

    Nextflow Pipeline Executable Inputs and Outputs

    Specifying Input Values to a Nextflow Pipeline Executable

    Below are all possible means that you can specify an input value at build time and runtime. They are listed in order of precedence (items listed first have greater precedence and override items listed further down the list):

    1. Executable (app or applet) run time

      1. DNAnexus Platform app or applet input.

        • CLI example: dx run project-xxxx:applet-xxxx -i reads_fastqgz=project-xxxx:file-yyyy

    Formats of PATH to File, Folder, or Wildcards

    While you can specify a file input parameter's value at different places as seen above, the valid PATH format referring to the same file is different. This depends on the level (DNAnexus API/CLI level or Nextflow script-level) and the (file object or string) of the executable's input parameter. Examples of this are given below.

    Scenarios
    Valid PATH format

    Specifying a Nextflow Job Tree Output Folder

    When launching a DNAnexus job, you can specify a job-level output destination such as project-xxxx:/destination/ using the platform-level optional parameter on the or on the . For pipelines with publishDir settings, each output file is saved to <dx_run_path>/<publishDir>/, where <dx_run_path> is the job-level output destination and <publishDir> is the path assigned by the Nextflow script's process.

    Read more detail about the output folder specification and . Find an example on how to construct output paths of an nf-core pipeline job tree at run time in the .

    Using an AWS S3 Bucket as a Work Directory for Nextflow Pipeline Runs

    You can have your Nextflow pipeline runs use an Amazon Web Services (AWS) S3 bucket as a work directory. To do this, follow the steps outlined below.

    Step 1. Configure Your AWS Account to Trust the DNAnexus Platform as an OIDC Identity Provider

    to configure your AWS account to trust the Platform, as an OIDC identity provider. Be sure to note the value entered in the "Audience" field. This value is required in a configuration file used by your pipeline to enable pipeline runs to access the S3 bucket.

    Step 2. Configure an AWS IAM Role with the Proper Trust and Permissions Policies

    Next, configure an , such that its permissions and trust policies allow Platform jobs that assume this role, to access and use resources in the S3 bucket.

    Permissions Policy

    The following example shows how to structure an IAM role's permission policy, to enable the role to use an S3 bucket - accessible via the S3 URI s3://my-nextflow-s3-workdir - as the work directory of Nextflow pipeline runs:

    In the above example:

    • The "Action" section contains a list of the actions the role is allowed to perform, including deleting, getting, listing, and putting objects.

    • The two entries in the list in the "Resource" section enable the role to access all resources in the bucket accessible via the S3 URI my-nextflow-s3-workdir.

    Trust Policy

    The following example shows how to configure an IAM role's trust policy, to allow only properly configured Platform jobs to assume the role:

    In the above example:

    • To assume the role, a job must be launched from within a specific Platform project (in this case, project-xxxx).

    • To assume the role, a job must be launched by a specific Platform user (in this case, user-aaaa).

    • Via the "Federated" setting in the "Principal" section, the policy configures the role to trust the Platform as an OIDC identity provider, as accessible at job-oidc.dnanexus.com.

    Step 3. Configure Your Nextflow Pipeline's Configuration File to Access the S3 Bucket

    Next you need to configure your pipeline so that when it's run, it can access the S3 bucket. To do this, add, in a configuration file, a dnanexus that includes the properties shown in this example:

    In the above example:

    • workDir is the path to the bucket to be used as a work directory, in S3 URI format.

    • jobTokenAudience is the value of "Audience" you defined in above.

    • jobTokenSubjectClaims is an ordered, comma-separated list of DNAnexus - for example, project_id, launched_by

    Using Subject Claims to Control Bucket Access

    When configuring the trust policy for the role that allows access to the S3 bucket, use custom subject claims to control which jobs can assume this role. Here are some typical combinations that we recommend, with their implications:

    Having included custom subject claims in the trust policy for the role, you need then, in the , to set the value of jobTokenSubjectClaims to equal a comma-separated list of claims, entered in the same order in which you entered them in the trust policy.

    For example, if you configured a role's trust policy per the , you are requiring a job, to assume the role, to present custom subject claims project_id and launched_by, in that order. In your Nextflow configuration file, set the value of jobTokenSubjectClaims, within the dnanexus config scope, as follows:

    Within the dna config scope, you must also set the value of iamRoleArnToAssume to that of the appropriate role:

    Advanced Options: Building a Nextflow Pipeline Executable

    Nextflow Pipeline Executable Permissions

    By default, the Platform . Nextflow pipeline apps and applets have the following capabilities that are exceptions to these limits:

    • External internet access ("network": ["*"]) - This is required for Nextflow pipeline apps and applets to be able to pull Docker images from external Docker registries at runtime.

    • UPLOAD access to the project in which a Nextflow pipeline job is run ("project": "UPLOAD") - This is required in order for Nextflow pipeline jobs to record the progress of executions, and preserve the run cache, to enable resume functionality.

    You can modify a Nextflow pipeline app or applet's permissions by overriding the default values when , using the --extra-args flag with . An example:

    Here are the key points:

    • "network": [] prevents jobs from accessing the internet.

    • "allProjects":"VIEW" increases jobs' access permission level to VIEW. This means that each job has "read" access to projects that can be accessed by the user running the job. Use this carefully. This permission setting can be useful when expected input file PATHs are provided as DNAnexus URIs - via a , for example, - from projects other than the one in which a job is being run.

    Advanced Building and Importing Pipelines

    Additional options exist for dx build --nextflow:

    Option
    Class
    Description

    Use dx build --help for more information.

    Private Nextflow Pipeline Repository

    When the Nextflow pipeline to be imported is from a private repository, you must provide a file object that contains the credentials needed to access the repository. Via the CLI, use the --git-credentials flag, and format the object as follows:

    To safeguard this credentials field object, store it in a separate project that only you can access.

    Platform File Objects as Runtime Docker Images

    When building a Nextflow pipeline executable, you can replace any with a Platform file object in tarball format. These Docker tarball objects serve as substitutes for referencing external Docker repositories.

    This approach enhances the provenance and reproducibility of the pipeline by minimizing reliance on external dependencies, thereby reducing associated risks. Also, it fortifies data security by eliminating the need for internet access to external resources, during pipeline execution.

    Two methods are available for preparing Docker images as tarball file objects on the platform: or .

    Built-in Docker Image Caching vs. Manually Preparing Tarballs

    Built-in Docker image caching
    Manually preparing tarballs

    Built-in Docker Image Caching

    This method initiates a building job that begins by taking the pipeline script, then identifying Docker containers by scanning the script's source code based on the final execution tree. Next, the job converts the containers to tarballs, and saves those tarballs to the project in which the job is running. Finally, the job builds the Nextflow pipeline executable, in the tarballs, as bundledDepends.

    You can use built-in caching via the CLI by using the flag --cache-docker at build time. All cached Docker tarballs are stored as file objects, within the Docker cache path, at project-xxxx:/.cached_docker_images/<image_name>/<image_name>_<version>.

    An example:

    If you need to access a Docker container that's stored in a private repository, you must provide, along with the flag --docker-secrets, a file object that contains the credentials needed to access the repository. This object must be in the following format:

    • When a pipeline requires specific inputs, such as file objects, sample values must be present within the project in which building job is to execute. These values must be provided along with the flag --nextflow-pipeline-params.

      • It's crucial that these sample values be structured in the same way as actual input data is structured. This ensures that the execution logic of the Nextflow pipeline remains intact. During the build process, use small files, containing data representative of the larger dataset, as sample data, to reduce file localization overhead.

    Manually Preparing Tarballs

    You can manually convert Docker images to tarball file objects. Within Nextflow pipeline scripts, you must then reference the location of each such tarball, in one of the following three ways:

    Option A: Reference each tarball by its unique Platform ID such as dx://project-xxxx:file-yyyy. Use this approach if you want deterministic execution behavior.

    You can use Platform IDs in Nextflow pipeline scripts (*.nf) or configuration files (*.config), as follows:

    When accessing a Platform project, a Nextflow pipeline job needs the VIEW or higher permission to the project.

    Option B: Within a Nextflow pipeline script, you can also reference a Docker image by using its . Use this name within a path that's in the following format: project-xxxx:/.cached_docker_images/<image_name>/<image_name>_<version>.

    An example:

    File extensions are not necessary, and project-xxxx is the project where the Nextflow pipeline executable was built and is executed. For.cached_docker_images, substitute the name of the folder in which these images have been stored. An exact <version> reference must be included - latest is not an accepted tag in this context.

    At Nextflow pipeline executable runtime:

    1. If no image is found at the path provided, the Nextflow pipeline job attempts to pull the Docker image from the remote external registry, based on the image name. This pull attempt requires internet access.

    2. When the version is referenced as latest, or when no version tag is provided, the Nextflow pipeline job attempts to search the digest of the image's latest

    Here are examples of tarball file object paths and names, as constructed from image names and version tags:

    Image Name
    Version Tag
    Tarball File Object Path and Name

    Option C: You can also reference Docker image names in pipeline scripts by digest - for example, <Image_name>@sha256:XYZ123…). File extensions are not necessary, and project-xxxx is the project where the Nextflow pipeline executable was built and is executed. For.cached_docker_images, substitute the name of the folder in which these images have been stored. An exact <version> reference must be included - latest is not an accepted tag in this context. When referring to a tarball file on the Platform using this method, the file must have an object property image_digest assigned to it. A typical format would be "image_digest":"<IMAGE_DIGEST_HERE>".

    An example:

    Nextflow Input Parameter Type Conversion to DNAnexus Executable Input Parameter Class

    Based on the input parameter's type and format (when applicable) defined in the corresponding , each parameter is assigned to the corresponding class (, ).

    File Input as String or File Class

    As a pipeline developer, you can specify a file input variable as {"type":"string", "format":"file-path"} or {"type":"string", "format":"path"}, which is assigned to "file" or "string" class, respectively. When running the executable, based on the class (file or string) of the executable's input parameter, you use a specific PATH format to specify the value. See the for an acceptable PATH format for each class.

    Converting a URL path to a String

    When converting a file reference from a URL format to a String, you use the method toUriString(). An example of a URL format would be dx://project-xxxx:/path/to/file for a DNAnexus URI. The method toURI().toString() does not give the same result because toURI() removes the context ID, such as project-xxxx, and toString() removes the scheme, such as dx://. More information about the Nextflow methods is available in the .

    Managing intermediate files and publishing outputs

    Pipeline Output Setting Using output: block and publishDir

    All files generated by a Nextflow job tree are stored in its session's corresponding workDir, which is the path where the temporary results are stored. On DNAnexus, when the Nextflow pipeline job is run with "preserve_cache=true", the workDir is set at the path: project-xxxx:/.nextflow_cache_db/<session_id>/work/. The project-xxxx is the project where the job took place, and you can follow the path to access all preserved temporary results. It is useful to be able to access these results for investigating the detailed pipeline progress, and use them for resuming job runs for pipeline development purposes.

    When the Nextflow pipeline job is run with "preserve_cache=false" (default), temporary files are stored in the job's which is deconstructed when the head job enters its terminate state - "done", "failed", or "terminated". Since a lot of these files are intermediate input/output being passed between processes and expected to be cleaned up after the job is completed, running with "preserve_cache=false" helps reduce project storage cost for files that are not of interest. It also saves you from remembering to clean up all temporary files.

    To save the final results of interest, and to display them as the Nextflow pipeline executable's output, you can declare output files matching the declaration under the script's output: block, and use Nextflow's optional directive to publish them.

    This makes the published output files available as the Nextflow pipeline head job's , under the executable's formally defined placeholder output parameter, published_files, as array:file class. Then the files are organized under the relative folder structure assigned via publishDir. This works for both "preserve_cache=true" and "preserve_cache=false". Only the "copy" publish mode is supported on DNAnexus.

    Values of publishDir

    At pipeline development time, the valid value of publishDir can be:

    • A local path string, for example, "publishDir path: ./path/to/nf/publish_dir/",

    • A dynamic string value defined as a pipeline input parameter such as "params.outdir", where "outdir" is a string-class input. This allows pipeline users to determine parameter values at runtime. For example, "publishDir path: '${params.outdir}/some/dir/'" or './some/dir/${params.outdir}/' or './some/dir/${params.outdir}/some/dir/' .

    Find an example on how to construct output paths for an nf-core pipeline job tree at run time in the .

    publishDir is NOT supported on DNAnexus when assigned as an absolute path starting at root (/), such as /path/to/nf/publish_dir/. If an absolute path is defined for the publishDir, no output files are generated as the job's output parameter "published_files".

    Queue Size Configuration

    The queueSize option is part of Nextflow's executor . It defines how many tasks the executor handles in a parallel way. On DNAnexus, this represents the number of subjobs being created at a time (5 by default) by the Nextflow pipeline executable's head job. If the pipeline's executor configuration has a value assigned to queueSize, it overrides the default value. If the value exceeds the upper limit (1000) on DNAnexus, the root job errors out. See the Nextflow executor page for examples.

    Instance Type Determination

    Head job instance type determination

    The head job of the job tree defaults to running on instance type mem2_ssd1_v2_x4 in AWS regions and azure:mem2_ssd1_x4 in Azure regions. Users can change to a different instance type than the default, but this is not recommended. The head job executes and monitors the subjobs. Changing the instance type for the head job does not affect the computing resources available for subjobs, where most of the heavy computation takes place (see below where to configure instance types for Nextflow processes). Changing the instance type for the head job may be necessary only if it runs out of memory or disk space when staging input files, collecting pipeline output files, or uploading pipeline output files to the project.

    Subjob instance type determination

    Each subjob's instance type is determined based on the profile information provided in the Nextflow pipeline script. Specify required instances by via Nextflow's directive (example below). Alternatively, use a set of system requirements such as , , , and other resource parameters according to the official Nextflow documentation. The executor matches instance types to the minimal requirements described in the Nextflow pipeline profile using this logic:

    1. Choose the cheapest instance that satisfies the system requirements.

    2. Use only SSD type instances.

    3. For all things equal (price and instance specifications), it prefers a instance type.

    Order of precedence for subjob instance type determination:

    1. The value assigned to machineType directive.

    2. Values assigned to cpus, memory, and disk directives in their .

    An example command for specifying machineType by DNAnexus instance type name is provided below:

    Values assigned to cpus, memory, and disk directives serve two purposes: they determine the instance type and can be recalled by Nextflow's of task object such as ${task.cpus}, ${task.memory}, and ${task.disk} at runtime for task allocation.

    Nextflow Resume

    Preserve Run Caches and Resuming Previous Jobs

    Nextflow's feature enables skipping the processes that have been finished successfully and cached in previous runs. The new run can directly jump to downstream processes without needing to start from the beginning of the pipeline. By retrieving cached progress, Nextflow resume helps pipeline developers to save both time and compute costs. It is helpful for testing and troubleshooting when building and developing a Nextflow pipeline.

    Nextflow uses a scratch storage area for caching and preserving each task's temporary results. The directory is called "working directory", and the directory's path is defined by

    • The session id, a universally unique identifier (UUID) associated with current execution

    • Each task's unique hash ID: a hash number composed of each task's input values, input files, command line strings, container ID such as Docker image, conda environment, environment modules, and executed scripts in the bin directory, when applicable.

    You can use the Nextflow resume feature with the following Nextflow pipeline executable parameters:

    • preserve_cache Boolean type. Default value is false. When set to true, the run is cached in the current project for future resumes. For example:

      • This enables the Nextflow job tree to preserve cached information as well as all temporary results in the project where it is executed under the following paths, based on its session ID and each subjob's unique ID.

    • When preserve_cache=true, DNAnexus executor overrides the value of workDir of the job tree to be project-xxxx:/.nextflow_cache_db/<session_id>/work/, where project-xxxx is the project where the job tree was executed.

    Below are four possible scenarios and the recommended use cases for –i resume:

    Scenarios
    Parameters
    Use Cases
    Note

    Cache Preserve Limitations and Cleaning Up workDir

    To save on storage costs, clean up the workDir regularly. The maximum number of sessions that can be preserved in a DNAnexus project is 20 sessions. If you exceed the limit, the job generates an error with the following message:

    "The number of preserved sessions is already at the limit (N=20) and preserve_cache is true. Remove the folders in <project-id>:/.nextflow_cache_db/ to be under the limit, if you want to preserve the cache of this run. "

    To clean up all preserved sessions under a project, you can delete the entire ./nextflow_cache_db folder. To clean up a specific session's cached folder, you can delete the specific .nextflow_cache_db/<session_id>/ folder. To delete a folder in UI, you can follow the documentation on . To delete a folder in CLI, you can run:

    Be aware that deleting an object on UI or using CLI dx rm cannot be undone. Once the session work directory is deleted or moved, subsequent runs cannot resume from the session.

    For each session, only one job can resume the session's cached results and preserve its own progress to this session. Multiple jobs can resume and preserve different sessions without limitations, as long as each job preserves a different session. Similarly, multiple jobs can resume the same session without limitations, as long as only one or none is preserving the progress to the session.

    Nextflow's errorStrategy

    Nextflow's directive allows you to define how the error condition is managed by the Nextflow executor at the process level. When an error status is returned, by default, the process and other pending processes stop immediately (the default is errorStrategy terminate). This forces the entire pipeline execution to be terminated.

    Four error strategy options exist for Nextflow executor: terminate, finish, ignore, and retry. Below is a table of behaviors for each strategy. The "all other subjobs" referenced in the third column have not yet entered their terminal states.

    When more than one errorStrategy directives are applied to a pipeline job tree, the following rules apply depending on the first errorStrategy used.

    • When terminate is the first errorStrategy directive to be triggered in a subjob, all the other ongoing subjobs result in the "failed" state immediately.

    • When finish is the first errorStrategy directive to be triggered in a subjob, any other errorStrategy that is reached in the remaining ongoing subjobs also applies the finish errorStrategy

    Independent from Nextflow process-level error conditions, when a Nextflow subjob encounters platform-related restartable , such as ExecutionError, UnresponsiveWorker, JMInternalError, AppInternalError, or JobTimeoutExceeded, the subjob follows the executionPolicy determined to the subjob and restart itself. It does not restart from the head job.

    FAQ

    My Nextflow job tree failed, how do I find where the errors are?

    A: Find the errored subjob's job ID from the head job's nextflow_errored_subjob and nextflow_errorStrategy properties to investigate which subjob failed and which errorStrategy was applied. To query these errorStrategy related properties in CLI, run the following command:

    where job-xxxx is the head job's job ID. \

    After finding the errored subjob, investigate the job log using the Monitor page by accessing the URL https://platform.dnanexus.com/projects/<projectID>/monitor/job/<jobID>. In this URL, jobID is the subjob's ID such as job-yyyy. Alternatively, watch the job log in CLI using dx watch job-yyyy.

    With the preserve_cache value set to true when starting the Nextflow pipeline executable, trace the cache workDir such as project-xxxx:/.nextflow_cache_db/<session_id>/work/ to investigate the intermediate results of this run.

    What is the version of Nextflow that is used?

    A: Find the Nextflow version by reading the log of the head job. Each built Nextflow executable is locked down to the specific version of Nextflow executor.

    What container runtimes are supported?

    A: DNAnexus supports as the container runtime for Nextflow pipeline applets. It is recommended to set docker.enabled=true in the Nextflow pipeline configuration, which enables the built Nextflow pipeline applet to execute the pipeline using Docker.

    My job hangs at the end of the analysis. What can I do to avoid this problem?

    A: There can be many possibilities causing the head job to become unresponsive. One of the known reasons is caused by the being written directly to a DNAnexus URI such as dx://project-xxxx:/path/to/file. To avoid this cause, specify ​​-with-trace path/to/tracefile (using a local path string) to the Nextflow pipeline applet's nextflow_run_opts input parameter.

    Can I have an example of how to construct an output path when I run a Nextflow pipeline with params.outdir, publishDir and job-level destination?

    Taking as an example, start with reading the pipeline's logic:

    1. The pipeline's is constructed with a prefix of the params.outdir variable followed by each task's name for each subfolder: publishDir = [ path: { "${params.outdir}/${...}" }, ... ]

    2. params.outdir is a to the pipeline, and the . The user running the corresponding Nextflow pipeline executable must specify a value to params.outdir to:

    To specify a value of params.outdir for the Nextflow pipeline executable built from the nf-core/sarek pipeline script, you can use the following command:

    You can also set a job tree's output destination using :

    This command constructs the final output paths as follows:

    1. project-xxxx:/path/to/jobtree/destination/ as the destination of the job tree's shared output folder.

    2. project-xxxx:/path/to/jobtree/destination/local/to/outdir as the shared output folder of the all tasks/processes/subjobs of this pipeline.

    3. project-xxxx:/path/to/jobtree/destination/local/to/outdir/<task_name> as the output folder of each specific task/process/subjob of this pipeline.

    1. This example is built based on and .

    2. Not all Nextflow pipelines have params.outdir as input, nor do all use params.outdir in publishDir. Read the source script of the Nextflow pipeline for the actual context of usage and requirements for

    v0.378.0
    .
  • (Optional) Subfolders and other configuration files. Subfolders and other configuration files can be referenced by the main Nextflow file or nextflow.config via the include or includeConfig keyword. Ensure that all referenced subfolders and files exist under the pipeline script folder at the time of building or importing the pipeline.

  • reads_fastqgz is an example of an executable input parameter name. All Nextflow pipeline inputs can be configured and exposed by the pipeline developer using an nf-core flavored pipeline schema file (nextflow_schema.json).

  • When the input parameter is expecting a file, you need to specify the value in a certain format based on the class of the input parameter. When the input is of the "file" class, use DNAnexus qualified ID, which is the absolute path to the file object such as "project-xxxx:file-yyyy". When the input is of the "string" class, use the DNAnexus URI ("dx://project-xxxx:/path/to/file"). See table below for full descriptions of the formatting of PATHs.

  • You can use dx run <app(let)> --help to query the class of each input parameter at the app(let) level. In the example code block below, fasta is an input parameter of a file object, while fasta_fai is an input parameter of a string object. You then use DNAnexus qualifiedID format for fasta, and DNAnexus URI format for fasta_fai.

  • The DNAnexus object class of each input parameter is based on the "type" and "format" specified in the pipeline's nextflow_schema.json, when it exists. See additional documentation in the Nextflow Input Parameter Type Conversion section to understand how Nextflow input parameter's type and format (when applicable) converts to an app or applet's input class.

  • It is recommended to always use the app/applet means for specifying input values. The platform validates the input class and existence before the job is created.

  • All inputs for a Nextflow pipeline executable are set as "optional" inputs. This allows users to have flexibility to specify input via other means.

  • Nextflow pipeline command line input parameter, available as nextflow_pipeline_params. This is an optional "string" class input, available for any Nextflow pipeline executable on it being built.

    • CLI example: dx run project-xxxx:applet-xxxx -i nextflow_pipeline_params="--foo=xxxx --bar=yyyy", where "--foo=xxxx --bar=yyyy" corresponds to the "--something value" pattern of Nextflow input specification referenced in the Nextflow Configuration documentation.

    • Because nextflow_pipeline_params is a string type parameter with file-path format, use the DNAnexus URI format when the file is stored on DNAnexus.

  • Nextflow options parameter nextflow_run_opts. This is an optional "string" class input, available for any Nextflow pipeline executable on it being built.

    • CLI example: dx run project-xxxx:applet-xxxx -i nextflow_run_opts="-profile test", where -profile is single-dash prefix parameter that corresponds to the Nextflow run options pattern, specifying a preset input configuration.

  • Nextflow parameter file nextflow_params_file. This is an optional "file" class input, available for any Nextflow pipeline executable that is being built.

    • CLI example: dx run project-xxxx:applet-xxxx -i nextflow_params_file=project-xxxx:file-yyyy, where project-xxxx:file-yyyy is the DNAnexus qualified ID of the file being passed to nextflow run -params-file <file>. This corresponds to -params-file option of nextflow run.

  • Nextflow soft configuration override file nextflow_soft_confs. This is an optional "array:file" class input, available for any Nextflow pipeline executable that is being built.

    • CLI example: dx run project-xxxx:applet-xxxx -i nextflow_soft_confs=project-xxxx:file-1111 -i nextflow_soft_confs=project-xxxx:file-2222, where project-xxxx:file-1111 and project-xxxx:file-2222 are the DNAnexus qualified IDs of the file being passed to nextflow run -c <config-file1> -c <config-file2>. This corresponds to -c option of nextflow run, and the order specified for this array of file input is preserved when passing to the nextflow run execution.

    • The soft configuration file can be used for assigning default values of configuration scopes (such as ).

    • It is highly recommended to use nextflow_params_file as a replacement to using nextflow_soft_confs for the use case of specifying parameter values, especially when running Nextflow DSL2 nf-core pipelines. Read more about this at .

  • Pipeline source code:

    1. nextflow_schema.json

      • Pipeline developers may specify default values of inputs in the nextflow_schema.json file.

      • If an input parameter is of Nextflow's string type with file-path format, use DNAnexus URI format when the file is stored on DNAnexus.

    2. nextflow.config

      • Pipeline developers may specify default values of inputs in the nextflow.config file.

      • Pipeline developers may specify a default profile value using --profile <value> when building the executable, for example, dx build --nextflow --profile test

    3. main.nf, sourcecode.nf

      • Pipeline developers may specify default values of inputs in the Nextflow source code file (*.nf).

      • If an input parameter is of Nextflow's string type with file-path format, use the DNAnexus URI format when the file is stored on DNAnexus.

  • - that the job must present, to assume the role that enables bucket access.
  • iamRoleArnToAssume is the Amazon Resource Name (ARN) for the role that you configured in Step 2 above, and that is assumed by jobs to access the bucket.

  • You need also to configure your pipeline to access the bucket within the appropriate AWS region, which you specify via the region parameter, within an aws config scope.

  • --cache-docker

    flag

    Stores a container image tarball in the selected project in /.cached_dockerImages. Only Docker engine is supported. Incompatible with --remote.

    --nextflow-pipeline-params NEXTFLOW_PIPELINE_PARAMS

    string

    Custom pipeline parameters to be referenced when collecting the Docker images.

    --docker-secrets DOCKER_SECRETS

    file

    A DNAnexus file ID with credentials for a private Docker repository.

    For pipelines featuring conditional process trees determined by input values, provide mocked input values for caching Docker containers used by processes affected by the condition.

  • A building job requires CONTRIBUTE or higher permission to the destination project, that is the project for placing tarballs created from Docker containers.

  • Pipeline source code is saved at /.nf_source/<pipeline_folder_name>/ in the destination project. The user handles cleaning up this folder after the executable has been built.

  • reference from the external Docker repository and uses it to search for the corresponding tarball on The platform. This digest search requires internet access. If no digest is found, or if there is no internet access, the execution fails.

    NA

    int

    number

    NA

    float

    boolean

    NA

    boolean

    object

    NA

    hash

  • When publishDir is defined this way, the user who launches the Nextflow pipeline executable handles constructing the publishDir to be a valid relative path.

  • The actual selected instance type's resources (CPUs, memory, disk capacity) may differ from what is allocated by the task. Instance type selection follows the precedence rules described above, while task allocation uses the values assigned in the configuration file.

  • When using Docker as the runtime container, the Nextflow executor propagates task execution settings to the Docker run command. For example, when task.memory is specified, this becomes the maximum amount of memory allowed for the container: docker run --memory ${task.memory}

  • The session's cache directory containing information on the location of the workDir, the session progress, job status, and configuration data is saved to project-xxxx:/.nextflow_cache_db/<session_id>/cache.tar, where project-xxxx is the project where the job tree is executed.
  • Each task's working directory is saved to project-xxxx:/.nextflow_cache_db/<session_id>/work/<2digit>/<30characters>/, where <2digit>/<30characters>/ is technically the task's unique ID, and project-xxxx is the project where the job tree is executed.

  • resume String type. Default value is an empty string, and the run begins without any cached data. When assigned with a session id, the run resumes from what is cached for the session id on the project. When assigned with "true" or "last", the run determines the session id that corresponds to the latest valid execution in the current project and resumes the run from it. For example, dx run applet-xxxxm -i reads_fastqgz="project-xxxx:file-yyyy" -i resume="<session_id>"

  • When a new job is launched and resumes a cached session (where session_id has a format like 12345678-1234-1234-1234-123456789012), the new job not only resumes from where the cache left at, but also shares the same session_id with the cached session it resumes. When a new job makes progress in a session and if the job is being cached, it creates temporary results to the same session's workDir. This generates a new cache directory (cache.tar) with the latest cache information.
  • You can have many Nextflow job trees sharing the same sessionID and writing to the same path for workDir and creating its own cache.tar, while only the latest job that ends in "done" or "failed" state is preserved on the project.

  • When the head job enters its terminal state such as "failed" or "terminated" that is not caused by the executor, no cache directory is preserved, even when the job was run with preserve_cache=true. Subsequent new jobs cannot resume from this job run. This can happen when a job tree fails due to exceeding a cost limit or a user terminating a job of the job tree.

  • 4

    resume=<session_ID> | "true" | "last" and preserve_cache=true

    Pipeline development. Only happens for the first few tests.

    Only 1 job with the same <session_ID> can run at each time point.

    ignore

    - Job properties set with: "nextflow_errorStrategy":"ignore" "nextflow_errored_subjob":"self" - Ends in "done" state immediately

    - Job properties set with: "nextflow_errorStrategy":"ignore" "nextflow_errored_subjob":"job-1xxx, job-2xxx" - Shows "subjobs <job-1xxx>, <job-2xxx> runs into Nextflow process errors' ignore errorStrategy were applied" at end of job log. - Ends in a terminal state depending on other subjobs (can be "done", "failed", or "terminated").

    - Keep running until terminal state. - If error occurs, their own errorStrategy is applied.

    , ignoring any other error strategies set in the pipeline's source code or configuration.
  • If the retry errorStrategy is the first directive triggered in a subjob, if any of the remaining subjobs trigger a terminate, finish, or ignore errorStrategy, these other errorStrategy directives are applied to the corresponding subjob.

  • When ignore is the first errorStrategy directive to trigger in a subjob , and if any of terminate, finish, or retry errorStrategy directives applies to the remaining subjobs, that other errorStrategy is applied to the corresponding subjob.

  • Meet the input requirement for executing the pipeline.

  • Resolve the value of publishDir, with outdir as the leading path and each task's name as the subfolder name.

  • params.outdir
    and
    publishDir
    .

    • App or applet input parameter class as file object • CLI/API level, such as dx run --destination PATH

    DNAnexus qualified ID (absolute path to the file object). • Example (file): project-xxxx:file-yyyy project-xxxx:/path/to/file • Example (folder): project-xxxx:/path/to/folder/

    • App or applet input parameter class as string • Nextflow configuration and source code files, such as nextflow_schema.json, nextflow.config, main.nf, and sourcecode.nf

    DNAnexus URI. • Example (file): dx://project-xxxx:/path/to/file • Example (folder): dx://project-xxxx:/path/to/folder/ • Example (wildcard): dx://project-xxxx:/path/to/wildcard_files

    Values of StringEquals:job-oidc.dnanexus.com/:sub

    Which jobs can assume the role that enables bucket access?

    project_id;project-xxxx

    Any Nextflow pipeline jobs that are running in project-xxxx

    launched_by;user-aaaa

    Any Nextflow pipeline jobs that are launched by user-aaaa

    project_id;project-xxxx;launched_by;user-aaaa

    Any Nextflow pipeline jobs that are launched by user-aaaa in project-xxxx

    bill_to;org-zzzz

    Any Nextflow pipeline jobs that are billed to org-zzzz

    --profile PROFILE

    string

    Set default profile for the Nextflow pipeline executable.

    --repository REPOSITORY

    string

    Specifies a Git repository of a Nextflow pipeline. Incompatible with --remote.

    --repository-tag TAG

    string

    Specifies tag for Git repository. Can be used only with --repository.

    --git-credentials GIT_CREDENTIALS

    file

    Requires running a "building job" with external internet access?

    Yes, if building an applet for the first time or if any image is going to be updated. No internet access required on rebuild.

    No

    Docker images packaged as bundledDepends?

    Yes. For Docker images that are used in the execution, they are cached and bundled at build time.

    No. Docker tarballs are resolved at runtime.

    At runtime

    Job first attempts to access Docker images cached as bundledDepends. If this fails, the job attempts to find the image on the Platform. If this fails, the job tries to pull the images from the external repository, via the internet.

    Job attempts to find the Docker image based on the Docker cache path referenced. If this fails, the job attempts to pull from the external repository, via the internet.

    quay.io/biocontainers/tabix

    1.11--hdfd78af_0

    project-xxxx:/.cached_docker_images/tabix/tabix_1.11--hdfd78af_0

    python

    3.9-slim

    project-xxxx:/.cached_docker_images/python/python_3.9-slim

    python

    latest

    Nextflow pipeline job attempts to pull from remote external registry

    From: Nextflow Input Parameter (defined at nextflow_schema.json) Type

    Format

    To: DNAnexus Input Parameter Class

    string

    file-path

    file

    string

    directory-path

    string

    string

    path

    string

    string

    NA

    string

    1 (default)

    resume="" (empty string) and preserve_cache=false

    Production data processing. Most high volume use cases

    2

    resume="" (empty string) and preserve_cache=true

    Pipeline development. Only happens for the first few pipeline tests. During development, it is useful to see all intermediate results in workDir.

    Only up to 20 Nextflow sessions can be preserved per project.

    3

    resume=<session_ID> | "true" | "last" and preserve_cache=false

    Pipeline development. Pipeline developers can investigate the job workspace with --delay_workspace_destruction and --ssh

    errorStrategy

    Subjob Error

    Head Job

    All Other Subjobs

    terminate

    - Job properties set with: "nextflow_errorStrategy":"terminate" "nextflow_errored_subjob":"self" - Ends in "failed" state immediately

    - Job properties set with: "nextflow_errorStrategy":"terminate" "nextflow_errored_subjob":"job-xxxx" "nextflow_terminated_subjob":"job-yyyy, job-zzzz" where job-xxxx is the errored subjob, and job-yyyy, job-zzzz are other subjobs terminated due to this error. - Ends in "failed" state immediately, with error message: "Job was terminated by Nextflow with terminate errorStrategy for job-xxxx, check the job log to find the failure."

    End in "failed" state immediately.

    finish

    - Job properties set with: "nextflow_errorStrategy":"finish" "nextflow_errored_subjob":"self" - Ends in "done" state immediately

    - Job properties set with: "nextflow_errorStrategy":"finish" "nextflow_errored_subjob":"job-xxxx, job-2xxx" where job-xxxx and job-2xxx are errored subjobs. - No new subjobs created after error. - Ends in "failed" state eventually, after other subjobs enter terminal states, with error message: "Job was ended with finish errorStrategy for job-xxxx, check the job log to find the failure."

    - Keep running until terminal state. - If error occurs in any, finish errorStrategy is applied (ignoring other error strategies), per Nextflow default behavior.

    retry

    - Job properties set with: "nextflow_errorStrategy":"retry" "nextflow_errored_subjob":"self" - Ends in "done" state immediately

    - Spins off a new subjob to retry the errored job, named <name> (retry: <RetryCount>). - Ends in a terminal state depending on other subjobs (can be "done", "failed", or "terminated").

    - Keep running until terminal state. - If error occurs, their own errorStrategy is applied.

    dx command-line client
    dx-toolkit
    nextflow.config file
    nextflow_schema.json file
    nf-core
    https://github.com/nextflow-io/hello
    repository tag
    import destination
    dx-toolkit
    Contact DNAnexus Sales
    Private Nextflow Pipeline Repository
    GitHub
    described here
    optional information
    Sales
    dx run -h
    app from a Nextflow pipeline applet
    inputs
    command
    dx watch
    executor
    process
    Nextflow log file
    destination path
    class
    UI
    CLI
    publishDir
    FAQ
    Follow the steps outlined here
    AWS Identity and Access Management (IAM) role
    config scope
    Step 1
    job identity token custom claims
    aforementioned Nextflow configuration file
    above example
    limits apps' and applets' ability to read and write data
    building from a local disk
    dx build
    samplesheet.csv
    Docker container
    Built-in Docker image caching
    Manually preparing the tarballs
    bundling
    full image name
    nextflow_schema.json file
    ref1
    ref2
    Formats of Path to File, Folder or Wildcards section
    Nextflow Opening Files documentation
    temporary workspace
    publishDir
    output
    FAQ
    configuration
    configuration
    instance type name
    machineType
    cpus
    memory
    disk
    version2 (v2)
    configuration
    process implicit variables
    resume
    deleting objects
    errorStrategy
    errors
    Docker
    trace report file
    nf-core/sarek (3.3.1)
    publishDir
    required input parameter
    default value of params.outdir is null
    --destination
    Specifying A Nextflow Job Tree Output Folder
    Managing intermediate files and publishing outputs
    The "Estimated Price" value shown here is only an example. The actual price depends on the pricing model and runtime of the import job.

    Git credentials used to access Nextflow pipelines from private Git repositories. Can be used only with --repository. More information about the file syntax can be found in the .

    integer

    specifying input values to a Nextflow pipeline executable
    dx build --nextflow --extra-args='{"runSpec": {"release": "20.04"}}'
    $ dx build --nextflow \
      --repository https://github.com/nextflow-io/hello \
      --destination project-xxxx:/applets/hello
    
    Started builder job job-aaaa
    Created Nextflow pipeline applet-zzzz
    $ dx run project-xxxx:/applets/hello -h
    usage: dx run project-xxxx:/applets/hello [-iINPUT_NAME=VALUE ...]
    
    Applet: hello
    
    hello
    
    Inputs:
     Nextflow options
      Nextflow Run Options: [-inextflow_run_opts=(string)]
            Additional run arguments for Nextflow (e.g. -profile docker).
    
      Nextflow Top-level Options: [-inextflow_top_level_opts=(string)]
            Additional top-level options for Nextflow (e.g. -quiet).
    
      Soft Configuration File: [-inextflow_soft_confs=(file) [-inextflow_soft_confs=... [...]]]
            (Optional) One or more nextflow configuration files to be appended to the Nextflow pipeline
            configuration set
    
      Script Parameters File: [-inextflow_params_file=(file)]
            (Optional) A file, in YAML or JSON format, for specifying input parameter values
    
     Advanced Executable Development Options
      Debug Mode: [-idebug=(boolean, default=false)]
            Shows additional information in the job log. If true, the execution log messages from
            Nextflow are also included.
    
      Resume: [-iresume=(string)]
            Unique ID of the previous session to be resumed. If 'true' or 'last' is provided instead of
            the sessionID, resumes the latest resumable session run by an applet with the same name
            in the current project in the last 6 months.
    
      Preserve Cache: [-ipreserve_cache=(boolean, default=false)]
            Enable storing pipeline cache and local working files to the current project. If true, local
            working files and cache files are uploaded to the platform, so the current session could
            be resumed in the future
    
    Outputs:
      Published files of Nextflow pipeline: [published_files (array:file)]
            Output files published by current Nextflow pipeline and uploaded to the job output
            destination.
    $ pwd
    /path/to/hello
    $ ls
    LICENSE         README.md       main.nf         nextflow.config
    $ dx build --nextflow /path/to/hello \
      --destination project-xxxx:/applets2/hello
    {"id": "applet-yyyy"}
    $ dx run project-yyyy:applet-xxxx \
      -i debug=false \
      --destination project-xxxx:/path/to/destination/ \
      --brief -y
    
    job-bbbb
    # See subjobs in progress
    $ dx find jobs --origin job-bbbb
    * hello (done) job-bbbb
    │ amy 2023-09-20 14:57:58 (runtime 0:02:03)
    ├── sayHello (3) (hello:nf_task_entry) (done) job-1111
    │   amy 2023-09-20 14:58:57 (runtime 0:00:45)
    ├── sayHello (1) (hello:nf_task_entry) (done) job-2222
    │   amy 2023-09-20 14:58:52 (runtime 0:00:52)
    ├── sayHello (2) (hello:nf_task_entry) (done) job-3333
    │   amy 2023-09-20 14:58:48 (runtime 0:00:53)
    └── sayHello (4) (hello:nf_task_entry) (done) job-4444
        amy 2023-09-20 14:58:43 (runtime 0:00:50)
    # Monitor job in progress
    $ dx watch job-bbbb
    Watching job job-bbbb. Press Ctrl+C to stop watching.
    * hello (done) job-bbbb
      amy 2023-09-20 14:57:58 (runtime 0:02:03)
    ... [deleted]
    2023-09-20 14:58:29 hello STDOUT dxpy/0.358.0 (Linux-5.15.0-1045-aws-x86_64-with-glibc2.29) Python/3.8.10
    2023-09-20 14:58:30 hello STDOUT bash running (job ID job-bbbb)
    2023-09-20 14:58:31 hello STDOUT =============================================================
    2023-09-20 14:58:31 hello STDOUT === NF projectDir   : /home/dnanexus/hello
    2023-09-20 14:58:31 hello STDOUT === NF session ID   : 0eac8f92-1216-4fce-99cf-dee6e6b04bc2
    2023-09-20 14:58:31 hello STDOUT === NF log file     : dx://project-xxxx:/applets/nextflow-job-bbbb.log
    2023-09-20 14:58:31 hello STDOUT === NF command      : nextflow -log nextflow-job-bbbb.log run /home/dnanexus/hello -name job-bbbb
    2023-09-20 14:58:31 hello STDOUT === Built with dxpy : 0.358.0
    2023-09-20 14:58:31 hello STDOUT =============================================================
    2023-09-20 14:58:34 hello STDOUT N E X T F L O W  ~  version 22.10.7
    2023-09-20 14:58:35 hello STDOUT Launching `/home/dnanexus/hello/main.nf` [job-bbbb] DSL2 - revision: 1647aefcc7
    2023-09-20 14:58:43 hello STDOUT [0a/6a81ca] Submitted process > sayHello (4)
    2023-09-20 14:58:48 hello STDOUT [f5/87df8b] Submitted process > sayHello (2)
    2023-09-20 14:58:53 hello STDOUT [4b/21374a] Submitted process > sayHello (1)
    2023-09-20 14:58:57 hello STDOUT [f6/8c44f5] Submitted process > sayHello (3)
    2023-09-20 14:59:51 hello STDOUT Hola world!
    2023-09-20 14:59:51 hello STDOUT 
    2023-09-20 14:59:51 hello STDOUT Ciao world!
    2023-09-20 14:59:51 hello STDOUT 
    2023-09-20 15:00:06 hello STDOUT Bonjour world!
    2023-09-20 15:00:06 hello STDOUT 
    2023-09-20 15:00:06 hello STDOUT Hello world!
    2023-09-20 15:00:06 hello STDOUT 
    2023-09-20 15:00:07 hello STDOUT === Execution completed — cache and working files will not be resumable
    2023-09-20 15:00:07 hello STDOUT === Execution completed — upload nextflow log to job output destination project-xxxx:/applets/
    2023-09-20 15:00:09 hello STDOUT Upload nextflow log as file: file-GZ5ffkj071zqZ9Qj22qv097J
    2023-09-20 15:00:09 hello STDOUT === Execution succeeded — upload published files to job output destination project-xxxx:/applets/
    * hello (done) job-bbbb
      amy 2023-09-20 14:57:58 (runtime 0:02:03)
      Output: -
    # Monitor job in progress
    $ dx watch job-cccc
    Watching job job-cccc. Press Ctrl+C to stop watching.
    sayHello (1) (hello:nf_task_entry) (done) job-cccc
    amy 2023-09-20 14:58:52 (runtime 0:00:52)
    ... [deleted]
    2023-09-20 14:59:28 sayHello (1) STDOUT dxpy/0.358.0 (Linux-5.15.0-1045-aws-x86_64-with-glibc2.29) Python/3.8.10
    2023-09-20 14:59:30 sayHello (1) STDOUT bash running (job ID job-cccc)
    2023-09-20 14:59:33 sayHello (1) STDOUT file-GZ5ffQj047j3Vq7QX220Q5vQ
    2023-09-20 14:59:34 sayHello (1) STDOUT Bonjour world!
    2023-09-20 14:59:36 sayHello (1) STDOUT file-GZ5ffVQ047j2QXZ2ZkFx4YxG
    2023-09-20 14:59:38 sayHello (1) STDOUT file-GZ5ffX0047j2QXZ2ZkFx4YxK
    2023-09-20 14:59:41 sayHello (1) STDOUT file-GZ5ffXQ047jGYZ91x6KG32Jp
    2023-09-20 14:59:43 sayHello (1) STDOUT file-GZ5ffY8047jF2PY3609JPBKB
    sayHello (1) (hello:nf_task_entry) (done) job-cccc
    amy 2023-09-20 14:58:52 (runtime 0:00:52)
    Output: exit_code = 0
    {
      "docker_registry": {
        "registry": "url-to-registry",
        "username": "name123",
        "token": "12345678"
      }
    }
    # Query for the class of each input parameter
    $ dx run project-yyyy:applet-xxxx --help
    usage: dx run project-yyyy:applet-xxxx [-iINPUT_NAME=VALUE ...]
    
    Applet: example_applet
    
    example_applet
    
    Inputs:
    …
      fasta: [-ifasta=(file)]
    …
    
      fasta_fai: [-ifasta_fai=(string)]
    …
    
    
    # Assign values of the parameter based on the class of the parameter
    $ dx run project-yyyy:applet-xxxx -ifasta="project-xxxx:file-yyyy" -ifasta_fai="dx://project-xxxx:/path/to/file"
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Action": [
            "s3:DeleteObject",
            "s3:GetObject",
            "s3:ListBucket",
            "s3:PutObject"
          ],
          "Resource": [
            "arn:aws:s3:::my-nextflow-s3-workdir",
            "arn:aws:s3:::my-nextflow-s3-workdir/*"
          ]
        }
      ]
    }
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Action": "sts:AssumeRoleWithWebIdentity",
          "Principal": {
            "Federated": "arn:aws:iam::123456789012:oidc-provider/job-oidc.dnanexus.com/"
            ,
            "Condition": {
             "StringEquals": {
                "job-oidc.dnanexus.com/:aud": "dx_nextflow_s3_scratch_token_aud"
              },
             "StringEquals": {
                "job-oidc.dnanexus.com/:sub": "project_id;project-xxxx;launched_by;user-aaaa"
              }
            }
          }
        }
      ]
    }
    # In a nextflow configuration file:
    
    aws { region = '<aws region>'}
    
    dnanexus {
     workDir = '<S3 URI path>'
     jobTokenAudience = '<OIDC_audience_name>'
     jobTokenSubjectClaims = '<list of claims separated by commas>'
     iamRoleArnToAssume = '<arn of the role who is set with permission>'
    }
    # In a nextflow configuration file:
    dnanexus {
     ...
     jobTokenSubjectClaims = 'project_id,launched_by'
     ...
    }
    # In a nextflow configuration file:
    dnanexus {
     ...
     iamRoleArnToAssume = arn:aws:iam::123456789012:role/NextflowRunIdentityToken
     ...
    }
    $ dx build --nextflow /path/to/hello --extra-args \
        '{"access":{"network": [], "allProjects":"VIEW"}}'
    ...
    {"id": "applet-yyyy"}
    providers {
      github {
        user = 'username'
        password = 'ghp_xxxx'
      }
    }
    $ dx build --nextflow /path/to/hello \
    --cache-docker \
    --nextflow-pipeline-params "--alpha=1 --beta=foo" \ # when required
    --destination project-xxxx:/applets2/hello
    ...
    {"id:"applet-yyyy"}
    
    $ dx tree /.cached_docker_images/
    /.cached_docker_images/
    ├── samtools
    │   └── samtools_1.16.1--h6899075_1
    ├── multiqc
    │   └── multiqc_1.18--pyhdfd78af_0
    └── fastqc
        └── fastqc_0.11.9--0
    "docker_registry": {
      "registry": "url-to-registry",
      "username": "name123",
      "token": "12345678"
    }
    # In a Nextflow pipeline script:
    process foo {
      container 'dx://project-xxxx:file-yyyy'
    
      '''
      do this
      '''
    }
    # In nextflow.config    // at root folder of the nextflow pipeline:
    process {
        withName:foo {
            container = 'dx://project-xxxx:file-yyyy'
        }
    }
    # In nextflow configuration file:
    docker.enabled = true
    docker.registry = 'quay.io'
    
    # In the Nextflow pipeline script:
    process bar {
      container 'quay.io/biocontainers/tabix:1.11--hdfd78af_0'
    
      '''
      do this
      '''
    }
    # In nextflow configuration file:
    docker.enabled = true
    docker.registry = 'quay.io'
    
    # In the Nextflow pipeline script:
    process bar {
      container 'quay.io/biocontainers/tabix@sha256:XYZ123…'
      '''
      do this
      '''
    }
    process foo {
      machineType 'mem1_ssd1_v2_x36'
    
      """
      <your script here>
      """
    }
    dx run applet-xxxx -i reads_fastqgz=project-xxxx:file-yyyy -i preserve_cache=true
    dx rm -r project-xxxx:/.nextflow_cache_db/              # cleanup ALL sessions caches
    dx rm -r project-xxxx:/.nextflow_cache_db/<session_id>/ # clean up a specific session's cache
    $ dx describe job-xxxx --json | jq -r .properties.nextflow_errored_subjob
    job-yyyy
    $ dx describe job-xxxx --json | jq -r .properties.nextflow_errorStrategy
    terminate
    dx run project-xxxx:applet-zzzz \
    -i outdir=./local/to/outdir \   # assign "./local/to/outdir" params.outdir
    --brief -y
    dx run project-xxxx:applet-zzzz \
    -i outdir=./local/to/outdir \   # assign "./local/to/outdir" params.outdir
    --destination project-xxxx:/path/to/jobtree/destination/ \
    --brief -y
    .
    process
    nf-core documentation
    Configure Git repositories with Nextflow blog post