Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
By understanding projects, organizations, apps, and workflows, you'll improve your understanding of the DNAnexus Platform.
Get to know commonly used features in a series of short, task-oriented tutorials.
UserLearn to access and use the Platform via both its command-line interface and its user interface.
DeveloperLearn to manage data, users, and work on the Platform, via its API. Create and share reusable pipelines, applications for analyzing data, custom viewers, and workflows.
AdministratorThis section is targeted towards organizational leads who have the permission to enable others to use DNAnexus for scientific purposes. Operations include managing organization permissions, billing, and authentication to the platform.
DownloadsDownload, install, and get started using the DNAnexus Platform SDK, the DNAnexus upload and download agents, and dxCompiler.
Get details on new features, changes, and bug fixes for each Platform and toolkit release.
In this section, learn to access and use the Platform via both its command-line interface (CLI) and its user interface (UI).
To use the CLI, you need to .
If you're not familiar with the dx client, check the .
This section provides detailed instructions on using the dx client to perform such common actions as logging in, selecting projects, listing, copying, moving, and deleting objects, and launching and monitoring jobs. Details on using the UI are included throughout, as applicable.
A project is a collaborative workspace on the DNAnexus Platform where you can store objects such as files, applets, and workflows. Within projects, you can run apps and workflows. You can also share a project with other users by giving them access to it. Read about projects in the Key Concepts section.
Learn important terminology before using parallel and distributed computing paradigms on the DNAnexus Platform.
Many definitions and approaches exist for tackling the concept of parallelization and distributing workloads in the cloud (Here's a particularly helpful Stack Exchange post on the subject). To help make the documentation easier to understand, when discussing concurrent computing paradigms this guide refers to:
Parallel: Using multiple threads or logical cores to concurrently process a workload.
Distributed: Using multiple machines (in this case instances in the cloud) that communicate to concurrently process a workload.
Keep these formal definitions in mind as you read through the tutorials and learn how to compute concurrently on the DNAnexus Platform.
Get up and running quickly using the Platform via both its user interface (UI) and its command-line interface (CLI):
Learn the basics of developing for the Platform:
Get to know features you'll use every day, in these short, task-oriented tutorials.
See these Key Concepts pages for more in-depth treatments of topics that are covered briefly here:
For a step-by-step written tutorial to using the Platform via its UI, see .
For a step-by-step written tutorial to using the Platform via its CLI, see .
For a more in-depth video intro to the Platform, watch the .
As a developer, you may be interested in the following:
As a bioinformatician, see the walkthrough and other content in the .
This example demonstrates how to run TensorBoard inside a DNAnexus applet.
View full source code on GitHub
TensorBoard is a web application used to visualize and inspect what is going on inside TensorFlow training. To use TensorBoard, the training script in TensorFlow needs to include code that saves specific data to a log directory where TensorBoard can then find the data to display it.
This example uses an example script from the TensorBoard authors. For more guidance on how to use TensorBoard, check out the TensorFlow website (external link).
The applet code runs a training script, which is placed in resources/home/dnanexus/ to make it available in the current working directory of the worker, and then it starts TensorBoard on port 443 (HTTPS).
The training script runs in the background to start TensorBoard immediately, which allows you to see the results while training is still running. This is particularly important for long-running training scripts.
For all web apps, if everything is running smoothly and no errors are encountered (the ideal case), the line of code that starts the server keeps it running forever. The applet stops only when it is terminated. This also means that any lines of code after the server starts are not executed.
As with all web apps, the dxapp.json must include "httpsApp": {"ports":[443], "shared_access": "VIEW"} to tell the worker to expose port 443.
Build the asset with the libraries first:
Take the record ID it outputs and add it to the dxapp.json for the applet.
Then build the applet
Once it spins up, you can go to that job's designated URL based on its job ID, https://job-xxxx.dnanexus.cloud, to see the result.
This tutorial showcases packaging a precompiled binary in the resources/ directory of an app(let).
View full source code on GitHub
In this applet, the SAMtools binary was precompiled on an Ubuntu machine. A user can do this compilation on an Ubuntu machine of their own, or they can use the Cloud Workstation app to build and compile a binary. On the Cloud Workstation, the user can download the SAMtools source code and compile it in the worker environment, ensuring that the binary runs on future workers.
See Cloud Workstationin the App library for more information.
The SAMtools precompiled binary is placed in the <applet dir>/resources/ directory. Any files found in the resources/ directory are packaged, uploaded to the Platform, and then extracted in the root directory \ of the worker. In this case, the resources/ dir is structured as follows:
When this applet is run on a worker, the resources/ directory is placed in the worker's root directory /:
The SAMtools command is available because the respective binary is visible from the default $PATH variable. The directory /usr/bin/ is part of $PATH, so the script can reference the samtools command directly:
Learn how to run an older version of DXJupyterLab via the user interface or command-line interface.
The primary reason to run an older version of DXJupyterLab is to access snapshots containing tools that cannot be run in the current version's execution environment.
From the main Platform menu, select Tools, then Tools Library.
Find and select, from the list of tools, either DXJupyterLab with Python, R, Stata, ML, Image Processing or DXJupyterLab with Spark Cluster.
From the tool detail page, click on the Versions tab.
Select the project in which you want to run DXJupyterLab.
Launch the version of DXJupyterLab you want to run, substituting the version number for x.y.z in the following commands:
For DXJupyterLab without Spark cluster capability, run the command dx run app-dxjupyterlab/x.y.z --priority high.
After launching DXJupyterLab, access the DXJupyterLab environment using your browser. To do this:
Get the job ID for the job created when you launched DXJupyterLab. See the for details on how to get the job ID, via either the UI or the CLI.
Open the URL https://job-xxxx.dnanexus.cloud, substituting the job's ID for job-xxxx.
You may see an error message "502 Bad Gateway" if DXJupyterLab is not yet accessible. If this happens, wait a few minutes, then try again.
The Spark application is an extension of the current app(let) framework. App(let)s have a specification for their VM (instance type, OS, packages). This has been extended to allow for an additional optional cluster specification with type=dxspark.
Calling /app(let)-xxxx/run for Spark apps creates a Spark cluster (+ master VM).
The master VM (where the app shell code runs) acts as the driver node for Spark.
Code in the master VM leverages the Spark infrastructure.
Job mechanisms (monitoring, termination, and management) are the same for Spark apps as for any other regular app(let)s on the Platform.
Spark apps can be launched over a distributed Spark cluster.
Learn key terms used to describe apps and workflows.
On the DNAnexus Platform, the following terms are used when discussing apps and workflows:
Execution: An analysis or job.
Root execution: The initial analysis or job that's created when a user makes an API call to run a workflow, app, or applet. Analyses and jobs created from a job via /executable-xxxx/run API call with detach flag set to true are also root executions.
Execution tree: The set of all jobs and/or analyses that are created because of running a root execution.
Analysis: An analysis is created when a workflow is run. It consists of some number of stages, each of which consists of either another analysis (if running a workflow) or a job (if running an app or applet).
Parent analysis: Each analysis is the parent analysis to each of the jobs that are created to run its stages.
Job: A job is a unit of execution that is run on a worker in the cloud. A job is created when an app or applet is run, or when a job spawns another job.
Origin job: The job created when an app or applet is run by either a user or an analysis. An origin job always executes the "" entry point.
Master job: The job created when an app or applet is run by a user, job, or analysis. A master job always executes the "main" entry point. All origin jobs are also master jobs.
Job-based object reference: A hash containing a job ID and an output field name. This hash is given in the input or output of a job. Once the specified job has transitioned to the "done" state, it is replaced with the specified job's output field.
This example demonstrates how to run TensorBoard inside a DNAnexus applet.
TensorBoard is a web application used to visualize and inspect what is going on inside TensorFlow training. To use TensorBoard, your training script in TensorFlow must include code that saves specific data to a log directory where TensorBoard can then find the data to display it.
This example uses an example script from the TensorBoard authors. For more guidance on how to use TensorBoard, check out the TensorFlow website ().
This is an example web applet that demonstrates how to build and run an R Shiny application on DNAnexus.
Learn how to set job notification thresholds on the DNAnexus Platform.
Being notified of when a job may be stuck can help users to troubleshoot problems. On DNAnexus, users can set to limit the amount of time their jobs can run, or set a threshold on how long a job can take to run before the user is notified. The notification threshold can be specified in the executable at compile time via or .
When the threshold is reached for a , the system sends an email notification to both the user who launched the executable and the org admin.
Learn to build and use scatter plots in the Cohort Browser.
Scatter plots can be used to compare the distribution of values in a field containing numerical data, across different groups in a cohort. In a scatter plot, each such group is defined by its members sharing the same value in another field that also contains numerical data.
Primary field values are plotted on the x axis. Secondary field values are plotted on the y axis.
Spark apps use the same platform dx communication between the master VM and DNAnexus API servers.
There's a new log collection mechanism to collect logs from all nodes.
You can use the Spark UI to monitor running job using ssh tunneling.
For DXJupyterLab with Spark cluster capability, run the command dx run app-dxjupyterlab_spark_cluster/x.y.z --priority high
Parent job: A job that creates another job or analysis via an /executable-xxxx/run or /job/new API call.
Child job: A job created from a parent job via an /app[let]-xxxx/run or /job/new API call.
Subjob: A job created from a job via a /job/new API call. A subjob runs the same executable as its parent, and executes the entry point specified in the API call that created it.
Job tree: A set of all jobs that share the same origin job.
FreeSurfer is a software package for the analysis and visualization of structural and functional neuroimaging data from cross-sectional or longitudinal studies.
The FreeSurfer package comes pre-installed with the IMAGE_PROCESSING feature of DXJupyterLab.
To use FreeSurfer on the DNAnexus Platform, you need a valid FreeSurfer license. You can register for the FreeSurfer license at the FreeSurfer registration page.
To use the FreeSurfer license, complete the following steps:
Upload the license text file to your project on the DNAnexus Platform.
Launch the DXJupyterLab app and specify the IMAGE_PROCESSING feature.
Once DXJupyterLab is running, open your existing notebook (or a new notebook) and download the license file into the FREESURFER_HOME directory.
The commands to download the license file are as follows:
Python kernel: !dx download license.txt -o $FREESURFER_HOME
Bash kernel: dx download license.txt -o $FREESURFER_HOME
# Start the training script and put it into the background,
# so the next line of code will run immediately
python3 mnist_tensorboard_example.py --log_dir LOGS_FOR_TENSORBOARD &
# Run TensorBoard
tensorboard --logdir LOGS_FOR_TENSORBOARD --host 0.0.0.0 --port 443├── Applet dir
│ ├── src
│ ├── dxapp.json
│ ├── resources
│ ├── usr
│ ├── bin
│ ├── < samtools binary >resources/home/dnanexus/ to make it available in the current working directory of the worker, and then it starts TensorBoard on port 443 (HTTPS).Run the training script in the background to start TensorBoard immediately, which lets you see the results while training is still running. This is particularly important for long-running training scripts.
For all web apps, if everything is running smoothly and no errors are encountered (the ideal case), the line of code that starts the server keeps it running forever. The applet stops only when it is terminated. This also means that any lines of code after the server starts are not executed.
As with all web apps, the dxapp.json must include "httpsApp": {"ports":[443], "shared_access": "VIEW"} to tell the worker to expose port 443.
Build the asset with the libraries first:
Take the record ID it outputs and add it to the dxapp.json for the applet.
Then build the applet
Once it spins up, you can go to that job's designated URL based on its job ID, https://job-xxxx.dnanexus.cloud/, to see the result.
# Start the training script and put it into the background,
# so the next line of code runs immediately
python mnist_tensorboard_example.py --log_dir LOGS_FOR_TENSORBOARD &
# Run TensorBoard
tensorboard --logdir LOGS_FOR_TENSORBOARD --host 0.0.0.0 --port 443Download input files using the dx-download-all-inputs command. The dx-download-all-inputs command goes through all inputs and downloads into folders with the pattern /home/dnanexus/in/[VARIABLE]/[file or subfolder with files].
Create an output directory in preparation for the dx-upload-all-outputs DNAnexus command in the Upload Results section.
After executing the dx-download-all-inputs command, there are three helper variables created to aid in scripting. For this applet, the input variable name mappings_bam with platform filename my_mappings.bam has a helper variables:
Use the bash helper variable mappings_bam_path to reference the location of a file after it has been downloaded using dx-download-all-inputs.
Use the dx-upload-all-outputs command to upload data to the platform and specify it as the job's output. The dx-upload-all-outputs command expects to find file paths matching the pattern /home/dnanexus/out/[VARIABLE]/*. It uploads matching files and then associates them as the output corresponding to [VARIABLE]. In this case, the output is called counts_txt. After creating the folders, place the outputs there.
ui.Rresources/home/dnanexus/my_app/resources~//home/dnanexus~/my_appFrom the main applet script code.sh, start shiny pointing to ~/my_app, serving its mini-application on port 443.
For all web apps, if everything is running smoothly and no errors are encountered (the ideal case), the line of code that starts the server keeps it running indefinitely. The applet stops only when it is terminated. This also means that any lines of code after the server starts are not executed.
To make your own applet with R Shiny, copy the source code from this example and modify server.R and ui.R inside resources/home/dnanexus/my_app.
To build the asset, run the dx build_asset command and pass shiny-asset, that is the name of the directory holding dxasset.json:
This outputs a record ID record-xxxx that you can then put into the applet's dxapp.json in place of the existing one:
Build and run the applet itself:
Once it spins up, you can go to that job's designated URL based on its job ID, https://job-xxxx.dnanexus.cloud/, to see the result.
For a root execution, the turnaround time is the time between its creation time and the time it reaches the terminal state (or the current time if it is not in a terminal state). The terminal states of an execution are done, terminated, and failed. The job tree turnaround time threshold can be set from the dxapp.json app metadata file using the treeTurnaroundTimeThreshold supported field, where the threshold time is set in seconds. When a user runs an executable that has a threshold, the threshold applies only to the resulting root execution. See here for more details on the treeTurnAroundTimeThreshold API.
Example of including the treeTurnaroundTimeThreshold field in dxapp.json:
In the command-line interface (CLI), the dx build and dx build --app commands can accept the treeTurnaroundTimeThreshold field from dxapp.json, and the resulting app is built with the job tree turnaround time threshold from the JSON file.
To check the treeTurnaroundTimeThreshold value of an executable, users can use dx describe {app, applet, workflow or global workflow id} --json command.
Using the dx describe {execution_id} --json command displays the selectedTreeTurnaroundTimeThreshold, selectedTreeTurnaroundTimeThresholdFrom, and treeTurnaroundTime values of root executions.
For WDL workflows and tasks, dxCompiler enables tree turnaround time specification using the extras JSON file. dxCompiler reads the treeTurnaroundTimeThreshold field from the perWorkflowDxAttributes and defaultWorkflowDxAttributes sections in extras and applies this threshold to the generated workflow. To set a job tree turnaround time threshold for an applet using dxCompiler, add the treeTurnaroundTimeThreshold field to the perTaskDxAttributes and defaultTaskDxAttributes sections in the extras JSON file.
Example of including the treeTurnAroundTimeThreshold field in perWorkflowDxAttributes:
For example:
All files must have the same separator. This can be a comma, tab, or another consistent delimiter.
All files must include a header line, or all files must exclude it
Input:
CSV (array of CSV files to load into the database)
Required Parameters:
database_name -> name of the database to load the CSV files into.
create_mode -> strict mode creates database and tables from scratch and optimistic mode creates databases and tables if they do not already exist.
insert_mode -> append appends data to the end of tables and overwrite is equivalent to truncating the tables and then appending to them.
table_name -> array of table names, one for each corresponding CSV file by array index.
type -> the cluster type, "spark" for Spark apps
Other Options:
spark_read_csv_header -> default false -- whether the first line of each CSV should be used as column names for the corresponding table.
spark_read_csv_sep -> default , -- the separator character used by each CSV.
spark_read_csv_infer_schema -> default false -- whether the input schema should be inferred from the data.
The following case creates a brand new database and loads data into two new tables:
First, launch DXJupyterLab in the project of your choice, as described in the Running DXJupyterLab guide.
After starting your JupyterLab session, click on the DNAnexus tab on the left sidebar to see all the files and folders in the project.
To create a new empty notebook in the DNAnexus project, select DNAnexus > New Notebook from the top menu.
This creates an untitled ipynb file, viewable in the DNAnexus project browser, which refreshes every few seconds.
To rename your file, right-click on its name and select Rename.
You can open and edit the newly created notebook directly from the project (accessible from the DNAnexus tab in the left sidebar). To save your changes, hit Ctrl/Command + S or click on the save icon in the Toolbar (an area below the tab bar at the top). A new notebook version lands in the project, and you should see in the "Last modified" column that the file was created recently.
Since DNAnexus files are immutable, each notebook save creates a new version in the project, replacing the file of the same name. The previous version moves to the .Notebook_archive with a timestamp suffix added to its name. Saving notebooks directly in the project as new files preserves your analyses beyond the DXJupyterLab session's end.
To process your data in the notebook, the data must be available in the execution environment (as is the case with any DNAnexus app).
You can download input data from a project for your notebook using dx download in a notebook cell:
You can also use the terminal to execute the dx command.
For any data generated by your notebook that needs to be preserved, upload it to the project before the session ends and the JupyterLab worker terminates. Upload data directly in the notebook by running dx upload from a notebook cell or from the terminal:
Check the References guide for tips on the most useful operations and features in the DNAnexus JupyterLab.
dx build_asset tensorflow_asset"runSpec": {
...
"assetDepends": [
{
"id": "record-xxxx
}
]
...
}dx build -f tensorboard-web-app
dx run tensorboard-web-app/
├── usr
│ ├── bin
│ ├── < samtools binary >
├── home
│ ├── dnanexus
│ ├── applet scriptsamtools view -c "${mappings_bam_name}" > "${mappings_bam_prefix}.txt"dx build_asset tensorflow_asset"runSpec": {
...
"assetDepends": [
{
"id": "record-xxxx
}
]
...
}dx build -f tensorboard-web-app
dx run tensorboard-web-appdx-download-all-inputsmkdir -p out/counts_txt# [VARIABLE]_path the absolute string path to the file.
$ echo $mappings_bam_path
/home/dnanexus/in/mappings_bam/my_mappings.bam
# [VARIABLE]_prefix the file name minus the longest matching pattern in the dxapp.json file
$ echo $mappings_bam_prefix
my_mappings
# [VARIABLE]_name the file name from the platform
$ echo $mappings_bam_name
my_mappings.bamsamtools view -c "${mappings_bam_path}" \
> out/counts_txt/"${mappings_bam_prefix}.txt"dx-upload-all-outputsmain() {
R -e "shiny::runApp('~/my_app', host='0.0.0.0', port=443)"
}dx build_asset shiny-asset"runSpec": {
...
"assetDepends": [
{
"id": "record-xxxx
}
]
...
}dx build -f dash-web-app
dx run dash-web-app{
"treeTurnaroundTimeThreshold": {threshold},
...
}{
"perWorkflowDxAttributes": {
{workflow_name}: {
"treeTurnaroundTimeThreshold": {threshold},
...
},
...
}
}dx run app-csv-loader \
-i database_name=pheno_db \
-i create_mode=strict \
-i insert_mode=append \
-i spark_read_csv_header=true \
-i spark_read_csv_sep=, \
-i spark_read_csv_infer_schema=true \
-i csv=file-xxxx \
-i table_name=sample_metadata \
-i csv=file-yyyy \
-i table_name=gwas_result%%bash
dx download input_data/reads.fastq%%bash
dx upload results.csvDatetime
In a histogram in the Cohort Browser, each vertical bar represents the count of records in a particular "bin." Each bin groups records that share the same value or similar values, in a particular field.
The Cohort Browser automatically groups records into bins, based on the distribution of values in the dataset, for the field. Values are distributed in a linear fashion, on the x axis.
Below is a sample histogram showing the distribution of values in a field Critical care total days. The label under the chart title indicates the number of records (203) for which values are shown, and the name of the entity ("RNAseq Notes") to which the data relates.
A field containing numeric data may also contain some non-numeric values. These values cannot be represented in a histogram. In such cases, you see the following informational message below the chart:
Clicking the "non-numeric values" link displays detail on those values, and the number of record in which each appears:
In Cohort Compare mode, histograms can be used to compare the distribution of values in a field that's common to both cohorts. In this scenario, the distributions are overlaid one atop another. Clicking the "ˇ" icon, in the lower right corner of the tile containing the chart, opens a tooltip showing the cohort names and the colors used to represent data in each.
See Comparing Cohorts for more on using Cohort Compare mode.
When ingesting data using Data Model Loader, the following data types can be visualized in histograms:
Integer
Integer Sparse
Float
Float Sparse
Date
Date Sparse
Datetime
Datetime Sparse
Numerical (Integer)
Numerical (Float)
Date
Primary Field
Secondary Field
Numerical (Integer) or Numerical (Float)
Numerical (Integer) or Numerical (Float)
In the scatter plot below, each dot represents a particular combination of values, found in one or more records in a cohort, in fields Insurance Billed and Cost. The lighter the dot at a particular point, the fewer the records that share that combination. Darker dots, meanwhile, indicate that more records share a particular combination.
Fields containing primarily numeric data may also include non-numeric values. These non-numeric values cannot be represented in a scatter plot. The message "This field contains non-numeric values" appears below the scatter plot, as in this sample chart:
Clicking the "non-numeric values" link displays detail on those values, and the number of record in which each appears.
In the Cohort Browser, scatter plots can show up to 30,000 distinct data points. If you create a scatter plot that would require that more data points be shown, you see this message above the chart:
In this scenario, add a cohort filter to generate a scatter plot that shows data for all the members of a cohort.
Scatter plots are not supported in Cohort Compare.
When ingesting data using Data Model Loader, the following data types can be visualized in scatter plots:
Integer
Integer Sparse
Float
Float Sparse
The raw data is expected to be a set of gVCF files -- one file per sample in the cohort.
GLnexus is used to harmonize sites across all gVCFs and generate a single pVCF file containing all harmonized sites and all genotypes for all samples.
To learn more about GLnexus, see GLnexus or Getting started with GLnexus.
VCF files can include variant annotations. SnpEff annotations provided as INFO/ANN tags are loaded into the database. You can annotate the harmonized pVCF yourself by running any standard SnpEff annotator before loading it. For large pVCFs, rely on the internal annotation step in the VCF Loader instead of generating an annotated intermediate file. The VCF Loader performs annotation in a distributed, massively parallel process.
The VCF Loader does not persist the intermediate, annotated pVCF as a file. If you want to have access to the annotated file up front, you should annotate it yourself.
VCF annotation flows. In (a) the annotation step is external to the VCF Loader, whereas in (b) the annotation step is internal. In any case, SnpEff annotations present as INFO/ANN tags are loaded into the database by the VCF Loader.
Learn about limits on the costs executions can incur, and how these limits can affect executions on the DNAnexus Platform.
A running execution can be terminated when it incurs charges that cause a cost or spending limit to be reached. When a spending limit is reached, this can also prevent new executions from being launched.
. This limit is set when a root execution is launched. Once this limit is reached, the DNAnexus Platform terminates running executions in the affected execution tree.
When an execution is terminated in this fashion, the Platform sets . This failure code is displayed on the UI, on the relevant project's Monitor page.
Billing account spending limits are managed by billing administrators, and can impact executions in projects billed to the account.
Billing account spending limits apply to cumulative charges incurred by projects billed to the account.
If cumulative charges reach this limit, the Platform terminates running jobs in projects billed to the account, and prevents new executions from being launched.
When a job is terminated in this fashion, the Platform sets as the failure reason. This failure reason is displayed on the UI, on the relevant project's Monitor page.
, and can impact executions run within the project. Project admins can also set a separate monthly project-level egress spending limit, which can impact data egress from the project.
If the compute spending limit is reached, the Platform may terminate running jobs launched by project members, and prevent new executions from being launched. If the egress spending limit is reached, the Platform may prevent data egress from the project. The exact behavior depends on the policies of the org to which the project is billed.
For more information on these limits, see the , and the .
Monthly project compute limits do not apply to compute charges incurred by using .
Using public IPv4 addresses for workers incurs additional charges. When a job uses such a worker, IPv4 charges are included in the total cost figure shown for the job on the UI. These charges also count toward any .
For information on how to find the per-hour charge for using IPv4 addresses, in each cloud region in which org members can run executions, see the .
The UI displays information on costs and cost limits for both individual executions and execution trees. Navigate to the project in which the execution or execution tree is being run, then click the Monitor tab. Click on the name of the execution or execution tree to open a page showing detailed information about it.
While an execution or execution tree is running, information is displayed on the charges it has incurred so far, and on additional charges it can incur, before an applicable cost limit is reached.
Org spending limit information is available from the .
If project-level monthly spending limits have been set for a project, detailed information is available via the CLI, using the command .
Get an overview of the range of different charts you can build and use in the Cohort Browser.
While working in the Cohort Browser, you can visualize data using a variety of different types of charts.
To visualize data stored in particular field, follow these directions to browse through the fields in a dataset, select one, then create a chart based on the values in the field. When you select a field, the Cohort Browser suggests a chart type to use, to visualize the type of data it contains. You can also create multi-variable charts, displaying data from two fields, to help clarify the relationship between the data stored in each.
The following single-variable chart types are available in the Cohort Browser:
The following multi-variable chart types are available in the Cohort Browser:
In all charts used in the Cohort Browser, a chart total count is displayed under the chart's title. This figure represents the number of records for which data is displayed in the chart. The label - "Participants" in the chart shown below - indicates the entity to which the data relates.
This figure is not always the same as the number of records in the cohort.
In a single-variable chart, if a field in a record is empty or contains a null value, that record is not included in the total, as its data can't be visualized. If any such records exist in the cohort, an "i" warning icon appears next to the chart total figure. Hover over the icon to show a tooltip with information about records that aren't included in the total.
The same holds for multi-variable charts. If any record contains a null value in either of the selected fields, or if either field is empty, that record isn't included in the chart total count, as its data can't be visualized.
Learn to build and use stacked row charts in the Cohort Browser.
Stacked row charts can be used to compare the distribution of values in a field containing categorical data, across different groups in a cohort. In a stacked row chart, each such group is defined by its patient sharing the same value in another field that also contains categorical data.
When creating a stacked row chart:
Both the primary and secondary fields must contain categorical data
Both the primary and secondary fields must contain no more than 20 distinct category values
In the stacked row chart below, the primary field is VisitType, while DoctorType is the secondary field. In this chart, a cohort has been broken down into two groups, with the first sharing the value "Out-patient" in the VisitType field, while the second shares the value "In-patient."
The size of each bar, and the number to its right, indicate the total number of records in each group. In the chart below, for example, you can see that 3,179 records contain the value "Out-patient" in the VisitType field.
Each bar contains a color-coded section indicating how many of the group's records contain a specific value in the secondary field. Hovering over one of these sections reveals how many records, within a particular group, share a particular value in the secondary field. In the chart below, for example, you can see that 87 records in the first group share the value "specialist" in the DoctorType field.
Stacked row charts are not supported in Cohort Compare. Use a instead.
When , the following data types can be visualized in stacked row charts:
String Categorical
String Categorical Sparse
Integer Categorical
Learn to build and use Kaplan-Meier Survival Curve charts in the Cohort Browser.
To generate a survival chart, select one numerical field representing time, and one categorical field, which is transformed into the individual's status.
The categorical field should use one of the following 4 terms (case-insensitive) to indicate a status of "Living": living, alive, diseasefree, disease-free.
For multi-entity datasets, survival curve charts only support data fields from the main entity, or entities with 1:1 relation to the main entity.
To calculate survival percent at the current event the system evaluates the following formula:
: Survival at the current event
: Number of subjects living at the start of the period or event
: Number of subjects that died
For each time period the following values are generated:
Status: Each individual is considered Dead unless they qualify as Living.
Number of Subjects Living at the Start ()
For the initial value this is the total number of records returned by the backend from survival data with Living or Dead Status.
This is the actual point drawn on the survival plot.
View full source code on GitHub
This applet performs a basic samtools view -c {bam} command, referred to as "SAMtools count", on the DNAnexus Platform.
For bash scripts, inputs to a job execution become environment variables. The inputs from the dxapp.json file are formatted as shown below:
The object mappings_bam, a DNAnexus link containing the file ID of that file, is available as an environmental variable in the applet's execution. Use the command dx download to download the BAM file. By default, downloading a file preserves the filename of the object on the platform.
Use the bash helper variable mappings_bam_name for file inputs. For these inputs, the DNAnexus Platform creates a bash variable [VARIABLE]_name that holds the platform filename. Because the file was downloaded with default parameters, the worker filename matches the platform filename. The helper variable [VARIABLE]_prefix contains the filename minus any suffixes specified in the input field patterns (for example, the platform removes the trailing .bam to create [VARIABLE]_prefix).
Use command to upload data to the platform. This uploads the file into the job container, a temporary project that holds onto files associated with the job. When running the command dx upload with the flag --brief, the command returns only the file ID.
Job containers are an integral part of the execution process. To learn more see .
The output of an applet must be declared before the applet is even built. Looking back to the dxapp.json file, you see the following:
The applet declares a file type output named counts_txt. In the applet script, specify which file should be associated with the output counts_txt. On job completion, this file is copied from the temporary job container to the project that launched the job.
Learn to build and use row charts in the Cohort Browser.
Row charts can be used to visualize categorical data.
When creating a row chart:
The data must be from a field that contains either categorical or categorical multi-select data
This field must contain no more than 20 distinct category values
The values cannot be organized in a hierarchy
See if you need to visualize hierarchical categorical data.
Row charts can't be used to visualize data in categorical fields that have a hierarchical structure. For this type of data, use a .
Row charts aren't supported in Cohort Compare mode. In Cohort Compare mode, row charts are converted to .
Row charts can't be used to visualize data from more than one field. To visualize categorical data from two fields, you can use a .
In a row chart, each row shows a single category value, along with the number of records - the "count" - in which that value appears in the selected field. Also shown is the percentage of total cohort records in which it appears - its "freq." or "frequency."
Below is a sample row chart showing the distribution of values in a field Salt added to food. In the current cohort selection of 100,000 participants, 27,979 records contain the value "Sometimes", which represents 27.98% of the current cohort size.
When , the following data types can be visualized in row charts, if category values are specified as such in the coding file used at ingestion:
String Categorical
String Categorical Sparse
String Categorical Multi-select
Integer Categorical
Learn about different types of time limits on executions, and how they can affect your executions on the DNAnexus Platform.
On the DNAnexus Platform, executions are subject to two independent time limits: job timeouts, and execution tree expirations.
Each job has a . This setting denotes the maximum amount of "wall clock time" that the job can spend in the "running" state, that is, running on the DNAnexus Platform.
If the job is still running when this limit is reached, the job is terminated.
The default job timeout setting is 30 days, though . A job may be given a .
As noted above, job timeouts only apply to the time a job spends in the "running" state.
Job timeouts do not apply to any time a job spends waiting to begin running - as, for example, when a job is waiting for inputs to become available.
Job timeouts also do not apply to the time a job may spend between exiting the "running" state, and entering the "done" state - as, for example, when it is waiting for subjobs to finish.
To learn more about timeouts, see .
If a job fails to complete running before reaching its timeout limit, it is terminated, with .
Each job is part of an . All jobs in an execution tree must complete running within 30 days of the launch of the tree's .
After this limit has been reached, all jobs within the execution tree lose the ability to access the Platform.
If an execution tree is restarted, its timeout setting is not reset. Jobs in the tree lose Platform access 30 days after the initial launch (the first try) of the tree's root execution.
If an execution tree reaches its time limit, jobs in the tree may not fail right away. If such a job is waiting for inputs or outputs, or if it is running without accessing the Platform, it may remain in that state. Only when the job tries to access the Platform does it fail. Depending on the access pattern, .
To see information on time limits for execution and execution trees:
Navigate to the project in which the execution or execution tree is being run.
Click the Monitor tab.
Click the name of the execution or execution tree to open a page showing detailed information on it.
If a time limit is approaching, a warning message provides information on when the limit is reached.
If a job is waiting for subjobs to finish, it is shown as running, but job timeout information is not displayed. Execution tree information continues to be displayed.
Visualize your data and browse your multi-omics datasets.
Cohort Browser is a visualization tool for exploring and filtering structured datasets. It provides an intuitive interface for creating visualizations, defining patient cohorts, and analyzing complex data.
Cohort Browser supports multiple types of datasets:
Clinical and phenotypic data - Patient demographics, clinical measurements, and outcomes
Germline variants - Inherited genetic variations
Create charts, manage dashboards, and build visualizations to explore your datasets in the Cohort Browser.
Learn to build and use grouped box plots in the Cohort Browser.
Grouped box plots can be used to compare the distribution of values in a field containing numerical data, across different groups in a cohort. In a grouped box plot, each such group is defined by its members sharing the same value in another field that contains categorical data.
When creating a grouped box plot:
The command-line client and the client bindings use a set of environment variables to communicate with the API server and to store state on the current default project and directory. These settings are set when you run dx login and can be changed through other dx commands. To display the active settings in human-readable format, use the dx env command:
To print the bash commands for setting the environment variables to match what dx is using, you can run the same command with the --bash flag.
Running a dx command from the command-line does not (and cannot) overwrite your shell's environment variables. The environment variables are stored in the ~/.dnanexus_config/environment file.
dx run app-glnexus \
-i common.gvcf_manifest=<manifest_file_id> \
-i common.config=gatk_unfiltered \
-i common.targets_bed=<bed_target_ranges>dx run workflow-glnexus \
-i common.gvcf_manifest=<manifest_file_id> \
-i common.config=gatk_unfiltered \
-i common.targets_bed=<bed_target_ranges> \
-i unify.shards_bed=<bed_genomic_partition_ranges> \
-i etl.shards=<num_sample_partitions>{
"inputSpec": [
{
"name": "mappings_bam",
"label": "Mapping",
"class": "file",
"patterns": ["*.bam"],
"help": "BAM format file."
}
]
}

For all web apps, if everything is running smoothly and no errors are encountered (the ideal case), the line of code that starts the server keeps it running forever. The applet stops only when it is terminated. This also means that any lines of code after the server starts are not executed.
The rest of these instructions apply to building any applet with dependencies stored in an asset.
Install the DNAnexus SDK and log in, then run dx-app-wizard with default options.
dash-asset specifies all the packages and versions needed. These come from the Dash installation guide.
Add these into dash-asset/dxasset.json:
Build the asset:
Add this asset to the applet's dxapp.json:
Build and run the applet itself:
You can always use dx ssh job-xxxx to ssh into the worker and inspect what's going on or experiment with quick changes Then go to that job's special URL https://job-xxxx.dnanexus.cloud/ and see the result!
The main code is in dash-web-app/resources/home/dnanexus/my_app.py with a local launcher script called local_test.py in the same folder. This allows you to launch the same core code in the applet locally to quickly iterate. This is optional because you can also do all testing on the platform itself.
Install locally the same libraries listed above.
To launch the web app locally:
Once it spins up, you can go to that job's designated URL based on its job ID, https://job-xxxx.dnanexus.cloud/, to see the result.
The following is an ordered list of which DNAnexus utilities load values from configuration sources:
Command line options (if available)
Environment variables already set in the shell
~/.dnanexus_config/environment.json (dx configuration file)
Hardcoded defaults
The dx command always prioritizes the environment variables that are set in the shell. This means that if you have set your environment variable for DX_SECURITY_CONTEXT and then use dx login to log in as a different user, it still uses the original environment variable. When not run in a script, it prints a warning to stderr whenever the environment variables and its stored state have a mismatch. To get out of this situation, the best approach is often to run source ~/.dnanexus_config/unsetenv. Setting environment variables is generally an approach within a shell script or part of a job environment in the cloud.
In the interaction below, environment variables have already been set, but the user then uses dx to log in which is still overridden by the shell's environment variables.
If you instead want to discard the values which dx has stored, the command dx clearenv removes the dx-generated configuration file ~/.dnanexus_config/environment.json for you.
Most dx commands have the following additional flags to temporarily override the values of the respective variables.
For example, you can temporarily override the current default project used:
dx download "${mappings_bam}"readcount=$(samtools view -c "${mappings_bam_name}")
echo "Total reads: ${readcount}" > "${mappings_bam_prefix}.txt"counts_txt_id=$(dx upload "${mappings_bam_prefix}.txt" --brief){
"name": "counts_txt",
"class": "file",
"label": "Read count file",
"patterns": [
"*.txt"
],
"help": "Output file with Total reads as the first line."
}dx-jobutil-add-output counts_txt "${counts_txt_id}" --class=fileapp.run_server(host='0.0.0.0', port=443)pip install dash==0.39.0 # The core dash backend
pip install dash-html-components==0.14.0 # HTML components
pip install dash-core-components==0.44.0 # Supercharged components
pip install dash-table==3.6.0 # Interactive DataTable component
pip install dash-daq==0.1.0 # DAQ components{
...
"execDepends": [
{"name": "dash", "version":"0.39.0", "package_manager": "pip"},
{"name": "dash-html-components", "version":"0.14.0", "package_manager": "pip"},
{"name": "dash-core-components", "version":"0.44.0", "package_manager": "pip"},
{"name": "dash-table", "version":"3.6.0", "package_manager": "pip"},
{"name": "dash-daq", "version":"0.1.0", "package_manager": "pip"}
],
...
}dx build_asset dash-asset"runSpec": {
...
"assetDepends": [
{
"id": "record-xxxx
}
]
...
}dx build -f dash-web-app
dx run dash-web-appcd dash-web-app/resources/home/dnanexus/
python3 local_test.py$ dx env
Auth token used adLTkSNkjxoAerREqbB1dVkspQzCOuug
API server protocol https
API server host api.dnanexus.com
API server port 443
Current workspace project-9zVpbQf4Zg2641v5BGY00001
Current workspace name "Scratch Project"
Current folder /
Current user alice$ dx env --bash
export DX_SECURITY_CONTEXT='{"auth_token_type": "bearer", "auth_token": "adLTkSNkjxoAerREqbB1dVkspQzCOuug"}'
export DX_APISERVER_PROTOCOL=https
export DX_APISERVER_HOST=api.dnanexus.com
export DX_APISERVER_PORT=443
export DX_PROJECT_CONTEXT_ID=project-9zVpbQf4Zg2641v5BGY00001$ dx ls -l
Project: Sample Project (project-9zVpbQf4Zg2641v5BGY00001)
Folder : /
<Contents of Sample Project>
$ dx login
Acquiring credentials from https://auth.dnanexus.com
Username: alice
Password:
Note: Use "dx select --level VIEW" or "dx select --public" to select from
projects for which you only have VIEW permissions.
Available projects:
0) SAM importer test (CONTRIBUTE)
1) Scratch Project (ADMINISTER)
2) Mouse (ADMINISTER)
Pick a numbered choice [1]: 2
Setting current project to: Mouse
$ dx ls
WARNING: The following environment variables were found to be different than the
values last stored by dx: DX_SECURITY_CONTEXT, DX_PROJECT_CONTEXT_ID
To use the values stored by dx, unset the environment variables in your shell by
running "source ~/.dnanexus_config/unsetenv". To clear the dx-stored values,
run "dx clearenv".
Project: Sample Project (project-9zVpbQf4Zg2641v5BGY00001)
Folder : /
<Contents of Sample Project>
$ source ~/.dnanexus_config/unsetenv
$ dx ls -l
Project: Mouse (project-9zVpbQf4Zg2641v5BGY00001)
Folder : /
<Contents of Mouse>$ dx --env-help
usage: dx command ... [--apiserver-host APISERVER_HOST]
[--apiserver-port APISERVER_PORT]
[--apiserver-protocol APISERVER_PROTOCOL]
[--project-context-id PROJECT_CONTEXT_ID]
[--workspace-id WORKSPACE_ID]
[--security-context SECURITY_CONTEXT]
[--auth-token AUTH_TOKEN]
optional arguments:
--apiserver-host APISERVER_HOST
API server host
--apiserver-port APISERVER_PORT
API server port
--apiserver-protocol APISERVER_PROTOCOL
API server protocol (http or https)
--project-context-id PROJECT_CONTEXT_ID
Default project or project context ID
--workspace-id WORKSPACE_ID
Workspace ID (for jobs only)
--security-context SECURITY_CONTEXT
JSON string of security context
--auth-token AUTH_TOKEN
Authentication token$ dx env --project-context-id project-B0VK6F6gpqG6z7JGkbqQ000Q
Auth token used R54BN6Ws6Zl3Y0VqBA9o1qweUswYW5o4
API server protocol https
API server host api.dnanexus.com
API server port 443
Current workspace project-B0VK6F6gpqG6z7JGkbqQ000Q
Current folder /For followup events this is the number of subjects at the start of the previous event minus the number of subjects that died in the previous event and the subjects that dropped out or were censored in the previous event.
Number of Subjects Who Died (): * 1 for each individual who at the event does not have a status of Living.
Number of Subjects Dropped or Censored: * 1 for each individual who at the event has a status of Living.
Survival Percent at the Current Event ():
Cumulative Survival (): where is the survival percent at the previous event.
Integer Categorical Multi-select
Supported Data Types
Limitations
Categorical
≤20 distinct category values
Categorical Multi-Select
≤20 distinct category values

Gene expressions - Molecular expression measurements
Multi-assay datasets - Datasets combining multiple assay types or instances of the same assay type
If you need to perform custom statistical analysis, you can also use JupyterLab environments with Spark clusters to query your data programmatically.
You need to ingest your data before you can access it through a dataset in the Cohort Browser.
In Projects, select the project where your dataset is located.
Go to the Manage tab.
Select your dataset.
Click Explore Data.
You can also use the Info Panel to view information about the selected dataset, such as its creator or sponsorship.
Depending on your dataset, the Cohort Browser shows the following tabs:
Overview - Clinical data using interactive charts and dashboards
Data Preview - Clinical data in tabular format
Assay-specific tabs - Additional tabs appear based on your dataset content:
Germline Variants - For datasets containing germline genomic variants
Somatic Variants - For datasets containing somatic variants and mutations
Gene Expression - For datasets containing molecular expression data
In the Cohort Browser's Overview tab, you can visualize your data using charts. These visualizations provide an introduction to the dataset and insights on the clinical data it contains.
When you open a dataset, Cohort Browser automatically creates an empty cohort that includes all records in the dataset. From here, you can add filters to create specific cohorts, build visualizations to explore your data, and export filtered data for further analysis outside the platform.
Creating Charts and Dashboards - Build visualizations and manage dashboard layouts
Defining and Managing Patient Cohorts - Filter data and create patient groups
Analyzing Germline Genomic Variants - Work with inherited genetic variations
Analyzing Somatic Variants and Mutations - Explore cancer-related genetic changes
- Examine molecular expression patterns
By using Dashboard Actions, you can save or load your own dashboard views. This lets you quickly switch between different visualizations without having to set them up each time.
Save Dashboard View - Saves the current dashboard configuration as a record of the DashboardView type, including all tiles and their settings.
Load Dashboard View - Loads a custom dashboard view, restoring the tiles and their configurations.
After loading a dashboard view once, you can access it again from Dashboard Actions > Custom Dashboard Views.
Moving dashboards between datasets? If you want to use your dashboard views with a different Apollo Dataset, you can use the Rebase Cohorts And Dashboards app to transfer your custom dashboard configurations to a new target dataset.
Add charts to your dashboards to visualize the clinical and phenotypical data in your dataset. For example, you can add charts to display patient demographics or clinical measurements.
Each chart is represented as a tile on the dashboard. You can add multiple tiles to visualize different aspects of your data.
In the Overview tab, click + Add Tile on the top-right.
In the hierarchical list of the dataset fields, select the field you want to visualize.
In Data Field Details, choose your preferred chart type.
The available chart types depend on the field's value type.
Click Add Tile.
The tile immediately appears on the dashboard. You can add up to 15 tiles.
When selecting data fields to visualize, you can add a secondary data field to create a multi-variable chart. This allows you to visualize relationships between two data fields in the same chart.
To visualize the relationship between two data fields in the same chart, first select your primary data field from the hierarchical list. This opens a Data Field Details panel, showing the field's information and a preview of a basic chart.
To add a secondary field, keep the primary field selected and search for the desired field. When you find it, click the Add as Secondary Field icon (+) next to its name rather than selecting it directly. This adds the new field to the visualization. The Data Field Details panel updates to show the combined information for both fields.
For certain chart types, such as Stacked Row Chart and Scatter Plot, you can re-order the primary and secondary data fields by dragging the data field in Data Field Details.
For more details on multi-variable charts, including how to build a survival curve, see Multi-Variable Charts.
When working with large datasets, keep these tips in mind:
Limit dashboard tiles: To ensure fast loading times and a clear overview, it's best to limit the number of charts on a single dashboard. Typically, 8-10 tiles is a good number for human comprehension and optimal performance.
Filter data first: Reduce the volume of data by applying filters before you create complex visualizations. This improves chart loading speed.
In the main menu, navigate to Tools > JupyterLab. If you have used DXJupyterLab before, the page shows your previous sessions across different projects.
Click New JupyterLab.
Configure your JupyterLab session:
Specify the session name and select an instance type.
Choose the project where JupyterLab should run.
Set the session duration after which the environment automatically shuts down.
Optionally, provide a snapshot file to load a previously saved environment.
If needed, enable Spark Cluster and set the number of nodes.
Select a feature option based on your analysis needs:
PYTHON_R (default): Python3 and R kernel and interpreter
ML: Python3 with machine learning packages (TensorFlow, PyTorch, CNTK) and image processing (Nipype), but no R
Review the pricing estimate (if you have billing access) based on your selected duration and instance type.
Click Start Environment to launch your session. The JupyterLab shows an "Initializing" state while the worker spins up and the server starts.
Open your JupyterLab environment by clicking the session name link once the state changes to "Ready". You can also access it directly via https://job-xxxx.dnanexus.cloud, where job-xxxx is your job ID.
For a detailed list of libraries included in each feature option, see the in-product documentation.
You can start the JupyterLab environment directly from the command line by running the app:
Once the app starts, you may check if the JupyterLab server is ready to server connections, which is indicated by the job's property httpsAppState set to running. Once it is running, you can open your browser and go to https://job-xxxx.dnanexus.cloud where job-xxxx is the ID of the job running the app.
To run the Spark version of the app, use the command:
You can check the optional input parameters for the apps on the DNAnexus Platform (platform login required to access the links):
From the CLI, you can learn more about dx run with the following command:
where APP_NAME is either app-dxjupyterlab or app-dxjupyterlab_spark_cluster.
See the Quickstart and References pages for more details on how to use DXJupyterLab.
The primary field must contain no more than 15 distinct category values
The secondary field must contain numerical data
Primary Field
Secondary Field
Categorical or Categorical Multiple (<=15 categories)
Numerical (Integer) or Numerical (Float)
The grouped box plot below shows a cohort that has been broken down into groups, according to the value in a field Doctor. For each group, a box plot provides detail on the reported Visit Feeling, for cohort members who share a doctor:
A field containing numeric data may also contain some non-numeric values. These values cannot be represented in a grouped box plot. See the chart above for an example of the informational message that shows below the chart, in this scenario.
Clicking the "non-numeric values" link displays detail on those values, and the number of records in which each appears:
Cohort Browser grouped box plots represent all non-null numeric values. When a field contains an outlier value or values - that is, values that are unusually high or low - this can result in a grouped box plot that looks like this:
This grouped box plot displays data on the number of cups of coffee consumed per day, by members of different groups in a particular cohort, with groups defined by shared value in a field Coffee type. In multiple groups, one member was recorded as consuming far more cups of coffee per day than others in the group.
In Cohort Compare mode, a grouped box plot can be used to compare the distribution of values in a field that's common to both cohorts, across groups defined using values in a categorical field that is also common to both cohorts.
In this scenario, a separate, color-coded box plot is displayed for each group in each cohort.
Hovering over one of these box plots opens an informational window showing detail on the distribution of values for the group.
Clicking the "ˇ" icon, in the lower right corner of the tile containing the chart, opens a tooltip showing the cohort names and the colors used to represent data in each.
When ingesting data using Data Model Loader, the following data types can be visualized in grouped box plots:
String Categorical
String Categorical Multi-Select
String Categorical Sparse
Integer Categorical
Integer Categorical Multi-Select
Integer
Integer Sparse
Float
Float Sparse
Box plots provide a range of detail on the distribution of values in a field containing numerical data. Each box plot includes three thin blue horizontal lines, indicating, from top to bottom:
Max - The maximum, or highest value
Med - The median value
Min - The minimum, or lowest value
The blue box straddling the median value line represents the span covered by the median 50% of values. Of the total number of values, 25% sit above the box, and 25% lie below it.
Hovering over the middle of a box plot opens a window displaying detail on the maximum, median, and minimum values. Also shown are the values at the "top" ("Q3") and "bottom" ("Q1") of the box. "Q1" is the highest value in the first, or lowest, quartile of values. "Q3" is the highest value in the third quartile.
Also shown in this window is the total count of values covered by the box plot, along with the name of the entity to which the data relates.
Fields containing primarily numeric data may also include non-numeric values. These non-numeric values cannot be represented in a box plot. See the chart above for an example of the informational message that shows below the chart when non-numeric values are present.
Clicking the "non-numeric values" link displays detail on those values, and the number of record in which each appears:
In this scenario, a discrepancy exists between the "count" figure shown in the chart label and the one shown in the informational window that opens when hovering over the middle of a box plot. The latter figure is smaller, with the discrepancy determined by the number of records for which values can't be displayed in the box plot.
Cohort Browser box plots represent all non-null numeric values. When a field contains an outlier value or values - that is, values that are unusually high or low - this can result in a box plot that looks like this:
This box plot displays data on the number of cups of coffee consumed per day, by patients of a particular cohort. One cohort patient was recorded as consuming 42 cups of coffee per day, much higher than the value (2 cups/day) at the "top" of the third quartile, and far higher than the median value of 2 cups/day.
In Cohort Compare mode, a box plot chart can be used to compare the distribution of values in a field that's common to both cohorts. In this scenario, a separate, color-coded box plot is displayed for each cohort.
Hovering over either of the plots opens an informational window showing detail on the distribution of values for the cohort.
Clicking the "ˇ" icon, in the lower right corner of the tile containing the chart, opens a tooltip showing the cohort names and the colors used to represent data in each.
When ingesting data using Data Model Loader, the following data types can be visualized in box plots:
Integer
Integer Sparse
Float
Float Sparse
Numerical (Integer)
Numerical (Float)
Primary Field
Secondary Field
Categorical (<=20 distinct category values)
Categorical (<=20 distinct category values)










Access developer tutorials and examples.
Developers new to the DNAnexus Platform may find it helpful to learn by doing. This page contains a collection of tutorials and examples intended to showcase common tasks and methodologies when creating an app(let) on the DNAnexus Platform. After reading through the tutorials and examples you should be able to develop app(let)s that:
Run efficiently: use cloud computing methodologies.
Are straightforward to debug: let developers understand and resolve issues.
Use the scale of the cloud: take advantage of the DNAnexus Platform's flexibility
Are straightforward to use: reduce support and enable collaboration.
If it's your first time developing an app(let), read the series. This series introduces terms and concepts that tutorials and examples build on.
These tutorials are not meant to show realistic everyday examples, but rather provide a strong starting point for app(let) developers. These tutorials showcase varied implementations of the SAMtools view command on the DNAnexus Platform.
Bash app(let)s use dx-toolkit, the platform SDK, and the command-line interface along with common Bash practices to create bioinformatic pipelines in the cloud.
Python app(let)s make of use dx-toolkit's along with common Python modules such as to create bioinformatic pipelines in the cloud.
To create a web applet, you need access to Titan or Apollo features. Web applets can be made as either Python or Bash applets. The only difference is that they launch a web server and expose port 443 (for HTTPS) to allow a user to interact with that web application through a web browser.
A bit of terminology before starting the discussion of parallel and distributed computing paradigms on the DNAnexus Platform.
Many definitions and approaches exist for tackling the concept of parallelization and distributing workloads in the cloud (Here's a on the subject). To make the documentation easier to understand when discussing concurrent computing paradigms, this guide refers to:
Parallel: Using multiple threads or logical cores to concurrently process a workload.
Distributed: Using multiple machines (in this case, cloud instances) that communicate to concurrently process a workload.
Keep these formal definitions in mind as you read through the tutorials and learn how to compute concurrently on the DNAnexus Platform.
This applet tutorial performs a SAMtools count using parallel threads.
View full source code on GitHub
To take full advantage of the scalability that cloud computing offers, your scripts have to implement the correct methodologies. This applet tutorial shows you how to:
Install SAMtools
Download BAM file
Count regions in parallel
The SAMtools dependency is resolved by declaring an package in the dxapp.json runSpec.execDepends.
For additional information, refer to the .
The dxpy.download_all_inputs() function downloads all input files into the /home/dnanexus/in directory. A folder is created for each input and the files are downloaded to that directory. For convenience, the dxpy.download_all_inputs function returns a dictionary containing the following keys:
<var>_path (string): full absolute path to where the file was downloaded.
<var>_name (string): name of the file, including extension.
<var>_prefix (string): name of the file minus the longest matching pattern found in the
The path, name, and prefix key-value pattern is repeated for all applet file class inputs specified in the dxapp.json. In this example, the dictionary has the following key-value pairs:
Before performing the parallel SAMtools count, determine the workload for each thread. The number of workers is arbitrarily set to 10 and the workload per thread is set to 1 chromosome at a time. Python offers multiple ways to achieve multithreaded processing. For the sake of simplicity, use , a wrapper around Python's threading module.
Each worker creates a string to be called in a subprocess.Popen call. The multiprocessing.dummy.Pool.map(<func>, <iterable>) function is used to call the helper function run_cmd for each string in the iterable of view commands. Because multithreaded processing is performed using subprocess.Popen, the process does not alert to any failed processes. Closed workers are verified in the verify_pool_status helper function.
Important: In this example, you use subprocess.Popen to process and verify results in verify_pool_status. In general, it is considered good practice to use Python's built-in subprocess convenience functions. In this case, subprocess.check_call would achieve the same goal.
Each worker returns a read count of only one region in the BAM file. Sum and output the results as the job output. The dx-toolkit Python SDK function is used to upload and generate a DXFile corresponding to the result file. For Python, job outputs have to be a dictionary of key-value pairs, with the keys being job output names as defined in the dxapp.json and the values being the output values for corresponding output classes. For files, the output type is a . The function is used to generate the appropriate DXLink value.
This applet tutorial performs a SAMtools count using parallel threads.
View full source code on GitHub
To take full advantage of the scalability that cloud computing offers, your scripts must implement the correct methodologies. This applet tutorial shows you how to:
Install SAMtools
Download BAM file
Split workload
Count regions in parallel
The SAMtools dependency is resolved by declaring an package in the dxapp.json runSpec.execDepends field.
This applet downloads all inputs at once using dxpy.download_all_inputs:
Using the Python multiprocessing module, you can split the workload into multiple processes for parallel execution:
With this pattern, you can quickly orchestrate jobs on a worker. For a more detailed overview of the multiprocessing module, visit the .
Specific helpers are created in the applet script to manage the workload. One helper you may have seen before is run_cmd. This function manages the subprocess calls:
Before the workload can be split, you need to identify the regions present in the BAM input file. This initial parsing is handled in the parse_sam_header_for_region function:
Once the workload is split and processing has started, wait and review the status of each Pool worker. Then, merge and output the results.
The run_cmd function returns a tuple containing the stdout, stderr, and exit code of the subprocess call. These outputs from the workers are parsed to determine whether the run failed or passed.
View full source code on GitHub
This applet performs a basic SAMtools count of alignments present in an input BAM.
The app must have network access to the hostname where the git repository is located. In this example, access.network is set to:
To learn more about access and network fields see .
SAMtools is cloned and built from the repository. The following is a closer look at the dxapp.json file's runSpec.execDepends property:
The execDepends value is a JSON array of dependencies to resolve before the applet source code is run. In this applet, the git fetch dependencies for htslib and SAMtools are specified. Dependencies resolve in the order listed. Specify htslib first, before the SAMtools build_commands, because newer versions of SAMtools depend on htslib. An overview of each property in the git dependency:
package_manager - Details the type of dependency and how to resolve. .
url - Must point to the server containing the repository. In this case, a GitHub URL.
tag/branch - Git tag/branch to fetch.
The build_commands are executed from the destdir. Use cd when appropriate.
src script?Because "destdir": "/home/dnanexus" is set in dxapp.json, the git repository is cloned to the same directory from which the script executes. The example directory's structure:
The SAMtools command in the app script is samtools/samtools.
You can build SAMtools in a directory that is on the $PATH or add the binary directory to $PATH. Keep this in mind for your app(let) development.
This applet performs a basic SAMtools count on a series of sliced (by canonical chromosome) BAM files in parallel using wait (Ubuntu 14.04+).
View full source code on GitHub
The SAMtools dependency is resolved by declaring an Apt-Get package in the dxapp.json runSpec.execDepends.
The command set -e -x -o pipefail assists you in debugging this applet:
-e causes the shell to immediately exit if a command returns a non-zero exit code.
-x prints commands as they are executed, which is useful for tracking the job's status or pinpointing the exact execution failure.
-o pipefail makes the return code the first non-zero exit code. (Typically, the return code of pipes is the exit code of the last command, which can create difficult to debug problems.)
The *.bai file was an optional job input. You can check for a empty or unset var using the bash built-in test [[ - z ${var}} ]]. You can then download or create a *.bai index as needed.
Bash's system allows for convenient management of multiple processes. In this example, bash commands are run in the background as the maximum job executions are controlled in the foreground. You can place processes in the background using the character & after a command.
Once the input BAM has been sliced, counted, and summed, the output counts_txt is uploaded using the command . The following directory structure required for dx-upload-all-outputs is below:
In your applet, upload all outputs by creating the output directory and then using dx-upload-all-outputs to upload the output files.
This applet performs a SAMtools count on an input BAM using Pysam, a python wrapper for SAMtools.
View full source code on GitHub
Pysam is provided through a pip3 install using the pip3 package manager in the dxapp.json's runSpec.execDepends property:
The execDepends value is a JSON array of dependencies to resolve before the applet source code is run. In this applet, pip3 is specified as the package manager and pysam version 0.15.4 as the dependency to resolve.
The fields mappings_sorted_bam and mappings_sorted_bai are passed to the main function as parameters for the job. These parameters are dictionary objects with key-value pair {"$dnanexus_link": "<file>-<xxxx>"}. File objects from the platform are handled through handles. If an index file is not supplied, then a *.bai index is created.
Pysam provides key methods that mimic SAMtools commands. In this applet example, the focus is only on canonical chromosomes. The Pysam object representation of a BAM file is pysam.AlignmentFile.
The helper function get_chr
Once a list of canonical chromosomes is established, you can iterate over them and perform the Pysam version of samtools view -c, pysam.AlignmentFile.count.
The summarized counts are returned as the job output. The dx-toolkit Python SDK function uploads and generates a DXFile corresponding to the tabulated result file.
Python job outputs have to be a dictionary of key-value pairs, with the keys being job output names as defined in the dxapp.json file and the values being the output values for corresponding output classes. For files, the output type is a DXLink. The function generates the appropriate DXLink value.
This applet slices a BAM file by canonical chromosome then performs a parallelized samtools view -c using xargs. Type man xargs for general usage information.
View full source code on GitHub
The SAMtools compiled binary is placed directory in the <applet dir>/resources directory. Any files found in the resources/ directory are uploaded so that they are present in the root directory of the worker. In this case:
When this applet is run on a worker, the resources/ folder is placed in the worker's root directory /:
/usr/bin is part of the $PATH variable, so the script can reference the samtools command directly, for example, samtools view -c ....
First, download the BAM file and slice it by canonical chromosome, writing the *bam file names to another file.
To split a BAM by regions, you need a *.bai index. You can either create an app(let) which takes the *.bai as an input or generate a *.bai in the applet. In this tutorial, you generate the *.bai in the applet, sorting the BAM if necessary.
In the previous section, you recorded the name of each sliced BAM file into a record file. Next, perform a samtools view -c on each slice using the record file as input.
The results file is uploaded using the standard bash process:
Upload a file to the job execution's container.
Provide the DNAnexus link as a job's output using the script dx-jobutil-add-output <output name>
VCF Loader ingests Variant Call Format (VCF) files into a database. The input VCF files are loaded into a Parquet-format database that can be queried using Spark SQL.
The input VCF for every run can be a single VCF file or many VCF files, but the merged input must represent a single logical VCF file. In the many files case, the logical VCF may be partitioned by chromosome, by genomic region, and/or by sample. In any case, every input VCF file must be a syntactically correct, sorted VCF file.
Although VCF data can be loaded into Apollo databases after the variant call step, the dataset may not be normalized for downstream analyses across large cohorts. In that case, preprocessing and harmonizing the data before loading is recommended. To learn more, see .
Input:
vcf_manifest: (file) a text file containing a list of file ID's of the VCF files to load (one per line). The referenced files' names must be distinct and end in .vcf.gz. If more than one file is specified, then the complete VCF file to load is considered to be partitioned and every specified partition must be a valid VCF file. After the partition-merge step in preprocessing, the complete VCF file must be valid.
Required Parameters:
database_name: (string) name of the database into which to load the VCF files.
create_mode: (string) strict mode creates database and tables from scratch and optimistic mode creates databases and tables if they do not already exist.
Other Options:
snpeff: (boolean) default true -- whether to include the SnpEff annotation step in preprocessing with INFO/ANN tags. If SnpEff annotations are desired in the database, then either pre-annotate the raw VCF separately, or include this SnpEff annotation step -- it is not necessary to do both.
snpeff_human_genome: (string) default GRCh38.92 -- id of the SnpEff human genome to use in the SnpEff annotation step in preprocessing.
This applet tutorial performs a SAMtools count using parallel threads.
To take full advantage of the scalability that cloud computing offers, your scripts have to implement the correct methodologies. This applet tutorial shows you how to:
Install SAMtools
Download BAM file
This applet slices a BAM file by canonical chromosome and performs a parallelized SAMtools view.
The SAMtools compiled binary is placed directory in the <applet dir>/resources directory. Any files found in the resources/ directory are uploaded so that they are present in the root directory of the worker. In this case:
When this applet is run on a worker, the
Analyze germline genomic variants, including filtering, visualization, and detailed variant annotation in the Cohort Browser.
Explore and analyze datasets with germline data by opening them in the Cohort Browser and switching to the Germline Variants tab. You can create cohorts based on germline variants, visualize variant patterns, and examine detailed variant information.
You can to include only samples with specific germline variants.
To apply a germline filter to your cohort:
Analyze gene expression data, including expression-based filtering, visualization, and molecular profiling in the Cohort Browser.
Speed workflow development and reduce testing costs by reusing computational outputs.
dx run app-dxjupyterlabdx run app-dxjupyterlab_spark_clusterdx run -h APP_NAME "runSpec": {
...
"execDepends": [
{"name": "samtools"}
]
}{
"runSpec": {
...
"execDepends": [
{"name": "pysam",
"package_manager": "pip3",
"version": "0.15.4"
}
]
...
}├── Applet dir
│ ├── src
│ ├── dxapp.json
│ ├── resources
│ ├── usr
│ ├── bin
│ ├── <samtools binary>IMAGE_PROCESSING: Python3 with image processing packages (Nipype, FreeSurfer, FSL), but no R. FreeSurfer requires a license.STATA: Stata requires a license to run
MONAI_ML: Extends the ML feature with specialized medical imaging frameworks, such as MONAI Core, MONAI Label, and 3D Slicer.












dxapp.jsondestdir - Directory on worker to which the git repository is cloned.
build_commands - Commands to build the dependency, run from the repository destdir. In this example, htslib is built when SAMtools is built, so only the SAMtools entry includes build_commands.
insert_modeappendoverwriterun_mode: (string) site mode processes only the site-specific data, genotype mode processes genotype-specific data and other non-site-specific data and all mode processes both types of data.
etl_spec_id: (string) Only the genomics-phenotype schema choice is supported.
is_sample_partitioned: (boolean) whether the raw VCF data is partitioned.
snpeff_opt_no_upstream: (boolean) default true -- exclude SnpEff upstream gene annotations in the SnpEff annotation step (equivalent to SnpEff's -no-upstream option). This option does not filter pre-calculated annotations outside of the SnpEff annotation step.snpeff_opt_no_downstream: (boolean) default true -- exclude SnpEff downstream gene annotations in the SnpEff annotation step (equivalent to SnpEff's -no-downstream option). This option does not filter pre-calculated annotations outside of the SnpEff annotation step.
calculate_worst_effects: (boolean) default true -- whether to include the SnpEff worst-effects annotation step in preprocessing, which adds the SnpEff worst-effects for every alternate-allele--gene combination as INFO/ANN_WORST tags (Number "A"). This option automatically filters SnpEff annotations to exclude 'feature_type!=transcript', 'transcript_biotype!=protein_coding', 'effect=upstream_gene_variant' and 'effect=downstream_gene_variant'
calculate_locus_frequencies: (boolean) default true -- whether to include the locus-level frequencies annotation step in preprocessing, which adds locus-level allele and genotype frequencies as INFO tags.
snpsift: (boolean) default true -- whether to include the SnpSift/dbSNP annotation step in preprocessing. The SnpSift/dbSNP annotation step adds dbSNP ID annotations to the INFO/RSID tag (Number "A"), which is required in the ETL stage. If the raw VCF is already pre-annotated, then this annotation step is not necessary.
num_init_partitions: (int) integer defining the number of partitions for the initial VCF lines Spark RDD.
The SAMtools dependency is resolved by declaring an Apt-Get package in the dxapp.json runSpec.execDepends.
For additional information, refer to the execDepends documentation.
The dxpy.download_all_inputs() function downloads all input files into the /home/dnanexus/in directory. A folder is created for each input and the files are downloaded to that directory. For convenience, the dxpy.download_all_inputs function returns a dictionary containing the following keys:
<var>_path (string): full absolute path to where the file was downloaded.
<var>_name (string): name of the file, including extension.
<var>_prefix (string): name of the file minus the longest matching pattern found in the dxapp.json I/O pattern field.
The path, name, and prefix key-value pattern is repeated for all applet file class inputs specified in the dxapp.json. In this example, the dictionary has the following key-value pairs:
Before performing the parallel SAMtools count, determine the workload for each thread. The number of workers is arbitrarily set to 10 and the workload per thread is set to 1 chromosome at a time. Python offers multiple ways to achieve multithreaded processing. For the sake of simplicity, use multiprocessing.dummy, a wrapper around Python's threading module.
Each worker creates a string to be called in a subprocess.Popen call. The multiprocessing.dummy.Pool.map(<func>, <iterable>) function is used to call the helper function run_cmd for each string in the iterable of view commands. Because multithreaded processing is performed using subprocess.Popen, the process does not alert to any failed processes. Closed workers are verified in the verify_pool_status helper function.
Important: In this example, subprocess.Popen is used to process and verify results in verify_pool_status. In general, it is considered good practice to use Python's built-in subprocess convenience functions. In this case, subprocess.check_call would achieve the same goal.
Each worker returns a read count of only one region in the BAM file. Sum and output the results as the job output. The dx-toolkit Python SDK function dxpy.upload_local_file is used to upload and generate a DXFile corresponding to the result file. For Python, job outputs have to be a dictionary of key-value pairs, with the keys being job output names as defined in the dxapp.json and the values being the output values for corresponding output classes. For files, the output type is a DXLink. The dxpy.dxlink function is used to generate the appropriate DXLink value.
Distributed bash-interpreter apps use bash functions to declare entry points. This app has the following entry points specified as bash functions:
main
count_func
sum_reads
Entry points are executed on a new worker with its own system requirements. The instance type can be set in the dxapp.json file's runSpec.systemRequirements:
The main function slices the initial *.bam file and generates an index *.bai if needed. The input *.bam is then sliced into smaller *.bam files containing only reads from canonical chromosomes. First, the main function downloads the BAM file and gets the headers.
Sliced *.bam files are uploaded and their file IDs are passed to the count_func entry point using the dx-jobutil-new-job command.
Outputs from the count_func entry points are referenced as Job Based Object References (JBOR) and used as inputs for the sum_reads entry point.
The output of the sum_reads entry point is used as the output of the main entry point via JBOR reference using the command dx-jobutil-add-output.
This entry point downloads and runs the command samtools view -c on the sliced *.bam. The generated counts_txt output file is uploaded as the entry point's job output via the command dx-jobutil-add-output.
The main entry point triggers this sub job, providing the output of count_func as an input. This entry point gathers all the files generated by the count_func jobs and sums them.
This function returns read_sum_file as the entry point output.
The command set -e -x -o pipefail assists you in debugging this applet:
-e causes the shell to immediately exit if a command returns a non-zero exit code.
-x prints commands as they are executed, which is useful for tracking the job's status or pinpointing the exact execution failure.
-o pipefail makes the return code the first non-zero exit code. (Typically, the return code of pipes is the exit code of the last command, which can create difficult to debug problems.)
The *.bai file was an optional job input. You can check for an empty or unset var using the bash built-in test [[ - z ${var}} ]]. Then, you can download or create a *.bai index as needed.
Bash's job control system allows for convenient management of multiple processes. In this example, you can run bash commands in the background as you control maximum job executions in the foreground. Place processes in the background using the character & after a command.
Once the input BAM has been sliced, counted, and summed, the output counts_txt is uploaded using the command dx-upload-all-outputs. The following directory structure required for dx-upload-all-outputs is below:
In your applet, upload all outputs by:
resources///usr/bin is part of the $PATH variable, so in the script, you can reference the samtools command directly, as in samtools view -c ....
First, download the BAM file and slice it by canonical chromosome, writing the *bam file names to another file.
To split a BAM by regions, you need to have a *.bai index. You can either create an app(let) which takes the *.bai as an input or generate a *.bai in the applet. In this tutorial, the *.bai is generated in the applet, sorting the BAM if necessary.
In the previous section, the name of each sliced BAM file was recorded into a record file. Next, perform a samtools view -c on each slice using the record file as input.
The results file is uploaded using the standard bash process:
Upload a file to the job execution's container.
Provide the DNAnexus link as a job's output using the script dx-jobutil-add-output <output name>
Distributed bash-interpreter apps use bash functions to declare entry points. This app has the following entry points specified as bash functions:
main
count_func
sum_reads
Entry points are executed on a new worker with its own system requirements. The instance type can be set in the dxapp.json file's runSpec.systemRequirements:
The main function slices the initial *.bam file and generates an index *.bai if needed. The input *.bam is the sliced into smaller *.bam files containing only reads from canonical chromosomes. First, the main function downloads the BAM file and gets the headers.
Sliced *.bam files are uploaded and their file IDs are passed to the count_func entry point using the dx-jobutil-new-job command.
Outputs from the count_func entry points are referenced as Job Based Object References (JBOR) and used as inputs for the sum_reads entry point.
The output of the sum_reads entry point is used as the output of the main entry point via JBOR reference using the command dx-jobutil-add-output.
This entry point downloads and runs the command samtools view -c on the sliced *.bam. The generated counts_txt output file is uploaded as the entry point's job output via the command dx-jobutil-add-output.
The main entry point triggers this sub job, providing the output of count_func as an input. This entry point gathers all the files generated by the count_func jobs and sums them.
This function returns read_sum_file as the entry point output.
Often, you can retrieve data without using Spark, and extra compute resources are not required (see the example OpenBio notebooks). However, if you need more compute power—such as when working with complex data models, large datasets, or extracting large volumes of data—you can use a private Spark resource. In these scenarios, data is returned through the DNAnexus Thrift Server. While the Thrift Server is highly available, it has a fixed timeout that may limit the number of queries you can run. Using private compute resources helps avoid these timeouts by scaling resources as needed.
If you use the --sql flag, the command returns a SQL statement (as a string) that you can use in a standalone Spark-enabled application, such as JupyterLab.
The most common way to use Spark on the DNAnexus Platform is via a Spark enabled JupyterLab notebook.
After creating a Jupyter notebook within a project, enter the commands shown below, to start a Spark session.
Python:
R:
Once you've initiated a Spark session, you can run SQL queries on the database within your notebook, with the results written to a Spark DataFrame:
Python:
R:
Python:
Where dataset is the record-id or the path to the dataset or cohort, for example, "record-abc123" or "/mydirectory/mydataset.dataset."
R:
Where dataset is the record-id or the path to the dataset or cohort.
Python:
R:
In the examples above, dataset is the record-id or the path to the dataset or cohort, for example, record-abc123 or /mydirectory/mydataset.dataset. allele_filter.json is a JSON object, as a file, and which contains filters for the --retrieve-allele command. For more information, refer to the notebooks in the DNAnexus OpenBio dx-toolkit examples.
Python:
R:
When querying large datasets - such as those containing genomic data - ensure that your Spark cluster is scaled up appropriately with multiple clusters to parallelize across.
Ensure that your Spark session is only initialized once per Jupyter session. If you initialize the Spark session in multiple notebooks in the same Jupyter Job - for example, run notebook 1 and also run notebook 2 OR run a notebook from start to finish multiple times - the Spark session becomes corrupted and you need to restart the specific notebook's kernel. As a best practice, shut down the kernel of any notebook you are not using, before running a second notebook in the same session.
If you want to use a database outside your project's scope, you must refer to it using its unique database name (typically this looks something like database_fjf3y28066y5jxj2b0gz4g85__metabric_data) as opposed to the database name (metabric_data in this case).
"runSpec": {
...
"execDepends": [
{"name": "samtools"}
]
}{
mappings_bam_path: [u'/home/dnanexus/in/mappings_bam/SRR504516.bam']
mappings_bam_name: [u'SRR504516.bam']
mappings_bam_prefix: [u'SRR504516']
index_file_path: [u'/home/dnanexus/in/index_file/SRR504516.bam.bai']
index_file_name: [u'SRR504516.bam.bai']
index_file_prefix: [u'SRR504516']
}inputs = dxpy.download_all_inputs()
shutil.move(inputs['mappings_bam_path'][0], os.getcwd())input_bam = inputs['mappings_bam_name'][0]
bam_to_use = create_index_file(input_bam)
print("Dir info:")
print(os.listdir(os.getcwd()))
regions = parseSAM_header_for_region(bam_to_use)
view_cmds = [
create_region_view_cmd(bam_to_use, region)
for region
in regions]
print('Parallel counts')
t_pools = ThreadPool(10)
results = t_pools.map(run_cmd, view_cmds)
t_pools.close()
t_pools.join()
verify_pool_status(results)def verify_pool_status(proc_tuples):
err_msgs = []
for proc in proc_tuples:
if proc[2] != 0:
err_msgs.append(proc[1])
if err_msgs:
raise dxpy.exceptions.AppInternalError(b"\n".join(err_msgs))resultfn = bam_to_use[:-4] + '_count.txt'
with open(resultfn, 'w') as f:
sum_reads = 0
for res, reg in zip(results, regions):
read_count = int(res[0])
sum_reads += read_count
f.write("Region {0}: {1}\n".format(reg, read_count))
f.write("Total reads: {0}".format(sum_reads))
count_file = dxpy.upload_local_file(resultfn)
output = {}
output["count_file"] = dxpy.dxlink(count_file)
return output{
"runSpec": {
...
"execDepends": [
{"name": "samtools"}
]
}inputs = dxpy.download_all_inputs()
# download_all_inputs returns a dictionary that contains mapping from inputs to file locations.
# Additionaly, helper keys, value pairs are added to the dicitonary, similar to bash helper functions
inputs
# mappings_sorted_bam_path: [u'/home/dnanexus/in/mappings_sorted_bam/SRR504516.bam']
# mappings_sorted_bam_name: u'SRR504516.bam'
# mappings_sorted_bam_prefix: u'SRR504516'
# mappings_sorted_bai_path: u'/home/dnanexus/in/mappings_sorted_bai/SRR504516.bam.bai'
# mappings_sorted_bai_name: u'SRR504516.bam.bai'
# mappings_sorted_bai_prefix: u'SRR504516'print("Number of cpus: {0}".format(cpu_count())) # Get cpu count from multiprocessing
worker_pool = Pool(processes=cpu_count()) # Create a pool of workers, 1 for each core
results = worker_pool.map(run_cmd, collection) # map run_cmds to a collection
# Pool.map handles orchestrating the job
worker_pool.close()
worker_pool.join() # Make sure to close and join workers when donedef run_cmd(cmd_arr):
"""Run shell command.
Helper function to simplify the pool.map() call in our parallelization.
Raises OSError if command specified (index 0 in cmd_arr) isn't valid
"""
proc = subprocess.Popen(
cmd_arr,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
stdout, stderr = proc.communicate()
exit_code = proc.returncode
proc_tuple = (stdout, stderr, exit_code)
return proc_tupledef parse_sam_header_for_region(bamfile_path):
"""Helper function to match SN regions contained in SAM header
Returns:
regions (list[string]): list of regions in bam header
"""
header_cmd = ['samtools', 'view', '-H', bamfile_path]
print('parsing SAM headers:', " ".join(header_cmd))
headers_str = subprocess.check_output(header_cmd).decode("utf-8")
rgx = re.compile(r'SN:(\S+)\s')
regions = rgx.findall(headers_str)
return regions# Write results to file
resultfn = inputs['mappings_sorted_bam_name'][0]
resultfn = (
resultfn[:-4] + '_count.txt'
if resultfn.endswith(".bam")
else resultfn + '_count.txt')
with open(resultfn, 'w') as f:
sum_reads = 0
for res, reg in zip(results, regions):
read_count = int(res[0])
sum_reads += read_count
f.write("Region {0}: {1}\n".format(reg, read_count))
f.write("Total reads: {0}".format(sum_reads))
count_file = dxpy.upload_local_file(resultfn)
output = {}
output["count_file"] = dxpy.dxlink(count_file)
return outputdef verify_pool_status(proc_tuples):
"""
Helper to verify worker succeeded.
As failed commands are detected, the `stderr` from that command is written
to the job_error.json file. This file is printed to the Platform
job log on App failure.
"""
all_succeed = True
err_msgs = []
for proc in proc_tuples:
if proc[2] != 0:
all_succeed = False
err_msgs.append(proc[1])
if err_msgs:
raise dxpy.exceptions.AppInternalError(b"\n".join(err_msgs))"access": {
"network": ["github.com"]
} "runSpec": {
...
"execDepends": [
{
"name": "htslib",
"package_manager": "git",
"url": "https://github.com/samtools/htslib.git",
"tag": "1.3.1",
"destdir": "/home/dnanexus"
},
{
"name": "samtools",
"package_manager": "git",
"url": "https://github.com/samtools/samtools.git",
"tag": "1.3.1",
"destdir": "/home/dnanexus",
"build_commands": "make samtools"
}
],
...
}├── home
│ ├── dnanexus
│ ├── < app script >
│ ├── htslib
│ ├── samtools
│ ├── < samtools binary >main() {
set -e -x -o pipefail
dx download "$mappings_bam"
count_filename="${mappings_bam_prefix}.txt"
readcount=$(samtools/samtools view -c "${mappings_bam_name}")
echo "Total reads: ${readcount}" > "${count_filename}"
counts_txt=$(dx upload "${count_filename}" --brief)
dx-jobutil-add-output counts_txt "${counts_txt}" --class=file
}set -e -x -o pipefail
echo "Value of mappings_sorted_bam: '${mappings_sorted_bam}'"
echo "Value of mappings_sorted_bai: '${mappings_sorted_bai}'"
mkdir workspace
cd workspace
dx download "${mappings_sorted_bam}"
if [ -z "$mappings_sorted_bai" ]; then
samtools index "$mappings_sorted_bam_name"
else
dx download "${mappings_sorted_bai}"
fi# Extract valid chromosome names from BAM header
chromosomes=$(
samtools view -H "${mappings_sorted_bam_name}" | \
grep "@SQ" | \
awk -F '\t' '{print $2}' | \
awk -F ':' '{
if ($2 ~ /^chr[0-9XYM]+$|^[0-9XYM]/) {
print $2
}
}'
)
# Split BAM by chromosome and record output file names
for chr in $chromosomes; do
samtools view -b "${mappings_sorted_bam_name}" "${chr}" -o "bam_${chr}.bam"
echo "bam_${chr}.bam"
done > bamfiles.txt
# Parallel counting of reads per chromosome BAM
busyproc=0
while read -r b_file; do
echo "${b_file}"
# If busy processes hit limit, wait for one to finish
if [[ "${busyproc}" -ge "$(nproc)" ]]; then
echo "Processes hit max"
while [[ "${busyproc}" -gt 0 ]]; do
wait -n
busyproc=$((busyproc - 1))
done
fi
# Count reads in background
samtools view -c "${b_file}" > "count_${b_file%.bam}" &
busyproc=$((busyproc + 1))
done < bamfiles.txtwhile [[ "${busyproc}" -gt 0 ]]; do
wait -n # p_id
busyproc=$((busyproc-1))
done├── $HOME
│ ├── out
│ ├── < output name in dxapp.json >
│ ├── output fileoutputdir="${HOME}/out/counts_txt"
mkdir -p "${outputdir}"
cat count* \
| awk '{sum+=$1} \
END{print "Total reads = ",sum}' \
> "${outputdir}/${mappings_sorted_bam_prefix}_count.txt"
dx-upload-all-outputsprint(mappings_sorted_bai)
print(mappings_sorted_bam)
mappings_sorted_bam = dxpy.DXFile(mappings_sorted_bam)
sorted_bam_name = mappings_sorted_bam.name
dxpy.download_dxfile(mappings_sorted_bam.get_id(),
sorted_bam_name)
ascii_bam_name = unicodedata.normalize( # Pysam requires ASCII not Unicode string.
'NFKD', sorted_bam_name).encode('ascii', 'ignore')
if mappings_sorted_bai is not None:
mappings_sorted_bai = dxpy.DXFile(mappings_sorted_bai)
dxpy.download_dxfile(mappings_sorted_bai.get_id(),
mappings_sorted_bai.name)
else:
pysam.index(ascii_bam_name)mappings_obj = pysam.AlignmentFile(ascii_bam_name, "rb")
regions = get_chr(mappings_obj, canonical_chr)def get_chr(bam_alignment, canonical=False):
"""Helper function to return canonical chromosomes from SAM/BAM header
Arguments:
bam_alignment (pysam.AlignmentFile): SAM/BAM pysam object
canonical (boolean): Return only canonical chromosomes
Returns:
regions (list[str]): Region strings
"""
regions = []
headers = bam_alignment.header
seq_dict = headers['SQ']
if canonical:
re_canonical_chr = re.compile(r'^chr[0-9XYM]+$|^[0-9XYM]')
for seq_elem in seq_dict:
if re_canonical_chr.match(seq_elem['SN']):
regions.append(seq_elem['SN'])
else:
regions = [''] * len(seq_dict)
for i, seq_elem in enumerate(seq_dict):
regions[i] = seq_elem['SN']
return regionstotal_count = 0
count_filename = "{bam_prefix}_counts.txt".format(
bam_prefix=ascii_bam_name[:-4])
with open(count_filename, "w") as f:
for region in regions:
temp_count = mappings_obj.count(region=region)
f.write("{region_name}: {counts}\n".format(
region_name=region, counts=temp_count))
total_count += temp_count
f.write("Total reads: {sum_counts}".format(sum_counts=total_count))counts_txt = dxpy.upload_local_file(count_filename)
output = {}
output["counts_txt"] = dxpy.dxlink(counts_txt)
return output/
├── usr
│ ├── bin
│ ├── < samtools binary >
├── home
│ ├── dnanexus# Download BAM from DNAnexus
dx download "${mappings_bam}"
# Attempt to index the BAM file
indexsuccess=true
bam_filename="${mappings_bam_name}"
samtools index "${mappings_bam_name}" || indexsuccess=false
# If indexing fails, sort then index
if [[ $indexsuccess == false ]]; then
samtools sort -o "${mappings_bam_name}" "${mappings_bam_name}"
samtools index "${mappings_bam_name}"
bam_filename="${mappings_bam_name}"
fi
# Extract chromosome names from header
chromosomes=$(
samtools view -H "${bam_filename}" | \
grep "@SQ" | \
awk -F '\t' '{print $2}' | \
awk -F ':' '{
if ($2 ~ /^chr[0-9XYM]+$|^[0-9XYM]/) {
print $2
}
}'
)
# Split BAM by chromosome and record filenames
for chr in $chromosomes; do
samtools view -b "${bam_filename}" "${chr}" -o "bam_${chr}.bam"
echo "bam_${chr}.bam"
done > bamfiles.txtcounts_txt_name="${mappings_bam_prefix}_count.txt"
# Sum all read counts across split BAM files
sum_reads=$(
< bamfiles.txt xargs -I {} \
samtools view -c $view_options '{}' | \
awk '{s += $1} END {print s}'
)
# Write the total read count to a file
echo "Total Count: ${sum_reads}" > "${counts_txt_name}"counts_txt_id=$(dx upload "${counts_txt_name}" --brief)
dx-jobutil-add-output counts_txt "${counts_txt_id}" --class=filedx run vcf-loader \
-i vcf_manifest=file-xxxx \
-i is_sample_partitioned=false \
-i database_name=<my_favorite_db> \
-i etl_spec_id=genomics-phenotype \
-i create_mode=strict \
-i insert_mode=append \
-i run_mode=genotype "runSpec": {
...
"execDepends": [
{"name": "samtools"}
]
}{
mappings_bam_path: [u'/home/dnanexus/in/mappings_bam/SRR504516.bam']
mappings_bam_name: [u'SRR504516.bam']
mappings_bam_prefix: [u'SRR504516']
index_file_path: [u'/home/dnanexus/in/index_file/SRR504516.bam.bai']
index_file_name: [u'SRR504516.bam.bai']
index_file_prefix: [u'SRR504516']
}inputs = dxpy.download_all_inputs()
shutil.move(inputs['mappings_bam_path'][0], os.getcwd())input_bam = inputs['mappings_bam_name'][0]
bam_to_use = create_index_file(input_bam)
print("Dir info:")
print(os.listdir(os.getcwd()))
regions = parseSAM_header_for_region(bam_to_use)
view_cmds = [
create_region_view_cmd(bam_to_use, region)
for region
in regions]
print('Parallel counts')
t_pools = ThreadPool(10)
results = t_pools.map(run_cmd, view_cmds)
t_pools.close()
t_pools.join()
verify_pool_status(results)def verify_pool_status(proc_tuples):
err_msgs = []
for proc in proc_tuples:
if proc[2] != 0:
err_msgs.append(proc[1])
if err_msgs:
raise dxpy.exceptions.AppInternalError(b"\n".join(err_msgs))resultfn = bam_to_use[:-4] + '_count.txt'
with open(resultfn, 'w') as f:
sum_reads = 0
for res, reg in zip(results, regions):
read_count = int(res[0])
sum_reads += read_count
f.write("Region {0}: {1}\n".format(reg, read_count))
f.write("Total reads: {0}".format(sum_reads))
count_file = dxpy.upload_local_file(resultfn)
output = {}
output["count_file"] = dxpy.dxlink(count_file)
return output{
...
"runSpec": {
...
"execDepends": [
{
"name": "samtools"
}
]
}
...
}{
"runSpec": {
...
"systemRequirements": {
"main": {
"instanceType": "mem1_ssd1_x4"
},
"count_func": {
"instanceType": "mem1_ssd1_x2"
},
"sum_reads": {
"instanceType": "mem1_ssd1_x4"
}
},
...
}
}dx download "${mappings_sorted_bam}" \
chromosomes=$( \
samtools view -H "${mappings_sorted_bam_name}" \
| grep "\@SQ" \
| awk -F '\t' '{print $2}' \
| awk -F ':' '{if ($2 ~ /^chr[0-9XYM]+$|^[0-9XYM]/) {print $2}}')if [ -z "${mappings_sorted_bai}" ]; then
samtools index "${mappings_sorted_bam_name}"
else
dx download "${mappings_sorted_bai}" -o "${mappings_sorted_bam_name}.bai"
fi
count_jobs=()
for chr in $chromosomes; do
seg_name="${mappings_sorted_bam_prefix}_${chr}.bam"
samtools view -b "${mappings_sorted_bam_name}" "${chr}" > "${seg_name}"
bam_seg_file=$(dx upload "${seg_name}" --brief)
count_jobs+=($(dx-jobutil-new-job \
-isegmentedbam_file="${bam_seg_file}" \
-ichr="${chr}" \
count_func))
donefor job in "${count_jobs[@]}"; do
readfiles+=("-ireadfiles=${job}:counts_txt")
done
sum_reads_job=$(
dx-jobutil-new-job \
"${readfiles[@]}" \
-ifilename="${mappings_sorted_bam_prefix}" \
sum_reads
)count_func () {
echo "Value of segmentedbam_file: '${segmentedbam_file}'"
echo "Chromosome being counted '${chr}'"
dx download "${segmentedbam_file}"
readcount=$(samtools view -c "${segmentedbam_file_name}")
printf "${chr}:\t%s\n" "${readcount}" > "${segmentedbam_file_prefix}.txt"
readcount_file=$(dx upload "${segmentedbam_file_prefix}.txt" --brief)
dx-jobutil-add-output counts_txt "${readcount_file}" --class=file
}sum_reads () {
set -e -x -o pipefail
printf "Value of read file array %s" "${readfiles[@]}"
echo "Filename: ${filename}"
echo "Summing values in files and creating output read file"
for read_f in "${readfiles[@]}"; do
echo "${read_f}"
dx download "${read_f}" -o - >> chromosome_result.txt
done
count_file="${filename}_chromosome_count.txt"
total=$(awk '{s+=$2} END {print s}' chromosome_result.txt)
echo "Total reads: ${total}" >> "${count_file}"
readfile_name=$(dx upload "${count_file}" --brief)
dx-jobutil-add-output read_sum_file "${readfile_name}" --class=file
}set -e -x -o pipefail
echo "Value of mappings_sorted_bam: '${mappings_sorted_bam}'"
echo "Value of mappings_sorted_bai: '${mappings_sorted_bai}'"
mkdir workspace
cd workspace
dx download "${mappings_sorted_bam}"
if [ -z "$mappings_sorted_bai" ]; then
samtools index "$mappings_sorted_bam_name"
else
dx download "${mappings_sorted_bai}"
fi "runSpec": {
...
"execDepends": [
{"name": "samtools"}
]
}chromosomes=$( \
samtools view -H "${mappings_sorted_bam_name}" \
| grep "\@SQ" \
| awk -F '\t' '{print $2}' \
| awk -F ':' '{if ($2 ~ /^chr[0-9XYM]+$|^[0-9XYM]/) {print $2}}')
for chr in $chromosomes; do
samtools view -b "${mappings_sorted_bam_name}" "${chr}" -o "bam_${chr}.bam"
echo "bam_${chr}.bam"
done > bamfiles.txt
busyproc=0
while read -r b_file; do
echo "${b_file}"
if [[ "${busyproc}" -ge "$(nproc)" ]]; then
echo Processes hit max
while [[ "${busyproc}" -gt 0 ]]; do
wait -n # p_id
busyproc=$((busyproc-1))
done
fi
samtools view -c "${b_file}"> "count_${b_file%.bam}" &
busyproc=$((busyproc+1))
done <bamfiles.txtwhile [[ "${busyproc}" -gt 0 ]]; do
wait -n # p_id
busyproc=$((busyproc-1))
done├── $HOME
│ ├── out
│ ├── < output name in dxapp.json >
│ ├── output fileoutputdir="${HOME}/out/counts_txt"
mkdir -p "${outputdir}"
cat count* \
| awk '{sum+=$1} END{print "Total reads = ",sum}' \
> "${outputdir}/${mappings_sorted_bam_prefix}_count.txt"
dx-upload-all-outputs counts_txt_id=$(dx upload "${counts_txt_name}" --brief)
dx-jobutil-add-output counts_txt "${counts_txt_id}" --class=file├── Applet dir
│ ├── src
│ ├── dxapp.json
│ ├── resources
│ ├── usr
│ ├── bin
│ ├── < samtools binary >/
├── usr
│ ├── bin
│ ├── < samtools binary >
├── home
│ ├── dnanexusdx download "${mappings_bam}"
indexsuccess=true
bam_filename="${mappings_bam_name}"
samtools index "${mappings_bam_name}" || indexsuccess=false
if [[ $indexsuccess == false ]]; then
samtools sort -o "${mappings_bam_name}" "${mappings_bam_name}"
samtools index "${mappings_bam_name}"
bam_filename="${mappings_bam_name}"
fi
chromosomes=$( \
samtools view -H "${bam_filename}" \
| grep "\@SQ" \
| awk -F '\t' '{print $2}' \
| awk -F ':' '{if ($2 ~ /^chr[0-9XYM]+$|^[0-9XYM]/) {print $2}}')
for chr in $chromosomes; do
samtools view -b "${bam_filename}" "${chr}" -o "bam_${chr}."bam
echo "bam_${chr}.bam"
done > bamfiles.txtcounts_txt_name="${mappings_bam_prefix}_count.txt"
sum_reads=$( \
<bamfiles.txt xargs -I {} samtools view -c $view_options '{}' \
| awk '{s+=$1} END {print s}')
echo "Total Count: ${sum_reads}" > "${counts_txt_name}"{
...
"runSpec": {
...
"execDepends": [
{"name": "samtools"}
]
}
...
}{
"runSpec": {
...
"systemRequirements": {
"main": {
"instanceType": "mem1_ssd1_x4"
},
"count_func": {
"instanceType": "mem1_ssd1_x2"
},
"sum_reads": {
"instanceType": "mem1_ssd1_x4"
}
},
...
}
}dx download "${mappings_sorted_bam}"
chromosomes=$( \
samtools view -H "${mappings_sorted_bam_name}" \
| grep "\@SQ" \
| awk -F '\t' '{print $2}' \
| awk -F ':' '{if ($2 ~ /^chr[0-9XYM]+$|^[0-9XYM]/) {print $2}}')if [ -z "${mappings_sorted_bai}" ]; then
samtools index "${mappings_sorted_bam_name}"
else
dx download "${mappings_sorted_bai}" -o "${mappings_sorted_bam_name}".bai
fi
count_jobs=()
for chr in $chromosomes; do
seg_name="${mappings_sorted_bam_prefix}_${chr}".bam
samtools view -b "${mappings_sorted_bam_name}" "${chr}" > "${seg_name}"
bam_seg_file=$(dx upload "${seg_name}" --brief)
count_jobs+=($(dx-jobutil-new-job -isegmentedbam_file="${bam_seg_file}" -ichr="${chr}" count_func))
donefor job in "${count_jobs[@]}"; do
readfiles+=("-ireadfiles=${job}:counts_txt")
done
sum_reads_job=$(dx-jobutil-new-job "${readfiles[@]}" -ifilename="${mappings_sorted_bam_prefix}" sum_reads)count_func ()
{
echo "Value of segmentedbam_file: '${segmentedbam_file}'";
echo "Chromosome being counted '${chr}'";
dx download "${segmentedbam_file}";
readcount=$(samtools view -c "${segmentedbam_file_name}");
printf "${chr}:\t%s\n" "${readcount}" > "${segmentedbam_file_prefix}.txt";
readcount_file=$(dx upload "${segmentedbam_file_prefix}".txt --brief);
dx-jobutil-add-output counts_txt "${readcount_file}" --class=file
}sum_reads ()
{
set -e -x -o pipefail;
printf "Value of read file array %s" "${readfiles[@]}";
echo "Filename: ${filename}";
echo "Summing values in files and creating output read file";
for read_f in "${readfiles[@]}";
do
echo "${read_f}";
dx download "${read_f}" -o - >> chromosome_result.txt;
done;
count_file="${filename}_chromosome_count.txt";
total=$(awk '{s+=$2} END {print s}' chromosome_result.txt);
echo "Total reads: ${total}" >> "${count_file}";
readfile_name=$(dx upload "${count_file}" --brief);
dx-jobutil-add-output read_sum_file "${readfile_name}" --class=file
}import pyspark
sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc)install.packages("sparklyr")
library(sparklyr)
port <- Sys.getenv("SPARK_MASTER_PORT")
master <- paste("spark://master:", port, sep = '')
sc = spark_connect(master)retrieve_sql = 'select .... from .... '
df = spark.sql(retrieve_sql)library(DBI)
retrieve_sql <- 'select .... from .... '
df = dbGetQuery(sc, retrieve_sql)import subprocess
cmd = ["dx", "extract_dataset", dataset, "--fields", "entity1.field1, entity1.field2, entity2.field4", "--sql", "-o", "extracted_data.sql"]
subprocess.check_call(cmd)cmd <- paste("dx extract_dataset", dataset, " --fields", "entity1.field1, entity1.field2, entity2.field4", "--sql", "-o extracted_data.sql")
system(cmd)import subprocess
cmd = ["dx", "extract_assay", "germline", dataset, "--retrieve-allele", "allele_filter.json", "--sql", "-o", "extract_allele.sql"]
subprocess.check_call(cmd)cmd <- paste("dx extract_assay", "germline", dataset, "--retrieve-allele", "allele_filter.json", "--sql", "-o extracted_allele.sql")
system(cmd)with open("extracted_data.sql", "r") as file:
retrieve_sql=""
for line in file:
retrieve_sql += line.strip()
df = spark.sql(retrieve_sql.strip(";"))install.packages("tidyverse")
library(readr)
retrieve_sql <-read_file("extracted_data.sql")
retrieve_sql <- gsub("[;\n]", "", retrieve_sql)
df <- dbGetQuery(sc, retrieve_sql)For the cohort you want to edit, click Add Filter.
In Add Filter to Cohort > Assays > Genomic Sequencing, select a genomic filter.
In Edit Filter: Variant (Germline), specify your filtering criteria:
For datasets with multiple germline variant assays, select the specific assay to filter by.
On the Genes / Effects tab, select variants of specific types and variant consequences within the specified genes and/or genomic ranges. You can specify up to 5 genes or genomic ranges in a comma-separated list.
On the Variant IDs tab, specify a list of variant IDs, with a maximum of 100 variants.
To enter multiple genes, genomic ranges, or variants, separate them with commas or place each on a new line.
Click Apply Filter.
The Germline Variants tab includes a lollipop plot displaying allele frequencies for variants in a specified genomic region. This visualization helps you identify patterns in germline variants across your cohort and understand the distribution of allelic frequencies.
The allele table, located below the lollipop plot, shows the same variants in a tabular format with comprehensive annotation information. It allows you to examine specific variant characteristics and compare allele frequencies within your selected cohort, the entire dataset, and from annotation databases, including gnomAD.
The annotation information includes:
Type: whether the variant is an SNP, deletion, insertion, or mixed.
Consequences: The impact of variant according to SnpEff. For variants with multiple gene annotations, this column displays the most severe consequence per gene.
Population Allele Frequency: Allele frequency calculated across entire dataset from which the cohort is created.
Cohort Allele Frequency: Allele frequency calculated across current cohort selection.
GnomAD Allele Frequency: Allele frequency of the specified allele from the public dataset .
If canonical transcript information is available, the following three columns with additional annotation information appear in the Table:
Consequences (Canonical Transcript): Canonical effects per each associated gene, according to SnpEff.
HGVS DNA (Canonical Transcript): HGVS (DNA) standard terminology per each associated gene with this variant
HGVS Protein (Canonical Transcript): HGVS (Protein) standard terminology per each associated gene with this variant
You can export the selected variants in the table as a list of variant IDs or a CSV file.
To copy a comma-separated list of variant IDs to your clipboard, select the set of IDs you want to copy, and click Copy.
To export variants as a CSV file, select the set of IDs you need, and click Download (.csv file).
For large datasets, you can use the SQL Runner app to download data in a more efficient way.
In Allele table > Location column, you can click on the specific location to open the locus details. The locus details provides in-depth annotations and population genetics data for the selected genomic position.
The locus details page displays three main sections of pre-calculated information from dataset ingestion: Location Summary, Genotype Distribution, and Allele Annotations. These sections provide a comprehensive view starting with a locus summary, including genotype frequencies, followed by detailed annotations for each allele.
Location Info provides a quick overview of the genomic locus in your dataset, including the chromosome and starting position, the frequency of both the reference allele and no-calls, and the total number of alleles available.
Genotypes shows a detailed breakdown of genotypes in the dataset at the specific location. Since allele order is not preserved, genotypes like C/A and A/C are counted in the same category, which is why only half of the comparison table is populated. These genotype frequencies represent the entire dataset at this location, not only your selected cohort.
Alleles displays detailed information for each allele, collected from dbSNP and gnomAD during data ingestion. When available, rsID or AffyID appear with direct links to the corresponding NCBI dbSNP page. The section provides allele type, affected samples (dataset), and gnomAD frequency for quick reference, with additional details sorted by transcript ID. For canonical transcripts, a blue indicator appears next to the transcript ID, identifying the primary transcript annotations.
For more sophisticated genomic analysis beyond the Cohort Browser's visualization capabilities, you can connect your variant data with other DNAnexus tools. Export variant lists for detailed analysis in JupyterLab, leverage Spark clusters for large-scale genomic computations, or connect to SQL Runner for complex queries across your dataset.
You can customize your Gene Expression dashboard to focus on the most relevant analyses for your research:
Create new Expression Distribution or Feature Correlation charts.
Remove charts you no longer need.
Resize and reposition charts to optimize your workspace.
Save your dashboard customizations along with your cohort.
For datasets with multiple gene expression assays, you can choose the specific assay to visualize at the top of the dashboard. The Cohort Browser displays data from only one assay at a time. Switching between assays preserves your charts and their display settings.
You can define your cohort by gene expression to include only patients with specific expression characteristics.
To apply a gene expression filter to your cohort:
For the cohort you want to edit, click Add Filter.
In Add Filter to Cohort > Assays > Gene Expression, select a genomic filter.
In Edit Filter: Gene Expression, specify the criteria:
For datasets with multiple gene expression assays, select the specific assay to filter by.
In Expression Level, specify inclusive minimum and maximum values. For an individual to be included, all their expression values across all samples for the feature must fall within the range.
In Gene / Transcript, enter a gene symbol, such as BRCA1, or feature ID, such as ENSG00000012048 or ENST00000309586. Search is case insensitive.
Click Apply Filter.
The Expression Level charts help you visualize gene expression patterns for individual transcript or gene features. You can examine how expression values are distributed across your cohort, identify outliers, and compare patterns between different patient groups.
The chart displays data for one gene or transcript at a time. You can directly enter a transcript or gene feature ID, such as ID starting with [ENST](https://useast.ensembl.org/Help/View?id=151) or [ENSG](https://useast.ensembl.org/info/genome/genebuild/index.html), or search by gene symbol to see available options.
You can view the data as either a histogram showing frequency distribution or a box plot displaying quartiles and outliers. To switch between these views or adjust display statistics, click ⛭ Chart Settings.
When comparing cohorts, the chart shows data from each cohort on the same axes for direct comparison.
You can also customize your charts by selecting different transcript or gene features, resizing and rearranging them on your dashboard, or adjusting display settings to focus on the most relevant analyses for your research.
The Feature Correlation charts help you understand how the expression levels of two genes or transcripts relate to each other. You can use these charts to identify genes or transcripts that are co-expressed, explore potential pathway interactions, and compare correlation patterns between different cohorts.
The chart displays a scatter plot where each point represents a sample, with the X and Y axes showing expression values for your two selected features. A best fit line shows the overall relationship trend, and you can swap which gene appears on which axis to view the data from different perspectives.
The correlation analysis includes statistical measures to help you determine if the relationship you're seeing is meaningful. The Pearson correlation coefficient shows both the strength and direction of the linear relationship (ranging from -1 to +1), while the p-value indicates whether the correlation is statistically significant.
You can toggle these statistics on or off as needed. The chart updates automatically when you change your feature selections or switch between viewing single cohorts versus comparing multiple cohorts. This quantitative analysis helps you assess whether observed correlations are both statistically sound and biologically relevant to your research.
The Expression Per Feature table provides gene metadata and expression statistics for all features in your dataset. Use the search bar to find specific genes by symbol or explore genes within genomic ranges.
The table displays one row per feature ID with the following columns:
Feature ID: The unique transcript or gene identifier, such as ENST for a transcript or ENSG for a gene
Gene Symbol: The official gene name or symbol associated with the feature ID, such as TP53
Location: The genomic coordinates in "chromosome:start-end" format
Strand: The DNA strand orientation (+ or -)
Expression (Mean): The average expression value for this feature across the current cohort
Expression (SD): The standard deviation of expression values
Expression (Median): The median expression value
When comparing cohorts, the table shows separate expression statistics for each cohort, allowing direct comparison of expression patterns.
Each feature includes links to external annotation resources:
Ensembl transcript pages: Detailed transcript information and annotations
Ensembl gene pages: Comprehensive gene summaries and functional data
These links provide quick access to additional context about genes and transcripts of interest.
For example, suppose you are developing a workflow, and at each stage you end up debugging an issue. Each stage takes about one hour to develop and run. If you do not reuse outputs during development, the process takes 1 + 2 + 3 + ... + n hours because at every stage you fix something and must recompute results from previous stages. By reusing results for stages that have matured and are no longer modified, the total development time equals the time it takes to develop and run the pipeline (in this case n hours). This is an order-of-magnitude reduction in development time, and the improvement becomes more pronounced for longer workflows.
This feature also saves time when developing forks of existing workflows. For example, suppose you are a developer in an R&D organization and want to modify the last couple of stages of a production workflow in another organization. As long as the new workflow uses the same executable IDs for the earlier stages, the time required for R&D of the forked version equals the time for the last stages.
In production environments, test R&D modifications to a workflow at scale. This is especially relevant for workflows used in clinical tests. For example, suppose you are testing a workflow like the forked workflow discussed earlier. This clinical workflow must be tested on thousands of samples (let that number be represented by m) before it is vetted for production. Suppose the whole workflow takes n hours but only the last k stages changed. You save (n-k)m total compute hours. This can add up to dramatic cost savings as m grows and if k is small.
To show Smart Reuse, the following example uses WDL syntax as supported by DNAnexus SDK and dxCompiler.
The workflow above is a two-step workflow that duplicates a file and takes the first 10 lines from the duplicate.
Suppose the user has run the workflow above on some file and wants to tweak headfile to output the first 15 lines instead:
Here the only differences are the renamed headfile and basic_reuse, and the change from 10 to 15. The compilation process automatically detects that dupfile is the same but the second stage differs. The generated workflow therefore uses the original executable ID for dupfile but a different executable ID for headfile2.
When executing basic_reuse_tweaked on the same input file with Smart Reuse enabled, the results from dupfile task are reused. This is because since there is already a job on the DNAnexus Platform that has run that specific executable with the same input file, the system can reuse that file.
When using Smart Reuse with complex WDL workflows involving WDL expressions in input arguments, scatters, and nested sub-workflows, we recommend launching workflows using the --preserve-job-outputs option. This preserves the outputs of all jobs in the execution tree in the project and increases the potential for subsequent Smart Reuse.
Smart Reuse applies under the following conditions:
It is available only for jobs run in projects billed to organizations with Smart Reuse enabled.
It applies only to jobs completed after the organization's policies have been updated to enable Smart Reuse.
Jobs can reuse results from previous jobs if all these criteria are met:
An existing job exists that used the exact same executable and input IDs (including the function called within the applet). If an input is watermarked, both the watermark and its version must match. Other settings, such as the instance type, do not affect reuse.
If ignoreReuse: true is set, the job is not eligible for future reuse.
The job being reused must have all outputs available and accessible at the time of reuse. If any output is missing or inaccessible, reuse is impossible.
Each reused job includes an outputReusedFrom field, which points to the original job ID that produced the outputs. This field never refers to another reused job.
Results can be reused across projects only if the application's file includes "allProjects": "VIEW" in the "access" field.
You must have at least VIEW access to the original job's outputs, and those outputs must still exist on the Platform. Outputs that have been deleted cannot be reused.
Reused jobs are reported as having run for 0 seconds and are billed at $0.
Outputs are assumed to be deterministic.
If the reused job or workflow is in a different project or folder, the output data is not cloned to the new project or destination folder, since the job or workflow is not actually rerun.
If you are an administrator of a licensed org and want to enable Smart Reuse, run this command:
Conversely, set the value to false to disable it. If you are a licensed customer and cannot run the command above, contact DNAnexus Support. If you are interested in this feature and are not a licensed customer, reach out to DNAnexus Sales or your account executive for more information.
Learn how to log into and out of the DNAnexus Platform, via both the user interface and the command-line interface. Learn how to use tokens to log in, and how to set up two-factor authentication.
Logging In and Out via the User Interface
To log in via the user interface (UI), open the login page and enter your username and password.
To log out via the UI, click on your avatar at the far right end of the main Platform menu, then select Sign Out:
To log in via the command-line interface (CLI), make sure you've . From the CLI, enter the command .
Next, enter your username, or, if you've logged in before on the same computer and your username is displayed, hit Return to confirm that you want to use it to log in. Then enter your password.
See below for directions on .
See the for detail on optional arguments that can be used with dx login.
When using the CLI, log out by entering the command .
See the for detail on optional arguments that can be used with dx logout.
The system logs out users after fifteen minutes of inactivity. Exceptions apply to users logged in with an that specifies a different session duration, or users in an org with a custom autoLogoutAfter policy.
You can log in via the CLI, and stay logged in for a fixed length of time, by using an API token, also called an authentication token.
Exercise caution when sharing DNAnexus Platform tokens. Anyone with a token can access the Platform and impersonate you as a user. They gain your access level to any projects accessible by the token, enabling them to run jobs and potentially incur charges to your account.
To generate a token, click on your avatar at the top right corner of the main Platform menu, then select My Profile from the dropdown menu.
Next, click on the API Tokens tab. Then click the New Token button:
The New Token form opens in a modal window:
Consider the following points when filling out the form:
The token provides access to each project at the level at which you have access. See the .
If the token provides access to a project within which you have PHI data access, it enables access to that PHI data.
Tokens without a specified expiration date expire in one month.
After completing the form, click Generate Token. The system generates a 32-character token and displays it with a confirmation message.
To log in with a token via the CLI, enter the command , followed by a valid 32-character token.
Tokens are useful in multiple scenarios, such as:
Logging in via the CLI with single sign-on enabled - If your organization uses , logging in via the CLI might require a token instead of a username and password.
Logging in via a script - Scripts can use tokens to authenticate with the Platform.
When incorporating a token into a script, take care to set the token's expiration date such that the script has Platform access for only as long as necessary. Ensure as well that the script only has access to that project or those projects to which it must have access, to function properly.
To revoke a token, navigate to the API Tokens screen within your profile on the UI. Select the token you want to revoke, then click the Revoke button:
In the Revoke Tokens Confirmation modal window, click the Yes, revoke it button. The token is revoked, and its name no longer appears in the list of tokens on the API Tokens screen.
Token shared too widely - Revoke a token if someone with whom you've shared the token should no longer be able to use it, or if you're not certain who has access to it.
Token no longer needed - Revoke a token if a script that uses it is no longer in use, or if a group that had been using it no longer needs access to the Platform, or in any other situation in which the token is no longer necessary.
Though logging in typically requires direct interaction with the Platform through the UI or CLI, non-interactive login is also possible. Scripts commonly automate both login and project selection.
Non-interactive login uses dx login with the --token argument. The command automates project selection. For manual project selection, add the argument to dx login.
DNAnexus recommends adding two-factor authentication to your account, to provide an extra means of ensuring the security of all data to which you have access, on the Platform.
With two-factor authentication enabled, you must enter a two-factor authentication code to log into the Platform and access certain other services. This code is a time-based one-time password valid for a single session, generated by a third-party two-factor authenticator application, such as Google Authenticator.
Two-factor authentication protects your account by requiring both your credentials and an authentication code. This prevents unauthorized access even if your username and password are compromised.
To enable two-factor authentication, select Account Security from the dropdown menu accessible via your avatar, at the top right corner of the main menu.
In the Account Security screen, click the button labeled Enable 2FA. Then follow the instructions to select and set up a third-party authenticator application.
Enabling two-factor authentication redirects you to a page containing back-up codes. These codes serve as alternatives to two-factor authentication codes if you lose access to your authenticator application.
Store the back-up codes in a secure place. Without them and without access to your authenticator application, Platform login becomes impossible.
if you lose both your codes and access to your authenticator application.
DNAnexus recommends keeping two-factor authentication enabled after activation. If disabling is necessary, navigate to the Account Security screen of your profile, then click the Turn Off button in the Two-Factor Authentication section. The system requires your password and a two-factor authentication code to confirm this change.
View full source code on GitHub
This applet performs a SAMtools count on an input file while minimizing disk usage. For additional details on using FIFO (named pipes) special files, run the command man fifo in your shell.
Named pipes require BOTH a stdin and stdout. The following examples run incomplete named pipes in background processes so the foreground script does not block.
To approach this use case, outline the desired steps for the applet:
Stream the BAM file from the platform to a worker.
While the BAM streams, count the number of reads present.
Write the result to a file.
Stream the result file to the platform.
First, establish a named pipe on the worker. Then, stream to stdin of the named pipe and download the file as a stream from the platform using .
Having created the FIFO special file representing the streamed BAM, you can call the samtools command as you normally would. The samtools command reading the BAM provides the BAM FIFO file with a stdout. However, remember that you want to stream the output back to the Platform. You must create a named pipe representing the output file too.
The directory structure created here (~/out/counts_txt) is required to use the command in the next step. All files found in the path ~/out/<output name> are uploaded to the corresponding <output name> specified in the dxapp.json.
A stream from the platform has been established, piped into a samtools command, and the results are output to another named pipe. However, the background process remains blocked without a stdout for the output file. Creating an upload stream to the platform resolves this.
Upload as a stream to the platform using the commands or . Specify --buffer-size when needed.
Alternatively, dx upload - can upload directly from stdin, eliminating the need for the directory structure required for dx-upload-all-outputs. Warning: When uploading a file that exists on disk, dx upload is aware of the file size and automatically handles any cloud service provider upload chunk requirements. When uploading as a stream, the file size is not automatically known and dx upload uses default parameters. While these parameters are fine for most use cases, you may need to specify upload part size with the --buffer-size option.
With background processes running, wait in the foreground for those processes to finish.
Without waiting, the app script running in the foreground would finish and terminate the job prematurely.
The SAMtools compiled binary is placed directly in the <applet dir>/resources directory. Any files found in the resources/ directory are uploaded so that they are present in the worker's root directory. In this case:
When this applet is run on a worker, the resources/ folder is placed in the worker's root directory /:
/usr/bin is part of the $PATH variable, so the samtools command can be referenced directly in the script as samtools view -c ....
Distributed bash-interpreter apps use bash functions to declare entry points. Entry points are executed as subjobs on new workers with their own respective system requirements. This app has the following entry points specified as bash functions:
main
count_func
sum_reads
mainThe main function takes the initial *.bam, generates an index *.bai if needed, and obtains the list of regions from the *.bam file. Every 10 regions are sent, as input, to the count_func entry point using command.
Job outputs from the count_func entry point are referenced as Job Based Object References and used as inputs for the sum_reads entry point.
Job outputs of the sum_reads entry point is used as the output of the main entry point via JBOR reference in the command.
count_funcThis entry point performs a SAMtools count of the 10 regions passed as input. This execution runs on a new worker. As a result, variables from other functions are not accessible here. This includes variables from the main() function.
Once the output file with counts is created, it is uploaded to the platform and assigned as the entry point's job output counts_txt via the command .
sum_readsThe main entry point triggers this subjob, providing the output of count_func as an input JBOR. This entry point gathers all the readcount.txt files generated by the count_func jobs and sums the totals.
This entry point returns read_sum as a JBOR, which is then referenced as job output.
In the main function, the output is referenced
Learn to build and use list views in the Cohort Browser.
List views can be used to visualize categorical data.
When creating a list view:
The data must be from a field that contains either categorical or categorical multi-select data
This field must contain no more than 20 distinct category values
The values can be organized in a hierarchy
List views, unlike , can be used to visualize categorical data with values that are organized in a hierarchical fashion.
List views can be used to visualize categorical data from two different fields. The same restrictions apply to the fields whose values are displayed, as when creating a basic list view.
In a list view in the Cohort Browser showing data from one field, each row displays a value, along with the number of records in the current cohort - the "count" - that contain this value. Also shown is a figure labeled "freq." - this is the percentage of all cohort records, that contain the value.
Below is a sample list view showing the distribution of values in a field Episode type. In the current cohort selection of 80 participants, 13 records contain the value "Delivery episode", which represents 16.25% of the current cohort size.
To visualize data from two fields, select a categorical field, then select "List View" as your visualization type. In the field list, select a second categorical field as a secondary field.
Below is the default view of a sample list view visualizing data from two fields: Critical care record origin and Critical care record format:
Critical care record origin is the primary field, Critical care record format is the secondary field.
Here, the user has clicked the ">" icon next to "Originating from Scotland" to display additional rows with detail on records that contain that value in the field Critical care record origin:
Each of these additional rows shows the number of records that contain a particular value for Critical care record format, along with the value "Originating from Scotland" for Critical care record origin.
In these additional rows, "count" and "freq." figures refer to records having a particular combination of values, in the fields.
Below is an example of a list view used to visualize data in a categorical hierarchical field Home State/Province:
By default, only values in the category at the top level of the hierarchy are displayed.
Here, the user has clicked ">" next to one of these values, revealing additional rows that show how many records have the value "Canada" for the top-level category, in combination with different values in the category at the next level down:
In these additional rows, "count" and "freq." figures refer to records having a particular combination of values, in the fields. In the list view above, for example, a single record, representing 10% of the cohort, has both the value "Canada" for the top-level category, and "British Columbia" for the second-level category.
The following example shows how "count" and "freq." are calculated, for list views based on fields containing categorical data organized into multiple levels of hierarchy:
For the bottommost row, "count" and "freq" refer to records having the following values:
"Yes" for the category at the top of the hierarchy
"9" for the category at the second level of the hierarchy
"8" for the category at the third level of the hierarchy
"7" for the category at the fourth level of the hierarchy
In cases where the field has categories at multiple levels and this make it difficult to find a particular value, use the search box at the bottom of the list view, to hone in on a row or rows containing that value:
In Cohort Compare mode, a list view can be used to compare the distribution of values in a field that's common to both cohorts. In this scenario, the list includes a color-coded column for each cohort, as well as color-coded "count" figures for each, as in this example:
In each column, count and "freq." figures refer to the occurrence of values in the individual cohort, not across both cohorts.
When , the following data types can be visualized in list views:
String Categorical
String Categorical Hierarchical
String Categorical Multi-Select
String Categorical Multi-Select Hierarchical
Learn to build an app that you can run on the Platform.
The steps below require the . You must download and install it if you have not done so already.
Besides this Quickstart, there are Developer Tutorials located in the sidebar that go over helpful tips for new users as well. A few of them include:
To launch a DNAnexus application or workflow on many files automatically, one may write a short script to loop over the desired files in a project and launch jobs or analyses. Alternatively, the provides a few handy utilities for batch processing. To use the GUI to run in batch mode, see these .
In this tutorial, you batch process a series of sample FASTQ files (forward and reverse reads). Use the dx generate_batch_inputs command to generate a batch file -- a tab-delimited (TSV) file where each row corresponds to a single run in the batch. Then you process the batch using the command with the --batch-tsv
When using the command-line client, you may refer to objects either through their ID or by name.
In the DNAnexus Platform, every data object has a unique starting with the class of the object followed by a hyphen ('-') and 24 alphanumeric characters. Common object classes include "record", "file", and "project". An example ID would be record-9zGPKyvvbJ3Q3P8J7bx00005. A string matching this format is always interpreted to be meant as the ID of such an object and is not further resolved as a name.
The command-line client, however, also accepts names and paths as input in a particular syntax.
Learn about the states through which a job or analysis may go, during its lifecycle.
The following example shows a workflow that has two stages, one of which is an applet, and the other of which is an app.
When the workflow runs, it generates an analysis with an attached workspace for storing intermediate output from its stages. Jobs are created to run the two stages. These jobs can spawn additional jobs to run other functions in the same executable or to run separate executables. The blue labels indicate which jobs or analyses can be described using a particular term (as defined above).
This applet tutorial performs a SAMtools count using parallel threads.
To take full advantage of the scalability that cloud computing offers, your scripts must implement the correct methodologies. This applet tutorial:
Install SAMtools
Download BAM file
task dupfile {
File infile
command { cat ${infile} ${infile} > outfile.txt }
output { File outfile = 'outfile.txt' }
}
task headfile {
File infile
command { head -10 ${infile} > outfile.txt }
output { File outfile = 'outfile.txt' }
}
workflow basic_reuse {
File infile
call dupfile { input: infile=infile }
call headfile { input: infile=dupfile.outfile }
}task dupfile {
File infile
command { cat ${infile} ${infile} > outfile.txt }
output { File outfile = 'outfile.txt' }
}
task headfile2 {
File infile
command { head -15 ${infile} > outfile.txt }
output { File outfile = 'outfile.txt' }
}
workflow basic_reuse_tweaked {
File infile
call dupfile { input: infile=infile }
call headfile { input: infile=dupfile.outfile }
}dx api org-myorg update '{"policies":{"jobReuse":true}}'"3" for the category at the bottom level of the hierarchy
String Categorical Sparse Hierarchical
Integer Categorical
Integer Categorical Hierarchical
Integer Categorical Multi-Select
Integer Categorical Multi-Select Hierarchical
Categorical (<=20 distinct category values)
Categorical Multiple (<=20 distinct category values)
Categorical Hierarchical (<=20 distinct category values)
Categorical Hierarchical Multiple (<=20 distinct category values)




















FIFO
stdin
stdout
BAM file
YES
NO
FIFO
stdin
stdout
BAM file
YES
YES
output file
YES
NO
FIFO
stdin
stdout
BAM file
YES
YES
output file
YES
YES
Every DNAnexus app starts with 2 files:
dxapp.json: a file containing the app's metadata: its inputs and outputs, how the app is run, and execution requirements
a script that is executed in the cloud when the app is run
Start by creating a file called dxapp.json with the following text:
The example specifies the app name (coolapp), the interpreter (python3) to run the script, and the path (code.py) to the script created next. ("version":"0") refers to the Ubuntu 24.04 application execution environment version that supports the python3 interpreter.
Next, create the script in a file called code.py with the following text:
That's all you need. To build the app, first log in to DNAnexus and start a project with dx login. In the directory with the two files above, run:
Next, run the app and watch the output:
That's it! You have made and run your first DNAnexus applet. Applets are lightweight apps that live in your project, and are not visible in the App Library. When you typed dx run, the app ran on its own Linux instance in the cloud. You have exclusive, secure access to the CPU, storage, and memory on the instance. The DNAnexus API lets your app read and write data on the Platform, as well as launch other apps.
The app is available in the DNAnexus web interface, as part of the project that you started. It can be configured and run in the Workflow Builder, or shared with other users by sharing the project.
Next, make the app do something a bit more interesting: take in two files with FASTA-formatted DNA, run the BLAST tool to compare them, and output the result.
In the cloud, your app runs on Ubuntu Linux 24.04, where BLAST is available as an APT package, ncbi-blast+. You can request that the DNAnexus execution environment install it before your script is run by listing ncbi-blast+ in the execDepends field of your dxapp.json like this:
Next, update code.py to run BLAST:
Rebuild the app and test it on some real data. You can use demo inputs available in the Demo Data project, or you can upload your own data with dx upload or via the website. If you use the Demo Data inputs, make sure the project you are running your app in is the same region as the Demo Data project.
Rebuild the app with dx build -a, and run it like this:
Once the job is done, you can examine the output with dx head report.txt, download it with dx download, or view it on the website.
Workflows are a powerful way to visually connect, configure, and run multiple apps in pipelines. To add the app to a workflow and connect its inputs and outputs to other apps, specify both input and output specifications. Update the dxapp.json as follows:
Rebuild the app with dx build -a. Run it as before, and add the applet to a workflow by clicking "New Workflow" while viewing your project on the website, then click coolapp once to add it to the workflow. Inputs and outputs appear on the workflow stage and can be connected to other stages.
If you run dx run coolapp with no input arguments from the command line, the command prompts for the input values for seq1 and seq2.
Besides specifying input files, the I/O specification can also configure settings the app uses. For example, configure the E-value setting and other BLAST settings with this code and dxapp.json:
Rebuild the app again and add it in the workflow builder. You should see the evalue and blast_args settings available when you click the gear button on the stage. After building and configuring a workflow, you can run the workflow itself with dx run workflowname.
One of the utilities provided in the SDK is dx-app-wizard. This tool prompts you with a series of questions with which it creates the basic files needed for a new app. It also gives you the option of writing your app as a bash shell script instead of Python. Run dx-app-wizard to try it out.
For additional information and examples of how to run jobs using the CLI, see Working with files using dx run may be useful. This material is not a part of the official DNAnexus documentation and is for reference only.
count_func
sum_reads
The main function takes the initial *.bam, generates an index *.bai if needed, and obtains the list of regions from the *.bam file. Every 10 regions are sent, as input, to the count_func entry point using dx-jobutil-new-job command.
Job outputs from the count_func entry point are referenced as Job Based Object References JBOR and used as inputs for the sum_reads entry point.
Job outputs of the sum_reads entry point is used as the output of the main entry point via JBOR reference in the dx-jobutil-add-output command.
This entry point performs a SAMtools count of the 10 regions passed as input. This execution runs on a new worker. As a result, variables from other functions are not accessible here. This includes variables from the main() function.
Once the output file with counts is created, it is uploaded to the platform and assigned as the entry point's job output counts_txt via the command dx-jobutil-add-output.
The main entry point triggers this subjob, providing the output of count_func as an input JBOR. This entry point gathers all the readcount.txt files generated by the count_func jobs and sums the totals.
This entry point returns read_sum as a JBOR, which is then referenced as job output.
In the main function, the output is referenced
The project My Research Project contains the following files in the project's root directory:
Batch process these read pairs using BWA-MEM (link requires platform login). For a single execution of the BWA-MEM app, specify the following inputs:
reads_fastqgzs - FASTQ containing the left mates
reads2_fastqgzs - FASTQ containing the right mates
genomeindex_targz - BWA reference genome index
The BWA reference genome index from the public Reference Genome (requires platform login) project is used for all runs. However, for the forward and reverse reads, the read pairs used vary from run to run. To generate a batch file that pairs the input reads:
The (.*) are regular expression groups. You can provide arbitrary regular expressions as input. The first match in the group is the pattern used to group pairs in the batch. These matches are called batch identifiers (batch IDs). To explain this behavior in more detail, consider the output of the dx generate_batch_inputs command above:
The dx generate_batch_inputs command creates the dx_batch.0000.tsv that looks like:
Recall the regular expression was RP(.*)_R1_(.*).fastq.gz. Although there are two grouped matches in this example, only the first one is used as the pattern for the batch ID. For example, the pattern identified for RP10B_S1_R1_001.fastq.gz is 10B_S1 which corresponds to the first grouped match while the second one is ignored.
Examining the TSV file above, the files are grouped as expected, with the first match labeling the identifier of the group within the batch. The next two columns show the file names. The last two columns contain the IDs of the files on the DNAnexus Platform. You can either edit this file directly or import it into a spreadsheet to make any subsequent changes.
If an input for the app is an array, the input file IDs within the batch.tsv file need to be in square brackets to work. The following bash command adds brackets to the file IDs in column 4 and 5. You may need to change the variables in the command ($4 and $5) to match the correct columns in your file. The command's output file, "new.tsv", is ready for the dx run --batch-tsv command.
The example above is for a case where all files have been paired properly. dx generate_batch_inputs creates a TSV for all files that can be successfully matched for a particular batch ID. Two classes of errors may occur for batch IDs that are not successfully matched:
A particular input is missing. This could occur when reads_fastqgzs has a pattern but no corresponding match can be found for reads2_fastqgzs.
More than one file ID matches the exact same name.
For both of these cases, dx generate_batch_inputs returns a description of these errors to STDERR.
With the batch file prepared, you can execute the BWA-MEM batch process:
Here, genomeindex_targz is a parameter set at execution time that is common to all groups in the batch and --batch-tsv corresponds to the input file generated above.
To monitor a batch job, use the 'Monitor' tab like you normally would for jobs you launch.
To direct the output of each run into a separate folder, the --batch-folders flag can be used, for example:
This command outputs the results for each sample in folders named after batch IDs, such as /10B_S1/, /10T_S5/, /15B_S4/, and /15T_S8/. If the folders do not exist, they are created.
The output folders are created under a path defined with --destination, which by default is set to the current project and the "/" folder. For example, this command outputs the result files in /run_01/10B_S1/, /run_01/10T_S5/, and other sample-specific folders:
The dx generate_batch_inputs command works well for batch processing with file inputs, but it has limitations. If you need to vary other input types (like strings, numbers, or file arrays), or want to customize run properties like job names, a for loop provides more flexibility.
Here's an example of using a loop to launch multiple jobs with different inputs:
You can also use the dx run command to use stage_id . For example, if you create a workflow called "Trio Exome Workflow - Jan 1st 2020 9:00am" in your project, you can run it from the command line:
The \ character is needed to escape the : in the workflow name.
Inputs to the workflow can be specified using dx run <workflow> --input name=stage_id:value, where stage_id is a numeric ID starting at 0. More help can be found by running the commands dx run --help and dx run <workflow> --help.
To batch multiple inputs then, do the following:
For additional information and examples of how to run batch jobs, Chapter 6 of this reference guide may be useful. This material is not a part of the official DNAnexus documentation and is for reference only.
Count regions in parallel
The SAMtools dependency is resolved by declaring an Apt-Get package in the dxapp.json runSpec.execDepends field.
This applet downloads all inputs at once using dxpy.download_all_inputs:
This tutorial processes data in parallel using the Python multiprocessing module with a straightforward pattern shown below:
This convenient pattern allows you to quickly orchestrate jobs on a worker. For more detailed overview of the multiprocessing module, visit the Python docs.
The applet script includes helper functions to manage the workload. One helper is run_cmd, which manages subprocess calls:
Before splitting the workload, determine what regions are present in the BAM input file. This initial parsing is handled in the parse_sam_header_for_region function:
Once the workload is split and processing has started, wait and review the status of each Pool worker. Then, merge and output the results.
The run_cmd function returns a tuple containing the stdout, stderr, and exit code of the subprocess call. These outputs are parsed from the workers to determine whether the run failed or passed.
mkdir workspace
mappings_fifo_path="workspace/${mappings_bam_name}"
mkfifo "${mappings_fifo_path}" # FIFO file is created
dx cat "${mappings_bam}" > "${mappings_fifo_path}" &
input_pid="$!"mkdir -p ./out/counts_txt/
counts_fifo_path="./out/counts_txt/${mappings_bam_prefix}_counts.txt"
mkfifo "${counts_fifo_path}" # FIFO file is created, readcount.txt
samtools view -c "${mappings_fifo_path}" > "${counts_fifo_path}" &
process_pid="$!"mkdir -p ./out/counts_txt/
counts_fifo_path="./out/counts_txt/${mappings_bam_prefix}_counts.txt"
mkfifo "${counts_fifo_path}" # FIFO file is created, readcount.txt
samtools view -c "${mappings_fifo_path}" > "${counts_fifo_path}" &
process_pid="$!"wait -n # "$input_pid"
wait -n # "$process_pid"
wait -n # "$upload_pid"├── Applet dir
│ ├── src
│ ├── dxapp.json
│ ├── resources
│ ├── usr
│ ├── bin
│ ├── < samtools binary >/
├── usr
│ ├── bin
│ ├── < samtools binary >
├── home
│ ├── dnanexusregions=$(samtools view -H "${mappings_sorted_bam_name}" \
| grep "\@SQ" | sed 's/.*SN:\(\S*\)\s.*/\1/')
echo "Segmenting into regions"
count_jobs=()
counter=0
temparray=()
for r in $(echo $regions); do
if [[ "${counter}" -ge 10 ]]; then
echo "${temparray[@]}"
count_jobs+=( \
$(dx-jobutil-new-job \
-ibam_file="${mappings_sorted_bam}" \
-ibambai_file="${mappings_sorted_bai}" "${temparray[@]}" count_func))
temparray=()
counter=0
fi
temparray+=("-iregions=${r}") # Here we add to an array of -i<parameter>'s
counter=$((counter+1))
done
if [[ counter -gt 0 ]]; then # Previous loop misses last iteration if it's < 10
echo "${temparray[@]}"
count_jobs+=($(dx-jobutil-new-job -ibam_file="${mappings_sorted_bam}" -ibambai_file="${mappings_sorted_bai}" "${temparray[@]}" count_func))
fiecho "Merge count files, jobs:"
echo "${count_jobs[@]}"
readfiles=()
for count_job in "${count_jobs[@]}"; do
readfiles+=("-ireadfiles=${count_job}:counts_txt")
done
echo "file name: ${sorted_bamfile_name}"
echo "Set file, readfile variables:"
echo "${readfiles[@]}"
countsfile_job=$(dx-jobutil-new-job -ifilename="${mappings_sorted_bam_prefix}" "${readfiles[@]}" sum_reads)echo "Specifying output file"
dx-jobutil-add-output counts_txt "${countsfile_job}:read_sum" --class=jobrefcount_func() {
set -e -x -o pipefail
echo "Value of bam_file: '${bam_file}'"
echo "Value of bambai_file: '${bambai_file}'"
echo "Regions being counted '${regions[@]}'"
dx-download-all-inputs
mkdir workspace
cd workspace || exit
mv "${bam_file_path}" .
mv "${bambai_file_path}" .
outputdir="./out/samtool/count"
mkdir -p "${outputdir}"
samtools view -c "${bam_file_name}" "${regions[@]}" >> "${outputdir}/readcounts.txt"
counts_txt_id=$(dx upload "${outputdir}/readcounts.txt" --brief)
dx-jobutil-add-output counts_txt "${counts_txt_id}" --class=file
}sum_reads() {
set -e -x -o pipefail
echo "$filename"
echo "Value of read file array '${readfiles[@]}'"
dx-download-all-inputs
echo "Value of read file path array '${readfiles_path[@]}'"
echo "Summing values in files"
readsum=0
for read_f in "${readfiles_path[@]}"; do
temp=$(cat "$read_f")
readsum=$((readsum + temp))
done
echo "Total reads: ${readsum}" > "${filename}_counts.txt"
read_sum_id=$(dx upload "${filename}_counts.txt" --brief)
dx-jobutil-add-output read_sum "${read_sum_id}" --class=fileecho "Specifying output file"
dx-jobutil-add-output counts_txt "${countsfile_job}:read_sum" --class=jobref{ "name": "coolapp",
"runSpec": {
"distribution": "Ubuntu",
"release": "24.04",
"version": "0",
"interpreter": "python3",
"file": "code.py"
}
}import dxpy
@dxpy.entry_point('main')
def main(**kwargs):
print("Hello, DNAnexus!")
return {}dx login
dx build -adx run coolapp --watch{ "name": "coolapp",
"runSpec": {
"distribution": "Ubuntu",
"release": "24.04",
"version": "0",
"interpreter": "python3",
"file": "code.py",
"execDepends": [ {"name": "ncbi-blast+"} ]
}
}import dxpy, subprocess
@dxpy.entry_point('main')
def main(seq1, seq2):
dxpy.download_dxfile(seq1, "seq1.fasta")
dxpy.download_dxfile(seq2, "seq2.fasta")
subprocess.call("blastn -query seq1.fasta -subject seq2.fasta > report.txt", shell=True)
report = dxpy.upload_local_file("report.txt")
return {"blast_result": report}dx run coolapp \
-i seq1="Demo Data:/Developer Quickstart/NC_000868.fasta" \
-i seq2="Demo Data:/Developer Quickstart/NC_001422.fasta" \
--watch{
"name": "coolapp",
"runSpec": {
"distribution": "Ubuntu",
"release": "24.04",
"version": "0",
"interpreter": "python3",
"file": "code.py",
"execDepends": [ {"name": "ncbi-blast+"} ]
},
"inputSpec": [
{"name": "seq1", "class": "file"},
{"name": "seq2", "class": "file"}
],
"outputSpec": [
{"name": "blast_result", "class": "file"}
]
}import dxpy, subprocess
@dxpy.entry_point('main')
def main(seq1, seq2, evalue, blast_args):
dxpy.download_dxfile(seq1, "seq1.fasta")
dxpy.download_dxfile(seq2, "seq2.fasta")
command = "blastn -query seq1.fasta -subject seq2.fasta -evalue {e} {args} > report.txt".format(e=evalue, args=blast_args)
subprocess.call(command, shell=True)
report = dxpy.upload_local_file("report.txt")
return {"blast_result": report}{
"name": "coolapp",
"runSpec": {
"distribution": "Ubuntu",
"release": "24.04",
"version": "0",
"interpreter": "python3",
"file": "code.py",
"execDepends": [ {"name": "ncbi-blast+"} ]
},
"inputSpec": [
{"name": "seq1", "class": "file"},
{"name": "seq2", "class": "file"},
{"name": "evalue", "class": "float", "default": 0.01},
{"name": "blast_args", "class": "string", "default": ""}
],
"outputSpec": [
{"name": "blast_result", "class": "file"}
]
}# Extract list of reference regions from BAM header
regions=$(
samtools view -H "${mappings_sorted_bam_name}" | \
grep "@SQ" | \
sed 's/.*SN:\(\S*\)\s.*/\1/'
)
echo "Segmenting into regions"
count_jobs=()
counter=0
temparray=()
# Loop through each region
for r in $(echo "$regions"); do
if [[ "${counter}" -ge 10 ]]; then
echo "${temparray[@]}"
count_jobs+=($(
dx-jobutil-new-job \
-ibam_file="${mappings_sorted_bam}" \
-ibambai_file="${mappings_sorted_bai}" \
"${temparray[@]}" \
count_func
))
temparray=()
counter=0
fi
# Add region to temp array of -i<parameter>s
temparray+=("-iregions=${r}")
counter=$((counter + 1))
done
# Handle remaining regions (less than 10)
if [[ $counter -gt 0 ]]; then
echo "${temparray[@]}"
count_jobs+=($(
dx-jobutil-new-job \
-ibam_file="${mappings_sorted_bam}" \
-ibambai_file="${mappings_sorted_bai}" \
"${temparray[@]}" \
count_func
))
fiecho "Merge count files, jobs:"
echo "${count_jobs[@]}"
readfiles=()
for count_job in "${count_jobs[@]}"; do
readfiles+=("-ireadfiles=${count_job}:counts_txt")
done
echo "file name: ${sorted_bamfile_name}"
echo "Set file, readfile variables:"
echo "${readfiles[@]}"
countsfile_job=$(dx-jobutil-new-job -ifilename="${mappings_sorted_bam_prefix}" "${readfiles[@]}" sum_reads)echo "Specifying output file"
dx-jobutil-add-output counts_txt "${countsfile_job}:read_sum" --class=jobrefcount_func() {
set -e -x -o pipefail
echo "Value of bam_file: '${bam_file}'"
echo "Value of bambai_file: '${bambai_file}'"
echo "Regions being counted '${regions[@]}'"
dx-download-all-inputs
mkdir workspace
cd workspace || exit
mv "${bam_file_path}" .
mv "${bambai_file_path}" .
outputdir="./out/samtool/count"
mkdir -p "${outputdir}"
samtools view -c "${bam_file_name}" "${regions[@]}" >> "${outputdir}/readcounts.txt"
counts_txt_id=$(dx upload "${outputdir}/readcounts.txt" --brief)
dx-jobutil-add-output counts_txt "${counts_txt_id}" --class=file
}sum_reads() {
set -e -x -o pipefail
echo "$filename"
echo "Value of read file array '${readfiles[@]}'"
dx-download-all-inputs
echo "Value of read file path array '${readfiles_path[@]}'"
echo "Summing values in files"
readsum=0
for read_f in "${readfiles_path[@]}"; do
temp=$(cat "$read_f")
readsum=$((readsum + temp))
done
echo "Total reads: ${readsum}" > "${filename}_counts.txt"
read_sum_id=$(dx upload "${filename}_counts.txt" --brief)
dx-jobutil-add-output read_sum "${read_sum_id}" --class=file
}echo "Specifying output file"
dx-jobutil-add-output counts_txt "${countsfile_job}:read_sum" --class=jobref$ dx select "My Research Project"
Selected project My Research Project
$ dx ls /
RP10B_S1_R1_001.fastq.gz
RP10B_S1_R2_001.fastq.gz
RP10T_S5_R1_001.fastq.gz
RP10T_S5_R2_001.fastq.gz
RP15B_S4_R1_002.fastq.gz
RP15B_S4_R2_002.fastq.gz
RP15T_S8_R1_002.fastq.gz
RP15T_S8_R2_002.fastq.gz$ dx generate_batch_inputs \
-i reads_fastqgzs='RP(.*)_R1_(.*).fastq.gz' \
-i reads2_fastqgzs='RP(.*)_R2_(.*).fastq.gz'
Found 4 valid batch IDs matching desired pattern.
Created batch file dx_batch.0000.tsv
CREATED 1 batch files each with at most 500 batch IDs.$ cat dx_batch.0000.tsv
batch ID reads_fastqgzs reads2_fastqgzs pair1 ID pair2 ID
10B_S1 RP10B_S1_R1_001.fastq.gz RP10B_S1_R2_001.fastq.gz file-aaa file-bbb
10T_S5 RP10T_S5_R1_001.fastq.gz RP10T_S5_R2_001.fastq.gz file-ccc file-ddd
15B_S4 RP15B_S4_R1_002.fastq.gz RP15B_S4_R2_002.fastq.gz file-eee file-fff
15T_S8 RP15T_S8_R1_002.fastq.gz RP15T_S8_R2_002.fastq.gz file-ggg file-hhhhead -n 1 dx_batch.0000.tsv > temp.tsv && \
tail -n +2 dx_batch.0000.tsv | \
awk '{sub($4, "[&]"); print}' | \
awk '{sub($5, "[&]"); print}' >> temp.tsv && \
tr -d '\r' < temp.tsv > new.tsv && \
rm temp.tsvdx run bwa_mem_fastq_read_mapper \
-igenomeindex_targz="Reference Genome Files":\
"/H. Sapiens - GRCh38/GRCh38.no_alt_analysis_set.bwa-index.tar.gz" \
--batch-tsv dx_batch.0000.tsvdx run bwa_mem_fastq_read_mapper \
-igenomeindex_targz="project-BQpp3Y804Y0xbyG4GJPQ01xv:\
file-BFBy4G805pXZKqV1ZVGQ0FG8" \
--batch-tsv dx_batch.0000.tsv \
--batch-foldersdx run bwa_mem_fastq_read_mapper \
-igenomeindex_targz="project-BQpp3Y804Y0xbyG4GJPQ01xv:\
file-BFBy4G805pXZKqV1ZVGQ0FG8" \
--batch-tsv dx_batch.0000.tsv \
--batch-folders \
--destination=My_project:/run_01for i in 1 2; do
dx run swiss-army-knife -icmd="wc *>${i}.out" -iin="fileinput_batch${i}a" -iin="file_input_batch${i}b" --name "sak_batch${i}"
donedx login
dx run "Trio Exome Workflow - Jan 1st 2020 9\:00am"dx cd /path/to/inputs
for i in $(dx ls); do
dx run "Trio Exome Workflow - Jan 1st 2020 9\:00am" --input 0.reads="$i"
done{
"runSpec": {
...
"execDepends": [
{"name": "samtools"}
]
}inputs = dxpy.download_all_inputs()
# download_all_inputs returns a dictionary that contains mapping from inputs to file locations.
# Additionaly, helper keys, value pairs are added to the dicitonary, similar to bash helper functions
inputs
# mappings_sorted_bam_path: [u'/home/dnanexus/in/mappings_sorted_bam/SRR504516.bam']
# mappings_sorted_bam_name: u'SRR504516.bam'
# mappings_sorted_bam_prefix: u'SRR504516'
# mappings_sorted_bai_path: u'/home/dnanexus/in/mappings_sorted_bai/SRR504516.bam.bai'
# mappings_sorted_bai_name: u'SRR504516.bam.bai'
# mappings_sorted_bai_prefix: u'SRR504516'# Get cpu count from multiprocessing
print("Number of cpus: {0}".format(cpu_count()))
# Create a pool of workers, 1 for each core
worker_pool = Pool(processes=cpu_count())
# Map run_cmds to a collection
# Pool.map handles orchestrating the job
results = worker_pool.map(run_cmd, collection)
# Make sure to close and join workers when done
worker_pool.close()
worker_pool.join()def run_cmd(cmd_arr):
"""Run shell command.
Helper function to simplify the pool.map() call in our parallelization.
Raises OSError if command specified (index 0 in cmd_arr) isn't valid
"""
proc = subprocess.Popen(
cmd_arr,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
stdout, stderr = proc.communicate()
exit_code = proc.returncode
proc_tuple = (stdout, stderr, exit_code)
return proc_tupledef parse_sam_header_for_region(bamfile_path):
"""Helper function to match SN regions contained in SAM header
Returns:
regions (list[string]): list of regions in bam header
"""
header_cmd = ['samtools', 'view', '-H', bamfile_path]
print('parsing SAM headers:', " ".join(header_cmd))
headers_str = subprocess.check_output(header_cmd).decode("utf-8")
rgx = re.compile(r'SN:(\S+)\s')
regions = rgx.findall(headers_str)
return regions# Write results to file
resultfn = inputs['mappings_sorted_bam_name'][0]
resultfn = (
resultfn[:-4] + '_count.txt'
if resultfn.endswith(".bam")
else resultfn + '_count.txt')
with open(resultfn, 'w') as f:
sum_reads = 0
for res, reg in zip(results, regions):
read_count = int(res[0])
sum_reads += read_count
f.write("Region {0}: {1}\n".format(reg, read_count))
f.write("Total reads: {0}".format(sum_reads))
count_file = dxpy.upload_local_file(resultfn)
output = {}
output["count_file"] = dxpy.dxlink(count_file)
return outputdef verify_pool_status(proc_tuples):
"""
Helper to verify worker succeeded.
As failed commands are detected, the `stderr` from that command is written
to the job_error.json file. This file is printed to the Platform
job log on App failure.
"""
all_succeed = True
err_msgs = []
for proc in proc_tuples:
if proc[2] != 0:
all_succeed = False
err_msgs.append(proc[1])
if err_msgs:
raise dxpy.exceptions.AppInternalError(b"\n".join(err_msgs))To refer to a project by name, it must be suffixed with the colon character ":". Anything appearing after the ":" or without a ":" is interpreted as a folder path to a named object. For example, to refer to a file called "hg19.fq.gz" in a folder called "human" in a project called "Genomes", the following path can be used in place of its object ID:
The folder path appearing after the ":" is assumed to be relative to the root folder "/" of the project.
Exceptions to this are when commands take in arbitrary names. This applies to commands like dx describe which accepts app names, user IDs, and other identifiers. In this case, all possible interpretations are attempted. However, it is always assumed that it is not a project name unless it ends in ":".
To refer to the output of a particular job, you can use the syntax <job id>:<output name>.
If you have the job ID handy, you can use it directly.
Or if you know it's the last analysis you ran:
You can also automatically download a file once the job producing it is done:
If the output is an array, you can extract a single element by specifying its array index (starting from 0) as follows:
DNAnexus links are JSON hashes which are used for job input and output. They always contain one key, $dnanexus_link, and have as a value either
a string representing a data object ID
another hash with two keys:
project a string representing a project or other data container ID
id a string representing a data object ID
For example:
When naming data objects, certain characters require special handling because they have specific meanings in the DNAnexus Platform:
The colon (:) identifies project names
The forward slash (/) separates folder names
Asterisks (*) and question marks (?) are used for wildcard matching
To use these characters in object names, you must escape them with backslashes. Spaces may also need escaping depending on your shell environment and whether you use quotes.
For the best experience, we recommend avoiding special characters in names when possible. If you need to work with objects that have special characters, using their object IDs directly is often simpler.
The table below shows how to escape special characters when accessing objects with these characters in their names:
(single space)
' '
:
\\\\:
'\\:'
/
\\\\/
'\/'
*
\\\\\\\\*
The following example illustrates how the special characters are escaped for use on the command line, with and without quotes.
For commands where the argument supplied involves naming or renaming something, the only escaping necessary is whatever is necessary for your shell or for setting it apart from a project or folder path.
It is possible to have multiple objects with the same name in the same folder. When an attempt is made to access or modify an object which shares the same name as another object, you are prompted to select the desired data object.
Some commands (like mv here) allow you to enter * so that all matches are used. Other commands may automatically apply the command to all matches. This includes commands like ls and describe. Some commands require that exactly one object be chosen, such as the run command.
Every successful job goes through at least the following four states: 1. idle: initial state of every new job, regardless of what API call was made to create it. 2. runnable: the job's inputs are ready, and it is not waiting for any other job to finish or data object to finish closing. 3. running: the job has been assigned to and is being run on a worker in the cloud. 4. done: the job has completed, and it is not waiting for any descendent job to finish or data object to finish closing. This is a terminal state, so no job becomes a different state after transitioning to done.
Jobs may also pass through the following transitional states as part of more complicated execution patterns:
waiting_on_input (between idle and runnable): a job enters and stays in this state if at least one of the following is true:
it has an unresolved job-based object reference in its input
it has a data object input that cannot be cloned yet because it is not in the closed state or a linked hidden object is not in the closed state
it was created to wait on a list of jobs or data objects that must enter the done or closed states, respectively (see the dependsOn field of any API call that creates a job). Linked hidden objects are implicitly included in this list
waiting_on_output (between running and done): a job enters and stays in this state if at least one of the following is true:
it has a descendant job that has not been moved to the done state
it has an unresolved job-based object reference in its output
Two terminal job states exist other than the done state: terminated and failed. A job can enter either of these states from any other state except another terminal state.
The terminated state occurs when a user requests termination of the job (or another job sharing the same origin job). For all terminated jobs, the failureReason in their describe hash contains "Terminated", and the failureMessage indicates the user responsible for termination. Only the user who launched the job or administrators of the job's project context can terminate the job.
Jobs can fail for a variety of reasons, and once a job fails, this triggers failure for all other jobs that share the same origin job. If an unrelated job not in the same job tree has a job-based object reference or otherwise depends on a failed job, then it also fails. For more information about errors that jobs can encounter, see the Error Information page.
On the DNAnexus Platform, jobs are limited to a runtime of 30 days. Jobs running longer than 30 days fail with JobTimeoutExceeded error.
Jobs can automatically restart when they encounter specific types of failures. You configure which failure types trigger restarts in the executionPolicy of an app, applet, or workflow. Common restartable failure types include:
UnresponsiveWorker
ExecutionError
AppInternalError
JobTimeoutExceeded
SpotInstanceInterruption
When a job fails for a restartable reason, the system determines where to restart based on the restartableEntryPoints configuration:
master setting (default): The failure propagates to the nearest master job, which then restarts
all setting: The job restarts itself directly
The system restarts a job up to the maximum number of times specified in the executionPolicy. Once this limit is reached, the entire job tree fails.
During the restart process, jobs transition through specific states:
restartable: The job is ready to be restarted
restarted: The job attempt was restarted (a new attempt begins)
For jobs in root executions launched after July 12, 2023 00:13 UTC, the platform tracks restart attempts using a try integer attribute:
First attempt: try = 0
Second attempt (first restart): try = 1
Third attempt (second restart): try = 2
Multiple API methods support job try operations and include try information in their responses:
/job-xxxx/describe
/job-xxxx/addTags
/job-xxxx/removeTags
/job-xxxx/setProperties
/system/findExecutions
/system/findJobs
/system/findAnalyses
When you provide a job ID without specifying a try argument, these methods automatically refer to the most recent attempt for that job.
For unsuccessful jobs, additional states exist between the running state and the terminal state of terminated or failed. Unsuccessful jobs starting in other non-terminal states transition directly to the appropriate terminal state.
terminating: the transitional state when the cloud worker begins terminating the job and tearing down the execution environment. The job moves to its terminal state after the worker reports successful termination or becomes unresponsive.
debug_hold: a job has been run with debugging options and has failed for an applicable reason, and is being held for debugging by the user. For more information about triggering this state, see the Connecting to Jobs page.
All analyses start in the state in_progress, and, like jobs, reach one of the terminal states done, failed, or terminated. The following diagram shows the state transitions for successful analyses.
If an analysis is unsuccessful, it may transition through one or more intermediate states before it reaches its terminal state:
partially_failed: this state indicates that one or more stages in the analysis have not finished successfully, and there is at least one stage which has not transitioned to a terminal state. In this state, some stages may have already finished successfully (and entered the done state), and the remaining stages are also allowed to finish successfully if they can.
terminating: an analysis may enter this state either via an API call where a user has terminated the analysis, or there is some failure condition under which the analysis is terminating any remaining stages. This may happen if the executionPolicy for the analysis (or a stage of an analysis) had the onNonRestartableFailure value set to "failAllStages".
Compute and data storage costs for jobs that fail due to user error are charged to the project running those jobs. This includes errors such as InputError and OutputError. The same applies to terminated jobs. For DNAnexus Platform internal errors, these costs are not billed.
The costs for each stage in an analysis is determined independently. If the first stage finishes successfully while a second stage fails for a system error, the first stage is still billed, and the second is not.


You can download input data from a project using dx download in a notebook cell:
The %%bash keyword converts the whole cell to a magic cell which allows you to run bash code in that cell without exiting the Python kernel. See examples of magic commands in the IPython documentation. The ! prefix achieves the same result:
Alternatively, the dx command can be executed from the terminal.
To download data with Python in the notebook, you can use the download_dxfile function:
Check the dxpy helper functions for details on how to download files and folders.
Any files from the execution environment can be uploaded to the project using dx upload:
To upload data using Python in the notebook, you can use the upload_local_file function:
Check the dxpy helper functions for details on how to upload files and folders.
By selecting a notebook or any other file on your computer and dragging it into the DNAnexus project file browser, you can upload the files directly to the project. To download a file, right-click on it and click Download (to local computer).
You may upload and download data to the local execution environment in a similar way, that is, by dragging and dropping files to the execution file browser or by right-clicking on the files there and clicking Download.
It is useful to have a terminal provided by JupyterLab at hand, which uses bash shell by default and lets you execute shell scripts or interact with the platform via dx toolkit. For example, the following command confirms what the current project context is:
Running pwd shows you that the working directory of the execution environment is /opt/notebooks. The JupyterLab server is launched from this directory, which is also the default location of the output files generated in the notebooks.
To open a terminal window, go to File > New > Terminal or open it from the Launcher (using the "Terminal" box at the bottom). To open a Launcher, select File > New Launcher.
You can install pip, conda, apt-get, and other packages in the execution environment from the notebook:
By creating a snapshot, you can start subsequent sessions with these packages pre-installed by providing the snapshot as input.
You can access public GitHub repositories from the JupyterLab terminal using git clone command. By placing a private ssh key that's registered with your GitHub account in /root/.ssh/id_rsa you can clone private GitHub repositories using git clone and push any changes back to GitHub using git push from the JupyterLab terminal.
Below is a screenshot of a JupyterLab session with a terminal displaying a script that:
sets up ssh key to access a private GitHub repository and clones it,
clones a public repository,
downloads a JSON file from the DNAnexus project,
modifies an open-source notebook to convert the JSON file to CSV format,
saves the modified notebook to the private GitHub repository,
and uploads the results of JSON to CSV conversion back to the DNAnexus project.
This animation shows the first part of the script in action:
A command can be run in the JupyterLab Docker container without starting an interactive JupyterLab server. To do that, provide the cmd input and additional input files using the in input file array. The command runs in the directory where the JupyterLab server is started and notebooks are run, that is, /opt/notebooks/. Any output files generated in this directory are uploaded to the project and returned in the out output.
The cmd input makes it possible to use a papermill tool pre-installed in the JupyterLab environment that executes notebooks non-interactively. For example, to execute all the cells in a notebook and produce an output notebook:
where notebook.ipynb is the input notebook to "papermill", which needs to be passed in the "in" input, and output_notebook.ipynb is the name of the output notebook, which stores the result of the cells' execution. The output is uploaded to the project at the end of the app execution.
If the snapshot parameter is specified, execution of cmd takes place in the specified Docker container. The duration argument is ignored when running the app with cmd. The app can be run from the command line with the --extra-args flag to limit the runtime, for example, dx run dxjupyterlab --extra-args '{"timeoutPolicyByExecutable": {"app-xxxx":{"\*": {"hours": 1}}}}'".
If cmd is not specified, the in parameter is ignored and the output of an app consists of an empty array.
If you are trying to use newer NVIDIA GPU-accelerated software, you may find that the NVIDIA GPU Driver kernel-mode driver NVIDIA.ko that is installed outside of the DXJupyterLab environment does not support the newer CUDA version required by your application. You can install NVIDIA Forward Compatibility packages to use the newer CUDA version required by your application by following the steps below in a DXJupyterLab terminal.
After 15 to 30 minutes of inactivity in the JupyterLab browser tabs, the system logs you out automatically from the JupyterLab session and displays a "Server Connection Error" message. To re-enter the JupyterLab session, reload the JupyterLab webpage and log into the platform to be redirected to the JupyterLab session.
Prerequisites:
The JDBC URL:
The username in the format TOKEN__PROJECTID, where:
TOKEN is a DNAnexus user-generated token, separated by a double underscore (__) from the project ID.
PROJECTID is a DNAnexus project ID used as the project context (when creating databases).
The Thrift server and the project must be in the same region.
See the Authentication tokens page.
Navigate to https://platform.dnanexus.com and login using your username and password.
In Projects > your project > Settings >, for Project ID, click Copy to Clipboard.
Beeline is a JDBC client bundled with Apache Spark that can be used to run interactive queries on the command line.
You can download Apache Spark 3.5.2 for Hadoop 3.x from here.
You need to have Java installed in your system PATH, or the JAVA_HOME environment variable pointing to a Java installation.
If you already have beeline installed and all the credentials, you can quickly connect with the following command:
In the following AWS example, you must escape some characters (; with \):
The command for connecting to Thrift on Azure has a different format:
The beeline client is located under $SPARK_HOME/bin/.
Connect to beeline using the JDBC URL:
Once successfully connected, you should see the message:
After connecting to the Thrift server using your credentials, you can view all databases you have access to within your current region.
You can query using the unique database name, which includes the lowercase database ID (for example, database_fjf3y28066y5jxj2b0gz4g85__metabric_data). If the database is in the same username and project used to connect to the Thrift server, you can use only the database name (for example, metabric_data). For databases outside the project, use the unique database name.
Databases stored in other projects can be found by specifying the project context in the LIKE option of SHOW DATABASES, using the format '<project-id>:<database pattern>' as shown below:
After connecting, you can run SQL queries.
This applet creates a count of reads from a BAM format file.
View full source code on GitHub
The SAMtools dependency is resolved by declaring an Apt-Get package in the dxapp.json runSpec.execDepends.
For additional information, refer to the execDepends documentation.
Distributed Python-interpreter apps use Python decorators on functions to declare entry points. This app has the following entry points as decorated functions:
main
samtoolscount_bam
combine_files
Entry points are executed on a new worker with their own system requirements. In this example, the files are split and merged on basic mem1_ssd1_x2 instances and the more intensive processing step is performed on a mem1_ssd1_x4 instance. Instance type can be set in the dxapp.json runSpec.systemRequirements:
mainThe main function scatters by region bins based on user input. If no *.bai file is present, the applet generates an index *.bai.
Regions bins are passed to the samtoolscount_bam entry point using the function.
Outputs from the samtoolscount_bam entry points are used as inputs for the combine_files entry point. The output of the combine_files entry point is used as the output of the main entry point.
samtoolscount_bamThis entry point downloads and creates a samtools view -c command for each region in the input bin. The dictionary returned from dxpy.download_all_inputs() is used to reference input names and paths.
This entry point returns {"readcount_fileDX": readCountDXlink}, a JBOR referencing an uploaded text file. This approach to scatter-gather stores the results in files and uploads/downloads the information as needed. This approach exaggerates a scatter-gather for tutorial purposes. You're able to pass types other than file such as int.
combine_filesThe main entry point triggers this subjob, providing the output of samtoolscount_bam as an input. This entry point gathers all the files generated by the samtoolscount_bam jobs and sums them.
Important: While the main entry point triggers the processing and gathering entry points, remember that the main entry point doesn't do any heavy lifting or processing. Notice in the .runSpec JSON above the process starts with a lightweight instance, scales up for the processing entry point, then finally scales down for the gathering step.
This applet creates a count of reads from a BAM format file.
View full source code on GitHub
The SAMtools dependency is resolved by declaring an Apt-Get package in the dxapp.json runSpec.execDepends.
For additional information, refer to the execDepends documentation.
Distributed Python-interpreter apps use Python decorators on functions to declare entry points. This app has the following entry points as decorated functions:
main
samtoolscount_bam
combine_files
Entry points are executed on a new worker with their own system requirements. In this example, the applet splits and merges files on basic mem1_ssd1_x2 instances and performs a more intensive processing step on a mem1_ssd1_x4 instance. Instance type can be set in the dxapp.json's runSpec.systemRequirements:
mainThe main function scatters by region bins based on user input. If no *.bai file is present, the applet generates an index *.bai.
Regions bins are passed to the samtoolscount_bam entry points in the function.
Outputs from the samtoolscount_bam entry points are used as inputs for the combine_files entry point. The output of the combine_files entry point is used as the output of the main entry point.
samtoolscount_bamThis entry point downloads and creates a samtools view -c command for each region in the input bin. The dictionary returned from dxpy.download_all_inputs() is used to reference input names and paths.
This entry point returns {"readcount_fileDX": readCountDXlink}, a JBOR referencing an uploaded text file. This approach to scatter-gather stores the results in files and uploads/downloads the information as needed. This approach exaggerates a scatter-gather for tutorial purposes. You're able to pass types other than file such as int.
combine_filesThe main entry point triggers this subjob, providing the output of samtoolscount_bam as an input. This entry point gathers all the files generated by the samtoolscount_bam jobs and sums them.
Important: While the main entry point triggers the processing and gathering entry points, remember that the main entry point doesn't do any heavy lifting or processing. The .runSpec JSON shows a workflow that starts with a lightweight instance, scales up for the processing entry point, and then scales down for the gathering step.
You can treat dx as an invocation command for navigating the data objects on the DNAnexus Platform. By adding dx in front of commonly used bash commands, you can manage objects in the platform directly from the command-line. Common commands include dx ls, dx cd, dx mv, and dx cp, which let you list objects, change folders, move data objects, and copy objects.
By default when you set your current project, you are placed in the root folder / of the project. You can list the objects and folders in your current folder with .
To see more details, you can run the command with the option dx ls -l.
As in bash, you can list the contents on a path.
You can also list the contents of a different project. To specify a path that points to a different project, start with the project-ID, followed by a :, then the path within the project where / is the root folder of the project.
Enclose the path in quotes (" ") so dx interprets the spaces as part of the folder name, not as a new command.
You can also list only the objects which match a pattern. Here, a * is used as a wildcard to represent all objects whose names contain .fasta. This returns only a subset of the objects returned in the original query. Again the path is enclosed in " " so dx correctly interprets the asterisk and the spaces in the path.
To find out your present folder location, use the dx pwd command. You can switch contexts to a subfolder in a project using .
You can move and rename data objects and folders using the command .
To rename an object or a folder, "move" it to a new name in the same folder. Here, a file named ce10.fasta.gz is renamed to C.elegans10.fastq.gz.
If you want to move the renamed file into a folder, specify the path to the folder as the destination of the move command (dx mv).
You can copy data objects or folders to another project by running the command dx cp. The following example shows how to copy a human reference genome FASTA file (hs37d5.fa.gz) from a public project, "Reference Genome Files", to a project "Scratch Project" that the user has ADMINISTER permission to.
You can also copy folders between projects by running dx cp folder_name destination_path. Folders are automatically copied recursively.
The Platform prevents copying a data object within the same project, since each specific data object exists only once in a project. The system also prohibits copying any data object between projects that are located in different through dx cp.
You can change to another project where you wanted to work by running the command . It brings up a prompt with a list of projects for you to select from. In the following example, the user has entered option 2 to select the project named "Mouse".
To view and select between all public projects, projects available to all DNAnexus users, you can run the command dx select --public:
By default, dx select prompts a list of projects that you have at least CONTRIBUTE permission to. If you wanted to switch to a project that you have VIEW permission to view the data objects, you can run dx select --level VIEW to list all the projects in which you have at least VIEW permission to.
If you know the project ID or name, you can also give it directly to switch to the project as dx select [project-ID | project-name]:
Using Stata via DXJupyterlab, working with project files, and creating datasets with Spark.
Stata is a powerful statistics package for data science. Stata commands and functionality can be accessed on the DNAnexus Platform via stata_kernel, in Jupyter notebooks.
On the DNAnexus Platform, use the to create and edit Jupyter notebooks.
To use Stata on the DNAnexus Platform, you need a valid Stata license. Before launching Stata in a project, you must save your license details according to the instructions below in a plain text file with the extension .json, then upload this file to the project's root directory. You only need to do this once per project.
Start by creating the file in a text editor, including all the fields shown here, where <user> is your DNAnexus username, and<organization>is the org of which you're a member:
Save the file according to the following format, where <username> is your DNAnexus username: .stataSettings.user-<username>.json
Open the project in which you want to use Stata. Upload the Stata license details file to the project's root directory by going to your project's Manage tab, clicking on the Add button on the upper right, and then selecting the Upload data option.
When working in a shared project, you can take an additional step to avoid exposing your Stata license details to project collaborators.
Create a private project. Then create and save a Stata license details file in that project's root directory, per the instructions above.
Within the shared project, create and save a Stata license details file in this format, where project-yyyy is the name of the private project, and file-xxxx is the license details file ID, in that private project:
Open the project in which you want to use Stata. From within the project's Manage tab, click the Start Analysis button.
Select the app DXJupyterLab with Python, R, Stata, ML.
Click the Run Selected button. If you haven't run this app before, you are prompted to install it. Next, you are taken to the Run Analysis screen.
Once the analysis starts, you see the notification "Running" appear under the name of the app.
Click the Monitor tab heading. This opens a list of running and past jobs. Jobs are shown in reverse chronological order, with the most recently launched at the top. The topmost row should show the job you launched. To open the job and enter the JupyterLab interface, click on the URL shown under Worker URL.
Within the JupyterLab interface, open the DNAnexus tab shown at the left edge of the screen.
Open a new Stata notebook by clicking the Stata tile in the Notebooks section.
You can download DNAnexus data files to DXJupyterLab container from Stata notebook with:
Data files in the current project can also be accessed using a /mnt/project folder from a Stata notebook as follows: To load a DTA file:
To load a CSV file:
To write a DTA file to the DXJupyterLab container:
To write a CSV file to the DXJupyterLab container:
To upload a data file from the DXJupyterLab container to the project, use the following command in a Stata notebook:
Alternatively, open a new Launcher tab, open Terminal, and run:
The /mnt/project directory is read-only, so trying to write to it results in an error.
can be used to query and filter DNAnexus returning a PySpark DataFrame. PySpark Dataframe can be converted to a pandas DataFrame with:
Pandas dataframe can be exported to CSV or Stata DTA files in the JupyterLab container with:
To upload a data file from the JupyterLab container to the DNAnexus project in the DXJupyterLab spark cluster app, use
Once saved to the project, data files can be used in a DXJupyterLab Stata session using the instructions above.
The DNAnexus Platform offers multiple different methods for viewing your files and data.
DNAnexus allows users to preview and open the following file types directly on the platform:
TXT
PNG
HTML
To preview these files, select the file you wish to view by either clicking on its name in the Manage tab or selecting the checkbox next to the file. If the file is one of the file types listed above, the "Preview" and "Open in New Tab" options appear in the toolbar above.
Alternatively, you can click on the three dots on the far right and choose the "Preview" or "Open in New Tab" options from the dropdown menu.
"Preview" opens a fixed-sized box in your current tab to preview the file of interest. "Open in New Tab" enables viewing the file in a separate tab. Due to limitations in web browser technologies, "Preview" and "Open in New Tab" may produce different results.
File preview and viewer functionality are subject to . When a project has the previewViewerRestricted flag enabled, preview and viewer capabilities are disabled for all project members. This flag is automatically set to true when downloadRestricted is enabled on a project (for both new projects and when updating existing projects), though project admins can override this behavior by explicitly providing the previewViewerRestricted flag.
For files not listed in the section above, the DNAnexus Platform also provides a lightweight framework called Viewers, which allows users to view their data using new or existing web-based tools.
A Viewer is an HTML file that you can give one or more DNAnexus URLs representing files to be viewed. Viewers generally integrate third-party technologies, such as HTML-based genome browsers.
You can launch a viewer by clicking on the Visualize tab within a project.
This tab opens a window displaying all Viewers available to you within your project. Any Viewers you've created and saved within your current project appear in this list along with the DNAnexus-provided Viewers.
Clicking on a Viewer opens a data selector for you to choose the files you wish to visualize. Tick one or more files that you want to provide to the Viewer. (The Viewer does not have access to any other of your data.) From there, you can either create a Viewer Shortcut or launch the Viewer.
The BioDalliance and IGV.js viewers provide HTML-based human genome browsers which you can use to visualize mappings and variants. When launching either one of these viewers, tick a pair of *.bam + *.bai files for each mappings track you would like to visualize, and a pair of *.vcf.gz + *.vcf.gz.tbi for each variant track you want to add. Also, the BioDalliance browser supports bigBed (*.bb) and bigWig (*.bw) tracks.
For more information about BioDalliance, consult . For IGV.js, see .
The BAM Header Viewer allows you to peek inside a BAM header, similar to what you would get if you were to run samtools view -H on the BAM file. (BAM headers include information about the reference genome sequences, read groups, and programs used). When launching this viewer, tick one or more BAM files (*.bam).
The Jupyter notebook viewer displays *.ipynb notebook files, showing notebook images, highlighted code blocks and rendered markdown blocks as shown below.
This viewer allows you to decompress and see the first few kilobytes of a gzipped file. It is conceptually similar to what you would get if you were to run zcat <file> \| head. Use this viewer to peek inside compressed reads files (*.fastq.gz) or compressed variants files (*.vcf.gz). When launching this viewer, tick one or more gzipped files (*.gz).
If a viewer fails to load, try temporarily disabling browser extensions such as AdBlock and Privacy Badger. Also, viewers are not supported in Incognito browser windows.
Developers comfortable with HTML and JavaScript can to visualize data on the platform.
Viewer Shortcuts are objects which, when opened, open a data selector to select inputs for launching a specified Viewer. The Viewer Shortcut includes a Viewer and an array of inputs that are selected by default.
The Viewer Shortcut appears in your project as an object of type "Viewer Shortcut." You can modify the name of the Viewer Shortcut and move it within your folders and projects like any other object in the DNAnexus Platform.
Genomes:human/hg19.fa.gzdx describe job-B0kK3p64Zg2FG1J75vJ00004:readsdx describe $(dx find jobs -n 1 --brief):readsdx download $(dx run some_exporter_app -iinput=my_input -y --brief --wait):file_outputdx describe job-B0kK3p64Zg2FG1J75vJ00004:reads.0$ dx ls '{"$dnanexus_link": "file-B2VBGXyK8yjzxF5Y8j40001Y"}'
file-name$ dx ls Project\ Mouse:
name: with/special*characters?
$ dx cd Project\ Mouse:
$ dx describe name\\\\:\ with\\\\/special\\\\\\\\*characters\\\\\\\\?
ID file-9zz0xKJkf6V4yzQjgx2Q006Y
Class file
Project project-9zb014Jkf6V33pgy75j0000G
Folder /
Name name: with/special*characters?
State closed
Hidden visible
Types -
Properties -
Tags -
Outgoing links -
Created Wed Jul 11 16:39:37 2012
Created by alice
Last modified Sat Jul 21 14:19:55 2012
Media type text/plain
Size (bytes) 4
$ dx describe "name\: with\/special\\\\\\*characters\\\\\\?"
...$ dx new record -o "must\: escape\/everything\*once\?at creation"
ID record-B13BBVK4Zg29fvVv08q00005
...
Name must: escape/everything*once? at creation
...
$ dx rename record-B13BBVK4Zg29fvVv08q00005 "no:escaping/necessary*even?wildcards"$ dx ls
sample : file-9zbpq72y8x6F0xPzKZB00003
sample : file-9zbjZf2y8x61GP1199j00085
$ dx mv sample mouse_sample
The given path "sample" resolves to the following data objects:
0) closed 2012-06-27 18:04:28 sample (file-9zbpq72y8x6F0xPzKZB00003)
1) closed 2012-06-27 15:34:00 sample (file-9zbjZf2y8x61GP1199j00085)
Pick a numbered choice or "*" for all: 1
$ dx ls -l
closed 2012-06-27 15:34:00 mouse_sample (file-9zbjZf2y8x61GP1199j00085)
closed 2012-06-27 18:04:28 sample (file-9zbpq72y8x6F0xPzKZB00003)%%bash
dx download input_data/reads.fastq! dx download input_data/reads.fastqimport dxpy
dxpy.download_dxfile(dxid='file-xxxx',
filename='unique_name.txt')%%bash
dx upload Readme.ipynbimport dxpy
dxpy.upload_local_file('variants.vcf')$ dx pwd
MyProject:/%%bash
pip install torch
pip install torchvision
conda install -c conda-forge opencvmy_cmd="papermill notebook.ipynb output_notebook.ipynb"
dx run dxjupyterlab -icmd="$my_cmd" -iin="notebook.ipynb"# NVIDIA-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.129.06 Driver Version: 470.129.06 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
// Let's upgrade CUDA 11.4 to 12.5
# apt-get update
# apt-get -y install cuda-toolkit-12-5 cuda-compat-12-5
# echo /usr/local/cuda/compat > /etc/ld.so.conf.d/NVIDIA-compat.conf
# ldconfig
# NVIDIA-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.256.02 Driver Version: 470.256.02 CUDA Version: 12.5 |
|-------------------------------+----------------------+----------------------+
// CUDA 12.5 is now usable from terminal and notebooks AWS US (East): jdbc:hive2://query.us-east-1.apollo.dnanexus.com:10000/;ssl=true
AWS London (UKB): jdbc:hive2://query.eu-west-2.apollo.dnanexus.com:10000/;ssl=true
Azure US (West): jdbc:hive2://query.westus.apollo.dnanexus.com:10001/;ssl=true;transportMode=http;httpPath=cliservice
AWS Frankfurt (General): jdbc:hive2://query.eu-central-1.apollo.dnanexus.com:10000/;ssl=truetar -zxvf spark-3.5.2-bin-hadoop3.tgz<beeline> -u <thrift path> -n <token>__<project-id>$SPARK_HOME/bin/beeline -u jdbc:hive2://query.us-east-1.apollo.dnanexus.com:10000/\;ssl=true -n yourToken__project-xxxx$SPARK_HOME/bin/beeline -u jdbc:hive2://query.westus.apollo.dnanexus.com:10001/\;ssl=true\;transportMode=http\;httpPath=cliservice -n yourToken__project-xxxxcd spark-3.5.2-bin-hadoop3/bin
./beeline$ beeline> !connect jdbc:hive2://query.us-east-1.apollo.dnanexus.com:10000/;ssl=true
Enter username: <TOKEN__PROJECTID>
Enter password: <empty - press RETURN>Connected to: Spark SQL (version 3.5.2)
Driver: Hive JDBC (version 2.3.9)
Transaction isolation: TRANSACTION_REPEATABLE_READ0: jdbc:hive2://query.us-east-1.apollo.dnanex> show databases;
+---------------------------------------------------------+--+
| databaseName |
+---------------------------------------------------------+--+
| database_fj7q18009xxzzzx0gjfk6vfz__genomics_180718_01 |
| database_fj8gygj0v10vj50j0gyfqk1x__af_result_180719_01 |
| database_fj96qx00v10vj50j0gyfv00z__af_result2 |
| database_fjf3y28066y5jxj2b0gz4g85__metabric_data |
| database_fjj1jkj0v10p8pvx78vkkpz3__pchr1_test |
| database_fjpz6fj0v10fjy3fjy282ybz__af_result1 |
+---------------------------------------------------------+--+0: jdbc:hive2://query.us-east-1.apollo.dnanex> use metabric_data;0: jdbc:hive2://query.us-east-1.apollo.dnanex> SHOW DATABASES LIKE 'project-xxx:af*';
+---------------------------------------------------------+--+
| databaseName |
+---------------------------------------------------------+--+
| database_fj8gygj0v10vj50j0gyfqk1x__af_result_180719_01 |
| database_fj96qx00v10vj50j0gyfv00z__af_result2 |
| database_fjpz6fj0v10fjy3fjy282ybz__af_result1 |
+---------------------------------------------------------+--+0: jdbc:hive2://query.us-east-1.apollo.dnanex> select * from cna limit 10;
+--------------+-----------------+------------+--------+--+
| hugo_symbol | entrez_gene_id | sample_id | value |
+--------------+-----------------+------------+--------+--+
| MIR3675 | NULL | MB-6179 | -1 |
| MIR3675 | NULL | MB-6181 | 0 |
| MIR3675 | NULL | MB-6182 | 0 |
| MIR3675 | NULL | MB-6183 | 0 |
| MIR3675 | NULL | MB-6184 | 0 |
| MIR3675 | NULL | MB-6185 | -1 |
| MIR3675 | NULL | MB-6187 | 0 |
| MIR3675 | NULL | MB-6188 | 0 |
| MIR3675 | NULL | MB-6189 | 0 |
| MIR3675 | NULL | MB-6190 | 0 |
+--------------+-----------------+------------+--------+--+"runSpec": {
...
"execDepends": [
{"name": "samtools"}
]
} "runSpec": {
...
"execDepends": [
{"name": "samtools"}
]
}'\\\\\\*'
?
\\\\\\\\?
'\\\\\\?'








On the Run Analysis screen, open the Analysis Inputs tab and click the Stata settings file button.
Add your Stata settings file as an input. This is the .json file you created, containing your Stata license details.
In the Common section at the bottom of the Analysis Inputs pane, open the Feature dropdown menu and select Stata.
Click the Start Analysis button at the top right corner of the screen. This launches the DXJupyterLab app, and takes you to the project's Monitor tab, where you can monitor the app's status as it loads.


"runSpec": {
...
"systemRequirements": {
"main": {
"instanceType": "mem1_ssd1_x2"
},
"samtoolscount_bam": {
"instanceType": "mem1_ssd1_x4"
},
"combine_files": {
"instanceType": "mem1_ssd1_x2"
}
},
...
}regions = parseSAM_header_for_region(filename)
split_regions = [regions[i:i + region_size]
for i in range(0, len(regions), region_size)]
if not index_file:
mappings_bam, index_file = create_index_file(filename, mappings_bam)print('creating subjobs')
subjobs = [dxpy.new_dxjob(
fn_input={"region_list": split,
"mappings_bam": mappings_bam,
"index_file": index_file},
fn_name="samtoolscount_bam")
for split in split_regions]
fileDXLinks = [subjob.get_output_ref("readcount_fileDX")
for subjob in subjobs]print('combining outputs')
postprocess_job = dxpy.new_dxjob(
fn_input={"countDXlinks": fileDXLinks, "resultfn": filename},
fn_name="combine_files")
countDXLink = postprocess_job.get_output_ref("countDXLink")
output = {}
output["count_file"] = countDXLink
return outputdef samtoolscount_bam(region_list, mappings_bam, index_file):
"""Processing function.
Arguments:
region_list (list[str]): Regions to count in BAM
mappings_bam (dict): dxlink to input BAM
index_file (dict): dxlink to input BAM
Returns:
Dictionary containing dxlinks to the uploaded read counts file
"""
#
# Download inputs
# -------------------------------------------------------------------
# dxpy.download_all_inputs will download all input files into
# the /home/dnanexus/in directory. A folder will be created for each
# input and the file(s) will be download to that directory.
#
# In this example our dictionary inputs has the following key, value pairs
# Note that the values are all list
# mappings_bam_path: [u'/home/dnanexus/in/mappings_bam/<bam filename>.bam']
# mappings_bam_name: [u'<bam filename>.bam']
# mappings_bam_prefix: [u'<bam filename>']
# index_file_path: [u'/home/dnanexus/in/index_file/<bam filename>.bam.bai']
# index_file_name: [u'<bam filename>.bam.bai']
# index_file_prefix: [u'<bam filename>']
#
inputs = dxpy.download_all_inputs()
# SAMtools view command requires the bam and index file to be in the same
shutil.move(inputs['mappings_bam_path'][0], os.getcwd())
shutil.move(inputs['index_file_path'][0], os.getcwd())
input_bam = inputs['mappings_bam_name'][0]
#
# Per region perform SAMtools count.
# --------------------------------------------------------------
# Output count for regions and return DXLink as job output to
# allow other entry points to download job output.
#
with open('read_count_regions.txt', 'w') as f:
for region in region_list:
view_cmd = create_region_view_cmd(input_bam, region)
region_proc_result = run_cmd(view_cmd)
region_count = int(region_proc_result[0])
f.write("Region {0}: {1}\n".format(region, region_count))
readcountDXFile = dxpy.upload_local_file("read_count_regions.txt")
readCountDXlink = dxpy.dxlink(readcountDXFile.get_id())
return {"readcount_fileDX": readCountDXlink}def combine_files(countDXlinks, resultfn):
"""The 'gather' subjob of the applet.
Arguments:
countDXlinks (list[dict]): list of DXlinks to process job output files.
resultfn (str): Filename to use for job output file.
Returns:
DXLink for the main function to return as the job output.
Note: Only the DXLinks are passed as parameters.
Subjobs work on a fresh instance so files must be downloaded to the machine
"""
if resultfn.endswith(".bam"):
resultfn = resultfn[:-4] + '.txt'
sum_reads = 0
with open(resultfn, 'w') as f:
for i, dxlink in enumerate(countDXlinks):
dxfile = dxpy.DXFile(dxlink)
filename = "countfile{0}".format(i)
dxpy.download_dxfile(dxfile, filename)
with open(filename, 'r') as fsub:
for line in fsub:
sum_reads += parse_line_for_readcount(line)
f.write(line)
f.write('Total Reads: {0}'.format(sum_reads))
countDXFile = dxpy.upload_local_file(resultfn)
countDXlink = dxpy.dxlink(countDXFile.get_id())
return {"countDXLink": countDXlink} "runSpec": {
...
"systemRequirements": {
"main": {
"instanceType": "mem1_ssd1_x2"
},
"samtoolscount_bam": {
"instanceType": "mem1_ssd1_x4"
},
"combine_files": {
"instanceType": "mem1_ssd1_x2"
}
},
...
}regions = parseSAM_header_for_region(filename)
split_regions = [regions[i:i + region_size]
for i in range(0, len(regions), region_size)]
if not index_file:
mappings_bam, index_file = create_index_file(filename, mappings_bam)print('creating subjobs')
subjobs = [dxpy.new_dxjob(
fn_input={"region_list": split,
"mappings_bam": mappings_bam,
"index_file": index_file},
fn_name="samtoolscount_bam")
for split in split_regions]
fileDXLinks = [subjob.get_output_ref("readcount_fileDX")
for subjob in subjobs]print('combining outputs')
postprocess_job = dxpy.new_dxjob(
fn_input={"countDXlinks": fileDXLinks, "resultfn": filename},
fn_name="combine_files")
countDXLink = postprocess_job.get_output_ref("countDXLink")
output = {}
output["count_file"] = countDXLink
return outputdef samtoolscount_bam(region_list, mappings_bam, index_file):
"""Processing function.
Arguments:
region_list (list[str]): Regions to count in BAM
mappings_bam (dict): dxlink to input BAM
index_file (dict): dxlink to input BAM
Returns:
Dictionary containing dxlinks to the uploaded read counts file
"""
#
# Download inputs
# -------------------------------------------------------------------
# dxpy.download_all_inputs will download all input files into
# the /home/dnanexus/in directory. A folder will be created for each
# input and the file(s) will be download to that directory.
#
# In this example, the dictionary has the following key-value pairs
# Note that the values are all list
# mappings_bam_path: [u'/home/dnanexus/in/mappings_bam/<bam filename>.bam']
# mappings_bam_name: [u'<bam filename>.bam']
# mappings_bam_prefix: [u'<bam filename>']
# index_file_path: [u'/home/dnanexus/in/index_file/<bam filename>.bam.bai']
# index_file_name: [u'<bam filename>.bam.bai']
# index_file_prefix: [u'<bam filename>']
#
inputs = dxpy.download_all_inputs()
# SAMtools view command requires the bam and index file to be in the same
shutil.move(inputs['mappings_bam_path'][0], os.getcwd())
shutil.move(inputs['index_file_path'][0], os.getcwd())
input_bam = inputs['mappings_bam_name'][0]
#
# Per region perform SAMtools count.
# --------------------------------------------------------------
# Output count for regions and return DXLink as job output to
# allow other entry points to download job output.
#
with open('read_count_regions.txt', 'w') as f:
for region in region_list:
view_cmd = create_region_view_cmd(input_bam, region)
region_proc_result = run_cmd(view_cmd)
region_count = int(region_proc_result[0])
f.write("Region {0}: {1}\n".format(region, region_count))
readcountDXFile = dxpy.upload_local_file("read_count_regions.txt")
readCountDXlink = dxpy.dxlink(readcountDXFile.get_id())
return {"readcount_fileDX": readCountDXlink}def combine_files(countDXlinks, resultfn):
"""The 'gather' subjob of the applet.
Arguments:
countDXlinks (list[dict]): list of DXlinks to process job output files.
resultfn (str): Filename to use for job output file.
Returns:
DXLink for the main function to return as the job output.
Note: Only the DXLinks are passed as parameters.
Subjobs work on a fresh instance so files must be downloaded to the machine
"""
if resultfn.endswith(".bam"):
resultfn = resultfn[:-4] + '.txt'
sum_reads = 0
with open(resultfn, 'w') as f:
for i, dxlink in enumerate(countDXlinks):
dxfile = dxpy.DXFile(dxlink)
filename = "countfile{0}".format(i)
dxpy.download_dxfile(dxfile, filename)
with open(filename, 'r') as fsub:
for line in fsub:
sum_reads += parse_line_for_readcount(line)
f.write(line)
f.write('Total Reads: {0}'.format(sum_reads))
countDXFile = dxpy.upload_local_file(resultfn)
countDXlink = dxpy.dxlink(countDXFile.get_id())
return {"countDXLink": countDXlink}$ dx ls
Developer Quickstart/
Developer Tutorials/
Quickstart/
RNA-seq Workflow Example/
SRR100022/
_README.1st.txt$ dx ls -l
Project: Demo Data (project-BQbJpBj0bvygyQxgQ1800Jkk)
Folder : /
Developer Quickstart/
Developer Tutorials/
Quickstart/
RNA-seq Workflow Example/
SRR100022/
State Last modified Size Name (ID)
closed 2015-09-01 17:55:33 712 bytes _README.1st.txt (file-BgY4VzQ0bvyg22pfZQpXfzgK)$ dx ls SRR100022/
SRR100022_1.filt.fastq.gz
SRR100022_2.filt.fastq.gz$ dx ls "project-BQpp3Y804Y0xbyG4GJPQ01xv:/C. Elegans - Ce10/"
ce10.bt2-index.tar.gz
ce10.bwa-index.tar.gz
ce10.cw2-index.tar.gz
ce10.fasta.fai
ce10.fasta.gz
ce10.tmap-index.tar.gz$ dx ls "project-BQpp3Y804Y0xbyG4GJPQ01xv:/C. Elegans - Ce10/*.fasta*"
ce10.fasta.fai
ce10.fasta.gz$ dx pwd
Demo Data:/
$ dx cd Quickstart/
$ dx ls
SRR100022_20_1.fq.gz
SRR100022_20_2.fq.gz$ dx ls
some_folder/
an_applet
ce10.fasta.gz
Variation Calling Workflow
$ dx mv ce10.fasta.gz C.elegans10.fasta.gz
$ dx ls
some_folder/
an_applet
C.elegans10.fasta.gz
Variation Calling Workflow$ dx mv C.elegans10.fasta.gz some_folder/
$ dx ls some_folder/
Hg19
C.elegans10.fasta.gz
...$ dx select project-BQpp3Y804Y0xbyG4GJPQ01xv
Selected project project-BQpp3Y804Y0xbyG4GJPQ01x
$ dx cd H.\ Sapiens\ -\ GRCh37\ -\ hs37d5\ (1000\ Genomes\ Phase\ II)/
$ dx ls
hs37d5.2bit
hs37d5.bt2-index.tar.gz
hs37d5.bwa-index.tar.gz
hs37d5.cw2-index.tar.gz
hs37d5.fa.fai
hs37d5.fa.gz
hs37d5.fa.sa
hs37d5.tmap-index.tar.gz
$ dx cp hs37d5.fa.gz project-9z94ZPZvbJ3qP0pyK1P0000p:/
$ dx select project-9z94ZPZvbJ3qP0pyK1P0000p
$ dx ls
some_folder/
an_applet
C.elegans10.fasta.gz
hs37d5.fa.gz
Variation Calling Workflow$ dx select
Note: Use "dx select --level VIEW" or "dx select --public" to select from
projects for which you only have VIEW permission to.
Available projects (CONTRIBUTE or higher):
0) SAM importer test (CONTRIBUTE)
1) Scratch Project (ADMINISTER)
2) Mouse (ADMINISTER)
Project # [1]: 2
Setting current project to: Mouse
$ dx ls -l
Project: Mouse (project-9zVfbG2y8x65kxKY7x20005G)
Folder : /$ dx select --public
Available public projects:
0) Example 1 (VIEW)
1) Apps Data (VIEW)
2) Parliament (VIEW)
3) CNVkit Tests (VIEW)
...
m) More options not shown...
Pick a numbered choice or "m" for more options: 1$ dx select --level VIEW
Available projects (VIEW or higher):
0) SAM importer test (CONTRIBUTE)
1) Scratch Project (ADMINISTER)
2) Shared Applets (VIEW)
3) Mouse (ADMINISTER)
Pick a numbered choice or "m" for more options: 2$ dx select project-9zVfbG2y8x65kxKY7x20005G
Selected project project-9zVfbG2y8x65kxKY7x20005G
$ dx ls -l
Project: Mouse (project-9zVfbG2y8x65kxKY7x20005G)
Folder : /{
"license": {
"serialNumber": "<Serial number from Stata>",
"code": "<Code from Stata>",
"authorization": "<Authorization from Stata>",
"user": "<Registered user line 1>",
"organization": "<Registered user line 2>"
}
}{
"licenseFile": {
"$dnanexus_link": {
"id": "file-xxxx",
"project": "project-yyyy"
}
}
}!dx download project-xxxx:file-yyyuse /mnt/project/<path>/data_in.dtaimport delimited /mnt/project/<path>/data_in.csvsave data_outexport delimited data_out.csv!dx upload <file> --destination=<destination>dx upload <file> --destination=<destination>pandas_df = spark_df.toPandas()pandas_df.to_stata("data_out.dta")
pandas_df.to_csv("data_out.csv")%%bash
dx upload <file>To find the tool you're looking for in the Tools Library, you can use search filters. Filtering enables you to find tools with a specific name, in a specific category, or of a specific type:
To see what inputs a tool requires, and what outputs it generates, select that tool's row in the list. The row is highlighted in blue. The tool's inputs and outputs are displayed in a pane to the right of the list:
To make sure you can find a tool later, you can pin it to the top of the list. Click More actions (⋮) icon at the far right end of the row showing the tool's name and key details about it. Then click Add Pin.
To learn more about a tool, click on its name in the list. The tool's detail page opens, showing a wide range of info, including guidance in how to use it, version history, pricing, and more:
You can quickly launch the latest version of any given tool from the Tools Library page. Or you can navigate to the apps details and click Run.
From within a project, navigate to the Manage pane, then click the Start Analysis button.
A dialog window opens, showing a list of tools. These include the same tools as shown in the Tools Library, as well as workflows and applets specifically available in the current project. Select the tool you want to run, then click Run Selected:
Workflows and applets can be launched directly from where they reside within a project. Select the workflow or applet in their folder location, and click Run.
Confirm details of the tool you are about to run. Selection of a project location is required for any tool to be run. You need at minimum Contributor access level to the project.
The tool may require specific inputs to be filled in before starting the run. You can quickly identify the required inputs by looking for the highlighted areas that are marked Inputs Required on the page.
You can access help information about each input or output by inspecting the label of each item. If a detailed README is provided for the executable, you can click the View Documentation icon to open the app or workflow info pane.
To configure instance type settings for a given tool or stage, click the Instance Type icon located on the top-right corner of the stage.
To configure output location and view info regarding output items, go to the Outputs tab under each stage. For workflows, output location can be specified separately for each stage.
The I/O graph provides an overview of the input/output structure of the tool. The graph is available for any tool and can be accessed via the Actions/Workflow Actions menu.
Once all required inputs have been configured, the page indicates that the run is ready to start. Click on Start Analysis to proceed to the final step.
As the last step before launching the tool, you can review and confirm specific runtime settings, including execution name, output location, priority, job rank, spending limit, and resource allocation. You can also review and modify instance type settings before starting the run.
Once you have confirmed final details, click Launch Analysis to start the run.
Batch run allows users to run the same app or workflow multiple times, with specific inputs varying between runs.
To enable batch run, start from any input that you wish to specify for batch run, and open its I/O Options menu on the right hand side. From the list of options available, select Enable Batch Run.
Input fields with batch run enabled are highlighted with a Batch label. Click any of the batch enabled input fields to enter the batch run configuration page.
Files and other data objects
Yes
Files and other data objects (array)
Partially supported. Can accept entry of a single-value array
String
Yes
Integer
Yes
Float
Yes
Boolean
Yes
The batch run configuration page allows specifying inputs across multiple runs. Interact with each table cell to fill in desired values for any run or field.
Similar to configuration of inputs for non-batch runs, you need to fill all the required input fields to proceed to next steps. Optional inputs, or required inputs with a predefined default value, can be left empty.
Once all required fields (for both batch inputs and non-batch inputs) have been configured, you can proceed to start the run via the Start Analysis button.
Once you've finished setting up your tool, start your analysis by clicking the Start Analysis button. Follow these instructions to monitor the job as it runs.
Learn in depth about running apps and workflows, leveraging advanced techniques like Smart Reuse.
To create a project:
In the DNAnexus Platform, select Projects > All Projects.
In the Projects page, click New Project.
In the New Project dialog:
In Project Name, enter your project's name.
(Optional) In More Info, you can enter or custom-defined . These make it easier to find the project later, and organize it among other projects.
(Optional) In More Info, you can enter a Project Summary and Project Description to help other users understand the project's purpose.
In Billing > Billed To, choose a to which project charges are billed.
In Billing > Billed To, choose a to use for storing project files and running analyses. Feel free to use the default region.
(Optional) In Usage Limits, available in Billed To orgs with compute and egress usage limits configured, you can set project-level limits for each.
In Access, you can specify for specific , defining who can copy, delete, and download data. Feel free to accept the defaults.
Click Create Project.
After the project is created, you can add data in the Manage page.
Once you add data to your project, this is where you can see and get info on this data, and launch analyses that use it.
Once you've created a project, you can add members by doing the following:
From the project's Manage screen, click the Share Project button - the "two people" icon - in the top right corner of the project page.
Type the username or the email address of an existing Platform user, or the ID of an org whose members you want to add the project.
In Access, choose the type of access the user or org has to the project. For more on this, see the detailed explanation of project access levels.
If you don't want the user to receive an email notification on being added to the project, click the Email Notification to "Off."
Click the Add User button.
Repeat Steps 2-5, for each user you want to add to the project.
Click Done when you're finished adding members.
To add data to your project, click the Add button in the top right corner of the project's Manage screen. You see three options for adding data:
Upload Data - Use your web browser to upload data from your computer. For long upload times, you must stay logged into the Platform and keep your browser window open until the upload completes.
Add Data from Server - Specify an URL of an accessible server from which the file is uploaded.
Copy Data from Project - Copy data from another project on the Platform.
To prepare for running your first analysis, as detailed in Steps 4-7, copy in data from the "Demo Data" project:
From the project's Manage screen, click the Add button, then select Copy Data from Project.
In the Copy Data from Project modal window, open the "Demo Data" project by clicking on its name.
Open the "Quickstart" folder. This folder contains two 1000 Genomes project files with the paired-end sequencing reads from chromosome 20 of exome SRR100022: SRR100022_20_1.fq.gz and SRR100022_20_2.fq.gz.
Click the box next to the Name header, to select both files.
Click Copy to copy the files to your project.
Next, install the apps you need, to analyze the data you added to the project in Step 3:
Select Tools Library from the Tools link in the main menu.
A list of available tools opens.
Find the BWA-MEM FASTQ Read Mapper in the list and click on its name.
A tool detail page opens, a full range of information about the tool, and how to use it.
Click the Install button in the upper left part of the screen, under the name of the tool.
In the Install App modal, click the Agree and Install button.
After the tool has been installed, you are returned to the tool detail page.
Use your browser's "Back" button to return to the tools list page.
Repeat Steps 3-6 to install the .
Build a workflow using the two apps you installed, and configure it to use the data you added to your project in Step 3.
A workflow runs tools as part of a preconfigured series of steps. Start building your workflow by adding steps to it:
Return to your project's Manage screen. You can do this by using your browser's "Back" button, or by selecting All Projects from the Projects link in the main menu, then clicking on the name of your project in the projects list.
Click the Add button in the top right corner of the screen, then select New Workflow from the dropdown. The Workflow Builder opens.
In the Workflow Builder, give your new workflow a name. In the upper left corner of the screen, you see a field with a placeholder value that begins "Untitled Workflow." Click on the "pencil" icon next to this placeholder name, then enter a name of your choosing.
Click the Add a Step button. In the Select a Tool modal window, find the BWA-MEM FASTQ Read Mapper and click the "+" to the left of its name, to add it to your workflow.
Repeat Step 4 for the FreeBayes Variant Caller.
Close the Select a Tool modal window, by clicking either on the "x" in its upper right corner, or the Close button in its lower right corner. You return to the main Workflow Builder screen.
Set the required inputs for each step by doing the following:
To set the required inputs for the first step, start by clicking on the input labeled "Reads [array]" for the BWA-MEM FASTQ Read Mapper. In the Select Data for Reads Input modal window, click the box for the SRR100022_20_1.fq.gz file. Then click the Select button.
Since the SRR100022 exome was sequenced using paired-end sequencing, you need to provide the right-mates for the first set of reads. Click on the input labeled "Reads (right mates) [array]" for the BWA-MEM FASTQ Read Mapper. Select the SRR100022_20_2.fq.gz file.
Click on the input labeled "BWA reference genome index." At the bottom of the modal window that opens, there is a Suggestions section that includes a link to a folder containing reference genome files. Click on this link, then open the folder named H. Sapiens - GRCh37 - b37 (1000 Genomes Phase I). Select the human_g1k_v37.bwa-index.tar.gz file.
Next set the "Sorted mappings [array]" required input for the second step. In the "Output" section for the first step, click on the blue pill labeled "Sorted mappings," then drag it to the second step input labeled "Sorted mappings [array]."
Click on the second step input labeled "Genome." In the modal that opens, find the reference genomes folder as in Step 3. Open the folder named H. Sapiens - GRCh37 - b37 (1000 Genomes Phase I). Select the human_g1k_v37.fa.gz file.
You're ready to launch your workflow, by doing the following:
Click the Start Analysis button at the upper right corner of the Workflow Builder.
In the modal window that opens, click the Run as Analysis button.
The BWA-MEM FASTQ Read Mapper starts executing immediately. Once it finishes, the FreeBayes Variant Caller starts, using the Read Mapper's output as an input.
Once you've launched your workflow, you are taken to your project's Monitor screen. Here, you see a list of both current and past analyses run within the project, along with key information about each run.
As your workflow runs, its status shows as "In Progress."
If for some reason you need to terminate the run before it completes, find its row in the list on the Monitor screen. In the last column on the right, you see a red button labeled Terminate. Click the button to terminate the job. This process may take some time. While the job is being terminated, the job's status shows as "Terminating."
When your workflow completes, output files are placed into a new folder in your project, with the same name as the workflow. The folder is accessible by navigating to your project's Manage screen.
You can run this workflow using the full SRR100022 exome, which is available in the SRR100022 folder, in the "Demo Data" project. Because this means working with a much larger file, running the workflow using the exome data takes longer.
See these Key Concepts pages for more in-depth treatments of topics that are covered briefly here:
For a video intro to the Platform, watch the series of short, task-oriented tutorials.
For a more in-depth video intro to the Platform, watch the DNAnexus Platform Essentials video.
Analyze somatic variants, including cancer-specific filtering, visualization, and variant landscape exploration in the Cohort Browser.
Explore and analyze datasets with somatic variant assays by opening them in the Cohort Browser and switching to the Somatic Variants tab. You can create cohorts based on somatic variants, visualize variant patterns, and examine detailed variant information.
You can analyze somatic variants across four main categories: Single Nucleotide Variants (SNVs) & Indels for small genomic changes, Copy Number Variants (CNVs) for alterations in gene copy numbers, Fusions for structural rearrangements involving gene coding sequences, and Structural Variants (SVs) for larger genomic rearrangements.
The somatic data model classifies all genomic variants into four main classes, defined by their size, structure, and representation in VCF files. Each variant type has specific criteria that must be met for classification.
You can to include only samples with specific somatic variants.
To apply a somatic filter to your cohort:
For the cohort you want to edit, click Add Filter.
In Add Filter to Cohort > Assays > Variant (Somatic), select a genomic filter.
In Edit Filter: Variant (Somatic), specify the criteria:
Structural variants larger than 10 megabases lack gene-level annotations, which limits how you can filter and visualize them. Use these alternative filtering approaches:
Filter by genomic coordinates: In the Genes / Effects filter, enter genomic coordinates in the format chr:start-end, for example, 17:7661779-7687538 for the TP53 gene region. Set the variant type scope to SV or CNV and leave consequence types blank. Find gene coordinates by typing the gene symbol in the search icon next to the Variants & Events table.
Filter by variant IDs: In the Variant IDs filter, enter up to 10 variant IDs in the format chr_pos_ref_alt, for example, 17_7674257_A_<DEL>. To get variant IDs, navigate to the gene region in the Variants & Events table, select variants of interest, and download the CSV file - the Location column contains the variant IDs.
For comprehensive structural variant analysis, combine multiple filtering approaches. Use gene symbol filters to capture annotated structural variants ≤ 10Mbp, then add coordinate-based filters to include larger structural variants in the same genomic regions.
Large structural variants are visible in the table with full details, but they do not appear in the due to missing gene-level annotations.
The Variant Frequency Matrix provides a visual overview of how often somatic variants appear throughout your cohort. The matrix helps you identify variant patterns across tumor samples and discover which variants frequently occur together. You can also measure the mutation burden in different genes and compare how mutation profiles differ between two cohorts. This makes it easier to spot trends and relationships in your data that might not be apparent when examining individual variants.
The Variant Frequency Matrix is interactive. You can , and , and zoom in on specific genes or regions.
In the Variant Frequency Matrix, the rows represents genes and columns represent samples, both are sorted by variant frequency.
Sorted gene list: Genes are ranked from most to least frequently affected by variants. A sample is considered "affected" by a gene if it is a tumor sample with at least one detected variant of high or moderate impact in that gene's canonical transcript. Matched normal samples are not included in this calculation.
Sorted sample list: Samples are ordered by the total number of genes that contain variants. This ranking is independent of how frequently each individual gene is affected.
By default, the Variant Frequency Matrix includes all genes and samples. To narrow your view, you can filter the matrix to specific classes of somatic variants, such as SNVs & Indels, Structural Variants, CNVs, or Fusions.
Using the legend in bottom right, you can focus on specific variants, events, or consequences. This allows you to better explore particular areas of interest, such as high-impact mutations or specific consequences relevant to your research.
The Variant Frequency Matrix is highly interactive, allowing you to quickly access more details and apply filters.
When you hover over a cell, the matrix shows a unique identifier for the sample, along with a breakdown of the variants detected in that gene, organized by their consequence type. You can copy the sample ID to your clipboard to apply it to a cohort filter.
When you hover over a gene ID on the left axis, the matrix shows more information about that gene. This includes a unique identifier for the gene, along with a quick breakdown of available external annotations, with direct links to the and databases (when available).
To create a filter, hover over the gene and click + Add to Filter, or copy the gene ID to your clipboard for use in a custom filter.
The Variant Frequency Matrix uses color coding to represent the consequences of detected variants, providing a quick visual assessment of variant types. Only high and moderate impact consequences, as defined by , are included in this visualization.
Samples with two or more detected variants are color-coded as "Multi Hit", indicating a complex variant profile.
The Lollipop Plot is a visualization tool that shows the somatic variants of a cohort on a single gene's canonical protein. With Lollipop Plot, you can identify mutation hotspots within a specific gene, understand the functional impact of variants in the context of protein domains, compare mutation patterns across different patient cohorts, and explore recurrent mutations in cancer driver genes.
Use the Go to Gene field to quickly navigate to a gene of interest, such as TP53.
When you hover over a lollipop, you can see details about the amino acid change, such as the HGVS notation and the frequency of that change in the current cohort. The plot also shows the location of each mutation along the protein sequence, with color coding to indicate the consequence type.
Each lollipop on the plot represents amino acid changes at a specific location.
The horizontal position (X axis) indicates the location of the change, while the height (Y axis) represents the frequency of that change within the current cohort.
Lollipops are color-coded by consequence based on the canonical transcript.
If a lollipop represents multiple consequence types, it is coded as "Multi Hit".
The Variants & Events table displays details on the same genomic region as the . You can filter the table to focus on specific variant types, such as SNV & Indels, SV (Structural Variants), CNV, or Fusion.
Information displayed in the Variants & Events table includes:
Location of variant, with a link to its
Reference allele of variant
Alternate allele of variant
Type of variant, such as SNV, Indel, or Structural Variant
You can export the selected variants in the Variants & Events table as a list of variant IDs or a CSV file.
To copy a comma-separated list of variant IDs to your clipboard, select the set of IDs you want to copy, and click Copy.
To export variants as a CSV file, select the set of IDs you need, and click Download (.csv file).
In Variants & Events > Location column, you can click on the specific location to open the locus details.
The locus details show specific SNV & Indel variants as well as up to 200 structural variants overlapping with the specific location. For canonical transcripts, a blue indicator appears next to the transcript ID, identifying the primary transcript annotations.
The locus details include enhanced annotations to external resources:
Gene-level links - Direct links to gene information in external databases
Variant-level links - Links to variant-specific annotation resources
These links allow you to quickly navigate to external annotation resources for further information about genes or variants of interest.
Learn to use the DXJupyterLab Spark Cluster app.
The DXJupyterLab Spark Cluster App is a that runs a fully-managed standalone Spark/Hadoop cluster. This cluster enables distributed data processing and analysis from directly within the JupyterLab application. In the JupyterLab session, you can interactively create and query DNAnexus databases or run any analysis on the Spark cluster.
Besides the core JupyterLab features, the Spark cluster-enabled JupyterLab app allows you to:
Explore the available databases and get an overview of the available datasets
Perform analyses and visualizations directly on data available in the database
Create databases
Submit data analysis jobs to the Spark cluster
Check the general for an introduction to DNAnexus JupyterLab products.
The page contains information on how to start a JupyterLab session and create notebooks on the DNAnexus Platform. The page has additional useful tips for using the environment.
Having created your notebook in the project, you can populate your first cells as below. It is good practice to instantiate your Spark context at the beginning of your analyses, as shown below.
To view any databases to which you have access to in your current region and project context, run a cell with the following code:
A sample output should be:
You can inspect one of the returned databases by running:
which should return an output similar to:
To find a database in your current region that may be in a different project than your current context, run the following code:
A sample output should be:
To inspect one of the databases listed in the output, use the unique database name. If you use only the database name, results are limited to the current project. For example:
Here's an example of how to create and populate your own database:
You can separate each line of code into different cells to view the outputs iteratively.
is an open-source, scalable framework for exploring and analyzing genomic data. It is designed to run primarily on a Spark cluster and is available with DXJupyterLab Spark Cluster. It is included in the app and can be used when the app is run with the feature input set to HAIL (set as default).
Initialize the context when beginning to use Hail. It's important to pass previously started Spark Context sc as an argument:
We recommend continuing your exploration of Hail with the . For example:
To use (Ensemble Variant Effect Predictor) with HAIL, select "Feature," then "HAIL" when launching Spark Cluster-Enabled DXJupyterLab via the CLI.
VEP can predict the functional effects of genomic variants on genes, transcripts, protein sequence, and regulatory regions. The is included as well, and is used when VEP configuration includes LoF plugin as shown in the configuration file below.
The Spark cluster app is a Docker-based app which runs the JupyterLab server in a Docker container.
The JupyterLab instance runs on port 443. Because it is an HTTPS app, you can bring up the JupyterLab environment in a web browser using the URL https://job-xxxx.dnanexus.cloud, where job-xxxx is the ID of the job that runs the app.
The script run at the instantiation of the container, /opt/start_jupyterlab.sh, configures the environment and starts the server needed to connect to the Spark cluster. The environment variables needed are set by sourcing two scripts, bind-mounted into the container:
The default user in the container is root.
The option --network host is used when starting Docker to remove the network isolation between the host and the Docker container, which allows the container to bind to the host's network and access Sparks master port directly.
S3 buckets can have private or public access. Either the s3 or the s3a scheme can be used to access S3 buckets. The s3 scheme is automatically aliased to s3a in all Apollo Spark Clusters.
To access public s3 buckets, you do not need to have s3 credentials. The example below shows how to access the public 1000Genomes bucket in a JupyterLab notebook:
When the above is run in a notebook, the following is displayed:
To access private buckets, see the example code below. The example assumes that a Spark session has been created as shown above.
The Spark SQL Runner application brings up a Spark cluster and executes your provided list of SQL queries. This is especially useful if you need to perform a sequence repeatedly or if you need to run a complex set of queries. You can vary the size of your cluster to speed up your tasks.
Input:
sqlfile: [Required] A SQL file which contains an ordered list of SQL queries.
substitutions: A JSON file which contains the variable substitutions.
user_config: User configuration JSON file, in case you want to set or override certain Spark configurations.
Other Options:
export: (boolean) default false. Exports output files with results for the queries in the sqlfile.
export_options: A JSON file which contains the export configurations.
collect_logs
Output:
output_files: Output files include report SQL file and query export files.
sqlfilesqlfile is ProcessedThe SQL runner extracts each command in sqlfile and runs them in sequential order.
Every SQL command needs to be separated with a semicolon ;.
Any command starting with -- is ignored (comments). Any comment within a command should be inside /*...*/ The following are examples of valid comments:
Variable substitution can be done by specifying the variables to replace in substitutions.
In the above example, each reference to srcdb in sqlfile within ${...} is substituted with sskrdemo1. For example, select * from ${srcdb}.${patient_table};. The script adds the set command before executing any of the SQL commands in sqlfile. As a result, select * from ${srcdb}.${patient_table}; translates to:
If enabled, the results of the SQL commands are exported to a CSV file. export_options defines an export configuration.
num_files: default 1. This defines the maximum number of output files to generate. The number generally depends on how many executors are running in the cluster and how many partitions of this file exist in the system. Each output file corresponds to a part file in parquet.
fileprefix: The filename prefix for every SQL output file. By default, output files are prefixed with query_id, which is the order in which the queries are listed in sqlfile (starting with 1), for example, 1-out.csv. If a prefix is specified, output files are named like <prefix>-1-out.csv
Values in spark-defaults.conf override or add to the default Spark configuration.
The export folder contains two generated files:
<JobId>-export.tar: Contains all the query results.
<JobId>-outfile.sql: SQL debug file.
After extracting the export tar file, the structure appears as follows:
In the above example, demo is the fileprefix used. The export produces one folder per query. Each folder contains a SQL file with the query executed and a .csv folder containing the result CSV.
Every SQL run execution generates a SQL runner debug report file. This is a SQL file.
It lists all the queries executed and status of the execution (Success or Fail). It also lists the name of the output file for that command and the time taken. If there are any failures, it reports the query and stops executing subsequent commands.
During execution of the series of SQL commands, a command may fail (error, syntax, etc). In that case, the app quits and uploads a SQL debug file to the project:
The output identifies the line with the SQL error and its response.
The query in the .sql file can be fixed, and this report file can be used as input for a subsequent run, allowing you to resume from where execution stopped.
Connect with Spark for database sharing, big data analytics, and rich visualizations.
Apache Spark can help you tackle big data analytics combined with rich visualization. Sharing a database is straightforward: platform access levels map directly to SQL abilities, so you can fine-tune access control to your databases at either an individual or org level.
The DNAnexus Platform provides two ways to connect to the Spark service: through the Thrift server or, for more scalable throughput, using Spark applications.
DNAnexus hosts a high-availability Thrift server with which you can connect over JDBC with a client-like beeline to run Spark SQL interactively. Refer to the page for more details.
You can launch a Spark application distributed across a cluster of workers. Since this is all tightly integrated with the rest of the platform, Spark jobs leverage the features of normal jobs. You have the same ways to monitor a job's progress, SSH into a job instance to debug, and use the features of dx-toolkit and the platform web UI. You also have access to logs from workers and can monitor the job in the Spark UI.
With Spark, you can visualize your results in real time. You can save those queries as cohorts, share them with your team, or use them as inputs to Spark-based analysis apps. You can create charts and shareable dashboards. The filter view allows you to build cohorts quickly without the need to write complex SQL queries by hand.
A database is a on the Platform. A object is stored in a project.
Databases can be shared with other users or organizations through project sharing. Access to a database can be revoked at any time by revoking access to the project by the project administrator. If revoking access to the project is impossible, the database can be to another project with different set of collaborators.
Project policies restrict how the data can be modified or copied to other projects. Databases follow the Delete Policy and the Copy Policy. If a database is in a restricted project, the database can be accessed for reading only from the same project context, when . Databases also adhere to the project's policy. If a database is in a project for which Data Protection is enabled ("PHI project"), the database is subject to the following restrictions:
The database cannot be accessed by Spark apps launched in projects for which PHI Data Protection is not enabled ("non-PHI projects").
If a non-PHI project is provided as a project context when connecting to Thrift, only databases from non-PHI projects are available for retrieving data.
If a PHI project is provided as a project context when connecting to Thrift, only databases from PHI projects are available to add new data.
As with all DNAnexus file objects, database access is controlled by project access. These access levels and database object translate into specific SQL abilities for the database, tables, data and database object in the project.
The following tables reference supported actions on a database and database object with lowest necessary access level for an open and closed database.
(*) If a project is protected, then ADMINISTER access is required.
The system handles database names in two ways:
User-provided name: Your database name is converted to lowercase and stored as the databaseName attribute.
System-generated unique name: A unique identifier is created by combining your lowercase database name with the database object ID (also converted to lowercase with hyphens changed to underscores) separated by two underscores. This is stored as the uniqueDatabaseName attribute.
When a database is created using the following SQL statement and a user-generated database name (referenced below as, db_name):
The platform database object, database-xxxx, is created with all lowercase characters. However, when creating a database using , the Python module supported by the DNAnexus SDK, dx-toolkit, the following case-sensitive command returns a database ID based on the user-generated database name, assigned here to the variable db_name:
With that in mind, it is suggested to either use lowercase characters in your db_name assignment or to instead apply a forcing function like, .lower(), to the user-generated database name:
You can perform advanced filtering on projects, data objects, and jobs using the filter bars above the table of results. This feature is displayed at the top of the Monitor tab but is hidden by default on the Manage tab and Projects page. You can display or hide the filter bar by toggling the filters icon in the top right corner.
The filter bar lets you to specify different criteria on which to filter your data. You can combine multiple different filters for greater control over your results.
To use this feature, first choose the field you want to filter your data by, then enter your filter criteria. For example, select the "Name" filter then search for "NA12878". The filter activates when you press enter or click outside of the filter bar.
The following filters are available for projects, and can be added by selecting them from the "Filters" dropdown menu.
Billed to: The user or org ID that the project is billed to, for example, "user-xxxx" or "org-xxxx". When viewing a partner organization's projects, the "Billed to" field is fixed to the org ID.
Project Name: Search by case insensitive string or regex, for example, "Example" or "exam$" both match "Example Project"
ID: Search by project ID, for example, "project-xxxx"
The following filters are available for objects. Filters listed in italics are not displayed in the filter bar by default but can be added by selecting them from the "Filters" dropdown menu on the right.
Search scope: The default scope is "Entire project", but if you know the location of the object you are looking for, limiting your search scope to "Current Folder" allows you to search more efficiently.
Object name: Search by case insensitive string or regex, for example, NA1 or bam$ both match NA12878.bam
ID: Search by object ID, for example, file-xxxx
When filtering on anything other than the current folder, results appear from many different places in the project. The folders appear in a lighter gray font and some actions are unavailable (such as creating a new workflow or folder), but otherwise functionality remains the same as in the normal data view.
The following filters are available for executions. Filters listed in italics are not displayed in the filter bar by default but can be added to the bar by selecting them from the "Filters" dropdown menu on the right.
Search scope: The default displays root executions only, but you can choose to view all executions (root and subjobs) instead
State: for example, Failed, Waiting, Done, Running, In Progress, Terminated
Name: Search by case-insensitive string or regex, for example, "BWA" or "MEM$" both match "BWA-MEM". This only matches the name of the job or analysis, not the executable name.
When filtering on a name, any spaces expand to include intermediate words. For example, filtering by "b37 build" also returns "b37 dbSNP build".
Some filters allow you to specify a date range for your query. For example, the "Created date" filter allows you to specify a beginning time ("From") and/or an end time ("To"). Clicking on the date box opens a calendar widget which allows you to specify a relative time in minutes, hours, days, weeks, months, or an absolute time by specifying a certain date.
For relative time, specify an amount of time before the access time. For example, selecting "Day" and typing 5 sets the datetime to 5 days before the current time.
Alternatively, you can use the calendar to represent an exact (absolute) datetime.
Setting only the beginning datetime ("From") creates a range from that time to the access time. Setting only the end datetime ("To") creates a range from the earliest records to the "To" time.
A filter with a relative time period updates each time it is accessed. For example, a filter for items created within two hours shows different results at different times: items from 9am at 11am, and items from 2pm at 4pm. For consistent results, use absolute datetimes from the calendar widget.
To search by tag, enter or select the tags you want to find. For example, to find all objects tagged with "human", type "human" in the filter box and select the checkbox next to the tag.
Unlike other searches where you can enter partial text, tag searches require the complete tag name. However, capitalization doesn't matter. For example, searching for "HUMAN", "human", or "Human" all find objects with the "Human" tag. Partial matches like "Hum" do not return results.
Properties have two parts: a key and a value. The system prompts for both when creating a new property. Like tags, properties allow you to create your own common attributes across multiple projects or items and find them quickly. When searching for a property, you can either search for all items that have that property, or items that have a property with a certain value.
To search for all items that have a property, regardless of the value of that property, select the "Properties" filter (not displayed by default), enter the property key, and click Apply. To search for items that have a property with a specific value, enter that property's key and value.
The keys and values must be entered in their entirety. For example, entering the key sample and the value NA does not match objects with {"sample_id": "NA12878"}.
Some filters allow you to select multiple values. For example, the "Tag" filter allows you to specify multiple tags in the dialog. When you have selected multiple tags, you have a choice whether to search for objects containing any of the selected tags or containing all the selected tags.
Given the following set of objects:
Object 1 (tags: "human", "normal")
Object 2 (tags: "human", "tumor")
Object 3 (tags: "mouse", "tumor")
Selecting both "human" and "tumor" tags, and choosing to filter by any tag returns all 3 objects. Choosing to filter by all tags returns only Object 2.
Click the "Clear All Filters" button on the filter bar to reset your filters.
If you wish to save your filters, active filters are saved in the URL of the filtered page. You can bookmark this URL in your browser to return to your filtered view in the future.
Bookmarking a filtered URL saves the search parameters, not the search results. The filters are applied to the data present when accessing the bookmarked link. For example, filters for items created in the last thirty days show items from the thirty days before viewing the results, not the thirty days before creating the bookmark. Results update based on when you access the saved search.
The DNAnexus Relational Database Service provides users with a way to create and manage cloud database clusters (referred to as dbcluster objects on the platform). These databases can then be securely accessed from within DNAnexus jobs/workers.
The Relational Database Service is accessible through the application program interface (API) in AWS regions only. See DBClusters API page for details.
When describing a DNAnexus DBCluster, the status field can be any of the following:
DB Clusters are not accessible from outside of the DNAnexus Platform. Any access to these databases must occur from within a DNAnexus job. Refer to this page on for one possible way to access a DB Cluster from within a job. Executions such as can access a DB Cluster as well.
The parameters needed for connecting to the database are:
host Use endpoint as returned from
port 3306 for MySQL Engines or 5432 for PostgreSQL Engines
The table below provides all the valid configurations of dxInstanceClass, database engine and versions
* - db_std1 instances may incur CPU Burst charges similar to AWS T3 Db instances described in . Regular hourly charges for this instance type are based on 1 core, CPU Burst charges are based on 2 cores.
If a project contains a , its ownership cannot be changed. when attempting to change the billTo of such a project.
Learn how to archive files, a cost-effective way to retain files in accord with data-retention policies, while keeping them secure and accessible, and preserving file provenance and metadata.
Archiving in DNAnexus is file-based. You can archive individual files, folders with files, or entire projects' files and save on storage costs. You can also unarchive one or more files, folders, or projects when you need to make the data available for further analyses.
The DNAnexus Archive Service is available via the API in Amazon AWS and Microsoft Azure regions.
Modified date: Search by projects modified before, after, or between different dates
Creator: The user ID who created the project, for example, "user-xxxx"
Shared with member: A user ID with whom the project is shared, for example, "user-xxxx" or "org-xxxx"
Level: The minimum permission level to the project. The dropdown has the options Viewer+", "Uploader+", "Contributor+", and "Admin only". For example, "Contributor+" filters projects with access CONTRIBUTOR or ADMINISTER
Tags: Search by tag. The filter bar automatically populates with tags available on projects
Properties: Search by properties. The filter bar automatically provides properties available on projects
applet-xxxxModified date: Search by objects modified before, after, or between different dates
Class: such as "File", "Applet", "Folder"
Types: such as "File" or custom Type
Created date: Search by objects created before, after, or between different dates
Tags: Search by tag. The filter bar automatically populates with tags available on objects within the current folder
Properties: Search by properties. The filter bar automatically provides properties available on objects within the current folder
Created date: Search by executions created before, after, or between different dates
Launched by: Search by the user ID of the user who launched the job. The filter bar shows users who have run jobs visible in the project
Tags: Search by tag. The filter bar automatically populates with tags available on the visible executions
Properties: Search by properties. The filter bar automatically provides properties available on executions visible in the project
Executable: Search by the ID of executable run by the executions in question. Examples include app-1234 or applet-5678
Class: for example, Analysis or Job
Origin Jobs: ID of origin job
Parent Jobs: ID of parent job
Parent Analysis: ID of parent analysis
Root Executions: ID of root execution
String (array)
No
Integer (array)
No
Float (array)
No
Boolean (array)
No
Hash
No




















For optimal performance and annotation scalability, the Cohort Browser processes SVs and CNVs between 50bp and 10Mbp differently than larger variants:
SVs and CNVs ≤ 10Mbp: Fully annotated with gene symbols and consequences, appear in all visualizations including the Variant Frequency Matrix
SVs and CNVs > 10Mbp: Ingested and visible in the Variants & Events table but lack gene-level annotations. These larger variants do not appear in the Variant Frequency Matrix and cannot be filtered using gene symbols or consequence terms. Use genomic coordinates or variant IDs to filter for these variants (see Working with Large Structural Variants below).
Fusions are not affected by this size limit as they are considered two single-position events.
Choose whether to include patients with at least one detected variant matching the specified criteria (WITH Variant), or include only patients who have no detected variants matching the criteria (WITHOUT Variant). By default, the filter includes those with matching variants. This choice applies to all specified filtering criteria.
On the Genes / Effects tab, select variants of specific types and variant consequences within specified genes and genomic ranges. You can specify up to 5 genes or genomic ranges in a comma-separated list.
On the HGVS tab, specify a particular HGVS DNA or HGVS protein notation, preceded by a gene symbol. Example: KRAS p.Arg1459Ter.
On the Variant IDs tab, specify variant IDs using the standard format chr_pos_ref_alt (for example, 17_7674257_A_G). You can enter up to 10 variant IDs in a comma-separated list.
Enter multiple genes, ranges, or variants, by separating them with commas or placing each on a new line.
Click Apply Filter.
You can identify mutation hotspots for a given gene and see protein changes in HGVS short form notation, such as T322A, and HGVS.p notation, such as p.Thr322Ala.
Variant consequences, with entries color-coded by level of severity
HGVS cDNA
HGVS Protein
COSMIC ID
RSID, with a link to the dbSNP entry for the variant
SNV & Indel Single base substitutions and small insertions/deletions with precise allele sequences
All must match:
• Variant size ≤ 50bp
• ALT field contains precise allele (NOT symbolic like <DEL>, <INS>, <DUP>, <CNV>)
A→G, ATCG→A, A→ATCG
Copy Number Variant (CNV) Changes in gene copy number
All must match:
• ALT field contains symbolic allele (<CNV>, <DEL>, <DUP>)
• Explicit copy number value present in FORMAT field key CN
<CNV>, <DEL>, <DUP>
Fusion Structural rearrangements involving gene coding sequences
All must match:
• ALT field contains breakend notation with square brackets ([ or ])
• At least one breakpoint overlaps with annotated gene or transcript
[chr2:123456[, ]chr5:789012]
Structural Variant (SV) Large or complex structural changes
Either must match:
• Variant length > 50bp
• ALT field contains symbolic allele (<DEL>, <INV>, <CNV>, <BND>)









<DEL>, <INV>, large insertions
terminated
The database cluster has been terminated and all data deleted.
user rootpassword Use the adminPassword specified when creating the database dbcluster/new
For MySQL: ssl-mode 'required'
For PostgreSQL: sslmode 'require' Note: For connecting and verifying certs, see Using SSL/TLS to encrypt a connection to a DB instance or cluster
4
db_mem1_x8
aurora-mysql: 8.0.mysql_aurora.3.04.1, aurora-postgresql: 12.9, 13.9, 14.6
64
8
db_mem1_x16
aurora-mysql: 8.0.mysql_aurora.3.04.1, aurora-postgresql: 12.9, 13.9, 14.6
128
16
db_mem1_x32
aurora-mysql: 8.0.mysql_aurora.3.04.1, aurora-postgresql: 12.9, 13.9, 14.6
244
32
db_mem1_x48
aurora-mysql: 8.0.mysql_aurora.3.04.1, aurora-postgresql: 12.9, 13.9, 14.6
384
48
db_mem1_x64
aurora-mysql: 8.0.mysql_aurora.3.04.1, aurora-postgresql: 12.9, 13.9, 14.6
488
64
db_mem1_x96
aurora-postgresql: 12.9, 13.9, 14.6
768
96
creating
The database cluster is being created, but not yet available for reading/writing.
available
The database cluster is created and all replicas are available for reading/writing.
stopping
The database cluster is stopping.
stopped
The database cluster is stopped.
starting
The database cluster is restarting from a stopped state, transitioning to available when ready.
terminating
The database cluster is being terminated.
db_std1_x2 (*)
aurora-mysql: 8.0.mysql_aurora.3.04.1, aurora-postgresql: 12.9, 13.9, 14.6
4
2
db_mem1_x2
aurora-mysql: 8.0.mysql_aurora.3.04.1, aurora-postgresql: 12.9, 13.9, 14.6
16
2
db_mem1_x4
aurora-mysql: 8.0.mysql_aurora.3.04.1, aurora-postgresql: 12.9, 13.9, 14.6
32

falseexecutor_memory: (string) Amount of memory to use per executor process, in MiB unless otherwise specified. Common values include 2g or 8g. This is passed as --executor-memory to Spark submit.
executor_cores: (integer) Number of cores to use per executor process. This is passed as --executor-cores to Spark submit.
driver_memory: (string) Amount of memory to use for the driver process. Common values include 2g or 8g. This is passed as --driver-memory to Spark submit.
log_level: (string) default INFO. Logging level for both driver and executors. [ALL, TRACE, DEBUG, INFO]
header: Default is true. If true, a header is added to each exported file.
ANALYZE TABLE COMPUTE STATISTICS
UPLOAD
N/A
CACHE TABLE, CLEAR CACHE
N/A
N/A
CREATE DATABASE
UPLOAD
UPLOAD
CREATE FUNCTION
N/A
N/A
CREATE TABLE
UPLOAD
N/A
CREATE VIEW
UPLOAD
UPLOAD
DESCRIBE DATABASE, TABLE, FUNCTION
VIEW
VIEW
DROP DATABASE
CONTRIBUTE (*)
ADMINISTER
DROP FUNCTION
N/A
N/A
DROP TABLE
CONTRIBUTE (*)
N/A
EXPLAIN
VIEW
VIEW
INSERT
UPLOAD
N/A
REFRESH TABLE
VIEW
VIEW
RESET
VIEW
VIEW
SELECT
VIEW
VIEW
SET
VIEW
VIEW
SHOW COLUMNS
VIEW
VIEW
SHOW DATABASES
VIEW
VIEW
SHOW FUNCTIONS
VIEW
VIEW
SHOW PARTITIONS
VIEW
VIEW
SHOW TABLES
VIEW
VIEW
TRUNCATE TABLE
UPLOAD
N/A
UNCACHE TABLE
N/A
N/A
Remove
CONTRIBUTE (*)
ADMINISTER
Remove Tags
UPLOAD
CONTRIBUTE
Remove Types
UPLOAD
N/A
Rename
UPLOAD
CONTRIBUTE
Set Details
UPLOAD
N/A
Set Properties
UPLOAD
CONTRIBUTE
Set Visibility
UPLOAD
N/A
ALTER DATABASE SET DBPROPERTIES
CONTRIBUTE
N/A
ALTER TABLE RENAME
CONTRIBUTE
N/A
ALTER TABLE DROP PARTITION
CONTRIBUTE (*)
N/A
ALTER TABLE RENAME PARTITION
CONTRIBUTE
Add Tags
UPLOAD
CONTRIBUTE
Add Types
UPLOAD
N/A
Close
UPLOAD
N/A
Get Details
VIEW

N/A
VIEW
import pyspark
sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc)spark.sql("show databases").show(truncate=False)+------------------------------------------------------------+
|namespace |
+------------------------------------------------------------+
|database_xxxx__brca_pheno |
|database_yyyy__gwas_vitamind_chr1 |
|database_zzzz__meta_data |
|database_tttt__genomics_180820 |
+------------------------------------------------------------+db = "database_xxxx__brca_pheno"
spark.sql(f"SHOW TABLES FROM {db}").show(truncate=False)+------------------------------------+-----------+-----------+
|namespace |tableName |isTemporary|
+------------------------------------+-----------+-----------+
|database_xxxx__brca_pheno |cna |false |
|database_xxxx__brca_pheno |methylation|false |
|database_xxxx__brca_pheno |mrna |false |
|database_xxxx__brca_pheno |mutations |false |
|database_xxxx__brca_pheno |patient |false |
|database_xxxx__brca_pheno |sample |false |
+------------------------------------+-----------+-----------+show databases like "<project_id_pattern>:<database_name_pattern>";
show databases like "project-*:<database_name>";+------------------------------------------------------------+
|namespace |
+------------------------------------------------------------+
|database_xxxx__brca_pheno |
|database_yyyy__gwas_vitamind_chr1 |
|database_zzzz__meta_data |
|database_tttt__genomics_180820 |
+------------------------------------------------------------+db = "database_xxxx__brca_pheno"
spark.sql(f"SHOW TABLES FROM {db}").show(truncate=False)# Create a database
my_database = "my_database"
spark.sql("create database " + my_database + " location 'dnax://'")
spark.sql("create table " + my_database + ".foo (k string, v string) using parquet")
spark.sql("insert into table " + my_database + ".foo values ('1', '2')")
sql("select * from " + my_database + ".foo")import hail as hl
hl.init(sc=sc)# Download example data from 1k genomes project and inspect the matrix table
hl.utils.get_1kg('data/')
hl.import_vcf('data/1kg.vcf.bgz').write('data/1kg.mt', overwrite=True)
mt = hl.read_matrix_table('data/1kg.mt')
mt.rows().select().show(5)# Annotate hail matrix table with VEP and LoF using configuration specified in the
# vep-GRCh38.json file in the project you're working in.
#
# Annotation process relies on "dnanexus/dxjupyterlab-vep" docker container
# as well as VEP and LoF resources that are pre-installed on every Spark node when
# HAIL-VEP feature is selected.
annotated_mt = hl.vep(mt, "file:///mnt/project/vep-GRCh38.json")
% cat /mnt/project/vep-GRCh38.json
{"command": [
"docker", "run", "-i", "-v", "/cluster/vep:/root/.vep", "dnanexus/dxjupyterlab-vep",
"./vep", "--format", "vcf", "__OUTPUT_FORMAT_FLAG__", "--everything", "--allele_number",
"--no_stats", "--cache", "--offline", "--minimal", "--assembly", "GRCh38", "-o", "STDOUT",
"--check_existing", "--dir_cache", "/root/.vep/",
"--fasta", "/root/.vep/homo_sapiens/109_GRCh38/Homo_sapiens.GRCh38.dna.toplevel.fa.gz",
"--plugin", "LoF,loftee_path:/root/.vep/Plugins/loftee,human_ancestor_fa:/root/.vep/human_ancestor.fa,conservation_file:/root/.vep/loftee.sql,gerp_bigwig:/root/.vep/gerp_conservation_scores.homo_sapiens.GRCh38.bw"],
"env": {
"PERL5LIB": "/root/.vep/Plugins"
},
"vep_json_schema": "Struct{assembly_name:String,allele_string:String,ancestral:String,colocated_variants:Array[Struct{aa_allele:String,aa_maf:Float64,afr_allele:String,afr_maf:Float64,allele_string:String,amr_allele: String,amr_maf:Float64,clin_sig:Array[String],end:Int32,eas_allele:String,eas_maf:Float64,ea_allele:String,ea_maf:Float64,eur_allele:String,eur_maf:Float64,exac_adj_allele:String,exac_adj_maf:Float64,exac_allele: String,exac_afr_allele:String,exac_afr_maf:Float64,exac_amr_allele:String,exac_amr_maf:Float64,exac_eas_allele:String,exac_eas_maf:Float64,exac_fin_allele:String,exac_fin_maf:Float64,exac_maf:Float64,exac_nfe_allele: String,exac_nfe_maf:Float64,exac_oth_allele:String,exac_oth_maf:Float64,exac_sas_allele:String,exac_sas_maf:Float64,id:String,minor_allele:String,minor_allele_freq:Float64,phenotype_or_disease:Int32,pubmed: Array[Int32],sas_allele:String,sas_maf:Float64,somatic:Int32,start:Int32,strand:Int32}],context:String,end:Int32,id:String,input:String,intergenic_consequences:Array[Struct{allele_num:Int32,consequence_terms: Array[String],impact:String,minimised:Int32,variant_allele:String}],most_severe_consequence:String,motif_feature_consequences:Array[Struct{allele_num:Int32,consequence_terms:Array[String],high_inf_pos:String,impact: String,minimised:Int32,motif_feature_id:String,motif_name:String,motif_pos:Int32,motif_score_change:Float64,strand:Int32,variant_allele:String}],regulatory_feature_consequences:Array[Struct{allele_num:Int32,biotype: String,consequence_terms:Array[String],impact:String,minimised:Int32,regulatory_feature_id:String,variant_allele:String}],seq_region_name:String,start:Int32,strand:Int32,transcript_consequences: Array[Struct{allele_num:Int32,amino_acids:String,appris:String,biotype:String,canonical:Int32,ccds:String,cdna_start:Int32,cdna_end:Int32,cds_end:Int32,cds_start:Int32,codons:String,consequence_terms:Array[String], distance:Int32,domains:Array[Struct{db:String,name:String}],exon:String,gene_id:String,gene_pheno:Int32,gene_symbol:String,gene_symbol_source:String,hgnc_id:String,hgvsc:String,hgvsp:String,hgvs_offset:Int32,impact: String,intron:String,lof:String,lof_flags:String,lof_filter:String,lof_info:String,minimised:Int32,polyphen_prediction:String,polyphen_score:Float64,protein_end:Int32,protein_start:Int32,protein_id:String, sift_prediction:String,sift_score:Float64,strand:Int32,swissprot:String,transcript_id:String,trembl:String,tsl:Int32,uniparc:String,variant_allele:String}],variant_class:String}"
}source /home/dnanexus/environment
source /cluster/dx-cluster.environment#read csv from public bucket
df = spark.read.options(delimiter='\t', header='True', inferSchema='True').csv("s3://1000genomes/20131219.populations.tsv")
df.select(df.columns[:4]).show(10, False)#access private data in S3 by first unsetting the default credentials provider
sc._jsc.hadoopConfiguration().set('fs.s3a.aws.credentials.provider', '')
# replace "redacted" with your keys
sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'redacted')
sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'redacted')
df=spark.read.csv("s3a://your_private_bucket/your_path_to_csv")
df.select(df.columns[:5]).show(10, False)dx run spark-sql-runner \
-i sqlfile=file-FQ4by2Q0Yy3pGp21F7vp8XGK \
-i paramfile=file-FK7Qpj00GQ8Q7ybZ0pqYJj6G \
-i export=trueSELECT * FROM ${srcdb}.${patient_table};
DROP DATABASE IF EXISTS ${dstdb} CASCADE;
CREATE DATABASE IF NOT EXISTS ${dstdb} LOCATION 'dnax://';
CREATE VIEW ${dstdb}.patient_view AS SELECT * FROM ${srcdb}.patient;
SELECT * FROM ${dstdb}.patient_view;SHOW DATABASES;
SELECT * FROM dbname.tablename1;
SELECT * FROM
dbname.tablename2;
DESCRIBE DATABASE EXTENDED dbname;-- SHOW DATABASES;
-- SELECT * FROM dbname.tablename1;
SHOW /* this is valid comment */ TABLES;{
"srcdb": "sskrdemo1",
"dstdb": "sskrtest201",
"patient": "patient_new",
"f2c":"patient_f2c",
"derived":"patient_derived",
"composed":"patient_composed",
"complex":"patient_complex",
"patient_view": "patient_newview",
"brca": "brca_new",
"patient_table":"patient",
"cna": "cna_new"
}set srcdb=sskrdemo1;
set patient_table=patient;
select * from ${srcdb}.${patient_table};{
"num_files" : 2,
"fileprefix":"demo",
"header": true
}{
"spark-defaults.conf": [
{
"name": "spark.app.name",
"value": "SparkAppName"
},
{
"name": "spark.test.conf",
"value": true
}
]
}$ dx tree export
export
├── job-FFp7K2j0xppVXZ791fFxp2Bg-export.tar
├── job-FFp7K2j0xppVXZ791fFxp2Bg-debug.sql├── demo-0
│ ├── demo-0-out.csv
│ │ ├── _SUCCESS
│ │ ├── part-00000-1e2c301e-6b28-47de-b261-c74249cc6724-c000.csv
│ │ └── part-00001-1e2c301e-6b28-47de-b261-c74249cc6724-c000.csv
│ └── demo-0.sql
├── demo-1
│ ├── demo-1-out.csv
│ │ ├── _SUCCESS
│ │ └── part-00000-b21522da-0e5f-42ba-8197-e475841ba9c3-c000.csv
│ └── demo-1.sql
├── demo-2
│ ├── demo-2-out.csv
│ │ ├── _SUCCESS
│ │ ├── part-00000-e61c6eff-5448-4c39-8c72-546279d8ce6f-c000.csv
│ │ └── part-00001-e61c6eff-5448-4c39-8c72-546279d8ce6f-c000.csv
│ └── demo-3.sql
├── demo-3
│ ├── demo-3-out.csv
│ │ ├── _SUCCESS
│ │ └── part-00000-5a48ba0f-d761-4aa5-bdfa-b184ca7948b5-c000.csv
│ └── demo-3.sql-- [SQL Runner Report] --;
-- [SUCCESS][TimeTaken: 1.90734863281e-06 secs ] set f2c=patient_f2c;
-- [SUCCESS][TimeTaken: 1.90734863281e-06 secs ] set srcdb=sskrdemosrcdb1_13;
-- [SUCCESS][TimeTaken: 9.53674316406e-07 secs ] set patient=patient_new;
-- [SUCCESS][TimeTaken: 9.53674316406e-07 secs ] set derived=patient_derived;
-- [SUCCESS][TimeTaken: 9.53674316406e-07 secs ] set composed=patient_composed;
-- [SUCCESS][TimeTaken: 9.53674316406e-07 secs ] set patient_table=patient;
-- [SUCCESS][TimeTaken: 1.19209289551e-06 secs ] set complex=patient_complex;
-- [SUCCESS][TimeTaken: 9.53674316406e-07 secs ] set patient_view=patient_newview;
-- [SUCCESS][TimeTaken: 9.53674316406e-07 secs ] set cna=cna_new;
-- [SUCCESS][TimeTaken: 0.0 secs ] set brca=brca_new;
-- [SUCCESS][TimeTaken: 2.14576721191e-06 secs ] set dstdb=sskrdemodstdb1_13;
-- [SUCCESS][OutputFile: demo-0-out.csv, TimeTaken: 8.83630990982 secs] SHOW DATABASES;
-- [SUCCESS][OutputFile: demo-1-out.csv, TimeTaken: 3.85295510292 secs] create database sskrdemo2 location 'dnax://';
-- [SUCCESS][OutputFile: demo-2-out.csv, TimeTaken: 4.8106200695 secs] use sskrdemo2;
-- [SUCCESS][OutputFile: demo-3-out.csv , TimeTaken: 1.00737595558 secs] create table patient (first_name string, last_name string, age int, glucose int, temperature int, dob string, temp_metric string) stored as parquet;-- [SQL Runner Report] --;
-- [SUCCESS][TimeTaken: 1.90734863281e-06 secs ] set f2c=patient_f2c;
-- [SUCCESS][TimeTaken: 1.90734863281e-06 secs ] set srcdb=sskrdemosrcdb1_13;
-- [SUCCESS][TimeTaken: 9.53674316406e-07 secs ] set patient=patient_new;
-- [SUCCESS][TimeTaken: 9.53674316406e-07 secs ] set derived=patient_derived;
-- [SUCCESS][TimeTaken: 9.53674316406e-07 secs ] set composed=patient_composed;
-- [SUCCESS][TimeTaken: 9.53674316406e-07 secs ] set patient_table=patient;
-- [SUCCESS][TimeTaken: 1.19209289551e-06 secs ] set complex=patient_complex;
-- [SUCCESS][TimeTaken: 9.53674316406e-07 secs ] set patient_view=patient_newview;
-- [SUCCESS][TimeTaken: 9.53674316406e-07 secs ] set cna=cna_new;
-- [SUCCESS][TimeTaken: 0.0 secs ] set brca=brca_new;
-- [SUCCESS][TimeTaken: 2.14576721191e-06 secs ] set dstdb=sskrdemodstdb1_13;
-- [SUCCESS][OutputFile: demo-0-out.csv, TimeTaken: 8.83630990982 secs] select * from ${srcdb}.${patient_table};
-- [FAIL] SQL ERROR while below command [ Reason: u"\nextraneous input '`' expecting <EOF>(line 1, pos 45)\n\n== SQL ==\ndrop database if exists sskrtest2011 cascade `\n---------------------------------------------^^^\n"];
drop database if exists ${dstdb} cascade `;
create database if not exists ${dstdb} location 'dnax://';
create view ${dstdb}.patient_view as select * from ${srcdb}.patient;
select * from ${dstdb}.patient_view;drop database if exists ${dstdb} cascade `;CREATE DATABASE IF NOT EXISTS {db_name} LOCATION 'dnax://db_uri = dxpy.find_one_data_object(name=db_name", classname="database")['id']db_uri = dxpy.find_one_data_object(name=db_name.lower(), classname="database")['id']To understand the archival life cycle as well as which operations can be performed on files and how billing works, it's helpful to understand the different file states associated with archival. A file in a project can assume one of four archival states:
live
The file is in standard storage, such as AWS S3 or Azure Blob.
archival
Archival requested on the current file, but other copies of the same file are in the live state in multiple projects with the same billTo entity. The file is still in standard storage.
archived
The file is in archival storage, such as AWS S3 Glacier or Azure Blob ARCHIVE.
unarchiving
Restore requested on the current file. The file is in transition from archival storage to standard storage.
Different states of a file allow different operations to the file. See the table below, for which operations can be performed based on a file's current archival state.
live
Yes
Yes
Yes
Yes
No
archival
No
Yes*
* Clone operation would fail if the object is actively transitioning from archival to archived.
When the project-xxxx/archive API is called on a file object, the file transitions from the live state to the archival state. Only when all copies of a file in all projects with the same billTo organization are in the archival state, does the file transition to the archived state automatically by the platform.
Likewise, when the project-xxxx/unarchive API is called on a file in the archived state, the file transitions from the archived to the unarchiving state. During the unarchiving state, the file is being restored by the third-party storage platform, such as AWS or Azure. The unarchiving process may take a while depending on the retrieval option selected for the specific platform. Finally, when unarchiving is completed, and the file becomes available on standard storage, the file is transitioned to a live state.
The File-based Archive Service allows users who have the CONTRIBUTE or ADMINISTER permissions to a project to archive or unarchive files that reside in the project.
Using API, users can archive or unarchive files, folders, or entire projects, although the archiving process itself happens at the file level. The API can accept a list of up to 1000 files for archiving and unarchiving.
When archiving or unarchiving folders or projects, the API by default archives or unarchives all the files at the root level and those in the subfolders recursively. If you archive a folder or a project that includes files in different states, the Service only archives files that are in the live state and skips files that are in other states. Likewise, if you unarchive a folder or a project that includes files in different states, the Service only unarchives files that are in the archived state, transitions archival files back to the live state, and skips files in other states.
The archival process incurs specific charges, all billed to the billTo organization of the project:
Standard storage charge: The monthly storage charge for files that are located in the standard storage on the platform. The files in the live and archival state incur this charge. The archival state indicates that the file is waiting to be archived or that other copies of the same file in other projects are still in the live state, so the file is in standard storage, such as AWS S3. The standard storage charge continues to get billed until all copies of the file are requested to be archived and eventually the file is moved to archival storage and transitioned into the archived state.
Archival storage charge: The monthly storage charge for files that are located in archival storage on the platform. Files in the archived state incur a monthly archival storage charge.
Retrieval fee: The retrieval fee is a one-time charge at the time of unarchiving based on the volume of data being unarchived.
Early retrieval fee: If you retrieve or delete data from archival storage before the required retention period is met, an early retrieval fee applies. This is 90 days for AWS regions and 180 days for Microsoft Azure regions. You are be charged a pro-rated fee equivalent to the archival storage charges for any remaining days within that period.
When using the Archive Service, we recommend the following best practices.
The Archive Service does not work on sponsored projects. If you want to archive files within a sponsored project, then you must move files into a different project or end the project sponsorship before archival.
If a file is shared in multiple projects, archiving one copy in one of the projects only transitions the file into the archival state, which still incurs the standard storage cost. To achieve the lower archival storage cost, you need to ensure that all copies of the file in all projects with the same billTo org are being archived. When all copies of the file reach the archival state, the Service moves the files from archival to archived state. Consider using the allCopies option of the API to archive all copies of the file. You must be the org ADMIN of the billTo org of the current project to use the allCopies option.
Refer to the following example: The file-xxxx has copies in project-xxxx, project-yyyy, and project-zzzz which are sharing the same billTo org (org-xxxx). You are the ADMINISTER of project-xxxx, and a CONTRIBUTE of project-yyyy, but do not have any role in project-zzzz. You are the org ADMIN of the project billTo org, and try to archive all copies of files in all projects with the same billTo
List all the copies of the file in the org-xxxx .
Force archiving all the copies of file-xxxx .
All copies of file-xxxx transition into the archived state.
You can use the dx ls command to list the objects in your current project. You can determine the current project and folder you are in by using the command dx pwd. Using glob patterns, you can broaden your search for objects by specifying filenames with wildcard characters such as * and ?. An asterisk (*) represents zero or more characters in a string, and a question mark (?) represents exactly one character.
By listing objects in your current directory with the wildcard characters * and ?, you can search for objects with a filename using a glob pattern. The examples below use the folder "C. Elegans - Ce10/" in the public project (platform login required to access this link).
To search the entire project with a filename pattern, use the command dx find data --name with the wildcard characters. Unless --path or --all-projects is specified, dx find data searches data under the current project. Below, the command dx find data is used in the public project (platform login required to access this link) using the --name option to specify the filename of objects that you're searching for.
When filenames contain special characters, escape these characters with a backslash (\) during searches. Characters requiring escaping include wildcards (* and ?) as well as colons (:) and slashes (/), which have special meaning in DNAnexus paths.
Shell behavior affects escaping rules. In many shells, you need to either double-escape (\\) or use single quotes to prevent the shell from interpreting the backslash.
The following examples show proper escaping techniques:
dx find data also allows you to search data using metadata fields, such as when the data was created, the data tags, or the project the data exists in.
You can use the flags --created-after and --created-before to search for data objects created within a specific time period.
You can search for objects based on their metadata. An object's metadata can be set by performing the command or to respectively tag or setup key-value pairs to describe your data object. You can also set metadata while uploading data to the platform. To search by object tags, use the option --tag. This option can be repeated if the search requires multiple tags.
To search by object properties, use the option --property. This option can be repeated if the search requires multiple properties.
You can search for an object living in a different project than your current working project by specifying a project and folder path with the flag --path. Below, the project ID (project-BQfgzV80bZ46kf6pBGy00J38) of the public project (platform login required to access this link) is specified as an example.
To search for data objects in all projects where you have VIEW and above permissions, use the --all-projects flag. Public projects are not shown in this search.
To describe data for small amounts of files (typically below 100), scope findDataObjects to only a project level.
The below is an example of code used to scope a project:
See the for more information about usage.
Use Symlinks to access, work with, and modify files that are stored on an external cloud service.
The DNAnexus Symlinks feature enables users to link external data files on AWS S3 and Azure blob storage as objects on the platform and access such objects for any usage as though they are native DNAnexus file objects.
No storage costs are incurred when using symlinked files on the Platform. When used by jobs, symlinked files are downloaded to the Platform at runtime.
Symlinked files stored in AWS S3 or Azure blob storage are made accessible on DNAnexus via a Symlink Drive. The drive contains the necessary cloud storage credentials, and can be created by following Step 1 below.
To set up Symlink Drives, use the CLI to provide the following information:
A name for the Symlink Drive
The cloud service (AWS or Azure) where your files are stored
The access credentials required by the service
After you've entered the appropriate command, a new drive object is created. You can see a confirmation message that includes the id of the new Symlink Drive in the format drive-xxxx.
By associating a DNAnexus Platform project with a Symlink Drive, you can both:
Have all new project files automatically uploaded to the AWS S3 bucket or Azure blob, to which the Drive links
Enable project members to work with those files
"New project files" includes the following:
Newly created files
File outputs from jobs
Files uploaded to the project
Non-symlinked files cloned into a symlinked project are not uploaded to the linked AWS S3 bucket or Azure blob.
When creating a new project via the UI, you can link it with an existing Symlink Drive by toggling the Enable Auto-Symlink in This Project setting to "On":
Next:
In the Symlink Drive field, select the drive with which the project should be linked
In the Container field, enter the name of the AWS S3 bucket or Azure blob where newly created files should be stored
Optionally, in the Prefix field, enter the name of a folder within the AWS S3 bucket or Azure blob where these files should be stored
When creating a new project via the CLI, you can link it to a Symlink Drive by using the optional argument --default-symlink with dx new project. See for details on inputs and input format.
To ensure that files can be saved to your AWS S3 bucket or Azure blob, you must enable CORS for that remote storage container.
Refer to Amazon documentation for .
Use the following JSON object when configuring CORS for the bucket:
Refer to Microsoft documentation for .
Working with Symlinked files is similar to working with files that are stored on the Platform. These files can, for example, be used as inputs to apps, applets, or workflows.
If you rename a symlink on DNAnexus, this does not change the name of the file in S3 or Azure blob storage. In this example, the symlink has been renamed from the original name file.txt, to Example File. The remote filename, as shown in the Remote Path field in the right-side info pane, remains file.txt:
If you delete a symlink on the Platform, the file to which it points is not deleted.
If your cloud access credentials change, you must update the definition of all Symlink Drives to keep using files to which those Drives provide access.
To update a drive definition with new AWS access credentials, use the following command:
To update a drive definition with new Azure access credentials, use the following command:
For more information, see .
No, the symlinked file only moves within the project. The change is not mirrored in the linked S3 or Azure blob container.
The job fails after it is unable to retrieve the source file.
Yes, you can copy a symlinked file from one project to another. This includes copying symlinked files from a symlink-enabled project to a project without this feature enabled.
Yes - egress charges are incurred.
In this scenario, the uploaded file overwrites, or "clobbers," the file that shares its name, and only the newly uploaded file is stored in the AWS S3 bucket or Azure blob.
This is true even if, within your project, you first renamed the symlinked file and uploaded a new file with the prior name. For example, if you upload a file named file.txt to your DNAnexus project, the file is automatically uploaded to your S3 or Azure blob to the specified directory. If you then rename the file on DNAnexus from file.txt to file.old.txt, and upload a new file to the project called file.txt, the original file.txt that was uploaded to S3 or Azure blob is overwritten. However, you are still left with file.txt and file.old.txt symlinks in your DNAnexus project. Trying to access the original file.old.txt symlink results in a checksum error.
If the auto-symlink feature has been enabled for a project, billing responsibility for the project cannot be transferred. Attempting to do so via API call returns a PermissionDenied error.
Create, filter, and manage patient cohorts using clinical, genomic, and other data fields in the Cohort Browser.
Create comprehensive patient cohorts by filtering your datasets. You can combine, compare, and export your cohorts for further analysis.
If you'd like to visualize data in your cohorts, see .
When you start exploring a dataset, Cohort Browser automatically creates an empty cohort that includes all patients/samples. You can then by adding filters, and repeat multiple times to create additional cohorts.
The Cohorts panel gives you an overview of your active cohorts on the dashboard (up to 2) and the recently used cohorts (up to 8) in your current session. These can be temporary unsaved cohorts as well as
No
No
Yes (Cancel archive)
archived
No
Yes
No
No
Yes
unarchiving
No
No
No
No
No




dx api file-xxxx listProjects '{"archivalInfoForOrg":"org-xxxx"}'
{
"project-xxxx": "ADMINISTER",
"project-yyyy": "CONTRIBUTE",
"liveProjects": [
"project-xxxx",
"project-yyyy",
"project-zzzz"
]
}dx api project-xxxx archive '{"files": ["file-xxxx"], "allCopies": true}'
{
"id": "project-xxxx"
"count": 1
}$ dx select "Reference Genome Files"
$ dx cd "C. Elegans - Ce10/"
$ dx pwd # Print current working directory
Reference Genome Files:/C. Elegans - Ce10$ dx ls
ce10.bt2-index.tar.gz
ce10.bwa-index.tar.gz
ce10.cw2-index.tar.gz
ce10.fasta.fai
ce10.fasta.gz
ce10.hisat2-index.tar.gz
ce10.star-index.tar.gz
ce10.tmap-index.tar.gz$ dx ls '*.fa*' # List objects with filenames of the pattern "*.fa*"
ce10.fasta.fai
ce10.fasta.gz
$ dx ls ce10.???-index.tar.gz # List objects with filenames of the pattern "ce10.???-index.tar.gz"
ce10.cw2-index.tar.gz
ce10.bt2-index.tar.gz
ce10.bwa-index.tar.gz$ dx find data --name "*.fa*.gz"
closed 2014-10-09 09:50:51 776.72 MB /M. musculus - mm10/mm10.fasta.gz (file-BQbYQPj0Z05ZzPpb1xf000Xy)
closed 2014-10-09 09:50:30 767.47 MB /M. musculus - mm9/mm9.fasta.gz (file-BQbYK6801fFJ9Fj30kf003PB)
closed 2014-10-09 09:49:27 49.04 MB /D. melanogaster - Dm3/dm3.fasta.gz (file-BQbYVf80yf3J9Fj30kf00PPk)
closed 2014-10-09 09:48:55 29.21 MB /C. Elegans - Ce10/ce10.fasta.gz (file-BQbY9Bj015pB7JJVX0vQ7vj5)
closed 2014-10-08 13:52:26 818.96 MB /H. Sapiens - GRCh37 - hs37d5 (1000 Genomes Phase II)/hs37d5.fa.gz (file-B6ZY7VG2J35Vfvpkj8y0KZ01)
closed 2014-10-08 13:51:31 876.79 MB /H. Sapiens - hg19 (UCSC)/ucsc_hg19.fa.gz (file-B6qq93v2J35fB53gZ5G0007K)
closed 2014-10-08 13:50:53 827.95 MB /H. Sapiens - hg19 (Ion Torrent)/ion_hg19.fa.gz (file-B6ZYPQv2J35xX095VZyQBq2j)
closed 2014-10-08 13:50:17 818.88 MB /H. Sapiens - GRCh38/GRCh38.no_alt_analysis_set.fa.gz (file-BFBv6J80634gkvZ6z100VGpp)
closed 2014-10-08 13:49:53 810.45 MB /H. Sapiens - GRCh37 - b37 (1000 Genomes Phase I)/human_g1k_v37.fa.gz (file-B6ZXxfG2J35Vfvpkj8y0KXF5)# Searching for a file with colons in the name
dx find data --name "sample\:123.txt"
# Or alternatively with single quotes
dx find data --name 'sample\:123.txt'
# Searching for a file with a literal asterisk
dx find data --name "experiment\*.fastq"$ dx find data --created-after 2017-02-22 --created-before 2017-02-25
closed 2017-02-27 19:14:51 3.90 GB /H. Sapiens - hg19 (UCSC)/ucsc_hg19.hisat2-index.tar.gz (file-F2pJvF80Vzx54f69K4J8K5xy)
closed 2017-02-27 19:14:21 3.55 GB /M. musculus - mm10/mm10.hisat2-index.tar.gz (file-F2pJqk00Vq161bzq44Vjvpf5)
closed 2017-02-27 19:13:57 3.51 GB /M. musculus - mm9/mm9.hisat2-index.tar.gz (file-F2pJpKj0G0JxZxBZ4KJq0Q6B)
closed 2017-02-27 19:13:41 3.85 GB /H. Sapiens - hg19 (Ion Torrent)/ion_hg19.hisat2-index.tar.gz (file-F2pJkp00BjBk99xz4Jk74V0y)
closed 2017-02-27 19:13:28 3.85 GB /H. Sapiens - GRCh37 - b37 (1000 Genomes Phase I)/human_g1k_v37.hisat2-index.tar.gz (file-F2pJpy007bGBzj7X446PzxJJ)
closed 2017-02-27 19:13:02 3.90 GB /H. Sapiens - GRCh37 - hs37d5 (1000 Genomes Phase II)/hs37d5.hisat2-index.tar.gz (file-F2pJpb000vFpzj7X446PzxF0)
closed 2017-02-27 19:12:31 3.91 GB /H. Sapiens - GRCh38/GRCh38.no_alt_analysis_set.hisat2-index.tar.gz (file-F2pK5y00F8Bp9BYk4KX7Qb4P)
closed 2017-02-27 19:12:18 224.54 MB /D. melanogaster - Dm3/dm3.hisat2-index.tar.gz (file-F2pJP7j0QkbQ3ZqG269589pj)
closed 2017-02-27 19:11:56 139.76 MB /C. Elegans - Ce10/ce10.hisat2-index.tar.gz (file-F2pJK300KKz8bx1126Ky5b3P)$ dx find data --tag sampleABC --tag batch123
closed 2017-01-01 09:00:00 6.08 GB /Input/SRR504516_1.fastq.gz (file-xxxx)
closed 2017-01-01 09:00:00 5.82 GB /Input/SRR504516_2.fastq.gz (file-wwww)$ dx find data --property sequencing_providor=CRO_XYZ
closed 2017-01-01 09:00:00 8.06 GB /Input/SRR504555_1.fastq.gz (file-qqqq)
closed 2017-01-01 09:00:00 8.52 GB /Input/SRR504555_2.fastq.gz (file-rrrr)$ dx find data --name "*.fastq.gz"
--path project-BQfgzV80bZ46kf6pBGy00J38:/Input
closed 2014-10-03 12:04:16 6.08 GB /Input/SRR504516_1.fastq.gz (file-B40jg7v8KfPy38kjz1vQ001y)
closed 2014-10-03 12:04:16 5.82 GB /Input/SRR504516_2.fastq.gz (file-B40jgYG8KfPy38kjz1vQ0020)$ dx find data --name "SRR*_1.fastq.gz" --all-projects
closed 2017-01-01 09:00:00 6.08 GB /Exome Analysis Demo/Input/SRR504516_1.fastq.gz (project-xxxx:file-xxxx)
closed 2017-07-01 10:00:00 343.58 MB /input/SRR064287_1.fastq.gz (project-yyyy:file-yyyy)
closed 2017-01-01 09:00:00 6.08 GB /data/exome_analysis_demo/SRR504516_1.fastq.gz (project-zzzz:file-xxxx)dx api system findDataObjects '{"scope": {"project": "project-xxxx"}, "describe":{"fields":{"state":true}}}'dx api drive new '{
"name" : "<drive_name>",
"cloud" : "aws",
"credentials" : {
"accessKeyId" : "<my_aws_access_key>",
"secretAccessKey" : "<my_aws_secret_access_key>"
}
}'dx api drive new '{
"name" : "<drive_name>",
"cloud" : "azure",
"credentials" : {
"account" : "<my_azure_storage_account_name>",
"key" : "<my_azure_storage_access_key>"
}
}'[
{
"AllowedHeaders": [
"Content-Length",
"Origin",
"Content-MD5",
"accept",
"content-type"
],
"AllowedMethods": [
"PUT",
"POST"
],
"AllowedOrigins": [
"https://*"
],
"ExposeHeaders": [
"Retry-After"
],
"MaxAgeSeconds": 3600
}
]dx api <driveID> update '{
"credentials" : {
"accessKeyId" : "<my_new_aws_access_key>",
"secretAccessKey" : "<my_new_aws_secret_access_key>"
}
}'dx api <driveID> update '{
"credentials" : {
"account" : "<my_azure_storage_account_name>",
"key" : "<my_azure_storage_access_key>"
}
}'To change the active cohorts on the dashboard, you need to swap them between the Dashboard and Recent sections:
In Cohorts > Dashboard, click In Dashboard to remove a cohort from the dashboard.
In Cohorts > Recent, click Add to Dashboard next to the cohort you want to add to the dashboard.
This way you can quickly explore, compare, and iterate across multiple cohorts within a single session.
To apply a filter to your cohort:
For the cohort you want to edit, click Add Filter.
In Add Filter to Cohort > Clinical, select a data field to filter by.
Click Add Cohort Filter.
In Edit Filter, select operators and enter the values to filter by.
Click Apply Filter.
After you apply the filter, the dashboard automatically refreshes and displays the updated cohort size below the filtered cohort's name.
With multi-assay datasets, you can create cohorts by applying filters from multiple assay types and instances.
When adding filters, you can find assay types under the Assays tab. This allows you to create cohorts that combine different types of data. For example, you can filter patients based on both clinical characteristics and germline variants, merge somatic mutation criteria with gene expression levels, or build cohorts that span multiple assays of the same type.
To learn more about filtering by specific assay types, see:
When working with an omics dataset that includes multiple assays, such as a germline dataset with both WES and WGS assays, you can:
Select specific assays to choose which assay to filter on.
Apply different filters per assay.
Create separate cohorts for different assays of the same type and compare results.
The maximum number of filters allowed varies by assay type and is shared across all instances of that type:
Germline variant assays: 1 filter maximum
Somatic variant assays: Up to 10 filter criteria
Gene expression assays: Up to 10 filter criteria
If you add multiple filters from the same category, such as Patient or Sample, they automatically form a filter group.
By default, filters within a filter group are joined by the logical operator 'AND', meaning that all filters in the group must be satisfied for a record to be included in the cohort. You can change the logical operator used within the group to 'OR' by clicking on the operator.
Join filters allow you to create cohorts by combining criteria across multiple related data entities within your dataset. This is useful when working with complex datasets that contain interconnected information, such as patient records linked to visits, medications, lab tests, or other clinical data.
An entity is a grouping of data around a unique item, event, or concept.
In the Cohort Browser, an entity can refer either to a data model object, such as patient or visit, or to a specific input parameter in the Table Exporter app.
Common examples of data entities include:
Patient: Demographics, medical history, baseline characteristics
Visit: Hospital admissions, appointments, encounters
Medication: Prescriptions, dosages, administration records
Lab Test: Results, procedures, sample information
To create join filters that span multiple data entities:
Start a new join filter: On the cohort panel, click Add Filter or, on a chart tile, click Cohort Filters > Add Cohort Filter.
Select secondary entity: Choose data fields from a secondary entity (different from your primary entity) to create the join relationship.
Add criteria to existing joins: To expand an existing join filter, click Add additional criteria on the row of the chosen filter.
Join filters support both AND as well as OR logical operators to control how criteria are combined:
AND logic: All specified criteria must be met
OR logic: Any of the specified criteria can be met
Key rules for logical operators:
Click on the operator buttons to switch between the AND logic and OR logic.
For a specific level of join filtering, joins are either all AND or all OR.
When using OR for join filters, the existence condition applies first: "where exists, join 1 OR join 2".
As your filtering needs become more sophisticated, you can create multi-layered join structures:
Add criteria to branches: Further define secondary entities by adding additional criteria to existing join branches
Create nested joins: Add more layers of join filters that derive from the current branch
Automatic field filtering: The field selector automatically hides fields that are ineligible based on the current join structure
The following examples show how join filters work in practice:
First Example Cohort - Separate Conditions: This cohort identifies all patients with a "high" or "medium" risk level who meet both of these conditions:
Have a first hospital visit (visit instance = 1)
Have had a "nasal swab" lab test at any point (not necessarily during the first visit)
Second Example Cohort - Connected Conditions: This cohort includes all patients with a "high" or "medium" risk level who had the "nasal swab" test performed specifically during their first visit, creating a more restrictive temporal relationship between the visit and lab test.
You can save your cohort selection to a project as a cohort record by clicking Save Cohort in the top-right corner of the cohort panel.
Cohorts are saved with their applied filters, as well as the latest visualizations and dashboard layout. Like other dataset objects, you can find your saved cohorts under the Manage tab in your project.
To open a cohort, double-click it or click Explore Data.
Need to use your cohorts with a different dataset? If you want to apply your cohort definitions to a different Apollo Dataset, you can use the Rebase Cohorts And Dashboards app to transfer your saved cohorts to a new target dataset.
For each cohort, you can export a list of main entity IDs in your current cohort selection as a CSV file by clicking Export sample IDs.
On the Data Preview tab, you can export tabular information as record IDs or a CSV file. Select multiple table rows to see export options in the top-right corner. Exports include only the fields displayed in the Data Preview tab.
The Data Preview supports up to 30 columns per tab. Tables with 30-200 columns show column names only. In such cases, you can save cohorts but data is not queried. Tables with over 200 columns are not supported.
You can view up to 30,000 records in the Data Preview. If your cohort exceeds this size, the table may not display all data. For larger exports, use the Table Exporter app.
The Cohort Browser follows your project's download policy restrictions. Downloads are blocked when:
Database restrictions apply: If the database storing your dataset has restricted download permissions, you cannot download data from any Cohort Browser view of that dataset, regardless of which project contains the cohort or dashboard.
All dataset copies are restricted: When every copy of your dataset exists in projects with restricted download policies, downloads are blocked. However, if at least one copy exists in a project that allows downloads, then downloads are permitted.
Cohort or dashboard restrictions apply: If the specific cohort or dashboard you're viewing has restricted download permissions, downloads are blocked regardless of the underlying dataset permissions.
You can create complex cohorts by combining existing cohorts from the same dataset.
Near the cohort name, click + > Combine Cohorts.
In the Cohorts panel, click Combine Cohorts.
You can also create a combined cohort basing on the cohorts already being compared.
The Cohort Browser supports the following combination logic:
Logic
Description
Number of Cohorts Supported
Intersection
Select members that are present in ALL selected cohorts. Example: intersection of cohort A, B and C would be A ∩ B ∩ C.
Up to 5 cohorts
Union
Select members that are present in ANY of the selected cohorts. Example: union of cohort A, B and C would be A ∪ B ∪ C.
Up to 5 cohorts
Subtraction
Select members that are present only in the first selected cohort and not in the second. Example: Subtraction of cohort A, B would be A - B.
2 cohorts
Unique
Select members that appear in exactly one of the selected cohorts. Example: Unique of cohort A, B would be (A - B) ∪ (B - A).
2 cohorts
Once a combined cohort is created, you can inspect the combination logic and its original cohorts in the cohort filters section.
You can compare two cohorts from the same dataset by adding both cohorts into the Cohort Browser.
To compare cohorts, click + next to the cohort name. You can create a new cohort, duplicate the current cohort, or load a previously saved cohort.
When comparing cohorts:
All visualizations are converted to show data from both cohorts.
You can continue to edit both cohorts and visualize the results dynamically.
You can compare a cohort with its complement in the dataset by selecting Compare / Combine Cohorts > Not In …. Similar to combining cohorts, you first need to save your current cohort before creating its not-in counterpart.
Logic
Description
Not In
Select patients that are present in the dataset, but not in the current cohort. Example: In dataset U, the result of "Not In" A would be U - A.
Cohorts created using Not In cannot be used for further creation of combined or not-in cohorts. "Not In" cohorts are linked to the cohort they are originally based on. Once a not-in cohort is created, further changes to the original cohort definition are not reflected.
The dx command create_cohort generates a new Cohort object on the platform using an existing Dataset or Cohort object, and a list of primary IDs. The filters are applied to the global primary key of the dataset/cohort object.
When the input is a CohortBrowser typed record, the existing filters are preserved and the output record has additional filters on the global primary key. The filters are combined in a way such that the resulting record is an intersection of the IDs present in the original input and the IDs passed through CLI.
For additional details, see the create_cohort command reference and example notebooks in the public GitHub repository, DNAnexus/OpenBio.
Learn to use projects to collaborate, organize your work, manage billing, and control access to files and executables.
Within the DNAnexus Platform, a project is first and foremost a means of enabling users to collaborate, by providing them with shared access to specific data and tools.
Projects have a series of features designed for collaboration, helping project members coordinate and organize their work, and ensuring appropriate control over both data and tools.
A key function of each project is to serve as a shared storehouse of data objects used by project members as they collaborate.
Click on a project's Manage tab to see a list of all the data objects stored in the project. Within the Manage screen, you can browse and manage these objects, with the range of available actions for an object dependent on its type.
The following are four common actions you can perform on objects from within the Manage screen.
You can directly download file objects from the system.
Select the file's row.
Click More Actions (⋮).
From the list of available actions, select Download.
Follow the instructions in the modal window that opens.
To learn more about an object:
Select its row, then click the Show Info Panel button - the "i" icon - in the upper corner of the Manage screen.
Select the row showing the name of the object about which you want to know more. An info panel opens on the right, displaying a range of information about the object. This includes its unique ID, as well as metadata about its owner, time of creation, size, tags, properties, and more.
Deletion is permanent and cannot be undone.
To delete an object:
Select its row.
Click More Actions (⋮).
From the list of available actions, select Delete.
Follow the instructions in the modal window that opens.
Select the object or objects you want to copy to a new project, by clicking the box to the left of the name of each object in the objects list.
Click the Copy button in the upper right corner of the Manage screen. A modal window opens.
Select the project to which you want to copy the object or objects, then select the location within the project to which the objects should be copied.
You can collaborate on the platform by . On sharing a project with a user, or group of users in an , they become project members, with access at one of the levels described below. Project access can be revoked at any time by a project administrator.
To remove a user or org from a project to which you have ADMINISTER access:
On the project's Manage screen, click the Share Project button - the "two people" icon - in the top right corner of the page. A modal window opens, showing a list of project members.
Find the row showing the user you want to remove from the project.
Move your mouse over that row, then click the Remove from Members button at the right end of the row.
Suppose you have a set of samples sequenced at your lab, and you have a collaborator who's interested in three of the samples. You can upload the data associated with those samples into a new project, then share that new project with your collaborator, granting them VIEW access.
Alternatively, suppose that you and your collaborator are working on the same tissue samples, but each of you wants to try a different sequencing process. You can create a new project, then upload your sequenced data to the project. Then grant your collaborator UPLOAD access to the project, allowing them to upload their data. You both are then able to use each other's data to perform downstream analyses.
A project admin can configure a project to allow project members to run only specific executables as . The list of allowed executables is set by entering the following command, via the CLI:
This command overwrites any existing list of allowed executables.
To discard the allowed executables list, that is, let project members run all available executables as root executions, enter the following command:
Executables that are called by a permitted executable can run even if they are not included in the list.
Users with ADMINISTER access to a project can restrict the ability of project members to view, copy, delete, and download project data. The project-level boolean flags below provide fine-grained data access control. All data access control flags default to false and can be viewed and modified via CLI and platform API. The protected, restricted, downloadRestricted, externalUploadRestricted and containsPHI settings can be viewed and modified in the project's Settings web screen as described below.
protected: If set to true, only project members with ADMINISTER access to the project can delete project data. Otherwise, project members with ADMINISTER and CONTRIBUTE access can delete project data. This flag corresponds to the Delete Access policy in the project's Settings web interface screen.
restricted: If set to true,
Protected Health Information, or PHI, is identifiable health information that can be linked to a specific person. On the DNAnexus Platform, PHI Data Protection safeguards the confidentiality and integrity of data in compliance with the Health Insurance Portability and Accountability Act of 1996 (HIPAA).
When PHI Data Protection is enabled for a project, it is subject to the following protective restrictions:
Data in this project cannot be cloned to other projects that do not have containsPHI set to true
Any jobs that run in non-PHI projects cannot access any data that can only be found in PHI projects
Job email notifications sent from the project refer to objects by object ID instead of by name, and other information in the notification may be elided. If you receive such a notification, you can view the elided information by logging onto the Platform and opening the notification and accessing it in the Notifications pane, accessible by clicking the "bell" icon at the far right end of the main menu.
On the DNAnexus Platform, running analyses, storing data, and egressing data are billable activities, and always take place within a specific project. Each project is associated with a billing account to which invoices are sent, covering all billable activities carried out within the project.
The Monthly Project Usage Limit for Compute and Egress and Monthly Project Storage Spending Limit features can help project admins monitor and keep project costs under control. For more information, see .
In the project's Settings tab under the Usage Limits section, project admins can view the project's compute and egress usage limits.
For details on how to set and retrieve project-specific compute and egress usage limits, and storage spending limits, see the .
If you have ADMINISTER access to a project, you can transfer project billing responsibility to another user, by doing the following:
On the project's Settings screen, scroll down to the Administration section.
Click the Transfer Billing button. A modal window opens.
Enter the email address or username of the user to whom you want to transfer billing responsibility for the project.
The user receives an email notification of your request. To finalize the transfer, they need to log onto the Platform and formally accept it.
If you have billable activities access in the org to which you wish to transfer the project, you can change the billing account of the project to the org. To do this, navigate to the project settings page by clicking on the gear icon in the project header. On the project settings page, you can then select which to which billing account the project should be billed.
If you do not have billable activities access in the org you wish to transfer the project to, you need to transfer the project to a user who does have this access. The recipient is then able to follow the instructions below to accept a project transfer on behalf of an org.
You can cancel a transfer of project billing responsibility, so long as it hasn't yet been formally accepted by the recipient. To do this:
Select All Projects from the Projects link in the main menu. Open the project. You see a Pending Project Ownership Transfer notification at the top of the screen.
Click the Cancel Transfer button to cancel the transfer.
When another user initiates a project transfer to you, you receive a project transfer request, via both an email, and a notification accessible by clicking the Notifications button - the "bell" - at the far right end of the main menu.
If you did not already have access to the project being transferred, you receive VIEW access and the project appears in the list on the Projects screen.
To accept the transfer:
Open the project. You see a Pending Project Ownership Transfer notification in the project header.
Click the Accept Transfer button.
Select a new billing account for the project from the dropdown of eligible accounts.
Projects with auto-symlink enabled cannot be transferred to a different billing account. For more information, see .
If a project has PHI Data Protection enabled, it may only be transferred to an org billing account which also has PHI Data Protection enabled.
Ownership of may not be transferred without the sponsorship first being terminated.
A user or org can sponsor the cost of data storage in a project for a fixed term. During the sponsorship period, project members may copy this data to their own projects and store it there, without incurring storage charges.
On setting up the sponsorship, the sponsor sets it end date. The sponsor can change this end date at any time.
Billing responsibility for sponsored projects may not be transferred.
Sponsored projects may not be deleted, without the project sponsor first ending the sponsorship, by changing its end date to a date in the past.
For more information about sponsorship, contact .
for detailed information on projects that are billed to an org.
Learn about accessing and working with projects via the CLI:
Learn about working with projects as a developer:
Use Jupyter notebooks on the DNAnexus Platform to craft sophisticated custom analyses in your preferred coding language.
Jupyter notebooks are a popular way to track the work performed in computational experiments the way a lab notebook tracks the work done in a wet lab setting. DXJupyterLab, or JupyterLab, is an application provided by DNAnexus that allows you to perform computational experiments on the DNAnexus Platform using Jupyter notebooks. DXJupyterLab allows users on the DNAnexus Platform to collaborate on notebooks and extends JupyterLab with options for directly accessing a DNAnexus project from the JupyterLab environment.
DXJupyterLab supports the use of Bioconductor and Bioconda, useful tools for bioinformatics analysis.
DXJupyterLab is a versatile application that can be used to:
Collaborate on exploratory analysis of data
Reproduce and fork work performed in computational analyses
Visualize and gain insights into data generated from biological experiments
Create figures and tables for scientific publications
The DNAnexus Platform offers two different DXJupyterLab apps. One is a general-purpose JupyterLab application. The other is Spark cluster-enabled, and can be used within the framework.
Both apps instantiate a JupyterLab server that allows for data analyses to be interactively performed in Jupyter notebooks on a DNAnexus worker.
The app contains all the features found in the general-purpose DXJupyterLab along with access to a fully-managed, on-demand Spark cluster for big data processing and translational informatics.
DXJupyterLab 2.2 is the default version on the DNAnexus Platform. .
A step-by-step guide on how to start with DXJupyterLab and create and edit Jupyter notebooks can be found in the .
Creating a DXJupyterLab session requires the use of two different environments:
The DNAnexus project (accessible through the web platform and the CLI).
The worker execution environment.
You have direct access to the project in which the application is run from the JupyterLab session. The project file browser (which lists folders, notebooks, and other files in the project) can be accessed from the DNAnexus tab in the left sidebar or from the :
The project is selected when the DXJupyterLab app is started and cannot be subsequently changed.
The DNAnexus file browser shows:
Up to 1,000 of your most recently modified files and folders
All Jupyter notebooks in the project
Databases (Spark-enabled app only, limited to 1,000 most recent)
The file list refreshes automatically every 10 seconds. You can also refresh manually by clicking the circular arrow icon in the top right corner.
Need to see more files? Use dx ls in the terminal or access them programmatically through the API.
When you open and run a notebook from the the kernel corresponding to this notebook is started in the worker execution environment and is used to execute the notebook code. DNAnexus notebooks have a [DX] prepended to the notebook name in the tab of all opened notebooks.
The execution environment file browser is accessible from the left sidebar (notice the folder icon at the top) or from the terminal:
To create Jupyter notebooks in the worker execution environment, use the File menu. These notebooks are stored on the local file system of the DXJupyterLab execution environment and require persistence in a DNAnexus project. More information about saving appears in the .
You can directly in the DNAnexus project as well as duplicate, delete, or download them to your local machine. Notebooks stored in your DNAnexus project, which are housed within the DNAnexus tab on the left sidebar, are fetched from and saved to the project on the DNAnexus Platform without being stored in the JupyterLab execution environment file system. These are referred to as "DNAnexus notebooks" and these notebooks persist in the DNAnexus project after the DXJupyterLab instance is terminated.
DNAnexus notebooks can be recognized by the [DX] that is prepended to its name in the tab of all opened notebooks.
DNAnexus notebooks can be created by clicking the DNAnexus Notebook icon from the Launcher tab that appears on starting the JupyterLab session, or by clicking the DNAnexus tab on the upper menu and then clicking "New notebook". The Launcher tab can also be opened by clicking File and then selecting "New Launcher" from the upper menu.
To create a new local notebook, click the File tab in the upper menu and then select "New" and then "Notebook". These non-DNAnexus notebooks can be saved to DNAnexus by dragging and dropping them in the DNAnexus file viewer in the left panel.
In JupyterLab, users can access input data that is located in a DNAnexus project in one of the following ways.
For reading the input file multiple times or for reading a large fraction of the file in random order:
Download the file from the DNAnexus project to the execution environment with dx download and access the downloaded local file from Jupyter notebook.
For scanning the content of the input file once or for reading only a small fraction of file's content:
Files, such as local notebooks, can be persisted in the DNAnexus project by using one of these options:
dx upload in bash console.
Drag the file onto the DNAnexus tab that is in the column of icons on the left side of the screen. This uploads the file into the selected DNAnexus folder.
Exporting DNAnexus notebooks to formats such as HTML or PDF is not supported. However, you can dx download the DNAnexus notebook from the current DNAnexus project to the JupyterLab environment and export the downloaded notebook. For exporting local notebook to certain formats, the following commands might be needed beforehand: apt-get update && apt-get install texlive-xetex texlive-fonts-recommended texlive-plain-generic.
A command can be executed in the DXJupyterLab worker execution environment without starting an interactive JupyterLab server. To do that, provide the cmd input and additional input files using the in input file array to the DXJupyterLab app. The provided command runs in the /opt/notebooks/directory and any output files generated in this directory are uploaded to the project and returned in the out output field of the job that ran DXJupyterLab app.
The cmd input makes it possible to use the papermill command that is pre-installed in the DXJupyterLab environment to execute notebooks non-interactively. For example, to execute all the cells in a notebook and produce an output notebook:
Where notebook.ipynb is the input notebook to the papermill command, which is passed to the dxjupyterlab app using the in input, and output_notebook.ipynb is the name of the output notebook, which contains the result of executing the input notebook and is uploaded to the project at the end of app's execution. See the for details.
Collaborators can work on notebooks in the project without the risk of overwriting each other's changes.
If a user has opened a specific notebook in a JupyterLab session, other users cannot open or edit the notebook. This is indicated by a red lock icon next to the notebook's name.
It is still possible to create a duplicate to see what changes are being saved in the locked notebook or to continue work on this "forked" version of the notebook. To copy a notebook, right-click on its name and select Duplicate. After a few seconds, a notebook with the same name and a "copy" suffix should appear in the project.
Once the editing user closes the notebook, the lock is released and anybody else with access to the project can open it.
Whenever a notebook is saved in the project, it is uploaded to the platform as a new file that replaces the previous version, that is, the file of the same name. The previous version is moved to the .Notebook_archive folder with a timestamp suffix added to its name and its ID is saved in the properties of the new file. Saving notebooks directly in the project ensures that your analyses are not lost when the DXJupyterLab session ends.
DXJupyterLab sessions begin with a set duration and shut down automatically at the end of this period. The timeout clock appears in the footer on the right side and can be adjusted using the Update duration button. The session terminates at the set timestamp even if the DXJupyterLab webpage is closed. Job lengths have an upper limit of 30 days, which cannot be extended.
A session can be terminated immediately from the top menu (DNAnexus > End Session).
It is possible to save the current session environment and data and reload it later by creating a session snapshot (DNAnexus > Create Snapshot).
A DXJupyterLab session is , and a session snapshot file is a tarball generated by saving the Docker container state (with the docker commit and docker save commands). Any installed packages and files created locally are saved to a snapshot file, except for directories /home/dnanexus and /mnt/, which are not included. This file is then uploaded to the project to .Notebook_snapshots and can be passed as input the next time the app is started.
Snapshots created with DXJupyterLab versions older than 2.0.0 (released mid-2023) are not compatible with the current version. These previous snapshots contain tool versions that may conflict with the newer environment, potentially causing problems.
To use a snapshot from a previous version in the current version of DXJupyterLab, recreate the snapshot as follows:
Create a tarball incorporating all the necessary data files and packages.
Save the tarball in a project.
Launch the current version of DXJupyterLab.
Import and unpack the tarball file.
If you don't want to have to recreate your older snapshot, you can run an and access the snapshot there.
Viewing any other file types from your project, such as CSV, JSON, PDF files, images, or scripts, is convenient because JupyterLab displays them accordingly. For example, JSON files are collapsible and navigable and CSV files are presented in the tabular format.
However, editing and saving any open files from the project other than IPython notebooks results in an error.
The JupyterLab apps are run in a specific project, defined at start time, and this project cannot be subsequently changed. The job associated with the JupyterLab app has CONTRIBUTE access to the project in which it is run.
When running DXJupyterLab app, it is possible to view, but not update, other projects the user has access to. This enhanced scope is required to be able to read databases which may be located in different projects and cannot be cloned.
Use dx run to start new jobs from within a notebook or the terminal. If the billTo for the project where your JupyterLab session runs does not have a license for detached executions, any started jobs run as subjobs of your interactive JupyterLab session. In this situation, the --project argument for dx run is ignored, and the job uses the JupyterLab session's workspace instead of the specified project. If a subjob fails or terminates on the DNAnexus Platform, the entire job tree—including your interactive JupyterLab session—terminates as well.
Jobs are limited to a runtime of 30 days. The system automatically terminates jobs running longer than 30 days.
The DXJupyterLab app is a Docker-based app that runs the JupyterLab server instance in a Docker container. The server runs on port 443. Because it's an HTTPS app, you can bring up the JupyterLab environment in a web browser using the URL https://job-xxxx.dnanexus.cloud, where job-xxxx is the ID of the job that runs the app. Only the user who launched the JupyterLab job has access to the JupyterLab environment. Other users see a "403 Permission Forbidden" message under the JupyterLab session's URL.
When launching JupyterLab, the feature options available are PYTHON_R, ML, IMAGE_PROCESSING, STATA, and MONAI_ML.
PYTHON_R (default option): Loads the environment with Python3 and R kernel and interpreter.
ML: Loads the environment with Python3 and machine learning packages, such as TensorFlow, PyTorch, CNTK as well as the image processing package Nipype, but it does not contain R.
IMAGE_PROCESSING: Loads the environment with Python3 and Image Processing packages such as Nipype, FreeSurfer and FSL but it does not contain R. The FreeSurfer package requires a license to run. Details about license creation and usage can be found in the
Full List of Pre-Installed Packages
For the full list of pre-installed packages, see the . This list includes details on feature-specific packages available when running the PYTHON_R, ML, IMAGE_PROCESSING, STATA, and MONAI_ML features.
Additional packages can be during a JupyterLab session. By creating a Docker container , users can then start subsequent sessions with the new packages pre-installed by providing the snapshot as input.
For more information on the features and benefits of JupyterLab, see the .
Create your first notebooks by following the instructions in the guide.
See the guide for tips and info on the most useful DXJupyterLab features.
You can describe objects (files, app(let)s, and workflows) on the DNAnexus Platform using the command dx describe.
Objects can be described using their DNAnexus Platform name via the command line interface (CLI) using a path.
Objects can be described relative to the user's current directory on the DNAnexus Platform. In the following example, the indexed reference genome file human_g1k_v37.bwa-index.tar.gz is described.
The entire path is enclosed in quotes because the folder name Original files contains whitespace. Instead of quotes, escape special characters with \: dx describe Original\ files/human_g1k_v37.bwa-index.tar.gz.
Objects can be described using an absolute path. This allows you to describe objects outside the current project context. In the following example, selects the project "My Research Project" and dx describe describes the file human_g1k_v37.fa.gz in the "Reference Genome Files" project.
Objects can be described using a unique object ID.
This example describes the workflow object "Exome Analysis Workflow" using its ID. This workflow is publicly available in the "Exome Analysis Demo" project.
Because workflows can include many app(let)s, inputs/outputs, and default parameters, the dx describe output can seem overwhelming.
The output from a dx describe command can be used for multiple purposes. The optional argument --json converts the output from dx describe into JSON format for advanced scripting and command line use.
In this example, the publicly available workflow object "Exome Analysis Workflow" is described and the output is returned in JSON format.
Parse, process, and query the JSON output using . Below, the dx describe --json output is processed to generate a list of all stages in the exome analysis pipeline.
To get the "executable" value of each stage present in the "stages" array value of the dx describe output above, use the following command:














data in this project cannot be used as input to a job or an analysis in another project
any running app or applet that reads from this project cannot write results to any other project
a job running in the project has singleContext flag set to true irrespective of the singleContext value supplied to /job/new and /executable-xxxx/run, and is only allowed to use the job's DNAnexus authentication token when issuing requests to the proxied DNAnexus API endpoint within the job. Use of any other authentication token results in an error.
This flag corresponds to the Copy Access policy in the project's Settings web interface screen.
downloadRestricted: If set to true, data in this project cannot be downloaded outside of the platform. For database objects, users cannot access the data in the project from outside DNAnexus. When set to true, previewViewerRestricted defaults to true unless explicitly overridden. This flag corresponds to the Download Access policy in the project's Settings web interface screen.
previewViewerRestricted: If set to true, file preview and viewer are disabled for the project. This flag defaults to true when downloadRestricted is set to true. You can override this by explicitly setting previewViewerRestricted to false using the /project-xxxx/update API method.
databaseUIViewOnly: If set to true, project members with VIEW access have their access to project databases restricted to the Cohort Browser only. This feature is only available to customers with an Apollo license. Contact DNAnexus Sales for more information.
containsPHI: If set to true, data in this project is treated as Protected Health Information (PHI), an identifiable health information that can be linked to a specific person. PHI data protection safeguards the confidentiality and integrity of the project data in compliance with the Health Insurance Portability and Accountability Act of 1996 (HIPAA) by imposing additional restrictions documented in PHI Data Protection section. This flag corresponds to the PHI Data Protection setting in the Administration section of a project's Settings web interface screen.
displayDataProtectionNotice: If set to true, ADMIN users can turn on/off the ability to show a Data Protection Notice to any users accessing the selected project. If the Data Protection Notice feature is enabled for a project, all users, when first accessing the project, are required to review and confirm their acceptance of a requirement not to egress data from the project. A license is required to use this feature. Contact DNAnexus Sales for more information.
externalUploadRestricted: If set to true, external file uploads to this project (from outside the job context) are rejected. The creation of Apollo databases, tables, and inserts of data into tables is disallowed from Thrift with a non-job token. This flag corresponds to the External Upload Access policy in the project's Settings web interface screen. A license is required to use this feature. Contact DNAnexus Sales for more information.
httpsAppIsolatedBrowsing: If set to true, httpsApp access to jobs launched in this project are wrapped in Isolated Browsing, which restricts data transfers through the httpsApp job interface. A license is required to use this limited-access feature. Contact DNAnexus Sales for more information.
Apollo database access is subject to additional restrictions.
Once PHI Data Protection is activated for a project, it cannot be disabled.
VIEW
Allows users to browse and visualize data stored in the project, download data to a local computer, and copy data to other projects.
UPLOAD
Gives users VIEW access, plus the ability to create new folders and data objects, modify the metadata of open data objects, and close data objects.
CONTRIBUTE
Gives users UPLOAD access, plus the ability to run executions directly in the project.
ADMINISTER
Gives users CONTRIBUTE access, plus the power to change project permissions and policies, including giving other users access, revoking access, transferring project ownership, and deleting the project.
Build and test algorithms directly in the cloud before creating DNAnexus apps and workflows
Test and train machine/deep learning models
Interactively run commands on a terminal
A project in which the app is running is mounted in a read-only fashion at /mnt/project folder. Reading the content of the files in /mnt/project dynamically fetches the content from the DNAnexus Platform, so this method uses minimal disk space in the JupyterLab execution environment, but uses more API calls to fetch the content.
STATA: Requires a license to run. See Stata in DXJupyterLab for more information about running Stata in JupyterLab.
MONAI_ML: Loads the environment with Python3 and extends the ML feature. This feature is ideal for medical imaging research involving machine learning model development and testing. It includes medical imaging frameworks designed for AI-powered analysis:
MONAI Core and MONAI Label: Medical imaging AI frameworks for deep learning workflows.
3D Slicer: Medical image visualization and analysis platform accessible through the SlicerJupyter kernel.



dx update project project-xxxx --allowed-executables applet-yyyy --allowed-executables workflow-zzzz [...]dx update project project-xxxx --unset-allowed-executablesmy_cmd="papermill notebook.ipynb output_notebook.ipynb"
dx run dxjupyterlab -icmd="$my_cmd" -iin="notebook.ipynb"Name
All
Object name on the platform.
All
Status of the object on the platform.
Visibility
All
Whether the file is visible to the user through the platform web interface.
Tags
All
Set of tags associated with an object. Tags are strings used to organize or annotate objects.
Properties
All
Key/value pairs attached to object.
All
JSON reference to another object on the platform. Linked objects are copied along with the object if the object is cloned to another project.
Created
All
Date and time object was created.
Created by
All
DNAnexus user who created the object. Contains subfield "via the job" if the object was created by an app or applet.
Last modified
All
Date and time the object was last modified.
Input Spec
App(let)s and Workflows
App(let) or workflow input names and classes. With workflows, the corresponding applet stage ID is also provided.
Output Spec
App(let) and Workflows
App(let) or workflow output names and classes. With workflows, the corresponding applet stage ID is also provided.
All
Unique ID assigned to a DNAnexus object.
Class
All
DNAnexus object type.
Project
All
Container where the object is stored.
Folder
All
Objects inside a container (project) can be organized into folders. Objects can only exist in one path within a project.
$ dx describe "Original files/human_g1k_v37.bwa-index.tar.gz"
Result 1:
ID file-xxxx
Class file
Project project-xxxx
Folder /Original files
Name human_g1k_v37.bwa-index.tar.gz
State closed
Visibility visible
Types -
Properties -
Tags -
Outgoing links -
Created ----
Created by Amy
via the job job-xxxx
Last modified ----
archivalState "live"
Size 3.21 GB$ dx select "My Research Project"
$ dx describe Reference\ Genome\ Files:H.\ Sapiens\ -\ GRCh37\ -\ b37\ (1000\ Genomes\ Phase\ I)/human_g1k_v37.fa.gz
Result 1:
ID file-xxxx
Class file
Project project-xxxx
Folder /H. Sapiens - GRCh37 - b37 (1000 Genomes Phase I)
Name human_g1k_v37.fa.gz
State closed
Visibility visible
Types -
Properties -
Tags -
Outgoing links -
Created ----
Created by Amy
via the job job-xxxx
Last modified ----
archivalState "live"
Size 810.45 MB$ dx describe "Exome Analysis Demo":workflow-G409jQQ0bZ46x5GF4GXqKxZ0
Result 1:
ID workflow-G409jQQ0bZ46x5GF4GXqKxZ0
Class workflow
Project project-BQfgzV80bZ46kf6pBGy00J38
Folder /
Name Exome Analysis Workflow
....
Stage 0 bwa_mem_fastq_read_mapper
Executable app-bwa_mem_fastq_read_mapper/2.0.1
Stage 1 fastqc
Executable app-fastqc/3.0.1
Stage 2 gatk4_bqsr
Executable app-gatk4_bqsr_parallel/2.0.1
Stage 3 gatk4_haplotypecaller
Executable app-gatk4_haplotypecaller_parallel/2.0.1
Stage 4 gatk4_genotypegvcfs
Executable app-gatk4_genotypegvcfs_single_sample_parallel/2.0.0$ dx describe "Exome Analysis Demo":workflow-G409jQQ0bZ46x5GF4GXqKxZ0 --json
{
"project": "project-BQfgzV80bZ46kf6pBGy00J38",
"name": "Exome Analysis Workflow",
"inputSpec": [
{
"name": "bwa_mem_fastq_read_mapper.reads_fastqgzs",
"class": "array:file",
"help": "An array of files, in gzipped FASTQ format, with the first read mates to be mapped.",
"patterns": [ "*.fq.gz", "*.fastq.gz" ],
...
},
...
],
"stages": [
{
"id": "bwa_mem_fastq_read_mapper",
"executable": "app-bwa_mem_fastq_read_mapper/2.0.1",
"input": {
"genomeindex_targz": {
"$dnanexus_link": {
"project": "project-BQpp3Y804Y0xbyG4GJPQ01xv",
"id": "file-FFJPKp0034KY8f20F6V9yYkk"
}
}
},
...
},
{
"id": "fastqc",
"executable": "app-fastqc/3.0.1",
...
}
...
]
}$ dx describe "Exome Analysis Demo":workflow-G409jQQ0bZ46x5GF4GXqKxZ0 --json |jq .stages
[{
"id": "bwa_mem_fastq_read_mapper",
"executable": "app-bwa_mem_fastq_read_mapper/2.0.1",
...
}, {
"id": "fastqc",
"executable": "app-fastqc/3.0.1",
...
}, {
"id": "gatk4_bqsr",
"executable": "app-gatk4_bqsr_parallel/2.0.1",
...
}
...
}]$ dx describe "Exome Analysis Demo":workflow-G409jQQ0bZ46x5GF4GXqKxZ0 --json | jq '.stages | map(.executable) | .[]'
"app-bwa_mem_fastq_read_mapper/2.0.1"
"app-fastqc/3.0.1"
"app-gatk4_bqsr_parallel/2.0.1"
"app-gatk4_haplotypecaller_parallel/2.0.1"
"app-gatk4_genotypegvcfs_single_sample_parallel/2.0.0"In its simplest form, an org can be thought of as referring to a group of users on the same project. An org can be used efficiently to share projects and data with multiple users - and, if necessary, to revoke access.
Org admins can manage org membership, configure access and projects associated with the org, and oversee billing. All storage and compute costs associated with an org are invoiced to a single billing account designated by the org admin. You can create an org that is associated with a billing account by contacting DNAnexus Sales.
Orgs are referenced on the DNAnexus Platform by a unique org ID, such as org-dnanexus. Org IDs are used when sharing projects with an org in the Platform user interface or when manipulating the org in the CLI.
Users may have one of two membership levels in an org:
ADMIN
MEMBER
An ADMIN-level user is granted all possible access in the org and may perform org administrative functions. These functions include adding/removing users or modifying org policies. A MEMBER-level user, on the other hand, is granted only a subset of the possible org accesses in the org and has no administrative power in the org.
A user with MEMBER level can be configured to have a subset of the following org access. These access levels determine which actions each user can perform in an org.
Billable activities access
If allowed, the org member can create new projects and apps billed to the org, download data (incurring data egress charges against the org), and set their own default billing account to that of the org.
[Allowed] or [Not Allowed]
Shared apps access
If allowed, the org member has access to view and run apps in which the org has been added as an "authorized user".
[Allowed] or [Not Allowed]
Shared projects access
The maximum access level a user can have in projects shared with an org. For example, if this is set to UPLOAD for an org member, the member has at most UPLOAD access in projects shared with the org, even if the org was given CONTRIBUTE or ADMINISTER access to the project.
[NONE], [VIEW], [UPLOAD], [CONTRIBUTE] or [ADMINISTER]
These accesses allow you to have fine-grained control over what members of your orgs can do in the context of your org.
Org admins are granted all possible access in the org. More specifically, org admins receive the following set of accesses:
Billable activities access
Allowed
Shared apps access
Allowed
Shared projects access
ADMINISTER
Org admins also have the following special privileges:
Org admins can list and view metadata for all org projects (projects billed to the org) even if the project is not explicitly shared with them. They can also give themselves access to any project billed to the org. For example, when a member creates a new project, Project-P, and bills it to the org, they are the only user with access to Project-P. The org admin can see all projects billed to the org, including Project-P. Org admins can also invite themselves to Project-P at any time to get access to objects and jobs in the project.
Org admins can add themselves as developers to any app billed to the org. For example, when a member creates a new app, App-A, billed to the org, they are the only developer for App-A. However, any org admins may add themselves as developers at any time.
In the diagram below, there are 3 examples of how organizations can be structured.
The simplest example, ORG-1, is represented by the leftmost circle. In this situation, ORG-1 is a billable org that has 3 members who share one billing account, so all 5 projects created by the members of ORG-1 are billed to that org. One admin (user A) manages ORG-1.
The second example shows ORG-2 and ORG-3 demonstrating more a complicated organizational setup. Here users are grouped into two different billable orgs, with some users belonging to both orgs and others belonging to only one.
In this case, ORG-2 and ORG-3 bill their work against separate billing accounts. This separation of orgs can represent two different groups in one company working in different departments, each with their own budgets, two different labs that work closely together, or any other scenario in which two collaborators would share work.
ORG-2 has 5 members, 4 projects, and is managed by one org admin (user G). ORG-3 has 5 members and 3 projects, but is managed by 2 admins (users G and I).
In this example, admin G and member H belong to both ORG-2 and ORG-3. They can create new projects billed to either org, depending on the project they're working on. Admin G can manage users and projects in both ORG-2 and ORG-3.
You can create a non-billable org as an alias for a group of users. For example, you have a group of users who all need access to a shared dataset. You can make an org which represents all the users who need access to the dataset, for example, an org named org-dataset_access, and share all the projects and apps related to the dataset with that org. All members of the org have at least VIEW "shared project access" and "shared app access" so that they are all given permission to view the dataset. If a member no longer needs access to the dataset, they can be removed from the org, and then no longer have access to any projects or apps shared with org-dataset_access.
You can contact DNAnexus Sales to create a billable org where only one member, the org admin, can create new org projects. All other org members are not granted the "billable activities access", and so cannot create new org projects. The org admin can then assign each org member a "shared projects access" (VIEW, UPLOAD, CONTRIBUTE, ADMINISTER) and share every org project with the org with ADMINISTER access. The members' permissions to the projects are restricted by their respective "shared project access."
For example, in a given group, bioinformaticians can be given CONTRIBUTE access to the projects shared with the entire org, so they can run analyses and produce new data in any of the org projects. However, the sequencing center technicians only need UPLOAD permissions to add new data to the projects. Analysts in the group are only given VIEW access to projects shared with the org. When you need to add a new member to your group and give them access to the projects shared with the org, you need to add them to the org as a new member and assign them the appropriate permission levels.
This membership structure allows the org admin to control the number of projects billed to the org. The org admin can also quickly share new projects with their org and revoke permissions from users who have been removed from the org.
You can contact DNAnexus Sales to create a billable org where users work independently and bill their activities to the org billing account (as specified by the org admin). All org members are granted "billable activities access." The org members also need to share common resources. These resources might include incoming samples or reference datasets.
In this case, all members should be granted the "shared apps access" and assigned VIEW as their "shared projects access." The reference datasets that need to be shared with the org are stored in an "Org Resources" project that is shared with the org, which is granted VIEW access. The org can also have best-practice executables built as apps on the DNAnexus system.
The apps can be shared with the org so all members of the org have access to these (potentially proprietary) executables. If any user leaves your company or institution, their access to reference datasets and executables is revoked by removing them from the org.
In general, it is possible to apply many different schemas to orgs as they were designed for many different real-life collaborative structures. If you have a type of collaboration you would like to support, contact DNAnexus Support for more information about how orgs can work for you.
If you are an admin of an org, you can access the org admin tools from the Org Admin link in the header of the DNAnexus Platform. From here, you can quickly navigate to the list of orgs you administer via All Orgs, or to a specific org.
The Organizations list shows you the list of all orgs to which you have admin access. On this page, you can quickly see your orgs, the org IDs, their Project Transfer setting, and the Member List Visibility setting.
Within an org, the Settings tab allows you to view and edit basic information, billing, and policies for your org.
You can find the org overview on the Settings tab. From here, you can:
View and edit the organization name (this is how the org is referred to in the Platform user interface and in email notifications).
View the organization ID, the unique ID used to reference a particular org on the CLI. An example org ID would be org-demo_org.
View the number of org members, org projects, and org apps.
View the list of organization admins.
Within an org page, the Members tab allows you to view all the members of the org, invite new members, remove existing members, and update existing members' permission levels.
From the Members tab, you can quickly see the names and access levels for all org members. For more information about org membership, see the organization member guide.
To add existing DNAnexus user to your org, you can use the + Invite New Member button from the org's Members tab. This opens a screen where you can enter the user's username, such as smithj, or user-ID, such as user-smithj. Then you can configure the user's access level in the org.
If you add a member to the org with billable activities access set to billing allowed, they have the ability to create new projects billed to the org.
However, adding the member does not change their default billing account. If the user wishes to use the org as their default billing account, they must set their own default billing account.
If the member has any pre-existing projects that are not billed to the org, the user must transfer the project to an org if they wish to have the project billed to the org.
The user receives an email notification informing them that they have been added to the organization.
Org admins have the ability to create new DNAnexus accounts on behalf of the org, provided the org is covered by a license that enables account provisioning. The user then receives an email with instructions to activate their account and set their password.
If this feature has already been turned on for an org you administer, you see an option to Create New User when you go to invite a new member.
Here you can specify a username, such as alice or smithj, the new user's name, and their email address. The system automatically creates a new user account for the given email address and adds them as a member in the org.
If you create a new user and set their Billable Activities Access to Billing Allowed, consider setting the org as the user's default billing account. This option is available as a checkbox under the Billable Activities Access dropdown.
From the org Members tab, you can edit the permissions for one or multiple members of the org. The option to Edit Access appears when you have one or more org members selected in the table.
When you edit multiple members, you have the option of changing only one access while leaving the rest alone.
From the org Members tab, you can remove one or more members from the org. The option to Remove appears when you have one or more org members selected on the Members tab.
Removing a member revokes the user's access to all projects and apps billed to or shared with the org.
In the org's Projects tab you to see the list of all projects billed to the org. This list includes all projects in which you have VIEW and above permissions as well as projects that are billed to the org in which you do not have permissions (not a Member of).
You can view all project metadata, such as the list of members, data usage, and creation date. You can also view other optional columns such as project creator. To enable the optional columns, select the column from the dropdown menu to the right of the column names.
Org admins can give themselves access to any project billed to the org. If you select a project in which you are not a member, you are still able to navigate into the project's settings page. On the project settings page, you can click a button to grant yourself ADMINISTER permissions to the project.
You can also grant yourself ADMINISTER permissions if you are a member of a project billed to your org but you only have VIEW, CONTRIBUTE, or UPLOAD permissions.
To access your org's billing information:
In the main menu, click Orgs > All Orgs.
Select an organization you want to view.
Select the Billing tab to view billing information.
To set up or update the billing information for an org you administer, contact DNAnexus Billing team.
Setting up billing for an organization designates someone to receive and pay DNAnexus invoices, including usage by organization members. The billing contact can be you, someone from your finance department, or another designated person.
When you click Confirm Billing, DNAnexus sends an email to the designated billing contact requesting confirmation of their responsibility for receiving and paying invoices. The organization's billing contact information does not update until DNAnexus receives this confirmation.
The org spending limit is the total in outstanding usage charges that can be incurred by projects linked to an org.
If you are an org admin, you can set or modify this spending limit:
In the main menu, click Orgs > All Orgs.
Select the org for which you'd like to set or modify a spending limit.
In the org details, select the Billing tab.
In Summary, click Increase Spending Limit to request increasing the limit via DNAnexus Support.
Doing this only submits your request.
Before approving your request, DNAnexus Support may follow up with you via email with questions about the change.
The Usage Charges section allows users with billable access to view total charges incurred to date. You can see how much is left of the org's spending limit. This section is only visible if your org is a billable org, which means your org has confirmed billing information.
For orgs with the Monthly Project Usage Limit for Computing and Egress and/or the Monthly Project Spending Limit for Storage feature enabled, org admins can set, update, and view default limits for each spending type, and set limit enforcement actions via API calls.
To set the org policies for compute, egress, and storage spending limits and related enforcement actions, use the API methods org/new, org-xxxx/update, or org-xxxx/bulkUpdateProjectLimit.
To retrieve and view your org's policies configuration for spending and usage limits, use the API method org-xxxx/describe.
In orgs with the Monthly Project Usage Limit for Computing and Egress feature enabled, org admins can set default limits and enforcement actions:
In the main menu, click Orgs > All Orgs.
In the Organizations list, click the organization name you want to configure.
In the Usage Limits section
Set the default compute and egress spending limits for linked projects.
Configure the enforcement action when limits are reached.
Choose whether to prevent new executions and terminate ongoing ones, or send alerts while allowing executions to continue.
For details on these limits, see How Spending and Usage Limits Work.
The Per-Project Usage Report and Root Execution Stats Report are monthly reports that provide detailed breakdowns of charges incurred by org members. These reports help you track and analyze spending patterns across your organization. For more information, see organization management and usage monitoring.
Org admins can also set configurable policies for the org. Org policies dictate many different behaviors when the org interacts with other entities. The following policies exist:
Membership List Visibility
Dictates the minimum org membership level required to view the list of org members, their membership level, and access within the org. If PUBLIC, any DNAnexus user can view the list of org members.
[ADMIN], [MEMBER], or [PUBLIC]
Project Transfer
Dictates the minimum org membership level allowed to change the billing account of an org project (via the UI or project transfer).
[ADMIN] or [MEMBER]
Project Sharing
Dictates the minimum org membership level allowed for a user to invite that org to a project
[ADMIN] or [MEMBER]
DNAnexus recommends, as a starting point, to restrict the "membership list visibility policy" to ADMIN and "project transfer policy" to ADMIN. This ensures that only the org admin is allowed to see the list of members and their access within the org and that org projects always remain under control of the org.
You can update org policies for your org in the Policies and Administration section of the org Settings tab. Here, you can both change the membership list visibility and restrict project transfer policies for the org and contact DNAnexus Support to enable PHI data policies for org projects.
Billable activities access is an access level that can be granted to org members. If allowed, the org member can create new projects and apps billed to the org, download data (incurring data egress charges against the org), and set their own default billing account to that of the org.
Billable org is an org that has confirmed billing information or a non-negative spending limit remaining. Users with billable activities access in a billable org are allowed to create new projects billed to the org. See the definition of a non-billable org for an org that is used for sharing.
Billed to an org (app context) sets the billing account of an app to an org. Apps require storage for their resources and assets, and the billing account of the app are billed for that storage. The billing account of an app does not pay for invocations of the app unless the app is run in a project billed to the org.
Billed to an org (project context) sets the billing account of a project to an org. The org is invoiced the storage for all data stored in the project as well as compute charges for all jobs and analyses run in the project.
Membership level describes one of two membership levels available to users in an org, ADMIN or MEMBER. Remember that ADMINISTER is a type of access level.
Membership list visibility policy dictates the minimum org membership level required to view the list of org members, their membership level, and access within the org.
Non-billable org describes an org only used as an alias for a group of users. Non-billable orgs do not have billing information and do not have any org projects or org apps. Any user can share a project with a non-billable org.
Org access is granted to a user to determine which actions the user can perform in an org.
Org admin describes administrators of an org who can manage org membership, configure access and projects associated with the org, and oversee billing.
Org app is an app billed to an org.
Org ID is the unique ID used to reference a particular org on the DNAnexus Platform. An example is org-dnanexus.
Org member is a DNAnexus user associated with an org. Org members can have variable membership levels in an org which define their role in the org. Admins are a type of org member as well.
Org policy is a configurable policy for the org. Org policies dictate many different behaviors when the org interacts with other entities.
Org project describes a project billed to an org.
Org (or "organization") is a DNAnexus entity that is used to associate a group of users. Orgs are referenced on the DNAnexus Platform by a unique org ID.
Project transfer policy dictates the minimum org membership level allowed to change the billing account of an org project.
Share with an org means to give the members of an org access to a project or app via giving the org access to the project or adding the org as an "authorized user" of an app.
Shared apps access is an org access level that can be granted to org members. If allowed, the org member can view and run apps in which the org has been added as an "authorized user."
Shared projects access is an org access level that can be granted to org members: the maximum access level a user can have in projects shared with an org.
Learn in depth about setting up and managing orgs as an administrator.
Learn about what you can do as an org member.
You can run workflows from the command-line using the command dx run. The inputs to these workflows can be from any project for which you have VIEW access.
The examples here use the publicly available Exome Analysis Workflow (platform login required to access this link).
For information on how to run a Nextflow pipeline, see Running Nextflow Pipelines.
Running dx run without specifying an input launches interactive mode. The system prompts for each required input, followed by options to select from a list of optional parameters to modify. Optional parameters include all modifiable parameters for each stage of the workflow. The interface outputs a JSON file detailing the input specified and generates an analysis ID of the form analysis-xxxx unique to this particular run of the workflow.
Below is an example of running the Exome Analysis Workflow from the public "Exome Analysis Demo" project.
You can specify each input on the command-line using the -i or --input flags using the syntax -i<stage ID>.<input name>=<input value>. <input-value> must take the form of a DNAnexus object ID or a file named in the project you have selected. It is also possible to specify the number of a stage in place of the stage ID for a given workflow, where stages are indexed starting at zero. The inputs in the following example are specified for the first stage of the workflow only to illustrate this point. The parentheses around the <input-value> in the help string are omitted when entering input.
Possible values for the input name field can be found by running the command dx run workflow-xxxx -h, as shown below using the Exome Analysis Workflow.
This help message describes the inputs for each stage of the workflow in the order they are specified. For each stage of the workflow, the help message first lists the required inputs for that stage, specifying the requisite type in the <input-value> field. Next, the message describes common options for that stage (as seen in that stage's corresponding UI on the platform). Lastly, it lists advanced command-line options for that stage. If any stage's input is linked to the output of a prior stage, the help message shows the default value for that stage as a DNAnexus link of the form
{"$dnanexus_link": {"outputField": "<prior stage output name>", "stage": "stage-xxxx" }}.
This link format can also be used to specify output from any prior stage in the workflow as input for the current stage.
For the Exome Analysis Workflow, one required input parameter needs to be specified manually: -ibwa_mem_fastq_read_mapper.reads_fastqgzs.
This parameter targets the first stage of the workflow. For convenience, use the stage number instead of the full stage ID. Since this is the first stage (and workflow stages are zero-indexed), replace bwa_mem_fastq_read_mapper with 0 like this: -i0.reads_fastqgzs.
The example below shows how to run the same Exome Analysis Workflow on a FASTQ file containing reads, as well as a BWA reference genome, using the default parameters for each subsequent stage.
Array input can be specified by specifying multiple inputs for a single parameter in a stage. For example, the following flags would add files 1 through 3 to the file_inputs parameter for stage-xxxx of the workflow:
If no project is selected, or if the file is in another project, the project containing the files you wish to use must be specified as follows: -i<stage ID>.<input name>=<project id>:<file id>.
The -i flag can also be used to specify (JBORs) with the syntax -i<stage ID or number>:<input name>=<job id>:<output name>. The --brief flag, when used with the command dx run, outputs only the execution's ID. You can also skip the interactive prompts confirming the execution using the -y flag. Calling dx run on a job with the --brief flag returns only the job ID of that execution, and you can skip being prompted to begin execution with the -y flag.
The example below calls the app (platform login required to access this link) to produce the sorted_bam output described in the help string produced by running dx run app-bwa_mem_fastq_read_mapper -h. This output is then used as input to the first stage of the featured on the DNAnexus Platform (platform login required to access this link).
Using the --brief flag at the end of a dx run command causes the command line to print the execution's analysis ID ("analysis-xxxx") instead of the input JSON for the execution. This ID can be saved for later reference.
To modify specific settings from the previous analysis, you can run the command dx run --clone analysis-xxxx [options]. The [options] parameters override anything set by the --clone flag, and take the form of options passed as input from the command line.
The --clone flag does not copy the usage of the --allow-ssh or --debug-on flags, which must be set with the new execution. Only the applet, instance type, and input spec are copied. See the page for more information on the usage of these flags.
For example, the command below redirects the output of the analysis to the outputs/ folder and reruns all stages.
When rerunning workflows, if a stage runs identically to how it ran in a previous analysis, the stage itself is not rerun. The outputs of that stage are not copied or rewritten in a new location. To rerun a specific stage, use the option --rerun-stage STAGE_ID to force a stage to be run again, where STAGE_ID is an ID of the form stage-xxxx, the stage's name, or the index of that stage (where the first stage of a workflow is indexed at 0). If you want to rerun all stages of an analysis, you can use --rerun-stage "*", where the asterisk is enclosed in quotes to prevent expansion of that variable into all folders in your current directory via globbing.
The command below reruns the third and final stage of analysis-xxxx
The --destination flag allows you to specify the path of the output of a workflow. By default, every output of every stage is written to the destination specified.
You can use the --stage-output-folder <stage_ID> <folder> command to specify the output destination of a particular stage in the analysis being run. In this command, stage_ID is the stage's name, or the index of that stage (where the first stage of a workflow is indexed at 0). The folder is the project and path to which you wish the stage to write using the syntax project-xxxx:/PATH where PATH is the path to the folder in project-xxxx where you wish to write outputs.
The following command reruns all stages of analysis-xxxx and sets the output destination of the first step of the workflow (BWA) to "mappings" in the current project:
If you want to specify output folder of a stage within the current output folder of the entire analysis, you can use the flag --stage-relative-output-folder <stage_id> <folder>, where stage_id is the stage's name (stage-xxxx), or the index of that stage (where the first stage of a workflow is indexed at 0). For the folder argument, you can specify a quoted path to write the output of that stage that is relative to the output folder of the analysis.
The following command reruns all stages of analysis-xxxx, setting the output destination of the analysis to /exome_run, and the output destination of stage 0 to /exome_run/mappings in the current project:
To specify the instance type of all stages in your analysis or a specific set of stages in your analysis, use the flag --instance-type. Specifically, the format --instance-type STAGE_ID=INSTANCE_TYPE allows you to set the instance type of a specific stage, while --instance-type INSTANCE_TYPE sets one instance type for all stages. The two options can be combined, for example, --instance-type mem2_ssd1_x2 --instance-type my_stage_0=mem3_ssd1_x16 sets all stages' instance types to mem2_ssd1_x2 except for the stage my_stage_0, for which mem3_ssd1_x16 is used.
Here STAGE_ID is an ID of a stage, the stage's name, or the index of that stage (where the first stage of a workflow is indexed at 0).
The example below reruns all stages of analysis-xxxx and specifies that the first and second stages should be run on mem1_ssd2_x8 and mem1_ssd2_x16 instances respectively:
This is identical to adding metadata to a job. See for details.
Command line monitoring of an analysis is not available. For information about monitoring a job from the command line, see .
On the DNAnexus Platform, jobs are limited to a runtime of 30 days. Jobs that run longer than 30 days are automatically terminated.
This is identical to providing an input JSON to a job. For more information, see .
As in running a workflow in non-interactive mode, inputs to a workflow must be specified as STAGE_ID.<input>. Here STAGE_ID is either an ID of the form stage-xxxx or the index of that stage in the workflow (starting with the first stage at index 0).
$ dx run "Exome Analysis Demo:Exome Analysis Workflow"
Entering interactive mode for input selection.
Input: Reads (bwa_mem_fastq_read_mapper.reads_fastqgzs)
Class: array:file
Enter file values, one at a time (^D or <ENTER> to finish, <TAB> twice for compatible files in
current directory, '?' for more options)
bwa_mem_fastq_read_mapper.reads_fastqgzs[0]: "Exome Analysis Demo:/Input/SRR504516_1.fastq.gz"
Select an optional parameter to set by its # (^D or <ENTER> to finish):
[0] Reads (right mates) (bwa_mem_fastq_read_mapper.reads2_fastqgzs)
[1] Read group information (bwa_mem_fastq_read_mapper.rg_info_csv)
.
.
.
[33] Output prefix (gatk4_genotypegvcfs.prefix)
[34] Extra command line options (gatk4_genotypegvcfs.extra_options) [default="-G StandardAnnotation --only-output-calls-starting-in-intervals"]
Optional param #: 0
Input: Reads (right mates) (bwa_mem_fastq_read_mapper.reads2_fastqgzs)
Class: array:file
Enter file values, one at a time (^D or <ENTER> to finish, <TAB> twice for compatible files in
current directory, '?' for more options)
bwa_mem_fastq_read_mapper.reads2_fastqgzs[0]: "Exome Analysis Demo:/Input/SRR504516_2.fastq.gz"
bwa_mem_fastq_read_mapper.reads2_fastqgzs[1]:
Optional param #: <ENTER>
Using input JSON:
{
"bwa_mem_fastq_read_mapper.reads_fastqgzs": [
{
"$dnanexus_link": {
"project": "project-BQfgzV80bZ46kf6pBGy00J38",
"id": "file-B40jg7v8KfPy38kjz1vQ001y"
}
}
],
"bwa_mem_fastq_read_mapper.reads2_fastqgzs": [
{
"$dnanexus_link": {
"project": "project-BQfgzV80bZ46kf6pBGy00J38",
"id": "file-B40jgYG8KfPy38kjz1vQ0020"
}
}
]
}
Confirm running the executable with this input [Y/n]: <ENTER>
Calling workflow-xxxx with output destination project-xxxx:/
Analysis ID: analysis-xxxx$ dx run "Exome Analysis Demo:Exome Analysis Workflow" -h
usage: dx run Exome Analysis Demo:Exome Analysis Workflow [-iINPUT_NAME=VALUE ...]
Workflow: GATK4 Exome FASTQ to VCF (hs38DH)
Runs GATK4 Best Practice for Exome on hs38DH reference genome
Inputs:
bwa_mem_fastq_read_mapper
Reads: -ibwa_mem_fastq_read_mapper.reads_fastqgzs=(file) [-ibwa_mem_fastq_read_mapper.reads_fastqgzs=... [...]]
An array of files, in gzipped FASTQ format, with the first read mates
to be mapped.
Reads (right mates): [-ibwa_mem_fastq_read_mapper.reads2_fastqgzs=(file) [-ibwa_mem_fastq_read_mapper.reads2_fastqgzs=... [...]]]
(Optional) An array of files, in gzipped FASTQ format, with the second
read mates to be mapped.
BWA reference genome index: [-ibwa_mem_fastq_read_mapper.genomeindex_targz=(file, default={"$dnanexus_link": {"project": "project-BQpp3Y804Y0xbyG4GJPQ01xv", "id": "file-FFJPKp0034KY8f20F6V9yYkk"}})]
A file, in gzipped tar archive format, with the reference genome
sequence already indexed with BWA.
...
fastqc
Reads: [-ifastqc.reads=(file, default={"$dnanexus_link": {"stage": "bwa_mem_fastq_read_mapper", "outputField": "sorted_bam"}})]
A file containing the reads to be checked. Accepted formats are
gzipped-FASTQ and BAM.
...
gatk4_bqsr
Sorted mappings: [-igatk4_bqsr.mappings_sorted_bam=(file, default={"$dnanexus_link": {"outputField": "sorted_bam", "stage": "bwa_mem_fastq_read_mapper"}})]
A coordinate-sorted BAM or CRAM file with the base quality scores to
be recalibrated.
...
...
Outputs:
Sorted mappings: bwa_mem_fastq_read_mapper.sorted_bam (file)
A coordinate-sorted BAM file with the resulting mappings.
Sorted mappings index: bwa_mem_fastq_read_mapper.sorted_bai (file)
The associated BAM index file.
...
Variants index: gatk4_genotypegvcfs.variants_vcfgztbi (file)
The associated TBI file.$ dx run "Exome Analysis Demo:Exome Analysis Workflow" \
-i0.reads_fastqgzs="Exome Analysis Demo:/Input/SRR504516_1.fastq.gz" \
-ibwa_mem_fastq_read_mapper.genomeindex_targz='Reference Genome Files\: AWS US (East):/H. Sapiens - GRCh37 - hs37d5 (1000 Genomes Phase II)/hs37d5.bwa-index.tar.gz' -y
Using input JSON:
{
"bwa_mem_fastq_read_mapper.reads_fastqgzs": [
{
"$dnanexus_link": {
"project": "project-BQfgzV80bZ46kf6pBGy00J38",
"id": "file-B40jg7v8KfPy38kjz1vQ001y"
}
}
],
"bwa_mem_fastq_read_mapper.genomeindex_targz": {
"$dnanexus_link": {
"project": "project-BQpp3Y804Y0xbyG4GJPQ01xv",
"id": "file-B6ZY4942J35xX095VZyQBk0v"
}
}
}
Calling workflow-xxxx with output destination
project-xxxx:/
Analysis ID: analysis-xxxx$ dx run workflow \
-istage-xxxx.file_inputs=project-xxxx:file-1xxxx \
-istage-xxxx.file_inputs=project-xxxx:file-2xxxx \
-istage-xxxx.file_inputs=project-xxxx:file-3xxxx
Using input JSON:
{
"stage-xxxx.file_inputs": [
{
"$dnanexus_link": {
"project": "project-xxxx",
"id": "file-1xxxx"
},
{
"$dnanexus_link": {
"project": "project-xxxx",
"id": "file-2xxxx"
},
{
"$dnanexus_link": {
"project": "project-xxxx",
"id": "file-3xxxx"
}
]
}$ dx run Parliament \
-i0.illumina_bam=$(dx run bwa_mem_fastq_read_mapper -ireads_fastqgzs=file-xxxx -ireads2_fastqgzs=file-xxxx -igenomeindex_targz=project-BQpp3Y804Y0xbyG4GJPQ01xv:file-B6qq53v2J35Qyg04XxG0000V -y --brief):sorted_bam \
-i0.ref_fasta=project-BQpp3Y804Y0xbyG4GJPQ01xv:file-B6qq53v2J35Qyg04XxG0000V \
-y
Using input JSON:
{
"stage-F14F5qQ0Jz1gfpjX8y1JxG3y.illumina_bam": {
"$dnanexus_link": {
"field": "sorted_bam",
"job": "job-xxxx"
}
},
"stage-F14F5qQ0Jz1gfpjX8y1JxG3y.ref_fasta": {
"$dnanexus_link": {
"project": "project-xxxx",
"id": "file-B6qq53v2J35Qyg04XxG0000V"
}
}
}
Calling workflow-xxxx with output destination project-xxxx:/
Analysis ID: analysis-xxxx$ dx run workflow-xxxx -i0.input_file=Input/SRR504516_1.fastq.gz -y --brief
analysis-xxxxdx run --clone analysis-xxxx \
--rerun-stage "*" \
--destination project-xxxx:/output -ydx run --clone analysis-xxxx --rerun-stage 2 --brief -ydx run --clone analysis-xxxx --rerun-stage "*" \
--stage-output-folder 0 "mappings" --brief -ydx run --clone analysis-xxxx --rerun-stage "*" \
--destination "exome_run" \
--stage-relative-output-folder 0 "mappings" --brief -ydx run --clone analysis-xxxx \
--rerun-stage "*" \
--instance-type '{"0": "mem1_hdd2_x8", "1": "mem1_ssd2_x4"}' \
--brief -y

You can run apps and applets from the command-line using the command dx run. The inputs to these app(let)s can be from any project for which you have VIEW access. Or run from UI.
If dx run is run without specifying any inputs, interactive mode launches. When you run this command, the platform prompts you for each required input, followed by a prompt to set any optional parameters. As shown below using the BWA-MEM FASTQ Read Mapper app (platform login required to access this link), after you are done entering inputs, you must confirm that you want the applet/app to run with the inputs you have selected.
You can also specify each input parameter by name using the ‑i or ‑‑input flags with syntax ‑i<input name>=<input value>. Names of data objects in your project are resolved to the appropriate IDs and packaged correctly for the API method as shown below.
When specifying input parameters using the ‑i/‑‑input flag, you must use the input field names (not to be confused with their human-readable labels). To look up the input field names for an app, applet, or workflow, you can run the command dx run app(let)-xxxx -h, as shown below using the (platform login required to access this link).
The help message describes the inputs and outputs of the app, their types, and how to identify them when running the app from the command line. For example, from the above help message, the Swiss Army Knife app has two primary inputs: one or more file and a string to be executed on the command line, to be specified as -iin=file-xxxx and icmd=<string>, respectively.
The example below shows you how to run the same Swiss Army Knife app to sort a small BAM file using these inputs.
To specify array inputs, reuse the ‑i/‑‑input flag for each input in the array and each file specified is appended into an array in the same order as it was entered on the command line. Below is an example of how to use the to index multiple BAM files (platform login required to access this link).
(JBORs) can also be provided using the -i flag with syntax ‑i<input name>=<job id>:<output name>. Combined with the --brief flag (which allows dx run to output only the job ID) and the -y flag (to skip confirmation), you can string together two jobs using one command.
Below is an example of how to run the (platform login required to access this link), producing the output named sorted_bam as described in the dx help output by executing the command dx run app-bwa_mem_fastq_read_mapper -h. The sorted_bam output is then used as input for the (platform login required to access this link).
Some examples of additional functionalities provided by dx run are listed below.
Regardless of whether you run a job interactively or non-interactively, the command dx run always prints the exact input JSON with which it is calling the applet or app. If you don't want to print this verbose output, you can use the --brief flag which tells dx to print out only the job ID instead. This job ID can then be saved.
To run jobs without being prompted for confirmation, use the -y or --yes option. This is especially helpful when scripting or automating job submissions.
If you want to both skip confirmation and immediately monitor the job's progress, use -y --watch. This starts the job and displays its logs in your terminal as it runs.
If you are debugging applet-xxxx and wish to rerun a job you previously ran, using the same settings (destination project and folder, inputs, instance type requests), but use a new executable applet-yyyy, you can use the --clone flag.
In the above command, the command overrides the --clone job-xxxx command to use the executable (platform login required to access this link) rather than that used by the job.
If you want to modify some but not all settings from the previous job, you can run dx run <executable> --clone job-xxxx [options]. The command-line arguments you provide in [options] override the settings reused from --clone. For example, this is useful if you want to rerun a job with the same executable and inputs but a different instance type, or if you want to run an executable with the same settings but slightly different inputs.
The example shown below redirects the outputs of the job to the folder "outputs/".
The --destination flag allows you to specify the full project-ID:/folder/ path in which to output the results of the app(let). If this flag is unspecified, the output of the job defaults to the present working directory, which can be determined by running .
In the above command, the flag --destination project-xxxx:/mappings instructs the job to output all results into the "mappings" folder of project-xxxx.
The dx run --instance-type command allows you to specify the instance types to use for the job. More information is available by running the command dx run --instance-type-help.
Some apps and applets have multiple , meaning that different instance types can be specified for different functions executed by the app(let). In the example below, the (platform login required to access this link) is run while specifying the instance types for the entry points honey, ssake, ssake_insert, and main. Specifying the instance types for each entry point requires a JSON-like string, meaning that the string should be wrapped in single quotes, as explained earlier, and demonstrated below.
If you are running many jobs that have varying purposes, you can organize the jobs using metadata. Two types of metadata are available on the DNAnexus Platform: properties and tags.
Properties are key-value pairs that can be attached to any object on the platform, whereas tags are strings associated with objects on the platform. The --property flag allows you to attach a property to a job, and the --tag flag allows you to tag a job.
Adding metadata to executions does not affect the metadata of the executions' output files. Metadata on jobs make it easier for you to search for a particular job in your job history. This is useful when you want to tag all jobs run with a particular sample, for instance.
If your current workflow is not using the most up-to-date version of an app, you can specify an older version when running your job. Append the app name with the version required, for example, app-xxxx/0.0.1 if the current version is app-xxxx/1.0.0.
To monitor your job as it runs, use the --watch flag to display the job's logs in your terminal window as it progresses.
You can also specify the input JSON in its entirety. To specify a data object, you must wrap it in (a key-value pair with a key of $dnanexus_link and value of the data object's ID). Because you are already providing the JSON in its entirety, as long as the applet/app ID can be resolved and the JSON can be parsed, confirmation before the job starts is not required. Three methods exist for entering the full input JSON, discussed in separate sections below.
If using the CLI to enter the full input JSON, you must use the flag ‑j/‑‑input‑json followed by the JSON in single quotes. Only single quotes should be used to wrap the JSON to avoid interfering with the double quotes used by the JSON itself.
If using a file to enter the input JSON, you must use the flag ‑f/‑‑input‑json‑file followed by the name of the JSON file.
stdinEntering the input JSON file using stdin is done the same way as entering the file using the -f flag with the substitution of using "-" as the filename. Below is an example that demonstrates how to echo the input JSON to stdin and pipe the output to the input of dx run. As before, use single quotes to wrap the JSON input to avoid interfering with the double quotes used by the JSON itself.
dx runExecuting the dx run --help command shows the flags available to use with dx run. The message printed by this command is identical to the one displayed in the brief description of .
The --cost-limit cost_limit sets the maximum cost of the job before termination. In case of workflows, it is the cost of the entire analysis job. For batch run, this limit applies per job. See the dx run --help command for more information.
On the DNAnexus Platform, jobs are limited to a runtime of 30 days. Jobs running longer than 30 days are automatically terminated.
$ dx run app-bwa_mem_fastq_read_mapper
Entering interactive mode for input selection.
Input: Reads (reads_fastqgz)
Class: file
Enter file ID or path ((<TAB> twice for compatible files in current directory, '?' for more options)
reads_fastqgz: reads.fastq.gz
Input: BWA reference genome index (genomeindex_targz)
Class: file
Suggestions:
project-BQpp3Y804Y0xbyG4GJPQ01xv://file-* (DNAnexus Reference Genomes)
Enter file ID or path (<TAB> twice for compatible files in current directory, '?' for more options)
genomeindex_targz: "Reference Genome Files:/H. Sapiens - hg19 (UCSC)/ucsc_hg19.bwa-index.tar.gz"
Select an optional parameter to set by its # (^D or <ENTER> to finish):
[0] Reads (right mates) (reads2_fastqgz)
[1] Add read group information to the mappings (required by downstream GATK)? (add_read_group) [default=true]
[2] Read group id (read_group_id) [default={"$dnanexus_link": {"input": "reads_fastqgz", "metadata": "name"}}]
[3] Read group platform (read_group_platform) [default="ILLUMINA"]
[4] Read group platform unit (read_group_platform_unit) [default="None"]
[5] Read group library (read_group_library) [default="1"]
[6] Read group sample (read_group_sample) [default="1"]
[7] Output all alignments for single/unpaired reads? (all_alignments)
[8] Mark shorter split hits as secondary? (mark_as_secondary) [default=true]
[9] Advanced command line options (advanced_options)
Optional param #: <ENTER>
Using input JSON:
{
"reads_fastqgz": {
"$dnanexus_link": {
"project": "project-xxxx",
"id": "file-xxxx"
}
},
"genomeindex_targz": {
"$dnanexus_link": {
"project": "project-xxxx",
"id": "file-xxxx"
}
}
}
Confirm running the applet/app with this input [Y/n]: <ENTER>
Calling app-xxxx with output destination project-xxxx:/
Job ID: job-xxxx
Watch launched job now? [Y/n] n$ dx run app-swiss-army-knife \
-iin=project-BQbJpBj0bvygyQxgQ1800Jkk:file-BQbXVY0093Jk1K VY1J082y7v \
-icmd="samtools sort -T /tmp/aln.sorted -o SRR100022_chrom20_mapped_to_b37.sorted.bam \
SRR100022_chrom20_mapped_to_b37.bam" -y
Using input JSON:
{
"cmd": "samtools sort -T /tmp/aln.sorted -o SRR100022_chrom20_mapped_to_b37.sorted.bam SRR100022_chrom20_mapped_to_b37.bam",
"in": [
{
"$dnanexus_link": {
"project": "project-BQbJpBj0bvygyQxgQ1800Jkk",
"id": "file-BQbXVY0093Jk1KVY1J082y7v"
}
}
]
}
Calling app-xxxx with output destination project-xxxx:/
Job ID: job-xxxx$ dx run app-swiss-army-knife \
-iin=project-BQbJpBj0bvygyQxgQ1800Jkk:file-BQbXVY0093Jk1KVY1J082y7v \
-iin=project-BQbJpBj0bvygyQxgQ1800Jkk:file-BZ9YGpj0x05xKxZ42QPqZkJY \
-iin=project-BQbJpBj0bvygyQxgQ1800Jkk:file-BZ9YGzj0x05b66kqQv51011q \
-icmd="ls *.bam | xargs -n1 -P5 samtools index" -y
Using input JSON:
{
"cmd": "ls *.bam | xargs -n1 -P5 samtools index",
"in": [
{
"$dnanexus_link": {
"project": "project-BQbJpBj0bvygyQxgQ1800Jkk",
"id": "file-BQbXVY0093Jk1KVY1J082y7v"
}
},
{
"$dnanexus_link": {
"project": "project-BQbJpBj0bvygyQxgQ1800Jkk",
"id": "file-BZ9YGpj0x05xKxZ42QPqZkJY"
}
},
{
"$dnanexus_link": {
"project": "project-BQbJpBj0bvygyQxgQ1800Jkk",
"id": "file-BZ9YGzj0x05b66kqQv51011q"
}
}
]
}
Calling app-xxxx with output destination project-xxxx:/
Job ID: job-xxxx$ dx run app-swiss-army-knife \
-iin=$(dx run app-bwa_mem_fastq_read_mapper \
-ireads_fastqgz=project-BQbJpBj0bvygyQxgQ1800Jkk:file-BQbXKk80fPFj4Jbfpxb6Ffv2 \
-igenomeindex_targz=project-BQpp3Y804Y0xbyG4GJPQ01xv:file-B6qq53v2J35Qyg04XxG0000V -y \
--brief):sorted_bam \
-icmd="samtools index *.bam" -y
Using input JSON:
{
"in": [
{
"$dnanexus_link": {
"field": "sorted_bam",
"job": "job-xxxx"
}
}
],
"cmd": "samtools index *.bam"
}
Calling app-xxxx with output destination project-xxxx:/
Job ID: job-xxxx$ dx run app-bwa_mem_fastq_read_mapper \
-ireads_fastqgz="project-BQbJpBj0bvygyQxgQ1800Jkk:/SRR100022/SRR100022_1.filt.fastq.gz" \
-ireads_fastqgz="project-BQbJpBj0bvygyQxgQ1800Jkk:/SRR100022/SRR100022_2.filt.fastq.gz" \
-igenomeindex_targz="project-BQpp3Y804Y0xbyG4GJPQ01xv:file-B6ZY4942J35xX095VZyQBk0v" \
--destination "mappings" -y --brief$ dx run app-swiss-army-knife --clone job-xxxx -y
Using input JSON:
{
"cmd": "ls *.bam | xargs -n1 -P5 samtools index",
"in": [
{
"$dnanexus_link": {
"project": "project-BQbJpBj0bvygyQxgQ1800Jkk",
"id": "file-BQbXVY0093Jk1KVY1J082y7v"
}
},
{
"$dnanexus_link": {
"project": "project-BQbJpBj0bvygyQxgQ1800Jkk",
"id": "file-BZ9YGpj0x05xKxZ42QPqZkJY"
}
},
{
"$dnanexus_link": {
"project": "project-BQbJpBj0bvygyQxgQ1800Jkk",
"id": "file-BZ9YGzj0x05b66kqQv51011q"
}
}
]
}
Calling app-xxxx with output destination project-xxxx:/
Job ID: job-xxxx$ dx run app-swiss-army-knife \
--clone job-xxx --destination project-xxxx:/output -y$ dx run app-bwa_mem_fastq_read_mapper \
-ireads_fastqgz="project-BQbJpBj0bvygyQxgQ1800Jkk:/SRR100022/SRR100022_1.filt.fastq.gz" \
-ireads_fastqgz="project-BQbJpBj0bvygyQxgQ1800Jkk:/SRR100022/SRR100022_2.filt.fastq.gz" \
-igenomeindex_targz="project-BQpp3Y804Y0xbyG4GJPQ01xv:file-B6ZY4942J35xX095VZyQBk0v" \
--destination "mappings" -y --briefdx run parliament \
-iillumina_bam=illumina.bam \
-iref_fasta=ref.fa.gz \
--instance-type '{
"honey": "mem1_ssd1_x32",
"ssake": "mem1_ssd1_x8",
"ssake_insert": "mem1_ssd1_x32",
"main": "mem1_ssd1_x16"
}' \
-y \
--brief$ dx run app-swiss-army-knife \
-iin=project-BQbJpBj0bvygyQxgQ1800Jkk:file-BQbXVY0093Jk1KVY1J082y7v \
-icmd="samtools sort -T /tmp/aln.sorted -o \
SRR100022_chrom20_mapped_to_b37.sorted.bam SRR100022_chrom20_mapped_to_b37.bam" \
--property foo=bar --tag dna -y$ dx run app-swiss-army-knife/2.0.1 \
-iin=project-BQbJpBj0bvygyQxgQ1800Jkk:file-BQbXVY0093Jk1KVY1J082y7v \
-icmd="samtools sort -T /tmp/aln.sorted -o SRR100022_chrom20_mapped_to_b37.sorted.bam SRR100022_chrom20_mapped_to_b37.bam" \
-y --brief$ dx run app-swiss-army-knife \
-iin=project-BQbJpBj0bvygyQxgQ1800Jkk:file-BQbXVY0093Jk1KVY1J082y7v \
-icmd="samtools sort -T /tmp/aln.sorted \
-o SRR100022_chrom20_mapped_to_b37.sorted.bam \
SRR100022_chrom20_mapped_to_b37.bam" \
--watch \
-y \
--brief
job-xxxx
Job Log
-------
Watching job job-xxxx. Press Ctrl+C to stop.$ dx run app-swiss-army-knife -j '{
"cmd": "ls *.bam | xargs -n1 -P5 samtools index",
"in": [
{ "$dnanexus_link": {
"project": "project-BQbJpBj0bvygyQxgQ1800Jkk",
"id": "file-BQbXVY0093Jk1KVY1J082y7v"
}
},
{ "$dnanexus_link": {
"project": "project-BQbJpBj0bvygyQxgQ1800Jkk",
"id": "file-BZ9YGpj0x05xKxZ42QPqZkJY"
}
},
{ "$dnanexus_link": {
"project": "project-BQbJpBj0bvygyQxgQ1800Jkk",
"id": "file-BZ9YGzj0x05b66kqQv51011q"
}
}
]
}' -y
Calling app-xxxx with output destination project-xxxx:/
Job ID: job-xxxx$ dx run app-swiss-army-knife -f input.json
Using input JSON:
{
"cmd": "ls *.bam | xargs -n1 -P5 samtools index",
"in": [
{
"$dnanexus_link": {
"project": "project-BQbJpBj0bvygyQxgQ1800Jkk",
"id": "file-BQbXVY0093Jk1KVY1J082y7v"
}
},
{
"$dnanexus_link": {
"project": "project-BQbJpBj0bvygyQxgQ1800Jkk",
"id": "file-BZ9YGpj0x05xKxZ42QPqZkJY"
}
},
{
"$dnanexus_link": {
"project": "project-BQbJpBj0bvygyQxgQ1800Jkk",
"id": "file-BZ9YGzj0x05b66kqQv51011q"
}
}
]
}
Calling app-xxxx with output destination project-xxxx:/
Job ID: job-xxxx$ echo '{
"cmd": "ls *.bam | xargs -n1 -P5 samtools index",
"in": [
{
"$dnanexus_link": {
"project": "project-BQbJpBj0bvygyQxgQ1800Jkk",
"id": "file-BQbXVY0093Jk1KVY1J082y7v"
}
},
{
"$dnanexus_link": {
"project": "project-BQbJpBj0bvygyQxgQ1800Jkk",
"id": "file-BZ9YGpj0x05xKxZ42QPqZkJY"
}
},
{
"$dnanexus_link": {
"project": "project-BQbJpBj0bvygyQxgQ1800Jkk",
"id": "file-BZ9YGzj0x05b66kqQv51011q"
}
}
]
}' | dx run app-swiss-army-knife -f - -y
Calling app-xxxx with output destination project-xxxx:/
Job ID: job-xxxxLearn to use the dx client for command-line access to the full range of DNAnexus Platform features.
The dx command-line client is included in the DNAnexus SDK (dx-toolkit). You can use the dx client to log into the Platform, to upload, browse, and organize data, and to launch analyses.
All the projects and data referenced in this Quickstart are publicly available, so you can follow along step-by-step.
If you haven't already done so, , which includes the dx command-line client, as well as range of useful utilities.
As you work, use the as a reference.
On the command line, you can also enter dx help to see a list of commands, broken down by category. To see a list of commands from a particular category, enter dx help <category>.
To learn what a particular command does, enter dx help <command>, dx <command> -h, or dx <command> -help . For example, enter dx help ls to learn about the command dx ls:
The first step is to . If you have not created a DNAnexus account, open the and sign up. User signup is not supported on the command line.
Your and your current project settings are saved in a local configuration file, and you can start accessing your project.
Look inside some public projects that have already been set up. From the command line, enter the command:
By running the command and picking a project, you perform the command-line equivalent of going to the project page for (platform login required to access this link) on the website. This is a DNAnexus-sponsored project containing popular genomes for use in analyses with your own data.
For more information about the dx select command, see the page.
List the data in the top-level directory of the project you've selected by running the command . View the contents of a folder by running the command dx ls <folder_name>.
You can avoid typing out the full name of the folder by typing in dx ls C and then pressing <TAB>. The folder name auto-completes from there.
You don't have to be in a project to inspect its contents. You can also look into another project, and a folder within the project, by giving the project name or ID, followed by a colon (:) and the folder path. Here, the contents of the publicly available project "Demo Data" are listed using both its name and ID.
As shown above, you can use the -l flag with dx ls to list more details about files, such as the time a file was last modified, its size (if applicable), and its full DNAnexus ID.
You can use the command to learn more about on the platform. Given a DNAnexus object ID or name, dx describe returns detailed information about the object. dx describe only returns results for data objects to which you have access.
Besides describing data and projects (examples for which are shown below), you can also describe apps, jobs, and users.
Below, the reference genome file for C. Elegans located in the "Reference Genome Files: AWS US (East)" project that has been used is described (which should be accessible from other regions as well). You need to add a colon (:) after the project name, here that would be Reference Genome Files\: AWS US (East): .
Below, the publicly available Reference Genome Files project that has been used is described.
Use the command to create a new project.
The text project-xxxx denotes a placeholder for a unique, immutable project ID. For more information about object IDs, see the page.
The project is ready for uploading data and running analyses.
To analyze a sample, use the command or the if installed. For this tutorial, download the file , which represents the first 25000 C. elegans reads from SRR070372. This file is used in the sample analysis below.
For uploading multiple or large files, use the . It compresses files and uploads them in parallel over multiple HTTP connections and supports resumable uploads.
The following command uploads the small-celegans-sample.fastq file into the current directory of the current project. The --wait flag tells to wait until uploading is complete before returning the prompt and describing the result.
To take a quick look at the first few lines of the file you uploaded, use the command. By default, it prints the first 10 lines of the given file.
Run it on the file you uploaded and use the -n flag to ask for the first 12 lines (the first 3 reads) of the FASTQ file.
If you'd like to download a file from the platform, use the command. This command uses the name of the file for the filename unless you specify your own with the -o or --output flag. The example below downloads the same C. elegans file that was uploaded previously.
Files have different available fields for metadata, such as "properties" (key-value pairs) and "tags".
For the next few steps, if you would like to follow along, you need a C. elegans FASTQ file. This tutorial maps the reads against the ce10 genome. If you haven't already, you can download and use the following FASTQ file, which contains the first 25,000 reads from SRR070372: .
The following walkthrough explains what each command does and shows which apps run. If you only want to convert a gzipped FASTQ file to a VCF via BWA and the FreeBayes Variant Caller, to see the commands required to run the apps.
If you have not yet done so, you can upload a FASTQ file for analysis.
For more information about using the command , see the page.
Next, use the (platform login required to access this link) to map the uploaded reads file to a reference genome.
If you don't know the command-line name of the app to run, you have two options:
Navigate to its web page from the (platform login required to access this link). The app's page shows how to run it from the command line. See the for details on the app used here (platform login required).
Alternatively, search for apps from the command line by running dx find apps. The command-line name appears in parentheses in the output (underlined below).
Install the app using and check that it has been installed. While you do not always need to install an app to run it, you may find it useful as a bookmarking tool.
You can run the app using . When you run it without any arguments, it prompts you for required and then optional arguments. The reference file genomeindex_targz for this C. elegans sample is in a .tar.gz format and can be found in the Reference Genome folder of the region your project is in.
You can use the command to monitor jobs. The command prints out the log file of the job, including the STDOUT, STDERR, and INFO printouts.
You can also use the command dx describe job-xxxx to learn more about your job. If you don't know the job's ID, you can use the command to list all the jobs run in the current project, along with the user who ran them, their status, and when they began.
Additional options are available to restrict your search of previous jobs, such as by their names or when they were run.
If for some reason you need to terminate your job before it completes, use the command .
You should see two new files in your project: the mapped reads in a BAM file, and an index of that BAM file with a .bai extension. You can refer to the output file by name or by the job that produced it using the syntax job-xxxx:<output field>. Try it yourself with the job ID you got from calling the BWA-MEM app!
You can use the (platform login required to access this link) to call variants on your BAM file.
This time, instead of relying on interactive mode to enter inputs, you provide them directly. First, look up the app's spec to determine the input names. Run the command dx run freebayes -h.
Optional inputs are shown using square brackets ([]) around the command-line syntax for each input. Notice that there are two required inputs that must be specified:
Sorted mappings (sorted_bams): A list of files with a .bam extension.
Genome (genome_fastagz): A reference genome in FASTA format that has been gzipped.
It is sometimes more convenient to run apps using a single one-line command. You can do this by specifying all the necessary inputs either via the command line or in a prepared file. Use the -i flag to specify inputs as suggested by the output of dx run freebayes ‑h:
sorted_bams: The output of the previous BWA step (see the section for more information).
genome_fastagz: The ce10 genome in the Reference Genomes project.
To specify new job input using the output of a previous job, use a via the job-xxxx:<output field> syntax used earlier.
Replace the job ID below with that generated by the BWA app you ran earlier. The -y flag skips the input confirmation.
Use the command to wait for a job to finish. If you run the following command immediately after launching the FreeBayes app, it shows recent jobs only after the job has finished, as shown in the example below.
Congratulations! You have called variants on a reads sample using the command line. Next, see how to automate this process.
The CLI enables automation of these steps. The following script assumes that you are logged in. It is hardcoded to use the ce10 genome and takes a local gzipped FASTQ file as its command-line argument.
You can start scripting using dx. The --brief flag is useful for scripting. A list of all dx commands and flags is on the page.
For more detailed information about running apps and applets from the command line, see the page.
For a comprehensive guide to the DNAnexus SDK, see the .
Want to start writing your own apps? Check out the for some useful tutorials.
$ dx help ls
usage: dx ls [-h] [--color {off,on,auto}] [--delimiter [DELIMITER]]
[--env-help] [--brief | --summary | --verbose] [-a] [-l] [--obj]
[--folders] [--full]
[path]
List folders and/or objects in a folder
... # output truncated for brevity$ dx login
Acquiring credentials from https://auth.dnanexus.com
Username: <your username>
Password: <your password>
No projects to choose from. You can create one with the command "dx new project".
To pick from projects for which you only have VIEW permissions, use "dx select --level VIEW" or "dx select --public".dx select --public --name "Reference Genome Files*"$ dx ls
C. Elegans - Ce10/
D. melanogaster - Dm3/
H. Sapiens - GRCh37 - b37 (1000 Genomes Phase I)/
H. Sapiens - GRCh37 - hs37d5 (1000 Genomes Phase II)/
H. Sapiens - GRCh38/
H. Sapiens - hg19 (Ion Torrent)/
H. Sapiens - hg19 (UCSC)/
M. musculus - mm10/
M. musculus - mm9/
$ dx ls "C. Elegans - Ce10/"
ce10.bt2-index.tar.gz
ce10.bwa-index.tar.gz
... # output truncated for brevity$ dx ls "Demo Data:/SRR100022/"
SRR100022_1.filt.fastq.gz
SRR100022_2.filt.fastq.gz
$ dx ls -l "project-BQbJpBj0bvygyQxgQ1800Jkk:/SRR100022/"
Project: Demo Data (project-BQbJpBj0bvygyQxgQ1800Jkk)
Folder : /SRR100022
State Last modified Size Name (ID)
... # output truncated for brevity$ dx describe "Reference Genome Files\: AWS US (East):/C. Elegans - Ce10/ce10.fasta.gz"
Result 1:
ID file-BQbY9Bj015pB7JJVX0vQ7vj5
Class file
Project project-BQpp3Y804Y0xbyG4GJPQ01xv
Folder /C. Elegans - Ce10
Name ce10.fasta.gz
State closed
Visibility visible
Types -
Properties Assembly=UCSC ce10,
Origin=http://hgdownload.cse.ucsc.edu/goldenPath/ce10/bigZip
s/ce10.2bit, Species=Caenorhabditis elegans, Taxonomy
ID=6239
Tags -
Outgoing links -
Created Tue Sep 30 18:54:35 2014
Created by bhannigan
via the job job-BQbY8y80KKgP380QVQY000qz
Last modified Thu Mar 2 12:17:27 2017
Media type application/x-gzip
archivalState "live"
Size 29.21 MB, sponsored by DNAnexus$ dx describe "Reference Genome Files\: AWS US (East):"
Result 1:
ID project-BQpp3Y804Y0xbyG4GJPQ01xv
Class project
Name Reference Genome Files: AWS US (East)
Summary
Billed to org-dnanexus
Access level VIEW
Region aws:us-east-1
Protected true
Restricted false
Contains PHI false
Created Wed Oct 8 16:42:53 2014
Created by tnguyen
Last modified Tue Oct 23 14:15:59 2018
Data usage 0.00 GB
Sponsored data 519.77 GB
Sponsored egress 0.00 GB used of 0.00 GB total
Tags -
Properties -
downloadRestricted false
defaultInstanceType "mem2_hdd2_x2"$ dx new project "My First Project"
Created new project called "My First Project"
(project-xxxx)
Switch to new project now? [y/N]: y$ dx upload --wait small-celegans-sample.fastq
[===========================================================>] Uploaded (16801690 of 16801690 bytes) 100% small-celegans-sample.fastq
ID file-xxxx
Class file
Project project-xxxx
Folder /
Name small-celegans-sample.fastq
State closed
Visibility visible
Types -
Properties -
Tags -
Details {}
Outgoing links -
Created Sun Jan 1 09:00:00 2017
Created by amy
Last modified Sat Jan 1 09:00:00 2017
Media type text/plain
Size 16.02 MB$ dx head -n 12 small-celegans-sample.fastq
@SRR070372.1 FV5358E02GLGSF length=78
TTTTTTTTTTTTTTTTTTTTTTTTTTTNTTTNTTTNTTTNTTTATTTATTTATTTATTATTATATATATATATATATA
+SRR070372.1 FV5358E02GLGSF length=78
...000//////999999<<<=<<666!602!777!922!688:669A9=<=122569AAA?>@BBBBAA?=<96632
@SRR070372.2 FV5358E02FQJUJ length=177
TTTCTTGTAATTTGTTGGAATACGAGAACATCGTCAATAATATATCGTATGAATTGAACCACACGGCACATATTTGAACTTGTTCGTGAAATTTAGCGAACCTGGCAGGACTCGAACCTCCAATCTTCGGATCCGAAGTCCGACGCCCCCGCGTCGGATGCGTTGTTACCACTGCTT
+SRR070372.2 FV5358E02FQJUJ length=177
222@99912088>C<?7779@<GIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIC;6666IIIIIIIIIIII;;;HHIIE>944=>=;22499;CIIIIIIIIIIIIHHHIIIIIIIIIIIIIIIH?;;;?IIEEEEEEEEIIII77777I7EEIIEEHHHHHIIIIIIIIIIIIII
@SRR070372.3 FV5358E02GYL4S length=70
TTGGTATCATTGATATTCATTCTGGAGAACGATGGAACATACAAGAATTGTGTTAAGACCTGCATAAGGG
+SRR070372.3 FV5358E02GYL4S length=70
@@@@@DFFFFFHHHHHHHFBB@FDDBBBB=?::5555BBBBD??@?DFFHHFDDDDFFFDDBBBB<<410$ dx download small-celegans-sample.fastq
[ ] Downloaded 0 byte
[===========================================================>] Downloaded 16.02 of
[===========================================================>] Completed 16.02 of 16.02 bytes (100%) small-celegans-sample.fastqdx upload small-celegans-sample.fastq --wait$ dx find apps
...
x BWA-MEM FASTQ Read Mapper (bwa_mem_fastq_read_mapper), v1.4.0
...$ dx install bwa_mem_fastq_read_mapper
Installed the bwa_mem_fastq_read_mapper app
$ dx find apps --installed
BWA-MEM FASTQ Read Mapper (bwa_mem_fastq_read_mapper), v1.4.0$ dx run bwa_mem_fastq_read_mapper
Entering interactive mode for input selection.
Input: Reads (reads_fastqgz)
Class: file
Enter file ID or path (<TAB> twice for compatible files in current directory,'?' for help)
reads_fastqgz[0]: <small-celegans-sample.fastq.gz>
Input: BWA reference genome index (genomeindex_targz)
Class: file
Suggestions:
project-BQpp3Y804Y0xbyG4GJPQ01xv://file-\* (DNAnexus Reference Genomes)
Enter file ID or path (<TAB> twice for compatible files in current
directory,'?' for more options)
genomeindex_targz: <"Reference Genome Files\: <REGION_OF_PROJECT>:/C. Elegans - Ce10/ce10.bwa-index.tar.gz">
Select an optional parameter to set by its # (^D or <ENTER> to finish):
[0] Reads (right mates) (reads2_fastqgz)
[1] Add read group information to the mappings (required by downstream GATK)? (add_read_group) [default=true]
[2] Read group id (read_group_id) [default={"$dnanexus_link": {"input": "reads_fastqgz", "metadata": "name"}}]
[3] Read group platform (read_group_platform) [default="ILLUMINA"]
[4] Read group platform unit (read_group_platform_unit) [default="None"]
[5] Read group library (read_group_library) [default="1"]
[6] Read group sample (read_group_sample) [default="1"]
[7] Output all alignments for single/unpaired reads? (all_alignments)
[8] Mark shorter split hits as secondary? (mark_as_secondary) [default=true]
[9] Advanced command line options (advanced_options)
Optional param #: <ENTER>
Using input JSON:
{
"reads_fastqgz": {
"$dnanexus_link": {
"project": "project-B3X8bjBqqBk1y7bVPkvQ0001",
"id": "file-B3P6v02KZbFFkQ2xj0JQ005Y"
}
"genomeindex_targz": {
"$dnanexus_link": {
"project": "project-xxxx(project ID for the reference genome in your region)",
"id": "file-BQbYJpQ09j3x9Fj30kf003JG"
}
}
}
Confirm running the applet/app with this input [Y/n]: <ENTER>
Calling app-BP2xVx80fVy0z92VYVXQ009j with output destination
project-xxxx:/
Job ID: job-xxxx$ dx find jobs
* BWA-MEM FASTQ Read Mapper (bwa_mem_fastq_read_mapper:main)(done) job-xxxx
user-amy 20xx-xx-xx 0x:00:00 (runtime 0:00:xx)
$ dx describe job-xxxx
...$ dx ls
small-celegans-sample.bam
small-celegans-sample.bam.bai
small-celegans-sample.fastq
$ dx describe small-celegans-sample.bam
...
$ dx describe job-xxxx:sorted_bam
...$ dx run freebayes -y \
-igenome_fastagz=Reference\ Genome\ Files:/C.\ Elegans\ -\ Ce10/ce10.fasta.gz \
-isorted_bams=job-xxxx:sorted_bam
Using input JSON:
{
"genome_fastagz": {
"$dnanexus_link": {
"project": "project-xxxx",
"id": "file-xxxx"
}
},
"sorted_bams": {
"field": "sorted_bam",
"job": "job-xxxx"
}
}
Calling app-BFG5k2009PxyvYXBBJY00BK1 with output destination
project-xxxx:/
Job ID: job-xxxx$ dx wait job-xxxx && dx find jobs
Waiting for job-xxxx to finish running...
Done
* FreeBayes Variant Caller (done) job-xxxx
user-amy 2017-01-01 09:00:00 (runtime 0:05:24)
...#!/usr/bin/env bash
# Usage: <script_name.sh> local_fastq_filename.fastq.gz
reference="Reference Genome Files\: AWS US (East):/C. Elegans - Ce10/ce10.fasta.gz"
bwa_indexed_reference="Reference Genome Files\: AWS US (East):/C. Elegans - Ce10/ce10.bwa-index.tar.gz"
local_reads_file="$1"
reads_file_id=$(dx upload "$local_reads_file" --brief)
bwa_job=$(dx run bwa_mem_fastq_read_mapper -ireads_fastqgzs=$reads_file_id -igenomeindex_targz="$bwa_indexed_reference" -y --brief)
freebayes_job=$(dx run freebayes -isorted_bams=$bwa_job:sorted_bam -igenome_fastagz="$reference" -y --brief)
dx wait $freebayes_job
dx download $freebayes_job:variants_vcfgz -o "$local_reads_file".vcf.gz
gunzip "$local_reads_file".vcf.gzusage: dx run app-swiss-army-knife [-iINPUT_NAME=VALUE ...]
App: Swiss Army Knife
Version: 5.1.0 (published)
A multi-purpose tool for all your basic analysis needs
See the app page for more information:
https://platform.dnanexus.com/app/swiss-army-knife
Inputs:
Input files: [-iin=(file) [-iin=... [...]]]
(Optional) Files to download to instance temporary folder before
command is executed.
Command line: -icmd=(string)
Command to execute on instance. View the app readme for details.
Whether to use "dx-mount-all-inputs"?: [-imount_inputs=(boolean, default=false)]
(Optional) Whether to mount all files that were supplied as inputs to
the app instead of downloading them to the local storage of the
execution worker.
Public Docker image identifier: [-iimage=(string)]
(Optional) Instead of using the default Ubuntu 24.04 environment, the
input command <CMD> will be run using the specified publicly
accessible Docker image <IMAGE> as it would be when running 'docker
run <IMAGE> <CMD>'. Example image identifiers are 'ubuntu:25.04',
'quay.io/ucsc_cgl/samtools'. Cannot be specified together with
'image_file'. This input relies on access to internet and is unusable
in an internet-restricted project.
Platform file containing Docker image accepted by `docker load`: [-iimage_file=(file)]
(Optional) Instead of using the default Ubuntu 24.04 environment, the
input command <CMD> will be run using the Docker image <IMAGE> loaded
from the specified image file <IMAGE_FILE> as it would be when running
'docker load -i <IMAGE_FILE> && docker run <IMAGE> <CMD>'. Cannot be
specified together with 'image'.
Outputs:
Output files: [out (array:file)]
(Optional) New files that were created in temporary folder.Learn how to get information on current and past executions via both the UI and the CLI.
To get basic information on :
Click on Projects in the main Platform menu.
On the Projects list page, find and click on the name of the project within which the execution was launched.
Click on the Monitor tab to open the Monitor screen.
The list on the Monitor screen displays the following information for each execution that is running or has been run within the project:
Name - The default name for an execution is the name of the app, applet, or workflow being run. When configuring an execution, you can give it a custom name, either , or . The execution's name is used in Platform email alerts related to the execution. Clicking on a name in the executions list opens the .
State - This is the execution's state. State values include:
"Waiting" - The execution awaits Platform resource allocation or completion of dependent executions.
Additional basic information can be displayed for each execution. To do this:
Click on the "table" icon at the right edge of the table header row.
Select one or more of the entries in the list, to display an additional column or columns.
Available additional columns include:
Stopped Running - The time at which the execution stopped running.
Custom properties columns - If a have been assigned to any of the listed executions, a column can be added to the table, for each such property, showing the values assigned to each execution, for that property.
To remove columns from the list, click on the "table" icon at the right edge of the table header row, then de-select one or more of the entries in the list, to hide the column or columns.
A filter menu above the executions list allows you to run a search that refines the list to display only executions meeting specific criteria.
By default, pills are available to set search criteria for filtering executions by one or more of these attributes:
Name - Execution name
State - Execution state
ID - An execution's or
Executable
Click the List icon, above the right edge of the executions list, to display pills that allow filtering by additional execution attributes.
By default, filters are set to display only that meet the criteria defined in the filter. To include all executions, including those run during individual stages of workflows, click the button above the left edge of the executions list showing the default value "Root Executions Only," then click "All Executions."
To save a particular filter, click the Bookmark icon, above the right edge of the executions list, assign your filter a name, then click Save.
To apply a saved filter to the executions list, click the Bookmark icon, then select the filter from the list.
If you launched an execution or have contributor access to the project in which the execution is running, you can terminate the execution from the list on the Monitor screen when it is in a non-terminal state. You can also terminate executions launched by other project members if you have project admin status.
To terminate an execution:
Find the execution in the list:
Select the execution by clicking on the row. Click the red Terminate button that appears at the end of the header.
Hover over the row and click on the "More Actions" button that looks like three vertical dots at the end of the row to select Terminate in the menu.
For additional information about an execution, click its name in the list on the Monitor screen to open its details page.
The details page for an execution displays a range of information, including:
High-level details - The high-level information in this section includes:
For a standalone execution - such as a job without children - the display shows a single entry with details about the execution state, start and stop times, and duration in the running state.
For an execution with descendants - such as an analysis with multiple stages - the display shows a list with each row containing details about stage executions. For executions with descendants, click the "+" icon next to the name to expand the row and view descendant information. A page displaying detailed information on a stage appears when clicking on its name in the list. To navigate back to the workflow's details page, click its name in the "breadcrumb" navigation menu in the top right corner of the screen.
For failed executions, a Cause of Failure pane appears above the Execution Tree section. The cause of failure is a system-generated error message. For assistance in diagnosing the failure and any related issues:
Click the button labeled Send Failure Report to DNAnexus Support.
A form opens in a modal window, with pre-populated Subject and Message fields containing diagnostic information for DNAnexus Support.
Click the button in the Grant Access section to grant DNAnexus Support "View" access to the project, enabling faster issue diagnosis and resolution.
To re-launch a job from the execution details screen:
Click the Launch as New Job button in the upper right corner of the screen.
A new browser tab opens, displaying the Run App / Applet form.
Configure the run, then click Start Analysis.
To re-launch an analysis from the execution details screen:
Click the Launch as New Analysis button in the upper right corner of the screen.
A new browser tab opens, displaying the Run Analysis form.
Configure the run, then click Start Analysis.
To save a copy of a workflow along with its input configurations under a new name from the execution details screen:
Click the Save as New Workflow button in the upper right corner of the screen.
In the Save as New Workflow modal window, give the workflow a name, and select the project in which you'd like to save it.
Click Save.
As described in , jobs can be configured to restart automatically on certain types of failures.
If you want to view the execution details for the initial tries for a restarted job:
Click on the "Tries" link below the job name in the summary banner, or the "Tries" link next to the job name in the execution tree.
A modal window opens.
Click the name of the try for which you'd like to view execution details.
You can only for the most recent try, not for any previous tries.
You can use dx watch to view the log of a running job or any past jobs, which may have finished successfully, failed, or been terminated.
Use to view a job's log stream during execution. The log stream includes stdout, stderr, and additional worker output information.
To terminate a job before completion, use the command .
Use the command to view completed jobs. The log stream includes stdout, stderr, and additional worker output information from the execution.
Use dx find executions to display the ten most recent executions in your current project. Specify a different number of executions by using dx find executions -n <specified number>. The output matches the information shown in the "Monitor" tab on the DNAnexus web UI.
Below is an example of dx find executions. In this case, only two executions have been run in the current project. An individual job, DeepVariant Germline Variant Caller, and a workflow consisting of two stages, Variant Calling Workflow, are shown. A stage is represented by either another analysis (if running a workflow) or a job (if running an app(let)).
The job running the DeepVariant Germline Variant Caller executable is running and has been running for 10 minutes and 28 seconds. The analysis running the Variant Calling Workflow consists of 2 stages, FreeBayes Variant Caller, which is waiting on input, and BWA-MEM FASTQ Read Mapper, which has been running for 10 minutes and 18 seconds.
dx find executionsThe dx find executions operation searches for jobs or analyses created when a user runs an app or applet. For jobs that are part of an analysis, the results appear in a tree representation linking related jobs together.
By default, dx find executions displays up to ten of the most recent executions in your current project, ordered by creation time.
Filter executions by job type using command flags: --origin-jobs shows only original jobs, while --all-jobs includes both original jobs and subjobs.
You can monitor analyses by using the command dx find analyses, which displays the top-level analyses, excluding contained jobs. Analyses are executions of workflows and consist of one or more app(let)s being run.
Below is an example of dx find analyses:
Jobs are runs of an individual app(let) and compose analyses. Monitor jobs using the command to display a flat list of jobs. For jobs within an analysis, the command returns all jobs in that analysis.
Below is an example of dx find jobs:
Searches for executions can be restricted to specific parameters.
stdout and/or stderr from a Job LogTo extract stdout only from this job, run the command dx watch job-xxxx --get-stdout.
To extract stderr only from this job, run the command dx watch job-xxxx --get-stderr.
To extract both
Below is an example of viewing stdout lines of a job log:
To view the entire job tree, including both main jobs and subjobs, use the command dx watch job-xxxx --tree.
To view the entire job tree -- both main jobs and subjobs -- use the command dx watch job-xxxx -n 8. If the job already ran, the output is displayed as well.
In the example below, the app Sample Prints doesn't have any output.
Jobs can be configured to restart automatically on certain types of failures as described in the section. To view initial tries of the restarted jobs along with execution subtrees rooted in those initial tries, use dx find executions --include-restarted. To examine job logs for initial tries, use dx watch job-xxxx --try X. An example of these commands is shown below.
By default, dx find restricts searches to your current project context. Use the --all-projects flag to search across all accessible projects.
By default, dx find returns up to ten of the most recently launched executions matching your search query. Use the -n option to change the number of executions returned.
A user can search for only executions of a specific app(let) or workflow based on its .
Users can also use the --created-before and --created-after options to search based on when the execution began.
Users can also restrict the search to a specific state, for example, "done", "failed", "terminated".
The --delim flag produces tab-delimited output, suitable for processing by other shell commands.
Use the --brief flag to display only the object IDs for objects returned by your search query. The ‑‑origin‑jobs flag excludes subjob information.
Below is an example usage of the --brief flag:
Below is an example of using the flags --origin-jobs and --brief. In the example below, the last job run in the current default project is described.
See more on .
Job logs can be automatically for analysis.
Find the row displaying information on the execution.
For an analysis (the execution of a workflow), click the "+" icon to the left of the analysis name to expand the row and view information on its stages. For executions with further descendants, click the "+" icon next to the name to expand the row and show additional details.
To see additional information on an execution, click on its name to be taken to its details page.
The following shortcuts allow you to view information from the details page directly on the list page, or relaunch an execution:
To view the Info pane:
Click the Info icon, above the right edge of the executions list, if it's not already selected, and then select the execution by clicking on the row.
Hover over the row and click on the "More Actions" button that looks like three vertical dots at the end of the row to select View Info in the fly out menu.
To view the for a job, do either of the following:
Select the execution by clicking on the row. When a View Log button appears in the header, click it,
Hover over the row and click on the "More Actions" button that looks like three vertical dots at the end of the row to select View Log in the fly out menu.
To , do either of the following:
Select the execution by clicking on the row. When a Launch as New Job button appears in the header, click it.
Hover over the row and click on the "More Actions" button that looks like three vertical dots at the end of the row, then select Launch as New Job in the menu.
To , do either of the following:
Select the execution by clicking on the row. When a Launch as New Analysis button appears in the header, click it.
Hover over the row and click on the "More Actions" button that looks like three vertical dots at the end of the row to select Launch as New Analysis in the menu.
"Running" - The job is actively executing.
"In Progress" - The analysis is actively processing.
"Done" - The execution completed successfully without errors.
"Failed" - The execution encountered an error and could not complete. See Types of Errors for troubleshooting assistance.
"Partially Failed" - An analysis reaches "Partially Failed" state if one or more workflow stages did not finish successfully, with at least one stage not in a terminal state (either "Done," "Failed," or "Terminated").
"Terminating" - The worker has initiated but not completed the termination process.
"Terminated" - The execution stopped before completion.
"Debug Hold" - The execution, run with debugging options, encountered an applicable failure and entered debugging hold.
Executable - The executable or executables run during the execution. If the execution is an analysis, each stage appears in a separate row, including the name of the executable run during the stage. If an informational page exists with details about the executable's configuration and use, the executable name becomes clickable, and clicking displays that page.
Tags - Tags are strings associated with objects on the platform. They are a type of metadata that can be added to an execution.
Launched By - The name of the user who launched the execution.
Launched On - The time at which the execution was launched. This time often precedes the time in the Started Running column due to executions waiting for available resources before starting.
Started Running - The time at which the execution started running, if it has done so. This is not always the same as its launch time, if it requires time waiting for available resources before starting.
Duration - For jobs, this figure represents the time elapsed since the job entered the running state. For analyses, it represents the time elapsed since the analysis was created.
Cost - A value is displayed in this column when the user has access to billing info for the execution. The figure shown represents either, for a running execution, an estimate of the charges it has incurred so far, or, for a completed execution, the total costs it incurred.
Priority - The priority assigned to the execution - either "low," "normal," or "high" - when it was configured, either via the CLI or via the UI. This setting determines the scheduling priority of the execution relative to other executions that are waiting to be launched.
Worker URL - If the execution runs an executable, such as DXJupyterLab, with direct web URL connection capability, the URL appears here. Clicking the URL opens a connection to the executable in a new browser tab.
Output Folder - For each execution, the value shows a path relative to the project's root folder. Click the value to open the folder containing the execution's outputs.
Launched By - The user who launched an execution or executions
Launch Time - The time range within which executions were launched
The execution's state changes to "Terminating" during termination, then to "Terminated" once complete.
Execution state - In the Execution Tree section, each execution row includes a color bar that represents the execution's current state. For descendants within the same execution tree, the time visualizations are staggered, indicating their different start and stop times compared to each other. The colors include:
Blue - A blue bar indicates that the execution is in the "Running" or "In Progress" state.
Green - A green bar indicates that the execution is in the "Done" state.
Red - A red bar indicates that the execution is in the "Failed" or "Partially Failed" state.
"Grey" indicates that the execution is in the "Terminated" state.
Execution start and stop times - Times are displayed in the header bar at the top of the Execution Tree section. These times run, from left to right, from the time at which the job started running, or when the analysis was created, to either the current time, or the time at which the execution entered a terminal state ("Done," "Failed," or "Terminated").
Inputs - This section lists the execution inputs. Available input files appear as hyperlinks to their project locations. For inputs from other workflow executions, the source execution name appears as a hyperlink to its details page.
Outputs - This section lists the execution's outputs. Available output files appear as hyperlinks. Click a link to open the folder containing the output file.
Log files - An execution's log file is useful in understanding details about, for example, the resources used by an execution, the costs it incurred, and the source of any delays it encountered. To access log files, and, as needed, download them in .txt format:
To access the log file for a job, click either the View Log button in the top right corner of the screen, or the View Log link in the Execution Tree section.
To access the log file for each stage in an analysis, click the View Log link next to the row displaying information on the stage, in the Execution Tree section.
Basic info - The Info pane, on the right side of the screen, displays a range of basic information on the execution, along with additional detail such as the execution's unique ID, and custom properties and tags assigned to it.
Reused results - For executions reusing results from another execution, the information appears in a blue pane above the Execution Tree section. Click the source execution's name to see details about the execution that generated these results.
Click Send Report to send the report.
stdoutstderrdx watch job-xxxx --get-streams$ dx watch job-xxxx
Watching job job-xxxx. Press Ctrl+C to stop.
* Sample Prints (sample_prints:main) (running) job-xxxx
amy 2024-01-01 09:00:00 (running for 0:00:37)
2024-01-01 09:06:00 Sample Prints INFO Logging initialized (priority)
2024-01-01 09:06:37 Sample Prints INFO CPU: 4% (4 cores) * Memory: 547/7479MB * Storage: 74GB free * Net: 0↓/0↑MBps
2024-01-01 09:06:37 Sample Prints INFO Setting SSH public key
2024-01-01 09:06:37 Sample Prints STDOUT dxpy/0.365.0 (Linux-5.15.0-1050-aws-x86_64-with-glibc2.29) Python/3.8.10
2024-01-01 09:06:37 Sample Prints STDOUT Invoking main with {}
2024-01-01 09:06:37 Sample Prints STDOUT 0
...$ dx watch job-xxxx
Watching job job-xxxx. Press Ctrl+C to stop.
* Sample Prints (sample_prints:main) (running) job-xxxx
amy 2024-01-01 09:00:00 (running for 0:00:37)
2024-01-01 09:06:00 Sample Prints INFO Logging initialized (priority)
2024-01-01 09:06:37 Sample Prints INFO CPU: 4% (4 cores) * Memory: 547/7479MB * Storage: 74GB free * Net: 0↓/0↑MBps
2024-01-01 09:06:37 Sample Prints INFO Setting SSH public key
204-01-01 09:06:37 Sample Prints STDOUT dxpy/0.365.0 (Linux-5.15.0-1050-aws-x86_64-with-glibc2.29) Python/3.8.10
2024-01-01 09:06:37 Sample Prints STDOUT Invoking main with {}
2024-01-01 09:06:37 Sample Prints STDOUT 0
2024-01-01 09:06:37 Sample Prints STDOUT 1
2024-01-01 09:06:37 Sample Prints STDOUT 2
2024-01-01 09:06:37 Sample Prints STDOUT 3
* Sample Prints (sample_prints:main) (done) job-xxxx
amy 2024-01-01 09:08:11 (runtime 0:02:11)
Output: -$ dx find executions
* DeepVariant Germline Variant Caller (deepvariant_germline:main) (running) job-xxxx
amy 2024-01-01 09:00:18 (running for 0:10:28)
* Variant Calling Workflow (in_progress) analysis-xxxx
│ amy 2024-01-01 09:00:18
├── * FreeBayes Variant Caller (freebayes:main) (waiting_on_input) job-yyyy
│ amy 2024-01-01 09:00:18
└── * BWA-MEM FASTQ Read Mapper (bwa_mem_fastq_read_mapper:main) (running) job-zzzz
amy 2024-01-01 09:00:18 (running for 0:10:18)$ dx find analyses
* Variant Calling Workflow (in_progress) analysis-xxxx
amy 2024-01-01 09:00:18$ dx find jobs
* DeepVariant Germline Variant Caller (deepvariant_germline:main) (running) job-xxxx
amy 2024-01-01 09:10:00 (running for 0:00:28)
* FreeBayes Variant Caller (freebayes:main) (waiting_on_input) job-yyyy
amy 2024-01-01 09:00:18
* BWA-MEM FASTQ Read Mapper (bwa_mem_fastq_read_mapper:main) (running) job-zzzz
amy 2024-01-01 09:00:18 (running for 0:10:18)$ dx watch job-xxxx --get-streams
Watching job job-xxxx. Press Ctrl+C to stop.
dxpy/0.365.0 (Linux-5.15.0-1050-aws-x86_64-with-glibc2.29) Python/3.8.10
Invoking main with {}
0
1
2
3
4
5
6
7
8
9
10$ dx watch job-F5vPQg807yxPJ3KP16Ff1zyG -n 8
Watching job job-xxxx. Press Ctrl+C to stop.
* Sample Prints (sample_prints:main) (done) job-xxxx
amy 2024-01-01 09:00:00 (runtime 0:02:11)
2024-01-01 09:06:00 Sample Prints INFO Logging initialized (priority)
2024-01-01 09:08:11 Sample Prints INFO CPU: 4% (4 cores) * Memory: 547/7479MB * Storage: 74GB free * Net: 0↓/0↑MBps
2024-01-01 09:08:11 Sample Prints INFO Setting SSH public key
2024-01-01 09:08:11 Sample Prints dxpy/0.365.0 (Linux-5.15.0-1050-aws-x86_64-with-glibc2.29) Python/3.8.10
* Sample Prints (sample_prints:main) (done) job-F5vPQg807yxPJ3KP16Ff1zyG
amy 2024-01-01 09:00:00 (runtime 0:02:11)
Output: -$ dx run swiss-army-knife -icmd="exit 1" \
--extra-args '{"executionPolicy": { "restartOn":{"*":2}}}'
$ dx find executions --include-restarted
* Swiss Army Knife (swiss-army-knife:main) (failed) job-xxxx tries
├── * Swiss Army Knife (swiss-army-knife:main) (failed) job-xxxx try 2
│ amy 2023-08-02 16:33:40 (runtime 0:01:45)
├── * Swiss Army Knife (swiss-army-knife:main) (restarted) job-xxxx try 1
│ amy 2023-08-02 16:33:40
└── * Swiss Army Knife (swiss-army-knife:main) (restarted) job-xxxx try 0
amy 2023-08-02 16:33:40
$ dx watch job-xxxx --try 0
Watching job job-xxxx try 0. Press Ctrl+C to stop watching.
* Swiss Army Knife (swiss-army-knife:main) (restarted) job-xxxx try 0
amy 2023-08-02 16:33:40
2023-08-02 16:35:26 Swiss Army Knife INFO Logging initialized (priority)$ dx find executions -n 3 --all-projects
* Sample Prints (sample_prints:main) (done) job-xxxx
amy 2024-01-01 09:15:00 (runtime 0:02:11)
* Sample Applet (sample_applet:main) (done) job-yyyy
ben 2024-01-01 09:10:00 (runtime 0:00:28)
* Sample Applet (sample_applet:main) (failed) job-zzzz
amy 2024-01-01 09:00:00 (runtime 0:19:02)# Find the 100 most recently launched jobs in your project
$ dx find executions -n 100# Find most recent executions running app-deepvariant_germline in the current project
$ dx find executions --executable app-deepvariant_germline
* DeepVariant Germline Variant Caller (deepvariant_germline:main) (running) job-xxxx
amy 2024-01-01 09:00:18 (running for 0:10:18)# Find executions run on January 2, 2024
$ dx find executions --created-after=2024-01-01 --created-before=2024-01-03# Find executions created in the last 2 hours
$ dx find executions --created-after=-2h# Find analyses created in the last 5 days
$ dx find analyses --created-after=-5d# Find failed jobs in the current project
$ dx find jobs --state failed$ dx find jobs --delim
* Cloud Workstation (cloud_workstation:main) done job-xxxx amy 2024-01-07 09:00:00 (runtime 1:00:00)
* GATK3 Human Exome Pipeline(gatk3_human_exome_pipeline:main) done job-yyyy amy 2024-01-07 09:00:00 (runtime 0:21:16)$ dx find jobs -n 3 --brief
job-xxxx
job-yyyy
job-zzzz$ dx describe $(dx find jobs -n 1 --origin-jobs --brief)
Result 1:
ID job-xxxx
Class job
Job name BWA-MEM FASTQ Read Mapper
Executable name bwa_mem_fastq_read_mapper
Project context project-xxxx
Billed to amy
Workspace container-xxxx
Cache workspace container-yyyy
Resources container-zzzz
App app-xxxx
Instance Type mem1_ssd1_x8
Priority high
State done
Root execution job-zzzz
Origin job job-zzzz
Parent job -
Function main
Input genomeindex_targz = file-xxxx
reads_fastqgz = file-xxxx
[read_group_library = "1"]
[mark_as_secondary = true]
[read_group_platform = "ILLUMINA"]
[read_group_sample = "1"]
[add_read_group = true]
[read_group_id = {"$dnanexus_link": {"input": "reads_fastqgz", "metadata": "name"}}]
[read_group_platform_unit = "None"]
Output -
Output folder /
Launched by amy
Created Sun Jan 1 09:00:17 2024
Started running Sun Jan 1 09:00:10 2024
Stopped running Sun Jan 1 09:00:27 2024 (Runtime: 0:00:16)
Last modified Sun Jan 1 09:00:28 2024
Depends on -
Sys Requirements {"main": {"instanceType": "mem1_ssd1_x8"}}
Tags -
Properties -# Find failed jobs in the current project from a time period
$ dx find jobs --state failed --created-after=2024-01-01 --created-before=2024-02-01
* BWA-MEM FASTQ Read Mapper (bwa_mem_fastq_read_mapper:main) (failed) job-xxxx
amy 2024-01-22 09:00:00 (runtime 0:02:12)
* BWA-MEM FASTQ Read Mapper (bwa_mem_fastq_read_mapper:main) (done) job-yyyy
amy 2024-01-07 06:00:00 (runtime 0:11:22)# Find all failed executions of specified executable
$ dx find executions --state failed --executable app-bwa_mem_fastq_read_mapper
* BWA-MEM FASTQ Read Mapper (bwa_mem_fastq_read_mapper:main) (failed) job-xxxx
amy 2024-01-01 09:00:00 (runtime 0:02:12)# Update the app and navigate to within app directory
$ dx build -a
INFO:dxpy:Archived app app-xxxx to project-xxxx:"/.App_archive/bwa_mem_fastq_read_mapper (Sun Jan 1 09:00:00 2024)"
{"id": "app-yyyy"}# Rerun job with updated app
dx run bwa_mem_fastq_read_mapper --clone job-xxxxdx find jobs --tag TAGusage: dx run freebayes [-iINPUT_NAME=VALUE ...]
App: FreeBayes Variant Caller
Version: 3.0.1 (published)
Calls variants (SNPs, indels, and other events) using FreeBayes
See the app page for more information:
https://platform.dnanexus.com/app/freebayes
Inputs:
Sorted mappings: -isorted_bams=(file) [-isorted_bams=... [...]]
One or more coordinate-sorted BAM files containing mappings to call
variants for.
Genome: -igenome_fastagz=(file)
A file, in gzipped FASTA format, with the reference genome that the
reads were mapped against.
Suggestions:
project-BQpp3Y804Y0xbyG4GJPQ01xv://file-* (DNAnexus Reference Genomes: AWS US (East))
project-F3zxk7Q4F30Xp8fG69K1Vppj://file-* (DNAnexus Reference Genomes: AWS Germany)
project-F0yyz6j9Jz8YpxQV8B8Kk7Zy://file-* (DNAnexus Reference Genomes: Azure US (West))
project-F4gXb605fKQyBq5vJBG31KGG://file-* (DNAnexus Reference Genomes: AWS Sydney)
project-FGX8gVQB9X7K5f1pKfPvz9yG://file-* (DNAnexus Reference Genomes: Azure Amsterdam)
project-GvGXBbk36347jYPxP0j755KZ://file-* (DNAnexus Reference Genomes: Bahrain)
Target regions: [-itargets_bed=(file)]
(Optional) A BED file containing the coordinates of the genomic
regions to intersect results with. Supplying this will cause 'bcftools
view -R' to be used, to limit the results to that subset. This option
does not speed up the execution of FreeBayes.
Suggestions:
project-B6JG85Z2J35vb6Z7pQ9Q02j8:/vendor_exomes/file-* (Vendor Exomes (GRCh37 and hg19): AWS US (East))
project-F3zqGV04fXX5j7566869fjFq:/vendor_exomes/file-* (Vendor Exomes (GRCh37 and hg19): AWS Germany)
project-F29g0xQ90fvQf5z1BX6b5106:/vendor_exomes/file-* (Vendor Exomes (GRCh37 and hg19): Azure US (West))
project-F4gYG1850p1JXzjp95PBqzY5:/vendor_exomes/file-* (Vendor Exomes (GRCh37 and hg19): AWS Sydney)
project-FGXfq9QBy7Zv5BYQ9Yvqj9Xv:/vendor_exomes/file-* (Vendor Exomes (GRCh37 and hg19): Azure Amsterdam)
project-GvGXBZk3f624QVfBPjB8916j:/vendor_exomes/file-* (Vendor Exomes (GRCh37 and hg19): Bahrain)
Common
Output prefix: [-ioutput_prefix=(string)]
(Optional) The prefix to use when naming the output files (they will
be called prefix.vcf.gz, prefix.vcf.gz.tbi). If not provided, the
prefix will be the same as the first BAM file given.
Apply standard filters?: [-istandard_filters=(boolean, default=true)]
Select this to use stringent input base and mapping quality filters,
which may reduce false positives. This will supply the
'--standard-filters' option to FreeBayes.
Normalize variants representation?: [-inormalize_variants=(boolean, default=true)]
Select this to use 'bcftools norm' in order to normalize the variants
representation, which may help with downstream compatibility.
Perform parallelization?: [-iparallelized=(boolean, default=true)]
Select this to parallelize FreeBayes using multiple threads. This will
use the 'freebayes-parallel' script from the FreeBayes package, with a
granularity of 3 million base pairs. WARNING: This option may be
incompatible with certain advanced command-line options.
Advanced
Report genotype qualities?: [-igenotype_qualities=(boolean, default=false)]
Select this to have FreeBayes report genotype qualities.
Add RG tags to BAM files?: [-ibam_add_rg=(boolean, default=false)]
Select this to have FreeBayes add read group tags to the input BAM
files so each file will be treated as an individual sample. WARNING:
This may increase the memory requirements for FreeBayes.
Advanced command line options: [-iadvanced_options=(string)]
(Optional) Advanced command line options that will be supplied
directly to the FreeBayes program.
Outputs:
Variants: variants_vcfgz (file)
A bgzipped VCF file with the called variants.
Variants index: variants_tbi (file)
A tabix index (TBI) file with the associated variants index.Tools in this section are created and maintained by their respective vendors and may require separate licenses.
Retrieve reads in FASTQ format from SRA
Short read alignment
gatk4_haplotypecaller_parallel
Variant calling, post-alignment QC
gatk4_genotypegvcfs_single_sample_parallel
Variant calling
picard_mark_duplicates
Variant Calling- remove duplicates, post-alignment
saige_gwas_gbat
saige_gwas_svat
saige_gwas_grm
saige_gwas_sparse_grm
plink_pipeline
Plink2
plato_pipeline
Plato, Plink2
locuszoom
LocusZoom
GWAS, visualization
Interactive radiology image analysis
Transcriptomics Expression Quantification
fastqc
FastQC
Building reference for BWA alignment
bwa_mem_fastq_read_mapper
BWA-MEM
Short read alignment
star_generate_genome_index
(Spliced Transcripts Alignment to a Reference) (--runMode genomeGenerate)
RNA Seq- indexing
star_mapping
(Spliced Transcripts Alignment to a Reference)
RNA Seq- mapping
subread_feature_counts
featureCounts
Read summarization, RNAseq
salmon_index_builder
Salmon
Transcriptomics Expression Quantification
salmon_mapping_quant
Salmon
Transcriptomics Expression Quantification
Read quality trimming, adapter trimming
RNA Seq- mapping
subread_feature_counts
featureCounts
Read summarization, RNAseq
star_generate_genome_index
(Spliced Transcripts Alignment to a Reference) (--runMode genomeGenerate)
Transcriptomics Expression Quantification
star_mapping
(Spliced Transcripts Alignment to a Reference)
Transcriptomics Expression Quantification
salmon_index_builder
Salmon
Transcriptomics Expression Quantification
salmon_mapping_quant
Salmon
Transcriptomics Expression Quantification
salmon_quant
Salmon
Transcriptomics Expression Quantification
Transcript_Expression_Part-05_Analysis-Regulatory-Network_R.ipynb
GENIE3
Data processing tools
ttyd
N/A
Unix shell on a platform cloud worker in your browser. Use it for on-demand CLI operations and to launch https apps on 2 extra ports
Variant calling
picard_mark_duplicates
Variant Calling- remove duplicates, post-alignment
freebayes
Use for short variant calls
gatk4_mutect2_variant_caller_and_filter
Somatic variant calling and post calling filtering
gatk4_somatic_panel_of_normals_builder
Create a panel of normals (PoN) containing germline and artifactual sites for use with Mutect2.
Running analyses, visualizing data, building and testing models and algorithms in an interactive way, accessing and manipulating data in spark databases and tables
dxjupyterlab
dxpy, matplotlib, numpy, pandas, papermill, scipy, seaborn, dxdata, nipype, freesurfer, FSL
Running imaging processing related analysis
dxjupyterlab
dxpy, matplotlib, numpy, pandas, papermill, scipy, seaborn, dxdata, keras, scikit-learn, TensorFlow, torch, monai, monailabel
Running image processing related analysis, building and testing models and algorithms in an interactive way
Data Extraction
Running analyses, visualizing data, building and testing models and algorithms in an interactive way
WGS, WES, accelerated analysis
sentieon-tnbam
Sentieon's BAM to VCF somatic analysis pipeline
WGS, WES, accelerated analysis
pbdeepvariant
Deepvariant
Variant calling, accelerated analysis
sentieon-umi
Sentieon's pre-processing and alignment pipeline for next-generation sequence
WGS, WES, accelerated analysis
sentieon-dnabam
Sentieon's BAM to VCF germline analysis pipeline
WGS, WES, accelerated analysis
sentieon-joint_genotyping
Sentieon GVCFtyper
WGS, WES, accelerated analysis
sentieon-ccdg
Sentieon's FASTQ to CRAM pipeline, Functional Equivalent Pipeline
WGS, WES, accelerated analysis
sentieon-dnaseq
Sentieon's FASTQ to VCF germline analysis pipeline
WGS, WES, accelerated analysis
WGS, WES, accelerated analysis
pbdeepvariant
Deepvariant
Variant calling, accelerated analysis
sentieon-umi
Sentieon's pre-processing and alignment pipeline for next-generation sequence
WGS, WES, accelerated analysis
sentieon-ccdg
Sentieon's FASTQ to CRAM pipeline, Functional Equivalent Pipeline
WGS, WES, accelerated analysis
sentieon-dnaseq
Sentieon's FASTQ to VCF germline analysis pipeline
WGS, WES, accelerated analysis
snpeff_annotate
SnpEff
Annotation
snpsift_annotate
SnpSift
Annotation
aws_platform_to_s3_file_transfer
AWS S3
aws_s3_to_platform_files
AWS S3
sra_fastq_importer
oqfe
A revision of Functionally Equivalent
WGS, WES- alignment and duplicate marking
gatk4_bqsr_parallel
Variant calling
bwa_mem_fastq_read_mapper
regenie
REGENIE
GWAS
plink_gwas
PLINK2
raremetal2
url_fetcher
N/A
Fetches a file from a URL onto the DNAnexus Platform
imaging_multitool_monai
Image processing
3d_slicer
Visualization and image analysis
3d_slicer_monai
sra_fastq_importer
Retrieve reads in FASTQ format from SRA
url_fetcher
N/A
Fetches a file from a URL onto the DNAnexus Platform
cloud_workstation
N/A
SSH-accessible unix shell on a platform cloud worker. Use it for on-demand analysis of platform data.
ttyd
N/A
Unix shell on a platform cloud worker in your browser. Use it for on-demand CLI operations and to launch https apps on 2 extra ports
glnexus
GLnexus
This app can also be used to create pVCF without running joint genotyping
samtools_index
SAMtools - SAMtools index
Building BAM index file
samtools_sort
SAMtools - SAMtools sort
Sort alignment result based on coordinates
phesant
PHESANT
PheWAS
prsice2
PRSice-2
Polygenic risk scores
multiqc
MultiQC
QC reporting
qualimap2_anlys
Qualimap2
QC
rnaseqc
salmon_quant
Salmon
Transcriptomics Expression Quantification
salmon_mapping_quant
Salmon
Transcriptomics Expression Quantification
bowtie2_fasta_indexer
bowtie2: bowtie2-build
Building reference for Bowite2 alignment
bowtie2_fastq_read_mapper
bowtie2, SAMtools view, SAMtools sort, SAMtools index
Short read alignment
bwa_fasta_indexer
gatk4_bqsr_parallel
Variant calling
flexbar_fastq_read_trimmer
QC
trimmomatic
rnaseqc
Transcriptomics Expression Quantification
star_generate_genome_index
STAR (Spliced Transcripts Alignment to a Reference) (--runMode genomeGenerate)
RNA Seq- indexing
star_mapping
Transcript_Expression_Part-02_Analysis-diff-exp_R.ipynb
DESeq2
Transcript_Expression_Part-03_Analysis-GSEA_R.ipynb
WebGestaltR
Transcript_Expression_Part-04_Analysis-CoEx-Network_R.ipynb
file_concatenator
N/A
gzip
gzip
swiss-army-knife
cnvkit_batch
Copy Number Variant
gatk4_haplotypecaller_parallel
Variant calling, post-alignment QC
gatk4_genotypegvcfs_single_sample_parallel
locuszoom
LocusZoom
GWAS, visualization
dxjupyterlab
dxpy, matplotlib, numpy, pandas, papermill, scipy, seaborn, dxdata, cntk, keras, scikit-learn, TensorFlow, torch
Running analyses, visualizing data, building and testing models and algorithms in an interactive way
dxjupyterlab
dxpy, matplotlib, numpy, pandas, papermill, scipy, seaborn, dxdata, BiocManager, coloc, epiR, hyprcoloc, incidence, MendelianRandomization, outbreaks, prevalence
Running analyses, visualizing data, building and testing models and algorithms in an interactive way
dxjupyterlab
data_model_loader_v2
Dataset Creation
Dataset Creation
dataset-extender
Dataset Extension
Dataset Extension
csv-loader
N/A
Data Loading
spark-sql-runner
Spark SQL
Dynamic SQL Execution
table-exporter
dxjupyterlab_spark_cluster
dxpy, hail, matplotlib, numpy, pandas, papermill, scipy, seaborn, dxdata, pyOpenSSL, glow.py, pypandoc, koalas, pyarrow, bokeh, vep, BiocManager, coloc, epiR, yprcoloc, incidence, MendelianRandomization, outbreaks, prevalence, sparklyr, Glow
Running analyses, visualizing data, building and testing models and algorithms in an interactive way
dxjupyterlab_spark_cluster
dxpy, hail, matplotlib, numpy, pandas, papermill, scipy, seaborn, dxdata, pyOpenSSL, glow.py, pypandoc, koalas, pyarrow, bokeh, vep, BiocManager, coloc, epiR, yprcoloc, incidence, MendelianRandomization, outbreaks, prevalence, sparklyr, HAIL
Running analyses, visualizing data, building and testing models and algorithms in an interactive way
dxjupyterlab_spark_cluster
sentieon-tnseq
Sentieon's FASTQ to VCF somatic analysis pipeline
WGS, WES, accelerated analysis
sentieon-bwa
Sentieon's FASTQ to BAM/CRAM pipeline
WGS, WES, accelerated analysis
pbgermline
sentieon-joint_genotyping
Sentieon GVCFtyper
WGS, WES, accelerated analysis
sentieon-tnseq
Sentieon's FASTQ to VCF somatic analysis pipeline
WGS, WES, accelerated analysis
sentieon-bwa
Sentieon's FASTQ to BAM/CRAM pipeline
WGS, WES, accelerated analysis
pbgermline
BWA-MEM
RAREMETALWORKER, RAREMETAL
,
BWA- bwa index
(Spliced Transcripts Alignment to a Reference)
WGCNA, topGO
bcftools, bedtools, bgzip, plink, sambamba, SAMtools, seqtk, tabix, vcflib, Plato, QCTool, vcftools, plink2, Picard
dxpy, matplotlib, numpy, pandas, papermill, scipy, seaborn, dxdata, BiocManager, coloc, epiR, hyprcoloc, incidence, MendelianRandomization
N/A
dxpy, hail, matplotlib, numpy, pandas, papermill, scipy, seaborn, dxdata, pyOpenSSL, glow.py, pypandoc, koalas, pyarrow
BWA-Mem Alignment, Co-ordinate Sorting, Picard MarkDuplicates, Base Quality Score Recalibration
BWA-Mem Alignment, Co-ordinate Sorting, Picard MarkDuplicates, Base Quality Score Recalibration
REGENIEBOLT-LMMBGENoutbreaksprevalenceStatastata_kernel3D SlicerbokehvepBiocManagercolocepiRyprcolocincidenceMendelianRandomizationoutbreaksprevalencesparklyrHAILThis tutorial demonstrates how to use Nextflow pipelines on the DNAnexus Platform by importing a Nextflow pipeline from a remote repository or building from local disk space.
This documentation assumes you already have a basic understanding of how to develop and run a Nextflow pipeline. To learn more about Nextflow, consult the official Nextflow Documentation.
To run a Nextflow pipeline on the DNAnexus Platform:
Import the pipeline script from a remote repository or local disk.
Convert the script to an app or applet.
Run the app or applet.
You can do this via either the user interface (UI) or the command-line interface (CLI), using the .
A Nextflow pipeline script is structured as a folder with Nextflow scripts with optional configuration files and subfolders. Below are the basic elements of the folder structure when building a Nextflow executable:
(Required) A main Nextflow file with the extension .nf containing the pipeline. The default filename is main.nf. A different filename can be specified in the nextflow.config file.
(Optional) A .
(Optional, recommended) A . If this file is present at the root folder of the Nextflow script when importing or building the executable, the input parameters described in the file are exposed as the built Nextflow pipeline applet's input parameters. For more information on how the exposed parameters are used at run time, see
An flavored folder structure is encouraged but not required.
To import a Nextflow pipeline via the UI, click on the Add button on the top-right corner of the project's Manage tab, then expand the dropdown menu. Select the Import Pipeline/Workflow option.
Once the Import Pipeline/Workflow modal appears, enter the repository URL where the Nextflow pipeline source code resides, for example, . Then choose the desired project import location. If the repository is private, provide the credentials necessary for accessing it.
An example of the Import Pipeline/Workflow modal:
Click the Start Import button after providing the necessary information. This starts a pipeline import job in the project specified in the Import To field (default is the current project).
After launching the import job, a status message "External workflow import job started" appears.
Access information about the pipeline import job in the project's Monitor tab:
After the import finishes, the imported pipeline executable exists as an applet. This is the output of the pipeline import job:
The newly created Nextflow pipeline applet appears in the project, for example, hello.
To import a Nextflow pipeline from a remote repository via the CLI, run the following command to specify the repository's URL. You can also provide optional information, such as a and an :
For Nextflow pipelines stored in private repositories, access requires credentials provided via the --git-credentials option with a DNAnexus file containing your authentication details. The file should be specified using either its qualified ID or path on the Platform. See the section for more details on setting up and formatting these credentials.
Once the pipeline import job finishes, it generates a new Nextflow pipeline applet with an applet ID in the form applet-zzzz.
Use dx run -h to get more information about running the applet:
Through the CLI you can also build a Nextflow pipeline applet from a pipeline script folder stored on a local disk. For example, you may have a copy of the nextflow-io/hello pipeline from the Nextflow on your local laptop, stored in a directory named hello, which contains the following files:
Ensure that the folder structure is in the required format, as .
To build a Nextflow pipeline applet using a locally stored pipeline script, run the following command and specify the path to the folder containing the Nextflow pipeline scripts. You can also provide , such as an import destination:
This command packages the Nextflow pipeline script folder as an applet named hello with ID applet-yyyy, and stores the applet in the destination project and path project-xxxx:/applets2/hello. If an import destination is not provided, the current working directory is used.
The command can be run to see information about this applet, similar to the above example.
A Nextflow pipeline applet has a type nextflow under its metadata. This applet acts like a regular DNAnexus applet object, and can be shared with other DNAnexus users who have access to the project containing the applet.
For advanced information regarding the parameters of dx build --, run dx build --help in the CLI and find the Nextflow section for all arguments that are supported for building an Nextflow pipeline applet.
You can also build a Nextflow pipeline by running the command: dx build --app --from applet-xxxx.
You can access a Nextflow pipeline applet from the Manage tab in your project, while the Nextflow pipeline app that you built can be accessed by clicking on the Tools Library option from the Tools tab. Once you click on the applet or app, the Run Analysis tab is displayed. Fill out the required inputs/outputs and click the Start Analysis button to launch the job.
To run the Nextflow pipeline applet, use dx run applet-xxxx or dx run app-xxxx commands in the CLI and specify your :
You can list and see the progress of the Nextflow pipeline job tree, which is structured as a head job with many subjobs, using the following :
Each Nextflow pipeline executable run is represented as a job tree with one head job and many subjobs. The head job launches and supervises the entire pipeline execution. Each subjob handles a process in the Nextflow pipeline. You can monitor the progress of the entire pipeline job tree by viewing the status of the subjobs (see example above).
Monitor the detail log of the head job and the subjobs through each job's DNAnexus log via the UI or the CLI.
On the DNAnexus Platform, jobs are limited to a runtime of 30 days. Jobs running longer than 30 days are automatically terminated.
Once your job tree is running, you can go to the Monitor tab to view the status of your job tree. From the Monitor tab, view the job log of the head job as well as the subjobs by clicking on the Log link in the row of the desired job. The costs (when your account has permission) and resource usage of a job are also viewable.
An example of the log of a head job:
An example of the log of a subjob:
From the CLI, you can use the command to check the status and view the log of the head job or each subjob.
Monitoring the head job:
Monitoring a subjob:
The Nextflow pipeline executable is launched as a job tree, with one head job running the Nextflow , and multiple subjobs running a single each. Throughout the pipeline's execution, the head job remains in "running" state and supervises the job tree's execution.
When a Nextflow head job (job-xxxx) enters its terminal state, either "done" or "failed", the system writes a named nextflow-<job-xxxx>.log to the of the head job.
DNAnexus supports Docker container engines for the Nextflow pipeline execution environment. The pipeline developer may refer to a public Docker repository or a private one. When the pipeline is referencing a private Docker repository, you should provide your Docker credential file as a file input of docker_creds to the Nextflow pipeline executable when launching the job tree.
Syntax of a private Docker credential:
Store this credential file in a separate project with restricted access permissions for security.
Below are all possible means that you can specify an input value at build time and runtime. They are listed in order of precedence (items listed first have greater precedence and override items listed further down the list):
Executable (app or applet) run time
DNAnexus Platform app or applet input.
CLI example:
dx run project-xxxx:applet-xxxx -i reads_fastqgz=project-xxxx:file-yyyy
While you can specify a file input parameter's value at different places as seen above, the valid PATH format referring to the same file is different. This depends on the level (DNAnexus API/CLI level or Nextflow script-level) and the (file object or string) of the executable's input parameter. Examples of this are given below.
When launching a DNAnexus job, you can specify a job-level output destination such as project-xxxx:/destination/ using the platform-level optional parameter on the or on the . For pipelines with publishDir settings, each output file is saved to <dx_run_path>/<publishDir>/, where <dx_run_path> is the job-level output destination and <publishDir> is the path assigned by the Nextflow script's process.
Read more detail about the output folder specification and . Find an example on how to construct output paths of an nf-core pipeline job tree at run time in the .
You can have your Nextflow pipeline runs use an Amazon Web Services (AWS) S3 bucket as a work directory. To do this, follow the steps outlined below.
to configure your AWS account to trust the Platform, as an OIDC identity provider. Be sure to note the value entered in the "Audience" field. This value is required in a configuration file used by your pipeline to enable pipeline runs to access the S3 bucket.
Next, configure an , such that its permissions and trust policies allow Platform jobs that assume this role, to access and use resources in the S3 bucket.
The following example shows how to structure an IAM role's permission policy, to enable the role to use an S3 bucket - accessible via the S3 URI s3://my-nextflow-s3-workdir - as the work directory of Nextflow pipeline runs:
In the above example:
The "Action" section contains a list of the actions the role is allowed to perform, including deleting, getting, listing, and putting objects.
The two entries in the list in the "Resource" section enable the role to access all resources in the bucket accessible via the S3 URI my-nextflow-s3-workdir.
The following example shows how to configure an IAM role's trust policy, to allow only properly configured Platform jobs to assume the role:
In the above example:
To assume the role, a job must be launched from within a specific Platform project (in this case, project-xxxx).
To assume the role, a job must be launched by a specific Platform user (in this case, user-aaaa).
Via the "Federated" setting in the "Principal" section, the policy configures the role to trust the Platform as an OIDC identity provider, as accessible at job-oidc.dnanexus.com.
Next you need to configure your pipeline so that when it's run, it can access the S3 bucket. To do this, add, in a configuration file, a dnanexus that includes the properties shown in this example:
In the above example:
workDir is the path to the bucket to be used as a work directory, in S3 URI format.
jobTokenAudience is the value of "Audience" you defined in above.
jobTokenSubjectClaims is an ordered, comma-separated list of DNAnexus - for example, project_id, launched_by
When configuring the trust policy for the role that allows access to the S3 bucket, use custom subject claims to control which jobs can assume this role. Here are some typical combinations that we recommend, with their implications:
Having included custom subject claims in the trust policy for the role, you need then, in the , to set the value of jobTokenSubjectClaims to equal a comma-separated list of claims, entered in the same order in which you entered them in the trust policy.
For example, if you configured a role's trust policy per the , you are requiring a job, to assume the role, to present custom subject claims project_id and launched_by, in that order. In your Nextflow configuration file, set the value of jobTokenSubjectClaims, within the dnanexus config scope, as follows:
Within the dna config scope, you must also set the value of iamRoleArnToAssume to that of the appropriate role:
By default, the Platform . Nextflow pipeline apps and applets have the following capabilities that are exceptions to these limits:
External internet access ("network": ["*"]) - This is required for Nextflow pipeline apps and applets to be able to pull Docker images from external Docker registries at runtime.
UPLOAD access to the project in which a Nextflow pipeline job is run ("project": "UPLOAD") - This is required in order for Nextflow pipeline jobs to record the progress of executions, and preserve the run cache, to enable resume functionality.
You can modify a Nextflow pipeline app or applet's permissions by overriding the default values when , using the --extra-args flag with . An example:
Here are the key points:
"network": [] prevents jobs from accessing the internet.
"allProjects":"VIEW" increases jobs' access permission level to VIEW. This means that each job has "read" access to projects that can be accessed by the user running the job. Use this carefully. This permission setting can be useful when expected input file PATHs are provided as DNAnexus URIs - via a , for example, - from projects other than the one in which a job is being run.
Additional options exist for dx build --nextflow:
Use dx build --help for more information.
When the Nextflow pipeline to be imported is from a private repository, you must provide a file object that contains the credentials needed to access the repository. Via the CLI, use the --git-credentials flag, and format the object as follows:
When building a Nextflow pipeline executable, you can replace any with a Platform file object in tarball format. These Docker tarball objects serve as substitutes for referencing external Docker repositories.
This approach enhances the provenance and reproducibility of the pipeline by minimizing reliance on external dependencies, thereby reducing associated risks. Also, it fortifies data security by eliminating the need for internet access to external resources, during pipeline execution.
Two methods are available for preparing Docker images as tarball file objects on the platform: or .
This method initiates a building job that begins by taking the pipeline script, then identifying Docker containers by scanning the script's source code based on the final execution tree. Next, the job converts the containers to tarballs, and saves those tarballs to the project in which the job is running. Finally, the job builds the Nextflow pipeline executable, in the tarballs, as bundledDepends.
You can use built-in caching via the CLI by using the flag --cache-docker at build time. All cached Docker tarballs are stored as file objects, within the Docker cache path, at project-xxxx:/.cached_docker_images/<image_name>/<image_name>_<version>.
An example:
If you need to access a Docker container that's stored in a private repository, you must provide, along with the flag --docker-secrets, a file object that contains the credentials needed to access the repository. This object must be in the following format:
You can manually convert Docker images to tarball file objects. Within Nextflow pipeline scripts, you must then reference the location of each such tarball, in one of the following three ways:
Option A: Reference each tarball by its unique Platform ID such as dx://project-xxxx:file-yyyy. Use this approach if you want deterministic execution behavior.
You can use Platform IDs in Nextflow pipeline scripts (*.nf) or configuration files (*.config), as follows:
Option B: Within a Nextflow pipeline script, you can also reference a Docker image by using its . Use this name within a path that's in the following format: project-xxxx:/.cached_docker_images/<image_name>/<image_name>_<version>.
An example:
File extensions are not necessary, and project-xxxx is the project where the Nextflow pipeline executable was built and is executed. For.cached_docker_images, substitute the name of the folder in which these images have been stored. An exact <version> reference must be included - latest is not an accepted tag in this context.
Here are examples of tarball file object paths and names, as constructed from image names and version tags:
Option C: You can also reference Docker image names in pipeline scripts by digest - for example, <Image_name>@sha256:XYZ123…). File extensions are not necessary, and project-xxxx is the project where the Nextflow pipeline executable was built and is executed. For.cached_docker_images, substitute the name of the folder in which these images have been stored. An exact <version> reference must be included - latest is not an accepted tag in this context. When referring to a tarball file on the Platform using this method, the file must have an object property image_digest assigned to it. A typical format would be "image_digest":"<IMAGE_DIGEST_HERE>".
An example:
Based on the input parameter's type and format (when applicable) defined in the corresponding , each parameter is assigned to the corresponding class (, ).
As a pipeline developer, you can specify a file input variable as {"type":"string", "format":"file-path"} or {"type":"string", "format":"path"}, which is assigned to "file" or "string" class, respectively. When running the executable, based on the class (file or string) of the executable's input parameter, you use a specific PATH format to specify the value. See the for an acceptable PATH format for each class.
When converting a file reference from a URL format to a String, you use the method toUriString(). An example of a URL format would be dx://project-xxxx:/path/to/file for a DNAnexus URI. The method toURI().toString() does not give the same result because toURI() removes the context ID, such as project-xxxx, and toString() removes the scheme, such as dx://. More information about the Nextflow methods is available in the .
block and publishDirAll files generated by a Nextflow job tree are stored in its session's corresponding workDir, which is the path where the temporary results are stored. On DNAnexus, when the Nextflow pipeline job is run with "preserve_cache=true", the workDir is set at the path: project-xxxx:/.nextflow_cache_db/<session_id>/work/. The project-xxxx is the project where the job took place, and you can follow the path to access all preserved temporary results. It is useful to be able to access these results for investigating the detailed pipeline progress, and use them for resuming job runs for pipeline development purposes.
When the Nextflow pipeline job is run with "preserve_cache=false" (default), temporary files are stored in the job's which is deconstructed when the head job enters its terminate state - "done", "failed", or "terminated". Since a lot of these files are intermediate input/output being passed between processes and expected to be cleaned up after the job is completed, running with "preserve_cache=false" helps reduce project storage cost for files that are not of interest. It also saves you from remembering to clean up all temporary files.
To save the final results of interest, and to display them as the Nextflow pipeline executable's output, you can declare output files matching the declaration under the script's output: block, and use Nextflow's optional directive to publish them.
This makes the published output files available as the Nextflow pipeline head job's , under the executable's formally defined placeholder output parameter, published_files, as array:file class. Then the files are organized under the relative folder structure assigned via publishDir. This works for both "preserve_cache=true" and "preserve_cache=false". Only the "copy" publish mode is supported on DNAnexus.
publishDirAt pipeline development time, the valid value of publishDir can be:
A local path string, for example, "publishDir path: ./path/to/nf/publish_dir/",
A dynamic string value defined as a pipeline input parameter such as "params.outdir", where "outdir" is a string-class input. This allows pipeline users to determine parameter values at runtime. For example, "publishDir path: '${params.outdir}/some/dir/'" or './some/dir/${params.outdir}/' or './some/dir/${params.outdir}/some/dir/' .
Find an example on how to construct output paths for an nf-core pipeline job tree at run time in the .
The queueSize option is part of Nextflow's executor . It defines how many tasks the executor handles in a parallel way. On DNAnexus, this represents the number of subjobs being created at a time (5 by default) by the Nextflow pipeline executable's head job. If the pipeline's executor configuration has a value assigned to queueSize, it overrides the default value. If the value exceeds the upper limit (1000) on DNAnexus, the root job errors out. See the Nextflow executor page for examples.
The head job of the job tree defaults to running on instance type mem2_ssd1_v2_x4 in AWS regions and azure:mem2_ssd1_x4 in Azure regions. Users can change to a different instance type than the default, but this is not recommended. The head job executes and monitors the subjobs. Changing the instance type for the head job does not affect the computing resources available for subjobs, where most of the heavy computation takes place (see below where to configure instance types for Nextflow processes). Changing the instance type for the head job may be necessary only if it runs out of memory or disk space when staging input files, collecting pipeline output files, or uploading pipeline output files to the project.
Each subjob's instance type is determined based on the profile information provided in the Nextflow pipeline script. Specify required instances by via Nextflow's directive (example below). Alternatively, use a set of system requirements such as , , , and other resource parameters according to the official Nextflow documentation. The executor matches instance types to the minimal requirements described in the Nextflow pipeline profile using this logic:
Choose the cheapest instance that satisfies the system requirements.
Use only SSD type instances.
For all things equal (price and instance specifications), it prefers a instance type.
Order of precedence for subjob instance type determination:
The value assigned to machineType directive.
Values assigned to cpus, memory, and disk directives in their .
An example command for specifying machineType by DNAnexus instance type name is provided below:
Nextflow's feature enables skipping the processes that have been finished successfully and cached in previous runs. The new run can directly jump to downstream processes without needing to start from the beginning of the pipeline. By retrieving cached progress, Nextflow resume helps pipeline developers to save both time and compute costs. It is helpful for testing and troubleshooting when building and developing a Nextflow pipeline.
Nextflow uses a scratch storage area for caching and preserving each task's temporary results. The directory is called "working directory", and the directory's path is defined by
The session id, a universally unique identifier (UUID) associated with current execution
Each task's unique hash ID: a hash number composed of each task's input values, input files, command line strings, container ID such as Docker image, conda environment, environment modules, and executed scripts in the bin directory, when applicable.
You can use the Nextflow resume feature with the following Nextflow pipeline executable parameters:
preserve_cache Boolean type. Default value is false. When set to true, the run is cached in the current project for future resumes. For example:
This enables the Nextflow job tree to preserve cached information as well as all temporary results in the project where it is executed under the following paths, based on its session ID and each subjob's unique ID.
Below are four possible scenarios and the recommended use cases for –i resume:
workDirTo save on storage costs, clean up the workDir regularly. The maximum number of sessions that can be preserved in a DNAnexus project is 20 sessions. If you exceed the limit, the job generates an error with the following message:
"The number of preserved sessions is already at the limit (N=20) and preserve_cache is true. Remove the folders in <project-id>:/.nextflow_cache_db/ to be under the limit, if you want to preserve the cache of this run. "
To clean up all preserved sessions under a project, you can delete the entire ./nextflow_cache_db folder. To clean up a specific session's cached folder, you can delete the specific .nextflow_cache_db/<session_id>/ folder. To delete a folder in UI, you can follow the documentation on . To delete a folder in CLI, you can run:
Be aware that deleting an object on UI or using CLI dx rm cannot be undone. Once the session work directory is deleted or moved, subsequent runs cannot resume from the session.
For each session, only one job can resume the session's cached results and preserve its own progress to this session. Multiple jobs can resume and preserve different sessions without limitations, as long as each job preserves a different session. Similarly, multiple jobs can resume the same session without limitations, as long as only one or none is preserving the progress to the session.
errorStrategyNextflow's directive allows you to define how the error condition is managed by the Nextflow executor at the process level. When an error status is returned, by default, the process and other pending processes stop immediately (the default is errorStrategy terminate). This forces the entire pipeline execution to be terminated.
Four error strategy options exist for Nextflow executor: terminate, finish, ignore, and retry. Below is a table of behaviors for each strategy. The "all other subjobs" referenced in the third column have not yet entered their terminal states.
When more than one errorStrategy directives are applied to a pipeline job tree, the following rules apply depending on the first errorStrategy used.
When terminate is the first errorStrategy directive to be triggered in a subjob, all the other ongoing subjobs result in the "failed" state immediately.
When finish is the first errorStrategy directive to be triggered in a subjob, any other errorStrategy that is reached in the remaining ongoing subjobs also applies the finish errorStrategy
Independent from Nextflow process-level error conditions, when a Nextflow subjob encounters platform-related restartable , such as ExecutionError, UnresponsiveWorker, JMInternalError, AppInternalError, or JobTimeoutExceeded, the subjob follows the executionPolicy determined to the subjob and restart itself. It does not restart from the head job.
A: Find the errored subjob's job ID from the head job's nextflow_errored_subjob and nextflow_errorStrategy properties to investigate which subjob failed and which errorStrategy was applied. To query these errorStrategy related properties in CLI, run the following command:
where job-xxxx is the head job's job ID. \
After finding the errored subjob, investigate the job log using the Monitor page by accessing the URL https://platform.dnanexus.com/projects/<projectID>/monitor/job/<jobID>. In this URL, jobID is the subjob's ID such as job-yyyy. Alternatively, watch the job log in CLI using dx watch job-yyyy.
With the preserve_cache value set to true when starting the Nextflow pipeline executable, trace the cache workDir such as project-xxxx:/.nextflow_cache_db/<session_id>/work/ to investigate the intermediate results of this run.
A: Find the Nextflow version by reading the log of the head job. Each built Nextflow executable is locked down to the specific version of Nextflow executor.
A: DNAnexus supports as the container runtime for Nextflow pipeline applets. It is recommended to set docker.enabled=true in the Nextflow pipeline configuration, which enables the built Nextflow pipeline applet to execute the pipeline using Docker.
A: There can be many possibilities causing the head job to become unresponsive. One of the known reasons is caused by the being written directly to a DNAnexus URI such as dx://project-xxxx:/path/to/file. To avoid this cause, specify -with-trace path/to/tracefile (using a local path string) to the Nextflow pipeline applet's nextflow_run_opts input parameter.
params.outdir, publishDir and job-level destination?Taking as an example, start with reading the pipeline's logic:
The pipeline's is constructed with a prefix of the params.outdir variable followed by each task's name for each subfolder:
publishDir = [ path: { "${params.outdir}/${...}" }, ... ]
params.outdir is a to the pipeline, and the . The user running the corresponding Nextflow pipeline executable must specify a value to params.outdir to:
To specify a value of params.outdir for the Nextflow pipeline executable built from the nf-core/sarek pipeline script, you can use the following command:
You can also set a job tree's output destination using :
This command constructs the final output paths as follows:
project-xxxx:/path/to/jobtree/destination/ as the destination of the job tree's shared output folder.
project-xxxx:/path/to/jobtree/destination/local/to/outdir as the shared output folder of the all tasks/processes/subjobs of this pipeline.
project-xxxx:/path/to/jobtree/destination/local/to/outdir/<task_name> as the output folder of each specific task/process/subjob of this pipeline.
v0.378.0(Optional) Subfolders and other configuration files. Subfolders and other configuration files can be referenced by the main Nextflow file or nextflow.config via the include or includeConfig keyword. Ensure that all referenced subfolders and files exist under the pipeline script folder at the time of building or importing the pipeline.
reads_fastqgz is an example of an executable input parameter name. All Nextflow pipeline inputs can be configured and exposed by the pipeline developer using an nf-core flavored pipeline schema file (nextflow_schema.json).
When the input parameter is expecting a file, you need to specify the value in a certain format based on the class of the input parameter. When the input is of the "file" class, use DNAnexus qualified ID, which is the absolute path to the file object such as "project-xxxx:file-yyyy". When the input is of the "string" class, use the DNAnexus URI ("dx://project-xxxx:/path/to/file"). See table below for full descriptions of the formatting of PATHs.
You can use dx run <app(let)> --help to query the class of each input parameter at the app(let) level. In the example code block below, fasta is an input parameter of a file object, while fasta_fai is an input parameter of a string object. You then use DNAnexus qualifiedID format for fasta, and DNAnexus URI format for fasta_fai.
The DNAnexus object class of each input parameter is based on the "type" and "format" specified in the pipeline's nextflow_schema.json, when it exists. See additional documentation in the Nextflow Input Parameter Type Conversion section to understand how Nextflow input parameter's type and format (when applicable) converts to an app or applet's input class.
It is recommended to always use the app/applet means for specifying input values. The platform validates the input class and existence before the job is created.
All inputs for a Nextflow pipeline executable are set as "optional" inputs. This allows users to have flexibility to specify input via other means.
Nextflow pipeline command line input parameter, available as nextflow_pipeline_params. This is an optional "string" class input, available for any Nextflow pipeline executable on it being built.
CLI example:
dx run project-xxxx:applet-xxxx -i nextflow_pipeline_params="--foo=xxxx --bar=yyyy", where "--foo=xxxx --bar=yyyy" corresponds to the "--something value" pattern of Nextflow input specification referenced in the Nextflow Configuration documentation.
Because nextflow_pipeline_params is a string type parameter with file-path format, use the DNAnexus URI format when the file is stored on DNAnexus.
Nextflow options parameter nextflow_run_opts. This is an optional "string" class input, available for any Nextflow pipeline executable on it being built.
CLI example:
dx run project-xxxx:applet-xxxx -i nextflow_run_opts="-profile test", where -profile is single-dash prefix parameter that corresponds to the Nextflow run options pattern, specifying a preset input configuration.
Nextflow parameter file nextflow_params_file. This is an optional "file" class input, available for any Nextflow pipeline executable that is being built.
CLI example:
dx run project-xxxx:applet-xxxx -i nextflow_params_file=project-xxxx:file-yyyy, where project-xxxx:file-yyyy is the DNAnexus qualified ID of the file being passed to nextflow run -params-file <file>. This corresponds to -params-file option of nextflow run.
Nextflow soft configuration override file nextflow_soft_confs. This is an optional "array:file" class input, available for any Nextflow pipeline executable that is being built.
CLI example:
dx run project-xxxx:applet-xxxx -i nextflow_soft_confs=project-xxxx:file-1111 -i nextflow_soft_confs=project-xxxx:file-2222, where project-xxxx:file-1111 and project-xxxx:file-2222 are the DNAnexus qualified IDs of the file being passed to nextflow run -c <config-file1> -c <config-file2>. This corresponds to -c option of nextflow run, and the order specified for this array of file input is preserved when passing to the nextflow run execution.
The soft configuration file can be used for assigning default values of configuration scopes (such as ).
It is highly recommended to use nextflow_params_file as a replacement to using nextflow_soft_confs for the use case of specifying parameter values, especially when running Nextflow DSL2 nf-core pipelines. Read more about this at .
Pipeline source code:
nextflow_schema.json
Pipeline developers may specify default values of inputs in the nextflow_schema.json file.
If an input parameter is of Nextflow's string type with file-path format, use DNAnexus URI format when the file is stored on DNAnexus.
nextflow.config
Pipeline developers may specify default values of inputs in the nextflow.config file.
Pipeline developers may specify a default profile value using --profile <value> when building the executable, for example, dx build --nextflow --profile test
main.nf, sourcecode.nf
Pipeline developers may specify default values of inputs in the Nextflow source code file (*.nf).
If an input parameter is of Nextflow's string type with file-path format, use the DNAnexus URI format when the file is stored on DNAnexus.
iamRoleArnToAssume is the Amazon Resource Name (ARN) for the role that you configured in Step 2 above, and that is assumed by jobs to access the bucket.
You need also to configure your pipeline to access the bucket within the appropriate AWS region, which you specify via the region parameter, within an aws config scope.
--cache-docker
flag
Stores a container image tarball in the selected project in /.cached_dockerImages. Only Docker engine is supported. Incompatible with --remote.
--nextflow-pipeline-params NEXTFLOW_PIPELINE_PARAMS
string
Custom pipeline parameters to be referenced when collecting the Docker images.
--docker-secrets DOCKER_SECRETS
file
A DNAnexus file ID with credentials for a private Docker repository.
For pipelines featuring conditional process trees determined by input values, provide mocked input values for caching Docker containers used by processes affected by the condition.
A building job requires CONTRIBUTE or higher permission to the destination project, that is the project for placing tarballs created from Docker containers.
Pipeline source code is saved at /.nf_source/<pipeline_folder_name>/ in the destination project. The user handles cleaning up this folder after the executable has been built.
NA
int
number
NA
float
boolean
NA
boolean
object
NA
hash
When publishDir is defined this way, the user who launches the Nextflow pipeline executable handles constructing the publishDir to be a valid relative path.
The actual selected instance type's resources (CPUs, memory, disk capacity) may differ from what is allocated by the task. Instance type selection follows the precedence rules described above, while task allocation uses the values assigned in the configuration file.
When using Docker as the runtime container, the Nextflow executor propagates task execution settings to the Docker run command. For example, when task.memory is specified, this becomes the maximum amount of memory allowed for the container: docker run --memory ${task.memory}
workDir, the session progress, job status, and configuration data is saved to project-xxxx:/.nextflow_cache_db/<session_id>/cache.tar, where project-xxxx is the project where the job tree is executed.Each task's working directory is saved to project-xxxx:/.nextflow_cache_db/<session_id>/work/<2digit>/<30characters>/, where <2digit>/<30characters>/ is technically the task's unique ID, and project-xxxx is the project where the job tree is executed.
resume String type. Default value is an empty string, and the run begins without any cached data. When assigned with a session id, the run resumes from what is cached for the session id on the project. When assigned with "true" or "last", the run determines the session id that corresponds to the latest valid execution in the current project and resumes the run from it. For example, dx run applet-xxxxm -i reads_fastqgz="project-xxxx:file-yyyy" -i resume="<session_id>"
session_id has a format like 12345678-1234-1234-1234-123456789012), the new job not only resumes from where the cache left at, but also shares the same session_id with the cached session it resumes. When a new job makes progress in a session and if the job is being cached, it creates temporary results to the same session's workDir. This generates a new cache directory (cache.tar) with the latest cache information.You can have many Nextflow job trees sharing the same sessionID and writing to the same path for workDir and creating its own cache.tar, while only the latest job that ends in "done" or "failed" state is preserved on the project.
When the head job enters its terminal state such as "failed" or "terminated" that is not caused by the executor, no cache directory is preserved, even when the job was run with preserve_cache=true. Subsequent new jobs cannot resume from this job run. This can happen when a job tree fails due to exceeding a cost limit or a user terminating a job of the job tree.
4
resume=<session_ID> | "true" | "last" and preserve_cache=true
Pipeline development. Only happens for the first few tests.
Only 1 job with the same <session_ID> can run at each time point.
ignore
- Job properties set with:
"nextflow_errorStrategy":"ignore"
"nextflow_errored_subjob":"self"
- Ends in "done" state immediately
- Job properties set with:
"nextflow_errorStrategy":"ignore"
"nextflow_errored_subjob":"job-1xxx, job-2xxx"
- Shows "subjobs <job-1xxx>, <job-2xxx> runs into Nextflow process errors' ignore errorStrategy were applied" at end of job log.
- Ends in a terminal state depending on other subjobs (can be "done", "failed", or "terminated").
- Keep running until terminal state.
- If error occurs, their own errorStrategy is applied.
If the retry errorStrategy is the first directive triggered in a subjob, if any of the remaining subjobs trigger a terminate, finish, or ignore errorStrategy, these other errorStrategy directives are applied to the corresponding subjob.
When ignore is the first errorStrategy directive to trigger in a subjob , and if any of terminate, finish, or retry errorStrategy directives applies to the remaining subjobs, that other errorStrategy is applied to the corresponding subjob.
Meet the input requirement for executing the pipeline.
Resolve the value of publishDir, with outdir as the leading path and each task's name as the subfolder name.
params.outdirpublishDir• App or applet input parameter class as file object
• CLI/API level, such as dx run --destination PATH
DNAnexus qualified ID (absolute path to the file object).
• Example (file):
project-xxxx:file-yyyy
project-xxxx:/path/to/file
• Example (folder):
project-xxxx:/path/to/folder/
• App or applet input parameter class as string
• Nextflow configuration and source code files, such as nextflow_schema.json, nextflow.config, main.nf, and sourcecode.nf
DNAnexus URI.
• Example (file):
dx://project-xxxx:/path/to/file
• Example (folder):
dx://project-xxxx:/path/to/folder/
• Example (wildcard):
dx://project-xxxx:/path/to/wildcard_files
Values of StringEquals:job-oidc.dnanexus.com/:sub
Which jobs can assume the role that enables bucket access?
project_id;project-xxxx
Any Nextflow pipeline jobs that are running in project-xxxx
launched_by;user-aaaa
Any Nextflow pipeline jobs that are launched by user-aaaa
project_id;project-xxxx;launched_by;user-aaaa
Any Nextflow pipeline jobs that are launched by user-aaaa in project-xxxx
bill_to;org-zzzz
Any Nextflow pipeline jobs that are billed to org-zzzz
--profile PROFILE
string
Set default profile for the Nextflow pipeline executable.
--repository REPOSITORY
string
Specifies a Git repository of a Nextflow pipeline. Incompatible with --remote.
--repository-tag TAG
string
Specifies tag for Git repository. Can be used only with --repository.
--git-credentials GIT_CREDENTIALS
file
Requires running a "building job" with external internet access?
Yes, if building an applet for the first time or if any image is going to be updated. No internet access required on rebuild.
No
Docker images packaged as bundledDepends?
Yes. For Docker images that are used in the execution, they are cached and bundled at build time.
No. Docker tarballs are resolved at runtime.
At runtime
Job first attempts to access Docker images cached as bundledDepends. If this fails, the job attempts to find the image on the Platform. If this fails, the job tries to pull the images from the external repository, via the internet.
Job attempts to find the Docker image based on the Docker cache path referenced. If this fails, the job attempts to pull from the external repository, via the internet.
quay.io/biocontainers/tabix
1.11--hdfd78af_0
project-xxxx:/.cached_docker_images/tabix/tabix_1.11--hdfd78af_0
python
3.9-slim
project-xxxx:/.cached_docker_images/python/python_3.9-slim
python
latest
Nextflow pipeline job attempts to pull from remote external registry
From: Nextflow Input Parameter
(defined at nextflow_schema.json) Type
Format
To: DNAnexus Input Parameter Class
string
file-path
file
string
directory-path
string
string
path
string
string
NA
string
1 (default)
resume="" (empty string) and preserve_cache=false
Production data processing. Most high volume use cases
2
resume="" (empty string) and preserve_cache=true
Pipeline development. Only happens for the first few pipeline tests.
During development, it is useful to see all intermediate results in workDir.
Only up to 20 Nextflow sessions can be preserved per project.
3
resume=<session_ID> | "true" | "last" and preserve_cache=false
Pipeline development. Pipeline developers can investigate the job workspace with --delay_workspace_destruction and --ssh
errorStrategy
Subjob Error
Head Job
All Other Subjobs
terminate
- Job properties set with:
"nextflow_errorStrategy":"terminate"
"nextflow_errored_subjob":"self"
- Ends in "failed" state immediately
- Job properties set with:
"nextflow_errorStrategy":"terminate"
"nextflow_errored_subjob":"job-xxxx"
"nextflow_terminated_subjob":"job-yyyy, job-zzzz"
where job-xxxx is the errored subjob, and job-yyyy, job-zzzz are other subjobs terminated due to this error.
- Ends in "failed" state immediately, with error message: "Job was terminated by Nextflow with terminate errorStrategy for job-xxxx, check the job log to find the failure."
End in "failed" state immediately.
finish
- Job properties set with:
"nextflow_errorStrategy":"finish"
"nextflow_errored_subjob":"self"
- Ends in "done" state immediately
- Job properties set with:
"nextflow_errorStrategy":"finish"
"nextflow_errored_subjob":"job-xxxx, job-2xxx"
where job-xxxx and job-2xxx are errored subjobs.
- No new subjobs created after error.
- Ends in "failed" state eventually, after other subjobs enter terminal states, with error message: "Job was ended with finish errorStrategy for job-xxxx, check the job log to find the failure."
- Keep running until terminal state.
- If error occurs in any, finish errorStrategy is applied (ignoring other error strategies), per Nextflow default behavior.
retry
- Job properties set with:
"nextflow_errorStrategy":"retry"
"nextflow_errored_subjob":"self"
- Ends in "done" state immediately
- Spins off a new subjob to retry the errored job, named <name> (retry: <RetryCount>).
- Ends in a terminal state depending on other subjobs (can be "done", "failed", or "terminated").
- Keep running until terminal state.
- If error occurs, their own errorStrategy is applied.








Git credentials used to access Nextflow pipelines from private Git repositories. Can be used only with --repository. More information about the file syntax can be found in the .
integer
dx build --nextflow --extra-args='{"runSpec": {"release": "20.04"}}'$ dx build --nextflow \
--repository https://github.com/nextflow-io/hello \
--destination project-xxxx:/applets/hello
Started builder job job-aaaa
Created Nextflow pipeline applet-zzzz$ dx run project-xxxx:/applets/hello -h
usage: dx run project-xxxx:/applets/hello [-iINPUT_NAME=VALUE ...]
Applet: hello
hello
Inputs:
Nextflow options
Nextflow Run Options: [-inextflow_run_opts=(string)]
Additional run arguments for Nextflow (e.g. -profile docker).
Nextflow Top-level Options: [-inextflow_top_level_opts=(string)]
Additional top-level options for Nextflow (e.g. -quiet).
Soft Configuration File: [-inextflow_soft_confs=(file) [-inextflow_soft_confs=... [...]]]
(Optional) One or more nextflow configuration files to be appended to the Nextflow pipeline
configuration set
Script Parameters File: [-inextflow_params_file=(file)]
(Optional) A file, in YAML or JSON format, for specifying input parameter values
Advanced Executable Development Options
Debug Mode: [-idebug=(boolean, default=false)]
Shows additional information in the job log. If true, the execution log messages from
Nextflow are also included.
Resume: [-iresume=(string)]
Unique ID of the previous session to be resumed. If 'true' or 'last' is provided instead of
the sessionID, resumes the latest resumable session run by an applet with the same name
in the current project in the last 6 months.
Preserve Cache: [-ipreserve_cache=(boolean, default=false)]
Enable storing pipeline cache and local working files to the current project. If true, local
working files and cache files are uploaded to the platform, so the current session could
be resumed in the future
Outputs:
Published files of Nextflow pipeline: [published_files (array:file)]
Output files published by current Nextflow pipeline and uploaded to the job output
destination.$ pwd
/path/to/hello
$ ls
LICENSE README.md main.nf nextflow.config$ dx build --nextflow /path/to/hello \
--destination project-xxxx:/applets2/hello
{"id": "applet-yyyy"}$ dx run project-yyyy:applet-xxxx \
-i debug=false \
--destination project-xxxx:/path/to/destination/ \
--brief -y
job-bbbb# See subjobs in progress
$ dx find jobs --origin job-bbbb
* hello (done) job-bbbb
│ amy 2023-09-20 14:57:58 (runtime 0:02:03)
├── sayHello (3) (hello:nf_task_entry) (done) job-1111
│ amy 2023-09-20 14:58:57 (runtime 0:00:45)
├── sayHello (1) (hello:nf_task_entry) (done) job-2222
│ amy 2023-09-20 14:58:52 (runtime 0:00:52)
├── sayHello (2) (hello:nf_task_entry) (done) job-3333
│ amy 2023-09-20 14:58:48 (runtime 0:00:53)
└── sayHello (4) (hello:nf_task_entry) (done) job-4444
amy 2023-09-20 14:58:43 (runtime 0:00:50)# Monitor job in progress
$ dx watch job-bbbb
Watching job job-bbbb. Press Ctrl+C to stop watching.
* hello (done) job-bbbb
amy 2023-09-20 14:57:58 (runtime 0:02:03)
... [deleted]
2023-09-20 14:58:29 hello STDOUT dxpy/0.358.0 (Linux-5.15.0-1045-aws-x86_64-with-glibc2.29) Python/3.8.10
2023-09-20 14:58:30 hello STDOUT bash running (job ID job-bbbb)
2023-09-20 14:58:31 hello STDOUT =============================================================
2023-09-20 14:58:31 hello STDOUT === NF projectDir : /home/dnanexus/hello
2023-09-20 14:58:31 hello STDOUT === NF session ID : 0eac8f92-1216-4fce-99cf-dee6e6b04bc2
2023-09-20 14:58:31 hello STDOUT === NF log file : dx://project-xxxx:/applets/nextflow-job-bbbb.log
2023-09-20 14:58:31 hello STDOUT === NF command : nextflow -log nextflow-job-bbbb.log run /home/dnanexus/hello -name job-bbbb
2023-09-20 14:58:31 hello STDOUT === Built with dxpy : 0.358.0
2023-09-20 14:58:31 hello STDOUT =============================================================
2023-09-20 14:58:34 hello STDOUT N E X T F L O W ~ version 22.10.7
2023-09-20 14:58:35 hello STDOUT Launching `/home/dnanexus/hello/main.nf` [job-bbbb] DSL2 - revision: 1647aefcc7
2023-09-20 14:58:43 hello STDOUT [0a/6a81ca] Submitted process > sayHello (4)
2023-09-20 14:58:48 hello STDOUT [f5/87df8b] Submitted process > sayHello (2)
2023-09-20 14:58:53 hello STDOUT [4b/21374a] Submitted process > sayHello (1)
2023-09-20 14:58:57 hello STDOUT [f6/8c44f5] Submitted process > sayHello (3)
2023-09-20 14:59:51 hello STDOUT Hola world!
2023-09-20 14:59:51 hello STDOUT
2023-09-20 14:59:51 hello STDOUT Ciao world!
2023-09-20 14:59:51 hello STDOUT
2023-09-20 15:00:06 hello STDOUT Bonjour world!
2023-09-20 15:00:06 hello STDOUT
2023-09-20 15:00:06 hello STDOUT Hello world!
2023-09-20 15:00:06 hello STDOUT
2023-09-20 15:00:07 hello STDOUT === Execution completed — cache and working files will not be resumable
2023-09-20 15:00:07 hello STDOUT === Execution completed — upload nextflow log to job output destination project-xxxx:/applets/
2023-09-20 15:00:09 hello STDOUT Upload nextflow log as file: file-GZ5ffkj071zqZ9Qj22qv097J
2023-09-20 15:00:09 hello STDOUT === Execution succeeded — upload published files to job output destination project-xxxx:/applets/
* hello (done) job-bbbb
amy 2023-09-20 14:57:58 (runtime 0:02:03)
Output: -# Monitor job in progress
$ dx watch job-cccc
Watching job job-cccc. Press Ctrl+C to stop watching.
sayHello (1) (hello:nf_task_entry) (done) job-cccc
amy 2023-09-20 14:58:52 (runtime 0:00:52)
... [deleted]
2023-09-20 14:59:28 sayHello (1) STDOUT dxpy/0.358.0 (Linux-5.15.0-1045-aws-x86_64-with-glibc2.29) Python/3.8.10
2023-09-20 14:59:30 sayHello (1) STDOUT bash running (job ID job-cccc)
2023-09-20 14:59:33 sayHello (1) STDOUT file-GZ5ffQj047j3Vq7QX220Q5vQ
2023-09-20 14:59:34 sayHello (1) STDOUT Bonjour world!
2023-09-20 14:59:36 sayHello (1) STDOUT file-GZ5ffVQ047j2QXZ2ZkFx4YxG
2023-09-20 14:59:38 sayHello (1) STDOUT file-GZ5ffX0047j2QXZ2ZkFx4YxK
2023-09-20 14:59:41 sayHello (1) STDOUT file-GZ5ffXQ047jGYZ91x6KG32Jp
2023-09-20 14:59:43 sayHello (1) STDOUT file-GZ5ffY8047jF2PY3609JPBKB
sayHello (1) (hello:nf_task_entry) (done) job-cccc
amy 2023-09-20 14:58:52 (runtime 0:00:52)
Output: exit_code = 0{
"docker_registry": {
"registry": "url-to-registry",
"username": "name123",
"token": "12345678"
}
}# Query for the class of each input parameter
$ dx run project-yyyy:applet-xxxx --help
usage: dx run project-yyyy:applet-xxxx [-iINPUT_NAME=VALUE ...]
Applet: example_applet
example_applet
Inputs:
…
fasta: [-ifasta=(file)]
…
fasta_fai: [-ifasta_fai=(string)]
…
# Assign values of the parameter based on the class of the parameter
$ dx run project-yyyy:applet-xxxx -ifasta="project-xxxx:file-yyyy" -ifasta_fai="dx://project-xxxx:/path/to/file"{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:DeleteObject",
"s3:GetObject",
"s3:ListBucket",
"s3:PutObject"
],
"Resource": [
"arn:aws:s3:::my-nextflow-s3-workdir",
"arn:aws:s3:::my-nextflow-s3-workdir/*"
]
}
]
}{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "sts:AssumeRoleWithWebIdentity",
"Principal": {
"Federated": "arn:aws:iam::123456789012:oidc-provider/job-oidc.dnanexus.com/"
,
"Condition": {
"StringEquals": {
"job-oidc.dnanexus.com/:aud": "dx_nextflow_s3_scratch_token_aud"
},
"StringEquals": {
"job-oidc.dnanexus.com/:sub": "project_id;project-xxxx;launched_by;user-aaaa"
}
}
}
}
]
}# In a nextflow configuration file:
aws { region = '<aws region>'}
dnanexus {
workDir = '<S3 URI path>'
jobTokenAudience = '<OIDC_audience_name>'
jobTokenSubjectClaims = '<list of claims separated by commas>'
iamRoleArnToAssume = '<arn of the role who is set with permission>'
}# In a nextflow configuration file:
dnanexus {
...
jobTokenSubjectClaims = 'project_id,launched_by'
...
}# In a nextflow configuration file:
dnanexus {
...
iamRoleArnToAssume = arn:aws:iam::123456789012:role/NextflowRunIdentityToken
...
}$ dx build --nextflow /path/to/hello --extra-args \
'{"access":{"network": [], "allProjects":"VIEW"}}'
...
{"id": "applet-yyyy"}providers {
github {
user = 'username'
password = 'ghp_xxxx'
}
}$ dx build --nextflow /path/to/hello \
--cache-docker \
--nextflow-pipeline-params "--alpha=1 --beta=foo" \ # when required
--destination project-xxxx:/applets2/hello
...
{"id:"applet-yyyy"}
$ dx tree /.cached_docker_images/
/.cached_docker_images/
├── samtools
│ └── samtools_1.16.1--h6899075_1
├── multiqc
│ └── multiqc_1.18--pyhdfd78af_0
└── fastqc
└── fastqc_0.11.9--0"docker_registry": {
"registry": "url-to-registry",
"username": "name123",
"token": "12345678"
}# In a Nextflow pipeline script:
process foo {
container 'dx://project-xxxx:file-yyyy'
'''
do this
'''
}# In nextflow.config // at root folder of the nextflow pipeline:
process {
withName:foo {
container = 'dx://project-xxxx:file-yyyy'
}
}# In nextflow configuration file:
docker.enabled = true
docker.registry = 'quay.io'
# In the Nextflow pipeline script:
process bar {
container 'quay.io/biocontainers/tabix:1.11--hdfd78af_0'
'''
do this
'''
}# In nextflow configuration file:
docker.enabled = true
docker.registry = 'quay.io'
# In the Nextflow pipeline script:
process bar {
container 'quay.io/biocontainers/tabix@sha256:XYZ123…'
'''
do this
'''
}process foo {
machineType 'mem1_ssd1_v2_x36'
"""
<your script here>
"""
}dx run applet-xxxx -i reads_fastqgz=project-xxxx:file-yyyy -i preserve_cache=truedx rm -r project-xxxx:/.nextflow_cache_db/ # cleanup ALL sessions caches
dx rm -r project-xxxx:/.nextflow_cache_db/<session_id>/ # clean up a specific session's cache$ dx describe job-xxxx --json | jq -r .properties.nextflow_errored_subjob
job-yyyy
$ dx describe job-xxxx --json | jq -r .properties.nextflow_errorStrategy
terminatedx run project-xxxx:applet-zzzz \
-i outdir=./local/to/outdir \ # assign "./local/to/outdir" params.outdir
--brief -ydx run project-xxxx:applet-zzzz \
-i outdir=./local/to/outdir \ # assign "./local/to/outdir" params.outdir
--destination project-xxxx:/path/to/jobtree/destination/ \
--brief -y