Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Learn to use projects to collaborate, organize your work, manage billing, and control access to files and executables.
On the DNAnexus Platform, a project is first and foremost a means of enabling users to collaborate, by providing them with shared access to specific data and tools.
Projects have a series of features designed to facilitate collaboration, help project members coordinate and organize their work, and ensure appropriate control over both data and tools.
A key function of each project is to serve as a shared storehouse of data objects used by project members as they collaborate.
Click on a project's Manage tab to see a list of all the data objects stored in the project. Within the Manage screen, you can browse and manage these objects, with the range of available actions for an object dependent on its type.
The following are four common actions you can perform on objects from within the Manage screen.
File objects can be directly downloaded from the system. To download a file:
Select its row, then click the More Actions button - the "..." icon - at the end of the row showing the file's name.
Select "Download" from the list of available actions.
Follow the instructions in the modal window that opens.
To learn more about an object:
Select its row, then click the Show Info Panel button - the "i" icon - in the upper corner of the Manage screen.
Select the row showing the name of the object about which you want to know more. An info panel will open on the right, displaying a range of information about the object. This will include its unique ID, as well as metadata about its owner, time of creation, size, tags, properties, and more.
To delete an object:
Select its row, then click on the More Actions button - the "..." icon - at the end of the row.
Select "Delete" from the list of available actions.
Follow the instructions in the modal window that opens.
Note that deletion cannot be undone.
Select the object or objects you want to copy to a new project, by clicking the box to the left of the name of each object in the objects list.
Click the Copy button in the upper right corner of the Manage screen. A modal window will open.
Select the project to which you want to copy the object or objects, then select the location within the project to which the objects should be copied.
Click the Copy Selected button.
You can collaborate on the platform by sharing a project with other DNAnexus users. On sharing a project with a user, or group of users in an organization, they become project members, with access at one of the levels described below. Project access can be revoked at any time by a project administrator.
To remove a user or org from a project to which you have ADMINISTER access:
On the project's Manage screen, click the Share Project button - the "two people" icon - in the top right corner of the page. A modal window will open, showing a list of project members.
Find the row showing the user you want to remove from the project.
Move your mouse over that row, then click the Remove from Members button at the right end of the row.
Access Level
Description
VIEW
Allows users to browse and visualize data stored in the project, download data to a local computer, and copy data to other projects.
UPLOAD
Gives users VIEW access, plus the ability to create new folders and data objects, modify the metadata of open data objects, and close data objects.
CONTRIBUTE
Gives users UPLOAD access, plus the ability to run executions directly in the project.
ADMINISTER
Gives users CONTRIBUTE access, plus the power to change project permissions and policies, including giving other users access, revoking access, transferring project ownership, and deleting the project.
Suppose you have a set of samples sequenced at your lab, and you have a collaborator who's interested in three of the samples. You can upload the data associated with those samples into a new project, then share that new project with your collaborator, granting him or her VIEW access.
Alternatively, suppose that you and your collaborator are working on the same tissue samples, but each of you wants to try a different sequencing process. You can create a new project, then upload your sequenced data to the project. Then grant your collaborator UPLOAD access to the project, allowing him or her to upload his or her data. You'll then both be able to use one another's data to perform downstream analyses.
A project admin can configure a project to allow project members to run only specific executables as root executions. The list of permitted executables is set by entering the following command, via the CLI:
Note that by entering this command, you will overwrite any existing set of permitted executables.
To unset the list, and thus permit project members to run all available executables as root executions, enter the following command:
Note that executables that are called by a permitted executable are permitted to run, even if they are not included in the list.
Users with ADMINISTER access to a project can restrict the ability of project members to view, copy, delete, and download project data. The project-level boolean flags below provide fine-grained data access control. All data access control flags default tofalse
and can be viewed and modified via CLI and platform API. protected, restricted, downloadRestricted, externalUploadRestricted and containsPHI settings can be viewed and modified in the project's Settings web screen as described below.
protected: If set to true, only project members with ADMINISTER access to the project can can delete project data. Otherwise, project members with ADMINISTER and CONTRIBUTE access can delete project data. This flag corresponds to the Delete Access policy in the project's Settings web interface screen.
restricted: If set to true,
data in this project cannot be cloned to another project
data in this project cannot be used as input to a job or an analysis in another project
any running app or applet that reads from this project cannot write results to any other project
a job running in the project will have singleContext flag set to true irrespective of the singleContext value supplied to /job/new
and /executable-xxxx/run,
and will only be allowed to use the job's DNAnexus authentication token when issuing requests to the proxied DNAnexus api endpoint within the job. Use of any other authentication token will result in an error.
This flag corresponds to the Copy Access policy in the project's Settings web interface screen.
downloadRestricted: If set to true, data in this project cannot be downloaded outside of the platform. For database objects, users would not be able to access the data in the project from outside DNAnexus. This includes read and write operations. This flag corresponds to the Download Access policy in the project's Settings web interface screen.
databaseUIViewOnly: If set to true, project members with VIEW access will have their access to project databases restricted to the Cohort Browser only. This feature is only available to customers with an Apollo license. Contact DNAnexus Sales for more information.
containsPHI: If set to true, data in this project is treated as Protected Health Information (PHI), an identifiable health information that can be linked to a specific person. PHI data protection safeguards the confidentiality and integrity of the project data in compliance with the Health Insurance Portability and Accountability Act of 1996 (HIPAA) by imposing additional restrictions documented in PHI Data Protection section. This flag corresponds to the PHI Data Protection setting in the Administration section of a project's Settings web interface screen.
displayDataProtectionNotice: If set to true, ADMIN users will be able to turn on/off the ability to show a Data Protection Notice to any users accessing the selected project. If the Data Protection Notice feature is enabled for a project, all users, when first accessing the project, will be required to review and confirm their acceptance of a requirement not to egress data from the project. Note that a license is required to use this feature. Contact DNAnexus Sales for more information.
externalUploadRestricted: If set to true, external file uploads to this project (from outside the job context) are rejected. The creation of Apollo databases, tables, and inserts of data into tables is disallowed from Thrift with a non-job token. This flag corresponds to the External Upload Access policy in the project's Settings web interface screen. A license is required to use this feature. Contact DNAnexus Sales for more information.
httpsAppIsolatedBrowsing: If set to true, httpsApp access to jobs launched in this project will be wrapped in Isolated Browsing, which will restrict data transfers through the httpsApp job interface. A license is required to use this limited-access feature. Contact DNAnexus Sales for more information.
Protected Health Information, or PHI, is identifiable health information that can be linked to a specific person. On the DNAnexus Platform, PHI Data Protection safeguards the confidentiality and integrity of data in compliance with the Health Insurance Portability and Accountability Act of 1996 (HIPAA).
When PHI Data Protection is enabled for a project, it is subject to the following protective restrictions:
Data in this project cannot be cloned to other projects that do not have containsPHI set to true
Any jobs that run in non-PHI projects will not be able to access any data that can only be found in PHI projects
Job email notifications sent from the project refer to objects by object ID instead of by name, and other information in the notification may be elided. If you receive such a notification, you can view the elided information by logging onto the Platform and opening the notification and accessing it in the Notifications pane, accessible by clicking the "bell" icon at the far right end of the main menu.
Apollo database access is subject to additional restrictions
Once PHI Data Protection is activated for a project, it cannot be disabled.
On the DNAnexus Platform, running analyses, storing data, and egressing data are billable activities, and always take place within a specific project. Each project is associated with a billing account, to which invoices are sent, covering all billable activities carried out within the project.
If you have ADMINISTER access to a project, you can transfer project billing responsibility to another user, by doing the following:
On the project's Settings screen, scroll down to the Administration section.
Click the Transfer Billing button. A modal window will open.
Enter the email address or username of the user to whom you want to transfer billing responsibility for the project.
Click Send Transfer Request.
The user will receive an email notification of your request. To finalize the transfer, he or she must log onto the Platform and formally accept it.
If you have billable activities access in the org to which you wish to transfer the project, you can change the billing account of the project to the org. To do this, navigate to the project settings page by clicking on the gear icon in the project header. On the project settings page, you can then select which to which billing account the project should be billed.
If you do not have billable activities access in the org you wish to transfer the project to, you will need to transfer the project to a user who does have this access. The recipient will then be able to follow the instructions below to accept a project transfer on behalf of an org.
You can cancel a transfer of project billing responsibility, so long as it hasn't yet been formally accepted by the user in question. To do this:
Select All Projects from the Projects link in the main menu. Open the project in question. You'll see a Pending Project Ownership Transfer notification at the top of the screen.
Click the Cancel Transfer button to cancel the transfer.
When another user initiates a project transfer to you, you’ll receive a project transfer request, via both an email, and a notification accessible by clicking the Notifications button - the "bell" - at the far right end of the main menu.
Note that if you did not already have access to the project being transferred, you'll get VIEW access, and the project will appear in the list on the Projects screen.
To accept the transfer:
Open the project. You'll see a Pending Project Ownership Transfer notification in the project header.
Click the Accept Transfer button.
Select a new billing account for the project from the dropdown of eligible accounts.
If the auto-symlink feature has been enabled for a project, billing responsibility for the project cannot be transferred. See this documentation for more on the auto-symlink feature.
If a project has PHI Data Protection enabled, it may only be transferred to an org billing account which also has PHI Data Protection enabled.
Ownership of sponsored projects may not be transferred without the sponsorship first being terminated.
A user or org can sponsor the cost of data storage in a project for a fixed term. During the sponsorship period, project members may copy this data to their own projects and store it there, without incurring storage charges.
On setting up the sponsorship, the sponsor sets it end date. The sponsor can change this end date at any time.
Billing responsibility for sponsored projects may not be transferred.
Sponsored projects may not be deleted, without the project sponsor first ending the sponsorship, by changing its end date to a date in the past.
For more information about sponsorship, contact DNAnexus Support.
See the Org Management page for detailed information on projects that are billed to an org.
Learn about accessing and working with projects via the CLI:
Learn about working with projects as a developer:
By understanding projects, organizations, apps, and workflows, you'll improve your understanding of the DNAnexus Platform.
Learn to build an app that you can run on the Platform.
The steps below require the DNAnexus SDK. You must download and install it if you have not done so already.
In addition to this Quickstart, there are Developer Tutorials located in the sidebar that go over helpful tips for new users as well. A few of them include:
Every DNAnexus app starts with 2 files:
dxapp.json
: a file containing the app's metadata: its inputs and outputs, how the app will be run, etc.
a script that will be executed in the cloud when the app is run
Let's start by creating a file called dxapp.json
with the following text:
Above, we've specified the name for our app (coolapp
), the type of interpreter (python3
) to run our script with, and a path (code.py
) to the script that we will create next. ("version":"0"
) refers to the version of Ubuntu 24.04 application execution environment that supports (python3
) interpreter.
Next, we create our script in a file called code.py
with the following text:
That's all we need. To build the app, first log in to DNAnexus and start a project with dx login
. In the directory with the two files above, run:
Now, run the app and watch the output:
That's it! You have just made and run your first DNAnexus applet. Applets are lightweight apps that live in your project, and are not visible in the App Library. When you typed dx run
, the app ran on its own Linux instance in the cloud. You have exclusive, secure access to the CPU, storage, and memory on the instance. The DNAnexus API lets your app read and write data on the Platform, as well as launch other apps.
The app is now available in the DNAnexus web interface, as part of the project that you started. It can be configured and run in the Workflow Builder, or shared with other users by sharing the project.
Next, we'll make our app do something a bit more interesting: take in two files with FASTA-formatted DNA, run the BLAST tool to compare them, and output the result.
In the cloud, your app will run on Ubuntu Linux 24.04, where BLAST is available as an APT package, ncbi-blast+
. You can request that the DNAnexus execution environment install it before your script is run by listing ncbi-blast+
in the execDepends
field of your dxapp.json
like this:
Next, let's update code.py
to run BLAST:
We're now ready to rebuild the app and test it on some real data. You can use some demo inputs available in the Demo Data project, or you can upload your own data with dx upload
or via the website. If you use the Demo Data inputs, make sure the project you are running your app in is the same region as the Demo Data project.
Rebuild the app with dx build -a
, and run it like this:
Once the job is done, you can examine the output with dx head report.txt
, download it with dx download
, or view it on the website.
Workflows are a powerful way to visually connect, configure, and run multiple apps in pipelines. To add our app to a workflow and be able to connect its inputs and/or outputs to other apps, our app will need both input and output specifications. Let's update our dxapp.json
as follows:
Rebuild the app with dx build -a
. You can run it in the same way as before, but now we can add the applet to a workflow. Click "New Workflow" while looking at your project on the website, and click on coolapp
once to add it to the workflow. You'll see inputs and outputs appear on the workflow stage which can be connected to other stages in the workflow.
Also, if you now go back to the command line and run dx run coolapp
with no input arguments, it will prompt you for the input values for seq1
and seq2
.
In addition to specifying input files, the I/O specification can also be used to configure settings that we want the app to use. For example, we can configure the E-value setting and other BLAST settings with this code and dxapp.json
:
code.py
dxapp.json
Rebuild the app again and add it in the workflow builder. You should now see the evalue and blast_args settings available when you click the gear button on the stage. After building and configuring a workflow, you can run the workflow itself with dx run workflowname
.
One of the utilities provided in the SDK is dx-app-wizard
. This tool will prompt you with a series of questions with which it will create the basic files needed for a new app. It also gives you the option of writing your app as a bash shell script instead of Python. Just run dx-app-wizard
to try it out.
For additional information and examples of how to run jobs using the CLI, Chapter 5 of this reference guide may be useful. Note that this material is not a part of the official DNAnexus documentation and is for reference only.
Access developer tutorials and examples.
Developers new to the DNAnexus platform may find it easier to learn by doing. This page contains a collection of simple tutorials and examples intended to showcase common tasks and methodologies when creating an app(let) on the DNAnexus platform. After reading through the tutorials and examples you should be able to develop app(let)s that:
Run efficiently: make use of cloud computing methodologies.
Are easy to debug: let developers understand and resolve issues.
Use the scale of the cloud: take advantage of the DNAnexus platform’s flexibility
Are easy to use: reduce support and enable collaboration.
If it’s your first time developing an app(let) be sure to read through the Getting started series. This series will introduce terms and concepts that tutorials and examples will build upon.
These tutorials are not meant to show realistic everyday examples, but rather provide a strong starting point for app(let) developers. Tutorials will showcase simple and varied implementations of the SAMtools view command on the DNAnexus platform.
Bash app(let)s use dx-toolkit’s, our platform SDK, command line interface along with common bashisms to create bioinformatic pipelines in the cloud.
Bash
Python app(let)s make of use dx-toolkit’s python implementation along with common python modules such as subprocess to create bioinformatic pipelines in the cloud.
Python
To create a web applet, you will need access to Titan or Apollo features Web applets can be made as either python or bash applets, the only difference is that they will launch some kind of web server and expose port 443 (for HTTPS) to allow a user to interact with that web application through a web browser.
Web_app
A bit of terminology before we start discussing parallel and distributed computing paradigms on the DNAnexus Platform.
There are many definitions and approaches to tackling the concept of parallelization and distributing workloads in the cloud (Here’s a particularly helpful Stack Exchange post on the subject). To help make our documentation easier to understand, when discussing concurrent computing paradigms we’ll refer to:
Parallel: Using multiple threads or logical cores to concurrently process a workload.
Distributed: Using multiple machines (in our case instances in the cloud) that communicate to concurrently process a workload.
Keep these formal definitions in mind as you read through the tutorials and learn how to compute concurrently on the DNAnexus platform.
Parallel
Distributed
Learn to create a project, add members and data to the project, and run a simple workflow.
On the DNAnexus Platform, all data is stored within projects. So before you upload, browse, or analyze any data, you must create a project to house that data.
To create a project:
Select All Projects from the Projects link in the main menu. This will take you to the Projects page.
Click the New Project button in the top right corner of the Projects page. The New Project wizard will open in a modal window.
In the Project Name field, enter a name for your project.
In the More Info section, you can enter Tags or custom-defined Properties to make it easier to find this project later, and organize it and other projects. For more information on this topic, see this detailed explanation for more information on tags and properties.
In the More Info section, you can also enter a Project Summary and/or a Project Description.
In the Billed To field of the Billing section, choose a billing account to which project charges will be billed. Follow these instructions to set up billing.
In the Billed To field of the Billing section, choose a cloud region in which project files will be stored and analyses will be run. A default region will be displayed here; it's fine to accept this default. For more on this topic, see this detailed explanation of cloud regions.
In the Access section, specify which types of users will be able to Copy Data, Delete Data, and Download Data. Default values will be shown here; it's fine to accept the defaults. For more on project access, see this detailed explanation of project access levels. For more on types of users, see this detailed rundown.
Click Create Project. You'll be taken to the Manage screen for the project. Once you've added data to your project, this is where you'll be able to see and get info on this data, and launch analyses that use it.
Once you've created a project, you can add members by doing the following:
From the project's Manage screen, click the Share Project button - the "two people" icon - in the top right corner of the project page.
Type the username or the email address of an existing Platform user, or the ID of an org whose members you want to add the project.
In the Access pulldown, choose the type of access the user or org will have to the project. For more on this, see this detailed explanation of project access levels.
If you don't want the user to receive an email notification on being added to the project, click the Email Notification to "Off."
Click the Add User button.
Repeat Steps 2-5, for each user you want to add to the project.
Click Done when you're finished adding members.
To add data to your project, click the Add button in the top right corner of the project's Manage screen. You'll see three options for adding data:
Upload Data - Use your web browser to upload data from your computer. Note that if the upload takes a significant amount of time, you'll need to ensure that until it completes, you stay logged into the Platform, and keep your browser window open.
Add Data from Server - Specify an URL of an accessible server from which the file will be uploaded.
Copy Data from Project - Copy data from another project on the Platform.
To prepare for running your first analysis, as detailed in Steps 4-7, copy in data from the "Demo Data" project:
From the project's Manage screen, click the Add button, then select Copy Data from Project.
In the Copy Data from Project modal window, open the "Demo Data" project by clicking on its name.
Open the "Quickstart" folder. This folder contains two 1000 Genomes project files with the paired-end sequencing reads from chromosome 20 of exome SRR100022: SRR100022_20_1.fq.gz
and SRR100022_20_2.fq.gz
.
Click the box next to the Name header, to select both files.
Click Copy to copy the files to your project.
Next, install the apps you'll need, to analyze the data you added to the project in Step 3:
Select Tools Library from the Tools link in the main menu.
A list of available tools will open.
Find the BWA-MEM FASTQ Read Mapper in the list and click on its name.
A tool detail page will open, a full range of information about the tool, and how to use it.
Click the Install button in the upper left part of the screen, under the name of the tool.
In the Install App modal, click the Agree and Install button.
After the tool has been installed, you'll be returned to the tool detail page.
Use your browser's "Back" button to return to the tools list page.
Repeat Steps 3-6 to install the FreeBayes Variant Caller.
Now build workflow using the two apps you've just installed, and configure it to use the data you added to your project in Step 3.
A workflow runs tools as part of a preconfigured series of steps. Start building your workflow by adding steps to it:
Return to your project's Manage screen. You can do this by using your browser's "Back" button, or by selecting All Projects from the Projects link in the main menu, then clicking on the name of your project in the projects list.
Click the Add button in the top right corner of the screen, then select New Workflow from the dropdown. The Workflow Builder will open.
In the Workflow Builder, give your new workflow a name. In the upper left corner of the screen, you'll see a field with a placeholder value that begins "Untitled Workflow." Click on the "pencil" icon next to this placeholder name, then enter a name of your choosing.
Click the Add a Step button. In the Select a Tool modal window, find the BWA-MEM FASTQ Read Mapper and click the "+" to the left of its name, to add it to your workflow.
Repeat Step 4 for the FreeBayes Variant Caller.
Close the Select a Tool modal window, by clicking either on the "x" in its upper right corner, or the Close button in its lower right corner. You'll return to the main Workflow Builder screen.
Set the required inputs for each step by doing the following:
To set the required inputs for the first step, start by clicking on the input labeled "Reads [array]" for the BWA-MEM FASTQ Read Mapper. In the Select Data for Reads Input modal window, click the box for the SRR100022_20_1.fq.gz
file. Then click the Select button.
Since the SRR100022 exome was sequenced using paired-end sequencing, you'll need to provide the right-mates for the first set of reads. Click on the input labeled "Reads (right mates) [array]" for the BWA-MEM FASTQ Read Mapper. Select the SRR100022_20_2.fq.gz
file.
Click on the input labeled "BWA reference genome index." At the bottom of the modal window that opens, there will be a Suggestions section that includes a link to a folder containing reference genome files. Click on this link, then open the folder named H. Sapiens - GRCh37 - b37 (1000 Genomes Phase I)
. Select the human_g1k_v37.bwa-index.tar.gz
file.
Next set the "Sorted mappings [array]" required input for the second step. In the "Output" section for the first step, click on the blue pill labeled "Sorted mappings," then drag it to the second step input labeled "Sorted mappings [array]."
Click on the second step input labeled "Genome." In the modal that opens, find the reference genomes folder as in Step 3. Open the folder named H. Sapiens - GRCh37 - b37 (1000 Genomes Phase I)
. Select the human_g1k_v37.fa.gz
file.
You're ready to launch your workflow, by doing the following:
Click the Start Analysis button at the upper right corner of the Workflow Builder.
In the modal window that opens, click the Run as Analysis button.
The BWA-MEM FASTQ Read Mapper will start executing immediately. Once it finishes, the FreeBayes Variant Caller will start, using the Read Mapper's output as an input.
Once you've launched your workflow, you'll be taken to your project's Monitor screen. Here, you'll see a list of both current and past analyses run within the project, along with key information about each run.
As your workflow runs, its status will be shown as "In Progress."
If for some reason you need to terminate the run before it completes, find its row in the list on the Monitor screen. In the last column on the right, you'll see a red button labeled Terminate. Click the button to terminate the job. Note that this can take some time. While the job is being terminated, the job's status will show as "Terminating."
When your workflow completes, output files will be placed into a new folder in your project, with the same name as the workflow. The folder is accessible by navigating to your project's Manage screen.
You can run this workflow using the full SRR100022 exome, which is available in the SRR100022
folder, in the "Demo Data" project. Note that because this entails working with a much larger file, running the workflow using the exome data will take longer.
See these Key Concepts pages for more in-depth treatments of topics that are covered briefly here:
For a video intro to the Platform, watch this series of short, task-oriented tutorials.
For a more in-depth video intro to the Platform, watch this DNAnexus Platform Essentials video.
Learn to build an applet that performs a basic SAMtools count with the aid of bash helper variables.
View full source code on GitHub
Download input files using the dx-download-all-inputs
command. The dx-download-all-inputs
command will go through all inputs and download into folders with the pattern /home/dnanexus/in/[VARIABLE]/[file or subfolder with files]
.
We create an output directory in preparation for dx-upload-all-outputs
DNAnexus command in the (Upload Results)[#Upload Result] section.
After executing the dx-download-all-inputs
command, there are three helper variables created to aid in scripting. For this applet, the input variable name mappings_bam
with platform filename my_mappings.bam
will have a helper variables:
We use the bash helper variable mappings_bam_path
to reference the location of a file after it has been downloaded using dx-download-all-inputs
.
We use the dx-upload-all-outputs
command to upload data to the platform and specify it as the job’s output. The dx-upload-all-outputs
command expects to find file paths matching the pattern /home/dnanexus/out/[VARIABLE]/*
. It will upload matching files and then associate them as the output corresponding to [VARIABLE]
. In this case, the output is called counts_txt
. Earlier we created the folders, and we can now place the outputs there.
View full source code on GitHub
The SAMtools dependency is resolved by declaring an Apt-Get package in the dxapp.json
file’s runSpec.execDepends
.
For additional information, see execDepends
Distributed bash-interpreter apps use bash functions to declare entry points. This app has the following entry points specified as bash functions:
main
count_func
sum_reads
Entry points are executed on a new worker with its own system requirements. The instance type can be set in the dxapp.json
file’s runSpec.systemRequirements
:
The main function slices the initial *.bam
file and generates an index *.bai
if needed. The input *.bam
is the sliced into smaller *.bam
files containing only reads from canonical chromosomes. First, the main function downloads the BAM file and gets the headers.
Sliced *.bam
files are uploaded and their file IDs are passed to the count_func entry point using the dx-jobutil-new-job
command.
Outputs from the count_func entry points are referenced as Job Based Object References (JBOR) and used as inputs for the sum_reads entry point.
The output of the sum_reads entry point is used as the output of the main entry point via JBOR reference using the command dx-jobutil-add-output
.
This entry point downloads and runs the command samtools view -c
on the sliced *.bam
. The generated counts_txt
output file is uploaded as the entry point’s job output via the command dx-jobutil-add-output
.
The main entry point triggers this sub job, providing the output of count_func as an input. This entry point gathers all the files generated by the count_func jobs and sums them.
This function returns read_sum_file
as the entry point output.
View full source code on GitHub
Distributed bash-interpreter apps use bash functions to declare entry points. Entry points are executed as subjobs on new workers with their own respective system requirements. This app has the following entry points specified as bash functions:
main
count_func
sum_reads
The main function takes the initial *.bam
, generates an index *.bai
if needed, and obtains the list of regions from the *.bam
file. Every 10 regions will be sent, as input, to the count_func entry point using dx-jobutil-new-job
command.
Job outputs from the count_func entry point are referenced as Job Based Object References (JBOR) and used as inputs for the sum_reads entry point.
Job outputs of the sum_reads entry point is used as the output of the main entry point via JBOR reference in the dx-jobutil-add-output
command.
This entry point performs a SAMtools count of the 10 regions passed as input. This execution will be run on a new worker. As a result variables from other functions (e.g. main()
) will not be accessible here.
Once the output file with counts is created, it is uploaded to the platform and assigned as the entry point’s job output counts_txt
via the command dx-jobutil-add-output
.
The main entry point triggers this subjob, providing the output of count_func as an input JBOR. This entry point gathers all the readcount.txt
files generated by the count_func jobs and sums the totals.
This entry point returns read_sum
as a JBOR, which is then referenced as job output.
In the main function, the output is referenced
View full source code on GitHub
This applet performs a basic samtools view -c {bam}
command, referred to as “SAMtools count”, on the DNAnexus platform.
For bash scripts, inputs to a job execution become environment variables. The inputs from our dxapp.json
file are formatted as shown below:
The object mappings_bam
, a DNAnexus link containing the file ID of that file, will be available as an environmental variable in the applet’s execution. Use the command dx download
to download the BAM file. By default, when we download a file, we will keep the filename of the object on the platform.
Here, we use the bash helper variable mappings_bam_name
. For file inputs, the DNAnexus platform creates a bash variable [VARIABLE]_name
that holds a string representing the filename of the object on the platform; because we downloaded the file using default parameters, this will be the filename of the object on this worker as well. We use another helper variable, [VARIABLE]_prefix
, the filename of the object minus any suffixes specified in the input field patterns. From the input spec above, the only pattern present is '["*.bam"]'
, so the platform will remove the trailing “.bam” and create the helper variable [VARIABLE]_prefix
for our use.
Use dx upload
command to upload data to the platform. This will upload the file into the job container, a temporary project that holds onto files associated with the job. When running the command dx upload
with the flag --brief
, the command will return just the file ID.
Note: Job containers are an integral part of the execution process, to learn more see Containers for Execution.
The output of an applet must be declared before the applet is even built. Looking back to the dxapp.json
file, we see the following:
We declared a file type output named counts_txt
. In the applet script, we must tell the system what file should be associated with the output counts_txt
. On job completion, usually at the end of the script, this file will be copied from the temporary job container to the project that launched the job.
View full source code on GitHub
This example demonstrates how to run TensorBoard inside a DNAnexus applet.
TensorBoard is a web application used to visualize and inspect what is going on inside TensorFlow training. To use TensorBoard, our training script in TensorFlow needs to include code that saves various data to a log directory where TensorBoard can then find the data to display it.
This example uses an example script from the TensorBoard authors. For more guidance on how to use TensorBoard, check out the tensorflow website (external link).
The applet code runs a training script, which is placed in resources/home/dnanexus/
to make it available in the current working directory of the worker, and then it starts tensorboard on port 443 (HTTPS).
We run the training script in the background to start TensorBoard immediately, which will let us see the results while training is still running. This is particularly important for long-running training scripts.
Note that for all web apps, if everything is running smoothly and no errors are encountered (the ideal case), the line of code that starts the server will keep it running forever. The applet stops only when it is terminated. This also means that any lines of code after the server starts will not be executed.
As with all web apps, the dxapp.json
must include "httpsApp": {"ports":[443], "shared_access": "VIEW"}
to tell the worker to expose port 443.
Build the asset with the libraries first:
Take the record ID it outputs and add it to the dxapp.json for the applet.
Then build the applet
Once it spins up, you can go to that job’s designated URL based on its job ID, https://job-xxxx.dnanexus.cloud/, to see the result.
View full source code on GitHub
This applet performs a basic SAMtools count of alignments present in an input BAM.
The app must have network access to the hostname where the git repository is located. In this example, access.network
is set to:
To learn more about access
and network
fields see Execution Environment Reference.
SAMtools is cloned and built from the SAMtools GitHub repository. Let’s take a closer look at the dxapp.json
file’s runSpec.execDepends
property:
The execDepends
value is a JSON array of dependencies to resolve before the applet source code is run. In this applet, we specify the following git fetch dependency for htslib and SAMtools. Dependencies are resolved in the order they’re specified. Here we must specify htslib first, before samtools build_commands
, due to newer versions of SAMtools depending on htslib. An overview of the each property in the git dependency:
package_manager
- Details the type of dependency and how to resolve. supplementary details.
url
- Must point to the server containing the repository. In this case, a github url.
tag
/branch
- Git tag/branch to fetch.
destdir
- Directory on worker to which the git repo is cloned.
build_commands
- If needed, build commands to execute. We know our first dependency, htslib, is built when we build SAMtools; as a result, we only specify “build_commands” for the SAMtools dependency.
Note: build_commands
are executed from the destdir
; use cd
when appropriate.
Because we set "destdir": "/home/dnanexus"
in our dxapp.json
, we know the git repo is cloned to the same directory from which our script will execute. Our example directory’s structure:
Our samtools command from the app script is samtools/samtools
.
Note: We could’ve built samtools in a destination within our $PATH
or added the binary directory to our $PATH
. Keep this in mind for your app(let) development
This applet performs a basic SAMtools count on a series of sliced (by canonical chromosome) BAM files in parallel using wait (Ubuntu 14.04+).
View full source code on GitHub
The SAMtools dependency is resolved by declaring an Apt-Get package in the dxapp.json
runSpec.execDepends
.
The command set -e -x -o pipefail
will assist you in debugging this applet:
-e
causes the shell to immediately exit if a command returns a non-zero exit code.
-x
prints commands as they are executed, which is very useful for tracking the job’s status or pinpointing the exact execution failure.
-o pipefail
makes the return code the first non-zero exit code. (Typically, the return code of pipes is the exit code of the last command, which can create difficult to debug problems.)
The *.bai
file was an optional job input. You can check for a empty or unset var
using the bash built-in test [[ - z ${var}} ]]
. You can then download or create a *.bai
index as needed.
Bash’s job control system allows for easy management of multiple processes. In this example, bash commands are run in the background as the maximum job executions are controlled in the foreground. You can place processes in the background using the character &
after a command.
Once the input bam has been sliced, counted, and summed, the output counts_txt
is uploaded using the command dx-upload-all-outputs
. The following directory structure required for dx-upload-all-outputs is below:
In your applet, upload all outputs by:
This is an example web app made with Dash, which in turn uses Flask underneath.
View full source code on GitHub
After configuring an app
with Dash, we start the server on port 443.
Inside the dxapp.json
, you would add "httpsApp": {"ports":[443], "shared_access": "VIEW"}
to tell the worker to expose this port.
Note that for all web apps, if everything is running smoothly and no errors are encountered (the ideal case), the line of code that starts the server will keep it running forever. The applet stops only when it is terminated. This also means that any lines of code after the server starts will not be executed.
The rest of these instructions apply to building any applet with dependencies stored in an asset.
Source dx-toolkit and log in, then run dx-app-wizard with default options.
dash-asset specifies all the packages and versions we need. We take these from the Dash installation guide (https://dash.plot.ly/installation\)
We put these into dash-asset/dxasset.json:
Build the asset:
Add this asset to the applet’s dxapp.json:
Now build and run the applet itself:
You can always use dx ssh job-xxxx
to ssh into the worker and inspect what’s going on or experiment with quick changes Then go to that job’s special URL https://job-xxxx.dnanexus.cloud/ and see the result!
The main code is in dash-web-app/resources/home/dnanexus/my_app.py
with a local launcher script called local_test.py
in the same folder. This allows us to launch the same core code in the applet locally to quickly iterate. This is optional because you can also do all testing on the platform itself.
Install locally the same libraries listed above.
To launch the web app locally:
Once it spins up, you can go to that job’s designated URL based on its job ID, https://job-xxxx.dnanexus.cloud/, to see the result.
This tutorial showcases packaging a precompiled binary in the resources/ directory of an app(let).
View full source code on GitHub
In this applet, the SAMtools binary was precompiled on an Ubuntu machine. A user can do this compilation on an Ubuntu machine of their own, or they can utilize the Cloud Workstation app to build and compile a binary. On the Cloud Workstation, the user can download the SAMtools source code and compile it in the worker environment, ensuring that the binary will run on future workers.
See Cloud Workstation in the App library for more information.
The SAMtools precompiled binary is placed in the <applet dir>/resources/
directory. Any files found in the resources/
directory will be packaged, uploaded to the platform, and then unpackaged in the root directory \
of the worker. In our case, the resources/
dir is structured as follows:
When this applet is run on a worker, the resources/
directory will be placed in the worker’s root directory /
:
We are able to access the SAMtools command because the respective binary is visible from the default $PATH
variable. The directory /usr/bin/
is part of the $PATH
variable, so in our script we can reference the samtools command directly:
View full source code on GitHub
This applet performs a SAMtools count on an input file while minimizing disk usage. For additional details on using FIFO (named pipes) special files, run the command man fifo
in your shell.Warning: Named pipes require BOTH a stdin and stdout or they will block a process. In these examples, we place incomplete named pipes in background processes so the foreground script process does not block.
To approach this use case, let’s focus on what we want our applet to do:
Stream the BAM file from the platform to a worker.
As the BAM is streamed, count the number of reads present.
Output the result into a file.
Stream the result file to the platform.
First, we establish a named pipe on the worker. Then, we stream to stdin of the named pipe and download the file as a stream from the platform using dx cat
.
First, we establish a named pipe on the worker. Then, we stream to stdin of the named pipe and download the file as a stream from the platform using dx cat
.
FIFO
stdin
stdout
BAM file
YES
NO
Now that we have created our FIFO special file representing the streamed BAM, we can just call the samtools
command as we normally would. The samtools
command reading the BAM would provide our BAM FIFO file with a stdout. However, keep in mind that we want to stream the output back to the platform. We must create a named pipe representing our output file too.
FIFO
stdin
stdout
BAM file
YES
YES
output file
YES
NO
The directory structure created here (~/out/counts_txt
) is required to use the dx-upload-all-outputs
command in the next step. All files found in the path ~/out/<output name>
will be uploaded to the corresponding <output name>
specified in the dxapp.json
.
The directory structure created here (~/out/counts_txt
) is required to use the dx-upload-all-outputs
command in the next step. All files found in the path ~/out/<output name>
will be uploaded to the corresponding <output name>
specified in the dxapp.json
.
Currently, we’ve established a stream from the platform, piped the stream into a samtools
command, and finally outputting the results to another named pipe. However, our background process is still blocked since we lack a stdout for our output file. Luckily, creating an upload stream to the platform will resolve this.
We can upload as a stream to the platform using the commands dx-upload-all-outputs
or dx upload -
. Make sure to specify --buffer-size
if needed.
We can upload as a stream to the platform using the commands dx-upload-all-outputs
or dx upload -
. Make sure to specify --buffer-size
if needed.
FIFO
stdin
stdout
BAM file
YES
YES
output file
YES
YES
Note: Alternatively, dx upload -
can upload directly from stdin. In this example, we would no longer need to have the directory structure required for dx-upload-all-outputs
.Warning: When uploading a file that exists on disk, dx upload
is aware of the file size and automatically handles any cloud service provider upload chunk requirements. When uploading as a stream, the file size is not automatically known and dx upload
uses default parameters. While these parameters are fine for most use cases, you may need to specify upload part size with the --buffer-size
option.
Now that our background processes are no longer blocking the rest of the applet’s execution, we simply wait
in the foreground for those processes to finish.
Note: If we didn’t wait the app script would running in the foreground would finish and terminate the job! We wouldn’t want that.
The SAMtools compiled binary is placed directly in the <applet dir>/resources
directory. Any files found in the resources/
directory will be uploaded so that they will be present in the worker’s root directory. In our case:
When this applet is run on a worker, the resources/
folder will be placed in the worker’s root directory /
:
/usr/bin
is part of the $PATH
variable, so we can reference the samtools command directly in our script as samtools view -c ...
This applet tutorial will perform a SAMtools count using parallel threads.
View full source code on GitHub
In order to take full advantage of the scalability that cloud computing offers, our scripts have to implement the correct methodologies. This applet tutorial will:
Install SAMtools
Download BAM file
Count regions in parallel
The SAMtools dependency is resolved by declaring an Apt-Get package in the dxapp.json
runSpec.execDepends
.
For additional information, please refer to the execDepends
documentation.
The dxpy.download_all_inputs()
function downloads all input files into the /home/dnanexus/in
directory. A folder will be created for each input and the file(s) will be downloaded to that directory. For convenience, the dxpy.download_all_inputs
function returns a dictionary containing the following keys:
<var>_path
(string): full absolute path to where the file was downloaded.
<var>_name
(string): name of the file, including extention.
<var>_prefix
(string): name of the file minus the longest matching pattern found in the dxapp.json I/O pattern field.
The path, name, and prefix key-value pattern is repeated for all applet file class inputs specified in the dxapp.json. In this example, our dictionary has the following key-value pairs:
Before we can perform our parallel SAMtools count, we must determine the workload for each thread. We arbitrarily set our number of workers to 10
and set the workload per thread to 1
chromosome at a time. There are various ways to achieve multithreaded processing in python. For the sake of simplicity, we use multiprocessing.dummy
, a wrapper around Python’s threading module.
Each worker creates a string to be called in a subprocess.Popen
call. We use the multiprocessing.dummy.Pool.map(<func>, <iterable>)
function to call the helper function run_cmd
for each string in the iterable of view commands. Because we perform our multithreaded processing using subprocess.Popen
, we will not be alerted to any failed processes. We verify our closed workers in the verify_pool_status
helper function.
Important: In this example we use subprocess.Popen
to process and verify our results in verify_pool_status
. In general, it is considered good practice to use python’s built-in subprocess convenience functions. In this case, subprocess.check_call
would achieve the same goal.
Each worker returns a read count of just one region in the BAM file. We sum and output the results as the job output. We use the dx-toolkit python SDK’s dxpy.upload_local_file
function to upload and generate a DXFile corresponding to our result file. For python, job outputs have to be a dictionary of key-value pairs, with the keys being job output names as defined in the dxapp.json
and the values being the output values for corresponding output classes. For files, the output type is a DXLink. We use the dxpy.dxlink
function to generate the appropriate DXLink value.
This applet slices a BAM file by canonical chromosome then performs a parallelized samtools view -c using xargs. Type man xargs for general usage information.
View full source code on GitHub
The SAMtools compiled binary is placed directory in the <applet dir>/resources
directory. Any files found in the resources/
directory will be uploaded so that they will be present in the root directory of the worker. In our case:
When this applet is run on a worker, the resources/
folder will be placed in the worker’s root directory /
:
/usr/bin
is part of the $PATH
variable, so in our script, we can reference the samtools command directly, as in samtools view -c ...
First, we download our BAM file and slice it by canonical chromosome, writing the *bam
file names to another file.
In order to split a BAM by regions, we need to have a *.bai
index. You can either create an app(let) which takes the *.bai
as an input or generate a *.bai
in the applet. In this tutorial, we generate the *.bai
in the applet, sorting the BAM if necessary.
In the previous section, we recorded the name of each sliced BAM file into a record file. Now we will perform a samtools view -c
on each slice using the record file as input.
The results file is uploaded using the standard bash process:
Upload a file to the job execution’s container.
Provide the DNAnexus link as a job’s output using the script dx-jobutil-add-output <output name>
This is an example web applet that demonstrates how to build and run an R Shiny application on DNAnexus.
View full source code on GitHub
Inside the dxapp.json
, you would add "httpsApp": {"ports":[443], "shared_access": "VIEW"}
to tell the worker to expose this port.
R Shiny needs two scripts, server.R
and ui.R
, which should be under resources/home/dnanexus/my_app/
. When a job starts based on this applet, the resources
directory is copied onto the worker, and since the ~/
path on the worker is /home/dnanexus
, that means you now have ~/my_app
with those two scripts inside.
From the main applet script code.sh
, we simply start shiny pointing to ~/my_app
repo, serving its mini-application on port 443.
Note that for all web apps, if everything is running smoothly and no errors are encountered (the ideal case), the line of code that starts the server will keep it running forever. The applet stops only when it is terminated. This also means that any lines of code after the server starts will not be executed.
To make your own applet with R Shiny, simply copy the source code from this example and modify server.R
and ui.R
inside resources/home/dnanexus/my_app
.
To build the asset, run the dx build_asset
command and pass shiny-asset
, i.e. the name of the directory holding dxasset.json
:
This will output a record ID record-xxxx
that you can then put into the applet’s dxapp.json
in place of the existing one:
Now build and run the applet itself:
Once it spins up, you can go to that job’s designated URL based on its job ID, https://job-xxxx.dnanexus.cloud/, to see the result.
This applet tutorial will perform a SAMtools count using parallel threads.
View full source code on GitHub
In order to take full advantage of the scalability that cloud computing offers, our scripts have to implement the correct methodologies. This applet tutorial will:
Install SAMtools
Download BAM file
Split workload
Count regions in parallel
The SAMtools dependency is resolved by declaring an Apt-Get package in the dxapp.json
runSpec.execDepends
field.
This applet downloads all inputs at once using dxpy.download_all_inputs
:
We process in parallel using the python multiprocessing
module using a rather simple pattern shown below:
This convenient pattern allows you to quickly orchestrate jobs on a worker. For more detailed overview of the multiprocessing
module, visit the python docs.
We create several helpers in our applet script to manage our workload. One helper you may have seen before is run_cmd
; we use this function to manage or subprocess calls:
Before we can split our workload, we need to know what regions are present in our BAM input file. We handle this initial parsing in the parse_sam_header_for_region
function:
Once our workload is split and we’ve started processing, we wait and review the status of each Pool
worker. Then, we merge and output our results.
Note: The run_cmd
function returns a tuple containing the stdout, stderr, and exit code of the subprocess call. We parse these outputs from our workers to determine whether the run failed or passed.
This applet creates a count of reads from a BAM format file.
View full source code on GitHub
The SAMtools dependency is resolved by declaring an Apt-Get package in the dxapp.json
runSpec.execDepends
.
For additional information, please refer to the execDepends
documentation .
Distributed python-interpreter apps use python decorators on functions to declare entry points. This app has the following entry points as decorated functions:
main
samtoolscount_bam
combine_files
Entry points are executed on a new worker with their own system requirements. In this example, we split and merge our files on basic mem1_ssd1_x2 instances and perform our own, more intensive, processing step on a mem1_ssd1_x4 instance. Instance type can be set in the dxapp.json runSpec.systemRequirements
:
The main function scatters by region bins based on user input. If no *.bai
file is present, the applet generates an index *.bai
.
Regions bins are passed to the samtoolscount_bam entry point using the dxpy.new_dxjob
function.
Outputs from the samtoolscount_bam entry points are used as inputs for the combine_files entry point. The output of the combine_files entry point is used as the output of the main entry point.
This entry point downloads and creates a samtools view -c
command for each region in the input bin. The dictionary returned from dxpy.download_all_inputs()
is used to reference input names and paths.
This entry point returns {"readcount_fileDX": readCountDXlink}
, a JBOR referencing an uploaded text file. This approach to scatter-gather stores the results in files and uploads/downloads the information as needed. This approach exaggerates a scatter-gather for tutorial purposes. You’re able to pass types other than file such as int.
The main entry point triggers this subjob, providing the output of samtoolscount_bam as an input. This entry point gathers all the files generated by the samtoolscount_bam jobs and sums them.
Important: While the main entry point triggers the processing and gathering entry points, keep in mind the main entry point doesn’t do any heavy lifting or processing. Notice in the .runSpec
json above we start with a lightweight instance, scale up for the processing entry point, then finally scale down for the gathering step.
Distributed bash-interpreter apps use bash functions to declare entry points. Entry points are executed as subjobs on new workers with their own respective system requirements. This app has the following entry points specified as bash functions:
main
count_func
sum_reads
The main function takes the initial *.bam
, generates an index *.bai
if needed, and obtains the list of regions from the *.bam
file. Every 10 regions will be sent, as input, to the count_func entry point using dx-jobutil-new-job
command.
Job outputs from the count_func entry point are referenced as Job Based Object References (JBOR) and used as inputs for the sum_reads entry point.
Job outputs of the sum_reads entry point is used as the output of the main entry point via JBOR reference in the dx-jobutil-add-output
command.
This entry point performs a SAMtools count of the 10 regions passed as input. This execution will be run on a new worker. As a result variables from other functions (e.g. main()
) will not be accessible here.
Once the output file with counts is created, it is uploaded to the platform and assigned as the entry point’s job output counts_txt
via the command dx-jobutil-add-output
.
The main entry point triggers this subjob, providing the output of count_func as an input JBOR. This entry point gathers all the readcount.txt
files generated by the count_func jobs and sums the totals.
This entry point returns read_sum
as a JBOR, which is then referenced as job output.
In the main function, the output is referenced
This applet performs a SAMtools count on an input BAM using Pysam, a python wrapper for SAMtools.
View full source code on GitHub
Pysam is provided through a pip3 install
using the pip3 package manager in the dxapp.json
’s runSpec.execDepends
property:
The execDepends
value is a JSON array of dependencies to resolve before the applet source code is run. In this applet, we specify pip3
as our package manager and pysam version 0.15.4
as the dependency to resolve.
The fields mappings_sorted_bam
and mappings_sorted_bai
are passed to the main function as parameters for our job. These parameters are dictionary objects with key-value pair {"$dnanexus_link": "<file>-<xxxx>"}
. We handle file objects from the platform through DXFile handles. If an index file is not supplied, then a *.bai
index will be created.
Pysam provides several methods that mimic SAMtools commands. In our applet example, we want to focus only on canonical chromosomes. Pysam’s object representation of a BAM file is pysam.AlignmentFile
.
The helper function get_chr
Once we establish a list of canonical chromosomes, we can iterate over them and perform Pysam’s version of samtools view -c
, pysam.AlignmentFile.count
.
Our summarized counts are returned as the job output. We use the dx-toolkit
python SDK’s dxpy.upload_local_file
function to upload and generate a DXFile corresponding to our tabulated result file.
Python job outputs have to be a dictionary of key-value pairs, with the keys being job output names as defined in the dxapp.json
file and the values being the output values for corresponding output classes. For files, the output type is a DXLink. We use the dxpy.dxlink
function to generate the appropriate DXLink value.
This is an example web app made with Dash, which in turn uses Flask underneath.
View full source code on GitHub
After configuring an app
with Dash, we start the server on port 443.
Inside the dxapp.json
, you would add "httpsApp": {"ports":[443], "shared_access": "VIEW"}
to tell the worker to expose this port.
Note that for all web apps, if everything is running smoothly and no errors are encountered (the ideal case), the line of code that starts the server will keep it running forever. The applet stops only when it is terminated. This also means that any lines of code after the server starts will not be executed.
The rest of these instructions apply to building any applet with dependencies stored in an asset.
Source dx-toolkit
and log in, then run dx-app-wizard
with default options.
dash-asset specifies all the packages and versions we need. We take these from the Dash installation guide (https://dash.plot.ly/installation\)
We put these into dash-asset/dxasset.json:
Build the asset:
Add this asset to the applet’s dxapp.json:
Now build and run the applet itself:
You can always use dx ssh job-xxxx
to ssh into the worker and inspect what’s going on or experiment with quick changes Then go to that job’s special URL https://job-xxxx.dnanexus.cloud/ and see the result!
The main code is in dash-web-app/resources/home/dnanexus/my_app.py
with a local launcher script called local_test.py
in the same folder. This allows us to launch the same core code in the applet locally to quickly iterate. This is optional because you can also do all testing on the platform itself.
Install locally the same libraries listed above.
To launch the web app locally:
Once it spins up, you can go to that job’s designated URL based on its job ID, https://job-xxxx.dnanexus.cloud/, to see the result.
Learn important terminology before using parallel and distributed computing paradigms on the DNAnexus Platform.
There are many definitions and approaches to tackling the concept of parallelization and distributing workloads in the cloud (Here’s a particularly helpful Stack Exchange post on the subject). To help make our documentation easier to understand, when discussing concurrent computing paradigms we’ll refer to:
Parallel: Using multiple threads or logical cores to concurrently process a workload.
Distributed: Using multiple machines (in our case instances in the cloud) that communicate to concurrently process a workload.
Keep these formal definitions in mind as you read through the tutorials and learn how to compute concurrently on the DNAnexus platform.
This example demonstrates how to run TensorBoard inside a DNAnexus applet.
View full source code on GitHub
TensorBoard is a web application used to visualize and inspect what is going on inside TensorFlow training. To use TensorBoard, our training script in TensorFlow needs to include code that saves various data to a log directory where TensorBoard can then find the data to display it.
This example uses an example script from the TensorBoard authors. For more guidance on how to use TensorBoard, check out the tensorflow website (external link).
The applet code runs a training script, which is placed in resources/home/dnanexus/
to make it available in the current working directory of the worker, and then it starts tensorboard on port 443 (HTTPS).
We run the training script in the background to start TensorBoard immediately, which will let us see the results while training is still running. This is particularly important for long-running training scripts.
Note that for all web apps, if everything is running smoothly and no errors are encountered (the ideal case), the line of code that starts the server will keep it running forever. The applet stops only when it is terminated. This also means that any lines of code after the server starts will not be executed.
As with all web apps, the dxapp.json
must include "httpsApp": {"ports":[443], "shared_access": "VIEW"}
to tell the worker to expose port 443.
Build the asset with the libraries first:
Take the record ID it outputs and add it to the dxapp.json for the applet.
Then build the applet
Once it spins up, you can go to that job’s designated URL based on its job ID, https://job-xxxx.dnanexus.cloud/, to see the result.
Learn to use the dx client for command-line access to the full range of DNAnexus Platform features.
The dx
command-line client is included in the DNAnexus SDK (dx-toolkit
). You can use the dx
client to log into the Platform; to upload, browse, and organize data; and to launch analyses.
All the projects and data referenced in this Quickstart are publicly available, so you can follow along step-by-step.
As you work, you can use this index of dx commands as a reference.
At the command line, you can also enter dx help
to see a list of commands, broken down by category. To see a list of commands from a particular category, enter dx help <category>
.
To learn what a particular command does, enter dx help <command>
, dx <command> -h
, or dx <command> -help
For example, enter dx help ls
to learn about the command dx ls
:
To use the command-line interface (CLI), make sure you've installed the DNAnexus Software Development Kit (SDK) available here.
To update your version of the command-line tool, you can run the command dx upgrade
.
The first thing you'll need to do is to log in. If you haven't created a DNAnexus account yet, visit the website and sign up. User signup is not supported on the command line.
Your authentication token and your current project settings have now been saved in a local configuration file, and you're ready to start accessing your project.
Let's look inside some of the public projects that have already been set up. From the command line, enter the command:
By running the dx select
command and picking a project, you've now done the command-line equivalent of going to the project page for Reference Genome Files: AWS US (East) (platform login required to access this link) on the website. This is a DNAnexus-sponsored project containing popular genomes for you to use when running analyses with your own data.
For more information about the dx select
command, please see the Changing Your Current Project page.
Now you can list all of the data in the top-level directory of the project you've just selected by running the command dx ls
. You can also see the contents of a folder by running the command dx ls <folder_name>
.
You can avoid typing out the full name of the folder by typing in dx ls C
and then pressing <TAB>
. The folder name will auto-complete from there.
You don't have to be in a project to inspect its contents. You can also look into another project, and a folder within the project, by giving the project name or ID, followed by a colon (:
) and the folder path. Here, we list the contents of the publicly available project "Demo Data" using both its name and ID.
As shown above, you can use the -l
flag in conjunction with dx ls
to list more details about files, such as the time a file was last modified, its size (if applicable), and its full DNAnexus ID.
You can use the dx describe
command to learn more about files and other objects on the platform. Given a DNAnexus object ID or name, dx describe
will return detailed information about the object in question. dx describe
will only return results for data objects to which you have access.
Besides describing data and projects (examples for which are shown below), you can also describe apps, jobs, and users.
Describing a File
Below, we describe the reference genome file for C. elegans located in the "Reference Genome Files: AWS US (East)" project that we've been using (which should be accessible from other regions as well). Note that you need to add a colon (:) after the project name, here that would be Reference Genome Files\: AWS US (East):
.
Describing a Project
Below, we describe the publicly available Reference Genome Files project that we've been using.
Now, we'll use the command dx new project
to create a new project.
The text project-xxxx denotes a placeholder for a unique, immutable project ID. For more information about object IDs, see the Entity IDs page.
You're now ready to start uploading your data and running your own analyses.
If you have a sample you would like to analyze, you can use the dx upload
command or the Upload Agent if you have installed it. For the purposes of this tutorial, you can also download the file small-celegans-sample.fastq, which represents the first 25000 C. elegans reads from SRR070372. We will use this file again later to run through a sample analysis.
For uploading multiple or large files, we strongly recommend that you use the Upload Agent; it will compress your files and upload them in parallel over multiple HTTP connections and boasts other features such as resumable uploads.
The following command uploads the small-celegans-sample.fastq
file into the current directory of the current project. The --wait
flag tells dx upload
to wait until it has finished uploading the data before returning the prompt and describing the result.
To take a quick look at the first few lines of the file you just uploaded, use the dx head
command. By default, it prints the first 10 lines of the given file.
Let's run it on the file we just uploaded and use the -n
flag to ask for the first 12 lines (the first 3 reads) of the FASTQ file.
If you'd like to download a file from the platform, just use the dx download
command. This command will use the name of the file for the filename unless you specify your own with the -o
/--output
flag. In the example below, we download the same C. elegans file that we uploaded previously.
Files have different available fields for metadata, such as "properties" (key-value pairs) and "tags".
For the next few steps, if you would like to follow along, you will need a C. elegans FASTQ file. We will map the reads against the ce10 genome. If you haven't already, you can download and use the following FASTQ file, which contains the first 25,000 reads from SRR070372: small-celegans-sample.fastq.
The following walkthrough is helpful if you would like to understand what all the commands do and take a look at what apps you're running, but if you're just interested in converting a gzipped FASTQ file to a VCF file via BWA and the FreeBayes variant caller, then you can skip ahead to the Automate It section below, where you can see all the commands necessary for running apps.
If you have not yet done so, you can upload a FASTQ file for analysis.
For more information about using the command dx upload
, please see the dx upload
page.
Next, use the BWA-MEM app (platform login required to access this link) to map the uploaded reads file to a reference genome.
If you don't know the command-line name of the app you would like to run, you have two options:
You can navigate to its web page from the Apps page (platform login required to access this link) on the platform. The app's page will tell you how to run it from the command line. You can find more information about the app we're running on the BWA-MEM FASTQ Read Mapper page (platform login required to access this link).
Alternatively, you can search for apps from the command line by running the command dx find apps
. You will find the name of the app that you can use on the command line in the parentheses (underlined below).
Now install the app using dx install
and check that it has been installed. While you do not always need to install an app to run it, you may find it useful as a bookmarking tool.
We can now run the app using dx run
. We will run it without any arguments; it will then prompt us for required and then optional arguments. Note that the reference file genomeindex_targz
for the C. elegans sample we are using is in a .tar.gz
format and can be found in the Reference Genome folder of the region your project is in.
You can use the command dx watch
to monitor jobs. The command will print out the log file of the job, including the STDOUT, STDERR, and INFO printouts.
You can also use the command dx describe job-xxxx
to learn more about your job. If you don't know the job's ID, you can use the command dx find jobs
to list all the jobs run in the current project, along with the user who ran them, their status, and when they began.
There are also additional options that you can use to restrict your search of previous jobs, such as by their names or when they were run.
If for some reason you need to terminate your job before it completes, use the command dx terminate
.
You should now see two new files in your project: the mapped reads in a BAM file, and an index of that BAM file with a .bai
extension. You can refer to the output file by name or by the job that produced it using the syntax job-xxxx:<output field>
. Try it yourself with the job ID you got from calling the BWA-MEM app!
You can use the FreeBayes Variant Caller app (platform login required to access this link) to call variants on your BAM file.
This time, we won't rely on the interactive mode to enter our inputs. Instead, we will provide them directly. But first, let's look up the app's spec so we know what the inputs are called. For this, let's run the command dx run freebayes -h
.
Optional inputs are shown using square brackets ([]
) around the command-line syntax for each input. You'll notice that there are two required inputs that must be specified:
Sorted mappings (sorted_bams
): A list of files with a .bam
extension.
Genome (genome_fastagz
): A reference genome in FASTA format that has been gzipped.
Running the App with a One-Liner Using a Job-Based Object Reference
It is sometimes more convenient to run apps using a single one-line command. You can do this by specifying all the necessary inputs either via the command line or in a prepared file. We will use the -i
flag to specify inputs as suggested by the output of dx run freebayes ‑h
:
sorted_bams
: The output of the previous BWA step (see the Map Reads section for more information).
genome_fastagz
: The ce10 genome in the Reference Genomes project.
To specify new job input using the output of a previous job, we'll use a [job-based object reference](/Developer-Tutorials/Sample-Code?bash#Use-job-based-object-references-(JBORs)) via the job-xxxx:<output field>
syntax we used earlier.
Replace the job ID below with that generated by the BWA app you ran earlier. The -y
flag skips the input confirmation.
Automatically Running a Command After a Job Finishes
You can use the command dx wait
to wait for a job to finish. If we run the following command right after running the Freebayes app, it will show you the recent jobs only after the job has finished, as shown in the example below.
Congratulations! You have now called variants on a reads sample, and you did it all on the command line. Now let's look at how you can automate this process.
The beauty of the CLI is the ability to automate processes. In fact, we can automate everything we just did. The following script assumes that you've already logged in and is hardcoded to use the ce10 genome and takes in a local gzipped FASTQ file as its command-line argument.
You're now ready to start scripting using dx
. As shown in some of the examples above, the --brief
flag can come in handy for scripting. A list of all dx
commands and flags is on the Index of dx Commands page.
For more detailed information about running apps and applets from the command line, see the Running Apps and Applets page.
For a comprehensive guide to the DNAnexus SDK, see the SDK documentation.
Want to start writing your own apps? Check out the Developer Portal for some useful tutorials.
Learn to upload data, create a project, run an analysis, and visualize results.
See these Key Concepts pages to learn more about how the DNAnexus Platform works, and how to get the most from it:
Get up and running quickly using the Platform via both its user interface (UI) and its command-line interface (CLI):
Learn the basics of developing for the Platform:
Get to know features you’ll use every day, in these short, task-oriented tutorials.
See these Key Concepts pages for more in-depth treatments of topics that are covered briefly here:
For a step-by-step written tutorial to using the Platform via its UI, see this User Interface Quickstart.
For a step-by-step written tutorial to using the Platform via its CLI, see this Command Line Quickstart.
For a more in-depth video intro to the Platform, watch this DNAnexus Platform Essentials video.
As a developer, you may be interested in the following:
As a bioinformatician, you may be interested in our SAIGE GWAS walkthrough, and other content grouped in our Science Corner.
Get to know features you’ll use every day, in a series of short, task-oriented tutorials.
Learn to access and use the Platform via both its command-line interface and its user interface.
Learn to manage data, users, and work on the Platform, via its API. Create and share reusable pipelines, applications for analyzing data, custom viewers, and workflows.
This section is targeted towards organizational leads who have the permission to enable others to use DNAnexus for scientific purposes. Operations include managing organization permissions, billing, and authentication to the platform.
Download, install, and get started using the DNAnexus Platform SDK, the DNAnexus upload and download agents, and dxCompiler.
Get details on new features, changes, and bug fixes for each Platform and toolkit release.
Every analysis in DNAnexus is run using apps. Apps can be linked together to create workflows. Learn the basics of using both.
The Tools Library provides a list of available apps and workflows. To see this list, select Tools Library from the Tools entry in the main Platform menu.
To find the tool you're looking for in the Tools Library, you can use search filters. Filtering enables you to find tools with a specific name, in a specific category, or of a specific type:
To see what inputs a tool requires, and what outputs it generates, select that tool's row in the list. The row will be highlighted in blue; the tool's inputs and outputs will be displayed in a pane to the right of the list:
To make sure you can find a tool later, "pin" it to the top of the list. Click the "..." icon at the far right end of the row showing the tool's name and key details about it. Then click Add Pin:
To learn more about a tool, click on its name in the list. The tool's detail page will open, showing a wide range of info, including guidance in how to use it, version history, pricing, and more:
You can quickly launch the latest version of any given tool from the Tools Library page. Or, navigate into the Details page of the tool and click the Run button:
From within a project, navigate to the Manage pane, then click the Start Analysis button.
A dialogue window will open, showing a list of tools. These will include the same tools as shown in the Tools Library, as well as workflows and applets specifically available in the current project. Select the tool you want to run, then click Run Selected:
Workflows and applets can be launched directly from where they reside within a project. Select the workflow or applet in their folder location, and click Run.
Confirm details of the tool you are about to run. Note that selection of a project location is required for any tool to be run. You will need at minimum Contributor access level to the project.
The tool may require specific inputs to be filled in before starting the run. You can quickly identify the required inputs by looking for the highlighted areas that are marked Inputs Required on the page.
You can access help information about each input or output by inspecting the label of each item. If a detailed README is provided for the executable, you can click the View Documentation icon to open the app or workflow info pane.
To configure instance type settings for a given tool or stage, click the Instance Type icon located on the top-right corner of the stage.
To configure output location and view info regarding output items, go to the Outputs tab under each stage. For workflows, output location can be specified separately for each stage.
The I/O graph provides an overview of the input/output structure of the tool. The graph is available for any tool and can be accessed via the Actions/Workflow Actions menu.
Once all required inputs have been configured, the page will indicate that the run is ready to start. Click on Start Analysis to proceed to the final step.
As the last step before launching the tool, you can review and confirm various runtime settings, including execution name, output location, priority, job rank, spending limit, etc. You can also review and modify instance type settings before starting the run.
Once you have confirmed final details, click Launch Analysis to start the run.
Batch run allows users to run the same app or workflow multiple times, with specific inputs varying between runs.
To enable batch run, start from any input that you wish to specify for batch run, and open its I/O Options menu on the right hand side. From the list of options available, select Enable Batch Run.
Input fields with batch run enabled will be highlighted with a Batch label. Click any of the batch enabled input fields to enter the batch run configuration page.
Files and other data objects
Yes
Files and other data objects (array)
Partially supported. Can accept entry of a single-value array
String
Yes
Integer
Yes
Float
Yes
Boolean
Yes
String (array)
No
Integer (array)
No
Float (array)
No
Boolean (array)
No
Hash
No
The batch run configuration page allows specifying inputs across multiple runs. Interact with each table cell to fill in desired values for any run or field.
Similar to configuration of inputs for non-batch runs, you will need to fill all the required input fields to proceed to next steps. Optional inputs, or required inputs with a predefined default value, can be left empty.
Once all required fields (for both batch inputs and non-batch inputs) have been configured, you can proceed to start the run via the Start Analysis button.
Once you've finished setting up your tool, start your analysis by clicking the Start Analysis button. Follow these instructions to monitor the job as it runs.
Learn in depth about running apps and workflows, leveraging advanced techniques like Smart Reuse.
Learn how to build a simple app.
Learn more about building apps using Bash or Python.
Learn in depth about building and deploying apps, including Spark apps.
Learn in depth about importing, building, and running workflows.
Learn about organizations, which associate users, projects, and resources with one another, enabling fluid collaboration, and simplifying the management of access, sharing, and billing.
An organization (or "org") is a DNAnexus entity used to manage a group of users. Use orgs to group users, projects, and other resources together, in a way that models real-world collaborative structures.
In its simplest form, an org can be thought of as referring to a group of users on the same project. An org can be used efficiently to share projects and data with multiple users - and, if necessary, to revoke access.
Org admins can manage org membership, configure access and projects associated with the org, and oversee billing. All storage and compute costs associated with an org are invoiced to a single billing account designated by the org admin. You can create an org that is associated with a billing account by contacting sales@dnanexus.com.
Orgs are referenced on the DNAnexus Platform by a unique org ID (e.g., org-dnanexus
). Org IDs are used when sharing projects with an org in the Platform user interface or when manipulating the org in the CLI.
Users may have one of two membership levels in an org:
ADMIN
MEMBER
An ADMIN-level user is granted all possible access in the org and may perform org administrative functions (e.g. adding/removing users or modifying org policies). A MEMBER-level user, on the other hand, is granted only a subset of the possible org accesses in the org and has no administrative power in the org.
A user with MEMBER level can be configured to have a subset of the following org access. These access levels determine which actions each user can perform in an org.
Access
Description
Options
Billable activities access
If allowed, the org member can create new projects and apps billed to the org, download data (incurring data egress charges against the org), and set their own default billing account to that of the org.
[Allowed] or [Not Allowed]
Shared apps access
If allowed, the org member will have access to view and run apps in which the org has been added as an "authorized user".
[Allowed] or [Not Allowed]
Shared projects access
The maximum access level a user can have in projects shared with an org. For example, if this is set to UPLOAD for an org member, the member will have at most UPLOAD access in projects shared with the org, even if the org was given CONTRIBUTE or ADMINISTER access to the project.
[NONE], [VIEW], [UPLOAD], [CONTRIBUTE] or [ADMINISTER]
These accesses allow you to have fine-grained control over what members of your orgs can do in the context of your org.
Org admins are granted all possible access in the org. More specifically, org admins receive the following set of accesses:
Access
Level
Billable activities access
Allowed
Shared apps access
Allowed
Shared projects access
ADMINISTER
In addition to the access listed above, org admins have the following additional abilities:
Org admins can list and view metadata for all org projects (projects billed to the org) even if the project is not explicitly shared with them. They can also give themselves access to any project billed to the org. For example, when a member creates a new project, Project-P, and bills it to the org, he is the only user with access to Project-P. The org admin will be able to see all projects billed to the org, including Project-P. The org admin can also invite themself to Project-P at any time to get access to objects and jobs in the project.
Org admins can add themselves as a developer to any app billed to the org. For example, when a member creates a new app, App-A, billed to the org, he is the only developer for App-A; however, any org admin may add themself as a developer at any time.
In the diagram below, there are 3 examples of how organizations can be structured.
ORG-1
The simplest example, ORG-1, is represented by the leftmost circle. In this situation, ORG-1 is a billable org that has 3 members who share one billing account, so all 5 projects created by the members of ORG-1 are billed to that org. There is one admin who manages ORG-1, represented here as user A.
ORG-2 and ORG-3
The second example shows ORG-2 and ORG-3 demonstrating more a complicated organizational setup. Here users are grouped into two different billable orgs, with some users belonging to both orgs and others belonging to only one.
In this case, ORG-2 and ORG-3 bill their work against separate billing accounts. This separation of orgs can represent two different groups in one company working in different departments, each with their own budgets, two different labs that work closely together, or any other scenario in which two collaborators would share work.
ORG-2 has 5 members, 4 projects, and is managed by one org admin (user G). ORG-3 has 5 members and 3 projects, but is managed by 2 admins (users G and I).
In this example, admin G and member H belong to both ORG-2 and ORG-3. They can create new projects billed to either org, depending on the project they're working on. Admin G can manage users and projects in both ORG-2 and ORG-3.
You can create a non-billable org as an alias for a group of users. For example, you have a group of users who all need access to a shared dataset. You can make an org which represents all the users who need access to the dataset, for example an org named org-dataset_access, and share all the projects and apps related to the dataset with that org. All members of the org will have at least VIEW "shared project access" and "shared app access" so that they will all be given permission to view the dataset. If a member no longer needs access to the dataset, they can be removed from the org, and will then no longer have access to any projects or apps shared with org-dataset_access.
You can request sales@dnanexus.com to create a billable org where only one member, the org admin, can create new org projects. All other org members will not be granted the "billable activities access", and so cannot create new org projects. The org admin can then assign each org member a "shared projects access" (VIEW, UPLOAD, CONTRIBUTE, ADMINISTER) and share every org project with the org with ADMINISTER access. The members' permissions to the projects will be restricted by their respective "shared project access."
For example, in a given group, bioinformaticians can be given CONTRIBUTE access to the projects shared with the entire org, so they can run analyses and produce new data in any of the org projects. However, the sequencing center technicians only need UPLOAD permissions to add new data to the projects. Analysts in the group will only be given VIEW access to projects shared with the org. When you need to add a new member to your group and give them access to the projects shared with the org, you simply need to add them to the org as a new member and assign them the appropriate permission levels.
This membership structure allows the org admin to control the number of projects billed to the org. The org admin can also quickly share new projects with their org and revoke permissions from users who have been removed from the org.
You can request sales@dnanexus.com to create a billable org where users work independently and bill all of their activities to the org billing account (as specified by the org admin). All org members are granted "billable activities access." The org members also need to share common resources (e.g. incoming samples or reference datasets).
In this case, all members should be granted the "shared apps access" and assigned VIEW as their "shared projects access." The reference datasets that need to be shared with the org are stored in an "Org Resources" project that is shared with the org, which is granted VIEW access. The org can also have best-practice executables built as apps on the DNAnexus system.
The apps can be shared with the org so all members of the org have access to these (potentially proprietary) executables. If any user leaves your company or institution, their access to reference datasets and executables is revoked simply by removing them from the org.
In general, it is possible to apply many different schemas to orgs as they were designed for many different real-life collaborative structures. If you have a type of collaboration you would like to support, contact support@dnanexus.com for more information about how orgs can work for you.
If you are an admin of an org, you can access the org admin tools from the Org Admin link in the header of the DNAnexus Platform. From here, you can quickly navigate to the list of orgs you administer via All Orgs, or to a specific org.
The Organizations list shows you the list of all orgs to which you have admin access. On this page, you can quickly see all of your orgs, the org IDs, their Project Transfer setting, and the Member List Visibility setting.
Within an org, the Settings tab allows you to view and edit basic information, billing, and policies for your org.
You can find the org overview on the Settings tab. From here, you can:
View and edit the organization name (this is how the org is referred to in the Platform user interface and in email notifications).
View the organization ID, the unique ID used to reference a particular org on the CLI (e.g. org-demo_org
).
View the number of org members, org projects, and org apps.
View the list of organization admins.
Within an org page, the Members tab allows you to view all the members of the org, invite new members, remove existing members, and update existing members' permission levels.
From the Members tab, you can quickly see the names and access levels for all org members. For more information about org membership, see the organization member guide.
To add existing DNAnexus user to your org, you can use the + Invite New Member button from the org's Members tab. This opens a screen where you can enter the user's username (e.g., smithj
) or user-ID (e.g., user-smithj
). Then you can configure the user's access level in the org.
If you add a member to the org with billable activities access set to billing allowed, they will have the ability to create new projects billed to the org.
However, adding the member will not change their default billing account. If the user wishes to use the org as their default billing account, they will have to set their own default billing account.
Additionally, if the member has any pre-existing projects that are not billed to the org, the user will need to transfer the project to an org if they wish to have the project billed to the org.
The user will receive an email notification informing them that they have been added to the organization.
Org admins have the ability to create new DNAnexus accounts on behalf of the org, provided the org is covered by a license that enables account provisioning. The user will then receive an email with instructions to activate their account and set their password.
If this feature has already been turned on for an org you administer, you will see an option to Create New User when you go to invite a new member.
Here you can specify a username (e.g. alice
or smithj
), the new user's name, and their email address. The system will automatically create a new user account for the given email address and add them as a member in the org.
If you create a new user and set their Billable Activities Access to Billing Allowed, we recommend that you set the org as the user's default billing account. This option is available as a checkbox under the Billable Activities Access dropdown.
From the org Members tab, you can edit the permissions for one or multiple members of the org. The option to Edit Access appears when you have one or more org members selected in the table.
When you edit multiple members, you have the option of changing only one access while leaving the rest alone.
From the org Members tab, you can remove one or more members from the org. The option to Remove appears when you have one or more org members selected on the Members tab.
Removing a member revokes the user's access to all projects and apps billed to or shared with the org.
In the org's Projects tab you to see the list of all projects billed to the org. This list includes all projects in which you have VIEW and above permissions as well as projects that are billed to the org in which you do not have permissions (not a Member of).
You can view all project metadata (e.g. the list of members, data usage, creation date), as well as some other optional columns (e.g. project creator). To enable the optional columns, select the column from the dropdown menu to the right of the column names.
In addition to viewing the list of projects, org admins can give themselves access to any project billed to the org. If you select a project in which you are not a member, you will still be able to navigate into the project's settings page. On the project settings page, you can click a button to grant yourself ADMINISTER permissions to the project.
You can also grant yourself ADMINISTER permissions if you are currently a member of a project billed to your org but you only have VIEW, CONTRIBUTE, or UPLOAD permissions.
To access your org's billing information:
From the global menu shown at the top of the screen, click org admin and then click the name of the org you want to view.
Click Billing from the org details screen to view billing information.
To set up or update the billing information for an org you administer, contact billing@dnanexus.com.
If you are an org admin, you can set or modify a spending limit for the org's usage charges. To do this:
Click on Org Admin from the global menu bar and then click on the name of the org for which you'd like to set or modify a spending limit.
Once on your org page, click on the billing tab.
Click the Set / Update Spending Limit link in the Funds Left box to contact DNAnexus Support, requesting a spending limit change. Doing this only submits your request. DNAnexus Support may follow up with you via email with questions about the change, before approving the request.
In the Funds Left section shows how much is left of the org's spending limit. If your org does not have a spending limit, your org is unlimited, which shows up as “N/A.”
The Per-Project Usage Report and Root Execution Stats Report are monthly reports that provide a wide range of detail on charges incurred by org members. See this documentation for more information.
Org admins can also set configurable policies for the org. Org policies dictate many different behaviors when the org interacts with other entities. The following policies exist:
Policy
Description
Options
Membership List Visibility
Dictates the minimum org membership level required to view the list of org members, their membership level, and access within the org. If PUBLIC, any DNAnexus user can view the list of org members.
[ADMIN], [MEMBER], or [PUBLIC]
Project Transfer
Dictates the minimum org membership level allowed to change the billing account of an org project (via the UI or project transfer).
[ADMIN] or [MEMBER]
Project Sharing
Dictates the minimum org membership level allowed for a user to invite that org to a project
[ADMIN] or [MEMBER]
DNAnexus recommends, as a starting point, to restrict the "membership list visibility policy" to ADMIN and "project transfer policy" to ADMIN. This ensures that only the org admin is allowed to see the list of members and their access within the org and that org projects always remain under control of the org.
You can update org policies for your org in the Policies and Administration section of the org Settings tab. Here, you can both change the membership list visibility and restrict project transfer policies for the org and contact DNAnexus support to enable PHI data policies for org projects.
Billable activities access is an access level that can be granted to org members. If allowed, the org member can create new projects and apps billed to the org, download data (incurring data egress charges against the org), and set their own default billing account to that of the org.
Billable org is an org that has confirmed billing information or a non-negative spending limit remaining. Users with billable activities access in a billable org are allowed to create new projects billed to the org. See the definition of a non-billable org for an org that is used for sharing.
Billed to an org (app context) sets the billing account of an app to an org. Apps require storage for their resources and assets, and the billing account of the app are billed for that storage. The billing account of an app does not pay for invocations of the app unless the app is run in a project billed to the org.
Billed to an org (project context) sets the billing account of a project to an org. The org is invoiced the storage for all data stored in the project as well as compute charges for all jobs and analyses run in the project.
Membership level describes one of two membership levels available to users in an org, ADMIN or MEMBER. Note that ADMINISTER is a type of access level.
Membership list visibility policy dictates the minimum org membership level required to view the list of org members, their membership level, and access within the org.
Non-billable org describes an org only used as an alias for a group of users. Non-billable orgs do not have billing information and will not have any org projects or org apps. Any user can share a project with a non-billable org.
Org access is granted to a user to determine which actions the user can perform in an org.
Org admin describes administrators of an org who can manage org membership, configure access and projects associated with the org, and oversee billing.
Org app is an app billed to an org.
Org ID is the unique ID used to reference a particular org on the DNAnexus Platform (e.g. org-dnanexus
).
Org member is a DNAnexus user associated with an org. Org members can have variable membership levels in an org which define their role in the org. Admins are a type of org member as well.
Org policy is a configurable policy for the org. Org policies dictate many different behaviors when the org interacts with other entities.
Org project describes a project billed to an org.
Org (or "organization") is a DNAnexus entity that is used to associate a group of users. Orgs are referenced on the DNAnexus Platform by a unique org ID.
Project transfer policy dictates the minimum org membership level allowed to change the billing account of an org project.
Share with an org means to give the members of an org access to a project or app via giving the org access to the project or adding the org as an "authorized user" of an app.
Shared apps access is an org access level that can be granted to org members. If allowed, the org member can view and run apps in which the org has been added as an "authorized user."
Shared projects access is an org access level that can be granted to org members: the maximum access level a user can have in projects shared with an org.
Learn in depth about setting up and managing orgs as an administrator.
Learn about what you can do as an org member.
Learn about creating and managing orgs as a developer, via the DNAnexus API.
The SAMtools dependency is resolved by declaring an Apt-Get package in the dxapp.json
file’s runSpec.execDepends
.
For additional information, see execDepends
.
Distributed bash-interpreter apps use bash functions to declare entry points. This app has the following entry points specified as bash functions:
main
count_func
sum_reads
Entry points are executed on a new worker with its own system requirements. The instance type can be set in the dxapp.json
file’s runSpec.systemRequirements
:
The main function slices the initial *.bam
file and generates an index *.bai
if needed. The input *.bam
is the sliced into smaller *.bam
files containing only reads from canonical chromosomes. First, the main function downloads the BAM file and gets the headers.
Outputs from the count_func entry points are referenced as Job Based Object References (JBOR) and used as inputs for the sum_reads entry point.
The output of the sum_reads entry point is used as the output of the main entry point via JBOR reference using the command dx-jobutil-add-output
.
This entry point downloads and runs the command samtools view -c
on the sliced *.bam
. The generated counts_txt
output file is uploaded as the entry point’s job output via the command dx-jobutil-add-output
.
The main entry point triggers this sub job, providing the output of count_func as an input. This entry point gathers all the files generated by the count_func jobs and sums them.
This function returns read_sum_file
as the entry point output.
Sliced *.bam
files are uploaded and their file IDs are passed to the count_func entry point using the command.
This applet creates a count of reads from a BAM format file.
View full source code on GitHub
The SAMtools dependency is resolved by declaring an Apt-Get package in the dxapp.json
runSpec.execDepends
.
For additional information, please refer to the execDepends
documentation .
Distributed python-interpreter apps use python decorators on functions to declare entry points. This app has the following entry points as decorated functions:
main
samtoolscount_bam
combine_files
Entry points are executed on a new worker with their own system requirements. In this example, we split and merge our files on basic mem1_ssd1_x2 instances and perform our own, more intensive, processing step on a mem1_ssd1_x4 instance. Instance type can be set in the dxapp.json runSpec.systemRequirements
:
The main function scatters by region bins based on user input. If no *.bai
file is present, the applet generates an index *.bai
.
Regions bins are passed to the samtoolscount_bam entry point using the dxpy.new_dxjob
function.
Outputs from the samtoolscount_bam entry points are used as inputs for the combine_files entry point. The output of the combine_files entry point is used as the output of the main entry point.
This entry point downloads and creates a samtools view -c
command for each region in the input bin. The dictionary returned from dxpy.download_all_inputs()
is used to reference input names and paths.
This entry point returns {"readcount_fileDX": readCountDXlink}
, a JBOR referencing an uploaded text file. This approach to scatter-gather stores the results in files and uploads/downloads the information as needed. This approach exaggerates a scatter-gather for tutorial purposes. You’re able to pass types other than file such as int.
The main entry point triggers this subjob, providing the output of samtoolscount_bam as an input. This entry point gathers all the files generated by the samtoolscount_bam jobs and sums them.
Important: While the main entry point triggers the processing and gathering entry points, keep in mind the main entry point doesn’t do any heavy lifting or processing. Notice in the .runSpec
json above we start with a lightweight instance, scale up for the processing entry point, then finally scale down for the gathering step.
This applet tutorial will perform a SAMtools count using parallel threads.
View full source code on GitHub
In order to take full advantage of the scalability that cloud computing offers, our scripts have to implement the correct methodologies. This applet tutorial will:
Install SAMtools
Download BAM file
Count regions in parallel
The SAMtools dependency is resolved by declaring an Apt-Get package in the dxapp.json
runSpec.execDepends
.
For additional information, please refer to the execDepends
documentation.
The dxpy.download_all_inputs()
function downloads all input files into the /home/dnanexus/in
directory. A folder will be created for each input and the file(s) will be downloaded to that directory. For convenience, the dxpy.download_all_inputs
function returns a dictionary containing the following keys:
<var>_path
(string): full absolute path to where the file was downloaded.
<var>_name
(string): name of the file, including extention.
<var>_prefix
(string): name of the file minus the longest matching pattern found in the dxapp.json I/O pattern field.
The path, name, and prefix key-value pattern is repeated for all applet file class inputs specified in the dxapp.json. In this example, our dictionary has the following key-value pairs:
Before we can perform our parallel SAMtools count, we must determine the workload for each thread. We arbitrarily set our number of workers to 10
and set the workload per thread to 1
chromosome at a time. There are various ways to achieve multithreaded processing in python. For the sake of simplicity, we use multiprocessing.dummy
, a wrapper around Python’s threading module.
Each worker creates a string to be called in a subprocess.Popen
call. We use the multiprocessing.dummy.Pool.map(<func>, <iterable>)
function to call the helper function run_cmd
for each string in the iterable of view commands. Because we perform our multithreaded processing using subprocess.Popen
, we will not be alerted to any failed processes. We verify our closed workers in the verify_pool_status
helper function.
Important: In this example we use subprocess.Popen
to process and verify our results in verify_pool_status
. In general, it is considered good practice to use python’s built-in subprocess convenience functions. In this case, subprocess.check_call
would achieve the same goal.
Each worker returns a read count of just one region in the BAM file. We sum and output the results as the job output. We use the dx-toolkit python SDK’s dxpy.upload_local_file
function to upload and generate a DXFile corresponding to our result file. For python, job outputs have to be a dictionary of key-value pairs, with the keys being job output names as defined in the dxapp.json
and the values being the output values for corresponding output classes. For files, the output type is a DXLink. We use the dxpy.dxlink
function to generate the appropriate DXLink value.
This applet tutorial will perform a SAMtools count using parallel threads.
View full source code on GitHub
In order to take full advantage of the scalability that cloud computing offers, our scripts have to implement the correct methodologies. This applet tutorial will:
Install SAMtools
Download BAM file
Split workload
Count regions in parallel
The SAMtools dependency is resolved by declaring an Apt-Get package in the dxapp.json
runSpec.execDepends
field.
This applet downloads all inputs at once using dxpy.download_all_inputs
:
We process in parallel using the python multiprocessing
module using a rather simple pattern shown below:
This convenient pattern allows you to quickly orchestrate jobs on a worker. For more detailed overview of the multiprocessing
module, visit the python docs.
We create several helpers in our applet script to manage our workload. One helper you may have seen before is run_cmd
; we use this function to manage or subprocess calls:
Before we can split our workload, we need to know what regions are present in our BAM input file. We handle this initial parsing in the parse_sam_header_for_region
function:
Once our workload is split and we’ve started processing, we wait and review the status of each Pool
worker. Then, we merge and output our results.
Note: The run_cmd
function returns a tuple containing the stdout, stderr, and exit code of the subprocess call. We parse these outputs from our workers to determine whether the run failed or passed.
This applet slices a BAM file by canonical chromosome then performs a parallelized samtools view -c using xargs. Type man xargs for general usage information.
View full source code on GitHub
The SAMtools compiled binary is placed directory in the <applet dir>/resources
directory. Any files found in the resources/
directory will be uploaded so that they will be present in the root directory of the worker. In our case:
When this applet is run on a worker, the resources/
folder will be placed in the worker’s root directory /
:
/usr/bin
is part of the $PATH
variable, so in our script, we can reference the samtools command directly, as in samtools view -c ...
First, we download our BAM file and slice it by canonical chromosome, writing the *bam
file names to another file.
In order to split a BAM by regions, we need to have a *.bai
index. You can either create an app(let) which takes the *.bai
as an input or generate a *.bai
in the applet. In this tutorial, we generate the *.bai
in the applet, sorting the BAM if necessary.
In the previous section, we recorded the name of each sliced BAM file into a record file. Now we will perform a samtools view -c
on each slice using the record file as input.
The results file is uploaded using the standard bash process:
Upload a file to the job execution’s container.
Provide the DNAnexus link as a job’s output using the script dx-jobutil-add-output <output name>
In this section, learn to access and use the Platform via both its command-line interface (CLI) and its user interface (UI).
To use the CLI, you'll need to download and install the dx
command-line client.
If you're not familiar with the dx
client, read the Command-Line Quickstart.
This section provides detailed instructions on using the dx
client to perform such common actions as logging in; selecting projects; listing, copying, moving, and deleting objects; and launching and monitoring jobs. Details on using the UI are included throughout, as applicable.
This applet performs a basic SAMtools count on a series of sliced (by canonical chromosome) BAM files in parallel using wait.
View full source code on GitHub
The SAMtools dependency is resolved by declaring an Apt-Get package in the dxapp.json
runSpec.execDepends
.
The command set -e -x -o pipefail
will assist you in debugging this applet:
-e
causes the shell to immediately exit if a command returns a non-zero exit code.
-x
prints commands as they are executed, which is very useful for tracking the job’s status or pinpointing the exact execution failure.
-o pipefail
makes the return code the first non-zero exit code. (Typically, the return code of pipes is the exit code of the last command, which can create difficult to debug problems.)
The *.bai
file was an optional job input. You can check for an empty or unset var
using the bash built-in test [[ - z ${var}} ]]
. Then, you can download or create a *.bai
index as needed.
Bash’s job control system allows for easy management of multiple processes. In this example, you can run bash commands in the background as you control maximum job executions in the foreground. Place processes in the background using the character &
after a command.
Once the input bam has been sliced, counted, and summed, the output counts_txt
is uploaded using the command dx-upload-all-outputs
. The following directory structure required for dx-upload-all-outputs is below:
In your applet, upload all outputs by:
Learn how to log into and out of the DNAnexus Platform, via both the user interface and the command-line interface. Learn how to use tokens to log in, and how to set up two-factor authentication.
Logging In and Out via the User Interface
To log in via the user interface (UI), open the login page and enter your username and password.
To log out via the UI, click on your avatar at the far right end of the main Platform menu, then select Sign Out:
To log in via the command-line interface (CLI), make sure you've installed the dx
command-line client. From the CLI, enter the command dx login
.
Next, enter your username, or, if you've logged in before on the same computer and your username is displayed, hit Return to confirm that you want to use it to log in. Then enter your password.
See below for directions on using a token to log in.
See the Index of dx Commands page for detail on optional arguments that can be used with dx login
.
When using the CLI, log out by entering the command dx logout
.
See the Index of dx Commands page for detail on optional arguments that can be used with dx logout
.
After fifteen minutes of inactivity, you will be automatically logged out, unless you logged in using an API token that specifies the length of time you can stay logged in, or are part of an org with a custom autoLogoutAfter policy.
You can log in via the CLI, and stay logged in for a fixed length of time, by using an API token, also called an authentication token.
Be very careful about giving a DNAnexus Platform token to someone else. Anyone in possession of that token can use it to access the Platform and impersonate you as a user. He or she will have the same access level as you, for any projects to which the token has access, potentially allowing him or her to run jobs, incurring charges to your account.
To generate a token, click on your avatar at the top right corner of the main Platform menu, then select My Profile from the dropdown menu.
Next, click on the API Tokens tab. Then click the New Token button:
The New Token form will open in a modal window:
While filling out the form, note the following:
The token will provide access to each project at the level at which you have access. See the Projects page for more on project access levels.
If the token provides access to a project within which you have PHI data access, it will enable access to that PHI data.
If you do not enter an expiration date when creating a token, it will be set to expire in one month.
Once you've completed the form, click Generate Token. A new 32-character token will be generated, and displayed along with a confirmation message.
To log in with a token via the CLI, enter the command dx login --token
, followed by a valid 32-character token.
Tokens are useful in a number of different scenarios. Examples include:
Logging in via the CLI when a single sign-on is enabled - If your organization uses single sign-on, you may not be able to log in via the CLI using a username and password. In this case, use a token to log in via the CLI.
Logging in via a script - You can incorporate a token into a script to allow the script to log into the Platform.
When incorporating a token into a script, take care to set the token's expiration date such that the script has Platform access for only as long as absolutely necessary. Ensure as well that the script only has access to that project or those projects to which it must have access, in order to function properly.
To revoke a token, navigate to the API Tokens screen within your profile on the UI. Select the token you want to revoke, then click the Revoke button:
In the Revoke Tokens Confirmation modal window, click the Yes, revoke it button. The token will be revoked, and its name will no longer appear in the list of tokens on the API Tokens screen.
Token shared too widely - Revoke a token if someone with whom you've shared the token should no longer be able to use it, or if you're not certain who has access to it.
Token no longer needed - Revoke a token if a script that uses it is no longer in use, or if a group that had been using it no longer needs access to the Platform, or in any other situation in which the token is no longer necessary.
As a rule, logging in requires interacting directly with the Platform, via the UI or the CLI. But it is possible to log in non-interactively. This is most commonly done via a script that automates both login and project selection.
Non-interactive login requires the use of dx login
with the --token
argument. Use the dx select
command to automate project selection. If you prefer not to automate project selection, add the --noprojects
argument to dx login
.
DNAnexus recommends adding two-factor authentication to your account, to provide an extra means of ensuring the security of all data to which you have access, on the Platform.
After enabling two-factor authentication, you will be required to enter a two-factor authentication code to log into the Platform, and to access certain other services. This code is a time-based one-time password that is valid for only a single session. It is generated by a third-party two-factor authenticator application, such as Google Authenticator.
With two-factor authentication protecting your account, your data will be protected even in the case that both your username and password are stolen. No attacker will be able to access your account without the two-factor authentication code.
To enable two-factor authentication, select Account Security from the dropdown menu accessible via your avatar, at the the top right corner of the main menu.
In the Account Security screen, click the button labeled Enable 2FA. Then follow the instructions to select and set up a third-party authenticator application.
After enabling two-factor authentication, you will be redirected to a page containing back-up codes. These codes can be used in place of a two-factor authentication code, in the event that you lose access to your authenticator application.
Save the back-up codes in a secure place. Without them, if you lose access to your authenticator application, you will be unable to log into the Platform.
Contact DNAnexus Support if you lose both your codes and access to your authenticator application.
DNAnexus does not recommend disabling two-factor authentication once it has been enabled. If you do need to do so, navigate to the Account Security screen of your profile, then click the Turn Off button in the Two-Factor Authentication section. You will be required to enter your password and a two-factor authentication code to confirm your choice.
When using the command-line client, you may refer to objects either through their ID or by name.
The command-line client, however, also accepts names and paths as input in a particular syntax.
There are three main types of paths that are recognized for referring to data objects: project paths, job-based object references (JBORs), and DNAnexus links.
To refer to a project by name, it must be suffixed with the colon character ":". Anything appearing after the ":" or without a ":" will be interpreted as a folder path to a named object. For example, to refer to a file called "hg19.fq.gz" in a folder called "human" in a project called "Genomes", the following path can be used in place of its object ID:
Note that the folder path appearing after the ":" is assumed to be relative to the root folder "/" of the project.
To refer to the output of a particular job, you can use the syntax <job id>:<output name>
.
If you have the job ID handy, you can use it directly.
Or if you know it's the last analysis you ran:
You can also automatically download a file once the job producing it is done:
If the output is an array, you can extract a single element by specifying its array index (numbered 0, 1, etc.) as follows:
DNAnexus links are JSON hashes which are used for job input and output. They always contain one key, $dnanexus_link
, and have as a value either
a string representing a data object ID
another hash with two keys:
project
a string representing a project or other data container ID
id
a string representing a data object ID
For example:
Because of the use of :
to denote project names and of /
to separate folder names, they must be escaped with a preceding backslash \\
when they appear in a data object's name. The characters *
and ?
are also reserved for the use in wildcard patterns and must also be escaped. Depending on your terminal and whether you put the entire name or path in quotes, you may have to additionally escape spaces in order to pass them in as part of the string. The use of backslashes in names is discouraged, and the best way to interact with objects with such names will probably be to use their IDs directly. See the following table for some examples of the necessary representation for accessing existing objects with special characters in their names (assuming a bash shell).
The following example illustrates how the special characters are escaped for use on the command line, with and without quotes.
For commands where the argument supplied involves naming or renaming something, the only escaping necessary is whatever is necessary for your shell or for setting it apart from a project or folder path.
It is possible to have multiple objects with the same name in the same folder. When an attempt is made to access or modify an object which shares the same name as another object, you will be prompted to select the desired data object.
Some commands (like mv
here) will allow you to enter *
so that all matches will be used. Other commands may automatically apply the command to all of them (e.g. ls
and describe
), and others will require that exactly one object be chosen (e.g. run
).
If dx run
is run without specifying an input, interactive mode will be launched. You will then be prompted to enter each required input, after which you will be given the option to select from a list of optional parameters to modify. Optional parameters listed will include all those that can be modified for each stage of the workflow. The interface will then output a JSON file detailing the input specified and generate an analysis ID of the form analysis-xxxx
unique to this particular run of the workflow.
Below is an example of running the Exome Analysis Workflow from the public "Exome Analysis Demo" project.
You can specify each input on the command-line using the -i
or --input
flags using the syntax -i<stage ID>.<input name>=<input value>
. <input-value>
must take the form of a DNAnexus object ID or a file named in the project currently selected. It is also possible to specify the number of a stage in place of the stage ID for a given workflow, where stages are indexed starting at zero. The inputs in the following example are specified for the first stage of the workflow only to illustrate this point. Note that the parentheses around the <input-value>
in the help string are omitted when entering input.
Possible values for the input name field can be found by running the command dx run workflow-xxxx -h
, as shown below using the Exome Analysis Workflow.
This help message describes the inputs for each stage of the workflow in the order they are specified. For each stage of the workflow, the help message will first list the required inputs for that stage, specifying the requisite type in the <input-value>
field. Next, the message describes common options for that stage (as seen in that stage's corresponding UI on the platform). Lastly, it will list advanced command-line options for that stage. If any stage's input is linked to the output of a prior stage, the help message shows the default value for that stage as a DNAnexus link of the form
{"$dnanexus_link": {"outputField": "<prior stage output name>", "stage": "stage-xxxx" }}
.
Similarly, this link format can be used to specify output from any prior stage in the workflow as input for the current stage. We see that the Exome Analysis Workflow has one required file array input in addition to those already specified by default: -ibwa_mem_fastq_read_mapper.reads_fastqgzs
. As these inputs are for the first stage of the Exome Analysis Workflow, the bwa_mem_fastq_read_mapper
stage ID can be replaced with 0
.
The example below shows how to run the same Exome Analysis Workflow on a FASTQ file containing reads, as well as a BWA reference genome, using the default parameters for each subsequent stage.
Array input can be specified by specifying multiple inputs for a single parameter in a stage. For example, the following flags would add files 1 through 3 to the file_inputs
parameter for stage-xxxx
of the workflow
:
If no project is selected, or if the file is in another project, the project containing the files you wish to use must be specified as follows: -i<stage ID>.<input name>=<project id>:<file id>
.
Using the --brief
flag at the end of a dx run
command will cause the command line to print the execution's analysis ID ("analysis-xxxx") instead of the input JSON for the execution. This ID can be saved for later reference.
To modify specific settings from the previous analysis, you can run the command dx run --clone analysis-xxxx [options]
. The [options]
parameters will override anything set by the --clone
flag, and they take the form of options passed as input from the command line.
For example, the command below redirects the output of the analysis to the outputs/
folder and reruns all stages.
When rerunning workflows, if a stage is run identically to how it was run in a previous analysis, the stage itself will not be rerun; the outputs of that stage will not be copied or rewritten in a new location. To rerun a specific stage, use the option --rerun-stage STAGE_ID
to force a stage to be run again, wherein STAGE_ID is an ID of the form stage-xxxx, the stage's name, or the index of that stage (where the first stage of a workflow is indexed at 0). If you wish to rerun all stages of an analysis, you can use --rerun-stage "*"
, where the asterisk is enclosed in quotes to prevent expansion of that variable into all folders your current directory via globbing.
The command below reruns the third and final stage of analysis-xxxx
The --destination
flag allows you to specify the path of the output of a workflow. Every output of every stage will be written to the destination specified by default.
You can use the --stage-output-folder <stage_ID> <folder>
command to specify the output destination of a particular stage in the analysis being run, wherein stage_ID
is the stage's name, or the index of that stage (where the first stage of a workflow is indexed at 0) andfolder
is the project and path to which you wish the stage to write using the syntax project-xxxx:/PATH
where PATH
is the path to the folder in project-xxxx
where you wish to write outputs.
The following command reruns all stages of analysis-xxxx
and sets the output destination of the first step of the workflow (BWA) to "mappings" in the current project:
If you want to specify output folder of a stage within the current output folder of the entire analysis, you can use the flag --stage-relative-output-folder <stage_id> <folder>
, wherestage_id
is the stage's name (stage-xxxx
), or the index of that stage (where the first stage of a workflow is indexed at 0). For the folder argument, you can specify a quoted path to write the output of that stage that is relative to the output folder of the analysis.
The following command reruns all stages of analysis-xxxx
, setting the output destination of the analysis to /exome_run,
and the output destination of stage 0 to /exome_run/mappings
in the current project:
If you wish to specify the instance type of all stages in your analysis or a specific set of stages in your analysis, you can do so with the flag --instance-type
. Specifically, the format --instance-type STAGE_ID=INSTANCE_TYPE
allows us to set the instance type of a specific stage, while --instance-type INSTANCE_TYPE
sets one instance types for all of the stages. The two options can be combined, for example --instance-type mem2_ssd1_x2 --instance-type my_stage_0=mem3_ssd1_x16
will set all stages' instance types to mem2_ssd1_x2
except for the stage my_stage_0
for which mem3_ssd1_x16
will be used.
Here STAGE_ID is an ID of a stage, the stage's name, or the index of that stage (where the first stage of a workflow is indexed at 0).
The example below reruns all stages of analysis-xxxx
and specifies that the first and second stages should be run on mem1_ssd2_x8
and mem1_ssd2_x16
instances respectively:
On the DNAnexus Platform, jobs are limited to a runtime of 30 days. Jobs running longer than 30 days will be automatically terminated.
Note that as in running a workflow in non-interactive mode, inputs to a workflow must be specified as STAGE_ID.<input>
, where STAGE_ID is either an ID of the form stage-xxxx
or the index of that stage in the workflow (starting with the first stage at index 0).
You can treat dx
as an invocation command for navigating the data objects on the DNAnexus platform. By adding dx
in front of commonly used bash commands (e.g. dx ls
, dx cd
, dx mv
, and dx cp
), you can list objects, change folders, move data objects, and copy objects stored in the platform; all on the command-line.
To see more details, you can run the command with the option dx ls -l
.
As in bash, you can list the contents on a path.
You can also list the contents of a different project. To specify a path that points to a different project, start with the project-ID, followed by a :
, then the path within the project where /
is the root folder of the project.
Note that we enclosed our path with quotes (" "
), so dx
interprets the spaces as part of the folder name, not as a new command.
You can also list only the objects which match a pattern. Here, we use a *
as a wildcard to represent all objects whose names contain .fasta
. This returns only a subset of the objects returned in the original query. Again we enclosed our path in " "
so dx
correctly interprets the asterisk and the spaces in the path.
To rename an object or a folder, simply "move" it to a new name in the same folder. Here we rename a file named ce10.fasta.gz
to C.elegans10.fastq.gz
.
If we wanted to move the renamed file into a folder, we can specify the path to the folder as the destination of the move command.
You can copy data objects or folders to another project by running the command dx cp
. Below we show an example to copy a human reference genome FASTA file (hs37d5.fa.gz
) from a public project, “Reference Genome Files”, to a project “Scratch Project” that the user has ADMINISTER permission to.
You can also copy folders between projects by running dx cp folder_name destination_path
. Folders will automatically be copied recursively.
To view and select between all public projects, projects available to all DNAnexus users, you can run the command dx select --public
:
By default, dx select
will prompt list of projects that you have at least CONTRIBUTE permission to. If you wanted to switch to a project that you have VIEW permission to to view the data objects, you can run dx select --level VIEW
to list all the projects in which you have at least VIEW permission to.
If you know the project ID or name, you can also give it directly to switch to the project as dx select [project-ID | project-name]
:
You can also specify each input parameter by name using the ‑i
or ‑‑input
flags with syntax ‑i<input name>=<input value>
. Names of data objects in your project will be resolved to the appropriate IDs and packaged correctly for the API method as shown below.
The help message describes the inputs and outputs of the app, their types, and how to identify them when running the app from the command line. For example, from the above help message, we learn that the Swiss Army Knife app has two primary inputs: one or more file and a string to be executed on the command line, to be specified as -iin=file-xxxx
and icmd=<string>
, respectively.
The example below shows you how to run the same Swiss Army Knife app to sort a small BAM file using these inputs.
Some examples of additional functionalities provided by dx run
are listed below.
Regardless of whether you run a job interactively or non-interactively, the command dx run
will always print the exact input JSON with which it is calling the applet or app. If you don't want to print this verbose output, you can use the --brief
flag which tells dx
to print out only the job ID instead. This job ID can then be saved.
TIP: When running jobs, you can use the
-y/--yes
option to bypass the prompts asking you to confirm running the job and whether or not you want to watch the job. This is useful for scripting jobs. If you want to confirm running the job and immediately start watching the job, you can use-y --watch
.
If you are debugging applet-xxxx and wish to rerun a job you previously ran, using the same settings (destination project and folder, inputs, instance type requests), but use a new executable applet-yyyy, you can use the --clone
flag.
If you want to modify some but not all settings from the previous job, you can simply run dx run <executable> --clone job-xxxx [options]
. The command-line arguments you provide in [options]
will override the settings reused from --clone
. For example, this is useful if you want to rerun a job with the same executable and inputs but a different instance type, or if you want to run an executable with the same settings but slightly different inputs.
The example shown below redirects the outputs of the job to the folder "outputs/".
In the above command, the flag --destination project-xxxx:/mappings
instructs the job to output all results into the "mappings" folder of project-xxxx.
The dx run --instance-type
command allows you to specify the instance type(s) to be used for the job. More information can be found by running the command dx run --instance-type-help
.
If you are running many jobs that have varying purposes, you can organize the jobs using metadata. There are two types of metadata on the DNAnexus platform: properties and tags.
Properties are key-value pairs that can be attached to any object on the platform, whereas tags are strings associated with objects on the platform. The --property
flag allows you to attach a property to a job, and the --tag
flag allows you to tag a job.
Adding metadata to executions does not affect the metadata of the executions' output files. Metadata on jobs make it easier for you to search for a particular job in your job history (e.g., if you wanted to tag all jobs run with a particular sample).
If your current workflow is not using the most up-to-date version of an app, you can specify an older version when running your job by appending the app name with the version required, e.g. app-xxx/0.0.1 if the current version is app-xxx/1.0.0.
If you would like to keep an eye on your job as it runs, you can use the --watch
flag to ask the job to print its logs in your terminal window as it progresses.
If using the CLI to enter the full input JSON, you must use the flag ‑j/‑‑input‑json
followed by the JSON in single quotes. Only single quotes should be used to wrap the JSON to avoid interfering with the double quotes used by the JSON itself.
If using a file to enter the input JSON, you must use the flag ‑f/‑‑input‑json‑file
followed by the name of the JSON file.
Entering the input JSON file using stdin is done in much the same way as entering the file using the -f
flag with the small substitution of using "-"as the filename. Below is an example that demonstrates how to echo the input JSON to stdin and pipe the output to the input of dx run
. As before, single quotes should be used to wrap the JSON input to avoid interfering with the double quotes used by the JSON itself.
The --cost-limit cost_limit
sets the maximum cost of the job before termination. In case of workflows it is cost of the entire analysis job. For batch run, this limit is applied per job. See the dx run --help
command for more information.
On the DNAnexus Platform, jobs are limited to a runtime of 30 days. Jobs running longer than 30 days will be automatically terminated.
A project is a collaborative workspace on the DNAnexus Platform where you can store objects such as files, applets, and workflows. Within projects, you can run apps and workflows. You can also share a project with other users by giving them access to it. Read about projects in the section.
In the DNAnexus platform, every data object has a unique starting with the class of the object (e.g. "record", "file", "project") followed by a hyphen ('-') and 24 alphanumberic characters, e.g. record-9zGPKyvvbJ3Q3P8J7bx00005
. A string matching this format will always be interpreted to be meant as the ID of such an object and will not be further resolved as a name.
Exceptions to this are when commands take in arbitrary names (e.g. for which takes in app names, user IDs, etc.). In this case, all possible interpretations will be attempted. However, it will always be assumed that it is not a project name unless it ends in ":".
You can run workflows from the command-line using the command . The inputs to these workflows can be from any project for which you have VIEW access.
The examples here use the publicly available (platform login required to access this link).
For information on how to run a Nextflow pipeline, see .
The -i
flag can also be used to specify (JBORs) with the syntax -i<stage ID or number>:<input name>=<job id>:<output name>
. The --brief
flag, when used with the command dx run
, will only output the execution's ID; we can also skip the interactive prompts confirming the execution using the -y
flag. Calling dx run
on a job with the --brief
flag allows the command to return just the job ID of that execution and we can skip being prompted to begin execution with the -y
flag.
The example below calls the app (platform login required to access this link) to produce the sorted_bam
output described in the help string produced by running dx run app-bwa_mem_fastq_read_mapper -h
. This output is then used as input to the first stage of the featured on the DNAnexus platform (platform login required to access this link).
Note that the --clone
flag will not copy the usage of the --allow-ssh
or --debug-on
flags, which must be set with the new execution; only the applet, instance type, and input spec are copied. See the page for more information on the usage of these flags.
This is identical to adding metadata to a job; see for details.
It is not possible to monitor an analysis to a command line. For information about monitoring a job from the command line, see .
This is identical to providing an input JSON to a job; for more information, see .
By default when you set your current project, you are placed in the root folder /
of the project. You can list the objects and folders in your current folder with .
To find out what folder you are currently in, you can use the dx pwd
command. You can switch contexts to a subfolder in a project using .
You can move and rename data objects and folders using the command .
NOTE: The platform does NOT allow copying a data object within the same project, since each specific data object would exist only once in a project. Additionally, it is prohibited to copy any data object between projects that located in different through dx cp
.
You can change to another project where you wanted to work by running the command . It brings up a prompt with a list of projects for you to select from. In the following example, the user has entered option 2 to select the project named "Mouse".
You can run apps and applets from the command-line using the command . The inputs to these app(let)s can be from any project for which you have VIEW access. Or
If dx run
is run without specifying any inputs, interactive mode will be launched. When you run this command, the platform prompts you for each required input, followed by a prompt to set any optional parameters. As shown below using the (platform login required to access this link), after you are done entering inputs, you must confirm that you want the applet/app to be run with the inputs you have selected.
When specifying input parameters using the ‑i/‑‑input
flag, you must use the input field names (not to be confused with their human-readable labels). To look up the input field names for an app, applet, or workflow, you can run the command dx run app(let)-xxxx -h
, as shown below using the (platform login required to access this link).
To specify array inputs, reuse the ‑i/‑‑input
flag for each input in the array and each file specified will be appended into an array in same order as it was entered on the command line. Below is an example of how to use the to index multiple BAM files (platform login required to access this link).
(JBORs) can also be provided using the -i
flag with syntax ‑i<input name>=<job id>:<output name>
. Combined with the --brief
flag (which allows dx run
to output just the job ID) and the -y
flag (to skip confirmation), you can string together two jobs using one command.
Below is an example of how to run the (platform login required to access this link), producing the output named "sorted_bam" as described in the app's helpstring by executing the command dx run app-bwa_mem_fastq_read_mapper -h
. The "sorted_bam" output will then be used as input for the (platform login required to access this link).
In the above command, the command overrides the --clone job-xxxx
command to use the executable (platform login required to access this link) rather than that used by the job.
While the --clone job-xxxx
flag will copy the applet, instance type, and inputs, it will not copy usage of the --allow-ssh
or --debug-on
flags. These will have to be re-specified for each job run. For more information, see the page.
The --destination
flag allows you to specify the full project-ID:/folder/
path in which to output the results of the app(let). If this flag is unspecified, the output of the job will default to the present working directory, which can be determined by running .
Some apps and applets have multiple , meaning that different instance types can be specified for different functions executed by the app(let). In the example below, we run the (platform login required to access this link) while specifying the instance types for the entry points "honey," "ssake," "ssake_insert," and "main." Specifying the instance types for each entry point requires a JSON-like string, meaning that the string should be wrapped in single quotes, as explained earlier, and demonstrated below.
You can also specify the input JSON in its entirety. To specify a data object, you must wrap it in (a key-value pair with a key of "$dnanexus_link" and value of the data object's ID). Because you are already providing the JSON in its entirety, as long as the applet/app ID can be resolved and the JSON can be parsed, you will not be prompted to confirm before the job is started. There are three methods for entering the full input JSON, which we discuss in separate sections below.
Executing the dx run --help
command will show all of the flags available to use in conjunction with dx run
. The message printed by this command is identical to the one displayed in the brief description of .
Character
Escaped version (no quotes)
Escaped version (with quotes)
`` (single space)
' '
:
\\\\:
'\\:'
/
\\\\/
'\/'
*
\\\\\\\\*
'\\\\\\*'
?
\\\\\\\\?
'\\\\\\?'
Learn about different types of time limits on executions, and how they can affect your executions on the DNAnexus Platform.
On the DNAnexus Platform, executions are subject to two independent time limits: job timeouts, and execution tree expirations.
Each job has a timeout setting. This setting denotes the maximum amount of “wall clock time” that the job can spend in the “running” state, i.e. running on the DNAnexus Platform.
If the job is still running when this limit is reached, the job will be terminated.
The default job timeout setting is 30 days, though individual apps may have different timeout settings, as specified by the app’s creator. A job may be given a custom timeout setting.
As noted above, job timeouts only apply to the time a job spends in the "running" state.
Job timeouts do not apply to any time a job spends waiting to begin running - as, for example, when a job is waiting for inputs to become available.
Job timeouts also do not apply to the time a job may spend between exiting the “running” state, and entering the “done” state - as, for example, when it is waiting for subjobs to finish.
See this documentation to learn more about on the job lifecycle and job states.
If a job fails to complete running before reaching its timeout limit, it will be terminated, with the Platform returning JobTimeoutExceeded
as the job's failure reason.
Each job is part of an execution tree. All jobs in an execution tree must complete running within 30 days of the launch of the tree’s root execution.
After this limit has been reached, all jobs within the execution tree lose the ability to access the Platform.
Note that if an execution tree is restarted, its timeout setting is not reset. Jobs in the tree lose Platform access 30 days after the initial launch (the first try) of the tree’s root execution.
If an execution tree reaches its time limit, jobs in the tree may not fail right away. If such a job is waiting for inputs or outputs, or if it is running without accessing the Platform, it may remain in that state. Only when the job tries to access the Platform will it fail. Depending on the access pattern, the Platform will return AppInternalError
, AppError
, or AuthError
as the job's failure reason.
Learn how to set job notification thresholds on the DNAnexus Platform.
Being notified of when a job may be stuck can help users to troubleshoot problems in a timely manner. On DNAnexus, users can set timeouts to limit the amount of time their jobs can run, or set a threshold on how long a job can take to run before the user is notified. The notification threshold can be specified in the executable at compile time via dx or dxCompiler.
When the threshold has been reached for a job tree, the user who launched the executable and the org admin will receive an email notification.
For a root execution, the turnaround time is the time between its creation time and the time it reaches the terminal state (or the current time if it is not in a terminal state). The terminal states of an execution are done, terminated and failed. The job tree turnaround time threshold can be set from the dxapp.json
app metadata file using the treeTurnaroundTimeThreshold
supported field, where the threshold time is set in seconds. Note that when a user runs an executable that has a threshold, only the resulting root execution will have that threshold applied to it. See here for more details on the treeTurnAroundTimeThreshold
API.
Example of including the treeTurnaroundTimeThreshold
field in dxapp.json
:
In the command-line interface (CLI), the dx build
and dx build --app
commands can accept the treeTurnaroundTimeThreshold
field from dxapp.json
, and the resulting app will be built with the job tree turnaround time threshold from the JSON file.
To check the treeTurnaroundTimeThreshold
value of an executable, users can use dx describe
{app, applet, workflow or global workflow id} --json command.
Using the dx describe {execution_id} --json
command will display the selectedTreeTurnaroundTimeThreshold
, selectedTreeTurnaroundTimeThresholdFrom
, and treeTurnaroundTime
values of root executions.
For WDL workflows and tasks, dxCompiler allows users to specify tree turnaround time using the extras JSON file. dxCompiler uses the value of the treeTurnaroundTimeThreshold
field from the perWorkflowDxAttributes
and defaultWorkflowDxAttributes
sections in extras
and passes on this threshold to the new workflow generated from this file. To set a job tree turnaround time threshold for an applet using dxCompiler, add the treeTurnaroundTimeThreshold
field to the perTaskDxAttributes
and defaultTaskDxAttributes
sections in the extras JSON file.
Example of including the treeTurnAroundTimeThreshold
field in perWorkflowDxAttributes
:
To launch a DNAnexus application or workflow on many files automatically, one may write a short script to loop over the desired files in a project and launch jobs or analyses. Alternatively, the DNAnexus SDK provides a few handy utilities for batch processing. To use the GUI to run in batch mode, see these instructions.
In this tutorial, we'll batch process a series of sample FASTQs (forward and reverse reads). We'll use the dx generate_batch_inputs
command to generate a batch file -- a tab-delimited (TSV) file where each row corresponds to a single run in our batch. Then we'll process our batch using the dx run
command with the --batch-tsv
options.
In the project My Research Project we have the following files in our root directory:
We want to batch process these read pairs using BWA-MEM (link requires platform login). For a single execution of the BWA-MEM app, we need to specify the following inputs:
reads_fastqgzs
- FASTQ containing the left mates
reads2_fastqgzs
- FASTQ containing the right mates
genomeindex_targz
- BWA reference genome index
We'll use the BWA reference genome index from the public Reference Genome (requires platform login) project for all runs; however, for the forward and reverse reads we want read pairs used to vary from run to run. To generate a batch file that pairs our input reads:
The (.*)
are regular expression groups. You can provide arbitrary regular expressions as input. The first match in the group will be the pattern used to group pairs in the batch, these matches are called batch identifiers (batch IDs). To explain this behavior in more detail, We will use the output of the dx generate_batch_inputs
command above:
The dx generate_batch_inputs
command creates the dx_batch.0000.tsv
that looks like:
Recall the regular expression was RP(.*)_R1_(.*).fastq.gz
. Although there are two grouped matches in this example, only the first one is used as the pattern for the batch ID. For example, the pattern identified for RP10B_S1_R1_001.fastq.gz
is 10B_S1
which corresponds to the first grouped match while the second one is ignored.
Examining the TSV file above, the files are grouped as expected, with the first match labeling the identifier of the group within the batch. The next two columns show the file names. The last two columns contain the IDs of the files on the DNAnexus platform. You can either edit this file directly or import it into a spreadsheet to make any subsequent changes.
Note that if an input for the app is an array, the input file IDs within the batch.tsv file need to be in square brackets in order to work. The following bash command will add brackets to the file IDs to column 4 and 5. You may need to change the variables in the below command ("$4" and "$5") to match the correct columns in your file. The command's output file, "new.tsv", is ready for the dx run --batch-tsv
command.
head -n 1 dx_batch.0000.tsv > temp.tsv && tail -n +2 dx_batch.0000.tsv | awk '{sub($4, "[&]"); print}' | awk '{sub($5, "[&]"); print}' >> temp.tsv; tr -d '\r' < temp.tsv > new.tsv; rm temp.tsv
Note that the example above is for a case where all files have been paired properly. dx generate_batch_inputs
will create a TSV for all files that can be successfully matched for a particular batch ID. There are two classes of errors for batch IDs that are not successfully matched:
A particular input is missing (e.g. reads_fastqgzs
has a pattern but no corresponding match can be found for reads2_fastqgzs
)
More than one file ID matches the exact same name
For both of these cases, dx generate_batch_inputs
returns a description of these errors to STDERR.
We have our batch file so now we can execute our BWA-MEM batch process:
Here, genomeindex_targz
is a parameter set at execution time that is common to all groups in the batch and --batch-tsv
corresponds to the input file generated above.
To monitor a batch job, simply use the 'Monitor' tab like you normally would for jobs you launch.
In order to direct the output of each run into a separate folder, the --batch-folders
flag can be used, for example:
This will output the results for each sample in folders named after batch IDs, in our case the folders: "/10B_S1/", "/10T_S5/", "/15B_S4/", and "/15T_S8/". If the folders do not exist, they will be created.
The output folders are created under a path defined with --destination
, which by default is set to current project and the "/" folder. For example, this command will output the result files in "/run_01/10B_S1/", "/run_01/10T_S5/", etc.:
dx generate_batch_inputs
is limited to starting runs that differ only in input fields of type file
. Use a more flexible for
loop construct if you want batch runs that differ in string, file array or other non-file type inputs.
Additionally, a for
loop allows you to specify other dx run
arguments such as name
for every run:
You can also use the dx run
command in order to use stage_id
. For example, if you create a workflow called "Trio Exome Workflow - Jan 1st 2020 9:00am" in your project, you can run it from the command line:
Note the \
that is needed to escape the :
in the workflow name.
Inputs to the workflow can be specified using dx run <workflow> --input name=stage_id:value
, where stage_id
is a numeric ID starting at 0. More help can be found by running the commands dx run --help
and dx run <workflow> --help
.
To batch multiple inputs then, do the following:
For additional information and examples of how to run batch jobs, Chapter 6 of this reference guide may be useful. Note that this material is not a part of the official DNAnexus documentation and is for reference only.
Learn how to get information on current and past executions, via both the UI and the CLI.
To get basic information on a job (the execution of an app or applet) or an analysis (the execution of a workflow):
Click on Projects in the main Platform menu.
On the Projects list page, find and click on the name of the project within which the execution was launched.
Click on the Monitor tab to open the Monitor screen.
On the Monitor screen, you'll see a list of executions launched within the project. By default, they're listed in reverse chronological order, with the most recently launched execution at the top.
Find the row displaying information on the execution.
Note that for an analysis (the execution of a workflow), you can click on the "+" icon to the left of the analysis name to expand the row to view information on its stages. If an execution has further descendants, you can click on the “+” icon next to its name to expand the row further.
To see additional information on an execution, click on its name to be taken to its details page.
Note that there are shortcuts you can use to view information that is found on the details page directly on the list page, or relaunch an execution:
To view the Info pane:
Click the Info icon, above the right edge of the executions list, if it’s not already selected, and then select the execution by clicking on the row, or
Hover over the row and click on the “More Actions” button that looks like three vertical dots at the end of the row to select View Info in the fly out menu.
To view the log file for a job, do either of the following:
Select the execution by clicking on the row. A View Log button will appear in the header. Click the View Log button, or
Hover over the row and click on the “More Actions” button that looks like three vertical dots at the end of the row to select View Log in the fly out menu.
To re-launch a job, do either of the following:
Select the execution by clicking on the row. A Launch as New Job button will appear in the header. Click the Launch as New Job button, or
Hover over the row and click on the “More Actions” button that looks like three vertical dots at the end of the row, then select Launch as New Job in the flyout menu.
To re-launch an analysis, do either of the following:
Select the execution by clicking on the row. A Launch as New Analysis button will appear in the header. Click the Launch as New Analysis button, or
Hover over the row and click on the “More Actions” button that looks like three vertical dots at the end of the row to select Launch as New Analysis in the flyout menu.
In the list on the Monitor screen, you'll see the following information for each of the executions that is running or has been run within the project in question:
Name - The default name for an execution is the name of the app, applet, or workflow being run. When configuring an execution, you can give it a custom name, either via the UI, or via the CLI. Note that the execution's name is used in Platform email alerts related to the execution. Note as well that clicking on a name in the executions list opens the execution details page, giving in-depth information on the execution.
State - This is the execution's state. State values include:
"Waiting" - When initially launched, the execution's state will be "Waiting" until the Platform has allocated the resources required to run it, and, in some cases, until other executions on which it depends have finished.
"Running" - Once a job has started to run, its state will change from "Waiting" to “Running.”
"In Progress" - Once an analysis has been launched, its state will change to "In Progress."
"Done" - If the execution completes with no errors, its state will change to "Done."
"Failed" - If the execution fails to complete with errors, its state will change to "Failed." See Types of Errors for help in understanding why an execution failed.
"Partially Failed" - An analysis is in the "Partially Failed" state if one or more stages in the workflow have not finished successfully, and there is at least one stage that has not transitioned to a terminal state (either "Done," "Failed," or "Terminated").
"Terminating" - If the worker has begun terminating an execution, but not yet finished doing so, the execution's state will be displayed as "Terminating."
"Terminated" - If the execution is terminated prior to completion, its state will change to "Terminated."
"Debug Hold" - If an execution has been run with debugging options, and has failed for an applicable reason, and is being held for debugging, its state will show as "Debug Hold."
Executable - The executable or executables run in the course of the execution. Note that if the execution is an analysis, each stage will be shown in a separate row, including the name of the executable run during the stage in question. Note as well that if there is an informational page giving details about the executable and how to configure and use it, the executable's name will be clickable, and clicking the name will display that page.
Tags - Tags are strings associated with objects on the platform. They are a type of metadata that can be added to an execution.
Launched By - The name of the user who launched the execution.
Launched On - The time at which the execution was launched. Note that for many executions, this will be earlier than the time displayed in the Started Running column, because many executions spend time waiting for resources to become available, before they start running.
Started Running - The time at which the execution started running, if it has done so. Note that this is not always the same as its launch time, if it has to spend time waiting for resources to become available, before it can start running.
Duration - For jobs, this figure represents the time elapsed since the job entered the running state. For analyses, it represents the time elapsed since the analysis was created.
Cost - A value is displayed in this column when the user has access to billing info for the execution. The figure shown represents either, for a running execution, an estimate of the charges it has incurred so far, or, for a completed execution, the total costs it incurred.
Priority - The priority assigned to the execution - either "low," "normal," or "high" - when it was configured, either via the CLI or via the UI. This setting determines the scheduling priority of the execution, vis-a-vis other executions that are waiting to be launched.
Worker URL - If the execution is running an executable - such as DXJupyterLab - to which you can connect directly via a web URL, that URL will be shown here. Clicking the URL will open a connection to the executable, in a new browser tab.
Output Folder - For each execution, the value displayed represents a path relative to the project's root folder. Clicking the value will open the folder in which the execution's outputs have been or will be stored.
Additional basic information can be displayed for each execution. To do this:
Click on the "table" icon at the right edge of the table header row.
Select one or more of the entries in the list, to display an additional column or columns.
Available additional columns include:
Stopped Running - The time at which the execution stopped running.
Custom properties columns - If a custom property or properties have been assigned to any of the listed executions, a column can be added to the table, for each such property, showing the values assigned to each execution, for that property.
To remove columns from the list, click on the "table" icon at the right edge of the table header row, then de-select one or more of the entries in the list, to hide the column or columns in question.
A filter menu above the executions list allows you to run a search that refines the list to display only executions meeting specific criteria.
By default, pills are displayed that allow you to set search criteria that will filter executions by one or more of the following attributes:
Name - Execution name
State - Execution state
ID - An execution's job ID or analysis ID
Executable - A specific executable
Launched By - The user who launched an execution or executions
Launch Time - The time range within which executions were launched
Click the List icon, above the right edge of the executions list, to display pills that allow filtering by additional execution attributes.
Note that by default, filters are set to display only root executions that meet the criteria defined in the filter. If you want the display to include all executions, including those run during individual stages of workflows, click the button, above the left edge of the executions list, showing the default value "Root Executions Only." Then click "All Executions."
To save a particular filter, click the Bookmark icon, above the right edge of the executions list, assign your filter a name, then click Save.
To apply a saved filter to the executions list, click the Bookmark icon, then select the filter from the list.
As a launcher of a given execution, and a contributor to the project within which the execution is running, you can terminate the execution from the list on the Monitor screen, when it's in a non-terminal state. You can also terminate executions launched by other project members if you have project admin status.
To terminate an execution:
Find the execution in the list:
Select the execution by clicking on the row. A red Terminate button will appear at the end of the header. Click the Terminate button; or
Hover over the row and click on the “More Actions” button that looks like three vertical dots at the end of the row to select Terminate in the fly out menu.
A modal window will open, asking you to confirm that you want to terminate the execution. Click Terminate to confirm.
The execution's state will show as "Terminating" as it is being terminated. Then its state will change to "Terminated."
To get additional information on an execution, click on its name in the list on the Monitor screen. A new page will open.
On the details page for an execution, you'll see a range of information, including:
High-level details - In the Execution Tree section, at the top of the screen, you'll see high-level information, including:
For a standalone execution - such as a job without children - you'll see a single entry that includes details on the state of the execution, when it started and stopped running, and how long it spent in the running state.
For an execution with descendants - such as an analysis with multiple stages - you'll see a list, with each row containing details on the execution run at each stage of the analysis. If the execution has descendants, you can click on the “+” icon next to its name to expand the row to view information on its descendants. To see a page displaying detailed information on a stage, click on its name in the list. To navigate back to the workflow's details page, click on its name in the "breadcrumb" navigation menu in the top right corner of the screen.
Execution state - In the Execution Tree section, each execution row includes a color bar that represents the execution's current state. For descendants within the same execution tree, the time visualizations are staggered, indicating their different start and stop times in relation to each other. The colors include:
Blue - A blue bar indicates that the execution is in the "Running" or "In Progress" state.
Green - A green bar indicates that the execution is in the "Done" state.
Red - A red bar indicates that the execution is in the "Failed" or "Partially Failed" state.
"Grey" indicates that the execution is in the "Terminated" state.
Execution start and stop times - Times are displayed in the header bar at the top of the Execution Tree section. These times run, from left to right, from the time at which the job started running, or when the analysis was created, to either the current time, or the time at which the execution entered a terminal state ("Done," "Failed," or "Terminated").
Inputs - In this section, you'll see a list of the inputs to the execution. If a direct link to the input file is available, the input's name will be hyperlinked to the file; clicking the link will open the project location containing the file. If the input was provided by another execution in a workflow, the execution's name will be hyperlinked; clicking the link will open the details page for the execution in question.
Outputs - In this section, you'll see a list of the execution's outputs. If a direct link to the output file is available, the output's name will be hyperlinked to the file; clicking the link will open the folder containing the file.
Log files - An execution's log file is useful in understanding details about, for example, the resources used by an execution, the costs it incurred, and the source of any delays it encountered. To access log files, and, as needed, download them in .txt
format:
To access the log file for a job, click either the View Log button in the top right corner of the screen, or the View Log link in the Execution Tree section.
To access the log file for each stage in an analysis, click the View Log link next to the row displaying information on the stage in question, in the Execution Tree section.
Basic info - The Info pane, on the right side of the screen, displays a range of basic information on the execution, along with additional detail such as the execution's unique ID, and custom properties and tags assigned to it.
Reused results - If an execution reuses results from another execution, this information will be shown in a blue pane, above the Execution Tree section. To see details on the execution that generated these results, click on its name.
If an execution failed, a Cause of Failure pane will display, above the Execution Tree section. The cause of failure is a system-generated error message. For assistance in diagnosing the failure and any related issues:
Click the button labeled Send Failure Report to DNAnexus Support.
A form will open in a modal window, with both the Subject and Message fields pre-populated with information that DNAnexus Support will use in diagnosing and resolving the issue.
By clicking the button in the Grant Access section, DNAnexus Support reps will be given "View" access to the project in which the issue occurred. This will enable Support reps to diagnose and resolve the issue more quickly.
Click Send Report to send the report.
To re-launch a job from the execution details screen:
Click the Launch as New Job button in the upper right corner of the screen.
A new browser tab will open, displaying the Run App / Applet form.
Configure the run, then click Start Analysis.
To re-launch an analysis from the execution details screen:
Click the Launch as New Analysis button in the upper right corner of the screen.
A new browser tab will open, displaying the Run Analysis form.
Configure the run, then click Start Analysis.
If you want to save a copy of a workflow along with its input configurations under a new name, from the execution details screen:
Click the Save as New Workflow button in the upper right corner of the screen.
In the Save as New Workflow modal window, give the workflow a name, and select the project in which you'd like to save it.
Click Save.
As described in this documentation, jobs can be configured to restart automatically upon certain types of failures.
If you want to view the execution details for the initial tries for a restarted job:
Click on the "Tries" link below the job name in the summary banner, or the "Tries" link next to the job name in the execution tree.
A modal window will open.
Click the name of the try for which you'd like to view execution details.
Note that you can only send a failure report for the most recent try, not for any previous tries.
You can use dx watch
to view the log of a running job or any past jobs, which may have finished successfully, failed, or been terminated.
If you'd like to view the job's log stream while it runs, you can use dx watch
. The log stream includes a log of stdout, stderr, and additional information the worker outputs as it executes the job.
If for some reason you need to terminate a job before it completes, use the command dx terminate
.
If you'd like to view any jobs that have finished running, you can use the dx watch
command. The log stream includes a log of stdout, stderr, and additional information the worker outputs as it executed the job.
You can use dx find executions
to return the ten most recent executions in your current project. You can specify the number of executions you wish to view by running dx find executions -n <specified number>
. The output from dx find executions
will be similar to the information shown in the "Monitor" tab on the DNAnexus web UI.
Below is an example of dx find executions
; in this case, only two executions have been run in the current project. There is an individual job, DeepVariant Germline Variant Caller, and a workflow consisting of two stages, Variant Calling Workflow. A stage is represented by either another analysis (if running a workflow) or a job (if running an app(let)).
The job running the DeepVariant Germline Variant Caller executable is running and has been running for 10 minutes and 28 seconds. The analysis running the Variant Calling Workflow consists of 2 stages, Freebayes Variant Caller, which is waiting on input, and BWA-MEM FASTQ Read Mapper, which has been running for 10 minutes and 18 seconds.
By default, the dx find executions
operation will search for jobs or analyses created when a user runs an app or applet. If a job is part of an analysis, the results will be returned in a tree representation linking all of the jobs in an analysis together.
By default, dx find executions
will return up to ten of the most recent executions in your current project in order of execution creation time.
However, a user can also filter the returned executions by job type. Using the flag --origin-jobs
in conjunction with the dx find executions
command returns only original jobs, whereas the flag --all-jobs
will also include subjobs.
We can choose to monitor only analyses by running the command dx find analyses
. Analyses are executions of workflows and consist of one or more app(let)s being run. When using dx find analyses
, the command will return only the top-level analyses, not any of the jobs contained therein.
Below is an example of dx find analyses
:
Jobs are runs of an individual app(let) and compose analyses. We can monitor jobs by running the command dx find jobs
, which will return a flat list of jobs. If a job is in an analysis, all jobs within the analysis are also returned.
Below is an example of dx find jobs
:
Searches for executions can be restricted to specific parameters.
To extract stdout only from this job, we can run the command dx watch job-xxxx --get-stdout
To extract stderr only from this job, we can run the command dx watch job-xxxx --get-stderr\
To extract only stdout and stderr from this job, we can run the command dx watch job-xxxx --get-streams
Below is an example of viewing stdout lines of a job log:
To view the entire job tree, including both main jobs and subjobs, use the command dx watch job-xxxx --tree
.
To view the entire job tree -- both main jobs and subjobs -- use the command dx watch job-xxxx -n 8
. If the job already ran, the output is displayed as well.
In the example below, the app Sample Prints doesn’t have any output.
Jobs can be configured to restart automatically upon certain types of failures as described in the Restartable Jobs section. To view initial tries of the restarted jobs along with execution subtrees rooted in those initial tries, use dx find executions --include-restarted
. To examine job logs for initial tries, use dx watch job-xxxx --try X
. An example of these commands is shown below.
By default, dx find
will restrict your search to only your current project context. To search across all the projects to which you have access, use the --all-projects
flag.
By default, dx find
will only return up to ten of the most recently launched executions matching your search query. To change the number of executions returned, you can use the -n
option.
A user can search for only executions of a specific app(let) or workflow based on its entity ID.
Users can also use the --created-before
and --created-after
options to search based on when the execution began.
Users can also restrict the search to a specific state, e.g. "done", "failed", "terminated".
The --delim
flag will tab-delimit the output. This allows the output to be passed into other shell commands.
You can use the --brief
flag to return only the object IDs for the objects returned by your search query. The ‑‑origin‑jobs
flag will omit the subjob information.
Below is an example usage of the --brief
flag:
Below is an example of using the flags --origin-jobs
and --brief
. In the example below, we describe the last job run in the current default project.
See the Index of dx Commands documentation for more on using dx find jobs
.
Job logs can be automatically forwarded to a customer's Splunk instance for analysis. See this documentation for more information on enabling and using this feature.
Learn key terms used to describe apps and workflows.
On the DNAnexus Platform, the following terms are used when discussing apps and workflows:
Execution: An analysis or job.
Root execution: The initial analysis or job that's created when a user makes an API call to run a workflow, app, or applet. Analyses and jobs created from a job via /executable-xxxx/run
API call with detach
flag set to true
are also root executions.
Execution tree: The set of all jobs and/or analyses that are created as a result of running a root execution.
Analysis: An analysis is created when a workflow is run. It consists of some number of stages, each of which consists of either another analysis (if running a workflow) or a job (if running an app or applet).
Parent analysis: Each analysis is the parent analysis to each of the jobs that are created to run its stages.
Job: A job is a unit of execution that is run on a worker in the cloud. A job is created when an app or applet is run, or when a job spawns another job.
Origin job: The job created when an app or applet is run by either a user or an analysis. An origin job always executes the "main" entry point.
Master job: The job created when an app or applet is run by a user, job, or analysis. A master job always executes the "main" entry point. All origin jobs are also master jobs.
Parent job: A job that creates another job or analysis via an /executable-xxxx/run
or /job/new
API call.
Child job: A job created from a parent job via an /app[let]-xxxx/run
or /job/new
API call.
Subjob: A job created from a job via a /job/new
API call. A subjob runs the same executable as its parent, and executes the entry point specified in the API call that created it.
Job tree: A set of all jobs that share the same origin job.
Job-based object reference: A hash containing a job ID and an output field name. This hash is given in the input or output of a job. Once the specified job has transitioned to the "done" state, it is replaced with the specified job's output field.
Speed workflow development and reduce testing costs by reusing computational outputs.
DNAnexus allows organizations to optionally reuse outputs of jobs that share the same executable and input IDs, even if these outputs are across projects or entire organizations. This feature has two primary use cases.
For example, suppose you are developing a workflow, and at each stage, you end up debugging an issue. Let's assume that each stage takes approximately one hour to develop and run. If you do not reuse outputs as you are developing, the development process takes 1 + 2 + 3 + ... + n hours since at every stage you fix something, you have to recompute results from previous stages you were working on. On the other hand, if you simply reuse results for stages that have matured and are no longer modified, your total development time is now just the total amount of time it takes to develop and run the pipeline (in this case n hours). This is an order of magnitude difference in development time, and the improvement becomes more pronounced for longer workflows.
This feature is also powerful for saving time developing forks of existing workflows. For example, suppose you are a developer in an R&D organization and want to modify the last couple of stages of a production workflow in another organization. As long as the new workflow uses the same executable IDs for the stages before it, the time required for R&D of the forked version is only that of last stages.
In production environments, it is important to test R&D modifications to a workflow at scale (e.g. a workflow for a clinical test). For example, suppose you are testing a workflow like the forked workflow discussed in the example above. This is a clinical workflow that needs to be tested on thousands of samples (let that number be represented by m) before being vetted to run in a production environment. Let's also suppose the whole workflow takes n hours but you only have modified the last k stages. You save (n-k)m total compute hours. This can add up to dramatic cost savings as m grows and if k is small.
To demonstrate Smart Reuse, we will use WDL syntax as supported by DNAnexus through our toolkit and by dxCompiler.
The workflow above is a two-step workflow that simply duplicates a file and takes the first 10 lines from the duplicate.
Now suppose the user has run the workflow above on some file and simply wants to tweak headfile
to output the first 15 lines instead:
Here the only difference is that we renamed headfile
, basic_reuse
, and changed 10
to 15
. The compilation process automatically detects that dupfile
is the same but there is a different second stage. The generated workflow therefore uses the original executable ID for dupfile
but a different executable ID for headfile2
.
When executing basic_reuse_tweaked
on the same input file with Smart Reuse enabled, the results from dupfile
task are reused. This is because since there is already a job on the DNAnexus Platform that has run that specific executable with the same input file, the system can reuse that file.
When using Smart Reuse with complex WDL workflows involving WDL expressions in input arguments, scatters, and nested sub-workflows, we recommend launching workflows using the --preserve-job-outputs
option, in order to preserve the outputs of all the jobs in the execution tree in the project, and also increase the potential for subsequent Smart Reuse.
Smart Reuse:
only applies to jobs run in projects billed to an organization that has Smart Reuse enabled
is applied only to completed jobs executed after the policies are updated for an org
Jobs:
may only reuse results from other jobs if there exists a previously run job that ran with the exact same executable and input IDs (including the function called within the applet). If an input is watermarked the watermark and its version must be the same as well. This therefore does not include other settings like the instance type the job was run on, for example.
if ignoreReuse: true
the job will not be considered a future candidate for job reuse.
the job to be reused must have all outputs intact at the time of reuse. Partial output from the job (e.g. some of the output is missing or inaccessible) will prevent the reuse.
contain a field called outputReusedFrom
that refers to the job ID that originally computed the requested outputs. This field never refers to another job that has itself been reused
may only use results across projects if the corresponding application's dxapp.json contains "allProjects": "VIEW"
in the "access"
field
must have at least VIEW access to the original job's outputs, and those outputs must still exist on the Platform (i.e. they have not been since deleted)
are reported as having run for 0 seconds and correspondingly are billed as $0
are assumed to be deterministic in output
if the reused jobs/workflows is located in a different project or a different folder, the output data will not be cloned to the working project or the new destination folder since the new jobs/workflows are not actually run.
If you are an administrator of a licensed org and want to enable Smart Reuse, run this command:
Conversely, set the value to false to disable it. If you are a licensed customer and cannot run the command above, contact support@dnanexus.com. If you are interested in this feature and are not a licensed customer, reach out to sales@dnanexus.com or your account executive for more information.
Learn about limits on the costs executions can incur, and how these limits can affect executions on the DNAnexus Platform.
A running execution can be terminated when it incurs charges that cause a cost or spending limit to be reached. When a spending limit is reached, this can also prevent new executions from being launched.
An execution cost limit is an optional limit on the usage charges an execution tree can incur. This limit is set when a root execution is launched. Once this limit is reached, the DNAnexus Platform will terminate running executions in the affected execution tree.
When an execution is terminated in this fashion, the Platform will set CostLimitExceeded as the failure reason. This failure code will be displayed on the UI, on the relevant project’s Monitor page.
Billing account spending limits are managed by billing administrators, and can impact executions in projects billed to the account.
Billing account spending limits apply to cumulative charges incurred by projects billed to the account.
If cumulative charges reach this limit, the Platform will terminate running jobs in projects billed to the account, and will prevent new executions from being launched.
When a job is terminated in this fashion, the Platform will set SpendingLimitExceeded as the failure reason. This failure reason will be displayed on the UI, on the relevant project’s Monitor page.
Monthly project compute spending limits can be set by project admins, and can impact executions run within the project. Project admins can also set a separate monthly project-level egress spending limit, which can impact data egress from the project.
If the compute spending limit is reached, the Platform may terminate running jobs launched by project members, and prevent new executions from being launched. If the egress spending limit is reached, the Platform may prevent data egress from the project. The exact behavior depends on the policies of the org to which the project is billed.
For more information on these limits, see this overview, and this detailed explanation of setting org spending limit policies.
Monthly project compute limits do not apply to compute charges incurred by using relational database clusters.
There is a charge for using public IPv4 addresses for workers used for compute. When a job uses such a worker, IPv4 charges are included in the total cost figure shown for the job on the UI. These charges also count toward any compute spending limit that applies to the project in which the job is running.
See this documentation for information on how to find the per-hour charge for using IPv4 addresses, in each cloud region in which org members can run executions.`
The UI displays information on costs and cost limits for both individual executions and execution trees. Navigate to the project in which the execution or execution tree is being run, then click the Monitor tab. Click on the name of the execution or execution tree to open a page showing detailed information about it.
While an execution or execution tree is running, information will be displayed on the charges it has incurred so far, and on additional charges it can incur, before an applicable cost limit is reached.
Org spending limit information is available from the Billing page for each org.
If project-level monthly spending limits have been set for a project, detailed information is available via the CLI, using the command dx describe project-id
.
Learn about the states through which a job or analysis may go, during its lifecycle.
In the following example, we have a workflow that has two stages, one of which is an applet, and the other of which is an app.
If the workflow is run, it will generate an analysis with an attached workspace for storing intermediate output from its stages. Jobs are also created to run the two stages. These jobs in turn can spawn more jobs, either to run another function in the same executable or to run an executable. The blue labels indicate which jobs or analyses can be described using a particular term (as defined above).
Note that the subjob or child job of stage 1's origin job shares the same temporary workspace as its parent job. Any calls to run a new applet or app (using the API methods /applet-xxxx/run or /app-xxxx/run will launch a master job that has its own separate workspace, and (by default) no visibility into its parent job's workspace.
Every successful job goes through at least the following four states: 1. idle: initial state of every new job, regardless of what API call was made to create it. 2. runnable: the job's inputs are ready, and it is not waiting for any other job to finish or data object to finish closing. 3. running: the job has been assigned to and is being run on a worker in the cloud. 4. done: the job has completed, and it is not waiting for any descendent job to finish or data object to finish closing. This is a terminal state, so no job will become a different state after transitioning to done.
Jobs may also pass through the following transitional states as part of more complicated execution patterns:
waiting_on_input (between idle and runnable): a job enters and stays in this state if at least one of the following is true:
it has an unresolved job-based object reference in its input
it has a data object input that cannot be cloned yet because it is not in the closed state or a linked hidden object is not in the closed state
it was created to wait on a list of jobs or data objects that must enter the done or closed states, respectively (see the dependsOn
field of any API call that creates a job); linked hidden objects are implicitly included in this list
waiting_on_output (between running and done): a job enters and stays in this state if at least one of the following is true:
it has a descendant job that has not been moved to the done state
it has an unresolved job-based object reference in its output
it is an origin or master job which has a data object (or linked hidden data object) output in the closing state
There are two terminal job states other than the done state, terminated and failed, and a job can enter either of these states from any other state except another terminal state.
The terminated state is entered when a user has requested that the job (or another job that shares the same origin job) be terminated. For all terminated jobs, the failureReason
in their describe hash will be set to "Terminated", and the failureMessage
will indicate the user responsible for terminating the job. Only the user who launched the job or administrators of the job's project context can terminate the job.
Jobs can fail for a variety of reasons, and once a job fails, this triggers failure for all other jobs that share the same origin job. If an unrelated job (i.e. is not in the same job tree) has a job-based object reference or otherwise depends on a failed job, then it will also fail. For more information about errors that jobs can encounter, see the Error Information page.
On the DNAnexus Platform, jobs are limited to a runtime of 30 days. Jobs running longer than 30 days will fail with JobTimeoutExceeded
error.
Jobs can automatically restart upon certain types of failures, such as UnresponsiveWorker
, ExecutionError
, AppInternalError
, JobTimeoutExceeded
, which can be specified in the executionPolicy
of an app(let) or workflow. If a job fails for a restartable reason, its failure propagates to its nearest master job and restarts that job (if that executable has restartableEntryPoints
flag set to the default value of master
) or restarts the job itself (if the if the executable has restartableEntryPoints
flag set to the default value of all
). A job can be restarted the number of times that is given in the executionPolicy
, after which the entire job tree fails.
Jobs belonging to root executions launched after July 12, 2023 00:13 UTC have a try
integer attribute representing different tries for restarted jobs. The first try of a job has try
set to 0. The second job try (if the job was restarted) has its try
attribute set to 1, and so on.
restartable: a job in a restartable state indicates that the job is ready to be restarted.
restarted: a job try in a restarted state indicates that the job try was restarted.
Some API methods ( e.g. /job-xxxx/describe
, /job-xxxx/addTags
, /job-xxxx/removeTags
, /job-xxxx/setProperties
, /system/findExecutions
, /system/findJobs
, /system/findAnalyses
) accept optional job try input and may include job's try attribute in their output. All API methods interpret job ID inputs without a try
argument as referring to the most recent try corresponding to that job ID.
For unsuccessful jobs, there are a couple more states that jobs may enter between the running state and its eventual terminal state of terminated or failed; unsuccessful jobs starting in all other non-terminal states will be transitioned directly to the appropriate terminal state.
terminating: the transitional state when the worker in the cloud has begun terminating the job and tearing down the execution environment. Once the worker in the cloud has reported that it has terminated the job or otherwise becomes unresponsive, then the job will transition to its terminal state.
debug_hold: a job has been run with debugging options and has failed for an applicable reason, and is being held for debugging by the user. For more information about triggering this state, see the Connecting to Jobs page.
All analyses start in the state in_progress, and, like jobs, will end up in one of the terminal states done, failed, or terminated. The following diagram shows the state transition for all successful analyses.
If an analysis is unsuccessful, it may transition through one or more intermediate states before it reaches its terminal state:
partially_failed: this state indicates that one or more stages in the analysis have not finished successfully, and there is at least one stage which has not transitioned to a terminal state. In this state, some stages may have already finished successfully (and entered the done state), and the remaining stages will also be allowed to finish successfully if they can.
terminating: an analysis may enter this state either via an API call where a user has terminated the analysis, or there is some failure condition under which the analysis is terminating any remaining stages. This may happen if the executionPolicy
for the analysis (or a stage of an analysis) had the onNonRestartableFailure
value set to "failAllStages".
In general, compute and data storage costs due to jobs that end up failing because of user error (e.g. InputError
, OutputError
) and terminated jobs are still charged to the project in which the jobs were run. For internal errors of the DNAnexus platform, such costs will not be billed.
The costs for each stage in an analysis is determined independently. If the first stage finishes successfully while a second stage fails for a system error, the first stage will still be billed, and the second will not.
This tutorial demonstrates how to use Nextflow pipelines on the DNAnexus Platform by importing a Nextflow pipeline from a remote repository or building from local disk space.
This documentation assumes you already have a basic understanding of how to develop and run a Nextflow pipeline. To learn more about Nextflow, consult the official Nextflow Documentation.
To run a Nextflow pipeline on the DNAnexus Platform:
Import the pipeline script from a remote repository or local disk
Convert the script to an app or applet
Run the app or applet
You can do this via either the user interface (UI) or the command-line interface (CLI), using the dx
command-line client.
A Nextflow pipeline script is structured as a folder with Nextflow scripts with optional configuration files and subfolders. Below are the basic elements of the folder structure when building a Nextflow executable:
(Required) A main Nextflow file with the extension .nf
containing the pipeline. The default filename is main.nf
. A different filename can be specified in the nextflow.config
file.
(Optional) A nextflow.config
file.
(Optional, recommended) A nextflow_schema.json
file. If this file is present at the root folder of the Nextflow script when importing or building the executable, the input parameters described in the file will be exposed as the built Nextflow pipeline applet's input parameters. See this section for more information on how the exposed parameters are used at run time.
(Optional) Subfolders and other configuration files. Subfolders and other configuration files can be referenced by the main Nextflow file or nextflow.config
via theinclude
or includeConfig
keyword. Ensure that all referenced subfolders and files exist under the pipeline script folder at the time of building or importing the pipeline.
An nf-core flavored folder structure is encouraged but not required.
To import a Nextflow pipeline via the UI, click on the Add button on the top-right corner of the project’s Manage tab, then expand the dropdown menu. Select the Import Pipeline/Workflow option.
Once the Import Pipeline/Workflow modal appears, enter the repository URL where the Nextflow pipeline source code resides, for example, "https://github.com/nextflow-io/hello". Then choose the desired project import location. If the repository is private, provide the credentials necessary for accessing it.
An example of the Import Pipeline/Workflow modal:
Once you’ve provided the necessary information, click the Start Import button and the import process will start as a pipeline import job, in the project specified in the Import To field (default is the current project).
After you've launched the import job, you'll see a status message "External workflow import job started" appear.
You can access information about the pipeline import job in the project’s Monitor tab:
Once the import is complete, you can find the imported pipeline executable as an applet. This is the output of the pipeline import job you previously ran:
You can find the newly created Nextflow pipeline applet - e.g. hello
- in the project:
To import a Nextflow pipeline from a remote repository via the CLI, run the following command to specify the repository’s URL. Note that you can also provide optional information, such as a repository tag and an import destination:
If the Nextflow pipeline is in a private repository, use the option --git-credentials
to provide the DNAnexus qualified ID or path of the credential files on the Platform. Read more about this here.
Once the pipeline import job has finished, it will generate a new Nextflow pipeline applet with an applet ID in the form applet-zzzz
.
Use dx run -h
to get more information about running the applet:
Through the CLI you can also build a Nextflow pipeline applet from a pipeline script folder stored on a local disk. For example, you may have a copy of the nextflow-io/hello
pipeline from the Nextflow Github on your local laptop, stored in a directory named hello
, which contains the following files:
Ensure that the folder structure is in the required format, as described here.
To build a Nextflow pipeline applet using a locally stored pipeline script, run the following command and specify the path to the folder containing the Nextflow pipeline scripts. You can also provide optional information, such as an import destination:
This command will package the Nextflow pipeline script folder as an applet named hello
with ID applet-yyyy
, and store the applet in the destination project and path project-xxxx:/applets2/hello
. If an import destination is not provided, the current working directory will be used.
The dx run -h command can be run to see information about this applet, similar to the above example.
A Nextflow pipeline applet will have a type “nextflow” under its metadata . This applet acts like a regular DNAnexus applet object, and can be shared with other DNAnexus users who have access to the project containing the applet.
For advanced information regarding the parameters of dx build --nextflow
, run dx build --help
in the CLI and find the Nextflow section for all arguments that are supported for building an Nextflow pipeline applet.
You can also build a Nextflow pipeline app from a Nextflow pipeline applet by running the command: dx build --app --from applet-xxxx
.
You can access a Nextflow pipeline applet from the Manage tab in your project, while the Nextflow pipeline app that you built can be accessed by clicking on the Tools Library option from the Tools tab. Once you click on the applet or app, the Run Analysis tab will be displayed. Fill out the required inputs/outputs and click the Start Analysis button to launch the job.
To run the Nextflow pipeline applet, use dx run applet-xxxx
or dx run app-xxxx
commands in the CLI and specify your inputs:
You can list and see the progress of the Nextflow pipeline job tree, which is structured as a head job with many subjobs, using the following command:
Each Nextflow pipeline executable run is represented as a job tree with one head job and many subjobs. The head job launches and supervises the entire pipeline execution. Each subjob is responsible for a process in the Nextflow pipeline. You can monitor the progress of the entire pipeline job tree by viewing the status of the subjobs (see example above).
To monitor the detail log of the head job and the subjobs, you can monitor each job’s DNAnexus log via the UI or the CLI.
On the DNAnexus Platform, jobs are limited to a runtime of 30 days. Jobs running longer than 30 days will be automatically terminated.
Once your job tree is running, you can go to the Monitor tab to view the status of your job tree. From the Monitor tab, you can view the job log of the head job as well as the subjobs by clicking on the Log link in the row of the desired job. You can also view the costs (when your account has permission) and resource usage of a job.
An example of the log of a head job:
An example of the log of a subjob:
From the CLI, you can use the dx watch
command to check the status and view the log of the head job or each subjob.
Monitoring the head job:
Monitoring a subjob:
The Nextflow pipeline executable is launched as a job tree, with one head job running the Nextflow executor, and multiple subjobs running a single process each. Throughout the pipeline’s execution, the head job remains in “running” state and supervises the job tree’s execution.
When a Nextflow head job (i.e. job-xxxx
) enters its terminal state (i.e. "done" or "failed"), a Nextflow log file with filename as nextflow-<job-xxxx>.log
will be written to the destination path of the head job.
DNAnexus supports Docker container engines for the Nextflow pipeline execution environment. The pipeline developer may refer to a public Docker repository or a private one. When the pipeline is referencing a private Docker repository, you should provide your Docker credential file as a file input of docker_creds
to the Nextflow pipeline executable when launching the job tree.
Syntax of a private Docker credential:
It is encouraged to save this credential file in a separate project where only limited users have permission to access it for privacy reasons.
Below are all possible means that you can specify an input value at build time and runtime. They are listed in order of precedence (items listed first have greater precedence and override items listed further down the list):
Executable (app or applet) run time
DNAnexus Platform app or applet input.
CLI example:
dx run project-xxxx:applet-xxxx -i reads_fastqgz=project-xxxx:file-yyyy
reads_fastqgz
is an example of an executable input parameter name. All Nextflow pipeline inputs can be configured and exposed by the pipeline developer using an nf-core
flavored pipeline schema file (nextflow_schema.json
).
When the input parameter is expecting a file, you need to specify the value in a certain format based on the class of the input parameter. When the input is of the “file” class, use DNAnexus qualified ID (i.e. absolute path to the file object such as “project-xxxx:file-yyyy”); when the input is of the “string” class, use the DNAnexus URI (“dx://project-xxxx:/path/to/file”). See table below for full descriptions of the formatting of PATHs.
You can use dx run <app(let)> --help
to query the class of each input parameter at the app(let) level. In the example code block below, fasta
is an input parameter of a file
object, while fasta_fai
is an input parameter of a string
object. You will then use DNAnexus qualifiedID format for fasta
, and DNAnexus URI format for fasta_fai
.
The DNAnexus object class of each input parameter is based on the “type” and “format” specified in the pipeline’s nextflow_schema.json,
when it exists. See additional documentation here to understand how Nextflow input parameter’s type and format (when applicable) converts to an app or applet’s input class.
It is recommended to always use the app/applet means for specifying input values. The platform validates the input class and existence before the job is created.
All inputs for a Nextflow pipeline executable are set as “optional” inputs. This allows users to have flexibility to specify input via other means.
Nextflow pipeline command line input parameter (i.e. nextflow_pipeline_params
). This is an optional "string" class input, available for any Nextflow pipeline executable upon it being built.
CLI example:
dx run project-xxxx:applet-xxxx -i nextflow_pipeline_params="--foo=xxxx --bar=yyyy",
where "--foo=xxxx --bar=yyyy"
corresponds to the "--something value"
pattern of Nextflow input specification referenced here.
Because nextflow_pipeline_params
is a string type parameter with file-path format, use the DNAnexus URI format when the file is stored on DNAnexus.
Nextflow options parameter (i.e. nextflow_run_opts
). This is a optional "string" class input, available for any Nextflow pipeline executable upon it being built.
CLI example:
dx run project-xxxx:applet-xxxx -i nextflow_run_opts=“-profile test”
, where -profile
is single-dash prefix parameter that corresponds to the Nextflow run options pattern, specifying a preset input configuration.
Nextflow parameter file (i.e. nextflow_params_file
). This is a optional "file" class input, available for any Nextflow pipeline executable that is being built.
CLI example:
dx run project-xxxx:applet-xxxx -i nextflow_params_file=project-xxxx:file-yyyy
, where project-xxxx:file-yyyy
is the DNAnexus qualified ID of the file being passed to nextflow run -params-file <file>
. This corresponds to -params-file
option of nextflow run
.
Nextflow soft configuration override file (i.e. nextflow_soft_confs
). This is a optional "array:file" class input, available for any Nextflow pipeline executable that is being built.
CLI example:
dx run project-xxxx:applet-xxxx -i nextflow_soft_confs=project-xxxx:file-1111 -i nextflow_soft_confs=project-xxxx:file-2222
, where project-xxxx:file-1111
and project-xxxx:file-2222
are the DNAnexus qualified IDs of the file being passed to nextflow run -c <config-file1> -c <config-file2>
. This corresponds to -c
option of nextflow run
, and the order specified for this array of file input is preserved when passing to the nextflow run
execution.
The soft configuration file can be used for assigning default values of configuration scopes (such as process
).
It is highly recommended to use nextflow_params_file
as a replacement to using nextflow_soft_confs
for the use case of specifying parameter values, especially when running Nextflow DSL2 nf-core pipelines. Read more about this at nf-core documentation.
Pipeline source code:
nextflow_schema.json
Pipeline developers may specify default values of inputs in the nextflow_schema.json
file.
If an input parameter is of Nextflow’s string type with file-path format, use DNAnexus URI format when the file is stored on DNAnexus.
nextflow.config
Pipeline developers may specify default values of inputs in thenextflow.config
file.
Pipeline developers may specify a default profile value using --profile <value>
, when building the executable. e.g. dx build --nextflow --profile test
.
main.nf
, sourcecode.nf
Pipeline developers may specify default values of inputs in the Nextflow source code file (*.nf
).
If an input parameter is of Nextflow’s string type with file-path format, use the DNAnexus URI format when the file is stored on DNAnexus.
While you can specify a file input parameter’s value at different places as seen above, the valid PATH format referring to the same file will be different depending on the level (DNAnexus API/CLI level or Nextflow script-level) and the class (file object or string) of the executable’s input parameter. Examples of this are given below.
Scenarios
Valid PATH format
• App or applet input parameter class as file object
• CLI/API level (e.g. dx run --destination PATH
)
DNAnexus qualified ID (i.e. absolute path to the file object).
• E.g. (file):
project-xxxx:file-yyyy
,
project-xxxx:/path/to/file
• E.g. (folder):
project-xxxx:/path/to/folder/
• App or applet input parameter class as string
• Nextflow configuration and source code files (e.g. nextflow_schema.json
, nextflow.config
, main.nf
, sourcecode.nf
)
DNAnexus URI.
• E.g. (file):
dx://project-xxxx:/path/to/file
• E.g. (folder):
dx://project-xxxx:/path/to/folder/
• E.g. (wildcard):
dx://project-xxxx:/path/to/wildcard_files
When launching a DNAnexus job, you can specify a job-level output destination (e.g. project-xxxx:/destination/
) using the platform-level optional parameter on the UI or on the CLI. In addition, when there is publishDir
specified in the pipeline, each output file will be located at <dx_run_path>/<publishDir>/
, where <dx_run_path>
is the job-level output destination, and <publishDir>
is the path assigned per Nextflow script’s process.
Read more detail about the output folder specification and publishDir
here. Find an example on how to construct output paths of an nf-core pipeline job tree at run time from our FAQ.
You can have your Nextflow pipeline runs use an Amazon Web Services (AWS) S3 bucket as a work directory. To do this, follow the steps outlined below.
Follow the steps outlined here to configure your AWS account to trust the Platform, as an OIDC identity provider. Be sure to take note of the value you enter in the "Audience" field. You'll need to use this value in a configuration file used by your pipeline, to enable pipeline runs to access the S3 bucket in question.
Next, configure an AWS Identity and Access Management (IAM) role, such that its permissions and trust policies allow Platform jobs that assume this role, to access and use resources in the S3 bucket in question.
The following example shows how to structure an IAM role's permission policy, to enable the role to use an S3 bucket - accessible via the S3 URI s3://my-nextflow-s3-workdir
- as the work directory of Nextflow pipeline runs:
Note in the above example:
The "Action" section contains a list of the actions the role is allowed to perform, including deleting, getting, listing, and putting objects.
The two entries in the list in the "Resource" section enable the role to access all resources in the bucket accessible via the S3 URI my-nextflow-s3-workdir
.
The following example shows how to configure an IAM role's trust policy, to allow only properly configured Platform jobs to assume the role:
Note in the above example:
To assume the role, a job must be launched from within a specific Platform project (in this case, project-xxxx
).
To assume the role, a job must be launched by a specific Platform user (in this case, user-aaaa
).
Via the "Federated" setting in the "Principal" section, the policy configures the role to trust the Platform as an OIDC identity provider, as accessible at job-oidc.dnanexus.com
.
Next you need to configure your pipeline so that when it's run, it can access the S3 bucket in question. To do this, add, in a configuration file, a dnanexus
config scope that includes the properties shown in this example:
Note in the above example:
workDir
is the path to the bucket to be used as a work directory, in S3 URI format.
jobTokenAudience
is the value of "Audience" you defined in Step 1 above.
jobTokenSubjectClaims
is an ordered, comma-separated list of DNAnexus job identity token custom claims - for example, "project_id, launched_by" - that the job must present, in order to assume the role that enables bucket access.
iamRoleArnToAssume
is the Amazon Resource Name (ARN) for the role that you configured in Step 2 above, and that will be assumed by jobs in order to access the bucket.
You need also to configure your pipeline to access the bucket within the appropriate AWS region, which you specify via the region
parameter, within an aws
config scope.
When configuring the trust policy for the role that allows access to the S3 bucket, use custom subject claims to control which jobs can assume this role. Here are some typical combinations that we recommend, with their implications:
Values of StringEquals:job-oidc.dnanexus.com/:sub
Which jobs can assume the role that enables bucket access?
project_id;project-xxxx
Any Nextflow pipeline jobs that are running in project-xxxx
launched_by;user-aaaa
Any Nextflow pipeline jobs that are launched by user-aaaa
project_id;project-xxxx;launched_by;user-aaaa
Any Nextflow pipeline jobs that are launched by user-aaaa
in project-xxxx
bill_to;org-zzzz
Any Nextflow pipeline jobs that are billed to org-zzzz
Having included custom subject claims in the trust policy for the role in question, you need then, in the aforementioned Nextflow configuration file, to set the value of jobTokenSubjectClaims
to equal a comma-separated list of claims, entered in the same order in which you entered them in the trust policy.
For example, if you configured a role's trust policy as per the above example, you are requiring a job, in order to assume the role, to present custom subject claims project_id
and launched_by
, in that order. In your Nextflow configuration file, set the value of jobTokenSubjectClaims
, within the dnanexus
config scope, as follows:
Note that you must also, within the dna
config scope, set the value of iamRoleArnToAssume
to that of the appropriate role:
By default, the Platform limits apps' and applets' ability to read and write data. Nextflow pipeline apps and applets have the following capabilities that are exceptions to these limits:
External internet access ("network": ["*"]
) - This is required for Nextflow pipeline apps and applets to be able to pull Docker images from external docker registries at runtime.
UPLOAD
access to the project in which a Nextflow pipeline job is run ("project": "UPLOAD"
) - This is required in order for Nextflow pipeline jobs to record the progress of executions, and preserve the run cache, in order to enable resume functionality.
You can modify a Nextflow pipeline app or applet's permissions by overriding the default values when building from a local disk, using the --extra-args
flag with dx build
. An example:
In this example, note:
"network": []
prevents jobs from accessing the internet.
"allProjects":"VIEW"
increases jobs' access permission level to VIEW. This means that each job will have "read" access to projects that can be accessed by the user running the job. Use this carefully. This permission setting can be useful when expected input file PATHs are provided as DNAnexus URIs - via a samplesheet.csv, for example - from projects other than the one in which a job is being run.
There are additional options for dx build --nextflow
:
Options
Class
Description
--profile PROFILE
string
Set default profile for the Nextflow pipeline executable.
--repository REPOSITORY
string
Specifies a Git repository of a Nextflow pipeline. Incompatible with --remote
.
--repository-tag TAG
string
Specifies tag for Git repository. Can be used only with --repository
.
--git-credentials GIT_CREDENTIALS
file
--cache-docker
flag
Stores a container image tarball in the currently selected project in /.cached_dockerImages. Currently only docker engine is supported. Incompatible with --remote
.
--nextflow-pipeline-params NEXTFLOW_PIPELINE_PARAMS
string
Custom pipeline parameters to be referenced when collecting the docker images.
--docker-secrets DOCKER_SECRETS
file
A dx file id with credentials for a private docker repository.
Use dx build --help
for more information.
When the Nextflow pipeline to be imported is from a private repository, you must provide a file object that contains the credentials needed to access the repository. Via the CLI, use the--git-credentials
flag, and format the object as follows:
When building a Nextflow pipeline executable, you can replace any Docker container with a Platform file object in tarball format. These Docker tarball objects serve as substitutes for referencing external Docker repositories.
This approach enhances the provenance and reproducibility of the pipeline by minimizing reliance on external dependencies, thereby reducing associated risks. Additionally, it fortifies data security by eliminating the need for internet access to external resources, during pipeline execution.
Two methods are available for preparing Docker images as tarball file objects on the platform: Built-in Docker image caching or Manually preparing the tarballs.
Requires running a "building job" with external internet access?
Yes, if building an applet for the first time or if any image is going to be updated.
No internet access required upon rebuild.
No
Docker images packaged as bundledDepends?
Yes.
For Docker images that will be used in the execution, they are cached and bundled at build time.
No.
Docker tarballs resolved at runtime.
At runtime
Job will attempt to access Docker cached as bundledDepends
. If this fails, the job will attempt to find the image on the Platform. If this fails, the job will try to pull the images from the external repository, via the internet.
Job will attempt to locate the Docker image based on the Docker cache path referenced. If this fails, the job will attempt to pull from the external repository, via the internet.
This method initiates a building job that begins by taking the pipeline script, then identifying Docker containers by scanning the script's source code based on the final execution tree. Next, the job converts the containers to tarballs, and saves those tarballs to the project in which the job is running. Finally, the job builds the Nextflow pipeline executable, bundling in the tarballs, as bundledDepends
.
You can use built-in caching via the CLI by using the flag --cache-docker
at build time. All cached Docker tarballs are stored as file objects, within the Docker cache path, at project-xxxx:/.cached_docker_images/<image_name>/<image_name>_<version>
.
An example:
If you need to access a Docker container that's stored in a private repository, you must provide, along with the flag --docker-secrets
, a file object that contains the credentials needed to access the repository. This object must be in the following format:
You can manually convert Docker images to tarball file objects. Within Nextflow pipeline scripts, you must then reference the location of each such tarball, in one of the following three ways:
Reference each tarball by its unique Platform ID (e.g. dx://project-xxxx:file-yyyy
). Use this approach if you want deterministic execution behavior. You can use Platform IDs in Nextflow pipeline scripts (*.nf
) or configuration files (*.config
), as follows:
Within a Nextflow pipeline script, you can also reference a Docker image by using its full image name. Use this name within a path that's in the following format: project-xxxx:/.cached_docker_images/<image_name>/<image_name>_<version>
An example:
Note that no file extension is necessary, and that project-xxxx
is the project where the Nextflow pipeline executable was built and will be executed. For.cached_docker_images
, substitute the name of the folder in which these images have been stored.Note as well that an exact <version>
reference must be included - latest
is not an accepted tag in this context.
Here are several examples of tarball file object paths and names, as constructed from image names and version tags:
quay.io/biocontainers/tabix
1.11--hdfd78af_0
project-xxxx:/.cached_docker_images/tabix/tabix_1.11--hdfd78af_0
python
3.9-slim
project-xxxx:/.cached_docker_images/python/python_3.9-slim
python
latest
Nextflow pipeline job will attempt to pull from remote external registry
You can also reference Docker image names in pipeline scripts by digest - for example, <Image_name>@sha256:XYZ123…
). Note that no file extension is necessary, and that project-xxxx
is the project where the Nextflow pipeline executable was built and will be executed. For.cached_docker_images
, substitute the name of the folder in which these images have been stored. Note as well that an exact <version>
reference must be included - latest
is not an accepted tag in this context. In addition, to refer to a tarball file on the Platform in this way, an object property image_digest
- for example, “image_digest”:”<IMAGE_DIGEST_HERE>”
- needs to have been assigned to it.
An example:
Based on the input parameter’s type and format (when applicable) defined in the corresponding nextflow_schema.json file, each parameter will be assigned to the corresponding class (ref1, ref2).
From:
Nextflow Input Parameter
(defined at nextflow_schema.json
) Type
Format
To: DNAnexus Input Parameter Class
string
file-path
file
string
directory-path
string
string
path
string
string
NA
string
integer
NA
int
number
NA
float
boolean
NA
boolean
object
NA
hash
As a pipeline developer, you can specify a file input variable as {“type”:“string”, “format”:“file-path”
} or {“type”:“string”, “format”:“path”
}, which will be assign to “file”
or “string”
class, respectively. When running the executable, based on the class (file or string) of the executable’s input parameter, you will use a specific PATH format to specify the value. See documentation here for an acceptable PATH format for each class.
When converting a file reference from a URL format (e.g. dx://project-xxxx:/path/to/file
of a DNAnexus URI) to a String, you will use the method toUriString()
. Method toURI().toString()
does not give the same result, as toURI()
removes the context ID (e.g. project-xxxx
), and toString()
removes the scheme (e.g. dx://
). More info about the Nextflow methods here.
All files generated by a Nextflow job tree will be stored in its session’s corresponding workDir
(i.e. the path where the temporary results are stored). On DNAnexus, when the Nextflow pipeline job is run with “preserve_cache=true”
, the workDir
is set at the path: project-xxxx:/.nextflow_cache_db/<session_id>/work/
. project-xxxx
is the project where the job took place, and you can follow the path to access all preserved temporary results. It is useful to be able to access these results for investigating the detailed pipeline progress, and use them for resuming job runs for pipeline development purposes. More info about workDir
is described here.
When the Nextflow pipeline job was run with “preserve_cache=false”
(default), temporary files will be stored in the job’s temporary workspace which will be deconstructed upon the head job enters its terminate state (i.e. “done”, “failed”, or “terminated”). Since a lot of these files are intermediate input/output being passed between processes and expected to be cleaned up after the job is completed, running with “preserve_cache=false”
will help reduce project storage cost for files that are not of interest, and also save you from remembering to clean up all temporary files.
To save the final results of interest, and to display them as the Nextflow pipeline executable’s output, you can declare output files matching the declaration under the script’s output:
block, and use Nextflow’s optional publishDir directive to publish
them.
This will make the published output files as the Nextflow pipeline head job’s output, under the executable’s formally defined placeholder output parameter, published_files
, as array:file
class. Then the files will be organized under the relative folder structure assigned via publishDir
. This works for both “preserve_cache=true”
and “preserve_cache=false”
. Only the “copy”
publish mode is supported on DNAnexus.
At pipeline development time, the valid value of publishDir
can be:
A local path string , e.g. “publishDir path: ./path/to/nf/publish_dir/”
,
A dynamic string value defined as a pipeline input parameter (e.g. “params.outdir”
, where “outdir”
is a string-class input), allowing pipeline users to determine parameter values at runtime. For example, “publishDir path: '${params.outdir}/some/dir/'”
or './some/dir/${params.outdir}/
' or './some/dir/${params.outdir}/some/dir/'
.
When publishDir
is defined this way, the user who launches the Nextflow pipeline executable is responsible for constructing the publishDir
to be a valid relative path.
Find an example on how to construct output paths for an nf-core pipeline job tree at run time from our FAQ.
The queueSize
option is part of Nextflow’s executor configuration. It defines how many tasks the executor will handle in a parallel manner. On DNAnexus, this represents the number of subjobs being created at a time (5 by default) by the Nextflow pipeline executable’s head job. If the pipeline’s executor configuration has a value assigned to queueSize
, it will override the default value. If the value exceeds the upper limit (1000) on DNAnexus, the root job will error out. See the Nextflow executor configuration page for examples.
The head job of the job tree defaults to running on instance type mem2_ssd1_v2_x4
in AWS regions and azure:mem2_ssd1_x4
in Azure regions. It is possible for users to change to a different instance type than the default, but is not recommended. The head job executes and monitors the subjobs. Changing the instance type for the head job will not affect the computing resources available for subjobs, where most of the heavy computation takes place (see below where to configure instance types for Nextflow processes). Changing the instance type for the head job may be necessary only if it is running out of memory or disk space when staging input files, collecting pipeline output files, or uploading pipeline output files to the project.
Each subjob’s instance type is determined based on the profile information provided in the Nextflow pipeline script. You can specify required instances by instance type name via Nextflow’s machineType directive (example below), or using a set of system requirements (e.g. cpus
, memory
, disk
, etc.) according to the official Nextflow documentation. The executor will choose the corresponding instance type that matches the minimal requirement of what is described in the Nextflow pipeline profile using the following logic:
Choose the cheapest instance that satisfies the system requirements.
Use only SSD type instances.
For all things equal (price and instance specifications), it will prefer a version2 (v2) instance type.
Order of precedence for subjob instance type determination:
The value assigned to machineType
directive.
Values assigned to cpus
, memory
, and disk
directives in their configuration.
An example command for specifying machineType
by DNAnexus instance type name is provided below:
Nextflow’s resume
feature enables skipping the processes that have been finished successfully and cached in previous runs. The new run can directly jump to downstream processes without needing to start from the beginning of the pipeline. By retrieving cached progress, Nextflow resume helps pipeline developers to save both time and compute costs. It is helpful for testing and troubleshooting when building and developing a Nextflow pipeline.
Nextflow utilizes a scratch storage area for caching and preserving each task’s temporary results. The directory is called “working directory”, and the directory’s path is defined by
The session id
, a universally unique identifier (UUID) associated with current execution
Each task’s unique hash ID: a hash number composed of each task’s input values, input files, command line strings, container ID (e.g. Docker image), conda environment, environment modules, and executed scripts in the bin directory, when applicable.
You can utilize the Nextflow resume feature with the following Nextflow pipeline executable parameters:
preserve_cache
Boolean type. Default value is false. When set to true, the run will be cached in the current project for future resumes. For example:
This enables the Nextflow job tree to preserve cached information as well as all temporary results in the project where it is executed under the following paths, based on its session ID
and each subjob’s unique ID.
The session's cache directory containing information on the location of the workDir
, the session progress, etc. are saved to project-xxxx:/.nextflow_cache_db/<session_id>/cache.tar
, where project-xxxx
is the project where the job tree is executed.
Each task's working directory will be saved to project-xxxx:/.nextflow_cache_db/<session_id>/work/<2digit>/<30characters>/
, where <2digit>/<30characters>/
is technically the task’s unique ID, and project-xxxx
is the project where the job tree is executed.
resume
String type. Default value is an empty string, and the run will start from scratch. When assigned with a session id
, the run will resume from what is cached for the session id
on the project. When assigned with “true” or “last”, the run will determine the session id
that corresponds to the latest valid execution in the current project and resume the run from it. For example:
Below are four possible scenarios and the recommended use cases for –i resume
:
Scenarios
Parameters
Use Cases
Note
1 (default)
resume
=“” (empty string) and preserve_cache=false
Production data processing; most high volume use cases
2
resume
=“” (empty string) and preserve_cache=true
Pipeline development; only happens for the first few pipeline tests.
During development; it would be useful to see all intermediate results in workDir
.
Only up to 20 Nextflow sessions can be preserved per project.
3
resume=<session_ID>|“true”|“last”
and preserve_cache=false
Pipeline development;
pipeline developers can investigate the job workspace with --delay_workspace_destruction
and --ssh
4
resume=<session_ID>|“true”|“last”
and preserve_cache=true
Pipeline development; only happens for the first few tests.
Only 1 job with the same <session_ID>
can run at each time point.
It is a good practice to frequently clean up the workDir
to save on storage costs. The maximum number of sessions that can be preserved in a DNAnexus project is 20 sessions. If you exceed the limit, the job will generate an error with the following message:
“The number of preserved sessions is already at the limit (N=20) and is true. Please remove the folders in <project-id>:/.nextflow_cache_db/
to be under the limit, if you want to preserve the cache of this run. “
To clean up all preserved sessions under a project, you can delete the entire ./nextflow_cache_db
folder. To clean up a specific session’s cached folder, you can delete the specific .nextflow_cache_db/<session_id>/
folder. To delete a folder in UI, you can follow the documentation on deleting objects. To delete a folder in CLI, you can run:
Note that deleting an object on UI or using CLI dx rm
cannot be undone. Once the session work directory is deleted or moved, subsequent runs will not be able to resume from the session.
For each session, only one job is allowed to resume the session’s cached results, and preserve its own progress to this session. There is no limit if multiple jobs resume and preserve multiple different sessions, as long as each job is preserving a different session. There is also no limit for multiple jobs to resume the same session, as long as only one or none is preserving the progress to the session.
Nextflow’s errorStrategy directive allows you to define how the error condition is managed by the Nextflow executor at the process level. When an error status is returned, by default, the process and other pending processes stop immediately (i.e. errorStrategy
terminate
), and this in turn forces the entire pipeline execution to be terminated.
There are four error strategy options of Nextflow executor: terminate
, finish
, ignore
, and retry
. Below is a table of behaviors for each strategy. Note that "all other subjobs" in the third column have not yet entered their terminal states.
errorStrategy
Subjob Error
Head Job
All Other Subjobs
terminate
Job properties set with: "nextflow_errorStrategy”:"terminate”,
"nextflow_errored_subjob”:"self”
End in “failed” state immediately
Job properties set with: "nextflow_errorStrategy”:”terminate”,
"nextflow_errored_subjob”:”job-xxxx",
"nextflow_terminated_subjob”:”job-yyyy, job-zzzz"
, where job-xxxx
is the errored subjob, and job-yyyy
is the other subjobs that were terminated due to this error.
End in “failed” state immediately, with error message, “Job was terminated by Nextflow with terminate
errorStrategy for job-xxxx
, check the job log to find the failure”
End in “failed” state immediately.
finish
Job properties set with:
"nextflow_errorStrategy”:"finish”,
"nextflow_errored_subjob”:"self”
End in “done” state immediately
Job properties set with:
"nextflow_errorStrategy:finish”,
"nextflow_errored_subjob”:”job-xxxx, job-2xxx"
, where job-xxxx
and job-2xxxx
are the the errored subjobs,
Not create new subjobs after the time point of error
End in “failed” state eventually, after other existing subjobs enter their terminal states, with error message “Job was ended with finish errorStrategy for job-xxxx, check the job log to find the failure.”.
Keep on running until entering their terminal states.
If error occurs in any of these subjobs (e.g. job-2xxx), finish
errorStrategy
will be applied to the subjob because a finish
errorStrategy
was hit first, ignoring any other error strategies set in the pipeline’s source code or configuration, as Nextflow’s default behavior.
retry
Job properties set with:
"nextflow_errorStrategy”:”retry”
,
"nex
tflow_errored_subjob”:”self"
End in “done” state immediately
Spin off a new subjob which retries the errored job, with the following job name:
<name> (retry: <RetryCount>) , where <name> is the original subjob name and <RetryCount> is the order of this retry (ex. retry:1, retry:2).
End in a terminal state depending on the terminal states of other currently existing subjobs that are not yet in their terminal states. Can be either “done”, “failed” or “terminated”.
Keep on running until enter their terminal states.
If error occurs in one of these subjobs, their errorStrategy
set in the subjob’s corresponding Nextflow process is applied.
ignore
Job properties set with:
"nextflow_errorStrategy
”:”ignore”,
"nextflow_errored_subjob”:”self"
End in “done” state immediately
Have job properties set with:
"nextflow_erorrStrategy”:”ignore”,
"nextflow_errorred_subjob”:”job-1xxx, job-2xxx"
Shows “subjob(s) <job-1xxxx>, <job-2xxxx> runs into Nextflow process errors’ ignore errorStrategy were applied” in the end of the job log.
End in a terminal state depending on the terminal states of other currently existing subjobs that are not yet in their terminal states. Can be either “done”, “failed” or “terminated”.
Keep on running until they enter their terminal states.
If error occurs in one of these subjobs, their errorStrategy
set in the subjob’s corresponding Nextflow process is applied.
When more than one errorStrategy
directives are applied to a pipeline job tree, the following rules will be applied depending on the first errorStrategy used.
When terminate
is the first errorStrategy
directive to be triggered in a subjob, all the other ongoing subjobs will result in the "failed" state immediately.
When finish
is the first errorStrategy
directive to be triggered in a subjob, any other errorStrategy
that is reached in the remaining ongoing subjob(s) will also apply the finish
errorStrategy
, ignoring any other error stategies set in the pipeline’s source code or configuration.
If the retry errorStrategy
is the first directive triggered in a subjob, if any of the remaining subjobs trigger a terminate
, finish
, or ignore
errorStrategy,
these other errorStrategy
directives will be applied to the corresponding subjob.
When ignore
is the first errorStrategy
directive to trigger in a subjob , and if any of terminate
, finish
, or retry
errorStrategy
directives applies to the remaining subjob(s), that other errorStrategy
will be applied to the corresponding subjob.
Independent from Nextflow process-level error conditions, when a Nextflow subjob encounters platform-related restartable errors, such as "ExecutionError", "UnresponsiveWorker", "JMInternalError", "AppInternalError", or "JobTimeoutExceeded", the subjob will follow the executionPolicy
determined to the subjob and and restart itself. It will not restart from the head job.
A: You can find the errored subjob’s job ID from the head job’s nextflow_errored_subjob
and nextflow_errorStrategy
properties to investigate which subjob failed and which errorStrategy
was applied. To query these errorStrategy
related properties in CLI, you can run the following command:
where job-xxxx
is the head job’s job ID.
Once you find the errored subjob, you can investigate the job log using the Monitor page by accessing the URL "https:/platform.dnanexus.com/projects/<projectID>/monitor/job/<jobID>", where jobID
is the subjob's ID (e.g. job-yyyy), or watch the job log in CLI using dx watch job-yyyy
.
If you have the preserve_cache
value set to true when start running the Nextflow pipeline executable, you can trace the cache workDir
(e.g. project-xxxx:/.nextflow_cache_db/<session_id>/work/
) and investigate the intermediate results of this run.
A: You can find the Nextflow version used by reading the log of the head job. Each built Nextflow executable is locked down to the specific version of Nextflow executor.
A: DNAnexus supports Docker as the container runtime for Nextflow pipeline applets. It is recommended to set docker.enabled=true
in the Nextflow pipeline configuration, which enables the built Nextflow pipeline applet to execute the pipeline using Docker.
A: There can be many possibilities causing the head job to hang. One of the known reasons is caused by the trace report file being written directly to a DNAnexus URI (e.g. dx://project-xxxx:/path/to/file
). To avoid this cause, we suggest you to specify -with-trace path/to/tracefile
(using a local path string) to the Nextflow pipeline applet’s nextflow_run_opts
input parameter.
Taking nf-core/sarek (3.3.1) as an example, start with reading the pipeline's logic:
The pipeline's publishDir
is constructed with a prefix of the params.outdir
variable followed by each task's name for each subfolder:
publishDir = [ path: { "${params.outdir}/${...}" }, ... ]
params.outdir
is a required input parameter to the pipeline, and the default value ofparams.outdir
is null
. The user running the corresponding Nextflow pipeline executable must specify a value to params.outdir
which will:
Meet the input requirement for executing the pipeline.
Resolve the value ofpublishDir
, with outdir
as the leading path and each task's name as the subfolder name.
To specify a value of params.outdir
for the Nextflow pipeline executable built from the nf-core/sarek
pipeline script, you can use the following command:
You can also set a job tree's output destination using --destination
:
This above command will construct the final output paths in the following manner:
project-xxxx:/path/to/jobtree/destination/ as the destination of the job tree's shared output folder.
project-xxxx:/path/to/jobtree/destination/local/to/outdir as the shared output folder of the all tasks/processes/subjobs of this pipeline.
project-xxxx:/path/to/jobtree/destination/local/to/outdir/<task_name> as the output folder of each specific task/process/subjob of this pipeline.
Get an overview of the range of different charts you can build and use in the Cohort Browser.
While working in the Cohort Browser, you can visualize data using a variety of different types of charts.
The following single-variable chart types are available in the Cohort Browser:
The following multi-variable chart types are available in the Cohort Browser:
In all charts used in the Cohort Browser, a chart total count is displayed under the chart's title. This figure represents the number of records for which data is displayed in the chart. The label - "Participants" in the chart shown below - indicates the entity to which the data relates.
This figure is not always the same as the number of records in the cohort.
In a single-variable chart, if the field in question, in a record, is empty or contains a null value, that record will not be included in the total, as its data can't be visualized. If any such records exist in the cohort, an "i" warning icon will appear next to the chart total figure. Hover over the icon to show a tooltip with information about records that aren't included in the total.
The same holds for multi-variable charts. If any record contains a null value in either of the selected fields, or if either field is empty, that record won't be included in the chart total count, as its data can't be visualized.
Learn to build and use histograms in the Cohort Browser.
Histograms can be used to visualize numerical, date, and datetime data.
In a histogram in the Cohort Browser, each vertical bar represents the count of records in a particular "bin." Each bin groups records that share the same value or very similar values, in a particular field.
The Cohort Browser automatically groups records into bins, based on the distribution of values in the dataset, for the field in question. Values are distributed in a linear fashion, on the x axis.
Below is a sample histogram showing the distribution of values in a field Critical care total days. Note the label under the chart title, indicating the number of records (203) for which values are shown , and the name of the entity ("RNAseq Notes") to which the data relates.
In some cases, a field containing numeric data may also contain some non-numeric values. These values cannot be represented in a histogram. In such cases, you'll see an following informational message below the chart:
Clicking the "non-numeric values" link will display detail on those values, and the number of record in which each appears:
In Cohort Compare mode, histograms can be used to compare the distribution of values in a field that's common to both cohorts. In this scenario, the distributions are overlaid one atop another. Clicking the "ˇ" icon, in the lower right corner of the tile containing the chart, opens a tooltip showing the cohort names and the colors used to represent data in each.
Integer
Integer Sparse
Float
Float Sparse
Date
Date Sparse
Datetime
Datetime Sparse
Learn to build and use row charts in the Cohort Browser.
Row charts can be used to visualize categorical data.
When creating a row chart, note that:
The data must be from a field that contains either categorical or categorical multi-select data
This field must contain no more than 20 distinct category values
The values cannot be organized in a hierarchy
In a row chart, each row shows a single category value, along with the number of records - the "count" - in which that value appears in the selected field. Also shown is the percentage of total cohort records in which it appears - its "freq." or "frequency."
Below is a sample row chart showing the distribution of values in a field Salt added to food. Note that In the current cohort selection of 100,000 participants, 27,979 records contain the value "Sometimes", which represents 27.98% of the current cohort size.
String Categorical
String Categorical Sparse
String Categorical Multi-select
Integer Categorical
Integer Categorical Multi-select
An overview of the Cohort Browser's key features and how to use them.
DNAnexus Apollo builds on the technological foundation of the core DNAnexus Platform to offer scientists and bioinformaticians an environment to store and query large sets of genomic, phenotypic, multi-omic, and other structured data. Researchers can bring their data to the Platform and leverage DNAnexus apps to ingest the data into queryable databases.
The Cohort Browser dashboard can show up to three tabs based on the configuration of the dataset: Overview, Data Preview, and either Genomics (if the dataset contains germline genomic data) or Somatic Variants (if the dataset contains somatic variant data). Tabs are loaded as the user clicks on them, so if there is no change in filtering, the tabs will stay cached and will not need to reload.
From the project where a dataset is located, go to Manage tab and select your dataset of interest. Click on the Explore Data action to open this dataset in Cohort Browser.
You can also access datasets via the Datasets page, which is located under the Projects menu. The Datasets page displays all datasets you have access to, and enables you to browse and find a specific dataset without navigating through projects.
You can use the optional information panel to view further information about a selected dataset, including creator, sponsorship, etc.
In the Cohort Browser's Overview tab, you'll see visualizations that provide an introduction to the dataset, and insights on the data it contains.
To create and view a chart visualizing data in a field, click the Add Tile button. The Add Tile dialog will open, showing a hierarchical view of all the data fields available in the dataset.
Browse the list or search an item by its title to narrow down the list.
Select a data field from the list. In the Data Field Details panel, you can see metadata on the selected data field, visualization preview, as well as options to customize chart types.
Confirm selection via the Add as Tile action. The new tile will appear on your dashboard.
Once you've selected a primary data field in the Add Tile dialogue, you can add a secondary data field by clicking on the + icon next to an eligible secondary data field.
This video provides a detailed overview of exploring new datasets using the Cohort Browser:
From the cohort which you wish to edit, click on Add Filter button.
Select a data field you want to filter by, confirm by clicking on Add as Filter.
Select operators and enter values to filter by. Click on Apply Filter to confirm.
Filters added are displayed in corresponding cohort panels. You can edit a specific filter any time by clicking on it, which would bring up the Edit Filter dialogue.
The default logical operator is 'AND'. To switch the operator to 'OR', click on the operator. For a filter group (a set of filters tied to 1 specific entity), all operators will be the same: all 'OR' or all 'AND'.
Once filters are added or edited, an updated cohort size will appear under name of the affected cohort. The dashboard will also auto-refresh to fetch updated results basing on latest cohort selection.
If your dataset includes germline genomic data, then you will have the option to add a genomic filter to your cohort.
From the cohort you wish to edit, click on Add Filter button.
Toggle to Geno tab.
Edit filter in Edit Genomic Filter dialogue by one of the following criteria:
Filter by genes and variant effects: Filter your dataset by variants of certain types and consequences within specified genes and/or genomic ranges. A maximum of 5 genes/ranges can be entered.
Filter by a list of variant IDs. A maximum of 100 variants can be entered.
If more than 1 range, gene, or variant is added, the values should be comma separated or each value must be on a new line.
Confirm edit by clicking on Apply Geno Filter button.
Similarly to the other cohort filters, a genomic filter is applied to the main entity of your dataset (in most cases, patients or participants).
For datasets with canonical transcript information available, an additional toggle will appear in the Genomic Filter dialogue titled "Match effects for canonical transcript only" which may be set to YES initially in order to restrict the results only to variants that have canonical information available.
In the Add Filter / Edit Filter pane, you will see options enabling you to:
Filter by genes and variant effects.
Filter by a particular HGVS DNA or HGVS protein notation, preceded by a gene symbol.
Filter by a list of variant IDs. A maximum of 10 variants can be entered.
Note that for each somatic variant filter, you can specify if matching variants are to be used as inclusion criteria or exclusion criteria for your cohort. By default, you will be selecting patients who have at least one detected variant that matches the specified criteria. To select patients or participants who do not have any matching variant, click the “WITH” dropdown button and change its value to “WITHOUT”.
You can create up to 10 somatic variant filters for each cohort.
When working with datasets that have multiple data entities, you can create a join filter by selecting data fields from a secondary entity and adding them as filters. An entity is a grouping of data around a unique item, event, or a concept: e.g. patient, visit, medication, laboratory tests.
Join Filters are displayed as subrows deriving from the main entity. Depending on the entity to which your selected data field belongs, a join filter that reflects the relationship between those entities will be automatically created. To create a new cohort criteria using the join filters, click + Add filter or the Filter > Add filter on a tile. To add additional criteria to an existing criteria in a join click the Add additional criteria inline on the row of the chosen filter.
You can choose between the 'AND' or 'OR' logical operators when creating a cohort and comparing join filters. To switch between them, click on the logical operator. For a specific level of join filtering, joins are either all 'AND' or all 'OR'. Note that even when using 'OR' for two join filters, the implication that "this criteria exists" precedes the join level, i.e. “where exists, join 1 or join 2”.
Once a join filter is created, you can further define the secondary entity by adding additional criteria to the branch, or adding more layers of join filters deriving from the current branch. As you add more layers, the field selector automatically hides fields that are ineligible to be added based on the join.
For an example of interpreting join filters, consider the following:
The First Example cohort identifies all patients with a "high" or "medium" risk level who have a first (visit instance = 1) hospital visit and who also have had a lab test that was a "nasal swab". This lab test does not necessarily have to be conducted at the time of the patient’s first hospital visit. In the Second Example, the cohort includes all patients with a "high" or "medium" risk level who had the "nasal swab" test performed on the first visit.
This video provides an overview of setting up your dashboard as part of defining and refining a cohort:
You can create complex cohorts by combining existing cohorts from the same dataset. Cohort combine can be accessed via the “Compare / Combine Cohorts” menu located at the top of the page.
The Cohort Browser supports the following combination logic:
Once a combined cohort is created, you can inspect the combination logic and its original cohorts in the cohort filters section.
In the Data Preview tab, the Cohort Table shows records that are within your current cohort selection split up by the entity they are on. You can add or remove data fields as columns via the column customization menu, which is located in the top-right corner of the table. As you add fields, entities are automatically split out. You can have up to 5 different entities showing at a time. In the entity drop down, you can toggle between various entities you’ve added and remove them outright.
Click on table column headers to access more functionalities including sorting and searching in a specific column and the data field information. From the data field information, you can quickly add the field as a tile or a filter if it has not been added yet.
You can export table information either as a list of record IDs or a CSV file. Export options are available on the top-right corner of the table once you have selected a number of table rows.
The Variant Browser shows variants that are present in current cohort selection.
For datasets containing germline data, the Variant Browser appears in the Genomics tab. For datasets containing somatic variants data, it appears in the Somatic Variants tab.
For datasets containing germline data, the Variant Browser includes a lollipop plot displaying allele frequencies for variants in a specified genomic region.
The table below the lollipop plot displays a list of the same variants in tabular format, along with further annotation information including:
Type: whether the variant is a SNP, deletion, insertion, or mixed.
Population Allele Frequency: Allele frequency calculated across entire dataset from which the cohort is created.
Cohort Allele Frequency: Allele frequency calculated across current cohort selection.
If canonical transcript information is available, the following three columns with additional annotation information will appear in the Allele Table:
Consequences (Canonical Transcript): Canonical effects per each associated gene, according to SnpEff.
HGVS DNA (Canonical Transcript): HGVS (DNA) standard terminology per each associated gene with this variant
HGVS Protein (Canonical Transcript): HGVS (Protein) standard terminology per each associated gene with this variant
To view further annotation information, you can go to the detail page of a given variant by clicking on the link in the Location column .
Note the following:
The lollipop plot displays variant information in one gene / canonical protein at a time.
Each “lollipop” represents amino acid changes at a given location (e.g. “Thr322Ala”), with location information visualized as horizontal position (X axis) and affected sample frequency in the current cohort visualized as height (Y axis).
Each lollipop is color-coded by consequence according to the canonical transcript. Lollipops that cover more than one consequence types are color-coded as “Multiple Consequences”
You can inspect variant statistics under each specific consequence, by interacting with the legend panel and selecting one color at a time. Selecting a particular lollipop in the plot will apply a filter to the the variants table, such that only variants corresponding the selected lollipop are displayed.
You can navigate to a different gene by entering the gene symbol in the Go to Gene field. This will also update the variants table by automatically navigating to the corresponding genomic region.
For datasets containing somatic variant data, the Variant Browser includes a chart illustrating the overall variant landscape across the top-mutated genes, for the current cohort.
Note the following:
The genes are sorted in descending order of percent of affected samples.
Samples are displayed, from left to right, from those that have the greatest number of mutated genes across all genes, to those that have the least.
A sample is considered affected, if the sample has at least one detected mutation of high or moderate impact within the canonical transcript for that gene.
Samples are color-coded by consequences. Samples with two or more detected variants are color-coded as “Multi Hit”.
The variant frequency matrix plot displays up to 50 top mutated genes, for up to 500 samples for any given cohort.
You can inspect variant statistics under each specific consequence, by interacting with the legend panel and selecting one color at a time.
Hovering over a particular sample cell will open a popover window showing detailed information on the sample, including:
The Sample ID
The count of variants per consequence, within the respective gene
For datasets containing somatic variant data, the Variant Browser includes a Variants Table that provides, in tabular format, details on the variants found in a particular genomic region, in samples for the current cohort.
Information displayed in the Variants Table includes:
Location of variant
Reference allele of variant
Alternate allele of variant
Type of variant
Variant consequences, with entries color-coded by level of severity
HGVS cDNA
HGVS Protein
You can export selected variants in the table as a list of variant IDs or a CSV file. Export options will appear at the top-right corner of the table once you have items selected.
Cohorts will be saved with the filters applied, along with the latest set of visualizations and dashboard layout information. Similar to Dataset objects, Cohort objects can be found under the Manage tab in your selected projects, and can be re-opened via the Explore Data option.
You can export a list of main entity IDs in your current cohort selection as a CSV file. This action can be found next to the Save Cohort button, on the top-right corner of cohort panel.
Dashboard views contain layout and configuration information that can be re-used during cohort browsing. You can save or load a dashboard view via the Dashboard Actions menu located at the top-right corner of the header area. Dashboard views are saved as "Type: DashboardView" objects, which once saved also show up in selected project folders.
You can compare two cohorts by adding both cohorts into the Cohort Browser. In cohort compare mode, all visualizations are converted to show data from both cohorts.
The Compare Cohort action can be found in the header area next to the cohort title. You can create a new cohort, duplicate the current cohort, or load a previously saved cohort.
In compare mode, you can continue to edit both cohorts and visualize the results dynamically.
You can compare a cohort with its complement in the dataset by selecting the “Not In …” option in the Compare / Combine menu. Similar to combining cohorts, you must first save your current cohort before creating its not-in counterpart.
If a database is in a project that has a restricteddownloadPolicy,
then a Cohort Browser that shows a dataset, cohorts, or dashboards pointing to that database should not allow downloads, regardless of project.
If all copies of the dataset are in a project that has a restricteddownloadPolicy,
then a Cohort Browser that shows a dataset, cohorts, or dashboards pointing to that dataset should not allow downloads, regardless of project.
Check all copies of the dataset.
If at least one copy is in a "download allowed" then the download will be allowed.
If the cohort or dashboard you are launching has a restricteddownloadPolicy
, then a Cohort Browser that shows the cohort/dashboard should not allow downloads.
Learn to build and use Kaplan-Meier Survival Curve charts in the Cohort Browser.
To generate a survival chart, select one numerical field representing time, and one categorical field, which will be transformed into the individual’s status.
The categorical field should use one of the following 4 terms (case-insensitive) to indicate a status of "Living": “living”, “alive”, “diseasefree”, “disease-free”
For multi-entity datasets, survival curve charts only support data fields from the main entity, or entities with 1:1 relation to the main entity.
To calculate survival percent at the current event the system evaluates the following formula:
ST: Survival at the current event
LT0: Number of subjects living at the start of the period or event
D: Number of subjects that died
For each time period the following values are generated:
Status
Each individual is considered Dead unless they qualify as Living
Number of Subjects Living at the Start (LT0)
For the initial value this is the total number of records returned by the backend from survival data with Living or Dead Status.
For followup events this is the number of subjects at the start of the previous event minus the number of subjects that died in the previous event and the subjects that dropped out or were censored in the previous event
Number of Subjects Who Died (D)
1 for each individual who at the event does not have a status of Living
Number of Subjects Dropped or Censored
1 for each individual who at the event has a status of Living
Survival Percent at the Current Event (ST)
Cumulative Survival (S)
ST-1: Survival percent at the previous event
Note that this is the actual point drawn on the survival plot.
Learn to build and use grouped box plots in the Cohort Browser.
Grouped box plots can be used to compare the distribution of values in a field containing numerical data, across different groups in a cohort. In a grouped box plot, each such group is defined by its members sharing the same value in another field that contains categorical data.
When creating a grouped box plot, note that:
The primary field must contain categorical or categorical multiple data
The primary field must contain no more than 15 distinct category values
The secondary field must contain numerical data
The grouped box plot below shows a cohort that has been broken down into groups, according to the value in a field Doctor. For each group, a box plot provides detail on the reported Visit Feeling, for cohort members who share a doctor:
In some cases, a field containing numeric data may also contain some non-numeric values. These values cannot be represented in a grouped box plot. See the chart just above for an example of the informational message that will show below the chart, in this scenario.
Clicking the "non-numeric values" link will display detail on those values, and the number of record in which each appears:
Cohort Browser grouped box plots represent all non-null numeric values. When a field contains an outlier value or values - that is, values that are unusually high or low - this can result in a grouped box plot that looks like this:
This grouped box plot displays data on the number of cups of coffee consumed per day, by members of different groups in a particular cohort, with groups defined by shared value in a field Coffee type. Note that in several groups, one member was recorded as consuming far more cups of coffee per day than others in the group.
In Cohort Compare mode, a grouped box plot can be used to compare the distribution of values in a field that's common to both cohorts, across groups defined using values in a categorical field that is also common to both cohorts.
In this scenario, a separate, color-coded box plot is displayed for each group in each cohort.
Hovering over one of these box plots opens an informational window showing detail on the distribution of values for the group in question.
Clicking the "ˇ" icon, in the lower right corner of the tile containing the chart, opens a tooltip showing the cohort names and the colors used to represent data in each.
String Categorical
String Categorical Multi-Select
String Categorical Sparse
Integer Categorical
Integer Categorical Multi-Select
Integer
Integer Sparse
Float
Float Sparse
Learn to build and use stacked row charts in the Cohort Browser.
Stacked row charts can be used to compare the distribution of values in a field containing categorical data, across different groups in a cohort. In a stacked row chart, each such group is defined by its member sharing the same value in another field that also contains categorical data.
When creating a stacked row chart, note that:
Both the primary and secondary fields must contain categorical data
Both the primary and secondary fields must contain no more than 20 distinct category values
In the stacked row chart below, the primary field is VisitType, while DoctorType is the secondary field. In this chart, a cohort has been broken down into two groups, with the first sharing the value "Out-patient" in the VisitType field, while the second shares the value "In-patient."
The size of each bar, and the number to its right, indicate the total number of records in each group. In the chart below, for example, we see that 3,179 records contain the value "Out-patient" in the VisitType field.
Each bar contains a color-coded section indicating how many of the group's records contain a specific value in the secondary field. Hovering over one of these sections reveals how many records, within a particular group, share a particular value in the secondary field. In the chart below, for example, we see that 87 records in the first group share the value "specialist" in the DoctorType field.
String Categorical
String Categorical Sparse
Integer Categorical
Learn to build and use scatter plots in the Cohort Browser.
Scatter plots can be used to compare the distribution of values in a field containing numerical data, across different groups in a cohort. In a scatter plot, each such group is defined by its members sharing the same value in another field that also contains numerical data.
Primary field values are plotted on the x axis. Secondary field values are plotted on the y axis.
In the scatter plot below, each dot represents a particular combination of values, found in one or more records in a cohort, in fields Insurance Billed and Cost. The lighter the dot at a particular point, the fewer the records that share that combination. Darker dots, meanwhile, indicate that relatively more records that share a particular combination.
In some cases, a field containing numeric data may also contain some non-numeric values. These values cannot be represented in a scatter plot. The message "This field contains non-numeric values" will appear below the scatter plot, as in this sample chart:
Clicking the "non-numeric values" link will display detail on those values, and the number of record in which each appears.
In the Cohort Browser, scatter plots can show up to 30,000 distinct data points. If you create a scatter plot that would require that more data points be shown, you'll see this message above the chart:
Scatter plots are not supported in Cohort Compare.
Integer
Integer Sparse
Float
Float Sparse
Learn to build and use list views in the Cohort Browser.
List views can be used to visualize categorical data.
When creating a list view, note that:
The data must be from a field that contains either categorical or categorical multi-select data
This field must contain no more than 20 distinct category values
The values can be organized in a hierarchy
List views can be used to visualize categorical data from two different fields. The same restrictions apply to the fields whose values are displayed, as when creating a simple list view.
In a list view in the Cohort Browser showing data from one field, each row displays a value, along with the number of records in the current cohort - the "count" - that contain this value. Also shown is a figure labeled "freq." - this is the percentage of all cohort records, that contain the value.
Below is a sample list view showing the distribution of values in a field Episode type. Note that In the current cohort selection of 80 participants, 13 records contain the value "Delivery episode", which represents 16.25% of the current cohort size.
To visualize data from two fields, select a categorical field, then select "List View" as your visualization type. In the field list, select a second categorical field as a secondary field.
Below is the default view of a sample list view visualizing data from two fields: Critical care record origin and Critical care record format:
Critical care record origin is the primary field, Critical care record format is the secondary field.
Here, the user has clicked the ">" icon next to "Originating from Scotland" to display additional rows with detail on records that contain that value in the field Critical care record origin:
Each of these additional rows shows the number of records that contain a particular value for Critical care record format, along with the value "Originating from Scotland" for Critical care record origin.
In these additional rows, "count" and "freq." figures refer to records having a particular combination of values, in the fields in question.
Below is an example of a list view used to visualize data in a categorical hierarchical field Home State/Province:
By default, only values in the category at the top level of the hierarchy are displayed.
Here, the user has clicked ">" next to one of these values, revealing additional rows that show how many records have the value "Canada" for the top-level category, in combination with different values in the category at the next level down:
In these additional rows, "count" and "freq." figures refer to records having a particular combination of values, in the fields in question. In the list view above, for example, a single record, representing 10% of the cohort, has both the value "Canada" for the top-level category, and "British Columbia" for the second-level category.
The following example shows how "count" and "freq." are calculated, for list views based on fields containing categorical data organized into multiple levels of hierarchy:
For the bottommost row, "count" and "freq" refer to records having all of the following values:
"Yes" for the category at the top of the hierarchy
"9" for the category at the second level of the hierarchy
"8" for the category at the third level of the hierarchy
"7" for the category at the fourth level of the hierarchy
"3" for the category at the bottom level of the hierarchy
In cases where the field has categories at multiple levels and this make it difficult to find a particular value, use the search box at the bottom of the list view, to hone in on a row or rows containing that value:
In Cohort Compare mode, a list view can be used to compare the distribution of values in a field that's common to both cohorts. In this scenario, the list includes a color-coded column for each cohort, as well as color-coded "count" figures for each, as in this example:
Note that in each column, count and "freq." figures refer to the occurrence of values in the individual cohort, not across both cohorts.
String Categorical
String Categorical Hierarchical
String Categorical Multi-Select
String Categorical Multi-Select Hierarchical
String Categorical Sparse
String Categorical Sparse Hierarchical
Integer Categorical
Integer Categorical Hierarchical
Integer Categorical Multi-Select
Integer Categorical Multi-Select Hierarchical
Learn to build and use box plots in the Cohort Browser.
Box plots can be used to visualize numerical data.
Box plots provide a range of detail on the distribution of values in a field containing numerical data. Each box plot includes three thin blue horizontal lines, indicating, from top to bottom:
Max - The maximum, or highest value
Med - The median value
Min - The minimum, or lowest value
The blue box straddling the median value line represents the span covered by the median 50% of values. Of the total number of values, 25% sit above the box, and 25% lie below it.
Hovering over the middle of a box plot opens a window displaying detail on the maximum, median, and minimum values. Also shown are the values at the "top" ("Q3") and "bottom" ("Q1") of the box. "Q1" is the highest value in the first, or lowest, quartile of values; "Q3" is the highest value in the third quartile.
Also shown in this window is a number representing the total number of values covered by the box plot, along with the name of the entity to which the data relates.
In some cases, a field containing numeric data may also contain some non-numeric values. These values cannot be represented in a box plot. See the chart just above for an example of the informational message that will show below the chart, in this scenario.
Clicking the "non-numeric values" link will display detail on those values, and the number of record in which each appears:
Note as well that in this scenario, there will be a discrepancy between the "count" figure shown in the chart label, and that shown in the informational window that opens, when hovering over the middle of a box plot. The latter figure will be smaller, with the discrepancy determined by the number of records for which values can't be displayed in the box plot.
Cohort Browser box plots represent all non-null numeric values. When a field contains an outlier value or values - that is, values that are unusually high or low - this can result in a box plot that looks like this:
This box plot displays data on the number of cups of coffee consumed per day, by members of a particular cohort. One cohort member was recorded as consuming 42 cups of coffee per day, much higher than the value (2 cups/day) at the "top" of the third quartile, and far higher than the median value of 2 cups/day.
In Cohort Compare mode, a box plot chart can be used to compare the distribution of values in a field that's common to both cohorts. In this scenario, a separate, color-coded box plot is displayed for each cohort.
Hovering over either of the plots opens an informational window showing detail on the distribution of values for the cohort in question.
Clicking the "ˇ" icon, in the lower right corner of the tile containing the chart, opens a tooltip showing the cohort names and the colors used to represent data in each.
Integer
Integer Sparse
Float
Float Sparse
Tools in this section are created and maintained by their respective vendors and may require separate licenses.
Git credentials used to access Nextflow pipelines from private Git repositories. Can be used only with --repository
. More information about the file syntax can be found .
For DNAnexus Platform users, an Apollo license is required to access the Cohort Browser. for more information.
To visualize data stored in particular field, . Note that when you select a field, the Cohort Browser will suggest a chart type to use, to visualize the type of data it contains. You can also , displaying data from two fields, to help clarify the relationship between the data stored in each.
For DNAnexus Platform users, an Apollo license is required to access the Cohort Browser. for more information.
See for more on using Cohort Compare mode.
When , note that the following data types can be visualized in histograms:
For DNAnexus Platform users, an Apollo license is required to access the Cohort Browser. for more information.
See if you need to visualize hierarchical categorical data.
Row charts can't be used to visualize data in categorical fields that have a hierarchical structure. For this type of data, use a .
Row charts aren't supported in Cohort Compare mode. In Cohort Compare mode, row charts are converted to .
Simple row charts can't be used to visualize data from more than one field. To visualize categorical data from two fields, you can use a .
In some cases, not all records will have a value for the field in question. In this case, summing the "count" figures displayed will yield a figure smaller than the total cohort size, and summing the "freq." figures will not yield "100%." See for more information.
When , note that the following data types can be visualized in row charts, provided that category values are specified as such in the coding file used at ingestion:
For DNAnexus Platform users, an Apollo license is required to access the Cohort Browser. for more information.
These databases can then be explored using the Cohort Browser. Scientists can filter the by any data field and save these filtered samples as . These cohorts can be shared with other scientists and also can be used as inputs to analysis apps to perform such tasks as calculating allele frequencies or performing a GWAS analysis.
Bioinformaticians who wish to perform ad hoc statistical analysis are able to create environments backed by to directly query their data and create dataframes within a Python or R environment for further analysis.
Datasets need to be prepared and ingested in order to be accessible via the Cohort Browser. See the page for information on the ingestion process.
For each data field, the range of available chart types depends on the type of data stored in the field. See pages for more information on how each chart type can be built. Note that no more than 15 tiles can be added to the dashboard.
The +” icon only appears when at least one chart type is supported for the specified combination. See for more details.
For certain chart types - such as and - you can re-order the primary and secondary data fields by dragging on the data field in the Data Field Details section.
When you start exploration on a dataset, an empty cohort is created automatically in the Cohort Browser. You can further narrow down your cohort by adding cohort filters. Cohorts can be for later use.
Canonical transcripts, as defined by , will be indicated with a blue marker next to their "Ensembl Transcript ID" in the Transcript column in the Genes/Transcripts table.
If your dataset includes data on somatic variants, the workflow for creating a genomic filter is very similar to that for .
Note that “Entity” in the Cohort Browser can refer either to a data model object (examples given above), or to the specific input parameter in the Table Exporter app. See the Table Exporter documentation
You can also create a combined cohort basing on the cohorts already being .
The cohort table can display a maximum number of 30,000 records. If your cohort size is larger than this number, the table may not show the full data. Exporting a table larger than 30,000 can be done with the .
Consequences: The impact of variant according to . For variants with multiple gene annotations, this column displays the most severe consequence per gene.
GnomAD Allele Frequency: Allele frequency of the specified allele from the public dataset .
Downloading genomic data via visualization UI is not suitable for large datasets. You can use the to download data in a more efficient way.
For datasets containing somatic variant data, the Variant Browser includes a lollipop plot showing somatic variants for the canonical protein of a single gene, that occur in the cohort under examination. Interpreting and working with this chart is very similar to working with the .
Only consequences of high or moderate impact (as defined by ) are included in this visualization.
Note that the table displays detail on the same genomic region as the .
You can save your cohort selection to a project as a by clicking the Save icon in the top-right corner of the cohort panel.
Compare mode is supported only for cohorts created from the same dataset. Certain cohort browser sections and are not supported in compare mode.
The Cohort Browser adheres to a project's .
The dx command generates a new Cohort object on the platform from an existing Dataset or Cohort object and using a list of primary IDs. The filters are applied to the global primary key of the dataset/cohort object. When the input is a CohortBrowser
record, the existing filters are preserved and the output record has additional filters on the global primary key. The filters are combined in a manner such that the resulting record is an intersection of the IDs present in the original input and the IDs passed through CLI. For additional details, please see and example notebooks in the public Github repo, .
For DNAnexus Platform users, an Apollo license is required to access the Cohort Browser. for more information.
For DNAnexus Platform users, an Apollo license is required to access the Cohort Browser. for more information.
When , note that the following data types can be visualized in grouped box plots:
For DNAnexus Platform users, an Apollo license is required to access the Cohort Browser. for more information.
Stacked row charts are not supported in Cohort Compare. Use a instead.
When , note that the following data types can be visualized in stacked row charts:
For DNAnexus Platform users, an Apollo license is required to access the Cohort Browser. for more information.
In this scenario, to generate a scatter plot that shows data for all the members of a cohort.
When , note that the following data types can be visualized in scatter plots:
For DNAnexus Platform users, an Apollo license is required to access the Cohort Browser. for more information.
Note that list views, unlike , can be used to visualize categorical data with values that are organized in a hierarchical fashion.
In some cases, not all records will have a value for the field in question. In this case, summing the "count" figures displayed will yield a figure smaller than the total cohort size, and summing the "freq." figures will not yield "100%." See for more information.
When , note that the following data types can be visualized in list views:
For DNAnexus Platform users, an Apollo license is required to access the Cohort Browser. for more information.
Numerical data can also be visualized using .
When , note that the following data types can be visualized in box plots:
A Titan license is required to access and use these tools. Please contact for more information.
An Apollo license is required to access and use these tools. Please contact for more information.
Intersection
Select members that are present in ALL selected cohorts.
Example: intersection of cohort A, B and C would be A∩B∩C
Up to 5 cohorts
Union
Select members that are present in ANY of the selected cohorts. Example: union of cohort A, B and C would be A∪B∪C
Up to 5 cohorts
Subtraction
Select members that are present only in the first selected cohort and not in the second.
Example: Subtraction of cohort A, B would be A-B
2 cohorts
Unique
Select members that appear in exactly one of the selected cohorts.
Example: Unique of cohort A, B would be (A-B) ∪ (B-A)
2 cohorts
Not In
Select members that are present in the dataset, but not in the current cohort.
Example: In dataset U, the result of “Not In” A would be U-A
1x numerical
1x categorical
Name of Tool
Name in CLI
Scientific Algorithms
Common Uses
snpeff_annotate
SnpEff
Annotation
snpsift_annotate
SnpSift
Annotation
Name of Tool
Name in CLI
Scientific Algorithms
Common uses
aws_platform_to_s3_file_transfer
AWS S3
aws_s3_to_platform_files
AWS S3
sra_fastq_importer
Retrieve reads in FASTQ format from SRA
Name of Tool
Name in CLI
Scientific Algorithms
Common Uses
oqfe
A revision of Functionally Equivalent
WGS, WES- alignment and duplicate marking
gatk4_bqsr_parallel
Variant calling
bwa_mem_fastq_read_mapper
BWA-MEM
Short read alignment
gatk4_haplotypecaller_parallel
Variant calling, post-alignment QC
gatk4_genotypegvcfs_single_sample_parallel
Variant calling
picard_mark_duplicates
Variant Calling- remove duplicates, post-alignment
Name of Tool
Name in CLI
Scientific Algorithms
Common Uses
regenie
REGENIE
GWAS
plink_gwas
PLINK2
raremetal2
RAREMETALWORKER, RAREMETAL
saige_gwas_gbat
saige_gwas_svat
saige_gwas_grm
saige_gwas_sparse_grm
plink_pipeline
Plink2
plato_pipeline
Plato, Plink2
locuszoom
LocusZoom
GWAS, visualization
Name of Tool
Name in CLI
Scientific Algorithms
Common Uses
url_fetcher
N/A
Fetches a file from a URL onto the DNAnexus platform
Name of Tool
Name in CLI
Scientific Algorithms
Common Uses
sra_fastq_importer
Retrieve reads in FASTQ format from SRA
url_fetcher
N/A
Fetches a file from a URL onto the DNAnexus platform
Name of Tool
Name in CLI
Scientific Algorithms
Common Uses
cloud_workstation
N/A
SSH-accessible unix shell on a platform cloud worker. Use it for ad hoc analysis of platform data.
ttyd
N/A
Unix shell on a platform cloud worker in your browser. Use it for ad hoc CLI operations and to launch https apps on 2 extra ports
Name of Tool
Name in CLI
Scientific Algorithms
Common Uses
glnexus
GLnexus
This app can also be used to create pVCF without running joint genotyping
Name of Tool
Name in CLI
Scientific Algorithms
Common Uses
samtools_index
SAMtools- samtools index
Building bam index file
samtools_sort
SAMtools- samtools sort
Sort alignment result based on coordinates
Name of Tool
Name in CLI
Scientific Algorithms
Common Uses
phesant
PHESANT
PheWAS
Name of Tool
Name in CLI
Scientific Algorithms
Common Uses
prsice2
PRSice-2
Polygenic risk scores
Name of Tool
Name in CLI
Scientific Algorithms
Common Uses
multiqc
MultiQC
QC reporting
qualimap2_anlys
Qualimap2
QC
rnaseqc
Transcriptomics Expression Quantification
fastqc
FastQC
Name of Tool
Name in CLI
Scientific Algorithms
Common Uses
salmon_quant
Salmon
Transcriptomics Expression Quantification
salmon_mapping_quant
Salmon
Transcriptomics Expression Quantification
Name of Tool
Name in CLI
Scientific Algorithms
Common Uses
bowtie2_fasta_indexer
Bowtie 2: bowtie2-build
Building reference for Bowite2 alignment
bowtie2_fastq_read_mapper
bowtie2, samtools view, samtools sort, samtools index
Short read alignment
bwa_fasta_indexer
BWA- bwa index
Building reference for BWA alignment
bwa_mem_fastq_read_mapper
BWA-MEM
Short read alignment
star_generate_genome_index
STAR (Spliced Transcripts Alignment to a Reference) (--runMode genomeGenerate)
RNA Seq- indexing
star_mapping
STAR (Spliced Transcripts Alignment to a Reference)
RNA Seq- mapping
subread_feature_counts
featureCounts
Read summarization, RNAseq
salmon_index_builder
Salmon
Transcriptomics Expression Quantification
salmon_mapping_quant
Salmon
Transcriptomics Expression Quantification
Name of Tool
Name in CLI
Scientific Algorithms
Common Uses
gatk4_bqsr_parallel
Variant calling
flexbar_fastq_read_trimmer
QC
trimmomatic
Read quality trimming, adapter trimming
Name of Tool
Name in CLI
Scientific Algorithms
Common Uses
rnaseqc
Transcriptomics Expression Quantification
star_generate_genome_index
STAR (Spliced Transcripts Alignment to a Reference) (--runMode genomeGenerate)
RNA Seq- indexing
star_mapping
STAR (Spliced Transcripts Alignment to a Reference)
RNA Seq- mapping
subread_feature_counts
featureCounts
Read summarization, RNAseq
star_generate_genome_index
STAR (Spliced Transcripts Alignment to a Reference) (--runMode genomeGenerate)
Transcriptomics Expression Quantification
star_mapping
STAR (Spliced Transcripts Alignment to a Reference)
Transcriptomics Expression Quantification
salmon_index_builder
Salmon
Transcriptomics Expression Quantification
salmon_mapping_quant
Salmon
Transcriptomics Expression Quantification
salmon_quant
Salmon
Transcriptomics Expression Quantification
Name of Tool
Name in CLI
Scientific Algorithms
Common Uses
Transcript_Expression_Part-02_Analysis-diff-exp_R.ipynb
DESeq2
Transcript_Expression_Part-03_Analysis-GSEA_R.ipynb
WebGestaltR
Transcript_Expression_Part-04_Analysis-CoEx-Network_R.ipynb
WGCNA, topGO
Transcript_Expression_Part-05_Analysis-Regulatory-Network_R.ipynb
GENIE3
Name of Tool
Name in CLI
Scientific Algorithm
Common Uses
file_concatenator
N/A
gzip
gzip
swiss-army-knife
bcftools, bedtools, bgzip, plink, sambamba, samtools, seqtk, tabix, vcflib, Plato, QCTool, vcftools, plink2, Picard, REGENIE, BOLT-LMM, BGEN
Data processing tools
ttyd
N/A
Unix shell on a platform cloud worker in your browser. Use it for ad hoc CLI operations and to launch https apps on 2 extra ports
Name of Tool
Name in CLI
Scientific Algorithms
Common Uses
cnvkit_batch
Copy Number Variant
gatk4_haplotypecaller_parallel
Variant calling, post-alignment QC
gatk4_genotypegvcfs_single_sample_parallel
Variant calling
picard_mark_duplicates
Variant Calling- remove duplicates, post-alignment
freebayes
Use for short variant calls
gatk4_mutect2_variant_caller_and_filter
Somatic variant calling and post calling filtering
gatk4_somatic_panel_of_normals_builder
Create a panel of normals (PoN) containing germline and artifactual sites for use with Mutect2.
Name of Tool
Name in CLI
Scientific Algorithms
Common Uses
locuszoom
LocusZoom
GWAS, visualization
Name of Tool
Name in CLI
Scientific Algorithms
Common Uses
dxjupyterlab
dxpy, matplotlib, numpy, pandas, papermill, scipy, seaborn , dxdata, cntk, keras,scikit-learn, tensorflow, torch
Running analyses, visualizing data, building and testing models and algorithms in an interactive way
dxjupyterlab
dxpy, matplotlib, numpy, pandas, papermill, scipy, seaborn, dxdata, BiocManager, coloc, epiR, hyprcoloc, incidence, MendelianRandomization, outbreaks, prevalence
Running analyses, visualizing data, building and testing models and algorithms in an interactive way
dxjupyterlab
dxpy, matplotlib, numpy, pandas, papermill, scipy, seaborn, dxdata, BiocManager, coloc, epiR, hyprcoloc, incidence, MendelianRandomization, outbreaks, prevalence, Stata, stata_kernel
Running analyses, visualizing data, building and testing models and algorithms in an interactive way, accessing and manipulating data in spark databases and tables
dxjupyterlab
dxpy, matplotlib, numpy, pandas, papermill, scipy, seaborn , dxdata, nipype, freesurfer, FSL
Running imaging processing related analysis
Name of Tool
Name in CLI
Scientific Algorithms
Common Uses
data_model_loader_v2
Dataset Creation
Dataset Creation
dataset-extender
Dataset Extension
Dataset Extension
Name of Tool
Name in CLI
Scientific Algorithms
Common Uses
csv-loader
N/A
Data Loading
spark-sql-runner
Spark SQL
Dynamic SQL Execution
table-exporter
N/A
Data Extraction
Name of Tool
Name in CLI
Scientific Algorithms
Common Uses
dxjupyterlab_spark_cluster
dxpy, hail, matplotlib, numpy, pandas, papermill, scipy, seaborn, dxdata, pyOpenSSL, glow.py, pypandoc, koalas, pyarrow, bokeh, vep, BiocManager, coloc, epiR, yprcoloc, incidence, MendelianRandomization, outbreaks, prevalence, sparklyr, Glow
Running analyses, visualizing data, building and testing models and algorithms in an interactive way
dxjupyterlab_spark_cluster
dxpy, hail, matplotlib, numpy, pandas, papermill, scipy, seaborn, dxdata, pyOpenSSL, glow.py, pypandoc, koalas, pyarrow, bokeh, vep, BiocManager, coloc, epiR, yprcoloc, incidence, MendelianRandomization, outbreaks, prevalence, sparklyr, HAIL
Running analyses, visualizing data, building and testing models and algorithms in an interactive way
dxjupyterlab_spark_cluster
dxpy, hail, matplotlib, numpy, pandas, papermill, scipy, seaborn, dxdata, pyOpenSSL, glow.py, pypandoc, koalas, pyarrow, bokeh, vep, BiocManager, coloc, epiR, yprcoloc, incidence, MendelianRandomization, outbreaks, prevalence, sparklyr, HAIL, Ensembl Variant Effect Predictor
Running analyses, visualizing data, building and testing models and algorithms in an interactive way
Name of Tool
Name in CLI
Scientific Algorithms
Common Uses
sentieon-tnseq
Sentieon's FASTQ to VCF somatic analysis pipeline
WGS, WES, accelerated analysis
sentieon-bwa
Sentieon's FASTQ to BAM/CRAM pipeline
WGS, WES, accelerated analysis
pbgermline
BWA-Mem Alignment, Co-ordinate Sorting, Picard MarkDuplicates, Base Quality Score Recalibration
WGS, WES, accelerated analysis
sentieon-tnbam
Sentieon's BAM to VCF somatic analysis pipeline
WGS, WES, accelerated analysis
pbdeepvariant
Deepvariant
Variant calling, accelerated analysis
sentieon-umi
Sentieon's pre-processing and alignment pipeline for next-generation sequence
WGS, WES, accelerated analysis
sentieon-dnabam
Sentieon's BAM to VCF germline analysis pipeline
WGS, WES, accelerated analysis
sentieon-joint_genotyping
Sentieon GVCFtyper
WGS, WES, accelerated analysis
sentieon-ccdg
Sentieon's FASTQ to CRAM pipeline, Functional Equivalent Pipeline
WGS, WES, accelerated analysis
sentieon-dnaseq
Sentieon's FASTQ to VCF germline analysis pipeline
WGS, WES, accelerated analysis
Name of Tool
Name in CLI
Scientific Algorithms
Common Uses
sentieon-joint_genotyping
Sentieon GVCFtyper
WGS, WES, accelerated analysis
Name of Tool
Name in CLI
Scientific Algorithms
Common Uses
sentieon-tnseq
Sentieon's FASTQ to VCF somatic analysis pipeline
WGS, WES, accelerated analysis
sentieon-bwa
Sentieon's FASTQ to BAM/CRAM pipeline
WGS, WES, accelerated analysis
pbgermline
BWA-Mem Alignment, Co-ordinate Sorting, Picard MarkDuplicates, Base Quality Score Recalibration
WGS, WES, accelerated analysis
pbdeepvariant
Deepvariant
Variant calling, accelerated analysis
sentieon-umi
Sentieon's pre-processing and alignment pipeline for next-generation sequence
WGS, WES, accelerated analysis
sentieon-ccdg
Sentieon's FASTQ to CRAM pipeline, Functional Equivalent Pipeline
WGS, WES, accelerated analysis
sentieon-dnaseq
Sentieon's FASTQ to VCF germline analysis pipeline
WGS, WES, accelerated analysis
Name of Tool
Name in CLI
Scientific Algorithms
Common Uses
sentieon-tnseq
Sentieon's FASTQ to VCF somatic analysis pipeline
WGS, WES, accelerated analysis, somatic variant calling
pbgermline
BWA-Mem Alignment, Co-ordinate Sorting, Picard MarkDuplicates, Base Quality Score Recalibration
WGS, WES, accelerated analysis
sentieon-tnbam
Sentieon's BAM to VCF somatic analysis pipeline
WGS, WES, accelerated analysis, somatic variant calling
sentieon-dnabam
Sentieon's BAM to VCF germline analysis pipeline
WGS, WES, accelerated analysis
sentieon-joint_genotyping
Sentieon GVCFtyper
WGS, WES, accelerated analysis
sentieon-ccdg
Sentieon's FASTQ to CRAM pipeline, Functional Equivalent Pipeline
WGS, WES, accelerated analysis
sentieon-dnaseq
Sentieon's FASTQ to VCF germline analysis pipeline
WGS, WES, accelerated analysis
pbmutectcaller
GPU accelerated version of Mutect2 and supports both tumor-only and tumor-normal variant calling
The dx commands, extract_dataset and extract_assay germline, provide the option to either return the data dictionary of a dataset or to retrieve the underlying data comprising a dataset that is described by the data dictionary. The commands also provide options to return metadata of a dataset, such as listing the name and title for entities and fields, or listing all relevant assays in a dataset. When retrieving data the user will have the choice of using a private Spark resource. In most scenarios, retrieving data without direct Spark usage may suffice, and additional compute resources may not be needed (see the example OpenBio notebooks). However, when additional compute resources are needed, data is returned using the DNAnexus Thrift Server, and though the server is highly available there is a fixed timeout which may prevent a high number of queries from executing. In scenarios where the data model has many relationships, there is a high volume of stored data, and/or there is a high volume of data to be extracted and returned, it may be necessary to extract data using additional private compute resources. These resources are scaled accordingly so that timeouts enforced via the Thrift Server are avoided completely. If the flag --sql is provided the command will instead return a SQL statement (string) to then use when querying from a stand alone Spark-enabled app(let), such as JupyterLab.
The most common way to use Spark on the DNAnexus Platform is via a Spark enabled JupyterLab notebook.
After creating a Jupyter notebook within a project, enter the commands shown below, to initiate a Spark session.
Python:
R:
Once you've initiated a Spark session, you can run SQL queries on the database within your notebook, with the results written to a Spark DataFrame:
Python:
R:
Python:
Where dataset
is the record-id
or the path to the dataset or cohort, for example, “record-abc123” or “/mydirectory/mydataset.dataset.”
R:
Where dataset
is the record-id
or the path to the dataset or cohort.
Python:
R:
In the examples above, dataset is the record-id
or the path to the dataset or cohort, for example, record-abc123
or /mydirectory/mydataset.dataset
. allele_filter.json
is a JSON object, as a file, and which contains filters for the --retrieve-allele
command. For more information, refer to the notebooks here.
Python:
R:
When querying large datasets - such as those containing genomic data - ensure that your Spark cluster is scaled up appropriately with multiple clusters to parallelize across.
Ensure that your Spark session is only initialized once per Jupyter session. If you initialize the Spark session in multiple notebooks in the same Jupyter Job - e.g. run notebook 1 and also run notebook 2 OR run a notebook from start to finish multiple times - the Spark session will be corrupted and you will need to restart the specific notebook's kernel. As a best practice, shut down the kernel of any notebook you are not using, before running a second notebook in the same session.
If you would like to use a database outside your project's scope, you must refer to it using it's unique database name (typically this will look something like database_fjf3y28066y5jxj2b0gz4g85__metabric_data
) as opposed to the database name (metabric_data
in this case).
Learn to launch a JupyterLab session on the DNAnexus Platform, via the DXJupyterLab app.
If you have used DXJupyterLab before, the page will display a list of your previous sessions run across different projects.
This will open a window from which you can start a new JupyterLab environment. In this window, you can configure your session, e.g. specify its name, select an instance type, and choose the project in which JupyterLab should be started.
If a snapshot
file is provided, a DXJupyterLab environment saved previously will be loaded from that file. A snapshot tarball file can be created when running a JupyterLab session.
You can adjust the duration
of the session, after which the environment will automatically shut down. Based on this duration and the instance type, the estimation of the price will be shown in the bottom-left corner (if you have access to the billing information for the selected project).
If you select Enable Spark Cluster
, a JupyterLab environment with a standalone Spark cluster will be started. With this option, you can also set the number of nodes in the cluster. This number includes the master (one node) and the worker nodes.
The feature
options available are PYTHON_R
, ML, IMAGE_PROCESSING
, and STATA
. Selecting thePYTHON_R
feature (default option) loads the environment with Python3 and R kernel and interpreter. Selecting the ML
feature loads the environment with Python3 and Machine Learning packages such as TensorFlow, PyTorch, CNTK as well as Image Processing package Nipype but it does not contain R. Selecting the IMAGE_PROCESSING
feature loads the environment with Python3 and Image Processing packages such as Nipype, FreeSurfer and FSL but it does not contain R. The FreeSurfer package requires a license to run. Details about License creation and usage can be found here. The STATA
feature requires a license to run. For a detailed list of libraries included in each of these feature
options, see the in-product documentation.
First, the JupyterLab will be in an "Initializing" state, where it waits for the worker to spin up and for the JupyterLab server to be up and running. Clicking on the row corresponding to your session and the i
icon in the top right corner will display more information corresponding to the JupyterLab job.
Once the JupyterLab server is running, the session state will change to Ready
and the name of the session will turn into a link. By clicking this link, you can open a JupyterLab environment page in your browser. You can access your job via the URL https://job-xxxx.dnanexus.cloud
, where job-xxxx
is the ID of the DXJupyterLab's job.
You can start the JupyterLab environment directly from the command line by running the app:
Once the app starts, you may check if the JupyterLab server is ready to server connections, which will be indicated by the job's property httpsAppState
set to running
. Once it is running, you can open your browser and go to:
where job-xxxx
is the ID of the job running the app.
In order to run the Spark version of the app, use the command:
You can check the optional input parameters for the apps on the DNAnexus platform (platform login required to access the links):
From the CLI, you can learn more about dx run
with the following command:
where APP_NAME is either app-dxjupyterlab
or app-dxjupyterlab_spark_cluster
.
See the Quickstart and References pages for more details on how to use DXJupyterLab.
In this tutorial, you will learn how to create and run a notebook in JupyterLab on the platform, download data from the notebook, and upload results to the platform.
First, launch DXJupyterLab in the project of your choice, as described in the Running DxJupyterLab guide.
Once your JupyterLab session is running, click on the DNAnexus
tab on the left sidebar to see all the files and folders in the project.
To create a new empty notebook in the DNAnexus project, select DNAnexus
> New Notebook
from the top menu.
An untitled ipynb
file will be created and viewable in the DNAnexus project browser, which refreshes every few seconds.
You can rename your file by right-clicking on its name and selecting Rename
.
You can open and edit the newly created notebook directly from the project (accessible from the DNAnexus
tab in the left sidebar). To save your changes, simply hit Ctrl/Command + S
or click on the save
icon in the Toolbar (an area just below the tab bar at the top). A new notebook version will land in the project, and you should see in the "Last modified" column that the file was created recently.
Since DNAnexus files are immutable, whenever you save the notebook, the current version is uploaded to the project and replaces the previous version, i.e. the file of the same name. The previous version is moved to the .Notebook_archive
with a timestamp suffix added to its name. Saving notebooks directly in the project as new files ensures that your analyses won't be lost when the DXJupyterLab session ends.
To process your data in the notebook, the data must be available in the execution environment (as is the case with any DNAnexus app).
You can download input data from a project for your notebook using dx download
in a notebook cell:
You can also use the terminal to execute the dx
command.
If your notebook generates any data you'd like to keep, you should upload it to the project before the session ends, i.e. before the worker in which the JupyterLab runs is terminated. You can do it directly in the notebook by running dx upload
from a notebook cell or from the terminal:
Check the References guide for tips on the most useful operations and features in the DNAnexus JupyterLab.
Learn how to use FreeSurfer in DXJupyterLab.
FreeSurfer is a software package for the analysis and visualization of structural and functional neuroimaging data from cross-sectional or longitudinal studies. The FreeSurfer package comes pre-installed with the IMAGE_PROCESSING
flavor of DXJupyterlab.
To use FreeSurfer on the DNAnexus Platform, you need a valid FreeSurfer license. You can register for the FreeSurfer license here.
To use the FreeSurfer license complete the following steps:
Upload the license text file to your project on the DNAnexus platform.
Launch the DXJupyterlab app using the IMAGE_PROCESSING
feature.
Once the DXJUpyterlab is running, open your existing notebook (or a new notebook) and download the license file into the FREESURFER_HOME
directory, like so:
Python kernel:
Bash kernel:
Learn how to run an older version of DXJupyterLab via the user interface or command-line interface.
The primary reason to run an older version of DXJupyterLab is to access snapshots containing tools that cannot be run in the current version's execution environment.
From the main Platform menu, select Tools, then Tools Library.
Locate and select, from the list of tools, either DXJupyterLab with Python, R, Stata, ML, Image Processing or DXJupyterLab with Spark Cluster.
From the tool detail page, click on the Versions tab.
Select the version you'd like to run. Click the Run button.
Select the project in which you want to run DXJupyterLab.
Launch the version of DXJupyterLab you want to run, substituting the version number for x.y.z
in the following commands:
For DXJupyterLab without Spark cluster capability, run the command dx run app-dxjupyterlab/x.y.z --priority high
.
For DXJupyterLab with Spark cluster capability, run the command dx run app-dxjupyterlab_spark_cluster/x.y.z --priority high
After launching DXJupyterLab, access the DXJupyterLab environment using your browser. To do this:
Get the job ID for the job created when you launched DXJupyterLab. See the Monitoring Executions page for details on how to get the job ID, via either the UI or the CLI.
Open the URL https://job-xxxx.dnanexus.cloud
, substituting the job's ID for job-xxxx
.
You will see an error message "502 Bad Gateway" if DXJupyterLab is not yet accessible. If this happens, wait a few minutes, then try again.
The Spark application is an extension of the current app(let) framework. Currently, app(let)s have a specification for their VM (instance type, OS, packages). This has been extended to allow for an additional optional cluster specification with type=dxspark
.
Calling /app(let)-xxx/run for Spark apps creates a Spark cluster (+ master VM).
The master VM (where the app shell code runs) acts as the driver node for Spark.
Code in the master VM leverages the Spark infrastructure.
Job mechanisms (monitoring, termination, etc.) are the same for Spark apps as for any other regular app(let)s on the Platform.
Spark apps use the same platform "dx" communication between the master VM and DNAnexus API servers.
There's a new log collection mechanism to collect logs from all nodes.
You can use the Spark UI to monitor running job using ssh tunneling.
Spark apps can be launched over a distributed Spark cluster.
Supported Data Types
Supported Data Types
Supported Data Types
Primary Field
Secondary Field
Supported Data Types
Primary Field
Secondary Field
Supported Data Types
Primary Field
Secondary Field
Supported Data Types
Supported Data Types
Learn to use the DXJupyterlab Spark Cluster app.
The DXJupyterlab Spark Cluster App is a Spark application that runs a fully-managed standalone Spark/Hadoop cluster. This cluster enables distributed data processing and analysis from directly within the Jupyterlab application. In the JupyterLab session, you can interactively create and query DNAnexus databases or run any analysis on the Spark cluster.
In addition to the core Jupyterlab features, the Spark cluster-enabled JupyterLab app allows you to:
Explore the available databases and get an overview of the available datasets
Perform analyses and visualizations directly on data available in the database
Create databases
Submit data analysis jobs to the Spark cluster
Check the general Overview for an introduction to DNAnexus JupyterLab products.
The Quickstart page contains information on how to start a JupyterLab session and create notebooks on the DNAnexus platform. The References page has additional useful tips for using the environment.
Having created your notebook in the project, you can populate your first cells as below. It is good practice to instantiate your Spark context at the very beginning of your analyses, as shown below.
To view any databases to which you have access to in your current region and project context, run a cell with the following code:
A sample output should be:
You can inspect one of the returned databases by running:
which should return an output similar to:
To find a database in your current region that may be in a different project than your current context, run the following code:
A sample output should be:
You can inspect one of the returned databases by running (note that using the database name instead of unique database name here will only return the databases within the project scope):
See below for an example of how to create and populate your own database.
You may separate each line of code into different cells to view the outputs iteratively.
Hail is an open-source, scalable framework for exploring and analyzing genomic data. It is designed to run primarily on a Spark cluster and is available with DXJupyterLab Spark Cluster. It is included in the app and can be used when the app is run with the feature
input set to HAIL (set as default).
Initialize the context when beginning to use Hail. It's important to pass previously started Spark Context sc
as an argument:
We recommend continuing your exploration of Hail with the GWAS using Hail tutorial. For example:
To use VEP (Ensemble Variant Effect Predictor) with HAIL, select "Feature," then "HAIL" when launching Spark Cluster-Enabled DXJupyterLab via the CLI.
VEP can predict the functional effects of genomic variants on genes, transcripts, protein sequence, and regulatory regions. The LoF plugin is included as well, and is used when VEP configuration includes LoF plugin as shown in the configuration file below.
The Spark cluster app is a Docker-based app which runs the JupyterLab server in a Docker container.
The JupyterLab instance runs on port 443. Because it is an HTTPS app, you can bring up the JupyterLab environment in a web browser using the URL https://job-xxxx.dnanexus.cloud
, where job-xxxx
is the ID of the job that runs the app.
The script run at the instantiation of the container, /opt/start_jupyterlab.sh
, configures the environment and starts the server needed to connect to the Spark cluster. The environment variables needed are set by sourcing two scripts, bind-mounted into the container:
The default user in the container is root
.
The option --network host
is used when starting Docker in order to remove the network isolation between the host and the docker container, which allows the container to bind to the host's network and access Sparks master port directly.
S3 buckets can have private or public access. Either the s3 or the s3a scheme can be used to access S3 buckets. Note that the s3 scheme is automatically aliased to s3a in all Apollo Spark Clusters.
To access public s3 buckets, you do not need to have s3 credentials. The example below shows how to access the public 1000Genomes bucket in a JupyterLab notebook:
When the above is run in a notebook, you will see the following:
To access private buckets, see the example code below. The example assumes that a Spark session has been created as shown above.
Learn about data showcased on the Cohort Browser's locus details page.
When genomic information is ingested and made available in the Cohort Browser, variants can be annotated using NCBI's dbSNP and Ensembl's gnomAD. The specific versions of each are provided during the ingestion process and allows the ingestion process to create a set of tables optimized for cohort creation through the Cohort Browser.
In the Cohort Browser, various annotation data (e.g. gene name, consequence, rsID) are already available for both cohort creation and for quick cohort interrogation. To access more detailed information, in the variant section of the Cohort Browser, simply click the location in the allele table and the locus detail’s page will open in a new tab.
The page is split into 3 sections, all pre-calculated during ingestion of the dataset. The sections start with a summary of the locus, including the genotype frequencies for the datasets at the locus and then there's a detailed breakdown of annotations by allele.
This pane shows quickly summary information for the locus in relation to the dataset. It shows the chromosome and starting position, the frequency of both the reference allele and the no-calls, and it shows the sum total of alleles available.
The genotypes section allows users to see a detailed breakdown of the various genotypes in the dataset. Given order is not preserved, which means a C/A and A/C will count in the same bucket so only half of the crosstab is populated. Note that these genotypes are for the dataset as a whole at the specific location, and are not specific to the selected cohort.
Each allele available in the dataset is split up to show various information that was pulled from dbSNP and gnomAD during ingestion. If an rsID or AffyID was available, it is shown along with a link to its NCBI's dbSNP page. Additionally, the allele type, dataset frequency, and gnomAD frequency are shown for easy reference. Further information about the allele is broken down based on the different transcripts available. Columns in this section can very based on the annotation resources used.
For canonical transcripts, a blue indicator will appear next to the ID of the transcripts.
Use Jupyter notebooks on the DNAnexus Platform to craft sophisticated custom analyses in your preferred coding language.
Jupyter notebooks are a popular way to track the work performed in computational experiments the way a lab notebook tracks the work done in a wet lab setting. DXJupyterLab, or JupyterLab, is an application provided by DNAnexus that allows you to perform computational experiments on the DNAnexus Platform using Jupyter notebooks. DXJupyterLab allows users on the DNAnexus platform to collaborate on notebooks and extends JupyterLab with options for directly accessing a DNAnexus project from the JupyterLab environment.
DXJupyterLab supports the use of Bioconductor and Bioconda, useful tools for bioinformatics analysis.
DXJupyterlab is a versatile application that can be used to:
Collaborate on exploratory analysis of data
Reproduce and fork work performed in computational analyses
Visualize and gain insights into data generated from biological experiments
Create figures and tables for scientific publications
Build and test algorithms directly in the cloud before creating DNAnexus apps and workflows
Test and train machine/deep learning models
Interactively run commands on a terminal
There are two different DXJupyterLab apps available on the DNAnexus Platform. One is a general-purpose JupyterLab application. The other is Spark cluster-enabled, and can be used within the DNAnexus Apollo framework.
Both apps instantiate a JupyterLab server that allows for data analyses to be interactively performed in Jupyter notebooks on a DNAnexus worker.
The DXJupyterLab Spark Cluster app contains all the features found in the general-purpose DXJupyterLab along with access to a fully-managed, on-demand Spark cluster for big data processing and translational informatics.
DXJupyterLab 2.2 is the default version available on the DNAnexus Platform. Older versions are also available.
A step-by-step guide on how to start with DXJupyterLab and create and edit Jupyter notebooks can be found in the Quickstart.
Creating a DXJupyterlab session requires the use of two different environments:
The DNAnexus project (accessible through the web platform and the CLI).
The worker execution environment.
You have direct access to the project in which the application is run from the JupyterLab session. The project file browser (which lists folders, notebooks, and other files in the project) can be accessed from the DNAnexus tab in the left sidebar or from the terminal:
The project is selected when the DXJupyterLab app is started and cannot be subsequently changed.
The project file browser displays all subfolders, all Jupyter notebooks, and files in the project (limited to the 1,000 most recently modified files). In the Spark-enabled app databases in the project are also visible (limited to the 1,000 most recently modified databases). A list of all the objects in the project can be obtained programmatically or by using dx ls
. The file listing is refreshed every 10s and it is possible to enforce a refresh by clicking on the whirl arrow icon in the top right corner of the file browser.
When you open and run a notebook from the DNAnexus file browser the kernel corresponding to this notebook is started in the worker execution environment and will be used to execute the notebook code. DNAnexus notebooks will have a [DX]
prepended to the notebook name in the tab of all opened notebooks.
The execution environment file browser can be accessed from the left sidebar (notice the folder icon at the top) or from the terminal:
You can also create Jupyter notebooks in the worker execution environment through the File menu. These notebooks are stored on the local file system of the DxJupyterLab execution environment and have to be persisted in a DNAnexus project. More information about saving can be found in the next section.
You can create, edit, and save notebooks directly in the DNAnexus project as well as duplicate, delete, or download them to your local machine. Notebooks stored in your DNAnexus project, which are housed within the DNAnexus tab on the left sidebar, are fetched from and saved to the project on the DNAnexus platform without being stored in the JupyterLab execution environment file system. These are referred to as “DNAnexus notebooks” and these notebooks persist in the DNAnexus project after the DXJupyterLab instance is terminated.
DNAnexus notebooks can be recognized by the [DX]
that is prepended to its name in the tab of all opened notebooks.
DNAnexus notebooks can be created by clicking the DNAnexus Notebook icon from the Launcher tab that appears upon starting the JupyterLab session, or by clicking the DNAnexus tab on the upper menu and then clicking “New notebook”. The Launcher tab can also be opened by clicking File and then selecting “New Launcher” from the upper menu.
To create a new local notebook, click the File tab in the upper menu and then select “New" and then “Notebook”. These non-DNAnexus notebooks can be saved to DNAnexus by simply dragging and dropping them in the DNAnexus file viewer in the left panel.
In JupyterLab, users can access input data that is located in a DNAnexus project in one of the following ways.
For reading the input file multiple times or for reading a large fraction of the file in random order:
Download the file from the DNAnexus project to the execution environment with dx download
and access the downloaded local file from Jupyter notebook.
For scanning the content of the input file once or for reading only a small fraction of file's content:
A project in which the app is running is mounted in a read-only fashion at/mnt/project
folder. Reading the content of the files in /mnt/project
dynamically fetches the content from the DNAnexus platform, so this method uses minimal disk space in the JupyterLab execution environment, but uses more api calls to fetch the content.
Files such as local notebooks can be persisted in the DNAnexus project by using one of these options:
dx upload
in bash console.
Drag and drop the file onto the DNAnexus tab that is in the column of icons on the left side of the screen. This will upload the file into the currently selected DNAnexus folder.
Exporting DNAnexus notebooks (e.g. to HTML or PDF) is not supported, but it's possible to dx download
the DNAnexus notebook from the current DNAnexus project to the JupyterLab environment and export the downloaded notebook. For exporting local notebook to certain formats, the following commands might be needed beforehand: apt-get update && apt-get install texlive-xetex texlive-fonts-recommended texlive-plain-generic
.
A command can be executed in the DxJupyterLab worker execution environment without starting an interactive JupyterLab server. To do that, provide the cmd
input and additional input files using the in
input file array to the DxJupyterLab app. The provided command will run in the /opt/notebooks/
directory and any output files generated in this directory will be uploaded to the project and returned in the out
output field of the job that ran DxJupyterLab app.
The cmd input makes it possible to use the papermill
command that is pre-installed in the DxJupyterLab environment to execute notebooks non-interactively. For example, to execute all the cells in a notebook and produce an output notebook:
Where notebook.ipynb
is the input notebook to thepapermill
command, which is passed to the dxjupyterlab
app using thein
input, and output_notebook.ipynb
is the name of the output notebook, which will contain the result of executing the input notebook and will be uploaded to the project at the end of app's execution. See the DxJupyterLab app page for details.
Collaborators can work on notebooks in the project without the risk of overwriting each other's changes.
If a user has opened a specific notebook in a JupyterLab session, other users cannot open or edit the notebook. This is indicated by a red lock icon next to the notebook's name.
It is still possible to create a duplicate to see what changes are being saved in the locked notebook or to continue work on this "forked" version of the notebook. To copy a notebook, right-click on its name and select Duplicate
; after a few seconds, a notebook with the same name and a "copy" suffix should appear in the project.
Once the editing user closes the notebook, the lock will be released and anybody else with access to the project will be able to open it.
Whenever a notebook is saved in the project, it is uploaded to the platform as a new file that will replace the previous version, i.e. the file of the same name. The previous version is moved to the .Notebook_archive
folder with a timestamp suffix added to its name and its ID is saved in the properties
of the new file. Saving notebooks directly in the project ensures that your analyses won't be lost when the DXJupyterLab session ends.
DXJupyterLab sessions begin with a set duration, after which they will shut down automatically. The timeout clock is displayed in the footer on the right side and it can also be adjusted there (using the Update duration
button). Even if the DxJupyterLab webpage is closed, the termination will be executed at the set timestamp. Job lengths have an upper limit of 30 days, which cannot be extended.
A session can be terminated immediately from the top menu (DNAnexus
> End Session
).
It is possible to save the current session environment and data and reload it later by creating a session snapshot (DNAnexus
> Create Snapshot
).
A DXJupyterLab session is run in a Docker container, and a session snapshot file is a tarball generated by saving the Docker container state (with the docker commit
and docker save
commands). Any installed packages and files created locally are saved to a snapshot file, with the exception of directories /home/dnanexus
and /mnt/
, which are not included. This file is then uploaded to the project to .Notebook_snapshots
and can be passed as input the next time the app is started.
Snapshots built with DXJupyterLab versions prior to 2.0.0 (released in mid-2023) are incompatible with the current version. These older snapshots incorporate versions of tools that could cause conflicts and unexpected behavior in the upgraded DXJupyterLab app execution environment, and so cannot be loaded within that environment.
If you want to use an older snapshot in the current version of DXJupyterLab, you'll need to recreate the snapshot as follows:
Create a tarball incorporating all the necessary data files and packages.
Save the tarball in a project.
Launch the current version of DXJupyterLab.
Import and unpack the tarball file.
If you don't want to have to recreate your older snapshot, you can run an older version of DXJupyterLab and access the snapshot therein.
Viewing any other file types from your project, such as CSV, JSON, PDF files, images, scripts, etc. is convenient because JupyterLab will render them accordingly. For example, JSON files will be collapsible and easy to navigate and CSV files will be presented in the tabular format.
However, editing and saving any open files from the project other than IPython notebooks will result in an error.
The JupyterLab apps are run in a specific project, defined at start time, and this project cannot be subsequently changed. The job associated with the JupyterLab app has CONTRIBUTE access to the project in which it is run.
When running DXJupyterLab app, it is possible to view, but not update, other projects the user has access to. This enhanced scope is required to be able to read databases which may be located in different projects and are not cloneable.
It is possible to start new jobs with dx run
from a notebook or the Terminal. If thebillTo
of the project that the JupyterLab session is running in is not licensed to start detached executions, the started jobs will be the subjobs of the interactive JupyterLab session. In this case, the --project
argument todx run
will be ignored and the JupyterLab job's workspace is used to run the job, not the given project. Also, if the attached subjob fails or is terminated on the DNAnexus platform, the whole job tree will be terminated, including the interactive JupyterLab session.
Jobs are limited to a runtime of 30 days. Jobs running longer than 30 days will be automatically terminated.
The DXJupyterLab app is a Docker-based app that runs the JupyterLab server instance in a Docker container. The server runs on port 443. Because it's an HTTPS app, you can bring up the JupyterLab environment in a web browser using the URL https://job-xxxx.dnanexus.cloud
, where job-xxxx
is the ID of the job that runs the app. Only the user who launched the JupyterLab job has access to the JupyterLab environment. Other users will see a “403 Permission Forbidden” message under the JupyterLab session's URL.
When launching Jupyterlab, the feature
options available are PYTHON_R
, ML, IP
, and STATA
.
Selecting thePYTHON_R
feature (default option) loads the environment with Python3 and R kernel and interpreter.
Selecting the ML
feature loads the environment with Python3 and machine learning packages, such as TensorFlow, PyTorch, CNTK as well as the image processing package NiPype, but it does not contain R.
Selecting the IMAGE_PROCESSING
feature loads the environment with Python3 and Image Processing packages such as NiPype, FreeSurfer and FSL but it does not contain R. The FreeSurfer package requires a license to run. Details about license creation and usage can be found here.
The STATA
feature requires a license to run. See Stata in DXJupyterLab for more information about running Stata in JupyterLab.
See the in-product documentation for the full list of pre-installed packages. This list includes details on feature-specific packages available when running the PYTHON_R
, ML
, IMAGE_PROCESSING
, and STATA
features.
Additional packages can easily be installed during a JupyterLab session. By creating a Docker container snapshot, users can then start subsequent sessions with the new packages pre-installed by providing the snapshot as input.
For more information on the features and benefits of JupyterLab, see the official JupyterLab documentation.
Create your first notebooks by following the instructions in this Quickstart guide.
See the DXJupyterLab Reference guide for tips and info on the most useful DXJupyterLab features.
Connect with Spark for database sharing, big data analytics, and rich visualizations.
Apache Spark can help you tackle big data analytics combined with rich visualization. Sharing a database is as easy as sharing a project: our access levels on the platform map directly to SQL abilities, so you can fine-tune access control to your databases at either an individual or org level.
There are two ways to connect to our Spark service: through our Thrift server or, for more scalable throughput, using Spark applications.
DNAnexus hosts a high-availability Thrift server with which you can connect over JDBC with a client-like beeline to run Spark SQL interactively. Refer to the Thrift Server page for more details.
You can launch a Spark application distributed across a cluster of workers. Since this is all tightly integrated with the rest of the platform, Spark jobs will leverage the features of normal jobs. You'll have the same ways to monitor a job's progress, SSH into a job instance to debug, and use the features of dx-toolkit
and the platform web UI. You'll additionally have access to logs from workers and be able to monitor the job in the Spark UI.
With Spark, you can visualize your results in real time. You can save those queries as cohorts, share them with your team, or use them as inputs to Spark-based analysis apps. You can create charts and shareable dashboards. The filter view allows you to build cohorts very quickly without the need to write complex SQL queries by hand.
A database is a data object on the Platform. A database object is stored in a project.
Databases can be shared with other users or organizations through project sharing. Access to a database can be revoked at any time by revoking access to the project by the project administrator. If revoking access to the project is not possible, the database can be relocated to another project with different set of collaborators.
Project policies restrict how the data can be modified or copied to other projects. Databases follow the Delete Policy and the Copy Policy. If a database is in a restricted project, the database can be accessed for reading only from the same project context, when connecting to Thrift. Databases also adhere to the project's PHI Data Protection policy. If a database is in a project for which Data Protection is enabled ("PHI project"), the database is subject to the following restrictions:
The database cannot be accessed by Spark apps launched in projects for which PHI Data Protection is not enabled ("non-PHI projects").
If a non-PHI project is provided as a project context when connecting to Thrift, only databases from non-PHI projects will be available for retrieving data.
If a PHI project is provided as a project context when connecting to Thrift, only databases from PHI projects will be available to add new data.
As with all DNAnexus file objects, database access is controlled by project access. These access levels and database object states translate into specific SQL abilities for the database, tables, data and database object in the project.
The following tables reference supported actions on a database and database object with lowest necessary access level for an open and closed database.
Spark SQL Function
Open Database
Closed Database
ALTER DATABASE SET DBPROPERTIES
CONTRIBUTE
N/A
ALTER TABLE RENAME
CONTRIBUTE
N/A
ALTER TABLE DROP PARTITION
CONTRIBUTE (*)
N/A
ALTER TABLE RENAME PARTITION
CONTRIBUTE
N/A
ANALYZE TABLE COMPUTE STATISTICS
UPLOAD
N/A
CACHE TABLE, CLEAR CACHE
N/A
N/A
CREATE DATABASE
UPLOAD
UPLOAD
CREATE FUNCTION
N/A
N/A
CREATE TABLE
UPLOAD
N/A
CREATE VIEW
UPLOAD
UPLOAD
DESCRIBE DATABASE, TABLE, FUNCTION
VIEW
VIEW
DROP DATABASE
CONTRIBUTE (*)
ADMINISTER
DROP FUNCTION
N/A
N/A
DROP TABLE
CONTRIBUTE (*)
N/A
EXPLAIN
VIEW
VIEW
INSERT
UPLOAD
N/A
REFRESH TABLE
VIEW
VIEW
RESET
VIEW
VIEW
SELECT
VIEW
VIEW
SET
VIEW
VIEW
SHOW COLUMNS
VIEW
VIEW
SHOW DATABASES
VIEW
VIEW
SHOW FUNCTIONS
VIEW
VIEW
SHOW PARTITIONS
VIEW
VIEW
SHOW TABLES
VIEW
VIEW
TRUNCATE TABLE
UPLOAD
N/A
UNCACHE TABLE
N/A
N/A
Data Object Action
Open Database
Closed Database
Add Tags
UPLOAD
CONTRIBUTE
Add Types
UPLOAD
N/A
Close
UPLOAD
N/A
Get Details
VIEW
VIEW
Remove
CONTRIBUTE (*)
ADMINISTER
Remove Tags
UPLOAD
CONTRIBUTE
Remove Types
UPLOAD
N/A
Rename
UPLOAD
CONTRIBUTE
Set Details
UPLOAD
N/A
Set Properties
UPLOAD
CONTRIBUTE
Set Visibility
UPLOAD
N/A
(*) If a project is protected, then ADMINISTER access is required.
When users create a database, the name the user provides is validated and downcased before it's stored as the databaseName
attribute of the database object. In addition, a unique database name is generated by downcasing database object ID, replacing the hyphen with an underscore, and concatenating it with two underscores to an updated database name. The unique database name is stored as the uniqueDatabaseName
attribute of the database object.
When a database is created using the following SQL statement and a user-generated database name (referenced below as, db_name
):
The platform database object, database-xxxx
, is created with all lowercase characters. However, when creating a database using dxpy
, the Python module supported by the DNAnexus SDK, dx-toolkit
, the following case-sensitive command returns a database ID based on the user-generated database name, assigned here to the variable db_name
:
With that in mind, it is suggested to either use lowercase characters in your db_name
assignment or to instead apply a forcing function like, .lower()
, to the user-generated database name:
Using Stata via DXJupyterlab, working with project files, and creating datasets with Spark.
Stata is a powerful statistics package for data science. Stata commands and functionality can be accessed on the DNAnexus Platform via stata_kernel, in Jupyter notebooks.
On the DNAnexus Platform, use the DXJupyterLab app to create and edit Jupyter notebooks.
To use Stata on the DNAnexus Platform, you need a valid Stata license. Before launching Stata in a project, you must save your license details according to the instructions below in a plain text file with the extension .json
, then upload this file to the project’s root directory. You only need to do this once per project.
Start by creating the file in a text editor, including all the fields shown here, where <user>
is your DNAnexus username, and<organization>
is the org of which you're a member:
Save the file according to the following format, where <username>
is your DNAnexus username: .stataSettings.user-<username>.json
Open the project in which you want to use Stata. Upload the Stata license details file to the project’s root directory by going to your project's Manage tab, clicking on the Add button on the upper right, and then selecting the Upload data option.
When working in a shared project, you can take an additional step to avoid exposing your Stata license details to project collaborators.
Create a private project. Then create and save a Stata license details file in that project’s root directory, as per the instructions above.
Within the shared project, create and save a Stata license details file in this format, where project-yyyy
is the name of the private project, and file-xxxx
is the license details file ID, in that private project:
Open the project in which you want to use Stata. From within the project's Manage tab, click the Start Analysis button.
Select the app DXJupyterLab with Python, R, Stata, ML.
Click the Run Selected button. Note that if you haven't run this app before, you'll be prompted to install it. Next, you’ll be taken to the Run Analysis screen.
On the Run Analysis screen, open the Analysis Inputs tab and click the Stata settings file button.
Add your Stata settings file as an input. This is the .json
file you created, containing your Stata license details.
In the Common section at the bottom of the Analysis Inputs pane, open the Feature dropdown menu and select Stata.
Click the Start Analysis button at the top right corner of the screen. This will launch the DXJupyterLab app, and take you to the project's Monitor tab, where you can monitor the app's status as it loads.
Once the analysis starts, you’ll see the notification "Running" appear under the name of the app.
Click the Monitor tab heading. This will open a list of running and past jobs. Jobs are shown in reverse chronological order, with the most recently launched at the top. The topmost row should show the job you’ve just launched. To open the job and enter the JupyterLab interface, click on the URL shown under Worker URL.
Within the JupyterLab interface, open the DNAnexus tab shown at the left edge of the screen.
Open a new Stata notebook by clicking the Stata tile in the Notebooks section.
You can download DNAnexus data files to DXJupyterLab container from Stata notebook with:
Data files in the current project can also be accessed using a /mnt/project
folder from a Stata notebook as follows: To load a DTA file:
To load a CSV file:
To write a DTA file to the DXJupyterLab container:
To write a CSV file to the DXJupyterLab container:
To upload a data file from the DXJupyterLab container to the project, use the following command in a Stata notebook:
Alternatively, open a new Launcher tab, open Terminal, and run:
Note that /mnt/project
directory is read-only, so trying to write to it results in an error.
DXJupyterLab spark cluster app can be used to query and filter DNAnexus datasets returning a PySpark DataFrame. PySpark Dataframe can be converted to a pandas DataFrame with:
Pandas dataframe can be exported to CSV or Stata DTA files in the JupyterLab container with:
To upload a data file from the JupyterLab container to the DNAnexus project in the DXJupyterLab spark cluster app, use
Once saved to the project, data files can be used in a DXJupyterLab Stata session using the instructions above.
Learn about the DNAnexus Thrift server, a service that allows JDBC and ODBC clients to run Spark SQL queries.
The DNAnexus Thrift server connects to a high availability Apache Spark cluster integrated with the platform. It leverages the same security, permissions, and sharing features built into DNAnexus.
In order to connect to the Thrift server, we need the following:
The JDBC url:
Note: Azure UK South (OFH) region does not support access to Thrift.
We support the following format of the username:
TOKEN__PROJECTID : TOKEN is DNAnexus user generated token and PROJECTID is a DNAnexus project ID used as the project context (when you create databases). Note the double underscore between the token and the project ID.
Additionally, both the Thrift server that the user wants to connect to and the project must be from the same region.
See the Authentication tokens page.
Navigate to https://platform.dnanexus.com and login using your username and password.
Go to Projects -> your project -> Settings -> Project ID and click on Copy to Clipboard.
Beeline is a JDBC client bundled with Apache Spark that can be used to run interactive queries on the command line.
You can download Apache Spark 3.5.2 for Hadoop 3.x from here.
You need to have Java installed in your system PATH, or the JAVA_HOME
environment variable pointing to a Java installation.
If you already have beeline installed and all of the credentials, you can quickly connect with the following command:
In the following AWS example, note that some characters are escaped (;
with \
)
Note that the command for connecting to Thrift is different for Azure, as seen below:
The beeline client is located under $SPARK_HOME/bin/
.
Connect to beeline using the JDBC URL:
Once successfully connected, you should see the message:
You are now connected to the Thrift server using your credentials and will be able to see all databases to which you have access to within your current region.
You can query using the unique database name including the downcased database ID, for example database_fjf3y28066y5jxj2b0gz4g85__metabric_data.
If the database is within the same username and project you used to connect to the Thrift server, you can query using only the database name, for example metabric_data
. If the database is located outside the project, you will need to use the unique database name.
You may also find databases stored in other projects if you specify the project context in the LIKE
option of SHOW DATABASES
using the format '<project-id>:<database pattern>'
like so:
Now you can run SQL queries.
This page is a reference for most useful operations and features in the DNAnexus JupyterLab environment.
You can download input data from a project using dx download
in a notebook cell:
The %%bash
keyword converts the whole cell to a magic cell which allows us to run bash code in that cell without exiting the Python kernel. See me examples of magic commands in the IPython documentation. The !
prefix to achieves the same result:
Alternatively, the dx
command can be executed from the terminal.
To download data with Python in the notebook, you can use the download_dxfile
function:
Check dxpy helper functions for details on how to download files and folders.
Any files from the execution environment can be uploaded to the project using dx upload
:
To upload data using Python in the notebook, you can use the upload_local_file
function:
Check dxpy helper functions for details on how to upload files and folders.
By selecting a notebook or any other file on your computer and dragging it into the DNAnexus project file browser, you can upload the files directly to the project. To download a file, right-click on it and click Download (to local computer)
.
You may upload and download data to the local execution environment in a similar way, i.e. by dragging and dropping files to the execution file browser or by right-clicking on the files there and clicking Download
.
It is useful to have a terminal provided by JupyterLab at hand, which uses bash
shell by default and lets you execute shell scripts or interact with the platform via dx
toolkit. For example, the command:
will confirm what the current project context is.
Running pwd
will show you that the working directory of the execution environment is /opt/notebooks
. The JupyterLab server is launched from this directory, which is also the default location of the output files generated in the notebooks.
To open a terminal window, go to File
> New
> Terminal
or open it from the Launcher (using the "Terminal" box at the bottom). To open a Launcher, select File
> New Launcher
.
You can install pip
, conda
, apt-get
, and other packages in the execution environment from the notebook:
By creating a snapshot, you can start subsequent sessions with these packages pre-installed by providing the snapshot as input.
You can access public github repositories from the JupyterLab terminal using git clone
command. By placing a private ssh key that's registered with your github account in /root/.ssh/id_rsa,
you can clone private github repositories using git clone
and push any changes back to github using git push
from the JupyterLab terminal.
Below is a screenshot of a JupyterLab session with a terminal displaying a script that:
sets up ssh key to access a private github repository and clones it,
clones a public repository,
downloads a json file from the DNAnexus project,
modifies an open-source notebook to convert the json file to csv format,
saves the modified notebook to the private github repository,
and uploads the results of json to csv conversion back to the DNAnexus project.
This animation shows the first part of the script in action:
A command can be run in the JupyterLab Docker container without starting an interactive JupyterLab server. To do that, provide the cmd
input and additional input files using the in
input file array. The command will run in the directory where the JupyterLab server is started and notebooks are run, i.e. /opt/notebooks/
. Any output files generated in this directory will be uploaded to the project and returned in the out
output.
The cmd input makes it possible to use a papermill
tool pre-installed in the JupyterLab environment that executes notebooks non-interactively. For example, to execute all the cells in a notebook and produce an output notebook:
where notebook.ipynb is the input notebook to "papermill", which needs to be passed in the "in" input, and output_notebook.ipynb is the name of the output notebook, which will store the result of the cells' execution. The output will be uploaded to the project at the end of the app execution.
If thesnapshot
parameter is specified, execution of cmd will take place in the specified Docker container. The duration
argument will be ignored when running the app with cmd
. The app can be run from commandline with the --extra-args flag to limit the runtime, e.g. dx run dxjupyterlab --extra-args '{"timeoutPolicyByExecutable": {"app-xxxx":{"\*": {"hours": 1}}}}'"
.
If cmd
is not specified, the in
parameter will be ignored and the output of an app will consist of an empty array.
If you are trying to use newer NVIDIA GPU-accelerated software, you may find that the NVIDIA GPU Driver kernel-mode driver nvidia.ko
that is installed outside of the DXJupyterLab environment does not support the newer CUDA version required by your application. You can install NVIDIA Forward Compatibility packages to use the newer CUDA version required by your application by following the steps below in a DXJupyterLab terminal.
If you are away from the JupyterLab browser tabs for 15 to 30 minutes, you will be automatically logged out from the JupyterLab session and JupyterLab tabs will display "Server Connection Error" message. You can re-enter the JupyterLab session by simply reloading the JupyterLab webpage and logging into the platform, which will redirect you back to the JupyterLab session.
Learn about preprocessing VCF data before using it in an analysis.
It may be necessary to preprocess, or harmonize, the data before you load them.
The raw data is expected to be a set of gVCF files -- one file per sample in the cohort.
The VCF data can include variant annotations. Of particular interest are SnpEff annotations, which are included in VCFs as INFO/ANN
tags--SnpEff annotations, if present, are loaded into databases. If desired, then you may pre-annotate your VCF data to include SnpEff annotations after harmonizing your data -- just pass your pVCF to any standard SnpEff annotator. If your pVCF is especially large, it may be advantageous to rely on the internal annotation step in the VCF loader instead of annotating the pVCF yourself. The VCF loader annotation step annotates the pVCF in a distributed, massively parallel way.
Note that the VCF loader does not persist the intermediate, annotated pVCF as a file, so if you want to have access to the annotated file up front, you should annotate it yourself.
VCF annotation flows. In (a) the annotation step is external to the VCF loader, whereas in (b) the annotation step is internal. In any case, SnpEff annotations present as INFO/ANN
tags are loaded into the database by the VCF loader.
The command-line client and the client bindings use a set of environment variables in order to communicate with the API server and to store state on the current default project and directory. These settings are usually set when you run dx login
and can be later changed through other dx commands. To display the currently used settings in human-readable format, you can use the dx env
command:
To print the bash commands for setting the environment variables to match what dx
is using, you can run the same command with the --bash
flag.
Running a dx
command from the command-line will not (and cannot) overwrite your shell's environment variables, so it stores them in a file at ~/.dnanexus_config/environment
.
The following is an ordered list of which DNAnexus utilities load values from configuration sources:
Command line options (if available)
Environment variables already set in the shell
~/.dnanexus_config/environment.json (dx configuration file)
Hardcoded defaults
Note that the dx
command will always prioritize the environment variables that are currently set in the shell. This means that if you have set your environment variable for DX_SECURITY_CONTEXT and subsequently use dx login
to log in as a different user, it will still use the original environment variable. If not run in a script, it will print a warning to stderr whenever the environment variables and its stored state have a mismatch. To get out of this situation, the best approach is often to run source ~/.dnanexus_config/unsetenv
. Setting environment variables is generally an approach within a shell script or part of a job environment in the cloud.
In the interaction below, environment variables have already been set, but the user then uses dx
to log in which is still overridden by the shell's environment variables.
If you instead want to discard the values which dx has stored, the command dx clearenv
removes the dx
-generated configuration file ~/.dnanexus_config/environment.json
for you.
Most dx commands have the following additional flags to temporarily override the values of the respective variables.
For example, you can temporarily override the current default project used:
The CSV Loader ingests CSV files into a database. The input CSV files are loaded into a Parquet-format database and tables that can be queried using Spark SQL.
You can load a single CSV file or many CSV files. In the many files case, all files must be syntactically equal.
For example:
All files must have the same separator (e.g. comma, tab)
All files must include a header line, or all files must exclude it
NOTE: Each CSV file is loaded into its own table within the specified database.
Input:
CSV (array of CSV files to load into the database)
Required Parameters:
database_name
-> name of the database to load the CSV files into.
create_mode
-> strict
mode creates database and tables from scratch and optimistic
mode creates databases and tables if they do not already exist.
insert_mode
-> append
appends data to the end of tables and overwrite
is equivalent to truncating the tables and then appending to them.
table_name
-> array of table names, one for each corresponding CSV file by array index.
type
-> the cluster type, "spark"
for Spark apps
Other Options:
spark_read_csv_header
-> default false
-- whether the first line of each CSV should be used as column names for the corresponding table.
spark_read_csv_sep
-> default ,
-- the separator character used by each CSV.
spark_read_csv_infer_schema
-> default false
-- whether the input schema should be inferred from the data.
The following case creates a brand new database and loads data into two new tables:
VCF Loader ingests Variant Call Format (VCF) files into a database. The input VCF files are loaded into a Parquet-format database that can be queried using Spark SQL.
The input VCF for every run can be a single VCF file or many VCF files, but the merged input must represent a single logical VCF file. In the many files case, the logical VCF may be partitioned by chromosome, by genomic region, and/or by sample. In any case, every input VCF file must be a syntactically correct, sorted VCF file.
Input:
vcf_manifest
: (file) a text file containing a list of file ID's of the VCF files to load (one per line). The referenced files' names must be distinct and end in .vcf.gz If more than one file is specified, then the complete VCF file to load is considered to be partitioned and every specified partition must be a valid VCF file. Moreover after the partition-merge step in preprocessing, the complete VCF file must be valid.
Required Parameters:
database_name
: (string) name of the database into which to load the VCF files.
create_mode
: (string) strict
mode creates database and tables from scratch and optimistic
mode creates databases and tables if they do not already exist.
insert_mode
: (string)append
appends data to the end of tables and overwrite
is equivalent to truncating the tables and then appending to them.
run_mode
: (string)site
mode processes only the site-specific data, genotype
mode processes genotype-specific data and other non-site-specific data and all
mode processes both types of data.
etl_spec_id
: (string) currently only genomics-phenotype
schema choice is supported.
is_sample_partitioned
: (boolean) whether the raw VCF data is partitioned.
Other Options:
snpeff
: (boolean) default true
-- whether to include the SnpEff annotation step in preprocessing with INFO/ANN tags. If SnpEff annotations are desired in the database, then either pre-annotate the raw VCF separately, or include this SnpEff annotation step -- it is not necessary to do both.
snpeff_human_genome
: (string) default GRCh38.92
-- id of the SnpEff human genome to use in the SnpEff annotation step in preprocessing.
snpeff_opt_no_upstream
: (boolean) default true
-- exclude SnpEff upstream gene annotations in the SnpEff annotation step (equivalent to SnpEff's -no-upstream option). Note that this option does not filter pre-calculated annotations outside of the SnpEff annotation step.
snpeff_opt_no_downstream
: (boolean) default true
-- exclude SnpEff downstream gene annotations in the SnpEff annotation step (equivalent to SnpEff's -no-downstream option). Note that this option does not filter pre-calculated annotations outside of the SnpEff annotation step.
calculate_worst_effects
: (boolean) default true
-- whether to include the SnpEff worst-effects annotation step in preprocessing, which adds the SnpEff worst-effects for every alternate-allele--gene combination as INFO/ANN_WORST tags (Number "A"). Note that this option automatically filters SnpEff annotations to exclude 'feature_type!=transcript', 'transcript_biotype!=protein_coding', 'effect=upstream_gene_variant' and 'effect=downstream_gene_variant'
calculate_locus_frequencies
: (boolean) default true
-- whether to include the locus-level frequencies annotation step in preprocessing, which adds locus-level allele and genotype frequencies as INFO tags.
snpsift
: (boolean) default true
-- whether to include the SnpSift/dbSNP annotation step in preprocessing. The SnpSift/dbSNP annotation step adds dbSNP ID annotations to the INFO/RSID tag (Number "A"), which is required in the ETL stage. If the raw VCF is already pre-annotated, then this annotation step is not necessary.
num_init_partitions
: (int) integer defining the the number of partitions for the initial VCF lines Spark RDD.
The Spark SQL Runner application brings up a Spark cluster and executes your provided list of SQL queries. This is especially useful if you need to perform a sequence repeatedly or if you need to run a complex set of queries. You can vary the size of your cluster to speed up your tasks.
Input:
sqlfile
: [Required] A SQL file which contains an ordered list of SQL queries.
substitutions
: A JSON file which contains the variable substitutions.
user_config
: User configuration JSON file, in case you want to set or override certain Spark configurations.
Other Options:
export
: (boolean) default false
. Will export output files with results for the queries in the sqlfile
export_options
: A JSON file which contains the export configurations.
collect_logs
: (boolean) default false
. Collects cluster logs from all nodes.
executor_memory
: (string) Amount of memory to use per executor process, in MiB unless otherwise specified. (e.g. 2g, 8g). This is passed as --executor-memory
to Spark submit.
executor_cores
: (integer) Number of cores to use per executor process. This is passed as --executor-cores
to Spark submit.
driver_memory
: (string) Amount of memory to use for the driver process (e.g. 2g, 8g). This is passed as --driver-memory
to Spark submit.
log_level
: (string) default INFO
. Logging level for both driver and executors. [ALL, TRACE, DEBUG, INFO]
Output:
output_files
: Output files include report SQL file and query export files.
The SQL runner extracts each command in sqlfile
and runs them in sequential order.
Every SQL command needs to be separated with a semicolon ;
.
Any command starting with --
is ignored (comments). Any comment within a command should be inside /*...*/
The following are examples of valid comments:
Variable substitution can be done by specifying the variables to replace in substitutions
.
In the above example, each reference to srcdb
in sqlfile within ${...}
will be substituted with sskrdemo1
. For example, select * from ${srcdb}.${patient_table};
. The script adds the set
command before executing any of the SQL commands in sqlfile
. So select * from ${srcdb}.${patient_table};
would translate to:
If enabled, the results of the SQL commands will be exported to a CSV file. export_options
defines an export configuration.
num_files
: default 1. This is specified to define the maximum output files you want to generate. This generally depends on how many executors you are running in the cluster as well as how many partitions of this file exist in the system. Each output file corresponds to a part file in parquet.
fileprefix
: The filename prefix for every SQL output file. By default the output files will be prefixed with query_id which is the order in which the queries are listed in sqlfile (starting with 1). For example 1-out.csv
. If we specify prefix, it will generate output files like <prefix>-1-out.csv
.
header
: Default is true. If true, it add header to each exported file.
These values in spark-defaults.conf will override or add to the default Spark configuration.
There are two files generated in export folder:
<JobId>
-export.tar : Contains all the query results.
<JobId>
-outfile.sql : SQL debug file.
Extracting the export tar file will look like:
In the above example, demo is the fileprefix
used. We have one folder for each query, and each folder has a .sql file containing the query executed and a .csv folder containing the result csv.
Every SQL run execution generates a SQL runner debug report file. This is a .sql
file.
It lists all the queries executed and status of the execution (Success or Fail). It also lists the name of the output file for that command and the time taken. If there are any failures, it will report the query and stop executing subsequent commands.
While executing the series of SQL commands, one of the commands could fail (error, syntax, etc). In that case the app will quit and upload a SQL debug file to the project:
As you can see, it identifies the line with the SQL error and its response.
Now we can fix the query in the .sql
file and even use this report file as an input for a subsequent run -- picking up where it left off.
Numerical (Integer)
Numerical (Float)
Date
Datetime
Categorical (<=20 distinct category values)
Categorical Multi-Select (<=20 distinct category values)
Categorical or Categorical Multiple (<=15 categories)
Numerical (Integer) or Numerical (Float)
Categorical (<=20 distinct category values)
Categorical (<=20 distinct category values)
Numerical (Integer) or Numerical (Float)
Numerical (Integer) or Numerical (Float)
Categorical (<=20 distinct category values)
Categorical Multiple (<=20 distinct category values)
Categorical Hierarchical (<=20 distinct category values)
Categorical Hierarchical Multiple (<=20 distinct category values)
Numerical (Integer)
Numerical (Float)
is used to harmonize sites across all gVCFs and generate a single pVCF file containing all harmonized sites and all genotypes for all samples.
Note: To learn more about GLnexus, see or .
A license is required to access Spark functionality on the DNAnexus Platform. for more information.
A license is required to access Spark functionality on the DNAnexus Platform. for more information.
Although VCF data can be loaded into Apollo databases immediately after the variant call step, the dataset may not be normalized for downstream analyses across large cohorts. In that case, you'll want to preprocess and harmonize your data before loading. To learn more, see .
A license is required to access Spark functionality on the DNAnexus Platform. for more information.
You can use the dx ls
command to list the objects in your current project. You can learn which project and folder you are currently in by using the command dx pwd
. Using glob patterns, you can broaden your search for objects by specifying filenames with wildcard characters such as *
and ?
. An asterisk (*
) is used to represent zero or more characters in a string, and a question mark (?
) represents exactly one character.
By listing objects in your current directory with the wildcard characters *
and ?
, you can search for objects with a filename using a glob pattern. Here we take folder "C. Elegans - Ce10/" in the public project "Reference Genome Files" (platform login required to access this link) and walk through these examples:
If you wish to search the entire project with a filename pattern, you can utilize the command dx find data --name
with the wildcard characters. Unless --path
or --all-projects
is specified, dx find data
searches data under the current project. Below, we use the command dx find data
in the public project "Reference Genome Files" (platform login required to access this link) using the --name
option to specify the filename of objects that we're searching for.
As described above, if your file contains special characters in their filename, the special characters should be escaped when searching. Additionally, as a colon (:
) is used to denote project names and a slash (/
) is used to separate folder names on the platform, they are also special characters, so we will also need to escape these two characters when they appear in a data object's name. To escape any special characters, you will use a preceding backslash \
.
Please note that while dx-toolkit itself requires a single \ to escape a colon or a slash, the syntax conventions in some shells may require you to escape the \ character itself by an extra backslash or by enclosing the argument in single quotes.
dx find data
also allows you to search data using metadata fields, such as when the data was created, the data's tags, or the project the data exists in.
You can utilize the flags --created-after
and --created-before
to search for data objects created within a period of time.
You can search for objects based on their metadata. An object's metadata can be set by performing the command dx tag
or dx set_properties
to respectively tag or setup key-value pairs to describe your data object. You can also set metadata while uploading data to the platform. To search by object tags, use the option --tag
. This option can be repeated if the search requires multiple tags.
To search by object properties, use the option --property
. This option can be repeated if the search requires multiple properties.
You can search for an object living in a different project than your current working project by specifying a project and folder path with the flag --path
. Below, we specify the project ID (project-BQfgzV80bZ46kf6pBGy00J38) of the public project "Exome Analysis Demo" (platform login required to access this link) as an example.
If you would like to search for data objects live in all projects in which you have VIEW and above permissions, you can use the --all-projects
flag. Public projects are not shown in this search.
To describe data for small amounts of files (typically below 100), scope findDataObjects
to only a project level.
The below is an example of code used to scope a project:
See the API method system/findDataObjects
for more information about usage.
The DNAnexus Relational Database Service provides users with a way to create and manage cloud database clusters (referred to as dbcluster
objects on the platform). These databases can then be securely accessed from within DNAnexus jobs/workers.
The Relational Database Service is currently available via the application program interface (API) in AWS regions only. See DBClusters API page for details.
When describing a DNAnexus dbclusters, the status field can be any of the following:
creating
The database cluster is being created, but not yet available for reading/writing.
available
The database cluster is created and all replicas are available for reading/writing.
stopping
The database cluster is currently being stopped.
stopped
The database cluster is stopped.
starting
The database cluster is restarting from a stopped state, it will be transitioned to available when ready.
terminating
The database cluster is being terminated.
terminated
The database cluster has been terminated and all data deleted.
DB Clusters are not accessible from outside of the DNAnexus platform. Any access to these databases must occur from within a DNAnexus job. Refer to this page on cloud workstations for one possible way to access a DB Cluster from within a job. Executions such as app/applets can access a DB Cluster as well.
The parameters needed for connecting to the database are:
host
Use endpoint
as returned from dbcluster-xxxx/describe
port
3306
for MySQL Engines or 5432
for Postgresql Engines
user
root
password
Use the adminPassword
specified when creating the database dbcluster/new
For MySQL: ssl-mode
'required'
For Postgresql: sslmode
'require' Note: For connecting and verifying certs: (https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/UsingWithRDS.SSL.html)
The table below provides all the valid configurations of dxInstanceClass, database engine and versions
db_std1_x2 (*)
aurora-mysql: ["8.0.mysql_aurora.3.04.1"], aurora-postgresql: ["12.9", "13.9", "14.6"]
4
2
db_mem1_x2
aurora-mysql: ["8.0.mysql_aurora.3.04.1"], aurora-postgresql: [ "12.9", "13.9", "14.6"]
16
2
db_mem1_x4
aurora-mysql: ["8.0.mysql_aurora.3.04.1"], aurora-postgresql: ["12.9", "13.9", "14.6"]
32
4
db_mem1_x8
aurora-mysql: ["8.0.mysql_aurora.3.04.1"], aurora-postgresql: ["12.9", "13.9", "14.6"]
64
8
db_mem1_x16
aurora-mysql: ["8.0.mysql_aurora.3.04.1"], aurora-postgresql: [ "12.9", "13.9", "14.6"]
128
16
db_mem1_x32
aurora-mysql: ["8.0.mysql_aurora.3.04.1"], aurora-postgresql: ["12.9", "13.9", "14.6"]
244
32
db_mem1_x48
aurora-mysql: ["8.0.mysql_aurora.3.04.1"], aurora-postgresql: ["12.9", "13.9", "14.6"]
384
48
db_mem1_x64
aurora-mysql: ["8.0.mysql_aurora.3.04.1"], aurora-postgresql: ["12.9", "13.9", "14.6"]
488
64
db_mem1_x96
aurora-postgresql: ["12.9", "13.9", "14.6"]
768
96
* - db_std1 instances may incur CPU Burst charges similar to AWS T3 Db instances described in this AWS documentation. Regular hourly charges for this instance type are based on 1 core, CPU Burst charges are based on 2 cores.
If a project contains a DBCluster, its ownership cannot be changed. A PermissionDenied error will be returned on attempting to change the billTo
of such a project.
You can perform advanced filtering on projects, data objects, and jobs using the filter bars above the table of results. This feature is displayed at the top of the Monitor tab but is hidden by default on the Manage tab and Projects page. You can display or hide the filter bar by toggling the filters icon in the top right corner.
The filter bar lets you to specify different criteria on which to filter your data. You can combine several different filters for greater control over your results.
To use this feature, first choose the field you want to filter your data by, then enter your filter criteria (e.g. select the "Name" filter then search for "NA12878"). The filter will usually be activated when you press or when you click outside of the filter bar.
The following filters are available for projects, and can be added by selecting them from the "Filters" dropdown menu.
Billed to: The user or org ID that the project is billed to, e.g. "user-xxxx" or "org-xxxx". NOTE: When you are viewing a partner organization's projects, the "Billed to" field will be fixed to the org ID.
Project Name: Search by case insensitive string or regex, e.g. "Example" or "exam$" will both match "Example Project"
ID: Search by project ID, e.g. "project-xxxx"
Created date: Search by projects created before, after, or between different dates
Modified date: Search by projects modified before, after, or between different dates
Creator: The user ID who created the project, e.g. "user-xxxx"
Shared with member: A user ID with whom the project is shared, e.g. "user-xxxx" or "org-xxxx"
Level: The minimum permission level to the project. The dropdown has the options Viewer+", "Uploader+", "Contributor+", and "Admin only". e.g. "Contributor+" filters projects with access CONTRIBUTOR or ADMINISTER
Tags: Search by tag. The filter bar automatically populates with tags available on projects
Properties: Search by properties. The filter bar automatically provides properties available on projects
The following filters are available for objects. Filters listed in italics are not displayed in the filter bar by default but can be added by selecting them from the "Filters" dropdown menu on the right.
Search scope: The default scope is "Entire project", but if you know the location of the object you are looking for, limiting your search scope to "Current Folder" allow syou to search more efficiently.
Object name: Search by case insensitive string or regex, e.g. "NA1" or "bam$" both match "NA12878.bam"
ID: Search by object ID, e.g. "file-xxxx" or "applet-xxxx"
Modified date: Search by objects modified before, after, or between different dates
Class: e.g. "File", "Applet", "Folder"
Types: e.g. "File" or custom Type
Created date: Search by objects created before, after, or between different dates
Tags: Search by tag. The filter bar automatically populates with tags available on objects within the current folder
Properties: Search by properties. The filter bar automatically provides properties available on objects within the current folder
When you filter on anything other than the current folder, you will get results from many different places in the project. The folders is be displayed in a lighter gray font and some actions are be unavailable (such as creating a new workflow or folder), but otherwise functionality remains the same as in the normal data view.
The following filters are available for executions. Filters listed in italics are not displayed in the filter bar by default but can be added to the bar by selecting them from the "Filters" dropdown menu on the right.
Search scope: The default displays root executions only, but you can choose to view all executions (root and subjobs) instead
State: e.g. Failed, Waiting, Done, Running, In Progress, Terminated
Name: Search by case-insensitive string or regex, e.g. "BWA" or "MEM$" both match "BWA-MEM". This only matches the name of the job or analysis, not the executable name.
ID: Search by job or analysis ID, e.g. "job-1234" or "analysis-5678"
Created date: Search by executions created before, after, or between different dates
Launched by: Search by the user ID of the user who launched the job. The filter bar automatically populates with users who have run the currently-shown jobs within the project
Tags: Search by tag. The filter bar automatically populates with tags available on the currently-shown executions
Properties: Search by properties. The filter bar automatically provides properties available on executions currently shown within the project
Executable: Search by the ID of executable run by the executions in question (e.g. app-1234 or applet-5678)
Class: e.g. Analysis or Job
Origin Jobs: ID of origin job
Parent Jobs: ID of parent job
Parent Analysis: ID of parent analysis
Root Executions: ID of root execution
When filtering on a name, any spaces are be expanded to include intermediate words. For example, filtering by "b37 build" also returns "b37 dbSNP build".
Some filters allow you to specify a date range for your query. For example, the "Created date" filter allows you to specify a beginning time ("From") and/or an end time ("To"). Clicking on the date box opens a calendar widget which allows you to specify a relative time period in minutes, hours, days, weeks, months, or an absolute time period by specifying a certain date.
If you select a relative time period, you are able to represent it as some amount of time before the current time. For example, if you select "Day" and type in 5, you are setting the datetime to the time 5 days prior to the current time.
Alternatively, you can use the calendar to represent an exact (absolute) datetime.
If you only set the beginning datetime ("From"), the range will automatically be set from the "From" time to the present moment. If you only set the end ("To") datetime, the range will be set from the beginning of time to the "To" time.
Note that if you have a filter saved with a time period defined in relative terms, it is calculated from the current time every time you load the filter. For example, if you have a saved filter showing all items created within the last two hours and you load it at 11am, it shows you everything created since 9am. But if you load it again at 4pm, it shows you everything created since 2pm. If you wanted consistent results when using a saved filter, make sure to specify an absolute datetime using the calendar widget.
To search by tag, simply enter (or select) the tag(s) you are searching for. For example, if you are searching for all objects with the tag "human", enter human into the filter query box and tick the checkbox next to the tag.
Note that while normal searches on project titles only require you to type in part of the title, searches using the above keywords require the entire value to be typed in, although casing doesn't matter (so for example, a search for HUMAN and human will both find a project with the tag "Human", but Hum will not).
Properties have two parts: a key and a value. You'll be asked to enter each of these when you create a new property. Just like tags, properties allow you to create your own common attributes across multiple projects or items and find them quickly and easily. When searching for a property, you can either search for all items that have that property, or items that have a property with a certain value.
To search for all items that possess a property, regardless of the value of that property, simply select the "Properties" filter (not displayed by default), enter the property key, and click Apply. To search for items that possess a property with a certain value, enter that property's key and value.
Note that the keys and values must be entered in their entirety. For example, entering the key sample and the value NA will not find objects corresponding to {"sample\_id": "NA12878"}
!
Some filters allow you to select multiple values. For example, the "Tag" filter allows you to specify multiple tags in the dialog. When you have selected multiple tags, you have a choice whether to search for objects containing any of the the selected tags or containing all of the selected tags.
Given the following set of objects:
Object 1 (tags: "human", "normal")
Object 2 (tags: "human", "tumor")
Object 3 (tags: "mouse", "tumor")
Selecting both "human" and "tumor" tags, and choosing to filter by any tag returns all 3 objects. Choosing to filter by all tags returns only Object 2.
Click the "Clear All Filters" button on the filter bar to reset your filters.
If you wish to save your filters, active filters are saved in the URL of the filtered page. You can bookmark this URL in your browser to return to your filtered view in the future.
Note that the bookmarked URL is a saved set of search parameters, not a saved set of search results. Every time you click on the bookmarked filters, those parameters are loaded into the filter bar. For example, if your saved filters includes a filter for items created in the last thirty days, each time you click that search it shows you items created in the last thirty days from the current point in time, not thirty days before the URL was created. Thus, as time goes by, you may notice slightly different results every time you click on a saved search.
You can upload files to the DNAnexus platform using the command dx upload
. You can also upload data using the DNAnexus Upload Agent, a fast and convenient command-line client. For uploading multiple or large files (>50 MB), we recommend that you use Upload Agent, which allows you to upload up to 1000 files concurrently and resume uploads in case of network interruption.
You can use the dx upload
command followed by a file path to upload one local file.
If you are trying to upload a file hosted at a publicly accessible URL rather than locally, you can use the URL Fetcher (platform login required to access this link) app.
You can also use the dx upload
command followed by multiple file paths to upload multiple local files.
The following command will upload all of the files inside the /Users/alice/
directory. This directory should contain only files and not sub-directories.
You can specify the ‑r/‑‑recursive
parameter with the dx upload
command to recursively upload one or more directories or folders and maintain their respective structures.
You can use the dx ls
command to list the uploaded folders and files.
NOTE: If the local uploaded directory ends with the
/
character, only the contents of the directory will be uploaded, not the directory itself.
You can specify object metadata using the dx upload
command.
You can use the --property KEY=VALUE
parameter to add metadata to the file being uploaded. The parameter may be repeated as necessary, e.g. --property key1=val1 --property key2=val2
, to link multiple metadata fields as key-value pairs.
You can also add tags to uploaded files with the dx upload
command using the --tag TAG
parameter. The parameter may be repeated as necessary, e.g. --tag tag1 --tag tag2
, to link multiple tags to a file.
You can use the --path/‑‑destination
parameter to specify the DNAnexus destination path. If the path is not specified, dx upload
will default to the current project and folder. You can determine your current project and folder using the command dx pwd
.
You can upload data directly from standard input. This is very useful when streaming the upload while the file is generated.
You can specify the --buffer-size
parameter with the dx upload
command to set the write buffer size in bytes. When uploading large files from standard input, you will need to manually set a --buffer-size
because the dx
command does not know the file size beforehand.
You can specify the --no-progress
flag with the dx upload
command to hide the progress bar.
You can describe objects (files, app(let)s, and workflows) on the DNAnexus platform using the command dx describe
.
Objects can be described using their DNAnexus platform name via the command line interface (CLI) using a path.
Objects can be described relative to the user's current directory on the DNAnexus platform. In the following example, we describe the indexed reference genome file human_g1k_v37.bwa-index.tar.gz
.
NOTE: The entire path is enclosed in quotes due to the space in the folder name Original files. Instead of quotes, you can escape special characters with the
\
character:dx describe Original\ files/human_g1k_v37.bwa-index.tar.gz
.
Objects can be described using an absolute path. This allows us to describe objects outside the current project context. In the following example, we dx select
the project "My Research Project" and dx describe
the file human_g1k_v37.fa.gz
in the "Reference Genome Files" project.
Objects can be described using a unique object ID.
In this example, we describe workflow object "Exome Analysis Workflow" using its ID. This workflow is publicly available in the "Exome Analysis Demo" project.
Due to the amount of information contained in a workflow (including multiple app(let)s, inputs/outputs, and default parameters), the dx describe
output can seem overwhelming.
The output from a dx describe
command can be used for various purposes. The optional argument --json
will convert the output from dx describe
into JSON format for advanced scripting and command line use.
In this example, we will describe the publicly available workflow object "Exome Analysis Workflow" and return the output in JSON format.
We can parse, process, and query the JSON output using jq
. Below, we process the dx describe --json
output to generate a list of all stages in the aforementioned exome analysis pipeline.
We can output the "executable" value of each stage present in the "stages" value of the dx describe
output above using the command below.
Field name
Objects
Description
All
Unique ID assigned to a DNAnexus object.
Class
All
DNAnexus object type.
Project
All
Container where the object is stored.
Folder
All
Objects inside a container (project) can be organized into folders. Objects can only exist in one path within a project.
Name
All
Object name on the platform.
All
Status of the object on the platform.
Visibility
All
Whether or not the file is visible to the user through the platform web interface.
Tags
All
Set of tags associated with an object. Tags are strings used to organize or annotate objects.
Properties
All
Key/value pairs attached to object.
All
JSON reference to another object on the platform. Linked objects will be copied along with the object if the object is cloned to another project.
Created
All
Date and time object was created.
Created by
All
DNAnexus user who created the object. Contains subfield “via the job” if the object was created as a result of an app or applet.
Last modified
All
Date and time the object was last modified.
Input Spec
App(let)s and Workflows
App(let) or workflow input names and classes. With workflows, the corresponding applet stage ID is also provided.
Output Spec
App(let) and Workflows
App(let) or workflow output names and classes. With workflows, the corresponding applet stage ID is also provided.
There are several different methods with which you can view your files and data on the DNAnexus platform.
DNAnexus allows users to preview and open the following file types directly on the platform:
TXT
PNG
HTML
To preview these files, select the file you wish to view by either clicking on its name in the Managetab or selecting the checkbox next to the file. If the file is one of the file types listed above, you will see the "Preview" and "Open in New Tab" options in the toolbar above.
Alternatively, you can click on the three dots on the far right and choose the "Preview" or "Open in New Tab" options from the dropdown menu.
"Preview" will open a fixed-sized box in your current tab for you to preview the file of interest. "Open in New Tab" will allow you to view the file in a separate tab. Due to limitations in web browser technologies, "Preview" and "Open in New Tab" may yield different results.
For files not listed in the section above, the DNAnexus platform also provides a lightweight framework called Viewers, which allows users to view their data using new or existing web-based tools.
A Viewer is simply an HTML file that you can give one or more DNAnexus URLs representing files to be viewed. Viewers generally integrate third-party technologies, such as HTML-based genome browsers.
You can easily launch a viewer by clicking on the Visualize tab within a project.
This tab will open a window showing all Viewers available to you within your project. If you have created any Viewers yourself and saved them within your current project, these will show up in this list along with the DNAnexus-provided Viewers.
Clicking on a Viewer will open a data selector for you to choose the files you wish to visualize. Tick one or more files that you want to provide to the Viewer. (The Viewer does not have access to any other of your data.) From there, you can either create a Viewer Shortcut or launch the Viewer.
The BioDalliance and IGV.js viewers provide HTML-based human genome browsers which you can use to visualize mappings and variants. When launching either one of these viewers, tick a pair of *.bam
+ *.bai
files for each mappings track you would like to visualize, and a pair of *.vcf.gz
+ *.vcf.gz.tbi
for each variant track you want to add. In addition, the BioDalliance browser supports bigBed (*.bb
) and bigWig (*.bw
) tracks.
For more information about BioDalliance, consult http://www.biodalliance.org/started.html. For IGV.js, see http://igv.org.
The BAM Header Viewer allows you to peek inside a BAM header, similar to what you would get if you were to run samtools view -H
on the BAM file. (BAM headers include information about the reference genome sequences, read groups, and programs used). When launching this viewer, tick one or more BAM files (*.bam
).
The Jupyter notebook viewer shows *.ipynb
notebook files, displaying notebook images, highlighting code blocks and rendering markdown blocks as shown below.
This viewer allows you to uncompress and see the first few kilobytes of a gzipped file. It is conceptually similar to what you would get if you were to run zcat <file> \| head
. Use this viewer to peek inside compressed reads files (*.fastq.gz
) or compressed variants files (*.vcf.gz
). When launching this viewer, tick one or more gzipped files (*.gz
).
If a viewer fails to load, please temporarily disable browser extensions such as AdBlock and Privacy Badger. Additionally, viewers are not supported in Incognito browser windows.
Developers comfortable with HTML and JavaScript can create custom viewers to visualize data on the platform.
Viewer Shortcuts are objects which, when opened, will open a data selector to select inputs for launching a specified Viewer. The Viewer Shortcut includes a Viewer and an array of inputs that will be selected by default.
The Viewer Shortcut will show up in your project as an object of type "Viewer Shortcut." You can change the name of the Viewer Shortcut and move it within your folders and projects as you would any other object in the DNAnexus platform.
Learn how to archive files, a cost-effective way to retain files in accord with data-retention policies, while keeping them secure and accessible, and preserving file provenance and metadata.
The archiving feature is file-based. Users can also archive individual files, folders, or entire projects and save on storage costs. Users can also easily unarchive one or more files, folders, or projects when they need to make the data available for further analyses.
The DNAnexus Archive Service is currently available via the application program interface (API) in AWS and Microsoft Azure regions.
To understand the archival life cycle as well as which operations can be performed on files and how billing works, it’s helpful to understand the different file states associated with archival. A file in a project can assume one of four archival states:
Archival states
Details
live
The file is in standard storage, such as AWS S3 or Azure Blob.
archival
Archival requested on the current file, but other copies of the same file are in the live
state in multiple projects with the same billTo
entity. The file is still in standard storage.
archived
The file is in archival storage, such as AWS S3 Glacier or Azure Blob ARCHIVE.
unarchiving
Unarchival requested on the current file. The file is in transition from archival storage to standard storage.
Different states of a file allow different operations to the file. See the table below, for which operations can be performed based on a file’s current archival state.
Archival states
Download
Clone
Compute
Archive
Unarchive
live
Yes
Yes
Yes
Yes
No
archival
No
Yes*
No
No
Yes (Cancel archive)
archived
No
Yes
No
No
Yes
unarchiving
No
No
No
No
No
* Clone operation would fail if the object is actively transitioning from archival
to archived
.
When the project-xxxx/archive
API is called upon a file object, the file transitions from the live
state to the archival
state. Only when all copies of a file in all projects with the same billTo
organization are in the archival
state, does the file transition to the archived
state automatically by the platform.
Likewise, when the project-xxxx/unarchive
API is called upon a file in the archived
state, the file transitions from the archived
to the unarchiving
state. During the unarchiving
state, the file is being restored by the third-party storage platform (e.g., AWS or Azure). The unarchiving
process may take a while depending on the retrieval option selected for the specific platform. Finally, when the unarchival process is completed, and the file becomes available on standard storage, the file is transitioned to a live
state.
The File-based Archive Service allows users who have the CONTRIBUTE
or ADMINISTER
permissions to a project to archive or unarchive files that reside in the project. Via API calls, users can archive or unarchive files, folders, or entire projects, although the archival process itself happens at the file level. The API can accept a list of up to 1000 files for archival and unarchival. When archiving or unarchiving folders or projects, the API by default will archive or unarchive all the files at the root level and those in the subfolders recursively. If you archive a folder or a project that includes filess in different states, the Service will only archive files that are in the live
state and skip files that are in other states. Likewise, if you unarchive a folder or a project that includes files in different states, the Service will only unarchive files that are in the archived
state, transition archival
files back to the live
state, skip files in other states.
All the fees associated with the archival process of a file get billed to the billTo organization of the project. There are several charges associated with the archival:
Standard storage charge: The monthly storage charge for files that are located in the standard storage on the platform. The files in the live
and archival
state incur this charge. The archival
state indicates that the file is waiting to be archived or that other copies of the same file in other projects are still in the live
state, so the file is in standard storage (such as AWS S3). The standard storage charge continues to get billed until all copies of the file are requested to be archived and eventually the file is moved to archival storage and transitioned into the archived
state.
Archival storage charge: The monthly storage charge for files that are located in archival storage on the platform. Files in the archived
state incur a monthly archival storage charge.
Retrieval fee: The retrieval fee is a one-time charge at the time of unarchival based on the data volume being unarchived. Retrieval fees for third-party services can be found at:
Amazon AWS: https://aws.amazon.com/glacier/pricing
Microsoft Azure: https://azure.microsoft.com/en-us/pricing/details/storage/blobs
Early retrieval fee: Because the Archive Service is designed for long-term storage of data that are infrequently used, there is necessarily a retrieval fee associated with data that are retrieved before these long-term storage periods have been met. For AWS regions, this time period is 90 days, and for Microsoft Azure regions, this time period is 180 days. Data that are unarchived less than the minimum requirement days incur a pro-rated early retrieval charge, which is equal to the archival charge for the remaining days.
When using the Archive Service, we recommend the following best practices.
The Archive Service does not work on sponsored projects. If you want to archive files within a sponsored project, then you must move files into a different project or end the project sponsorship before archival.
If a file is shared in multiple projects, archiving one copy in one of the projects will only transition the file into the archival
state, which still incurs the standard storage cost. To achieve the lower archival storage cost, you need to ensure that all copies of the file in all projects with the same billTo
org are being archived. When all the copies of the file transition into the archival
state, the Service automatically transitions the files from the archival
state to the archived
state. We recommend using the allCopies
option of the API to force archiving all the copies of the file. You must be the org ADMIN of the billTo
org of the current project to use the allCopies
option.
Refer to the following example: The file-xxxx
has copies in project-xxxx
, project-yyyy
, and project-zzzz
which are sharing the same billTo
org (org-xxxx
). You are the ADMINISTER
of project-xxxx
, and a CONTRIBUTE
of project-yyyy
, but do not have any role in project-zzzz
. You are the org ADMIN of the project billTo
org, and try to archive all copies of files in all projects with the same billTo
org using /project-xxxx/archive:
List all the copies of the file in the org-xxxx
Force archiving all the copies of file-xxxx
All copies of file-xxxx
will be archived and transitioned into the archived
state.
Learn to use the Symlinks feature to access, work with, and modify files that are stored on an external cloud service.
The DNAnexus Symlinks feature enables users to link external data files on AWS S3 and Azure blob storage as objects on the platform and access such objects for any usage as though they are native DNAnexus file objects.
No storage costs are incurred when using symlinked files on the Platform. When used by jobs, symlinked files are downloaded to the Platform at runtime.
Symlinked files stored in AWS S3 or Azure blob storage are made accessible on DNAnexus via a Symlink Drive. The drive contains the necessary cloud storage credentials, and can be created by following Step 1 below.
Symlink Drives are set up via the CLI. Follow the directions below, to provide the information needed to set one up:
A name for the Symlink Drive
The cloud service (AWS or Azure) where your files are stored
The access credentials required by the service
dx api drive new '{
"name" : "<drive_name>",
"cloud" : "aws",
"credentials" : {
"accessKeyId" : "<my_aws_access_key>",
"secretAccessKey" : "<my_aws_secret_access_key>"
}
}'
dx api drive new '{
"name" : "<drive_name>",
"cloud" : "azure",
"credentials" : {
"account" : "<my_azure_storage_account_name>",
"key" : "<my_azure_storage_access_key>"
}
}'
After you've entered the appropriate command, a new drive object will be created. You'll see a confirmation message that includes the id of the new Symlink Drive, in the format drive-xxxx
.
By associating a DNAnexus Platform project with a Symlink Drive, you can both:
Have all new project files automatically uploaded to the AWS S3 bucket or Azure blob, to which the Drive links
Enable project members to work with those files
Note that "new project files" includes all of the following:
Newly created files
File outputs from jobs
Files uploaded to the project
Note that non-symlinked files cloned into a symlinked project will not be uploaded to the linked AWS S3 bucket or Azure blob.
When creating a new project via the UI, you can link it with an existing Symlink Drive by toggling the Enable Auto-Symlink in This Project setting to "On":
Next:
In the Symlink Drive field, select the drive with which the project should be linked
In the Container field, enter the name of the AWS S3 bucket or Azure blob where newly created files should be stored
Optionally, in the Prefix field, enter the name of a folder within the AWS S3 bucket or Azure blob where these files should be stored
When creating a new project via the CLI, you can link it to a Symlink Drive by using the optional argument --default-symlink
with dx new project
. See the dx new project documentation for details on inputs and input format.
In order to ensure that files can be saved to your AWS S3 bucket or Azure blob, you must enable CORS for that remote storage container.
See this AWS documentation for guidance in enabling CORS for an S3 bucket.
Use the following JSON object when configuring CORS for the bucket:
See this documentation for general guidance on enabling CORS for an Azure blob.
Working with Symlinked files is largely the same as working with files that are stored on the Platform. These files can, for example, be used as inputs to apps, applets, or workflows.
If you rename a symlink on DNAnexus, this does not change the name of the file in S3 or Azure blob storage. Note that in this example, the symlink has been renamed from the original name file.txt
, to Example File
. The remote filename, as shown in the Remote Path field in the right-side info pane, remains file.txt
:
If you delete a symlink on the Platform, the file to which it points is not deleted.
If your cloud access credentials change, you must update the definition of all Symlink Drives to keep using files to which those Drives provide access.
To update a drive definition with new AWS access credentials, use the following command:
To update a drive definition with new Azure access credentials, use the following command:
For more information, see this detailed guide to working with Symlink Drives.
No, the symlinked file will only move within the project. The change will not be mirrored in the linked S3 or Azure blob container.
The job will fail after it is unable to retrieve the source file.
Yes, you can copy a symlinked file from one project to another. This includes copying symlinked files from a symlink-enabled project to a project without this feature enabled.
Yes - egress charges will be incurred.
In this scenario, the uploaded file will overwrite, or "clobber," the file that shares its name, and only the newly uploaded file will be stored in the AWS S3 bucket or Azure blob.
This is true even if, within your project, you first renamed the symlinked file and uploaded a new file with the prior name. For example, if you upload a file named file.txt
to your DNAnexus project, the file will be automatically uploaded to your S3 or Azure blob to the specified directory. If you then rename the file on DNAnexus from file.txt
to file.old.txt
, and upload a new file to the project called file.txt
, the original file.txt
that was uploaded to S3 or Azure blob will be overwritten. However, you will still be left with file.txt
and file.old.txt
symlinks in your DNAnexus project. Trying to access the original file.old.txt
symlink will likely result in a checksum error.
If the auto-symlink feature has been enabled for a project, billing responsibility for the project cannot be transferred. Attempting to do so via API call will return a PermissionDenied error.