1 of 100

DNAnexus Documentation

Overview

Using the DNAnexus Platform

Get to know features you’ll use every day, in a series of short, task-oriented tutorials.

Learn to access and use the Platform via both its command-line interface and its user interface.

Learn to manage data, users, and work on the Platform, via its API. Create and share reusable pipelines, applications for analyzing data, custom viewers, and workflows.

This section is targeted towards organizational leads who have the permission to enable others to use DNAnexus for scientific purposes. Operations include managing organization permissions, billing, and authentication to the platform.

Download, install, and get started using the DNAnexus Platform SDK, the DNAnexus upload and download agents, and dxCompiler.

Get details on new features, changes, and bug fixes for each Platform and toolkit release.

Getting Started

Get to know features you’ll use every day, in these short, task-oriented tutorials.

You must set up billing for your account before you can perform an analysis, or upload or egress data. Follow these instructions to set up billing.

Running a Single App

Creating and Running a Workflow

Monitoring Jobs and Viewing Results

Visualizing Data

Learn More

See these Key Concepts pages for more in-depth treatments of topics that are covered briefly here:

Projects
Organizations
Apps and Workflows

For a step-by-step written tutorial to using the Platform via its UI, see this User Interface Quickstart.

For a step-by-step written tutorial to using the Platform via its CLI, see this Command Line Quickstart.

For a more in-depth video intro to the Platform, watch this DNAnexus Platform Essentials video.

Additional Tutorials

As a developer, you may be interested in the following:

dxCompiler
JupyterLab
Spark on JupyterLab

As a bioinformatician, you may be interested in our SAIGE GWAS walkthrough, and other content grouped in our Science Corner.

DNAnexus Essentials

Learn to upload data, create a project, run an analysis, and visualize results.

Learn More

See these Key Concepts pages to learn more about how the DNAnexus Platform works, and how to get the most from it:

Projects
Organizations
Apps and Workflows

Get up and running quickly using the Platform via both its user interface (UI) and its command-line interface (CLI):

User Interface Quickstart
Command Line Quickstart

Learn the basics of developing for the Platform:

Developer Quickstart
Developer Tutorials

Key Concepts

By understanding projects, organizations, apps, and workflows, you'll improve your understanding of the DNAnexus Platform.

Projects

Learn to use projects to collaborate, organize your work, manage billing, and control access to files and executables.

About Projects

On the DNAnexus Platform, a project is first and foremost a means of enabling users to collaborate, by providing them with shared access to specific data and tools.

Projects have a series of features designed to facilitate collaboration, help project members coordinate and organize their work, and ensure appropriate control over both data and tools.

See the User Interface Quickstart for details on how to create a project, share it with other users, and run an analysis.

Managing Project Content

A key function of each project is to serve as a shared storehouse of data objects used by project members as they collaborate.

Click on a project's Manage tab to see a list of all the data objects stored in the project. Within the Manage screen, you can browse and manage these objects, with the range of available actions for an object dependent on its type.

The following are four common actions you can perform on objects from within the Manage screen.

Downloading Files

File objects can be directly downloaded from the system. To download a file:

Select its row, then click the More Actions button - the "..." icon - at the end of the row showing the file's name.
Select "Download" from the list of available actions.
Follow the instructions in the modal window that opens.

Getting More Information on Objects

To learn more about an object:

Select its row, then click the Show Info Panel button - the "i" icon - in the upper corner of the Manage screen.
Select the row showing the name of the object about which you want to know more. An info panel will open on the right, displaying a range of information about the object. This will include its unique ID, as well as metadata about its owner, time of creation, size, tags, properties, and more.

Deleting Objects

To delete an object:

Select its row, then click on the More Actions button - the "..." icon - at the end of the row.
Select "Delete" from the list of available actions.
Follow the instructions in the modal window that opens.

Note that deletion cannot be undone.

Copying Data to Another Project

To copy a data object or objects to another project, you must have CONTRIBUTE or ADMINISTER access to that project.

Select the object or objects you want to copy to a new project, by clicking the box to the left of the name of each object in the objects list.
Click the Copy button in the upper right corner of the Manage screen. A modal window will open.
Select the project to which you want to copy the object or objects, then select the location within the project to which the objects should be copied.
Click the Copy Selected button.

Adding Project Members

You can collaborate on the platform by sharing a project with other DNAnexus users. On sharing a project with a user, or group of users in an organization, they become project members, with access at one of the levels described below. Project access can be revoked at any time by a project administrator.

Removing Project Members

To remove a user or org from a project to which you have ADMINISTER access:

On the project's Manage screen, click the Share Project button - the "two people" icon - in the top right corner of the page. A modal window will open, showing a list of project members.
Find the row showing the user you want to remove from the project.
Move your mouse over that row, then click the Remove from Members button at the right end of the row.

Project Access Levels

Access Level

Description

VIEW

Allows users to browse and visualize data stored in the project, download data to a local computer, and copy data to other projects.

UPLOAD

Gives users VIEW access, plus the ability to create new folders and data objects, modify the metadata of open data objects, and close data objects.

CONTRIBUTE

Gives users UPLOAD access, plus the ability to run executions directly in the project.

ADMINISTER

Gives users CONTRIBUTE access, plus the power to change project permissions and policies, including giving other users access, revoking access, transferring project ownership, and deleting the project.

Project Access Levels: Two Examples

Suppose you have a set of samples sequenced at your lab, and you have a collaborator who's interested in three of the samples. You can upload the data associated with those samples into a new project, then share that new project with your collaborator, granting him or her VIEW access.

Alternatively, suppose that you and your collaborator are working on the same tissue samples, but each of you wants to try a different sequencing process. You can create a new project, then upload your sequenced data to the project. Then grant your collaborator UPLOAD access to the project, allowing him or her to upload his or her data. You'll then both be able to use one another's data to perform downstream analyses.

Restricting Access to Executables

A project admin can configure a project to allow project members to run only specific executables as root executions. The list of permitted executables is set by entering the following command, via the CLI:

$ dx update project project-xxxx --allowed-executables applet-yyyy --allowed-executables workflow-zzzz [...]

Note that by entering this command, you will overwrite any existing set of permitted executables.

To unset the list, and thus permit project members to run all available executables as root executions, enter the following command:

$ dx update project project-xxxx --unset-allowed-executables

Note that executables that are called by a permitted executable are permitted to run, even if they are not included in the list.

Project Data Access Controls

Users with ADMINISTER access to a project can restrict the ability of project members to view, copy, delete, and download project data. The project-level boolean flags below provide fine-grained data access control. All data access control flags default tofalse and can be viewed and modified via CLI and platform API. protected, restricted, downloadRestricted, externalUploadRestricted and containsPHI settings can be viewed and modified in the project's Settings web screen as described below.

protected: If set to true, only project members with ADMINISTER access to the project can can delete project data. Otherwise, project members with ADMINISTER and CONTRIBUTE access can delete project data. This flag corresponds to the Delete Access policy in the project's Settings web interface screen.
restricted: If set to true,
- data in this project cannot be cloned to another project
- data in this project cannot be used as input to a job or an analysis in another project
- any running app or applet that reads from this project cannot write results to any other project
- a job running in the project will have singleContext flag set to true irrespective of the singleContext value supplied to /job/new and /executable-xxxx/run, and will only be allowed to use the job's DNAnexus authentication token when issuing requests to the proxied DNAnexus api endpoint within the job. Use of any other authentication token will result in an error.
This flag corresponds to the Copy Access policy in the project's Settings web interface screen.
downloadRestricted: If set to true, data in this project cannot be downloaded outside of the platform. For database objects, users would not be able to access the data in the project from outside DNAnexus. This includes read and write operations. This flag corresponds to the Download Access policy in the project's Settings web interface screen.
databaseUIViewOnly: If set to true, project members with VIEW access will have their access to project databases restricted to the Cohort Browser only. This feature is only available to customers with an Apollo license. Contact DNAnexus Sales for more information.
containsPHI: If set to true, data in this project is treated as Protected Health Information (PHI), an identifiable health information that can be linked to a specific person. PHI data protection safeguards the confidentiality and integrity of the project data in compliance with the Health Insurance Portability and Accountability Act of 1996 (HIPAA) by imposing additional restrictions documented in PHI Data Protection section. This flag corresponds to the PHI Data Protection setting in the Administration section of a project's Settings web interface screen.
displayDataProtectionNotice: If set to true, ADMIN users will be able to turn on/off the ability to show a Data Protection Notice to any users accessing the selected project. If the Data Protection Notice feature is enabled for a project, all users, when first accessing the project, will be required to review and confirm their acceptance of a requirement not to egress data from the project. Note that a license is required to use this feature. Contact DNAnexus Sales for more information.
externalUploadRestricted: If set to true, external file uploads to this project (from outside the job context) are rejected. The creation of Apollo databases, tables, and inserts of data into tables is disallowed from Thrift with a non-job token. This flag corresponds to the External Upload Access policy in the project's Settings web interface screen. A license is required to use this feature. Contact DNAnexus Sales for more information.
httpsAppIsolatedBrowsing: If set to true, httpsApp access to jobs launched in this project will be wrapped in Isolated Browsing, which will restrict data transfers through the httpsApp job interface. A license is required to use this limited-access feature. Contact DNAnexus Sales for more information.

PHI Data Protection

Only projects billed to org billing accounts can have PHI Data Protection enabled.

A license and a signed Business Associate Agreement are required to enable and use PHI Data Protection. Contact DNAnexus Sales for more information.

Protected Health Information, or PHI, is identifiable health information that can be linked to a specific person. On the DNAnexus Platform, PHI Data Protection safeguards the confidentiality and integrity of data in compliance with the Health Insurance Portability and Accountability Act of 1996 (HIPAA).

When PHI Data Protection is enabled for a project, it is subject to the following protective restrictions:

Data in this project cannot be cloned to other projects that do not have containsPHI set to true
Any jobs that run in non-PHI projects will not be able to access any data that can only be found in PHI projects
Job email notifications sent from the project refer to objects by object ID instead of by name, and other information in the notification may be elided. If you receive such a notification, you can view the elided information by logging onto the Platform and opening the notification and accessing it in the Notifications pane, accessible by clicking the "bell" icon at the far right end of the main menu.
Apollo database access is subject to additional restrictions
Once PHI Data Protection is activated for a project, it cannot be disabled.

Billing and Charges

On the DNAnexus Platform, running analyses, storing data, and egressing data are billable activities, and always take place within a specific project. Each project is associated with a billing account, to which invoices are sent, covering all billable activities carried out within the project.

Follow these instructions to set up billing. See these instructions for details on how to link a project to a billing account, during project setup.

Transferring Project Billing Responsibility

Transferring Billing Responsibility to Another User

If you have ADMINISTER access to a project, you can transfer project billing responsibility to another user, by doing the following:

On the project's Settings screen, scroll down to the Administration section.
Click the Transfer Billing button. A modal window will open.
Enter the email address or username of the user to whom you want to transfer billing responsibility for the project.
Click Send Transfer Request.

The user will receive an email notification of your request. To finalize the transfer, he or she must log onto the Platform and formally accept it.

Transferring Billing Responsibility to an Org

If you have billable activities access in the org to which you wish to transfer the project, you can change the billing account of the project to the org. To do this, navigate to the project settings page by clicking on the gear icon in the project header. On the project settings page, you can then select which to which billing account the project should be billed.

If you do not have billable activities access in the org you wish to transfer the project to, you will need to transfer the project to a user who does have this access. The recipient will then be able to follow the instructions below to accept a project transfer on behalf of an org.

Cancelling a Transfer of Billing Responsibility

You can cancel a transfer of project billing responsibility, so long as it hasn't yet been formally accepted by the user in question. To do this:

Select All Projects from the Projects link in the main menu. Open the project in question. You'll see a Pending Project Ownership Transfer notification at the top of the screen.
Click the Cancel Transfer button to cancel the transfer.

Accepting a Transfer Request

When another user initiates a project transfer to you, you’ll receive a project transfer request, via both an email, and a notification accessible by clicking the Notifications button - the "bell" - at the far right end of the main menu.

Note that if you did not already have access to the project being transferred, you'll get VIEW access, and the project will appear in the list on the Projects screen.

To accept the transfer:

Open the project. You'll see a Pending Project Ownership Transfer notification in the project header.
Click the Accept Transfer button.
Select a new billing account for the project from the dropdown of eligible accounts.

Projects with the Auto-Symlink feature Enabled

If the auto-symlink feature has been enabled for a project, billing responsibility for the project cannot be transferred. See this documentation for more on the auto-symlink feature.

Projects with PHI Data Protection Enabled

If a project has PHI Data Protection enabled, it may only be transferred to an org billing account which also has PHI Data Protection enabled.

Project Sponsorship

A user or org can sponsor the cost of data storage in a project for a fixed term. During the sponsorship period, project members may copy this data to their own projects and store it there, without incurring storage charges.

On setting up the sponsorship, the sponsor sets it end date. The sponsor can change this end date at any time.

Billing responsibility for sponsored projects may not be transferred.

Sponsored projects may not be deleted, without the project sponsor first ending the sponsorship, by changing its end date to a date in the past.

For more information about sponsorship, contact DNAnexus Support.

Learn More

See the Org Management page for detailed information on projects that are billed to an org.

Learn about accessing and working with projects via the CLI:

Project Navigation
Path Resolution

Learn about working with projects as a developer:

Organizations

Learn about organizations, which associate users, projects, and resources with one another, enabling fluid collaboration, and simplifying the management of access, sharing, and billing.

This functionality is also available via our command line interface (CLI) tools. You may find it easier to use the CLI tools to perform some actions, such as inviting multiple users or exporting information into a machine-readable format.

What Is an Org?

An organization (or "org") is a DNAnexus entity used to manage a group of users. Use orgs to group users, projects, and other resources together, in a way that models real-world collaborative structures.

In its simplest form, an org can be thought of as referring to a group of users on the same project. An org can be used efficiently to share projects and data with multiple users - and, if necessary, to revoke access.

Org admins can manage org membership, configure access and projects associated with the org, and oversee billing. All storage and compute costs associated with an org are invoiced to a single billing account designated by the org admin. You can create an org that is associated with a billing account by contacting sales@dnanexus.com.

Orgs are referenced on the DNAnexus Platform by a unique org ID (e.g., org-dnanexus). Org IDs are used when sharing projects with an org in the Platform user interface or when manipulating the org in the CLI.

Org Membership Levels

Users may have one of two membership levels in an org:

ADMIN
MEMBER

An ADMIN-level user is granted all possible access in the org and may perform org administrative functions (e.g. adding/removing users or modifying org policies). A MEMBER-level user, on the other hand, is granted only a subset of the possible org accesses in the org and has no administrative power in the org.

Members

A user with MEMBER level can be configured to have a subset of the following org access. These access levels determine which actions each user can perform in an org.

Access

Description

Options

Billable activities access

If allowed, the org member can create new projects and apps billed to the org, download data (incurring data egress charges against the org), and set their own default billing account to that of the org.

[Allowed] or [Not Allowed]

Shared apps access

If allowed, the org member will have access to view and run apps in which the org has been added as an "authorized user".

[Allowed] or [Not Allowed]

Shared projects access

The maximum access level a user can have in projects shared with an org. For example, if this is set to UPLOAD for an org member, the member will have at most UPLOAD access in projects shared with the org, even if the org was given CONTRIBUTE or ADMINISTER access to the project.

[NONE], [VIEW], [UPLOAD], [CONTRIBUTE] or [ADMINISTER]

These accesses allow you to have fine-grained control over what members of your orgs can do in the context of your org.

Admins

Org admins are granted all possible access in the org. More specifically, org admins receive the following set of accesses:

Access

Level

Billable activities access

Allowed

Shared apps access

Allowed

Shared projects access

ADMINISTER

In addition to the access listed above, org admins have the following additional abilities:

Viewing Metadata for All Org Projects

Org admins can list and view metadata for all org projects (projects billed to the org) even if the project is not explicitly shared with them. They can also give themselves access to any project billed to the org. For example, when a member creates a new project, Project-P, and bills it to the org, he is the only user with access to Project-P. The org admin will be able to see all projects billed to the org, including Project-P. The org admin can also invite themself to Project-P at any time to get access to objects and jobs in the project.

Becoming a Developer for All Org Apps

Org admins can add themselves as a developer to any app billed to the org. For example, when a member creates a new app, App-A, billed to the org, he is the only developer for App-A; however, any org admin may add themself as a developer at any time.

Examples of Using Orgs

Org Structure Diagram

In the diagram below, there are 3 examples of how organizations can be structured.

ORG-1

The simplest example, ORG-1, is represented by the leftmost circle. In this situation, ORG-1 is a billable org that has 3 members who share one billing account, so all 5 projects created by the members of ORG-1 are billed to that org. There is one admin who manages ORG-1, represented here as user A.

ORG-2 and ORG-3

The second example shows ORG-2 and ORG-3 demonstrating more a complicated organizational setup. Here users are grouped into two different billable orgs, with some users belonging to both orgs and others belonging to only one.

In this case, ORG-2 and ORG-3 bill their work against separate billing accounts. This separation of orgs can represent two different groups in one company working in different departments, each with their own budgets, two different labs that work closely together, or any other scenario in which two collaborators would share work.

ORG-2 has 5 members, 4 projects, and is managed by one org admin (user G). ORG-3 has 5 members and 3 projects, but is managed by 2 admins (users G and I).

In this example, admin G and member H belong to both ORG-2 and ORG-3. They can create new projects billed to either org, depending on the project they're working on. Admin G can manage users and projects in both ORG-2 and ORG-3.

You can create a non-billable org as an alias for a group of users. For example, you have a group of users who all need access to a shared dataset. You can make an org which represents all the users who need access to the dataset, for example an org named org-dataset_access, and share all the projects and apps related to the dataset with that org. All members of the org will have at least VIEW "shared project access" and "shared app access" so that they will all be given permission to view the dataset. If a member no longer needs access to the dataset, they can be removed from the org, and will then no longer have access to any projects or apps shared with org-dataset_access.

Example 2: Only Admins can Create Projects

You can request sales@dnanexus.com to create a billable org where only one member, the org admin, can create new org projects. All other org members will not be granted the "billable activities access", and so cannot create new org projects. The org admin can then assign each org member a "shared projects access" (VIEW, UPLOAD, CONTRIBUTE, ADMINISTER) and share every org project with the org with ADMINISTER access. The members' permissions to the projects will be restricted by their respective "shared project access."

For example, in a given group, bioinformaticians can be given CONTRIBUTE access to the projects shared with the entire org, so they can run analyses and produce new data in any of the org projects. However, the sequencing center technicians only need UPLOAD permissions to add new data to the projects. Analysts in the group will only be given VIEW access to projects shared with the org. When you need to add a new member to your group and give them access to the projects shared with the org, you simply need to add them to the org as a new member and assign them the appropriate permission levels.

This membership structure allows the org admin to control the number of projects billed to the org. The org admin can also quickly share new projects with their org and revoke permissions from users who have been removed from the org.

Example 3: Shared Billing Account

You can request sales@dnanexus.com to create a billable org where users work independently and bill all of their activities to the org billing account (as specified by the org admin). All org members are granted "billable activities access." The org members also need to share common resources (e.g. incoming samples or reference datasets).

In this case, all members should be granted the "shared apps access" and assigned VIEW as their "shared projects access." The reference datasets that need to be shared with the org are stored in an "Org Resources" project that is shared with the org, which is granted VIEW access. The org can also have best-practice executables built as apps on the DNAnexus system.

The apps can be shared with the org so all members of the org have access to these (potentially proprietary) executables. If any user leaves your company or institution, their access to reference datasets and executables is revoked simply by removing them from the org.

Other Cases

In general, it is possible to apply many different schemas to orgs as they were designed for many different real-life collaborative structures. If you have a type of collaboration you would like to support, contact support@dnanexus.com for more information about how orgs can work for you.

Managing Your Orgs

If you are an admin of an org, you can access the org admin tools from the Org Admin link in the header of the DNAnexus Platform. From here, you can quickly navigate to the list of orgs you administer via All Orgs, or to a specific org.

The Organizations list shows you the list of all orgs to which you have admin access. On this page, you can quickly see all of your orgs, the org IDs, their Project Transfer setting, and the Member List Visibility setting.

Within an org, the Settings tab allows you to view and edit basic information, billing, and policies for your org.

Viewing and Updating Org Information

You can find the org overview on the Settings tab. From here, you can:

View and edit the organization name (this is how the org is referred to in the Platform user interface and in email notifications).
View the organization ID, the unique ID used to reference a particular org on the CLI (e.g. org-demo_org).
View the number of org members, org projects, and org apps.
View the list of organization admins.

Managing Org Members

Within an org page, the Members tab allows you to view all the members of the org, invite new members, remove existing members, and update existing members' permission levels.

From the Members tab, you can quickly see the names and access levels for all org members. For more information about org membership, see the organization member guide.

Inviting a New Member

To add existing DNAnexus user to your org, you can use the + Invite New Member button from the org's Members tab. This opens a screen where you can enter the user's username (e.g., smithj) or user-ID (e.g., user-smithj). Then you can configure the user's access level in the org.

If you add a member to the org with billable activities access set to billing allowed, they will have the ability to create new projects billed to the org.

However, adding the member will not change their default billing account. If the user wishes to use the org as their default billing account, they will have to set their own default billing account.

Additionally, if the member has any pre-existing projects that are not billed to the org, the user will need to transfer the project to an org if they wish to have the project billed to the org.

The user will receive an email notification informing them that they have been added to the organization.

Creating New DNAnexus Accounts

Org admins have the ability to create new DNAnexus accounts on behalf of the org, provided the org is covered by a license that enables account provisioning. The user will then receive an email with instructions to activate their account and set their password.

For information on a license that enables account provisioning, contact DNAnexus Sales.

If this feature has already been turned on for an org you administer, you will see an option to Create New User when you go to invite a new member.

Here you can specify a username (e.g. alice or smithj), the new user's name, and their email address. The system will automatically create a new user account for the given email address and add them as a member in the org.

If you create a new user and set their Billable Activities Access to Billing Allowed, we recommend that you set the org as the user's default billing account. This option is available as a checkbox under the Billable Activities Access dropdown.

Editing Member Access

From the org Members tab, you can edit the permissions for one or multiple members of the org. The option to Edit Access appears when you have one or more org members selected in the table.

When you edit multiple members, you have the option of changing only one access while leaving the rest alone.

Removing Members

From the org Members tab, you can remove one or more members from the org. The option to Remove appears when you have one or more org members selected on the Members tab.

Removing a member revokes the user's access to all projects and apps billed to or shared with the org.

Org Projects

In the org's Projects tab you to see the list of all projects billed to the org. This list includes all projects in which you have VIEW and above permissions as well as projects that are billed to the org in which you do not have permissions (not a Member of).

You can view all project metadata (e.g. the list of members, data usage, creation date), as well as some other optional columns (e.g. project creator). To enable the optional columns, select the column from the dropdown menu to the right of the column names.

Granting Admin Access to Org Projects

In addition to viewing the list of projects, org admins can give themselves access to any project billed to the org. If you select a project in which you are not a member, you will still be able to navigate into the project's settings page. On the project settings page, you can click a button to grant yourself ADMINISTER permissions to the project.

You can also grant yourself ADMINISTER permissions if you are currently a member of a project billed to your org but you only have VIEW, CONTRIBUTE, or UPLOAD permissions.

Org Billing

Accessing Org Billing Information

To access your org's billing information:

From the global menu shown at the top of the screen, click org admin and then click the name of the org you want to view.
Click Billing from the org details screen to view billing information.

Setting Up or Updating Billing Information for an Org

To set up or update the billing information for an org you administer, contact billing@dnanexus.com.

By setting up billing for an org, you are designating someone as being responsible for receiving and paying DNAnexus invoices. These invoices also include usage by org members. The responsible party can be you, someone in your organization's finance department, or someone else. Once you click the Confirm Billing button, an email is sent to the email address of the person you’ve designated as the new billing contact for the org. This email will request that the recipient confirm acceptance of responsibility for receiving and paying invoices covering usage charges for the org. Until DNAnexus receives this confirmation, billing contact information for the org will not be updated.

Setting a Spending Limit

If you are an org admin, you can set or modify a spending limit for the org's usage charges. To do this:

Click on Org Admin from the global menu bar and then click on the name of the org for which you'd like to set or modify a spending limit.
Once on your org page, click on the billing tab.
Click the Set / Update Spending Limit link in the Funds Left box to contact DNAnexus Support, requesting a spending limit change. Doing this only submits your request. DNAnexus Support may follow up with you via email with questions about the change, before approving the request.

Viewing Estimated Charges

The Estimated Charges section allows users with billable access to view total charges incurred to date. This section is only visible if your org is a billable org, which means your org has confirmed billing information.

In the Funds Left section shows how much is left of the org's spending limit. If your org does not have a spending limit, your org is unlimited, which shows up as “N/A.”

Monitoring Spending and Usage

Licenses are required to use the Per-Project Usage Report and Root Execution Stats Report features. Contact DNAnexus Sales for more information.

Configuration of these features, and report delivery, is handled by DNAnexus Support.

The Per-Project Usage Report and Root Execution Stats Report are monthly reports that provide a wide range of detail on charges incurred by org members. See this documentation for more information.

Org Policies

Org admins can also set configurable policies for the org. Org policies dictate many different behaviors when the org interacts with other entities. The following policies exist:

Policy

Description

Options

Membership List Visibility

Dictates the minimum org membership level required to view the list of org members, their membership level, and access within the org. If PUBLIC, any DNAnexus user can view the list of org members.

[ADMIN], [MEMBER], or [PUBLIC]

Project Transfer

Dictates the minimum org membership level allowed to change the billing account of an org project (via the UI or project transfer).

[ADMIN] or [MEMBER]

Project Sharing

Dictates the minimum org membership level allowed for a user to invite that org to a project

[ADMIN] or [MEMBER]

DNAnexus recommends, as a starting point, to restrict the "membership list visibility policy" to ADMIN and "project transfer policy" to ADMIN. This ensures that only the org admin is allowed to see the list of members and their access within the org and that org projects always remain under control of the org.

You can update org policies for your org in the Policies and Administration section of the org Settings tab. Here, you can both change the membership list visibility and restrict project transfer policies for the org and contact DNAnexus support to enable PHI data policies for org projects.

Glossary of Org Terms

Billable activities access is an access level that can be granted to org members. If allowed, the org member can create new projects and apps billed to the org, download data (incurring data egress charges against the org), and set their own default billing account to that of the org.
Billable org is an org that has confirmed billing information or a non-negative spending limit remaining. Users with billable activities access in a billable org are allowed to create new projects billed to the org. See the definition of a non-billable org for an org that is used for sharing.
Billed to an org (app context) sets the billing account of an app to an org. Apps require storage for their resources and assets, and the billing account of the app are billed for that storage. The billing account of an app does not pay for invocations of the app unless the app is run in a project billed to the org.
Billed to an org (project context) sets the billing account of a project to an org. The org is invoiced the storage for all data stored in the project as well as compute charges for all jobs and analyses run in the project.
Membership level describes one of two membership levels available to users in an org, ADMIN or MEMBER. Note that ADMINISTER is a type of access level.
Membership list visibility policy dictates the minimum org membership level required to view the list of org members, their membership level, and access within the org.
Non-billable org describes an org only used as an alias for a group of users. Non-billable orgs do not have billing information and will not have any org projects or org apps. Any user can share a project with a non-billable org.
Org access is granted to a user to determine which actions the user can perform in an org.
Org admin describes administrators of an org who can manage org membership, configure access and projects associated with the org, and oversee billing.
Org app is an app billed to an org.
Org ID is the unique ID used to reference a particular org on the DNAnexus Platform (e.g. org-dnanexus).
Org member is a DNAnexus user associated with an org. Org members can have variable membership levels in an org which define their role in the org. Admins are a type of org member as well.
Org policy is a configurable policy for the org. Org policies dictate many different behaviors when the org interacts with other entities.
Org project describes a project billed to an org.
Org (or "organization") is a DNAnexus entity that is used to associate a group of users. Orgs are referenced on the DNAnexus Platform by a unique org ID.
Project transfer policy dictates the minimum org membership level allowed to change the billing account of an org project.
Share with an org means to give the members of an org access to a project or app via giving the org access to the project or adding the org as an "authorized user" of an app.
Shared apps access is an org access level that can be granted to org members. If allowed, the org member can view and run apps in which the org has been added as an "authorized user."
Shared projects access is an org access level that can be granted to org members: the maximum access level a user can have in projects shared with an org.

Learn More

Learn in depth about setting up and managing orgs as an administrator.

Learn about what you can do as an org member.

Learn about creating and managing orgs as a developer, via the DNAnexus API.

Apps and Workflows

Every analysis in DNAnexus is run using apps. Apps can be linked together to create workflows. Learn the basics of using both.

You must set up billing for your account before you can perform an analysis, or upload or egress data. Follow these instructions to set up billing.

Finding the Right App or Workflow

The Tools Library provides a list of available apps and workflows. To see this list, select Tools Library from the Tools entry in the main Platform menu.

On the DNAnexus Platform, apps and workflows are generically referred to as "tools."

To find the tool you're looking for in the Tools Library, you can use search filters. Filtering enables you to find tools with a specific name, in a specific category, or of a specific type:

To see what inputs a tool requires, and what outputs it generates, select that tool's row in the list. The row will be highlighted in blue; the tool's inputs and outputs will be displayed in a pane to the right of the list:

To make sure you can find a tool later, "pin" it to the top of the list. Click the "..." icon at the far right end of the row showing the tool's name and key details about it. Then click Add Pin:

To learn more about a tool, click on its name in the list. The tool's detail page will open, showing a wide range of info, including guidance in how to use it, version history, pricing, and more:

Running Apps and Workflows

Launching a Tool

Launching from the Tools Library

You can quickly launch the latest version of any given tool from the Tools Library page. Or, navigate into the Details page of the tool and click the Run button:

Launching from a Project

From within a project, navigate to the Manage pane, then click the Start Analysis button.

A dialogue window will open, showing a list of tools. These will include the same tools as shown in the Tools Library, as well as workflows and applets specifically available in the current project. Select the tool you want to run, then click Run Selected:

Workflows and applets can be launched directly from where they reside within a project. Select the workflow or applet in their folder location, and click Run.

Launch Configuration

Confirm details of the tool you are about to run. Note that selection of a project location is required for any tool to be run. You will need at minimum Contributor access level to the project.

Specialized tools, such as JupyterLab and Spark Apps, require special licenses to run.

Configure Inputs and Outputs

The tool may require specific inputs to be filled in before starting the run. You can quickly identify the required inputs by looking for the highlighted areas that are marked Inputs Required on the page.

You can access help information about each input or output by inspecting the label of each item. If a detailed README is provided for the executable, you can click the View Documentation icon to open the app or workflow info pane.

To configure instance type settings for a given tool or stage, click the Instance Type icon located on the top-right corner of the stage.

To configure output location and view info regarding output items, go to the Outputs tab under each stage. For workflows, output location can be specified separately for each stage.

The I/O graph provides an overview of the input/output structure of the tool. The graph is available for any tool and can be accessed via the Actions/Workflow Actions menu.

Once all required inputs have been configured, the page will indicate that the run is ready to start. Click on Start Analysis to proceed to the final step.

Configure Runtime Settings

As the last step before launching the tool, you can review and confirm various runtime settings, including execution name, output location, priority, job rank, spending limit, etc. You can also review and modify instance type settings before starting the run.

Once you have confirmed final details, click Launch Analysis to start the run.

A license is required to use the Job Ranking feature. Contact DNAnexus Support for more information.

Batch Run

Batch run allows users to run the same app or workflow multiple times, with specific inputs varying between runs.

Specify Batch Inputs

To enable batch run, start from any input that you wish to specify for batch run, and open its I/O Options menu on the right hand side. From the list of options available, select Enable Batch Run.

Input fields with batch run enabled will be highlighted with a Batch label. Click any of the batch enabled input fields to enter the batch run configuration page.

Not all input classes are supported for batch run configuration. See table below.

Input Class

Batch Run Support

Files and other data objects

Yes

Files and other data objects (array)

Partially supported. Can accept entry of a single-value array

String

Yes

Integer

Yes

Float

Yes

Boolean

Yes

String (array)

Integer (array)

Float (array)

Boolean (array)

Hash

Configure Batch Inputs

The batch run configuration page allows specifying inputs across multiple runs. Interact with each table cell to fill in desired values for any run or field.

Similar to configuration of inputs for non-batch runs, you will need to fill all the required input fields to proceed to next steps. Optional inputs, or required inputs with a predefined default value, can be left empty.

Once all required fields (for both batch inputs and non-batch inputs) have been configured, you can proceed to start the run via the Start Analysis button.

Starting and Monitoring Your Analysis

Once you've finished setting up your tool, start your analysis by clicking the Start Analysis button. Follow these instructions to monitor the job as it runs.

Learn More

Learn in depth about running apps and workflows, leveraging advanced techniques like Smart Reuse.

Learn how to build a simple app.

Learn more about building apps using Bash or Python.

Learn in depth about building and deploying apps, including Spark apps.

Learn in depth about importing, building, and running workflows.

User Interface Quickstart

Learn to create a project, add members and data to the project, and run a simple workflow.

You must set up billing for your account before you can perform an analysis, or upload or egress data. Follow these instructions to set up billing.

Step 1. Create Your First Project

On the DNAnexus Platform, all data is stored within projects. So before you upload, browse, or analyze any data, you must create a project to house that data.

To create a project:

Select All Projects from the Projects link in the main menu. This will take you to the Projects page.
Click the New Project button in the top right corner of the Projects page. The New Project wizard will open in a modal window.
In the Project Name field, enter a name for your project.
In the More Info section, you can enter Tags or custom-defined Properties to make it easier to find this project later, and organize it and other projects. For more information on this topic, see this detailed explanation for more information on tags and properties.
In the More Info section, you can also enter a Project Summary and/or a Project Description.
In the Billed To field of the Billing section, choose a billing account to which project charges will be billed. Follow these instructions to set up billing.
In the Billed To field of the Billing section, choose a cloud region in which project files will be stored and analyses will be run. A default region will be displayed here; it's fine to accept this default. For more on this topic, see this detailed explanation of cloud regions.
In the Access section, specify which types of users will be able to Copy Data, Delete Data, and Download Data. Default values will be shown here; it's fine to accept the defaults. For more on project access, see this detailed explanation of project access levels. For more on types of users, see this detailed rundown.
Click Create Project. You'll be taken to the Manage screen for the project. Once you've added data to your project, this is where you'll be able to see and get info on this data, and launch analyses that use it.

Step 2. Add Project Members

Once you've created a project, you can add members by doing the following:

From the project's Manage screen, click the Share Project button - the "two people" icon - in the top right corner of the project page.
Type the username or the email address of an existing Platform user, or the ID of an org whose members you want to add the project.
In the Access pulldown, choose the type of access the user or org will have to the project. For more on this, see this detailed explanation of project access levels.
If you don't want the user to receive an email notification on being added to the project, click the Email Notification to "Off."
Click the Add User button.
Repeat Steps 2-5, for each user you want to add to the project.
Click Done when you're finished adding members.

Step 3. Add Data to Your Project

To add data to your project, click the Add button in the top right corner of the project's Manage screen. You'll see three options for adding data:

Upload Data - Use your web browser to upload data from your computer. Note that if the upload takes a significant amount of time, you'll need to ensure that until it completes, you stay logged into the Platform, and keep your browser window open.
Add Data from Server - Specify an URL of an accessible server from which the file will be uploaded.
Copy Data from Project - Copy data from another project on the Platform.

When uploading very large files, consider using the Upload Agent, a command-line tool that's both faster and more reliable than uploading via the UI.

Adding Data to Use in Your First Analysis

To prepare for running your first analysis, as detailed in Steps 4-7, copy in data from the "Demo Data" project:

From the project's Manage screen, click the Add button, then select Copy Data from Project.
In the Copy Data from Project modal window, open the "Demo Data" project by clicking on its name.
Open the "Quickstart" folder. This folder contains two 1000 Genomes project files with the paired-end sequencing reads from chromosome 20 of exome SRR100022: SRR100022_20_1.fq.gz and SRR100022_20_2.fq.gz.
Click the box next to the Name header, to select both files.
Click Copy to copy the files to your project.

Step 4. Install Apps

Next, install the apps you'll need, to analyze the data you added to the project in Step 3:

Select Tools Library from the Tools link in the main menu.
A list of available tools will open.
Find the BWA-MEM FASTQ Read Mapper in the list and click on its name.
A tool detail page will open, a full range of information about the tool, and how to use it.
Click the Install button in the upper left part of the screen, under the name of the tool.
In the Install App modal, click the Agree and Install button.
After the tool has been installed, you'll be returned to the tool detail page.
Use your browser's "Back" button to return to the tools list page.
Repeat Steps 3-6 to install the FreeBayes Variant Caller.

Step 5. Build a Workflow

Now build workflow using the two apps you've just installed, and configure it to use the data you added to your project in Step 3.

Adding Workflow Steps

A workflow runs tools as part of a preconfigured series of steps. Start building your workflow by adding steps to it:

Return to your project's Manage screen. You can do this by using your browser's "Back" button, or by selecting All Projects from the Projects link in the main menu, then clicking on the name of your project in the projects list.
Click the Add button in the top right corner of the screen, then select New Workflow from the dropdown. The Workflow Builder will open.
In the Workflow Builder, give your new workflow a name. In the upper left corner of the screen, you'll see a field with a placeholder value that begins "Untitled Workflow." Click on the "pencil" icon next to this placeholder name, then enter a name of your choosing.
Click the Add a Step button. In the Select a Tool modal window, find the BWA-MEM FASTQ Read Mapper and click the "+" to the left of its name, to add it to your workflow.
Repeat Step 4 for the FreeBayes Variant Caller.
Close the Select a Tool modal window, by clicking either on the "x" in its upper right corner, or the Close button in its lower right corner. You'll return to the main Workflow Builder screen.

Setting Inputs for Each Step

Note that in the Workflow Builder, required inputs have orange placeholder text. Optional inputs have black placeholder text.

Set the required inputs for each step by doing the following:

To set the required inputs for the first step, start by clicking on the input labeled "Reads [array]" for the BWA-MEM FASTQ Read Mapper. In the Select Data for Reads Input modal window, click the box for the SRR100022_20_1.fq.gz file. Then click the Select button.
Since the SRR100022 exome was sequenced using paired-end sequencing, you'll need to provide the right-mates for the first set of reads. Click on the input labeled "Reads (right mates) [array]" for the BWA-MEM FASTQ Read Mapper. Select the SRR100022_20_2.fq.gz file.
Click on the input labeled "BWA reference genome index." At the bottom of the modal window that opens, there will be a Suggestions section that includes a link to a folder containing reference genome files. Click on this link, then open the folder named H. Sapiens - GRCh37 - b37 (1000 Genomes Phase I). Select the human_g1k_v37.bwa-index.tar.gz file.
Next set the "Sorted mappings [array]" required input for the second step. In the "Output" section for the first step, click on the blue pill labeled "Sorted mappings," then drag it to the second step input labeled "Sorted mappings [array]."
Click on the second step input labeled "Genome." In the modal that opens, find the reference genomes folder as in Step 3. Open the folder named H. Sapiens - GRCh37 - b37 (1000 Genomes Phase I). Select the human_g1k_v37.fa.gz file.

Note that setting the inputs and outputs is different for each tool. Refer to a tool's tool detail page, in the Tools Library, to learn about its required and optional inputs and outputs, file format restrictions for each, and other information you'll need to configure it to run either on its own or as part of a workflow.

Step 6. Launch the Workflow

You're ready to launch your workflow, by doing the following:

Click the Start Analysis button at the upper right corner of the Workflow Builder.
In the modal window that opens, click the Run as Analysis button.

The BWA-MEM FASTQ Read Mapper will start executing immediately. Once it finishes, the FreeBayes Variant Caller will start, using the Read Mapper's output as an input.

Step 7. Monitor Your Job

Once you've launched your workflow, you'll be taken to your project's Monitor screen. Here, you'll see a list of both current and past analyses run within the project, along with key information about each run.

As your workflow runs, its status will be shown as "In Progress."

Terminating Your Job

If for some reason you need to terminate the run before it completes, find its row in the list on the Monitor screen. In the last column on the right, you'll see a red button labeled Terminate. Click the button to terminate the job. Note that this can take some time. While the job is being terminated, the job's status will show as "Terminating."

Step 8. Access the Results

When your workflow completes, output files will be placed into a new folder in your project, with the same name as the workflow. The folder is accessible by navigating to your project's Manage screen.

Running the Workflow Using the Full SRR100022 Exome

You can run this workflow using the full SRR100022 exome, which is available in the SRR100022 folder, in the "Demo Data" project. Note that because this entails working with a much larger file, running the workflow using the exome data will take longer.

Learn More

See these Key Concepts pages for more in-depth treatments of topics that are covered briefly here:

Projects
Apps and Workflows

For a video intro to the Platform, watch this series of short, task-oriented tutorials.

For a more in-depth video intro to the Platform, watch this DNAnexus Platform Essentials video.

Command Line Quickstart

Learn to use the dx client for command-line access to the full range of DNAnexus Platform features.

You must set up billing for your account before you can perform an analysis, or upload or egress data. Follow these instructions to set up billing.

The dx command-line client is included in the DNAnexus SDK (dx-toolkit). You can use the dx client to log into the Platform; to upload, browse, and organize data; and to launch analyses.

All the projects and data referenced in this Quickstart are publicly available, so you can follow along step-by-step.

Getting Help

As you work, you can use this index of dx commands as a reference.

At the command line, you can also enter dx helpto see a list of commands, broken down by category. To see a list of commands from a particular category, enter dx help <category>.

To learn what a particular command does, enter dx help <command>, dx <command> -h, or dx <command> -help For example, enter dx help lsto learn about the command dx ls:

$ dx help ls
usage: dx ls [-h] [--color {off,on,auto}] [--delimiter [DELIMITER]]
[--env-help] [--brief | --summary | --verbose] [-a] [-l] [--obj]
[--folders] [--full]
[path]

List folders and/or objects in a folder
... # output truncated for brevity

Before You Begin

To use the command-line interface (CLI), make sure you've installed the DNAnexus Software Development Kit (SDK) available here.

Upgrading the SDK

To update your version of the command-line tool, you can run the command dx upgrade.

Step 1: Log In

The first thing you'll need to do is to log in. If you haven't created a DNAnexus account yet, visit the website and sign up. User signup is not supported on the command line.

$ dx login
Acquiring credentials from https://auth.dnanexus.com
Username: <your username>
Password: <your password>

No projects to choose from.  You can create one with the command "dx new project".  To pick from projects for which you only have VIEW permissions, use "dx select --level VIEW" or "dx select --public".

Your authentication token and your current project settings have now been saved in a local configuration file, and you're ready to start accessing your project.

You can generate an authentication token from the online DNAnexus Platform using the UI.

Step 2: Explore

Public Projects

Let's look inside some of the public projects that have already been set up. From the command line, enter the command:

$ dx select --public --name "Reference Genome Files*"

By running the dx select command and picking a project, you've now done the command-line equivalent of going to the project page for Reference Genome Files: AWS US (East) (platform login required to access this link) on the website. This is a DNAnexus-sponsored project containing popular genomes for you to use when running analyses with your own data.

For more information about the dx select command, please see the Changing Your Current Project page.

You will never be charged for DNAnexus-sponsored data, so you can copy data from this project however many times you'd like, free of charge.

Now you can list all of the data in the top-level directory of the project you've just selected by running the command dx ls. You can also see the contents of a folder by running the command dx ls <folder_name>.

$ dx ls
C. Elegans - Ce10/
D. melanogaster - Dm3/
H. Sapiens - GRCh37 - b37 (1000 Genomes Phase I)/
H. Sapiens - GRCh37 - hs37d5 (1000 Genomes Phase II)/
H. Sapiens - GRCh38/
H. Sapiens - hg19 (Ion Torrent)/
H. Sapiens - hg19 (UCSC)/
M. musculus - mm10/
M. musculus - mm9/
$ dx ls "C. Elegans - Ce10/"
ce10.bt2-index.tar.gz
ce10.bwa-index.tar.gz
... # output truncated for brevity

You can avoid typing out the full name of the folder by typing in dx ls C and then pressing <TAB>. The folder name will auto-complete from there.

You don't have to be in a project to inspect its contents. You can also look into another project, and a folder within the project, by giving the project name or ID, followed by a colon (:) and the folder path. Here, we list the contents of the publicly available project "Demo Data" using both its name and ID.

$ dx ls "Demo Data:/SRR100022/"
SRR100022_1.filt.fastq.gz
SRR100022_2.filt.fastq.gz
$ dx ls -l "project-BQbJpBj0bvygyQxgQ1800Jkk:/SRR100022/"
Project: Demo Data (project-BQbJpBj0bvygyQxgQ1800Jkk)
Folder : /SRR100022
State   Last modified       Size     Name (ID)
... # output truncated for brevity

As shown above, you can use the -l flag in conjunction with dx ls to list more details about files, such as the time a file was last modified, its size (if applicable), and its full DNAnexus ID.

Describing DNAnexus Objects

You can use the dx describe command to learn more about files and other objects on the platform. Given a DNAnexus object ID or name, dx describe will return detailed information about the object in question. dx describe will only return results for data objects to which you have access.

Besides describing data and projects (examples for which are shown below), you can also describe apps, jobs, and users.

Describing a File

Below, we describe the reference genome file for C. elegans located in the "Reference Genome Files: AWS US (East)" project that we've been using (which should be accessible from other regions as well). Note that you need to add a colon (:) after the project name, here that would be Reference Genome Files\: AWS US (East): .

$ dx describe "Reference Genome Files\: AWS US (East):/C. Elegans - Ce10/ce10.fasta.gz"
Result 1:
ID                  file-BQbY9Bj015pB7JJVX0vQ7vj5
Class               file
Project             project-BQpp3Y804Y0xbyG4GJPQ01xv
Folder              /C. Elegans - Ce10
Name                ce10.fasta.gz
State               closed
Visibility          visible
Types               -
Properties          Assembly=UCSC ce10,
                    Origin=http://hgdownload.cse.ucsc.edu/goldenPath/ce10/bigZip
                    s/ce10.2bit, Species=Caenorhabditis elegans, Taxonomy
                    ID=6239
Tags                -
Outgoing links      -
Created             Tue Sep 30 18:54:35 2014
Created by          bhannigan
 via the job        job-BQbY8y80KKgP380QVQY000qz
Last modified       Thu Mar  2 12:17:27 2017
Media type          application/x-gzip
archivalState       "live"
Size                29.21 MB, sponsored by DNAnexus

Describing a Project

Below, we describe the publicly available Reference Genome Files project that we've been using.

$ dx describe "Reference Genome Files\: AWS US (East):"
Result 1:
ID                  project-BQpp3Y804Y0xbyG4GJPQ01xv
Class               project
Name                Reference Genome Files: AWS US (East)
Summary             
Billed to           org-dnanexus
Access level        VIEW
Region              aws:us-east-1
Protected           true
Restricted          false
Contains PHI        false
Created             Wed Oct  8 16:42:53 2014
Created by          tnguyen
Last modified       Tue Oct 23 14:15:59 2018
Data usage          0.00 GB
Sponsored data      519.77 GB
Sponsored egress    0.00 GB used of 0.00 GB total
Tags                -
Properties          -
downloadRestricted  false
defaultInstanceType "mem2_hdd2_x2"

Step 3: Create Your Own Project

Now, we'll use the command dx new project to create a new project.

$ dx new project "My First Project"
Created new project called "My First Project"
(project-xxxx)
Switch to new project now? [y/N]: y

The text project-xxxx denotes a placeholder for a unique, immutable project ID. For more information about object IDs, see the Entity IDs page.

You're now ready to start uploading your data and running your own analyses.

The new command can also allow you to create other new data objects, including new orgs or users. Use the command dx help new to see additional information. The full list of dx commands is provided here.

Step 4: Upload and Manage Your Data

If you have a sample you would like to analyze, you can use the dx upload command or the Upload Agent if you have installed it. For the purposes of this tutorial, you can also download the file small-celegans-sample.fastq, which represents the first 25000 C. elegans reads from SRR070372. We will use this file again later to run through a sample analysis.

For uploading multiple or large files, we strongly recommend that you use the Upload Agent; it will compress your files and upload them in parallel over multiple HTTP connections and boasts other features such as resumable uploads.

The following command uploads the small-celegans-sample.fastq file into the current directory of the current project. The --wait flag tells dx upload to wait until it has finished uploading the data before returning the prompt and describing the result.

$ dx upload --wait small-celegans-sample.fastq
[===========================================================>] Uploaded (16801690 of 16801690 bytes) 100% small-celegans-sample.fastq
ID              file-xxxx
Class           file
Project         project-xxxx
Folder          /
Name            small-celegans-sample.fastq
State           closed
Visibility      visible
Types           -
Properties      -
Tags            -
Details         {}
Outgoing links  -
Created         Sun Jan  1 09:00:00 2017
Created by      amy
Last modified   Sat Jan  1 09:00:00 2017
Media type      text/plain
Size            16.02 MB

If you run the same command but add the flag --brief, only the file ID (in the form of file-xxxx) will be printed to the terminal. Other dx commands will also accept the --brief flag and will also report only object IDs.

Examining Data

To take a quick look at the first few lines of the file you just uploaded, use the dx head command. By default, it prints the first 10 lines of the given file.

Let's run it on the file we just uploaded and use the -n flag to ask for the first 12 lines (the first 3 reads) of the FASTQ file.

$ dx head -n 12 small-celegans-sample.fastq
@SRR070372.1 FV5358E02GLGSF length=78
TTTTTTTTTTTTTTTTTTTTTTTTTTTNTTTNTTTNTTTNTTTATTTATTTATTTATTATTATATATATATATATATA
+SRR070372.1 FV5358E02GLGSF length=78
...000//////999999<<<=<<666!602!777!922!688:669A9=<=122569AAA?>@BBBBAA?=<96632
@SRR070372.2 FV5358E02FQJUJ length=177
TTTCTTGTAATTTGTTGGAATACGAGAACATCGTCAATAATATATCGTATGAATTGAACCACACGGCACATATTTGAACTTGTTCGTGAAATTTAGCGAACCTGGCAGGACTCGAACCTCCAATCTTCGGATCCGAAGTCCGACGCCCCCGCGTCGGATGCGTTGTTACCACTGCTT
+SRR070372.2 FV5358E02FQJUJ length=177
222@99912088>C<?7779@<GIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIC;6666IIIIIIIIIIII;;;HHIIE>944=>=;22499;CIIIIIIIIIIIIHHHIIIIIIIIIIIIIIIH?;;;?IIEEEEEEEEIIII77777I7EEIIEEHHHHHIIIIIIIIIIIIII
@SRR070372.3 FV5358E02GYL4S length=70
TTGGTATCATTGATATTCATTCTGGAGAACGATGGAACATACAAGAATTGTGTTAAGACCTGCATAAGGG
+SRR070372.3 FV5358E02GYL4S length=70
@@@@@DFFFFFHHHHHHHFBB@FDDBBBB=?::5555BBBBD??@?DFFHHFDDDDFFFDDBBBB<<410

Downloading Data

If you'd like to download a file from the platform, just use the dx download command. This command will use the name of the file for the filename unless you specify your own with the -o/--output flag. In the example below, we download the same C. elegans file that we uploaded previously.

$ dx download small-celegans-sample.fastq
[                                                            ] Downloaded 0 byte
[===========================================================>] Downloaded 16.02 of
[===========================================================>] Completed 16.02 of 16.02 bytes (100%) small-celegans-sample.fastq

About Metadata

Files have different available fields for metadata, such as "properties" (key-value pairs) and "tags".

Step 5: Analyze a Sample

For the next few steps, if you would like to follow along, you will need a C. elegans FASTQ file. We will map the reads against the ce10 genome. If you haven't already, you can download and use the following FASTQ file, which contains the first 25,000 reads from SRR070372: small-celegans-sample.fastq.

You can also substitute your own reads file for a different species (though it may take longer to run through the example). For your convenience, DNAnexus has already imported a variety of reference genomes to the platform. If you have your own FASTA file that you would like to use, you can upload the file and create genome indices for BWA using the BWA FASTA Indexer app (platform login required to access these links).

The following walkthrough is helpful if you would like to understand what all the commands do and take a look at what apps you're running, but if you're just interested in converting a gzipped FASTQ file to a VCF file via BWA and the FreeBayes variant caller, then you can skip ahead to the Automate It section below, where you can see all the commands necessary for running apps.

Uploading Reads

If you have not yet done so, you can upload a FASTQ file for analysis.

$ dx upload small-celegans-sample.fastq --wait

For more information about using the command dx upload, please see the dx upload page.

Mapping Reads

Next, use the BWA-MEM app (platform login required to access this link) to map the uploaded reads file to a reference genome.

Finding the App Name

If you don't know the command-line name of the app you would like to run, you have two options:

You can navigate to its web page from the Apps page (platform login required to access this link) on the platform. The app's page will tell you how to run it from the command line. You can find more information about the app we're running on the BWA-MEM FASTQ Read Mapper page (platform login required to access this link).
Alternatively, you can search for apps from the command line by running the command dx find apps. You will find the name of the app that you can use on the command line in the parentheses (underlined below).

$ dx find apps
...
x BWA-MEM FASTQ Read Mapper (bwa_mem_fastq_read_mapper), v1.4.0
...

Installing and Running the App

Now install the app using dx install and check that it has been installed. While you do not always need to install an app to run it, you may find it useful as a bookmarking tool.

$ dx install bwa_mem_fastq_read_mapper
Installed the bwa_mem_fastq_read_mapper app
$ dx find apps --installed
BWA-MEM FASTQ Read Mapper (bwa_mem_fastq_read_mapper), v1.4.0

We can now run the app using dx run. We will run it without any arguments; it will then prompt us for required and then optional arguments. Note that the reference file genomeindex_targz for the C. elegans sample we are using is in a .tar.gz format and can be found in the Reference Genome folder of the region your project is in.

$ dx run bwa_mem_fastq_read_mapper
Entering interactive mode for input selection.

Input:   Reads (reads_fastqgz)
Class:   file
Enter file ID or path (<TAB> twice for compatible files in current directory,'?' for help)
reads_fastqgz[0]: <small-celegans-sample.fastq.gz>

Input:   BWA reference genome index (genomeindex_targz)
Class:   file

Suggestions:
project-BQpp3Y804Y0xbyG4GJPQ01xv://file-\* (DNAnexus Reference Genomes)
Enter file ID or path (<TAB> twice for compatible files in current
directory,'?' for more options)
genomeindex_targz: <"Reference Genome Files\: <REGION_OF_PROJECT>:/C. Elegans - Ce10/ce10.bwa-index.tar.gz">

Select an optional parameter to set by its # (^D or <ENTER> to finish):

[0] Reads (right mates) (reads2_fastqgz)
[1] Add read group information to the mappings (required by downstream GATK)? (add_read_group) [default=true]
[2] Read group id (read_group_id) [default={"$dnanexus_link": {"input": "reads_fastqgz", "metadata": "name"}}]
[3] Read group platform (read_group_platform) [default="ILLUMINA"]
[4] Read group platform unit (read_group_platform_unit) [default="None"]
[5] Read group library (read_group_library) [default="1"]
[6] Read group sample (read_group_sample) [default="1"]
[7] Output all alignments for single/unpaired reads? (all_alignments)
[8] Mark shorter split hits as secondary? (mark_as_secondary) [default=true]
[9] Advanced command line options (advanced_options)

Optional param #: <ENTER>

Using input JSON:
{
    "reads_fastqgz": {
        "$dnanexus_link": {
            "project": "project-B3X8bjBqqBk1y7bVPkvQ0001",
            "id": "file-B3P6v02KZbFFkQ2xj0JQ005Y"
        }

"genomeindex_targz": {
        "$dnanexus_link": {
            "project": "project-xxxx(project ID for the reference genome in your region)",
            "id": "file-BQbYJpQ09j3x9Fj30kf003JG"
        }
    }
}

Confirm running the applet/app with this input [Y/n]: <ENTER>
Calling app-BP2xVx80fVy0z92VYVXQ009j with output destination
     project-xxxx:/

Job ID: job-xxxx

Monitoring Your Job

You can use the command dx watch to monitor jobs. The command will print out the log file of the job, including the STDOUT, STDERR, and INFO printouts.

You can also use the command dx describe job-xxxx to learn more about your job. If you don't know the job's ID, you can use the command dx find jobs to list all the jobs run in the current project, along with the user who ran them, their status, and when they began.

$ dx find jobs
* BWA-MEM FASTQ Read Mapper (bwa_mem_fastq_read_mapper:main)(done) job-xxxx
user-amy 20xx-xx-xx 0x:00:00 (runtime 0:00:xx)
$ dx describe job-xxxx
...

There are also additional options that you can use to restrict your search of previous jobs, such as by their names or when they were run.

Terminating Your Job

If for some reason you need to terminate your job before it completes, use the command dx terminate.

After Your Job Finishes

You should now see two new files in your project: the mapped reads in a BAM file, and an index of that BAM file with a .bai extension. You can refer to the output file by name or by the job that produced it using the syntax job-xxxx:<output field>. Try it yourself with the job ID you got from calling the BWA-MEM app!

$ dx ls
small-celegans-sample.bam
small-celegans-sample.bam.bai
small-celegans-sample.fastq
$ dx describe small-celegans-sample.bam
...
$ dx describe job-xxxx:sorted_bam
...

Variant Calling

You can use the FreeBayes Variant Caller app (platform login required to access this link) to call variants on your BAM file.

This time, we won't rely on the interactive mode to enter our inputs. Instead, we will provide them directly. But first, let's look up the app's spec so we know what the inputs are called. For this, let's run the command dx run freebayes -h.

$ dx run freebayes -h
usage: dx run freebayes [-iINPUT_NAME=VALUE ...]

App: FreeBayes Variant Caller

Calls variants (SNPs, indels, and other events) using FreeBayes

See the app page for more information:
https://platform.dnanexus.com/app/freebayes

Inputs:
    Sorted mappings: -isorted_bams=(file) [-isorted_bams=... [...]]
        One or more coordinate-sorted BAM files containing mappings to call
        variants for.

    Genome: -igenome_fastagz=(file)
        A file, in gzipped FASTA format, with the reference genome that the
        reads were mapped against.
...

Optional inputs are shown using square brackets ([]) around the command-line syntax for each input. You'll notice that there are two required inputs that must be specified:

Sorted mappings (sorted_bams): A list of files with a .bam extension.
Genome (genome_fastagz): A reference genome in FASTA format that has been gzipped.

You can also run dx describe freebayes for a more compact view of the input and output specifications. By default, it will hide the advanced input options, but you can view them using the --verbose flag.

Running the App with a One-Liner Using a Job-Based Object Reference

It is sometimes more convenient to run apps using a single one-line command. You can do this by specifying all the necessary inputs either via the command line or in a prepared file. We will use the -i flag to specify inputs as suggested by the output of dx run freebayes ‑h:

sorted_bams: The output of the previous BWA step (see the Map Reads section for more information).
genome_fastagz: The ce10 genome in the Reference Genomes project.

To specify new job input using the output of a previous job, we'll use a [job-based object reference](/Developer-Tutorials/Sample-Code?bash#Use-job-based-object-references-(JBORs)) via the job-xxxx:<output field> syntax we used earlier.

You can use job-based object references as input even before the referenced jobs have finished. The system will simply wait until the input is ready to begin the new job.

Replace the job ID below with that generated by the BWA app you ran earlier. The -y flag skips the input confirmation.

$ dx run freebayes -y \ 
 -igenome_fastagz=Reference\ Genome\ Files:/C.\ Elegans\ -\ Ce10/ce10.fasta.gz \ 
 -isorted_bams=job-xxxx:sorted_bam

Using input JSON:
{
  "genome_fastagz": {
    "$dnanexus_link": {
      "project": "project-xxxx",
      "id": "file-xxxx"
    }
  },
  "sorted_bams": {
    "field": "sorted_bam", 
    "job": "job-xxxx"
  } 
}

Calling app-BFG5k2009PxyvYXBBJY00BK1 with output destination
project-xxxx:/

Job ID: job-xxxx

Automatically Running a Command After a Job Finishes

You can use the command dx wait to wait for a job to finish. If we run the following command right after running the Freebayes app, it will show you the recent jobs only after the job has finished, as shown in the example below.

$ dx wait job-xxxx && dx find jobs
Waiting for job-xxxx to finish running...
Done
* FreeBayes Variant Caller (done) job-xxxx
user-amy 2017-01-01 09:00:00 (runtime 0:05:24)
...

Congratulations! You have now called variants on a reads sample, and you did it all on the command line. Now let's look at how you can automate this process.

Automation

The beauty of the CLI is the ability to automate processes. In fact, we can automate everything we just did. The following script assumes that you've already logged in and is hardcoded to use the ce10 genome and takes in a local gzipped FASTQ file as its command-line argument.

#!/usr/bin/env bash
# Usage: <script_name.sh> local_fastq_filename.fastq.gz

reference="Reference Genome Files\: AWS US (East):/C. Elegans - Ce10/ce10.fasta.gz"
bwa_indexed_reference="Reference Genome Files\: AWS US (East):/C. Elegans - Ce10/ce10.bwa-index.tar.gz"
local_reads_file="$1"

reads_file_id=$(dx upload "$local_reads_file" --brief)
bwa_job=$(dx run bwa_mem_fastq_read_mapper -ireads_fastqgzs=$reads_file_id -igenomeindex_targz="$bwa_indexed_reference" -y --brief)
freebayes_job=$(dx run freebayes -isorted_bams=$bwa_job:sorted_bam -igenome_fastagz="$reference" -y --brief)

dx wait $freebayes_job

dx download $freebayes_job:variants_vcfgz -o "$local_reads_file".vcf.gz
gunzip "$local_reads_file".vcf.gz

Learn More

You're now ready to start scripting using dx. As shown in some of the examples above, the --brief flag can come in handy for scripting. A list of all dx commands and flags is on the Index of dx Commands page.

For more detailed information about running apps and applets from the command line, see the Running Apps and Applets page.

For a comprehensive guide to the DNAnexus SDK, see the SDK documentation.

Want to start writing your own apps? Check out the Developer Portal for some useful tutorials.

Developer Quickstart

Learn to build an app that you can run on the Platform.

This tutorial provides a quick intro to the DNAnexus developer experience, and progresses to building a fully functional, useful app on the Platform. For a more in-depth discussion of the Platform, take a look at Intro to Building Apps.

The steps below require the DNAnexus SDK. You must download and install it if you have not done so already.

In addition to this Quickstart, there are Developer Tutorials located in the sidebar that go over helpful tips for new users as well. A few of them include:

Step 1. Build a Simple App

Every DNAnexus app starts with 2 files:

dxapp.json: a file containing the app's metadata: its inputs and outputs, how the app will be run, etc.
a script that will be executed in the cloud when the app is run

Let's start by creating a file called dxapp.json with the following text:

{ "name": "coolapp",
  "runSpec": {
    "distribution": "Ubuntu",
    "release": "24.04",
    "version": "0",
    "interpreter": "python3",
    "file": "code.py"
  }
}

Above, we've specified the name for our app (coolapp), the type of interpreter (python3) to run our script with, and a path (code.py) to the script that we will create next. ("version":"0") refers to the version of Ubuntu 24.04 application execution environment that supports (python3) interpreter.

Next, we create our script in a file called code.py with the following text:

import dxpy

@dxpy.entry_point('main')
def main(**kwargs):
    print("Hello, DNAnexus!")
    return {}

That's all we need. To build the app, first log in to DNAnexus and start a project with dx login. In the directory with the two files above, run:

$ dx login
 ...
$ dx build -a

Now, run the app and watch the output:

$ dx run coolapp --watch

That's it! You have just made and run your first DNAnexus applet. Applets are lightweight apps that live in your project, and are not visible in the App Library. When you typed dx run, the app ran on its own Linux instance in the cloud. You have exclusive, secure access to the CPU, storage, and memory on the instance. The DNAnexus API lets your app read and write data on the Platform, as well as launch other apps.

The app is now available in the DNAnexus web interface, as part of the project that you started. It can be configured and run in the Workflow Builder, or shared with other users by sharing the project.

Step 2. Run BLAST

Next, we'll make our app do something a bit more interesting: take in two files with FASTA-formatted DNA, run the BLAST tool to compare them, and output the result.

In the cloud, your app will run on Ubuntu Linux 24.04, where BLAST is available as an APT package, ncbi-blast+. You can request that the DNAnexus execution environment install it before your script is run by listing ncbi-blast+ in the execDepends field of your dxapp.json like this:

{ "name": "coolapp",
  "runSpec": {
    "distribution": "Ubuntu",
    "release": "24.04",
    "version": "0",
    "interpreter": "python3",
    "file": "code.py",
    "execDepends": [ {"name": "ncbi-blast+"} ]
  }
}

Next, let's update code.py to run BLAST:

import dxpy, subprocess

@dxpy.entry_point('main')
def main(seq1, seq2):
    dxpy.download_dxfile(seq1, "seq1.fasta")
    dxpy.download_dxfile(seq2, "seq2.fasta")

    subprocess.call("blastn -query seq1.fasta -subject seq2.fasta > report.txt", shell=True)

    report = dxpy.upload_local_file("report.txt")
    return {"blast_result": report}

We're now ready to rebuild the app and test it on some real data. You can use some demo inputs available in the Demo Data project, or you can upload your own data with dx upload or via the website. If you use the Demo Data inputs, make sure the project you are running your app in is the same region as the Demo Data project.

Rebuild the app with dx build -a, and run it like this:

$ dx run coolapp -i seq1="Demo Data:/Developer Quickstart/NC_000868.fasta" -i seq2="Demo Data:/Developer Quickstart/NC_001422.fasta" --watch

Once the job is done, you can examine the output with dx head report.txt, download it with dx download, or view it on the website.

Step 3. Provide an Input/Output Spec

Workflows are a powerful way to visually connect, configure, and run multiple apps in pipelines. To add our app to a workflow and be able to connect its inputs and/or outputs to other apps, our app will need both input and output specifications. Let's update our dxapp.json as follows:

{
  "name": "coolapp",
  "runSpec": {
    "distribution": "Ubuntu",
    "release": "24.04",
    "version": "0",
    "interpreter": "python3",
    "file": "code.py",
    "execDepends": [ {"name": "ncbi-blast+"} ]
  },
  "inputSpec": [
    {"name": "seq1", "class": "file"},
    {"name": "seq2", "class": "file"}
  ],
  "outputSpec": [
    {"name": "blast_result", "class": "file"}
  ]
}

Rebuild the app with dx build -a. You can run it in the same way as before, but now we can add the applet to a workflow. Click "New Workflow" while looking at your project on the website, and click on coolapp once to add it to the workflow. You'll see inputs and outputs appear on the workflow stage which can be connected to other stages in the workflow.

Also, if you now go back to the command line and run dx run coolapp with no input arguments, it will prompt you for the input values for seq1 and seq2.

Step 4. Configure App Settings

In addition to specifying input files, the I/O specification can also be used to configure settings that we want the app to use. For example, we can configure the E-value setting and other BLAST settings with this code and dxapp.json:

code.py

import dxpy, subprocess

@dxpy.entry_point('main')
def main(seq1, seq2, evalue, blast_args):
    dxpy.download_dxfile(seq1, "seq1.fasta")
    dxpy.download_dxfile(seq2, "seq2.fasta")

    command = "blastn -query seq1.fasta -subject seq2.fasta -evalue {e} {args} > report.txt".format(e=evalue, args=blast_args)
    subprocess.call(command, shell=True)

    report = dxpy.upload_local_file("report.txt")
    return {"blast_result": report}

dxapp.json

{
  "name": "coolapp",
  "runSpec": {
    "distribution": "Ubuntu",
    "release": "24.04",
    "version": "0",
    "interpreter": "python3",
    "file": "code.py",
    "execDepends": [ {"name": "ncbi-blast+"} ]
  },
  "inputSpec": [
    {"name": "seq1", "class": "file"},
    {"name": "seq2", "class": "file"},
    {"name": "evalue", "class": "float", "default": 0.01},
    {"name": "blast_args", "class": "string", "default": ""}
  ],
  "outputSpec": [
    {"name": "blast_result", "class": "file"}
  ]
}

Rebuild the app again and add it in the workflow builder. You should now see the evalue and blast_args settings available when you click the gear button on the stage. After building and configuring a workflow, you can run the workflow itself with dx run workflowname.

Step 5. Use SDK Tools

One of the utilities provided in the SDK is dx-app-wizard. This tool will prompt you with a series of questions with which it will create the basic files needed for a new app. It also gives you the option of writing your app as a bash shell script instead of Python. Just run dx-app-wizard to try it out.

Learn More

For additional information and examples of how to run jobs using the CLI, Chapter 5 of this reference guide may be useful. Note that this material is not a part of the official DNAnexus documentation and is for reference only.

Developer Tutorials

Access developer tutorials and examples.

Developers new to the DNAnexus platform may find it easier to learn by doing. This page contains a collection of simple tutorials and examples intended to showcase common tasks and methodologies when creating an app(let) on the DNAnexus platform. After reading through the tutorials and examples you should be able to develop app(let)s that:

Run efficiently: make use of cloud computing methodologies.
Are easy to debug: let developers understand and resolve issues.
Use the scale of the cloud: take advantage of the DNAnexus platform’s flexibility
Are easy to use: reduce support and enable collaboration.

If it’s your first time developing an app(let) be sure to read through the Getting started series. This series will introduce terms and concepts that tutorials and examples will build upon.

These tutorials are not meant to show realistic everyday examples, but rather provide a strong starting point for app(let) developers. Tutorials will showcase simple and varied implementations of the SAMtools view command on the DNAnexus platform.

Bash App(let) Tutorials

Bash app(let)s use dx-toolkit’s, our platform SDK, command line interface along with common bashisms to create bioinformatic pipelines in the cloud.

Bash

Python App(let) Tutorials

Python app(let)s make of use dx-toolkit’s python implementation along with common python modules such as subprocess to create bioinformatic pipelines in the cloud.

Python

Web App(let) Tutorials

To create a web applet, you will need access to Titan or Apollo features Web applets can be made as either python or bash applets, the only difference is that they will launch some kind of web server and expose port 443 (for HTTPS) to allow a user to interact with that web application through a web browser.

Web_app

Concurrent Computing Tutorials

A bit of terminology before we start discussing parallel and distributed computing paradigms on the DNAnexus Platform.

There are many definitions and approaches to tackling the concept of parallelization and distributing workloads in the cloud (Here’s a particularly helpful Stack Exchange post on the subject). To help make our documentation easier to understand, when discussing concurrent computing paradigms we’ll refer to:

Parallel: Using multiple threads or logical cores to concurrently process a workload.
Distributed: Using multiple machines (in our case instances in the cloud) that communicate to concurrently process a workload.

Keep these formal definitions in mind as you read through the tutorials and learn how to compute concurrently on the DNAnexus platform.

Parallel

Distributed

Bash

Bash Helpers

Learn to build an applet that performs a basic SAMtools count with the aid of bash helper variables.

Source Code

View full source code on GitHub

Step 1. Download BAM Files

Download input files using the dx-download-all-inputs command. The dx-download-all-inputs command will go through all inputs and download into folders with the pattern /home/dnanexus/in/[VARIABLE]/[file or subfolder with files].

dx-download-all-inputs

Step 2. Create an Output Directory

We create an output directory in preparation for dx-upload-all-outputs DNAnexus command in the (Upload Results)[#Upload Result] section.

mkdir -p out/counts_txt

Step 3. Run SAMtools View

After executing the dx-download-all-inputs command, there are three helper variables created to aid in scripting. For this applet, the input variable name mappings_bam with platform filename my_mappings.bam will have a helper variables:

# [VARIABLE]_path the absolute string path to the file.
$ echo $mappings_bam_path
/home/dnanexus/in/mappings_bam/my_mappings.bam
# [VARIABLE]_prefix the file name minus the longest matching pattern in the dxapp.json file
$ echo $mappings_bam_prefix
my_mappings
# [VARIABLE]_name the file name from the platform
$ echo $mappings_bam_name
my_mappings.bam

We use the bash helper variable mappings_bam_path to reference the location of a file after it has been downloaded using dx-download-all-inputs.

samtools view -c "${mappings_bam_path}" > out/counts_txt/"${mappings_bam_prefix}.txt"

Step 4. Upload Result

We use the dx-upload-all-outputs command to upload data to the platform and specify it as the job’s output. The dx-upload-all-outputs command expects to find file paths matching the pattern /home/dnanexus/out/[VARIABLE]/*. It will upload matching files and then associate them as the output corresponding to [VARIABLE]. In this case, the output is called counts_txt. Earlier we created the folders, and we can now place the outputs there.

dx-upload-all-outputs

Distributed by Chr (sh)

View full source code on GitHub

How is the SAMtools dependency provided?

The SAMtools dependency is resolved by declaring an Apt-Get package in the dxapp.json file’s runSpec.execDepends.

{
...
    "runSpec": {
  	...
      "execDepends": [
        {"name": "samtools"}
      ]
    }
...
}

For additional information, see execDepends

Entry Points

Distributed bash-interpreter apps use bash functions to declare entry points. This app has the following entry points specified as bash functions:

main
count_func
sum_reads

Entry points are executed on a new worker with its own system requirements. The instance type can be set in the dxapp.json file’s runSpec.systemRequirements:

{
  "runSpec": {
    ...
    "systemRequirements": {
      "main": {
        "instanceType": "mem1_ssd1_x4"
      },
      "count_func": {
        "instanceType": "mem1_ssd1_x2"
      },
      "sum_reads": {
        "instanceType": "mem1_ssd1_x4"
      }
    },
    ...
  }
}

main

The main function slices the initial *.bam file and generates an index *.bai if needed. The input *.bam is the sliced into smaller *.bam files containing only reads from canonical chromosomes. First, the main function downloads the BAM file and gets the headers.

  dx download "${mappings_sorted_bam}"
  chromosomes=$(samtools view -H "${mappings_sorted_bam_name}" | grep "\@SQ" | awk -F '\t' '{print $2}' | awk -F ':' '{if ($2 ~ /^chr[0-9XYM]+$|^[0-9XYM]/) {print $2}}')

Sliced *.bam files are uploaded and their file IDs are passed to the count_func entry point using the dx-jobutil-new-job command.

  if [ -z "${mappings_sorted_bai}" ]; then
    samtools index "${mappings_sorted_bam_name}"
  else
    dx download "${mappings_sorted_bai}" -o "${mappings_sorted_bam_name}".bai
  fi

  count_jobs=()
  for chr in $chromosomes; do
    seg_name="${mappings_sorted_bam_prefix}_${chr}".bam
    samtools view -b "${mappings_sorted_bam_name}" "${chr}" > "${seg_name}"
    bam_seg_file=$(dx upload "${seg_name}" --brief)
    count_jobs+=($(dx-jobutil-new-job -isegmentedbam_file="${bam_seg_file}" -ichr="${chr}" count_func))
  done

Outputs from the count_func entry points are referenced as Job Based Object References (JBOR) and used as inputs for the sum_reads entry point.

  for job in "${count_jobs[@]}"; do
    readfiles+=("-ireadfiles=${job}:counts_txt")
  done

  sum_reads_job=$(dx-jobutil-new-job "${readfiles[@]}" -ifilename="${mappings_sorted_bam_prefix}" sum_reads)

The output of the sum_reads entry point is used as the output of the main entry point via JBOR reference using the command dx-jobutil-add-output.

count_func

This entry point downloads and runs the command samtools view -c on the sliced *.bam. The generated counts_txt output file is uploaded as the entry point’s job output via the command dx-jobutil-add-output.

count_func () 
{ 
    echo "Value of segmentedbam_file: '${segmentedbam_file}'";
    echo "Chromosome being counted '${chr}'";
    dx download "${segmentedbam_file}";
    readcount=$(samtools view -c "${segmentedbam_file_name}");
    printf "${chr}:\t%s\n" "${readcount}" > "${segmentedbam_file_prefix}.txt";
    readcount_file=$(dx upload "${segmentedbam_file_prefix}".txt --brief);
    dx-jobutil-add-output counts_txt "${readcount_file}" --class=file
}

sum_reads

The main entry point triggers this sub job, providing the output of count_func as an input. This entry point gathers all the files generated by the count_func jobs and sums them.

This function returns read_sum_file as the entry point output.

sum_reads () 
{ 
    set -e -x -o pipefail;
    printf "Value of read file array %s" "${readfiles[@]}";
    echo "Filename: ${filename}";
    echo "Summing values in files and creating output read file";
    for read_f in "${readfiles[@]}";
    do
        echo "${read_f}";
        dx download "${read_f}" -o - >> chromosome_result.txt;
    done;
    count_file="${filename}_chromosome_count.txt";
    total=$(awk '{s+=$2} END {print s}' chromosome_result.txt);
    echo "Total reads: ${total}" >> "${count_file}";
    readfile_name=$(dx upload "${count_file}" --brief);
    dx-jobutil-add-output read_sum_file "${readfile_name}" --class=file
}

Distributed by Region (sh)

View full source code on GitHub

Entry Points

Distributed bash-interpreter apps use bash functions to declare entry points. Entry points are executed as subjobs on new workers with their own respective system requirements. This app has the following entry points specified as bash functions:

main
count_func
sum_reads

main

The main function takes the initial *.bam, generates an index *.bai if needed, and obtains the list of regions from the *.bam file. Every 10 regions will be sent, as input, to the count_func entry point using dx-jobutil-new-job command.

  regions=$(samtools view -H "${mappings_sorted_bam_name}" | grep "\@SQ" | sed 's/.*SN:\(\S*\)\s.*/\1/')

  echo "Segmenting into regions"
  count_jobs=()
  counter=0
  temparray=()
  for r in $(echo $regions); do
    if [[ "${counter}" -ge 10 ]]; then
      echo "${temparray[@]}"
      count_jobs+=($(dx-jobutil-new-job -ibam_file="${mappings_sorted_bam}" -ibambai_file="${mappings_sorted_bai}" "${temparray[@]}" count_func))
      temparray=()
      counter=0
    fi
    temparray+=("-iregions=${r}") # Here we add to an array of -i<parameter>'s
    counter=$((counter+1))
  done

  if [[ counter -gt 0 ]]; then # Previous loop will miss last iteration  if its < 10
    echo "${temparray[@]}"
    count_jobs+=($(dx-jobutil-new-job -ibam_file="${mappings_sorted_bam}" -ibambai_file="${mappings_sorted_bai}" "${temparray[@]}" count_func))
  fi

Job outputs from the count_func entry point are referenced as Job Based Object References (JBOR) and used as inputs for the sum_reads entry point.

  echo "Merge count files, jobs:"
  echo "${count_jobs[@]}"
  readfiles=()
  for count_job in "${count_jobs[@]}"; do
    readfiles+=("-ireadfiles=${count_job}:counts_txt")
  done
  echo "file name: ${sorted_bamfile_name}"
  echo "Set file, readfile variables:"
  echo "${readfiles[@]}"
  countsfile_job=$(dx-jobutil-new-job -ifilename="${mappings_sorted_bam_prefix}" "${readfiles[@]}" sum_reads)

Job outputs of the sum_reads entry point is used as the output of the main entry point via JBOR reference in the dx-jobutil-add-output command.

  echo "Specifying output file"
  dx-jobutil-add-output counts_txt "${countsfile_job}:read_sum" --class=jobref
}

count_func

This entry point performs a SAMtools count of the 10 regions passed as input. This execution will be run on a new worker. As a result variables from other functions (e.g. main()) will not be accessible here.

Once the output file with counts is created, it is uploaded to the platform and assigned as the entry point’s job output counts_txt via the command dx-jobutil-add-output.

count_func() {

  set -e -x -o pipefail

  echo "Value of bam_file: '${bam_file}'"
  echo "Value of bambai_file: '${bambai_file}'"
  echo "Regions being counted '${regions[@]}'"


  dx-download-all-inputs


  mkdir workspace
  cd workspace || exit
  mv "${bam_file_path}" .
  mv "${bambai_file_path}" .
  outputdir="./out/samtool/count"
  mkdir -p "${outputdir}"
  samtools view -c "${bam_file_name}" "${regions[@]}" >> "${outputdir}/readcounts.txt"


  counts_txt_id=$(dx upload "${outputdir}/readcounts.txt" --brief)
  dx-jobutil-add-output counts_txt "${counts_txt_id}" --class=file
}

sum_reads

The main entry point triggers this subjob, providing the output of count_func as an input JBOR. This entry point gathers all the readcount.txt files generated by the count_func jobs and sums the totals.

This entry point returns read_sum as a JBOR, which is then referenced as job output.

sum_reads() {

  set -e -x -o pipefail
  echo "$filename"

  echo "Value of read file array '${readfiles[@]}'"
  dx-download-all-inputs
  echo "Value of read file path array '${readfiles_path[@]}'"

  echo "Summing values in files"
  readsum=0
  for read_f in "${readfiles_path[@]}"; do
    temp=$(cat "$read_f")
    readsum=$((readsum + temp))
  done

  echo "Total reads: ${readsum}" > "${filename}_counts.txt"

  read_sum_id=$(dx upload "${filename}_counts.txt" --brief)
  dx-jobutil-add-output read_sum "${read_sum_id}" --class=file

In the main function, the output is referenced

  echo "Specifying output file"
  dx-jobutil-add-output counts_txt "${countsfile_job}:read_sum" --class=jobref
}

SAMtools count

View full source code on GitHub

This applet performs a basic samtools view -c {bam} command, referred to as “SAMtools count”, on the DNAnexus platform.

Download BAM Files

For bash scripts, inputs to a job execution become environment variables. The inputs from our dxapp.json file are formatted as shown below:

{
  "inputSpec": [
    {
      "name": "mappings_bam",
      "label": "Mapping",
      "class": "file",
      "patterns": ["*.bam"],
      "help": "BAM format file."
    }
  ]
}

The object mappings_bam, a DNAnexus link containing the file ID of that file, will be available as an environmental variable in the applet’s execution. Use the command dx download to download the BAM file. By default, when we download a file, we will keep the filename of the object on the platform.

dx download "${mappings_bam}"

SAMtools Count

Here, we use the bash helper variable mappings_bam_name. For file inputs, the DNAnexus platform creates a bash variable [VARIABLE]_name that holds a string representing the filename of the object on the platform; because we downloaded the file using default parameters, this will be the filename of the object on this worker as well. We use another helper variable, [VARIABLE]_prefix, the filename of the object minus any suffixes specified in the input field patterns. From the input spec above, the only pattern present is '["*.bam"]', so the platform will remove the trailing “.bam” and create the helper variable [VARIABLE]_prefix for our use.

readcount=$(samtools view -c "${mappings_bam_name}")
echo "Total reads: ${readcount}" > "${mappings_bam_prefix}.txt"

Upload Result

Use dx upload command to upload data to the platform. This will upload the file into the job container, a temporary project that holds onto files associated with the job. When running the command dx upload with the flag --brief, the command will return just the file ID.

counts_txt_id=$(dx upload "${mappings_bam_prefix}.txt" --brief)

Note: Job containers are an integral part of the execution process, to learn more see Containers for Execution.

Associate With Output

The output of an applet must be declared before the applet is even built. Looking back to the dxapp.json file, we see the following:

{
  "name": "counts_txt",
  "class": "file",
  "label": "Read count file",
  "patterns": [
    "*.txt"
  ],
  "help": "Output file with Total reads as the first line."
}

We declared a file type output named counts_txt. In the applet script, we must tell the system what file should be associated with the output counts_txt. On job completion, usually at the end of the script, this file will be copied from the temporary job container to the project that launched the job.

dx-jobutil-add-output counts_txt "${counts_txt_id}" --class=file

TensorBoard Example Web App

View full source code on GitHub

This example demonstrates how to run TensorBoard inside a DNAnexus applet.

TensorBoard is a web application used to visualize and inspect what is going on inside TensorFlow training. To use TensorBoard, our training script in TensorFlow needs to include code that saves various data to a log directory where TensorBoard can then find the data to display it.

This example uses an example script from the TensorBoard authors. For more guidance on how to use TensorBoard, check out the tensorflow website (external link).

Creating the web application

The applet code runs a training script, which is placed in resources/home/dnanexus/ to make it available in the current working directory of the worker, and then it starts tensorboard on port 443 (HTTPS).

# Start the training script and put it into the background,
# so the next line of code will run immediately
python mnist_tensorboard_example.py --log_dir LOGS_FOR_TENSORBOARD &

# Run TensorBoard
tensorboard  --logdir LOGS_FOR_TENSORBOARD --host 0.0.0.0 --port 443

We run the training script in the background to start TensorBoard immediately, which will let us see the results while training is still running. This is particularly important for long-running training scripts.

Note that for all web apps, if everything is running smoothly and no errors are encountered (the ideal case), the line of code that starts the server will keep it running forever. The applet stops only when it is terminated. This also means that any lines of code after the server starts will not be executed.

As with all web apps, the dxapp.json must include "httpsApp": {"ports":[443], "shared_access": "VIEW"} to tell the worker to expose port 443.

Creating an applet on DNAnexus

Build the asset with the libraries first:

dx build_asset tensorflow_asset

Take the record ID it outputs and add it to the dxapp.json for the applet.

"runSpec": {
	...
	"assetDepends": [
    {
      "id": "record-xxxx
    }
  ]
	...
}

Then build the applet

dx build -f tensorboard-web-app
dx run tensorboard-web-app

Once it spins up, you can go to that job’s designated URL based on its job ID, https://job-xxxx.dnanexus.cloud/, to see the result.

Git Dependency

View full source code on GitHub

What does this applet do?

This applet performs a basic SAMtools count of alignments present in an input BAM.

Prerequisites

The app must have network access to the hostname where the git repository is located. In this example, access.network is set to:

"access": {
  "network": ["github.com"]
}

To learn more about access and network fields see Execution Environment Reference.

How is the SAMtools dependency added?

SAMtools is cloned and built from the SAMtools GitHub repository. Let’s take a closer look at the dxapp.json file’s runSpec.execDepends property:

  "runSpec": {
 ...
    "execDepends": [
        {
        "name": "htslib",
        "package_manager": "git",
        "url": "https://github.com/samtools/htslib.git",
        "tag": "1.3.1",
        "destdir": "/home/dnanexus"
        },
        {"name": "samtools",
        "package_manager": "git",
        "url": "https://github.com/samtools/samtools.git",
        "tag": "1.3.1",
        "destdir": "/home/dnanexus",
        "build_commands": "make samtools"
        }
    ],
...
  }

The execDepends value is a JSON array of dependencies to resolve before the applet source code is run. In this applet, we specify the following git fetch dependency for htslib and SAMtools. Dependencies are resolved in the order they’re specified. Here we must specify htslib first, before samtools build_commands, due to newer versions of SAMtools depending on htslib. An overview of the each property in the git dependency:

package_manager - Details the type of dependency and how to resolve. supplementary details.
url - Must point to the server containing the repository. In this case, a github url.
tag/branch - Git tag/branch to fetch.
destdir - Directory on worker to which the git repo is cloned.
build_commands - If needed, build commands to execute. We know our first dependency, htslib, is built when we build SAMtools; as a result, we only specify “build_commands” for the SAMtools dependency.

Note: build_commands are executed from the destdir; use cd when appropriate.

How is SAMtools called in our src script?

Because we set "destdir": "/home/dnanexus" in our dxapp.json, we know the git repo is cloned to the same directory from which our script will execute. Our example directory’s structure:

├── home
│   ├── dnanexus
│       ├── < app script >
│       ├── htslib
│       ├── samtools
│           ├── < samtools binary >

Our samtools command from the app script is samtools/samtools.

Applet Script

main() {
  set -e -x -o pipefail

  dx download "$mappings_bam"

  count_filename="${mappings_bam_prefix}.txt"
  readcount=$(samtools/samtools view -c "${mappings_bam_name}")
  echo "Total reads: ${readcount}" > "${count_filename}"

  counts_txt=$(dx upload "${count_filename}" --brief)
  dx-jobutil-add-output counts_txt "${counts_txt}" --class=file
}

Note: We could’ve built samtools in a destination within our $PATH or added the binary directory to our $PATH. Keep this in mind for your app(let) development

Mkfifo and dx cat

View full source code on GitHub

This applet performs a SAMtools count on an input file while minimizing disk usage. For additional details on using FIFO (named pipes) special files, run the command man fifo in your shell.Warning: Named pipes require BOTH a stdin and stdout or they will block a process. In these examples, we place incomplete named pipes in background processes so the foreground script process does not block.

To approach this use case, let’s focus on what we want our applet to do:

Stream the BAM file from the platform to a worker.
As the BAM is streamed, count the number of reads present.
Output the result into a file.
Stream the result file to the platform.

Stream BAM file from the platform to a worker

First, we establish a named pipe on the worker. Then, we stream to stdin of the named pipe and download the file as a stream from the platform using dx cat.

  mkdir workspace
  mappings_fifo_path="workspace/${mappings_bam_name}"
  mkfifo "${mappings_fifo_path}" # FIFO file is created
  dx cat "${mappings_bam}" > "${mappings_fifo_path}" &
  input_pid="$!"

FIFO

stdin

stdout

BAM file

YES

Output BAM file read count

Now that we have created our FIFO special file representing the streamed BAM, we can just call the samtools command as we normally would. The samtools command reading the BAM would provide our BAM FIFO file with a stdout. However, keep in mind that we want to stream the output back to the platform. We must create a named pipe representing our output file too.

  mkdir -p ./out/counts_txt/

  counts_fifo_path="./out/counts_txt/${mappings_bam_prefix}_counts.txt"

  mkfifo "${counts_fifo_path}" # FIFO file is created, readcount.txt
  samtools view -c "${mappings_fifo_path}" > "${counts_fifo_path}" &
  process_pid="$!"

FIFO

stdin

stdout

BAM file

YES

output file

YES

The directory structure created here (~/out/counts_txt) is required to use the dx-upload-all-outputs command in the next step. All files found in the path ~/out/<output name> will be uploaded to the corresponding <output name> specified in the dxapp.json.

Stream the result file to the platform

Currently, we’ve established a stream from the platform, piped the stream into a samtools command, and finally outputting the results to another named pipe. However, our background process is still blocked since we lack a stdout for our output file. Luckily, creating an upload stream to the platform will resolve this.

We can upload as a stream to the platform using the commands dx-upload-all-outputs or dx upload -. Make sure to specify --buffer-size if needed.

  mkdir -p ./out/counts_txt/

  counts_fifo_path="./out/counts_txt/${mappings_bam_prefix}_counts.txt"

  mkfifo "${counts_fifo_path}" # FIFO file is created, readcount.txt
  samtools view -c "${mappings_fifo_path}" > "${counts_fifo_path}" &
  process_pid="$!"

FIFO

stdin

stdout

BAM file

YES

output file

YES

Note: Alternatively, dx upload - can upload directly from stdin. In this example, we would no longer need to have the directory structure required for dx-upload-all-outputs.Warning: When uploading a file that exists on disk, dx upload is aware of the file size and automatically handles any cloud service provider upload chunk requirements. When uploading as a stream, the file size is not automatically known and dx upload uses default parameters. While these parameters are fine for most use cases, you may need to specify upload part size with the --buffer-size option.

Wait for background processes

Now that our background processes are no longer blocking the rest of the applet’s execution, we simply wait in the foreground for those processes to finish.

  wait -n  # "$input_pid"
  wait -n  # "$process_pid"
  wait -n  # "$upload_pid"

Note: If we didn’t wait the app script would running in the foreground would finish and terminate the job! We wouldn’t want that.

How is the SAMtools dependency provided?

The SAMtools compiled binary is placed directly in the <applet dir>/resources directory. Any files found in the resources/ directory will be uploaded so that they will be present in the worker’s root directory. In our case:

├── Applet dir
│   ├── src
│   ├── dxapp.json
│   ├── resources
│       ├── usr
│           ├── bin
│               ├── < samtools binary >

When this applet is run on a worker, the resources/ folder will be placed in the worker’s root directory /:

/
├── usr
│   ├── bin
│       ├── < samtools binary >
├── home
│   ├── dnanexus

/usr/bin is part of the $PATH variable, so we can reference the samtools command directly in our script as samtools view -c ...

Parallel by Region (sh)

This applet performs a basic SAMtools count on a series of sliced (by canonical chromosome) BAM files in parallel using wait (Ubuntu 14.04+).

View full source code on GitHub

How is the SAMtools dependency provided?

The SAMtools dependency is resolved by declaring an Apt-Get package in the dxapp.json runSpec.execDepends.

  "runSpec": {
    ...
    "execDepends": [
      {"name": "samtools"}
    ]
  }

Debugging

The command set -e -x -o pipefail will assist you in debugging this applet:

-e causes the shell to immediately exit if a command returns a non-zero exit code.
-x prints commands as they are executed, which is very useful for tracking the job’s status or pinpointing the exact execution failure.
-o pipefail makes the return code the first non-zero exit code. (Typically, the return code of pipes is the exit code of the last command, which can create difficult to debug problems.)
```
set -e -x -o pipefail
echo "Value of mappings_sorted_bam: '${mappings_sorted_bam}'"
echo "Value of mappings_sorted_bai: '${mappings_sorted_bai}'"

mkdir workspace
cd workspace
dx download "${mappings_sorted_bam}"

if [ -z "$mappings_sorted_bai" ]; then
  samtools index "$mappings_sorted_bam_name"
else
  dx download "${mappings_sorted_bai}"
fi
```
The *.bai file was an optional job input. You can check for a empty or unset var using the bash built-in test [[ - z ${var}} ]]. You can then download or create a *.bai index as needed.

Parallel Run

Bash’s job control system allows for easy management of multiple processes. In this example, bash commands are run in the background as the maximum job executions are controlled in the foreground. You can place processes in the background using the character & after a command.

  chromosomes=$(samtools view -H "${mappings_sorted_bam_name}" | grep "\@SQ" | awk -F '\t' '{print $2}' | awk -F ':' '{if ($2 ~ /^chr[0-9XYM]+$|^[0-9XYM]/) {print $2}}')

  for chr in $chromosomes; do
    samtools view -b "${mappings_sorted_bam_name}" "${chr}" -o "bam_${chr}.bam"
    echo "bam_${chr}.bam"
  done > bamfiles.txt

  busyproc=0
  while read -r b_file; do
    echo "${b_file}"
    if [[ "${busyproc}" -ge "$(nproc)" ]]; then
      echo Processes hit max
      while [[ "${busyproc}" -gt  0 ]]; do
        wait -n # p_id
        busyproc=$((busyproc-1))
      done
    fi
    samtools view -c "${b_file}"> "count_${b_file%.bam}" &
    busyproc=$((busyproc+1))
  done <bamfiles.txt

  while [[ "${busyproc}" -gt  0 ]]; do
    wait -n # p_id
    busyproc=$((busyproc-1))
  done

Job Output

Once the input bam has been sliced, counted, and summed, the output counts_txt is uploaded using the command dx-upload-all-outputs. The following directory structure required for dx-upload-all-outputs is below:

├── $HOME
│   ├── out
│       ├── < output name in dxapp.json >
│           ├── output file

In your applet, upload all outputs by:

  outputdir="${HOME}/out/counts_txt"
  mkdir -p "${outputdir}"
  cat count* | awk '{sum+=$1} END{print "Total reads = ",sum}' > "${outputdir}/${mappings_sorted_bam_prefix}_count.txt"

  dx-upload-all-outputs

Parallel xargs by Chr

This applet slices a BAM file by canonical chromosome then performs a parallelized samtools view -c using xargs. Type man xargs for general usage information.

View full source code on GitHub

How is the SAMtools dependency provided?

The SAMtools compiled binary is placed directory in the <applet dir>/resources directory. Any files found in the resources/ directory will be uploaded so that they will be present in the root directory of the worker. In our case:

├── Applet dir
│   ├── src
│   ├── dxapp.json
│   ├── resources
│       ├── usr
│           ├── bin
│               ├── < samtools binary >

When this applet is run on a worker, the resources/ folder will be placed in the worker’s root directory /:

/
├── usr
│   ├── bin
│       ├── < samtools binary >
├── home
│   ├── dnanexus

/usr/bin is part of the $PATH variable, so in our script, we can reference the samtools command directly, as in samtools view -c ...

Parallel Run

Splice BAM

First, we download our BAM file and slice it by canonical chromosome, writing the *bam file names to another file.

In order to split a BAM by regions, we need to have a *.bai index. You can either create an app(let) which takes the *.bai as an input or generate a *.bai in the applet. In this tutorial, we generate the *.bai in the applet, sorting the BAM if necessary.

  dx download "${mappings_bam}"

  indexsuccess=true
  bam_filename="${mappings_bam_name}"
  samtools index "${mappings_bam_name}" || indexsuccess=false
  if [[ $indexsuccess == false ]]; then
    samtools sort -o "${mappings_bam_name}" "${mappings_bam_name}"
    samtools index "${mappings_bam_name}"
    bam_filename="${mappings_bam_name}"
  fi


  chromosomes=$(samtools view -H "${bam_filename}" | grep "\@SQ" | awk -F '\t' '{print $2}' | awk -F ':' '{if ($2 ~ /^chr[0-9XYM]+$|^[0-9XYM]/) {print $2}}')

  for chr in $chromosomes; do
    samtools view -b "${bam_filename}" "${chr}" -o "bam_${chr}."bam
    echo "bam_${chr}.bam"
  done > bamfiles.txt

Xargs SAMtools view

In the previous section, we recorded the name of each sliced BAM file into a record file. Now we will perform a samtools view -c on each slice using the record file as input.

  counts_txt_name="${mappings_bam_prefix}_count.txt"

  sum_reads=$(<bamfiles.txt xargs -I {} samtools view -c $view_options '{}' | awk '{s+=$1} END {print s}')
  echo "Total Count: ${sum_reads}" > "${counts_txt_name}"

Upload results

The results file is uploaded using the standard bash process:

Upload a file to the job execution’s container.

Provide the DNAnexus link as a job’s output using the script dx-jobutil-add-output <output name>

  counts_txt_id=$(dx upload "${counts_txt_name}" --brief)
  dx-jobutil-add-output counts_txt "${counts_txt_id}" --class=file

Precompiled Binary

This tutorial showcases packaging a precompiled binary in the resources/ directory of an app(let).

View full source code on GitHub

Precompiling a Binary

In this applet, the SAMtools binary was precompiled on an Ubuntu machine. A user can do this compilation on an Ubuntu machine of their own, or they can utilize the Cloud Workstation app to build and compile a binary. On the Cloud Workstation, the user can download the SAMtools source code and compile it in the worker environment, ensuring that the binary will run on future workers.

See Cloud Workstation in the App library for more information.

Resources Directory

The SAMtools precompiled binary is placed in the <applet dir>/resources/ directory. Any files found in the resources/ directory will be packaged, uploaded to the platform, and then unpackaged in the root directory \ of the worker. In our case, the resources/ dir is structured as follows:

├── Applet dir
│   ├── src
│   ├── dxapp.json
│   ├── resources
│       ├── usr
│           ├── bin
│               ├── < samtools binary >

When this applet is run on a worker, the resources/ directory will be placed in the worker’s root directory /:

/
├── usr
│   ├── bin
│       ├── < samtools binary >
├── home
│   ├── dnanexus
│   	├── applet script

We are able to access the SAMtools command because the respective binary is visible from the default $PATH variable. The directory /usr/bin/ is part of the $PATH variable, so in our script we can reference the samtools command directly:

samtools view -c "${mappings_bam_name}" > "${mappings_bam_prefix}.txt"\

R Shiny Example Web App

This is an example web applet that demonstrates how to build and run an R Shiny application on DNAnexus.

View full source code on GitHub

Creating the web application

Inside the dxapp.json, you would add "httpsApp": {"ports":[443], "shared_access": "VIEW"} to tell the worker to expose this port.

R Shiny needs two scripts, server.R and ui.R, which should be under resources/home/dnanexus/my_app/. When a job starts based on this applet, the resources directory is copied onto the worker, and since the ~/ path on the worker is /home/dnanexus, that means you now have ~/my_app with those two scripts inside.

From the main applet script code.sh, we simply start shiny pointing to ~/my_app repo, serving its mini-application on port 443.

main() {
  R -e "shiny::runApp('~/my_app', host='0.0.0.0', port=443)"
}

Modifying this example for your own applet

To make your own applet with R Shiny, simply copy the source code from this example and modify server.R and ui.R inside resources/home/dnanexus/my_app.

How to rebuild the shiny asset

View dxasset.json file

To build the asset, run the dx build_asset command and pass shiny-asset, i.e. the name of the directory holding dxasset.json:

dx build_asset shiny-asset

This will output a record ID record-xxxx that you can then put into the applet’s dxapp.json in place of the existing one:

"runSpec": {
    ...
    "assetDepends": [
    {
      "id": "record-xxxx
    }
  ]
    ...
}

Build the applet

Now build and run the applet itself:

dx build -f dash-web-app
dx run dash-web-app

Once it spins up, you can go to that job’s designated URL based on its job ID, https://job-xxxx.dnanexus.cloud/, to see the result.

Python

Dash Example Web App

This is an example web app made with Dash, which in turn uses Flask underneath.

View full source code on GitHub

Creating the web application

After configuring an app with Dash, we start the server on port 443.

app.run_server(host='0.0.0.0', port=443)

Inside the dxapp.json, you would add "httpsApp": {"ports":[443], "shared_access": "VIEW"} to tell the worker to expose this port.

The rest of these instructions apply to building any applet with dependencies stored in an asset.

Creating an applet on DNAnexus

Source dx-toolkit and log in, then run dx-app-wizard with default options.

Creating the asset

dash-asset specifies all the packages and versions we need. We take these from the Dash installation guide (https://dash.plot.ly/installation\)

pip install dash==0.39.0  # The core dash backend
pip install dash-html-components==0.14.0  # HTML components
pip install dash-core-components==0.44.0  # Supercharged components
pip install dash-table==3.6.0  # Interactive DataTable component (new!)
pip install dash-daq==0.1.0  # DAQ components (newly open-sourced!)

We put these into dash-asset/dxasset.json:

{
  ...
  "execDepends": [
    {"name": "dash", "version":"0.39.0", "package_manager": "pip"},
        {"name": "dash-html-components", "version":"0.14.0", "package_manager": "pip"},
        {"name": "dash-core-components", "version":"0.44.0", "package_manager": "pip"},
        {"name": "dash-table", "version":"3.6.0", "package_manager": "pip"},
        {"name": "dash-daq", "version":"0.1.0", "package_manager": "pip"}
  ],
    ...
}

Build the asset:

dx build_asset dash-asset

Use the asset from the applet

Add this asset to the applet’s dxapp.json:

"runSpec": {
    ...
    "assetDepends": [
    {
      "id": "record-xxxx
    }
  ]
    ...
}

Build the applet

Now build and run the applet itself:

dx build -f dash-web-app
dx run dash-web-app

You can always use dx ssh job-xxxx to ssh into the worker and inspect what’s going on or experiment with quick changes Then go to that job’s special URL https://job-xxxx.dnanexus.cloud/ and see the result!

Optional local testing

The main code is in dash-web-app/resources/home/dnanexus/my_app.py with a local launcher script called local_test.py in the same folder. This allows us to launch the same core code in the applet locally to quickly iterate. This is optional because you can also do all testing on the platform itself.

Install locally the same libraries listed above.

To launch the web app locally:

cd dash-web-app/resources/home/dnanexus/
python3 local_test.py

Once it spins up, you can go to that job’s designated URL based on its job ID, https://job-xxxx.dnanexus.cloud/, to see the result.

Distributed by Region (py)

This applet creates a count of reads from a BAM format file.

View full source code on GitHub

How is the SAMtools dependency provided?

The SAMtools dependency is resolved by declaring an Apt-Get package in the dxapp.json runSpec.execDepends.

  "runSpec": {
    ...
    "execDepends": [
      {"name": "samtools"}
    ]
  }

For additional information, please refer to the execDepends documentation .

Entry Points

Distributed python-interpreter apps use python decorators on functions to declare entry points. This app has the following entry points as decorated functions:

main
samtoolscount_bam
combine_files

Entry points are executed on a new worker with their own system requirements. In this example, we split and merge our files on basic mem1_ssd1_x2 instances and perform our own, more intensive, processing step on a mem1_ssd1_x4 instance. Instance type can be set in the dxapp.json runSpec.systemRequirements:

  "runSpec": {
    ...
    "systemRequirements": {
      "main": {
        "instanceType": "mem1_ssd1_x2"
      },
      "samtoolscount_bam": {
        "instanceType": "mem1_ssd1_x4"
      },
      "combine_files": {
        "instanceType": "mem1_ssd1_x2"
      }
    },
    ...
  }

main

The main function scatters by region bins based on user input. If no *.bai file is present, the applet generates an index *.bai.

    regions = parseSAM_header_for_region(filename)
    split_regions = [regions[i:i + region_size]
                     for i in range(0, len(regions), region_size)]

    if not index_file:
        mappings_bam, index_file = create_index_file(filename, mappings_bam)

Regions bins are passed to the samtoolscount_bam entry point using the dxpy.new_dxjob function.

    print('creating subjobs')
    subjobs = [dxpy.new_dxjob(
               fn_input={"region_list": split,
                         "mappings_bam": mappings_bam,
                         "index_file": index_file},
               fn_name="samtoolscount_bam")
               for split in split_regions]

    fileDXLinks = [subjob.get_output_ref("readcount_fileDX")
                   for subjob in subjobs]

Outputs from the samtoolscount_bam entry points are used as inputs for the combine_files entry point. The output of the combine_files entry point is used as the output of the main entry point.

    print('combining outputs')
    postprocess_job = dxpy.new_dxjob(
        fn_input={"countDXlinks": fileDXLinks, "resultfn": filename},
        fn_name="combine_files")

    countDXLink = postprocess_job.get_output_ref("countDXLink")

    output = {}
    output["count_file"] = countDXLink

    return output

samtoolscount_bam

This entry point downloads and creates a samtools view -c command for each region in the input bin. The dictionary returned from dxpy.download_all_inputs() is used to reference input names and paths.

def samtoolscount_bam(region_list, mappings_bam, index_file):
    """Processing function.

    Arguments:
        region_list (list[str]): Regions to count in BAM
        mappings_bam (dict): dxlink to input BAM
        index_file (dict): dxlink to input BAM

    Returns:
        Dictionary containing dxlinks to the uploaded read counts file
    """
    #
    # Download inputs
    # -------------------------------------------------------------------
    # dxpy.download_all_inputs will download all input files into
    # the /home/dnanexus/in directory.  A folder will be created for each
    # input and the file(s) will be download to that directory.
    #
    # In this example our dictionary inputs has the following key, value pairs
    # Note that the values are all list
    #     mappings_bam_path: [u'/home/dnanexus/in/mappings_bam/<bam filename>.bam']
    #     mappings_bam_name: [u'<bam filename>.bam']
    #     mappings_bam_prefix: [u'<bam filename>']
    #     index_file_path: [u'/home/dnanexus/in/index_file/<bam filename>.bam.bai']
    #     index_file_name: [u'<bam filename>.bam.bai']
    #     index_file_prefix: [u'<bam filename>']
    #

    inputs = dxpy.download_all_inputs()

    # SAMtools view command requires the bam and index file to be in the same
    shutil.move(inputs['mappings_bam_path'][0], os.getcwd())
    shutil.move(inputs['index_file_path'][0], os.getcwd())
    input_bam = inputs['mappings_bam_name'][0]

    #
    # Per region perform SAMtools count.
    # --------------------------------------------------------------
    # Output count for regions and return DXLink as job output to
    # allow other entry points to download job output.
    #

    with open('read_count_regions.txt', 'w') as f:
        for region in region_list:
                view_cmd = create_region_view_cmd(input_bam, region)
                region_proc_result = run_cmd(view_cmd)
                region_count = int(region_proc_result[0])
                f.write("Region {0}: {1}\n".format(region, region_count))
    readcountDXFile = dxpy.upload_local_file("read_count_regions.txt")
    readCountDXlink = dxpy.dxlink(readcountDXFile.get_id())

    return {"readcount_fileDX": readCountDXlink}

This entry point returns {"readcount_fileDX": readCountDXlink}, a JBOR referencing an uploaded text file. This approach to scatter-gather stores the results in files and uploads/downloads the information as needed. This approach exaggerates a scatter-gather for tutorial purposes. You’re able to pass types other than file such as int.

combine_files

The main entry point triggers this subjob, providing the output of samtoolscount_bam as an input. This entry point gathers all the files generated by the samtoolscount_bam jobs and sums them.

def combine_files(countDXlinks, resultfn):
    """The 'gather' subjob of the applet.

    Arguments:
        countDXlinks (list[dict]): list of DXlinks to process job output files.
        resultfn (str): Filename to use for job output file.

    Returns:
        DXLink for the main function to return as the job output.

    Note: Only the DXLinks are passed as parameters.
    Subjobs work on a fresh instance so files must be downloaded to the machine
    """
    if resultfn.endswith(".bam"):
        resultfn = resultfn[:-4] + '.txt'

    sum_reads = 0
    with open(resultfn, 'w') as f:
        for i, dxlink in enumerate(countDXlinks):
            dxfile = dxpy.DXFile(dxlink)
            filename = "countfile{0}".format(i)
            dxpy.download_dxfile(dxfile, filename)
            with open(filename, 'r') as fsub:
                for line in fsub:
                    sum_reads += parse_line_for_readcount(line)
                    f.write(line)
        f.write('Total Reads: {0}'.format(sum_reads))

    countDXFile = dxpy.upload_local_file(resultfn)
    countDXlink = dxpy.dxlink(countDXFile.get_id())

    return {"countDXLink": countDXlink}

Important: While the main entry point triggers the processing and gathering entry points, keep in mind the main entry point doesn’t do any heavy lifting or processing. Notice in the .runSpec json above we start with a lightweight instance, scale up for the processing entry point, then finally scale down for the gathering step.

Parallel by Chr (py)

This applet tutorial will perform a SAMtools count using parallel threads.

View full source code on GitHub

In order to take full advantage of the scalability that cloud computing offers, our scripts have to implement the correct methodologies. This applet tutorial will:

Install SAMtools
Download BAM file
Count regions in parallel

How is the SAMtools dependency provided?

The SAMtools dependency is resolved by declaring an Apt-Get package in the dxapp.json runSpec.execDepends.

  "runSpec": {
    ...
    "execDepends": [
      {"name": "samtools"}
    ]
  }

For additional information, please refer to the execDepends documentation.

Download BAM file

The dxpy.download_all_inputs() function downloads all input files into the /home/dnanexus/in directory. A folder will be created for each input and the file(s) will be downloaded to that directory. For convenience, the dxpy.download_all_inputs function returns a dictionary containing the following keys:

<var>_path (string): full absolute path to where the file was downloaded.
<var>_name (string): name of the file, including extention.
<var>_prefix (string): name of the file minus the longest matching pattern found in the dxapp.json I/O pattern field.

The path, name, and prefix key-value pattern is repeated for all applet file class inputs specified in the dxapp.json. In this example, our dictionary has the following key-value pairs:

{
    mappings_bam_path: [u'/home/dnanexus/in/mappings_bam/SRR504516.bam']
    mappings_bam_name: [u'SRR504516.bam']
    mappings_bam_prefix: [u'SRR504516']
    index_file_path: [u'/home/dnanexus/in/index_file/SRR504516.bam.bai']
    index_file_name: [u'SRR504516.bam.bai']
    index_file_prefix: [u'SRR504516']
}

    inputs = dxpy.download_all_inputs()
    shutil.move(inputs['mappings_bam_path'][0], os.getcwd())

Count Regions in Parallel

Before we can perform our parallel SAMtools count, we must determine the workload for each thread. We arbitrarily set our number of workers to 10 and set the workload per thread to 1 chromosome at a time. There are various ways to achieve multithreaded processing in python. For the sake of simplicity, we use multiprocessing.dummy, a wrapper around Python’s threading module.

    input_bam = inputs['mappings_bam_name'][0]

    bam_to_use = create_index_file(input_bam)
    print("Dir info:")
    print(os.listdir(os.getcwd()))

    regions = parseSAM_header_for_region(bam_to_use)

    view_cmds = [
        create_region_view_cmd(bam_to_use, region)
        for region
        in regions]

    print('Parallel counts')
    t_pools = ThreadPool(10)
    results = t_pools.map(run_cmd, view_cmds)
    t_pools.close()
    t_pools.join()

    verify_pool_status(results)

Each worker creates a string to be called in a subprocess.Popen call. We use the multiprocessing.dummy.Pool.map(<func>, <iterable>) function to call the helper function run_cmd for each string in the iterable of view commands. Because we perform our multithreaded processing using subprocess.Popen, we will not be alerted to any failed processes. We verify our closed workers in the verify_pool_status helper function.

def verify_pool_status(proc_tuples):
    err_msgs = []
    for proc in proc_tuples:
        if proc[2] != 0:
            err_msgs.append(proc[1])
    if err_msgs:
        raise dxpy.exceptions.AppInternalError(b"\n".join(err_msgs))

Important: In this example we use subprocess.Popen to process and verify our results in verify_pool_status. In general, it is considered good practice to use python’s built-in subprocess convenience functions. In this case, subprocess.check_call would achieve the same goal.

Gather Results

Each worker returns a read count of just one region in the BAM file. We sum and output the results as the job output. We use the dx-toolkit python SDK’s dxpy.upload_local_file function to upload and generate a DXFile corresponding to our result file. For python, job outputs have to be a dictionary of key-value pairs, with the keys being job output names as defined in the dxapp.json and the values being the output values for corresponding output classes. For files, the output type is a DXLink. We use the dxpy.dxlink function to generate the appropriate DXLink value.

    resultfn = bam_to_use[:-4] + '_count.txt'
    with open(resultfn, 'w') as f:
        sum_reads = 0
        for res, reg in zip(results, regions):
            read_count = int(res[0])
            sum_reads += read_count
            f.write("Region {0}: {1}\n".format(reg, read_count))
        f.write("Total reads: {0}".format(sum_reads))

    count_file = dxpy.upload_local_file(resultfn)
    output = {}
    output["count_file"] = dxpy.dxlink(count_file)

    return output

Parallel by Region (py)

This applet tutorial will perform a SAMtools count using parallel threads.

View full source code on GitHub

In order to take full advantage of the scalability that cloud computing offers, our scripts have to implement the correct methodologies. This applet tutorial will:

Install SAMtools
Download BAM file
Split workload
Count regions in parallel

How is the SAMtools dependency provided?

The SAMtools dependency is resolved by declaring an Apt-Get package in the dxapp.json runSpec.execDepends field.

{
  "runSpec": {
    ...
    "execDepends": [
      {"name": "samtools"}
    ]
  }

Download Inputs

This applet downloads all inputs at once using dxpy.download_all_inputs:

inputs = dxpy.download_all_inputs()
# download_all_inputs returns a dictionary that contains mapping from inputs to file locations.
# Additionaly, helper keys, value pairs are added to the dicitonary, similar to bash helper functions
inputs
#     mappings_sorted_bam_path: [u'/home/dnanexus/in/mappings_sorted_bam/SRR504516.bam']
#     mappings_sorted_bam_name: u'SRR504516.bam'
#     mappings_sorted_bam_prefix: u'SRR504516'
#     mappings_sorted_bai_path: u'/home/dnanexus/in/mappings_sorted_bai/SRR504516.bam.bai'
#     mappings_sorted_bai_name: u'SRR504516.bam.bai'
#     mappings_sorted_bai_prefix: u'SRR504516'

Split workload

We process in parallel using the python multiprocessing module using a rather simple pattern shown below:

print("Number of cpus: {0}".format(cpu_count()))  # Get cpu count from multiprocessing
worker_pool = Pool(processes=cpu_count())         # Create a pool of workers, 1 for each core
results = worker_pool.map(run_cmd, collection)    # map run_cmds to a collection
                                                  # Pool.map will handle orchestrating the job
worker_pool.close()
worker_pool.join()  # Make sure to close and join workers when done

This convenient pattern allows you to quickly orchestrate jobs on a worker. For more detailed overview of the multiprocessing module, visit the python docs.

We create several helpers in our applet script to manage our workload. One helper you may have seen before is run_cmd; we use this function to manage or subprocess calls:

def run_cmd(cmd_arr):
    """Run shell command.
    Helper function to simplify the pool.map() call in our parallelization.
    Raises OSError if command specified (index 0 in cmd_arr) isn't valid
    """
    proc = subprocess.Popen(
        cmd_arr,
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE)
    stdout, stderr = proc.communicate()
    exit_code = proc.returncode
    proc_tuple = (stdout, stderr, exit_code)
    return proc_tuple

Before we can split our workload, we need to know what regions are present in our BAM input file. We handle this initial parsing in the parse_sam_header_for_region function:

def parse_sam_header_for_region(bamfile_path):
    """Helper function to match SN regions contained in SAM header

    Returns:
        regions (list[string]): list of regions in bam header
    """
    header_cmd = ['samtools', 'view', '-H', bamfile_path]
    print('parsing SAM headers:', " ".join(header_cmd))
    headers_str = subprocess.check_output(header_cmd).decode("utf-8")
    rgx = re.compile(r'SN:(\S+)\s')
    regions = rgx.findall(headers_str)
    return regions

Once our workload is split and we’ve started processing, we wait and review the status of each Pool worker. Then, we merge and output our results.

# Write results to file
resultfn = inputs['mappings_sorted_bam_name'][0]
resultfn = (
    resultfn[:-4] + '_count.txt'
    if resultfn.endswith(".bam")
    else resultfn + '_count.txt')
with open(resultfn, 'w') as f:
    sum_reads = 0
    for res, reg in zip(results, regions):
        read_count = int(res[0])
        sum_reads += read_count
        f.write("Region {0}: {1}\n".format(reg, read_count))
    f.write("Total reads: {0}".format(sum_reads))

count_file = dxpy.upload_local_file(resultfn)
output = {}
output["count_file"] = dxpy.dxlink(count_file)
return output

Note: The run_cmd function returns a tuple containing the stdout, stderr, and exit code of the subprocess call. We parse these outputs from our workers to determine whether the run failed or passed.

def verify_pool_status(proc_tuples):
    """
    Helper to verify worker succeeded.

    As failed commands are detected we write the stderr from that command
    to the job_error.json file. This file will be printed to the platform
    job log on App failure.
    """
    all_succeed = True
    err_msgs = []
    for proc in proc_tuples:
        if proc[2] != 0:
            all_succeed = False
            err_msgs.append(proc[1])
    if err_msgs:
        raise dxpy.exceptions.AppInternalError(b"\n".join(err_msgs))

Pysam

This applet performs a SAMtools count on an input BAM using Pysam, a python wrapper for SAMtools.

View full source code on GitHub

How is Pysam provided?

Pysam is provided through a pip3 install using the pip3 package manager in the dxapp.json’s runSpec.execDepends property:

{
 "runSpec": {
    ...
    "execDepends": [
      {"name": "pysam",
         "package_manager": "pip3",
         "version": "0.15.4"
      }
    ]
    ...
 }

The execDepends value is a JSON array of dependencies to resolve before the applet source code is run. In this applet, we specify pip3 as our package manager and pysam version 0.15.4 as the dependency to resolve.

Downloading Input

The fields mappings_sorted_bam and mappings_sorted_bai are passed to the main function as parameters for our job. These parameters are dictionary objects with key-value pair {"$dnanexus_link": "<file>-<xxxx>"}. We handle file objects from the platform through DXFile handles. If an index file is not supplied, then a *.bai index will be created.

    print(mappings_sorted_bai)
    print(mappings_sorted_bam)

    mappings_sorted_bam = dxpy.DXFile(mappings_sorted_bam)
    sorted_bam_name = mappings_sorted_bam.name
    dxpy.download_dxfile(mappings_sorted_bam.get_id(),
                         sorted_bam_name)
    ascii_bam_name = unicodedata.normalize(  # Pysam requires ASCII not Unicode string.
        'NFKD', sorted_bam_name).encode('ascii', 'ignore')

    if mappings_sorted_bai is not None:
        mappings_sorted_bai = dxpy.DXFile(mappings_sorted_bai)
        dxpy.download_dxfile(mappings_sorted_bai.get_id(),
                             mappings_sorted_bai.name)
    else:
        pysam.index(ascii_bam_name)

Working with Pysam

Pysam provides several methods that mimic SAMtools commands. In our applet example, we want to focus only on canonical chromosomes. Pysam’s object representation of a BAM file is pysam.AlignmentFile.

    mappings_obj = pysam.AlignmentFile(ascii_bam_name, "rb")
    regions = get_chr(mappings_obj, canonical_chr)

The helper function get_chr

def get_chr(bam_alignment, canonical=False):
    """Helper function to return canonical chromosomes from SAM/BAM header

    Arguments:
        bam_alignment (pysam.AlignmentFile): SAM/BAM pysam object
        canonical (boolean): Return only canonical chromosomes
    Returns:
        regions (list[str]): Region strings
    """
    regions = []
    headers = bam_alignment.header
    seq_dict = headers['SQ']

    if canonical:
        re_canonical_chr = re.compile(r'^chr[0-9XYM]+$|^[0-9XYM]')
        for seq_elem in seq_dict:
            if re_canonical_chr.match(seq_elem['SN']):
                regions.append(seq_elem['SN'])
    else:
        regions = [''] * len(seq_dict)
        for i, seq_elem in enumerate(seq_dict):
            regions[i] = seq_elem['SN']

    return regions

Once we establish a list of canonical chromosomes, we can iterate over them and perform Pysam’s version of samtools view -c, pysam.AlignmentFile.count.

    total_count = 0
    count_filename = "{bam_prefix}_counts.txt".format(
        bam_prefix=ascii_bam_name[:-4])

    with open(count_filename, "w") as f:
        for region in regions:
            temp_count = mappings_obj.count(region=region)
            f.write("{region_name}: {counts}\n".format(
                region_name=region, counts=temp_count))
            total_count += temp_count

        f.write("Total reads: {sum_counts}".format(sum_counts=total_count))

Uploading Outputs

Our summarized counts are returned as the job output. We use the dx-toolkit python SDK’s dxpy.upload_local_file function to upload and generate a DXFile corresponding to our tabulated result file.

    counts_txt = dxpy.upload_local_file(count_filename)
    output = {}
    output["counts_txt"] = dxpy.dxlink(counts_txt)

    return output

Python job outputs have to be a dictionary of key-value pairs, with the keys being job output names as defined in the dxapp.json file and the values being the output values for corresponding output classes. For files, the output type is a DXLink. We use the dxpy.dxlink function to generate the appropriate DXLink value.

Web App(let) Tutorials

Dash Example Web App

This is an example web app made with Dash, which in turn uses Flask underneath.

View full source code on GitHub

Creating the web application

After configuring an app with Dash, we start the server on port 443.

app.run_server(host='0.0.0.0', port=443)

Inside the dxapp.json, you would add "httpsApp": {"ports":[443], "shared_access": "VIEW"} to tell the worker to expose this port.

The rest of these instructions apply to building any applet with dependencies stored in an asset.

Creating an applet on DNAnexus

Source dx-toolkit and log in, then run dx-app-wizard with default options.

Creating the asset

dash-asset specifies all the packages and versions we need. We take these from the Dash installation guide (https://dash.plot.ly/installation\)

pip install dash==0.39.0  # The core dash backend
pip install dash-html-components==0.14.0  # HTML components
pip install dash-core-components==0.44.0  # Supercharged components
pip install dash-table==3.6.0  # Interactive DataTable component (new!)
pip install dash-daq==0.1.0  # DAQ components (newly open-sourced!)

We put these into dash-asset/dxasset.json:

{
  ...
  "execDepends": [
    {"name": "dash", "version":"0.39.0", "package_manager": "pip"},
        {"name": "dash-html-components", "version":"0.14.0", "package_manager": "pip"},
        {"name": "dash-core-components", "version":"0.44.0", "package_manager": "pip"},
        {"name": "dash-table", "version":"3.6.0", "package_manager": "pip"},
        {"name": "dash-daq", "version":"0.1.0", "package_manager": "pip"}
  ],
    ...
}

Build the asset:

dx build_asset dash-asset

Use the asset from the applet

Add this asset to the applet’s dxapp.json:

"runSpec": {
    ...
    "assetDepends": [
    {
      "id": "record-xxxx
    }
  ]
    ...
}

Build the applet

Now build and run the applet itself:

dx build -f dash-web-app
dx run dash-web-app

Optional local testing

Install locally the same libraries listed above.

To launch the web app locally:

cd dash-web-app/resources/home/dnanexus/
python local_test.py

Once it spins up, you can go to that job’s designated URL based on its job ID, https://job-xxxx.dnanexus.cloud/, to see the result.

TensorBoard Example Web App

This example demonstrates how to run TensorBoard inside a DNAnexus applet.

View full source code on GitHub

This example uses an example script from the TensorBoard authors. For more guidance on how to use TensorBoard, check out the tensorflow website (external link).

Creating the web application

# Start the training script and put it into the background,
# so the next line of code will run immediately
python3 mnist_tensorboard_example.py --log_dir LOGS_FOR_TENSORBOARD &

# Run TensorBoard
tensorboard  --logdir LOGS_FOR_TENSORBOARD --host 0.0.0.0 --port 443

As with all web apps, the dxapp.json must include "httpsApp": {"ports":[443], "shared_access": "VIEW"} to tell the worker to expose port 443.

Creating an applet on DNAnexus

Build the asset with the libraries first:

dx build_asset tensorflow_asset

Take the record ID it outputs and add it to the dxapp.json for the applet.

"runSpec": {
    ...
    "assetDepends": [
    {
      "id": "record-xxxx
    }
  ]
    ...
}

Then build the applet

dx build -f tensorboard-web-app
dx run tensorboard-web-app

Once it spins up, you can go to that job’s designated URL based on its job ID, https://job-xxxx.dnanexus.cloud/, to see the result.

Concurrent Computing Tutorials

Learn important terminology before using parallel and distributed computing paradigms on the DNAnexus Platform.

Parallel: Using multiple threads or logical cores to concurrently process a workload.
Distributed: Using multiple machines (in our case instances in the cloud) that communicate to concurrently process a workload.

Keep these formal definitions in mind as you read through the tutorials and learn how to compute concurrently on the DNAnexus platform.

Distributed

Distributed by Region (sh)

Entry Points

main
count_func
sum_reads

main

  regions=$(samtools view -H "${mappings_sorted_bam_name}" | grep "\@SQ" | sed 's/.*SN:\(\S*\)\s.*/\1/')

  echo "Segmenting into regions"
  count_jobs=()
  counter=0
  temparray=()
  for r in $(echo $regions); do
    if [[ "${counter}" -ge 10 ]]; then
      echo "${temparray[@]}"
      count_jobs+=($(dx-jobutil-new-job -ibam_file="${mappings_sorted_bam}" -ibambai_file="${mappings_sorted_bai}" "${temparray[@]}" count_func))
      temparray=()
      counter=0
    fi
    temparray+=("-iregions=${r}") # Here we add to an array of -i<parameter>'s
    counter=$((counter+1))
  done

  if [[ counter -gt 0 ]]; then # Previous loop will miss last iteration  if its < 10
    echo "${temparray[@]}"
    count_jobs+=($(dx-jobutil-new-job -ibam_file="${mappings_sorted_bam}" -ibambai_file="${mappings_sorted_bai}" "${temparray[@]}" count_func))
  fi

Job outputs from the count_func entry point are referenced as Job Based Object References (JBOR) and used as inputs for the sum_reads entry point.

  echo "Merge count files, jobs:"
  echo "${count_jobs[@]}"
  readfiles=()
  for count_job in "${count_jobs[@]}"; do
    readfiles+=("-ireadfiles=${count_job}:counts_txt")
  done
  echo "file name: ${sorted_bamfile_name}"
  echo "Set file, readfile variables:"
  echo "${readfiles[@]}"
  countsfile_job=$(dx-jobutil-new-job -ifilename="${mappings_sorted_bam_prefix}" "${readfiles[@]}" sum_reads)

Job outputs of the sum_reads entry point is used as the output of the main entry point via JBOR reference in the dx-jobutil-add-output command.

  echo "Specifying output file"
  dx-jobutil-add-output counts_txt "${countsfile_job}:read_sum" --class=jobref
}

count_func

Once the output file with counts is created, it is uploaded to the platform and assigned as the entry point’s job output counts_txt via the command dx-jobutil-add-output.

count_func() {

  set -e -x -o pipefail

  echo "Value of bam_file: '${bam_file}'"
  echo "Value of bambai_file: '${bambai_file}'"
  echo "Regions being counted '${regions[@]}'"


  dx-download-all-inputs


  mkdir workspace
  cd workspace || exit
  mv "${bam_file_path}" .
  mv "${bambai_file_path}" .
  outputdir="./out/samtool/count"
  mkdir -p "${outputdir}"
  samtools view -c "${bam_file_name}" "${regions[@]}" >> "${outputdir}/readcounts.txt"


  counts_txt_id=$(dx upload "${outputdir}/readcounts.txt" --brief)
  dx-jobutil-add-output counts_txt "${counts_txt_id}" --class=file
}

sum_reads

This entry point returns read_sum as a JBOR, which is then referenced as job output.

sum_reads() {

  set -e -x -o pipefail
  echo "$filename"

  echo "Value of read file array '${readfiles[@]}'"
  dx-download-all-inputs
  echo "Value of read file path array '${readfiles_path[@]}'"

  echo "Summing values in files"
  readsum=0
  for read_f in "${readfiles_path[@]}"; do
    temp=$(cat "$read_f")
    readsum=$((readsum + temp))
  done

  echo "Total reads: ${readsum}" > "${filename}_counts.txt"

  read_sum_id=$(dx upload "${filename}_counts.txt" --brief)
  dx-jobutil-add-output read_sum "${read_sum_id}" --class=file

In the main function, the output is referenced

  echo "Specifying output file"
  dx-jobutil-add-output counts_txt "${countsfile_job}:read_sum" --class=jobref
}

Distributed by Chr (sh)

How is the SAMtools dependency provided?

The SAMtools dependency is resolved by declaring an Apt-Get package in the dxapp.json file’s runSpec.execDepends.

For additional information, see execDepends.

Entry Points

Distributed bash-interpreter apps use bash functions to declare entry points. This app has the following entry points specified as bash functions:

main
count_func
sum_reads

Entry points are executed on a new worker with its own system requirements. The instance type can be set in the dxapp.json file’s runSpec.systemRequirements:

main

The main function slices the initial *.bam file and generates an index *.bai if needed. The input *.bam is the sliced into smaller *.bam files containing only reads from canonical chromosomes. First, the main function downloads the BAM file and gets the headers.

Outputs from the count_func entry points are referenced as Job Based Object References (JBOR) and used as inputs for the sum_reads entry point.

The output of the sum_reads entry point is used as the output of the main entry point via JBOR reference using the command dx-jobutil-add-output.

count_func

sum_reads

The main entry point triggers this sub job, providing the output of count_func as an input. This entry point gathers all the files generated by the count_func jobs and sums them.

This function returns read_sum_file as the entry point output.

Distributed by Region (py)

This applet creates a count of reads from a BAM format file.

View full source code on GitHub

How is the SAMtools dependency provided?

The SAMtools dependency is resolved by declaring an Apt-Get package in the dxapp.json runSpec.execDepends.

  "runSpec": {
    ...
    "execDepends": [
      {"name": "samtools"}
    ]
  }

For additional information, please refer to the execDepends documentation .

Entry Points

Distributed python-interpreter apps use python decorators on functions to declare entry points. This app has the following entry points as decorated functions:

main
samtoolscount_bam
combine_files

  "runSpec": {
    ...
    "systemRequirements": {
      "main": {
        "instanceType": "mem1_ssd1_x2"
      },
      "samtoolscount_bam": {
        "instanceType": "mem1_ssd1_x4"
      },
      "combine_files": {
        "instanceType": "mem1_ssd1_x2"
      }
    },
    ...
  }

main

The main function scatters by region bins based on user input. If no *.bai file is present, the applet generates an index *.bai.

    regions = parseSAM_header_for_region(filename)
    split_regions = [regions[i:i + region_size]
                     for i in range(0, len(regions), region_size)]

    if not index_file:
        mappings_bam, index_file = create_index_file(filename, mappings_bam)

Regions bins are passed to the samtoolscount_bam entry point using the dxpy.new_dxjob function.

    print('creating subjobs')
    subjobs = [dxpy.new_dxjob(
               fn_input={"region_list": split,
                         "mappings_bam": mappings_bam,
                         "index_file": index_file},
               fn_name="samtoolscount_bam")
               for split in split_regions]

    fileDXLinks = [subjob.get_output_ref("readcount_fileDX")
                   for subjob in subjobs]

Outputs from the samtoolscount_bam entry points are used as inputs for the combine_files entry point. The output of the combine_files entry point is used as the output of the main entry point.

    print('combining outputs')
    postprocess_job = dxpy.new_dxjob(
        fn_input={"countDXlinks": fileDXLinks, "resultfn": filename},
        fn_name="combine_files")

    countDXLink = postprocess_job.get_output_ref("countDXLink")

    output = {}
    output["count_file"] = countDXLink

    return output

samtoolscount_bam

def samtoolscount_bam(region_list, mappings_bam, index_file):
    """Processing function.

    Arguments:
        region_list (list[str]): Regions to count in BAM
        mappings_bam (dict): dxlink to input BAM
        index_file (dict): dxlink to input BAM

    Returns:
        Dictionary containing dxlinks to the uploaded read counts file
    """
    #
    # Download inputs
    # -------------------------------------------------------------------
    # dxpy.download_all_inputs will download all input files into
    # the /home/dnanexus/in directory.  A folder will be created for each
    # input and the file(s) will be download to that directory.
    #
    # In this example our dictionary inputs has the following key, value pairs
    # Note that the values are all list
    #     mappings_bam_path: [u'/home/dnanexus/in/mappings_bam/<bam filename>.bam']
    #     mappings_bam_name: [u'<bam filename>.bam']
    #     mappings_bam_prefix: [u'<bam filename>']
    #     index_file_path: [u'/home/dnanexus/in/index_file/<bam filename>.bam.bai']
    #     index_file_name: [u'<bam filename>.bam.bai']
    #     index_file_prefix: [u'<bam filename>']
    #

    inputs = dxpy.download_all_inputs()

    # SAMtools view command requires the bam and index file to be in the same
    shutil.move(inputs['mappings_bam_path'][0], os.getcwd())
    shutil.move(inputs['index_file_path'][0], os.getcwd())
    input_bam = inputs['mappings_bam_name'][0]

    #
    # Per region perform SAMtools count.
    # --------------------------------------------------------------
    # Output count for regions and return DXLink as job output to
    # allow other entry points to download job output.
    #

    with open('read_count_regions.txt', 'w') as f:
        for region in region_list:
                view_cmd = create_region_view_cmd(input_bam, region)
                region_proc_result = run_cmd(view_cmd)
                region_count = int(region_proc_result[0])
                f.write("Region {0}: {1}\n".format(region, region_count))
    readcountDXFile = dxpy.upload_local_file("read_count_regions.txt")
    readCountDXlink = dxpy.dxlink(readcountDXFile.get_id())

    return {"readcount_fileDX": readCountDXlink}

combine_files

The main entry point triggers this subjob, providing the output of samtoolscount_bam as an input. This entry point gathers all the files generated by the samtoolscount_bam jobs and sums them.

def combine_files(countDXlinks, resultfn):
    """The 'gather' subjob of the applet.

    Arguments:
        countDXlinks (list[dict]): list of DXlinks to process job output files.
        resultfn (str): Filename to use for job output file.

    Returns:
        DXLink for the main function to return as the job output.

    Note: Only the DXLinks are passed as parameters.
    Subjobs work on a fresh instance so files must be downloaded to the machine
    """
    if resultfn.endswith(".bam"):
        resultfn = resultfn[:-4] + '.txt'

    sum_reads = 0
    with open(resultfn, 'w') as f:
        for i, dxlink in enumerate(countDXlinks):
            dxfile = dxpy.DXFile(dxlink)
            filename = "countfile{0}".format(i)
            dxpy.download_dxfile(dxfile, filename)
            with open(filename, 'r') as fsub:
                for line in fsub:
                    sum_reads += parse_line_for_readcount(line)
                    f.write(line)
        f.write('Total Reads: {0}'.format(sum_reads))

    countDXFile = dxpy.upload_local_file(resultfn)
    countDXlink = dxpy.dxlink(countDXFile.get_id())

    return {"countDXLink": countDXlink}

Parallel

Parallel by Chr (py)

This applet tutorial will perform a SAMtools count using parallel threads.

View full source code on GitHub

In order to take full advantage of the scalability that cloud computing offers, our scripts have to implement the correct methodologies. This applet tutorial will:

Install SAMtools
Download BAM file
Count regions in parallel

How is the SAMtools dependency provided?

The SAMtools dependency is resolved by declaring an Apt-Get package in the dxapp.json runSpec.execDepends.

  "runSpec": {
    ...
    "execDepends": [
      {"name": "samtools"}
    ]
  }

For additional information, please refer to the execDepends documentation.

Download BAM file

<var>_path (string): full absolute path to where the file was downloaded.
<var>_name (string): name of the file, including extention.
<var>_prefix (string): name of the file minus the longest matching pattern found in the dxapp.json I/O pattern field.

The path, name, and prefix key-value pattern is repeated for all applet file class inputs specified in the dxapp.json. In this example, our dictionary has the following key-value pairs:

{
    mappings_bam_path: [u'/home/dnanexus/in/mappings_bam/SRR504516.bam']
    mappings_bam_name: [u'SRR504516.bam']
    mappings_bam_prefix: [u'SRR504516']
    index_file_path: [u'/home/dnanexus/in/index_file/SRR504516.bam.bai']
    index_file_name: [u'SRR504516.bam.bai']
    index_file_prefix: [u'SRR504516']
}

    inputs = dxpy.download_all_inputs()
    shutil.move(inputs['mappings_bam_path'][0], os.getcwd())

Count Regions in Parallel

    input_bam = inputs['mappings_bam_name'][0]

    bam_to_use = create_index_file(input_bam)
    print("Dir info:")
    print(os.listdir(os.getcwd()))

    regions = parseSAM_header_for_region(bam_to_use)

    view_cmds = [
        create_region_view_cmd(bam_to_use, region)
        for region
        in regions]

    print('Parallel counts')
    t_pools = ThreadPool(10)
    results = t_pools.map(run_cmd, view_cmds)
    t_pools.close()
    t_pools.join()

    verify_pool_status(results)

def verify_pool_status(proc_tuples):
    err_msgs = []
    for proc in proc_tuples:
        if proc[2] != 0:
            err_msgs.append(proc[1])
    if err_msgs:
        raise dxpy.exceptions.AppInternalError(b"\n".join(err_msgs))

Gather Results

    resultfn = bam_to_use[:-4] + '_count.txt'
    with open(resultfn, 'w') as f:
        sum_reads = 0
        for res, reg in zip(results, regions):
            read_count = int(res[0])
            sum_reads += read_count
            f.write("Region {0}: {1}\n".format(reg, read_count))
        f.write("Total reads: {0}".format(sum_reads))

    count_file = dxpy.upload_local_file(resultfn)
    output = {}
    output["count_file"] = dxpy.dxlink(count_file)

    return output

Parallel by Region (py)

This applet tutorial will perform a SAMtools count using parallel threads.

View full source code on GitHub

In order to take full advantage of the scalability that cloud computing offers, our scripts have to implement the correct methodologies. This applet tutorial will:

Install SAMtools
Download BAM file
Split workload
Count regions in parallel

How is the SAMtools dependency provided?

The SAMtools dependency is resolved by declaring an Apt-Get package in the dxapp.json runSpec.execDepends field.

{
  "runSpec": {
    ...
    "execDepends": [
      {"name": "samtools"}
    ]
  }

Download Inputs

This applet downloads all inputs at once using dxpy.download_all_inputs:

inputs = dxpy.download_all_inputs()
# download_all_inputs returns a dictionary that contains mapping from inputs to file locations.
# Additionaly, helper keys, value pairs are added to the dicitonary, similar to bash helper functions
inputs
#     mappings_sorted_bam_path: [u'/home/dnanexus/in/mappings_sorted_bam/SRR504516.bam']
#     mappings_sorted_bam_name: u'SRR504516.bam'
#     mappings_sorted_bam_prefix: u'SRR504516'
#     mappings_sorted_bai_path: u'/home/dnanexus/in/mappings_sorted_bai/SRR504516.bam.bai'
#     mappings_sorted_bai_name: u'SRR504516.bam.bai'
#     mappings_sorted_bai_prefix: u'SRR504516'

Split workload

We process in parallel using the python multiprocessing module using a rather simple pattern shown below:

print("Number of cpus: {0}".format(cpu_count()))  # Get cpu count from multiprocessing
worker_pool = Pool(processes=cpu_count())         # Create a pool of workers, 1 for each core
results = worker_pool.map(run_cmd, collection)    # map run_cmds to a collection
                                                  # Pool.map will handle orchestrating the job
worker_pool.close()
worker_pool.join()  # Make sure to close and join workers when done

This convenient pattern allows you to quickly orchestrate jobs on a worker. For more detailed overview of the multiprocessing module, visit the python docs.

We create several helpers in our applet script to manage our workload. One helper you may have seen before is run_cmd; we use this function to manage or subprocess calls:

def run_cmd(cmd_arr):
    """Run shell command.
    Helper function to simplify the pool.map() call in our parallelization.
    Raises OSError if command specified (index 0 in cmd_arr) isn't valid
    """
    proc = subprocess.Popen(
        cmd_arr,
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE)
    stdout, stderr = proc.communicate()
    exit_code = proc.returncode
    proc_tuple = (stdout, stderr, exit_code)
    return proc_tuple

Before we can split our workload, we need to know what regions are present in our BAM input file. We handle this initial parsing in the parse_sam_header_for_region function:

def parse_sam_header_for_region(bamfile_path):
    """Helper function to match SN regions contained in SAM header

    Returns:
        regions (list[string]): list of regions in bam header
    """
    header_cmd = ['samtools', 'view', '-H', bamfile_path]
    print('parsing SAM headers:', " ".join(header_cmd))
    headers_str = subprocess.check_output(header_cmd).decode("utf-8")
    rgx = re.compile(r'SN:(\S+)\s')
    regions = rgx.findall(headers_str)
    return regions

Once our workload is split and we’ve started processing, we wait and review the status of each Pool worker. Then, we merge and output our results.

# Write results to file
resultfn = inputs['mappings_sorted_bam_name'][0]
resultfn = (
    resultfn[:-4] + '_count.txt'
    if resultfn.endswith(".bam")
    else resultfn + '_count.txt')
with open(resultfn, 'w') as f:
    sum_reads = 0
    for res, reg in zip(results, regions):
        read_count = int(res[0])
        sum_reads += read_count
        f.write("Region {0}: {1}\n".format(reg, read_count))
    f.write("Total reads: {0}".format(sum_reads))

count_file = dxpy.upload_local_file(resultfn)
output = {}
output["count_file"] = dxpy.dxlink(count_file)
return output

def verify_pool_status(proc_tuples):
    """
    Helper to verify worker succeeded.

    As failed commands are detected we write the stderr from that command
    to the job_error.json file. This file will be printed to the platform
    job log on App failure.
    """
    all_succeed = True
    err_msgs = []
    for proc in proc_tuples:
        if proc[2] != 0:
            all_succeed = False
            err_msgs.append(proc[1])
    if err_msgs:
        raise dxpy.exceptions.AppInternalError(b"\n".join(err_msgs))

Parallel by Region (sh)

This applet performs a basic SAMtools count on a series of sliced (by canonical chromosome) BAM files in parallel using wait.

View full source code on GitHub

How is the SAMtools dependency provided?

The SAMtools dependency is resolved by declaring an Apt-Get package in the dxapp.json runSpec.execDepends.

  "runSpec": {
    ...
    "execDepends": [
      {"name": "samtools"}
    ]
  }

Debugging

The command set -e -x -o pipefail will assist you in debugging this applet:

-e causes the shell to immediately exit if a command returns a non-zero exit code.
-x prints commands as they are executed, which is very useful for tracking the job’s status or pinpointing the exact execution failure.
-o pipefail makes the return code the first non-zero exit code. (Typically, the return code of pipes is the exit code of the last command, which can create difficult to debug problems.)
```
set -e -x -o pipefail
echo "Value of mappings_sorted_bam: '${mappings_sorted_bam}'"
echo "Value of mappings_sorted_bai: '${mappings_sorted_bai}'"

mkdir workspace
cd workspace
dx download "${mappings_sorted_bam}"

if [ -z "$mappings_sorted_bai" ]; then
  samtools index "$mappings_sorted_bam_name"
else
  dx download "${mappings_sorted_bai}"
fi
```
The *.bai file was an optional job input. You can check for an empty or unset var using the bash built-in test [[ - z ${var}} ]]. Then, you can download or create a *.bai index as needed.

Parallel Run

Bash’s job control system allows for easy management of multiple processes. In this example, you can run bash commands in the background as you control maximum job executions in the foreground. Place processes in the background using the character & after a command.

  chromosomes=$(samtools view -H "${mappings_sorted_bam_name}" | grep "\@SQ" | awk -F '\t' '{print $2}' | awk -F ':' '{if ($2 ~ /^chr[0-9XYM]+$|^[0-9XYM]/) {print $2}}')

  for chr in $chromosomes; do
    samtools view -b "${mappings_sorted_bam_name}" "${chr}" -o "bam_${chr}.bam"
    echo "bam_${chr}.bam"
  done > bamfiles.txt

  busyproc=0
  while read -r b_file; do
    echo "${b_file}"
    if [[ "${busyproc}" -ge "$(nproc)" ]]; then
      echo Processes hit max
      while [[ "${busyproc}" -gt  0 ]]; do
        wait -n # p_id
        busyproc=$((busyproc-1))
      done
    fi
    samtools view -c "${b_file}"> "count_${b_file%.bam}" &
    busyproc=$((busyproc+1))
  done <bamfiles.txt

  while [[ "${busyproc}" -gt  0 ]]; do
    wait -n # p_id
    busyproc=$((busyproc-1))
  done

Job Output

├── $HOME
│   ├── out
│       ├── < output name in dxapp.json >
│           ├── output file

In your applet, upload all outputs by:

  outputdir="${HOME}/out/counts_txt"
  mkdir -p "${outputdir}"
  cat count* | awk '{sum+=$1} END{print "Total reads = ",sum}' > "${outputdir}/${mappings_sorted_bam_prefix}_count.txt"

  dx-upload-all-outputs

Parallel xargs by Chr

This applet slices a BAM file by canonical chromosome then performs a parallelized samtools view -c using xargs. Type man xargs for general usage information.

View full source code on GitHub

How is the SAMtools dependency provided?

├── Applet dir
│   ├── src
│   ├── dxapp.json
│   ├── resources
│       ├── usr
│           ├── bin
│               ├── < samtools binary >

When this applet is run on a worker, the resources/ folder will be placed in the worker’s root directory /:

/
├── usr
│   ├── bin
│       ├── < samtools binary >
├── home
│   ├── dnanexus

/usr/bin is part of the $PATH variable, so in our script, we can reference the samtools command directly, as in samtools view -c ...

Parallel Run

Splice BAM

First, we download our BAM file and slice it by canonical chromosome, writing the *bam file names to another file.

  dx download "${mappings_bam}"

  indexsuccess=true
  bam_filename="${mappings_bam_name}"
  samtools index "${mappings_bam_name}" || indexsuccess=false
  if [[ $indexsuccess == false ]]; then
    samtools sort -o "${mappings_bam_name}" "${mappings_bam_name}"
    samtools index "${mappings_bam_name}"
    bam_filename="${mappings_bam_name}"
  fi


  chromosomes=$(samtools view -H "${bam_filename}" | grep "\@SQ" | awk -F '\t' '{print $2}' | awk -F ':' '{if ($2 ~ /^chr[0-9XYM]+$|^[0-9XYM]/) {print $2}}')

  for chr in $chromosomes; do
    samtools view -b "${bam_filename}" "${chr}" -o "bam_${chr}."bam
    echo "bam_${chr}.bam"
  done > bamfiles.txt

Xargs SAMtools view

In the previous section, we recorded the name of each sliced BAM file into a record file. Now we will perform a samtools view -c on each slice using the record file as input.

  counts_txt_name="${mappings_bam_prefix}_count.txt"

  sum_reads=$(<bamfiles.txt xargs -I {} samtools view -c $view_options '{}' | awk '{s+=$1} END {print s}')
  echo "Total Count: ${sum_reads}" > "${counts_txt_name}"

Upload results

The results file is uploaded using the standard bash process:

Upload a file to the job execution’s container.

Provide the DNAnexus link as a job’s output using the script dx-jobutil-add-output <output name>

  counts_txt_id=$(dx upload "${counts_txt_name}" --brief)
  dx-jobutil-add-output counts_txt "${counts_txt_id}" --class=file

User

In this section, learn to access and use the Platform via both its command-line interface (CLI) and its user interface (UI).

To use the CLI, you'll need to download and install the dx command-line client.

If you're not familiar with the dx client, read the Command-Line Quickstart.

This section provides detailed instructions on using the dx client to perform such common actions as logging in; selecting projects; listing, copying, moving, and deleting objects; and launching and monitoring jobs. Details on using the UI are included throughout, as applicable.

Login and Logout

Learn how to log into and out of the DNAnexus Platform, via both the user interface and the command-line interface. Learn how to use tokens to log in, and how to set up two-factor authentication.

Logging In and Out via the User Interface

To log in via the user interface (UI), open the login page and enter your username and password.

To log out via the UI, click on your avatar at the far right end of the main Platform menu, then select Sign Out:

Logging In via the Command-Line Interface

To log in via the command-line interface (CLI), make sure you've installed the dx command-line client. From the CLI, enter the command dx login.

Next, enter your username, or, if you've logged in before on the same computer and your username is displayed, hit Return to confirm that you want to use it to log in. Then enter your password.

See below for directions on using a token to log in.

See the Index of dx Commands page for detail on optional arguments that can be used with dx login.

Logging Out via the Command-Line Interface

When using the CLI, log out by entering the command dx logout.

Note that if you used a token to log in, logging out invalidates that token. To log in again using a token, you must generate a new token.

See the Index of dx Commands page for detail on optional arguments that can be used with dx logout.

Auto Logout

After fifteen minutes of inactivity, you will be automatically logged out, unless you logged in using an API token that specifies the length of time you can stay logged in, or are part of an org with a custom autoLogoutAfter policy.

Contact DNAnexus Support for more information on setting a custom autoLogoutAfter policy for an org.

Using Tokens

You can log in via the CLI, and stay logged in for a fixed length of time, by using an API token, also called an authentication token.

Be very careful about giving a DNAnexus Platform token to someone else. Anyone in possession of that token can use it to access the Platform and impersonate you as a user. He or she will have the same access level as you, for any projects to which the token has access, potentially allowing him or her to run jobs, incurring charges to your account.

Generating a Token

To generate a token, click on your avatar at the top right corner of the main Platform menu, then select My Profile from the dropdown menu.

Next, click on the API Tokens tab. Then click the New Token button:

The New Token form will open in a modal window:

While filling out the form, note the following:

The token will provide access to each project at the level at which you have access. See the Projects page for more on project access levels.
If the token provides access to a project within which you have PHI data access, it will enable access to that PHI data.
If you do not enter an expiration date when creating a token, it will be set to expire in one month.

Once you've completed the form, click Generate Token. A new 32-character token will be generated, and displayed along with a confirmation message.

Be sure to copy your token right away. Once you dismiss the confirmation message or navigate away from the API Tokens screen, the token will no longer be accessible.

Using a Token to Log In

To log in with a token via the CLI, enter the command dx login --token, followed by a valid 32-character token.

When to Use Tokens

Tokens are useful in a number of different scenarios. Examples include:

Logging in via the CLI when a single sign-on is enabled - If your organization uses single sign-on, you may not be able to log in via the CLI using a username and password. In this case, use a token to log in via the CLI.
Logging in via a script - You can incorporate a token into a script to allow the script to log into the Platform.

When incorporating a token into a script, take care to set the token's expiration date such that the script has Platform access for only as long as absolutely necessary. Ensure as well that the script only has access to that project or those projects to which it must have access, in order to function properly.

Revoking a Token

To revoke a token, navigate to the API Tokens screen within your profile on the UI. Select the token you want to revoke, then click the Revoke button:

In the Revoke Tokens Confirmation modal window, click the Yes, revoke it button. The token will be revoked, and its name will no longer appear in the list of tokens on the API Tokens screen.

When to Revoke a Token

Token shared too widely - Revoke a token if someone with whom you've shared the token should no longer be able to use it, or if you're not certain who has access to it.
Token no longer needed - Revoke a token if a script that uses it is no longer in use, or if a group that had been using it no longer needs access to the Platform, or in any other situation in which the token is no longer necessary.

Logging In Non-Interactively

As a rule, logging in requires interacting directly with the Platform, via the UI or the CLI. But it is possible to log in non-interactively. This is most commonly done via a script that automates both login and project selection.

Non-interactive login requires the use of dx login with the --token argument. Use the dx select command to automate project selection. If you prefer not to automate project selection, add the --noprojects argument to dx login.

Two-Factor Authentication

DNAnexus recommends adding two-factor authentication to your account, to provide an extra means of ensuring the security of all data to which you have access, on the Platform.

After enabling two-factor authentication, you will be required to enter a two-factor authentication code to log into the Platform, and to access certain other services. This code is a time-based one-time password that is valid for only a single session. It is generated by a third-party two-factor authenticator application, such as Google Authenticator.

With two-factor authentication protecting your account, your data will be protected even in the case that both your username and password are stolen. No attacker will be able to access your account without the two-factor authentication code.

Enabling Two-Factor Authentication

To enable two-factor authentication, select Account Security from the dropdown menu accessible via your avatar, at the the top right corner of the main menu.

In the Account Security screen, click the button labeled Enable 2FA. Then follow the instructions to select and set up a third-party authenticator application.

DNAnexus recommends using Google Authenticator on your mobile device. Google Authenticator is a popular, free application that's available for both Apple iOS and Android mobile devices. Get it on Google Play or from the Apple iTunes App Store.

If you are unable to use a smartphone application, compatible two-factor authenticator applications, using the TOTP (Time-based One-time Password) algorithm, exist for other platforms.

After enabling two-factor authentication, you will be redirected to a page containing back-up codes. These codes can be used in place of a two-factor authentication code, in the event that you lose access to your authenticator application.

Save the back-up codes in a secure place. Without them, if you lose access to your authenticator application, you will be unable to log into the Platform.

Contact DNAnexus Support if you lose both your codes and access to your authenticator application.

Disabling Two-Factor Authentication

DNAnexus does not recommend disabling two-factor authentication once it has been enabled. If you do need to do so, navigate to the Account Security screen of your profile, then click the Turn Off button in the Two-Factor Authentication section. You will be required to enter your password and a two-factor authentication code to confirm your choice.

If you disable, then re-enable, two-factor authentication, you will need to re-configure your authenticator application. You can do this by scanning a new QR code or entering a new secret key code. You will also be required to save a new set of back-up codes.

Projects

Project Navigation

You can treat dx as an invocation command for navigating the data objects on the DNAnexus platform. By adding dx in front of commonly used bash commands (e.g. dx ls, dx cd, dx mv, and dx cp), you can list objects, change folders, move data objects, and copy objects stored in the platform; all on the command-line.

Listing Objects

Listing Objects in Your Current Project

Listing Object Details

To see more details, you can run the command with the option dx ls -l.

As in bash, you can list the contents on a path.

Listing Objects in a Different Project

You can also list the contents of a different project. To specify a path that points to a different project, start with the project-ID, followed by a :, then the path within the project where / is the root folder of the project.

Note that we enclosed our path with quotes (" "), so dx interprets the spaces as part of the folder name, not as a new command.

Listing Objects That Match a Pattern

You can also list only the objects which match a pattern. Here, we use a * as a wildcard to represent all objects whose names contain .fasta. This returns only a subset of the objects returned in the original query. Again we enclosed our path in " " so dx correctly interprets the asterisk and the spaces in the path.

Switching Contexts

Changing Folders

Moving or Renaming Data Objects

To rename an object or a folder, simply "move" it to a new name in the same folder. Here we rename a file named ce10.fasta.gz to C.elegans10.fastq.gz.

If we wanted to move the renamed file into a folder, we can specify the path to the folder as the destination of the move command.

Copying Objects or Folders to Another Project

You can copy data objects or folders to another project by running the command dx cp. Below we show an example to copy a human reference genome FASTA file (hs37d5.fa.gz) from a public project, “Reference Genome Files”, to a project “Scratch Project” that the user has ADMINISTER permission to.

You can also copy folders between projects by running dx cp folder_name destination_path. Folders will automatically be copied recursively.

Changing Your Current Project

Changing to Another Project With a Project Prompt List

Changing to a Public Project

To view and select between all public projects, projects available to all DNAnexus users, you can run the command dx select --public:

Changing to a Project With VIEW Permission

By default, dx select will prompt list of projects that you have at least CONTRIBUTE permission to. If you wanted to switch to a project that you have VIEW permission to to view the data objects, you can run dx select --level VIEW to list all the projects in which you have at least VIEW permission to.

Changing Directly to a Specific Project

If you know the project ID or name, you can also give it directly to switch to the project as dx select [project-ID | project-name]:

Path Resolution

When using the command-line client, you may refer to objects either through their ID or by name.

The command-line client, however, also accepts names and paths as input in a particular syntax.

Path Syntax

There are three main types of paths that are recognized for referring to data objects: project paths, job-based object references (JBORs), and DNAnexus links.

Project Paths

To refer to a project by name, it must be suffixed with the colon character ":". Anything appearing after the ":" or without a ":" will be interpreted as a folder path to a named object. For example, to refer to a file called "hg19.fq.gz" in a folder called "human" in a project called "Genomes", the following path can be used in place of its object ID:

Note that the folder path appearing after the ":" is assumed to be relative to the root folder "/" of the project.

Job-Based Object References (JBORs)

To refer to the output of a particular job, you can use the syntax <job id>:<output name>.

Examples

If you have the job ID handy, you can use it directly.

Or if you know it's the last analysis you ran:

You can also automatically download a file once the job producing it is done:

If the output is an array, you can extract a single element by specifying its array index (numbered 0, 1, etc.) as follows:

DNAnexus Links

DNAnexus links are JSON hashes which are used for job input and output. They always contain one key, $dnanexus_link, and have as a value either

a string representing a data object ID
another hash with two keys:
- project a string representing a project or other data container ID
- id a string representing a data object ID

For example:

Special Characters

Because of the use of : to denote project names and of / to separate folder names, they must be escaped with a preceding backslash \\ when they appear in a data object's name. The characters * and ? are also reserved for the use in wildcard patterns and must also be escaped. Depending on your terminal and whether you put the entire name or path in quotes, you may have to additionally escape spaces in order to pass them in as part of the string. The use of backslashes in names is discouraged, and the best way to interact with objects with such names will probably be to use their IDs directly. See the following table for some examples of the necessary representation for accessing existing objects with special characters in their names (assuming a bash shell).

The following example illustrates how the special characters are escaped for use on the command line, with and without quotes.

For commands where the argument supplied involves naming or renaming something, the only escaping necessary is whatever is necessary for your shell or for setting it apart from a project or folder path.

Name Conflicts

It is possible to have multiple objects with the same name in the same folder. When an attempt is made to access or modify an object which shares the same name as another object, you will be prompted to select the desired data object.

Some commands (like mv here) will allow you to enter * so that all matches will be used. Other commands may automatically apply the command to all of them (e.g. ls and describe), and others will require that exactly one object be chosen (e.g. run).

Running Apps and Workflows

Running Apps and Applets

Running in Interactive Mode

Running in Non-interactive Mode

Naming Each Input

You can also specify each input parameter by name using the ‑i or ‑‑input flags with syntax ‑i<input name>=<input value>. Names of data objects in your project will be resolved to the appropriate IDs and packaged correctly for the API method as shown below.

The help message describes the inputs and outputs of the app, their types, and how to identify them when running the app from the command line. For example, from the above help message, we learn that the Swiss Army Knife app has two primary inputs: one or more file and a string to be executed on the command line, to be specified as -iin=file-xxxx and icmd=<string>, respectively.

The example below shows you how to run the same Swiss Army Knife app to sort a small BAM file using these inputs.

Specifying Array Input

Job-Based Object References (JBORs)

Advanced Options

Some examples of additional functionalities provided by dx run are listed below.

Quiet Output

Regardless of whether you run a job interactively or non-interactively, the command dx run will always print the exact input JSON with which it is calling the applet or app. If you don't want to print this verbose output, you can use the --brief flag which tells dx to print out only the job ID instead. This job ID can then be saved.

TIP: When running jobs, you can use the -y/--yes option to bypass the prompts asking you to confirm running the job and whether or not you want to watch the job. This is useful for scripting jobs. If you want to confirm running the job and immediately start watching the job, you can use -y --watch.

Rerunning a Job With the Same Settings

If you are debugging applet-xxxx and wish to rerun a job you previously ran, using the same settings (destination project and folder, inputs, instance type requests), but use a new executable applet-yyyy, you can use the --clone flag.

If you want to modify some but not all settings from the previous job, you can simply run dx run <executable> --clone job-xxxx [options]. The command-line arguments you provide in [options] will override the settings reused from --clone. For example, this is useful if you want to rerun a job with the same executable and inputs but a different instance type, or if you want to run an executable with the same settings but slightly different inputs.

The example shown below redirects the outputs of the job to the folder "outputs/".

Specifying the Job Output Folder

In the above command, the flag --destination project-xxxx:/mappings instructs the job to output all results into the "mappings" folder of project-xxxx.

Specifying a Different Instance Type

The dx run --instance-type command allows you to specify the instance type(s) to be used for the job. More information can be found by running the command dx run --instance-type-help.

Adding Metadata to a Job

If you are running many jobs that have varying purposes, you can organize the jobs using metadata. There are two types of metadata on the DNAnexus platform: properties and tags.

Properties are key-value pairs that can be attached to any object on the platform, whereas tags are strings associated with objects on the platform. The --property flag allows you to attach a property to a job, and the --tag flag allows you to tag a job.

Adding metadata to executions does not affect the metadata of the executions' output files. Metadata on jobs make it easier for you to search for a particular job in your job history (e.g., if you wanted to tag all jobs run with a particular sample).

Specifying an App Version

If your current workflow is not using the most up-to-date version of an app, you can specify an older version when running your job by appending the app name with the version required, e.g. app-xxx/0.0.1 if the current version is app-xxx/1.0.0.

Watching a Job

If you would like to keep an eye on your job as it runs, you can use the --watch flag to ask the job to print its logs in your terminal window as it progresses.

Providing Input JSON

From the CLI

If using the CLI to enter the full input JSON, you must use the flag ‑j/‑‑input‑json followed by the JSON in single quotes. Only single quotes should be used to wrap the JSON to avoid interfering with the double quotes used by the JSON itself.

From a File

If using a file to enter the input JSON, you must use the flag ‑f/‑‑input‑json‑file followed by the name of the JSON file.

From stdin

Entering the input JSON file using stdin is done in much the same way as entering the file using the -f flag with the small substitution of using "-"as the filename. Below is an example that demonstrates how to echo the input JSON to stdin and pipe the output to the input of dx run. As before, single quotes should be used to wrap the JSON input to avoid interfering with the double quotes used by the JSON itself.

Getting Additional Information on dx run

Cost Run Limits

The --cost-limit cost_limit sets the maximum cost of the job before termination. In case of workflows it is cost of the entire analysis job. For batch run, this limit is applied per job. See the dx run --help command for more information.

Job Runtime Limits

On the DNAnexus Platform, jobs are limited to a runtime of 30 days. Jobs running longer than 30 days will be automatically terminated.

Running Workflows

Running in Interactive Mode

If dx run is run without specifying an input, interactive mode will be launched. You will then be prompted to enter each required input, after which you will be given the option to select from a list of optional parameters to modify. Optional parameters listed will include all those that can be modified for each stage of the workflow. The interface will then output a JSON file detailing the input specified and generate an analysis ID of the form analysis-xxxx unique to this particular run of the workflow.

Below is an example of running the Exome Analysis Workflow from the public "Exome Analysis Demo" project.

Running in Non-Interactive Mode

You can specify each input on the command-line using the -i or --input flags using the syntax -i<stage ID>.<input name>=<input value>. <input-value> must take the form of a DNAnexus object ID or a file named in the project currently selected. It is also possible to specify the number of a stage in place of the stage ID for a given workflow, where stages are indexed starting at zero. The inputs in the following example are specified for the first stage of the workflow only to illustrate this point. Note that the parentheses around the <input-value> in the help string are omitted when entering input.

Possible values for the input name field can be found by running the command dx run workflow-xxxx -h, as shown below using the Exome Analysis Workflow.

This help message describes the inputs for each stage of the workflow in the order they are specified. For each stage of the workflow, the help message will first list the required inputs for that stage, specifying the requisite type in the <input-value> field. Next, the message describes common options for that stage (as seen in that stage's corresponding UI on the platform). Lastly, it will list advanced command-line options for that stage. If any stage's input is linked to the output of a prior stage, the help message shows the default value for that stage as a DNAnexus link of the form

{"$dnanexus_link": {"outputField": "<prior stage output name>", "stage": "stage-xxxx" }}.

Similarly, this link format can be used to specify output from any prior stage in the workflow as input for the current stage. We see that the Exome Analysis Workflow has one required file array input in addition to those already specified by default: -ibwa_mem_fastq_read_mapper.reads_fastqgzs. As these inputs are for the first stage of the Exome Analysis Workflow, the bwa_mem_fastq_read_mapper stage ID can be replaced with 0.

Workflow stages are zero-indexed; the first stage of a workflow is denoted as stage 0.

The example below shows how to run the same Exome Analysis Workflow on a FASTQ file containing reads, as well as a BWA reference genome, using the default parameters for each subsequent stage.

Specifying Array Input

Array input can be specified by specifying multiple inputs for a single parameter in a stage. For example, the following flags would add files 1 through 3 to the file_inputs parameter for stage-xxxx of the workflow:

If no project is selected, or if the file is in another project, the project containing the files you wish to use must be specified as follows: -i<stage ID>.<input name>=<project id>:<file id>.

Job-Based Object References (JBORs)

Advanced Options

Quiet Output

Using the --brief flag at the end of a dx run command will cause the command line to print the execution's analysis ID ("analysis-xxxx") instead of the input JSON for the execution. This ID can be saved for later reference.

Rerunning Analyses With Modified Settings

To modify specific settings from the previous analysis, you can run the command dx run --clone analysis-xxxx [options]. The [options] parameters will override anything set by the --clone flag, and they take the form of options passed as input from the command line.

For example, the command below redirects the output of the analysis to the outputs/ folder and reruns all stages.

Only the outputs of stages rerun are placed in the destination specified.

Rerunning Specific Stages

When rerunning workflows, if a stage is run identically to how it was run in a previous analysis, the stage itself will not be rerun; the outputs of that stage will not be copied or rewritten in a new location. To rerun a specific stage, use the option --rerun-stage STAGE_ID to force a stage to be run again, wherein STAGE_ID is an ID of the form stage-xxxx, the stage's name, or the index of that stage (where the first stage of a workflow is indexed at 0). If you wish to rerun all stages of an analysis, you can use --rerun-stage "*", where the asterisk is enclosed in quotes to prevent expansion of that variable into all folders your current directory via globbing.

The command below reruns the third and final stage of analysis-xxxx

Specifying Analysis Output Folders

The --destination flag allows you to specify the path of the output of a workflow. Every output of every stage will be written to the destination specified by default.

Specifying Output Folders

You can use the --stage-output-folder <stage_ID> <folder> command to specify the output destination of a particular stage in the analysis being run, wherein stage_ID is the stage's name, or the index of that stage (where the first stage of a workflow is indexed at 0) andfolder is the project and path to which you wish the stage to write using the syntax project-xxxx:/PATH where PATH is the path to the folder in project-xxxx where you wish to write outputs.

The following command reruns all stages of analysis-xxxx and sets the output destination of the first step of the workflow (BWA) to "mappings" in the current project:

Specifying Stage-Relative Output Folders

If you want to specify output folder of a stage within the current output folder of the entire analysis, you can use the flag --stage-relative-output-folder <stage_id> <folder>, wherestage_id is the stage's name (stage-xxxx), or the index of that stage (where the first stage of a workflow is indexed at 0). For the folder argument, you can specify a quoted path to write the output of that stage that is relative to the output folder of the analysis.

The following command reruns all stages of analysis-xxxx, setting the output destination of the analysis to /exome_run, and the output destination of stage 0 to /exome_run/mappings in the current project:

Specifying a Different Instance Type

If you wish to specify the instance type of all stages in your analysis or a specific set of stages in your analysis, you can do so with the flag --instance-type. Specifically, the format --instance-type STAGE_ID=INSTANCE_TYPE allows us to set the instance type of a specific stage, while --instance-type INSTANCE_TYPE sets one instance types for all of the stages. The two options can be combined, for example --instance-type mem2_ssd1_x2 --instance-type my_stage_0=mem3_ssd1_x16 will set all stages' instance types to mem2_ssd1_x2 except for the stage my_stage_0 for which mem3_ssd1_x16 will be used.

Here STAGE_ID is an ID of a stage, the stage's name, or the index of that stage (where the first stage of a workflow is indexed at 0).

The example below reruns all stages of analysis-xxxx and specifies that the first and second stages should be run on mem1_ssd2_x8 and mem1_ssd2_x16 instances respectively:

Adding Metadata to an Analysis

Monitoring an Analysis

On the DNAnexus Platform, jobs are limited to a runtime of 30 days. Jobs running longer than 30 days will be automatically terminated.

Providing Input JSON

Note that as in running a workflow in non-interactive mode, inputs to a workflow must be specified as STAGE_ID.<input>, where STAGE_ID is either an ID of the form stage-xxxx or the index of that stage in the workflow (starting with the first stage at index 0).

Running Nextflow Pipelines

This tutorial demonstrates how to use Nextflow pipelines on the DNAnexus Platform by importing a Nextflow pipeline from a remote repository or building from local disk space.

A license is required to create a DNAnexus app or applet from the Nextflow script folder. Contact DNAnexus Sales for more information.

This documentation assumes you already have a basic understanding of how to develop and run a Nextflow pipeline. To learn more about Nextflow, consult the official Nextflow Documentation.

To run a Nextflow pipeline on the DNAnexus Platform:

Import the pipeline script from a remote repository or local disk
Convert the script to an app or applet
Run the app or applet

You can do this via either the user interface (UI) or the command-line interface (CLI), using the dx command-line client.

Use the latest version of dx-toolkit to take advantage of recent improvements and bug fixes.

All versions beginning with v0.338.0 support converting Nextflow pipelines to apps or applets.

This documentation covers features available in dx-toolkit versions beginning with v0.378.0.

Quickstart

Pipeline Script Folder Structure

A Nextflow pipeline script is structured as a folder with Nextflow scripts with optional configuration files and subfolders. Below are the basic elements of the folder structure when building a Nextflow executable:

(Required) A main Nextflow file with the extension .nf containing the pipeline. The default filename is main.nf. A different filename can be specified in the nextflow.config file.
(Optional) A nextflow.config file.
(Optional, recommended) A nextflow_schema.json file. If this file is present at the root folder of the Nextflow script when importing or building the executable, the input parameters described in the file will be exposed as the built Nextflow pipeline applet's input parameters. See this section for more information on how the exposed parameters are used at run time.
(Optional) Subfolders and other configuration files. Subfolders and other configuration files can be referenced by the main Nextflow file or nextflow.config via theinclude or includeConfig keyword. Ensure that all referenced subfolders and files exist under the pipeline script folder at the time of building or importing the pipeline.

An nf-core flavored folder structure is encouraged but not required.

Importing a Nextflow Pipeline

Import via UI

To import a Nextflow pipeline via the UI, click on the Add button on the top-right corner of the project’s Manage tab, then expand the dropdown menu. Select the Import Pipeline/Workflow option.

Once the Import Pipeline/Workflow modal appears, enter the repository URL where the Nextflow pipeline source code resides, for example, "https://github.com/nextflow-io/hello". Then choose the desired project import location. If the repository is private, provide the credentials necessary for accessing it.

An example of the Import Pipeline/Workflow modal:

Once you’ve provided the necessary information, click the Start Import button and the import process will start as a pipeline import job, in the project specified in the Import To field (default is the current project).

After you've launched the import job, you'll see a status message "External workflow import job started" appear.

You can access information about the pipeline import job in the project’s Monitor tab:

Once the import is complete, you can find the imported pipeline executable as an applet. This is the output of the pipeline import job you previously ran:

You can find the newly created Nextflow pipeline applet - e.g. hello - in the project:

Import via CLI from a Remote Repository

To import a Nextflow pipeline from a remote repository via the CLI, run the following command to specify the repository’s URL. Note that you can also provide optional information, such as a repository tag and an import destination:

$ dx build --nextflow \
  --repository https://github.com/nextflow-io/hello \
  --destination project-xxxx:/applets/hello

Started builder job job-aaaa
Created Nextflow pipeline applet-zzzz

Use the latest version of dx-toolkit to take advantage of recent improvements and bug fixes.

All versions beginning with v0.338.0 support converting Nextflow pipelines to apps or applets.

This documentation covers features available in dx-toolkit versions beginning with v0.370.0.

Your destination project’s billTo feature needs to be enabled for Nextflow pipeline applet building. Contact DNAnexus Sales for more information.

If the Nextflow pipeline is in a private repository, use the option --git-credentials to provide the DNAnexus qualified ID or path of the credential files on the Platform. Read more about this here.

Once the pipeline import job has finished, it will generate a new Nextflow pipeline applet with an applet ID in the form applet-zzzz.

Use dx run -h to get more information about running the applet:

$ dx run project-xxxx:/applets/hello -h
usage: dx run project-xxxx:/applets/hello [-iINPUT_NAME=VALUE ...]

Applet: hello

hello

Inputs:
 Nextflow options
  Nextflow Run Options: [-inextflow_run_opts=(string)]
        Additional run arguments for Nextflow (e.g. -profile docker).

  Nextflow Top-level Options: [-inextflow_top_level_opts=(string)]
        Additional top-level options for Nextflow (e.g. -quiet).

  Soft Configuration File: [-inextflow_soft_confs=(file) [-inextflow_soft_confs=... [...]]]
        (Optional) One or more nextflow configuration files to be appended to the Nextflow pipeline
        configuration set

  Script Parameters File: [-inextflow_params_file=(file)]
        (Optional) A file, in YAML or JSON format, for specifying input parameter values

 Advanced Executable Development Options
  Debug Mode: [-idebug=(boolean, default=false)]
        Shows additional information in the job log. If true, the execution log messages from
        Nextflow will also be included.

  Resume: [-iresume=(string)]
        Unique ID of the previous session to be resumed. If 'true' or 'last' is provided instead of
        the sessionID, will resume the latest resumable session run by an applet with the same name
        in the current project in the last 6 months.

  Preserve Cache: [-ipreserve_cache=(boolean, default=false)]
        Enable storing pipeline cache and local working files to the current project. If true, local
        working files and cache files will be uploaded to the platform, so the current session could
        be resumed in the future

Outputs:
  Published files of Nextflow pipeline: [published_files (array:file)]
        Output files published by current Nextflow pipeline and uploaded to the job output
        destination.

Building from a Local Disk

Through the CLI you can also build a Nextflow pipeline applet from a pipeline script folder stored on a local disk. For example, you may have a copy of the nextflow-io/hello pipeline from the Nextflow Github on your local laptop, stored in a directory named hello, which contains the following files:

$ pwd
/path/to/hello
$ ls
LICENSE         README.md       main.nf         nextflow.config

Ensure that the folder structure is in the required format, as described here.

To build a Nextflow pipeline applet using a locally stored pipeline script, run the following command and specify the path to the folder containing the Nextflow pipeline scripts. You can also provide optional information, such as an import destination:

$ dx build --nextflow /path/to/hello \
  --destination project-xxxx:/applets2/hello
{"id": "applet-yyyy"}

Your destination project’s billTo feature needs to be enabled for Nextflow pipeline applet building. Contact S ales for more information.

This command will package the Nextflow pipeline script folder as an applet named hello with ID applet-yyyy, and store the applet in the destination project and path project-xxxx:/applets2/hello. If an import destination is not provided, the current working directory will be used.

The dx run -h command can be run to see information about this applet, similar to the above example.

A Nextflow pipeline applet will have a type “nextflow” under its metadata . This applet acts like a regular DNAnexus applet object, and can be shared with other DNAnexus users who have access to the project containing the applet.

For advanced information regarding the parameters of dx build --nextflow, run dx build --help in the CLI and find the Nextflow section for all arguments that are supported for building an Nextflow pipeline applet.

Building a Nextflow Pipeline App from a Nextflow Pipeline Applet

You can also build a Nextflow pipeline app from a Nextflow pipeline applet by running the command: dx build --app --from applet-xxxx.

Running a Nextflow Pipeline Executable (App or Applet)

Running a Nextflow Pipeline Executable via UI

You can access a Nextflow pipeline applet from the Manage tab in your project, while the Nextflow pipeline app that you built can be accessed by clicking on the Tools Library option from the Tools tab. Once you click on the applet or app, the Run Analysis tab will be displayed. Fill out the required inputs/outputs and click the Start Analysis button to launch the job.

Running a Nextflow Pipeline Applet via CLI

To run the Nextflow pipeline applet, use dx run applet-xxxx or dx run app-xxxx commands in the CLI and specify your inputs:

$ dx run project-yyyy:applet-xxxx \
  -i debug=false \
  --destination project-xxxx:/path/to/destination/ \
  --brief -y

job-bbbb

You can list and see the progress of the Nextflow pipeline job tree, which is structured as a head job with many subjobs, using the following command:

# See subjobs in progress
$ dx find jobs --origin job-bbbb
* hello (done) job-bbbb
│ amy 2023-09-20 14:57:58 (runtime 0:02:03)
├── sayHello (3) (hello:nf_task_entry) (done) job-1111
│   amy 2023-09-20 14:58:57 (runtime 0:00:45)
├── sayHello (1) (hello:nf_task_entry) (done) job-2222
│   amy 2023-09-20 14:58:52 (runtime 0:00:52)
├── sayHello (2) (hello:nf_task_entry) (done) job-3333
│   amy 2023-09-20 14:58:48 (runtime 0:00:53)
└── sayHello (4) (hello:nf_task_entry) (done) job-4444
    amy 2023-09-20 14:58:43 (runtime 0:00:50)

Monitoring Jobs

Each Nextflow pipeline executable run is represented as a job tree with one head job and many subjobs. The head job launches and supervises the entire pipeline execution. Each subjob is responsible for a process in the Nextflow pipeline. You can monitor the progress of the entire pipeline job tree by viewing the status of the subjobs (see example above).

To monitor the detail log of the head job and the subjobs, you can monitor each job’s DNAnexus log via the UI or the CLI.

On the DNAnexus Platform, jobs are limited to a runtime of 30 days. Jobs running longer than 30 days will be automatically terminated.

Monitoring in the UI

Once your job tree is running, you can go to the Monitor tab to view the status of your job tree. From the Monitor tab, you can view the job log of the head job as well as the subjobs by clicking on the Log link in the row of the desired job. You can also view the costs (when your account has permission) and resource usage of a job.

An example of the log of a head job:

An example of the log of a subjob:

Monitoring in the CLI

From the CLI, you can use the dx watch command to check the status and view the log of the head job or each subjob.

Monitoring the head job:

# Monitor job in progress
$ dx watch job-bbbb
Watching job job-bbbb. Press Ctrl+C to stop watching.
* hello (done) job-bbbb
  amy 2023-09-20 14:57:58 (runtime 0:02:03)
... [deleted]
2023-09-20 14:58:29 hello STDOUT dxpy/0.358.0 (Linux-5.15.0-1045-aws-x86_64-with-glibc2.29) Python/3.8.10
2023-09-20 14:58:30 hello STDOUT bash running (job ID job-bbbb)
2023-09-20 14:58:31 hello STDOUT =============================================================
2023-09-20 14:58:31 hello STDOUT === NF projectDir   : /home/dnanexus/hello
2023-09-20 14:58:31 hello STDOUT === NF session ID   : 0eac8f92-1216-4fce-99cf-dee6e6b04bc2
2023-09-20 14:58:31 hello STDOUT === NF log file     : dx://project-xxxx:/applets/nextflow-job-bbbb.log
2023-09-20 14:58:31 hello STDOUT === NF command      : nextflow -log nextflow-job-bbbb.log run /home/dnanexus/hello -name job-bbbb
2023-09-20 14:58:31 hello STDOUT === Built with dxpy : 0.358.0
2023-09-20 14:58:31 hello STDOUT =============================================================
2023-09-20 14:58:34 hello STDOUT N E X T F L O W  ~  version 22.10.7
2023-09-20 14:58:35 hello STDOUT Launching `/home/dnanexus/hello/main.nf` [job-bbbb] DSL2 - revision: 1647aefcc7
2023-09-20 14:58:43 hello STDOUT [0a/6a81ca] Submitted process > sayHello (4)
2023-09-20 14:58:48 hello STDOUT [f5/87df8b] Submitted process > sayHello (2)
2023-09-20 14:58:53 hello STDOUT [4b/21374a] Submitted process > sayHello (1)
2023-09-20 14:58:57 hello STDOUT [f6/8c44f5] Submitted process > sayHello (3)
2023-09-20 14:59:51 hello STDOUT Hola world!
2023-09-20 14:59:51 hello STDOUT 
2023-09-20 14:59:51 hello STDOUT Ciao world!
2023-09-20 14:59:51 hello STDOUT 
2023-09-20 15:00:06 hello STDOUT Bonjour world!
2023-09-20 15:00:06 hello STDOUT 
2023-09-20 15:00:06 hello STDOUT Hello world!
2023-09-20 15:00:06 hello STDOUT 
2023-09-20 15:00:07 hello STDOUT === Execution completed — cache and working files will not be resumable
2023-09-20 15:00:07 hello STDOUT === Execution completed — upload nextflow log to job output destination project-xxxx:/applets/
2023-09-20 15:00:09 hello STDOUT Upload nextflow log as file: file-GZ5ffkj071zqZ9Qj22qv097J
2023-09-20 15:00:09 hello STDOUT === Execution succeeded — upload published files to job output destination project-xxxx:/applets/
* hello (done) job-bbbb
  amy 2023-09-20 14:57:58 (runtime 0:02:03)
  Output: -

Monitoring a subjob:

# Monitor job in progress
$ dx watch job-cccc
Watching job job-cccc. Press Ctrl+C to stop watching.
sayHello (1) (hello:nf_task_entry) (done) job-cccc
amy 2023-09-20 14:58:52 (runtime 0:00:52)
... [deleted]
2023-09-20 14:59:28 sayHello (1) STDOUT dxpy/0.358.0 (Linux-5.15.0-1045-aws-x86_64-with-glibc2.29) Python/3.8.10
2023-09-20 14:59:30 sayHello (1) STDOUT bash running (job ID job-cccc)
2023-09-20 14:59:33 sayHello (1) STDOUT file-GZ5ffQj047j3Vq7QX220Q5vQ
2023-09-20 14:59:34 sayHello (1) STDOUT Bonjour world!
2023-09-20 14:59:36 sayHello (1) STDOUT file-GZ5ffVQ047j2QXZ2ZkFx4YxG
2023-09-20 14:59:38 sayHello (1) STDOUT file-GZ5ffX0047j2QXZ2ZkFx4YxK
2023-09-20 14:59:41 sayHello (1) STDOUT file-GZ5ffXQ047jGYZ91x6KG32Jp
2023-09-20 14:59:43 sayHello (1) STDOUT file-GZ5ffY8047jF2PY3609JPBKB
sayHello (1) (hello:nf_task_entry) (done) job-cccc
amy 2023-09-20 14:58:52 (runtime 0:00:52)
Output: exit_code = 0

Advanced Options: Running a Nextflow Pipeline Executable (App or Applet)

Nextflow Execution on DNAnexus

The Nextflow pipeline executable is launched as a job tree, with one head job running the Nextflow executor, and multiple subjobs running a single process each. Throughout the pipeline’s execution, the head job remains in “running” state and supervises the job tree’s execution.

Nextflow Execution Log File

When a Nextflow head job (i.e. job-xxxx) enters its terminal state (i.e. "done" or "failed"), a Nextflow log file with filename as nextflow-<job-xxxx>.log will be written to the destination path of the head job.

Private Docker Repository

DNAnexus supports Docker container engines for the Nextflow pipeline execution environment. The pipeline developer may refer to a public Docker repository or a private one. When the pipeline is referencing a private Docker repository, you should provide your Docker credential file as a file input of docker_creds to the Nextflow pipeline executable when launching the job tree.

Syntax of a private Docker credential:

{
  "docker_registry": {
    "registry": "url-to-registry",
    "username": "name123",
    "token": "12345678"
  }
}

It is encouraged to save this credential file in a separate project where only limited users have permission to access it for privacy reasons.

Nextflow Pipeline Executable Inputs and Outputs

Specifying Input Values to a Nextflow Pipeline Executable

Below are all possible means that you can specify an input value at build time and runtime. They are listed in order of precedence (items listed first have greater precedence and override items listed further down the list):

Executable (app or applet) run time
1. DNAnexus Platform app or applet input.
  - CLI example: dx run project-xxxx:applet-xxxx -i reads_fastqgz=project-xxxx:file-yyyy
  - reads_fastqgz is an example of an executable input parameter name. All Nextflow pipeline inputs can be configured and exposed by the pipeline developer using an nf-core flavored pipeline schema file (nextflow_schema.json).
  - When the input parameter is expecting a file, you need to specify the value in a certain format based on the class of the input parameter. When the input is of the “file” class, use DNAnexus qualified ID (i.e. absolute path to the file object such as “project-xxxx:file-yyyy”); when the input is of the “string” class, use the DNAnexus URI (“dx://project-xxxx:/path/to/file”). See table below for full descriptions of the formatting of PATHs.
  - You can use dx run <app(let)> --help to query the class of each input parameter at the app(let) level. In the example code block below, fasta is an input parameter of a file object, while fasta_fai is an input parameter of a string object. You will then use DNAnexus qualifiedID format for fasta, and DNAnexus URI format for fasta_fai.
  - The DNAnexus object class of each input parameter is based on the “type” and “format” specified in the pipeline’s nextflow_schema.json, when it exists. See additional documentation here to understand how Nextflow input parameter’s type and format (when applicable) converts to an app or applet’s input class.
  - It is recommended to always use the app/applet means for specifying input values. The platform validates the input class and existence before the job is created.
  - All inputs for a Nextflow pipeline executable are set as “optional” inputs. This allows users to have flexibility to specify input via other means.
2. Nextflow pipeline command line input parameter (i.e. nextflow_pipeline_params). This is an optional "string" class input, available for any Nextflow pipeline executable upon it being built.
  - CLI example: dx run project-xxxx:applet-xxxx -i nextflow_pipeline_params="--foo=xxxx --bar=yyyy", where "--foo=xxxx --bar=yyyy" corresponds to the "--something value" pattern of Nextflow input specification referenced here.
  - Because nextflow_pipeline_params is a string type parameter with file-path format, use the DNAnexus URI format when the file is stored on DNAnexus.
3. Nextflow options parameter (i.e. nextflow_run_opts). This is a optional "string" class input, available for any Nextflow pipeline executable upon it being built.
  - CLI example: dx run project-xxxx:applet-xxxx -i nextflow_run_opts=“-profile test”, where -profile is single-dash prefix parameter that corresponds to the Nextflow run options pattern, specifying a preset input configuration.
4. Nextflow parameter file (i.e. nextflow_params_file). This is a optional "file" class input, available for any Nextflow pipeline executable that is being built.
  - CLI example: dx run project-xxxx:applet-xxxx -i nextflow_params_file=project-xxxx:file-yyyy, where project-xxxx:file-yyyy is the DNAnexus qualified ID of the file being passed to nextflow run -params-file <file>. This corresponds to -params-file option of nextflow run.
5. Nextflow soft configuration override file (i.e. nextflow_soft_confs). This is a optional "array:file" class input, available for any Nextflow pipeline executable that is being built.
  - CLI example: dx run project-xxxx:applet-xxxx -i nextflow_soft_confs=project-xxxx:file-1111 -i nextflow_soft_confs=project-xxxx:file-2222, where project-xxxx:file-1111 and project-xxxx:file-2222 are the DNAnexus qualified IDs of the file being passed to nextflow run -c <config-file1> -c <config-file2>. This corresponds to -c option of nextflow run, and the order specified for this array of file input is preserved when passing to the nextflow run execution.
  - The soft configuration file can be used for assigning default values of configuration scopes (such as process).
  - It is highly recommended to use nextflow_params_file as a replacement to using nextflow_soft_confs for the use case of specifying parameter values, especially when running Nextflow DSL2 nf-core pipelines. Read more about this at nf-core documentation.
Pipeline source code:
1. nextflow_schema.json
  - Pipeline developers may specify default values of inputs in the nextflow_schema.json file.
  - If an input parameter is of Nextflow’s string type with file-path format, use DNAnexus URI format when the file is stored on DNAnexus.
2. nextflow.config
  - Pipeline developers may specify default values of inputs in thenextflow.config file.
  - Pipeline developers may specify a default profile value using --profile <value>, when building the executable. e.g. dx build --nextflow --profile test.
3. main.nf , sourcecode.nf
  - Pipeline developers may specify default values of inputs in the Nextflow source code file (*.nf).
  - If an input parameter is of Nextflow’s string type with file-path format, use the DNAnexus URI format when the file is stored on DNAnexus.

# Query for the class of each input parameter
$ dx run project-yyyy:applet-xxxx --help
usage: dx run project-yyyy:applet-xxxx [-iINPUT_NAME=VALUE ...]

Applet: example_applet

example_applet

Inputs:
…
  fasta: [-ifasta=(file)]
…

  fasta_fai: [-ifasta_fai=(string)]
…


# Assign values of the parameter based on the class of the parameter
$ dx run project-yyyy:applet-xxxx -ifasta=”project-xxxx:file-yyyy” -ifasta_fai=”dx://project-xxxx:/path/to/file”

Formats of PATH to File, Folder, or Wildcards

While you can specify a file input parameter’s value at different places as seen above, the valid PATH format referring to the same file will be different depending on the level (DNAnexus API/CLI level or Nextflow script-level) and the class (file object or string) of the executable’s input parameter. Examples of this are given below.

Scenarios

Valid PATH format

• App or applet input parameter class as file object

• CLI/API level (e.g. dx run --destination PATH)

DNAnexus qualified ID (i.e. absolute path to the file object).

• E.g. (file):

project-xxxx:file-yyyy,

project-xxxx:/path/to/file

• E.g. (folder):

project-xxxx:/path/to/folder/

• App or applet input parameter class as string

• Nextflow configuration and source code files (e.g. nextflow_schema.json, nextflow.config, main.nf, sourcecode.nf)

DNAnexus URI.

• E.g. (file):

dx://project-xxxx:/path/to/file

• E.g. (folder):

dx://project-xxxx:/path/to/folder/

• E.g. (wildcard):

dx://project-xxxx:/path/to/wildcard_files

Specifying a Nextflow Job Tree Output Folder

When launching a DNAnexus job, you can specify a job-level output destination (e.g. project-xxxx:/destination/) using the platform-level optional parameter on the UI or on the CLI. In addition, when there is publishDir specified in the pipeline, each output file will be located at <dx_run_path>/<publishDir>/, where <dx_run_path> is the job-level output destination, and <publishDir> is the path assigned per Nextflow script’s process.

Read more detail about the output folder specification and publishDir here. Find an example on how to construct output paths of an nf-core pipeline job tree at run time from our FAQ.

Using an AWS S3 Bucket as a Work Directory for Nextflow Pipeline Runs

You can have your Nextflow pipeline runs use an Amazon Web Services (AWS) S3 bucket as a work directory. To do this, follow the steps outlined below.

Step 1. Configure Your AWS Account to Trust the DNAnexus Platform as an OIDC Identity Provider

Follow the steps outlined here to configure your AWS account to trust the Platform, as an OIDC identity provider. Be sure to take note of the value you enter in the "Audience" field. You'll need to use this value in a configuration file used by your pipeline, to enable pipeline runs to access the S3 bucket in question.

Step 2. Configure an AWS IAM Role with the Proper Trust and Permissions Policies

Next, configure an AWS Identity and Access Management (IAM) role, such that its permissions and trust policies allow Platform jobs that assume this role, to access and use resources in the S3 bucket in question.

Permissions Policy

The following example shows how to structure an IAM role's permission policy, to enable the role to use an S3 bucket - accessible via the S3 URI s3://my-nextflow-s3-workdir - as the work directory of Nextflow pipeline runs:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:DeleteObject",
        "s3:GetObject",
        "s3:ListBucket",
        "s3:PutObject"
      ],
      "Resource": [
        "arn:aws:s3:::my-nextflow-s3-workdir",
        "arn:aws:s3:::my-nextflow-s3-workdir/*"
      ]
    }
  ]
}

Note in the above example:

The "Action" section contains a list of the actions the role is allowed to perform, including deleting, getting, listing, and putting objects.
The two entries in the list in the "Resource" section enable the role to access all resources in the bucket accessible via the S3 URI my-nextflow-s3-workdir.

Trust Policy

The following example shows how to configure an IAM role's trust policy, to allow only properly configured Platform jobs to assume the role:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Principal": {
        "Federated": "arn:aws:iam::123456789012:oidc-provider/job-oidc.dnanexus.com/"
        ,
        "Condition": {
         "StringEquals": {
            "job-oidc.dnanexus.com/:aud": “dx_nextflow_s3_scratch_token_aud"
          },
         "StringEquals": {
            "job-oidc.dnanexus.com/:sub": “project_id;project-xxxx;launched_by;user-aaaa"
          }
        }
      }
    }
  ]
}

Note in the above example:

To assume the role, a job must be launched from within a specific Platform project (in this case, project-xxxx).
To assume the role, a job must be launched by a specific Platform user (in this case, user-aaaa).
Via the "Federated" setting in the "Principal" section, the policy configures the role to trust the Platform as an OIDC identity provider, as accessible at job-oidc.dnanexus.com.

Step 3. Configure Your Nextflow Pipeline's Configuration File to Access the S3 Bucket

Next you need to configure your pipeline so that when it's run, it can access the S3 bucket in question. To do this, add, in a configuration file, a dnanexus config scope that includes the properties shown in this example:

# In a nextflow configuration file:

aws { region = '<aws region>'}

dnanexus {
	workDir = '<S3 URI path>'
	jobTokenAudience = '<OIDC_audience_name>'
	jobTokenSubjectClaims = '<list of claims separated by commas>'
	iamRoleArnToAssume = '<arn of the role who is set with permission>'
}

Note in the above example:

workDir is the path to the bucket to be used as a work directory, in S3 URI format.
jobTokenAudience is the value of "Audience" you defined in Step 1 above.
jobTokenSubjectClaims is an ordered, comma-separated list of DNAnexus job identity token custom claims - for example, "project_id, launched_by" - that the job must present, in order to assume the role that enables bucket access.
iamRoleArnToAssume is the Amazon Resource Name (ARN) for the role that you configured in Step 2 above, and that will be assumed by jobs in order to access the bucket.
You need also to configure your pipeline to access the bucket within the appropriate AWS region, which you specify via the region parameter, within an aws config scope.

Using Subject Claims to Control Bucket Access

When configuring the trust policy for the role that allows access to the S3 bucket, use custom subject claims to control which jobs can assume this role. Here are some typical combinations that we recommend, with their implications:

Values of StringEquals:job-oidc.dnanexus.com/:sub

Which jobs can assume the role that enables bucket access?

project_id;project-xxxx

Any Nextflow pipeline jobs that are running in project-xxxx

launched_by;user-aaaa

Any Nextflow pipeline jobs that are launched by user-aaaa

project_id;project-xxxx;launched_by;user-aaaa

Any Nextflow pipeline jobs that are launched by user-aaaa in project-xxxx

bill_to;org-zzzz

Any Nextflow pipeline jobs that are billed to org-zzzz

Having included custom subject claims in the trust policy for the role in question, you need then, in the aforementioned Nextflow configuration file, to set the value of jobTokenSubjectClaims to equal a comma-separated list of claims, entered in the same order in which you entered them in the trust policy.

For example, if you configured a role's trust policy as per the above example, you are requiring a job, in order to assume the role, to present custom subject claims project_id and launched_by, in that order. In your Nextflow configuration file, set the value of jobTokenSubjectClaims, within the dnanexus config scope, as follows:

# In a nextflow configuration file:
dnanexus {
	...
	jobTokenSubjectClaims = 'project_id,launched_by'
	...
}

Note that you must also, within the dna config scope, set the value of iamRoleArnToAssume to that of the appropriate role:

# In a nextflow configuration file:
dnanexus {
	...
	iamRoleArnToAssume = arn:aws:iam::123456789012:role/NextflowRunIdentityToken
	...
}

Advanced Options: Building a Nextflow Pipeline Executable

Nextflow Pipeline Executable Permissions

By default, the Platform limits apps' and applets' ability to read and write data. Nextflow pipeline apps and applets have the following capabilities that are exceptions to these limits:

External internet access ("network": ["*"]) - This is required for Nextflow pipeline apps and applets to be able to pull Docker images from external docker registries at runtime.
UPLOAD access to the project in which a Nextflow pipeline job is run ("project": "UPLOAD") - This is required in order for Nextflow pipeline jobs to record the progress of executions, and preserve the run cache, in order to enable resume functionality.

You can modify a Nextflow pipeline app or applet's permissions by overriding the default values when building from a local disk, using the --extra-args flag with dx build. An example:

$ dx build --nextflow /path/to/hello --extra-args \
    '{"access":{"network": [], "allProjects":"VIEW"}}'
...
{"id": "applet-yyyy"}

In this example, note:

"network": [] prevents jobs from accessing the internet.
"allProjects":"VIEW" increases jobs' access permission level to VIEW. This means that each job will have "read" access to projects that can be accessed by the user running the job. Use this carefully. This permission setting can be useful when expected input file PATHs are provided as DNAnexus URIs - via a samplesheet.csv, for example - from projects other than the one in which a job is being run.

Advanced Building and Importing Pipelines

There are additional options for dx build --nextflow:

Options

Class

Description

--profile PROFILE

string

Set default profile for the Nextflow pipeline executable.

--repository REPOSITORY

string

Specifies a Git repository of a Nextflow pipeline. Incompatible with --remote.

--repository-tag TAG

string

Specifies tag for Git repository. Can be used only with --repository.

--git-credentials GIT_CREDENTIALS

file

--cache-docker

flag

Stores a container image tarball in the currently selected project in /.cached_dockerImages. Currently only docker engine is supported. Incompatible with --remote.

--nextflow-pipeline-params NEXTFLOW_PIPELINE_PARAMS

string

Custom pipeline parameters to be referenced when collecting the docker images.

--docker-secrets DOCKER_SECRETS

file

A dx file id with credentials for a private docker repository.

Use dx build --help for more information.

Private Nextflow Pipeline Repository

When the Nextflow pipeline to be imported is from a private repository, you must provide a file object that contains the credentials needed to access the repository. Via the CLI, use the--git-credentials flag, and format the object as follows:

providers {
  github {
    user = 'username'
    password = 'ghp_xxxx'
  }
}

To safeguard this credentials field object, store it in a separate project that only you can access.

Platform File Objects as Runtime Docker Images

When building a Nextflow pipeline executable, you can replace any Docker container with a Platform file object in tarball format. These Docker tarball objects serve as substitutes for referencing external Docker repositories.

This approach enhances the provenance and reproducibility of the pipeline by minimizing reliance on external dependencies, thereby reducing associated risks. Additionally, it fortifies data security by eliminating the need for internet access to external resources, during pipeline execution.

Two methods are available for preparing Docker images as tarball file objects on the platform: Built-in Docker image caching or Manually preparing the tarballs.

Built-in Docker Image Caching vs. Manually Preparing Tarballs

Built-in Docker image caching

Manually preparing tarballs

Requires running a "building job" with external internet access?

Yes, if building an applet for the first time or if any image is going to be updated.

No internet access required upon rebuild.

Docker images packaged as bundledDepends?

Yes.

For Docker images that will be used in the execution, they are cached and bundled at build time.

No.

Docker tarballs resolved at runtime.

At runtime

Job will attempt to access Docker cached as bundledDepends. If this fails, the job will attempt to find the image on the Platform. If this fails, the job will try to pull the images from the external repository, via the internet.

Job will attempt to locate the Docker image based on the Docker cache path referenced. If this fails, the job will attempt to pull from the external repository, via the internet.

Built-in Docker Image Caching

This method initiates a building job that begins by taking the pipeline script, then identifying Docker containers by scanning the script's source code based on the final execution tree. Next, the job converts the containers to tarballs, and saves those tarballs to the project in which the job is running. Finally, the job builds the Nextflow pipeline executable, bundling in the tarballs, as bundledDepends.

You can use built-in caching via the CLI by using the flag --cache-docker at build time. All cached Docker tarballs are stored as file objects, within the Docker cache path, at project-xxxx:/.cached_docker_images/<image_name>/<image_name>_<version>.

An example:

$ dx build --nextflow /path/to/hello \
--cache-docker \
--nextflow-pipeline-params "--alpha=1 --beta=foo" \ # when required
--destination project-xxxx:/applets2/hello
...
{"id:"applet-yyyy"}

$ dx tree /.cached_docker_images/
/.cached_docker_images/
├── samtools
│   └── samtools_1.16.1--h6899075_1
├── multiqc
│   └── multiqc_1.18--pyhdfd78af_0
└── fastqc
    └── fastqc_0.11.9--0

If you need to access a Docker container that's stored in a private repository, you must provide, along with the flag --docker-secrets, a file object that contains the credentials needed to access the repository. This object must be in the following format:

"docker_registry": {
  "registry": "url-to-registry",
  "username": "name123",
  "token": "12345678"
}

When a pipeline requires specific inputs, such as file objects, sample values must be present within the project in which building job is to execute. These values must be provided along with the flag --nextflow-pipeline-params.
- It's crucial that these sample values be structured in the same way as actual input data will be structured. This ensures that the execution logic of the Nextflow pipeline remains intact. During the build process, use small files, containing data representative of the larger dataset, as sample data, in order to reduce file localization overhead.
For pipelines featuring conditional process trees determined by input values, you may provide mocked input values for caching Docker containers used by processes affected by the condition.
A building job requires CONTRIBUTE or higher permission to the destination project, i.e. the project in which it will place tarballs created from Docker containers.
Pipeline source code will be saved at/.nf_source/<pipeline_folder_name>/ in the destination project. The user is responsible for cleaning up this folder after the executable has been built.

Manually Preparing Tarballs

You can manually convert Docker images to tarball file objects. Within Nextflow pipeline scripts, you must then reference the location of each such tarball, in one of the following three ways:

Reference each tarball by its unique Platform ID (e.g. dx://project-xxxx:file-yyyy). Use this approach if you want deterministic execution behavior. You can use Platform IDs in Nextflow pipeline scripts (*.nf) or configuration files (*.config), as follows:

# In a Nextflow pipeline script:
process foo {
  container 'dx://project-xxxx:file-yyyy'
  
  '''
  do this
  '''
}

# In nextflow.config    // at root folder of the nextflow pipeline:
process {
    withName:foo {
        container = 'dx://project-xxxx:file-yyyy'
    }   
}

When accessing a Platform project, a Nextflow pipeline job needsVIEW or higher permission to the project.

Within a Nextflow pipeline script, you can also reference a Docker image by using its full image name. Use this name within a path that's in the following format: project-xxxx:/.cached_docker_images/<image_name>/<image_name>_<version> An example:

# In nextflow configuration file:
docker.enabled = true
docker.registry = 'quay.io'

# In the Nextflow pipeline script:
process bar {
  container 'quay.io/biocontainers/tabix:1.11--hdfd78af_0'

  '''
  do this
  '''
}

Note that no file extension is necessary, and that project-xxxx is the project where the Nextflow pipeline executable was built and will be executed. For.cached_docker_images, substitute the name of the folder in which these images have been stored.Note as well that an exact <version> reference must be included - latest is not an accepted tag in this context.

At Nextflow pipeline executable runtime:

If no image is found at the path provided, the Nextflow pipeline job will attempt to pull the Docker image from the remote external registry, based on the image name. This pull attempt requires internet access.
If the version is referenced as latest, or if no version tag is provided, the Nextflow pipeline job will attempt to search the digest of the image’s latest reference from the external Docker repository and use it to search for the corresponding tarball on the platform. This digest search requires internet access. If no digest is found, or if there is no internet access, the execution will fail.

Here are several examples of tarball file object paths and names, as constructed from image names and version tags:

Image Name

Version Tag

Tarball File Object Path and Name

quay.io/biocontainers/tabix

1.11--hdfd78af_0

project-xxxx:/.cached_docker_images/tabix/tabix_1.11--hdfd78af_0

python

3.9-slim

project-xxxx:/.cached_docker_images/python/python_3.9-slim

python

latest

Nextflow pipeline job will attempt to pull from remote external registry

You can also reference Docker image names in pipeline scripts by digest - for example, <Image_name>@sha256:XYZ123…). Note that no file extension is necessary, and that project-xxxx is the project where the Nextflow pipeline executable was built and will be executed. For.cached_docker_images, substitute the name of the folder in which these images have been stored. Note as well that an exact <version> reference must be included - latest is not an accepted tag in this context. In addition, to refer to a tarball file on the Platform in this way, an object property image_digest - for example, “image_digest”:”<IMAGE_DIGEST_HERE>”- needs to have been assigned to it.
An example:

# In nextflow configuration file:
docker.enabled = true
docker.registry = 'quay.io'

# In the Nextflow pipeline script:
process bar {
  container 'quay.io/biocontainers/tabix@sha256:XYZ123…'
  '''
  do this
  '''
}

Nextflow Input Parameter Type Conversion to DNAnexus Executable Input Parameter Class

Based on the input parameter’s type and format (when applicable) defined in the corresponding nextflow_schema.json file, each parameter will be assigned to the corresponding class (ref1, ref2).

From: Nextflow Input Parameter (defined at nextflow_schema.json) Type

Format

To: DNAnexus Input Parameter Class

string

file-path

file

string

directory-path

string

path

string

integer

int

number

float

boolean

object

hash

File Input as String or File Class

As a pipeline developer, you can specify a file input variable as {“type”:“string”, “format”:“file-path”} or {“type”:“string”, “format”:“path”}, which will be assign to “file” or “string” class, respectively. When running the executable, based on the class (file or string) of the executable’s input parameter, you will use a specific PATH format to specify the value. See documentation here for an acceptable PATH format for each class.

Converting a URL path to a String

When converting a file reference from a URL format (e.g. dx://project-xxxx:/path/to/file of a DNAnexus URI) to a String, you will use the method toUriString(). Method toURI().toString() does not give the same result, as toURI() removes the context ID (e.g. project-xxxx), and toString() removes the scheme (e.g. dx://). More info about the Nextflow methods here.

Managing intermediate files and publishing outputs

Pipeline Output Setting Using output: block and publishDir

All files generated by a Nextflow job tree will be stored in its session’s corresponding workDir (i.e. the path where the temporary results are stored). On DNAnexus, when the Nextflow pipeline job is run with “preserve_cache=true”, the workDir is set at the path: project-xxxx:/.nextflow_cache_db/<session_id>/work/. project-xxxx is the project where the job took place, and you can follow the path to access all preserved temporary results. It is useful to be able to access these results for investigating the detailed pipeline progress, and use them for resuming job runs for pipeline development purposes. More info about workDir is described here.

When the Nextflow pipeline job was run with “preserve_cache=false” (default), temporary files will be stored in the job’s temporary workspace which will be deconstructed upon the head job enters its terminate state (i.e. “done”, “failed”, or “terminated”). Since a lot of these files are intermediate input/output being passed between processes and expected to be cleaned up after the job is completed, running with “preserve_cache=false” will help reduce project storage cost for files that are not of interest, and also save you from remembering to clean up all temporary files.

To save the final results of interest, and to display them as the Nextflow pipeline executable’s output, you can declare output files matching the declaration under the script’s output: block, and use Nextflow’s optional publishDir directive to publish them.

This will make the published output files as the Nextflow pipeline head job’s output, under the executable’s formally defined placeholder output parameter, published_files, as array:file class. Then the files will be organized under the relative folder structure assigned via publishDir. This works for both “preserve_cache=true” and “preserve_cache=false”. Only the “copy” publish mode is supported on DNAnexus.

Values of publishDir

At pipeline development time, the valid value of publishDir can be:

A local path string , e.g. “publishDir path: ./path/to/nf/publish_dir/”,
A dynamic string value defined as a pipeline input parameter (e.g. “params.outdir”, where “outdir” is a string-class input), allowing pipeline users to determine parameter values at runtime. For example, “publishDir path: '${params.outdir}/some/dir/'” or './some/dir/${params.outdir}/' or './some/dir/${params.outdir}/some/dir/' .
- When publishDir is defined this way, the user who launches the Nextflow pipeline executable is responsible for constructing the publishDir to be a valid relative path.

Find an example on how to construct output paths for an nf-core pipeline job tree at run time from our FAQ.

publishDir is NOT supported on DNAnexus when assigned as an absolute path (e.g. /path/to/nf/publish_dir/, which starts at root (/)). If an absolute path is defined for the publishDir, no output files will be generated as the job’s output parameter “published_files”.

Queue Size Configuration

The queueSize option is part of Nextflow’s executor configuration. It defines how many tasks the executor will handle in a parallel manner. On DNAnexus, this represents the number of subjobs being created at a time (5 by default) by the Nextflow pipeline executable’s head job. If the pipeline’s executor configuration has a value assigned to queueSize, it will override the default value. If the value exceeds the upper limit (1000) on DNAnexus, the root job will error out. See the Nextflow executor configuration page for examples.

Instance Type Determination

Head job instance type determination

The head job of the job tree defaults to running on instance type mem2_ssd1_v2_x4 in AWS regions and azure:mem2_ssd1_x4 in Azure regions. It is possible for users to change to a different instance type than the default, but is not recommended. The head job executes and monitors the subjobs. Changing the instance type for the head job will not affect the computing resources available for subjobs, where most of the heavy computation takes place (see below where to configure instance types for Nextflow processes). Changing the instance type for the head job may be necessary only if it is running out of memory or disk space when staging input files, collecting pipeline output files, or uploading pipeline output files to the project.

Subjob instance type determination

Each subjob’s instance type is determined based on the profile information provided in the Nextflow pipeline script. You can specify required instances by instance type name via Nextflow’s machineType directive (example below), or using a set of system requirements (e.g. cpus, memory, disk, etc.) according to the official Nextflow documentation. The executor will choose the corresponding instance type that matches the minimal requirement of what is described in the Nextflow pipeline profile using the following logic:

Choose the cheapest instance that satisfies the system requirements.
Use only SSD type instances.
For all things equal (price and instance specifications), it will prefer a version2 (v2) instance type.

Order of precedence for subjob instance type determination:

The value assigned to machineType directive.
Values assigned to cpus, memory, and disk directives in their configuration.

An example command for specifying machineType by DNAnexus instance type name is provided below:

process foo {
  machineType 'mem1_ssd1_v2_x36'

  """
  <your script here>
  """
}

In addition to being used for instance type determination, values assigned to cpus, memory, and disk directives can also be recalled by Nextflow's process implicit variables of task object (e.g. ${task.cpus}, ${task.memory}, ${task.disk}) at runtime for task allocation.

It is possible the actual selected instance type’s CPUs, memory, or disk capacity being inconsistent with what is allocated by task. The former follows above precedence to determine the instance type, the latter is returning the value assigned in the configuration file.
When using Docker as the runtime container, Nextflow executor is propagating task execution settings to the Docker run command. For example, if task.memory is specified, this would then be the maximum amount of memory the container is allowed to use, for example: docker run --memory ${task.memory}

Nextflow Resume

Preserve Run Caches and Resuming Previous Jobs

Nextflow’s resume feature enables skipping the processes that have been finished successfully and cached in previous runs. The new run can directly jump to downstream processes without needing to start from the beginning of the pipeline. By retrieving cached progress, Nextflow resume helps pipeline developers to save both time and compute costs. It is helpful for testing and troubleshooting when building and developing a Nextflow pipeline.

Nextflow utilizes a scratch storage area for caching and preserving each task’s temporary results. The directory is called “working directory”, and the directory’s path is defined by

The session id, a universally unique identifier (UUID) associated with current execution
Each task’s unique hash ID: a hash number composed of each task’s input values, input files, command line strings, container ID (e.g. Docker image), conda environment, environment modules, and executed scripts in the bin directory, when applicable.

You can utilize the Nextflow resume feature with the following Nextflow pipeline executable parameters:

preserve_cache Boolean type. Default value is false. When set to true, the run will be cached in the current project for future resumes. For example:
- dx run applet-xxxx -i reads_fastqgz=project-xxxx:file-yyyy -i preserve_cache=true
- This enables the Nextflow job tree to preserve cached information as well as all temporary results in the project where it is executed under the following paths, based on its session ID and each subjob’s unique ID.
- The session's cache directory containing information on the location of the workDir, the session progress, etc. are saved to project-xxxx:/.nextflow_cache_db/<session_id>/cache.tar , where project-xxxx is the project where the job tree is executed.
- Each task's working directory will be saved to project-xxxx:/.nextflow_cache_db/<session_id>/work/<2digit>/<30characters>/ , where <2digit>/<30characters>/ is technically the task’s unique ID, and project-xxxx is the project where the job tree is executed.
resume String type. Default value is an empty string, and the run will start from scratch. When assigned with a session id, the run will resume from what is cached for the session id on the project. When assigned with “true” or “last”, the run will determine the session id that corresponds to the latest valid execution in the current project and resume the run from it. For example:
- dx run applet-xxxxm -i reads_fastqgz=“project-xxxx:file-yyyy” -i resume="<session_id>”

When preserve_cache=true, DNAnexus executor overrides the value of workDir of the job tree to be project-xxxx:/.nextflow_cache_db/<session_id>/work/, where project-xxxx is the project where the job tree was executed.
When a new job is launched and resumes a cached session (session_id may be formatted as 12345678-1234-1234-1234-123456789012 for example), the new job not only resumes from where the cache left at, but also shares the same session_id with the cached session it resumes. When a new job makes progress in a session and if the job is being cached, it creates temporary results to the same session’s workDir. This will generate a new cache directory (cache.tar) with the latest cache information.
You can have many Nextflow job trees sharing the same sessionID and writing to the same path for workDir and creating its own cache.tar, while only the latest job that ends in “done” or “failed” state will be preserved on the project.
When the head job enters its terminal state (e.g. “failed” or “terminated”) that is not caused by the executor, no cache directory will be preserved, even when the job was run with preserve_cache=true. Subsequent new jobs will not be able to resume from this job run. Examples: a job tree fails due to exceeding a cost limit; a user terminates a job of the job tree, etc.

Below are four possible scenarios and the recommended use cases for –i resume:

Scenarios

Parameters

Use Cases

Note

1 (default)

resume=“” (empty string) and preserve_cache=false

Production data processing; most high volume use cases

resume=“” (empty string) and preserve_cache=true

Pipeline development; only happens for the first few pipeline tests.

During development; it would be useful to see all intermediate results in workDir.

Only up to 20 Nextflow sessions can be preserved per project.

resume=<session_ID>|“true”|“last”and preserve_cache=false

Pipeline development; pipeline developers can investigate the job workspace with --delay_workspace_destruction and --ssh

resume=<session_ID>|“true”|“last” and preserve_cache=true

Pipeline development; only happens for the first few tests.

Only 1 job with the same <session_ID> can run at each time point.

Cache Preserve Limitations and Cleaning Up workDir

It is a good practice to frequently clean up the workDir to save on storage costs. The maximum number of sessions that can be preserved in a DNAnexus project is 20 sessions. If you exceed the limit, the job will generate an error with the following message:

“The number of preserved sessions is already at the limit (N=20) and is true. Please remove the folders in <project-id>:/.nextflow_cache_db/ to be under the limit, if you want to preserve the cache of this run. “

To clean up all preserved sessions under a project, you can delete the entire ./nextflow_cache_db folder. To clean up a specific session’s cached folder, you can delete the specific .nextflow_cache_db/<session_id>/ folder. To delete a folder in UI, you can follow the documentation on deleting objects. To delete a folder in CLI, you can run:

dx rm -r project-xxxx:/.nextflow_cache_db/              # cleanup ALL sessions caches
dx rm -r project-xxxx:/.nextflow_cache_db/<session_id>/ # clean up a specific session’s cache

Note that deleting an object on UI or using CLI dx rm cannot be undone. Once the session work directory is deleted or moved, subsequent runs will not be able to resume from the session.

For each session, only one job is allowed to resume the session’s cached results, and preserve its own progress to this session. There is no limit if multiple jobs resume and preserve multiple different sessions, as long as each job is preserving a different session. There is also no limit for multiple jobs to resume the same session, as long as only one or none is preserving the progress to the session.

Nextflow’s errorStrategy

Nextflow’s errorStrategy directive allows you to define how the error condition is managed by the Nextflow executor at the process level. When an error status is returned, by default, the process and other pending processes stop immediately (i.e. errorStrategy terminate), and this in turn forces the entire pipeline execution to be terminated.

There are four error strategy options of Nextflow executor: terminate, finish, ignore, and retry. Below is a table of behaviors for each strategy. Note that "all other subjobs" in the third column have not yet entered their terminal states.

errorStrategy

Subjob Error

Head Job

All Other Subjobs

terminate

Job properties set with: "nextflow_errorStrategy”:"terminate”, "nextflow_errored_subjob”:"self”

End in “failed” state immediately

Job properties set with: "nextflow_errorStrategy”:”terminate”,

"nextflow_errored_subjob”:”job-xxxx", "nextflow_terminated_subjob”:”job-yyyy, job-zzzz" , where job-xxxx is the errored subjob, and job-yyyy is the other subjobs that were terminated due to this error.

End in “failed” state immediately, with error message, “Job was terminated by Nextflow with terminate errorStrategy for job-xxxx, check the job log to find the failure”

End in “failed” state immediately.

finish

Job properties set with:

"nextflow_errorStrategy”:"finish”, "nextflow_errored_subjob”:"self”

End in “done” state immediately

Job properties set with:

"nextflow_errorStrategy:finish”, "nextflow_errored_subjob”:”job-xxxx, job-2xxx" , where job-xxxx and job-2xxxx are the the errored subjobs,

Not create new subjobs after the time point of error

End in “failed” state eventually, after other existing subjobs enter their terminal states, with error message “Job was ended with finish errorStrategy for job-xxxx, check the job log to find the failure.”.

Keep on running until entering their terminal states.

If error occurs in any of these subjobs (e.g. job-2xxx), finish errorStrategy will be applied to the subjob because a finish errorStrategy was hit first, ignoring any other error strategies set in the pipeline’s source code or configuration, as Nextflow’s default behavior.

retry

Job properties set with:

"nextflow_errorStrategy”:”retry”, "nex

tflow_errored_subjob”:”self"

End in “done” state immediately

Spin off a new subjob which retries the errored job, with the following job name:

<name> (retry: <RetryCount>) , where <name> is the original subjob name and <RetryCount> is the order of this retry (ex. retry:1, retry:2).

End in a terminal state depending on the terminal states of other currently existing subjobs that are not yet in their terminal states. Can be either “done”, “failed” or “terminated”.

Keep on running until enter their terminal states.

If error occurs in one of these subjobs, their errorStrategy set in the subjob’s corresponding Nextflow process is applied.

ignore

Job properties set with: "nextflow_errorStrategy”:”ignore”, "nextflow_errored_subjob”:”self"

End in “done” state immediately

Have job properties set with: "nextflow_erorrStrategy”:”ignore”, "nextflow_errorred_subjob”:”job-1xxx, job-2xxx"

Shows “subjob(s) <job-1xxxx>, <job-2xxxx> runs into Nextflow process errors’ ignore errorStrategy were applied” in the end of the job log.

End in a terminal state depending on the terminal states of other currently existing subjobs that are not yet in their terminal states. Can be either “done”, “failed” or “terminated”.

Keep on running until they enter their terminal states.

If error occurs in one of these subjobs, their errorStrategy set in the subjob’s corresponding Nextflow process is applied.

When more than one errorStrategy directives are applied to a pipeline job tree, the following rules will be applied depending on the first errorStrategy used.

When terminate is the first errorStrategy directive to be triggered in a subjob, all the other ongoing subjobs will result in the "failed" state immediately.
When finish is the first errorStrategy directive to be triggered in a subjob, any other errorStrategy that is reached in the remaining ongoing subjob(s) will also apply the finish errorStrategy, ignoring any other error stategies set in the pipeline’s source code or configuration.
If the retry errorStrategy is the first directive triggered in a subjob, if any of the remaining subjobs trigger a terminate, finish, or ignore errorStrategy, these other errorStrategy directives will be applied to the corresponding subjob.
When ignore is the first errorStrategy directive to trigger in a subjob , and if any of terminate, finish, or retry errorStrategy directives applies to the remaining subjob(s), that other errorStrategy will be applied to the corresponding subjob.

Independent from Nextflow process-level error conditions, when a Nextflow subjob encounters platform-related restartable errors, such as "ExecutionError", "UnresponsiveWorker", "JMInternalError", "AppInternalError", or "JobTimeoutExceeded", the subjob will follow the executionPolicy determined to the subjob and and restart itself. It will not restart from the head job.

FAQ

My Nextflow job tree failed, how do I find where the errors are?

A: You can find the errored subjob’s job ID from the head job’s nextflow_errored_subjob and nextflow_errorStrategy properties to investigate which subjob failed and which errorStrategy was applied. To query these errorStrategy related properties in CLI, you can run the following command:

$ dx describe job-xxxx --json | jq -r .properties.nextflow_errored_subjob
job-yyyy
$ dx describe job-xxxx --json | jq -r .properties.nextflow_errorStrategy
terminate

where job-xxxx is the head job’s job ID.

Once you find the errored subjob, you can investigate the job log using the Monitor page by accessing the URL "https:/platform.dnanexus.com/projects/<projectID>/monitor/job/<jobID>", where jobID is the subjob's ID (e.g. job-yyyy), or watch the job log in CLI using dx watch job-yyyy.

If you have the preserve_cache value set to true when start running the Nextflow pipeline executable, you can trace the cache workDir (e.g. project-xxxx:/.nextflow_cache_db/<session_id>/work/) and investigate the intermediate results of this run.

What is the version of Nextflow that is used?

A: You can find the Nextflow version used by reading the log of the head job. Each built Nextflow executable is locked down to the specific version of Nextflow executor.

What container runtimes are supported?

A: DNAnexus supports Docker as the container runtime for Nextflow pipeline applets. It is recommended to set docker.enabled=true in the Nextflow pipeline configuration, which enables the built Nextflow pipeline applet to execute the pipeline using Docker.

My job hangs at the end of the analysis. What can I do to avoid this problem?

A: There can be many possibilities causing the head job to hang. One of the known reasons is caused by the trace report file being written directly to a DNAnexus URI (e.g. dx://project-xxxx:/path/to/file). To avoid this cause, we suggest you to specify -with-trace path/to/tracefile (using a local path string) to the Nextflow pipeline applet’s nextflow_run_opts input parameter.

Can I have an example of how to construct an output path when I run a Nextflow pipeline with params.outdir, publishDir and job-level destination?

Taking nf-core/sarek (3.3.1) as an example, start with reading the pipeline's logic:

The pipeline's publishDir is constructed with a prefix of the params.outdir variable followed by each task's name for each subfolder: publishDir = [ path: { "${params.outdir}/${...}" }, ... ]
params.outdir is a required input parameter to the pipeline, and the default value ofparams.outdir is null. The user running the corresponding Nextflow pipeline executable must specify a value to params.outdir which will:
1. Meet the input requirement for executing the pipeline.
2. Resolve the value ofpublishDir, with outdir as the leading path and each task's name as the subfolder name.

To specify a value of params.outdir for the Nextflow pipeline executable built from the nf-core/sarek pipeline script, you can use the following command:

dx run project-xxxx:applet-zzzz \
-i outdir=./local/to/outdir \   # assign "./local/to/outdir" params.outdir
--brief -y

You can also set a job tree's output destination using --destination :

dx run project-xxxx:applet-zzzz \
-i outdir=./local/to/outdir \   # assign "./local/to/outdir" params.outdir
--destination project-xxxx:/path/to/jobtree/destination/ \ 
--brief -y

This above command will construct the final output paths in the following manner:

project-xxxx:/path/to/jobtree/destination/ as the destination of the job tree's shared output folder.
project-xxxx:/path/to/jobtree/destination/local/to/outdir as the shared output folder of the all tasks/processes/subjobs of this pipeline.
project-xxxx:/path/to/jobtree/destination/local/to/outdir/<task_name> as the output folder of each specific task/process/subjob of this pipeline.

This example is built based on Specifying A Nextflow Job Tree Output Folder and Managing intermediate files and publishing outputs.
Not all Nextflow pipelines haveparams.outdir as input, nor do all of them use params.outdir in publishDir. Read the source script of the Nextflow pipeline for the actual context of usage and requirements forparams.outdir andpublishDir.

Running Batch Jobs

To launch a DNAnexus application or workflow on many files automatically, one may write a short script to loop over the desired files in a project and launch jobs or analyses. Alternatively, the DNAnexus SDK provides a few handy utilities for batch processing. To use the GUI to run in batch mode, see these instructions.

Overview

In this tutorial, we'll batch process a series of sample FASTQs (forward and reverse reads). We'll use the dx generate_batch_inputs command to generate a batch file -- a tab-delimited (TSV) file where each row corresponds to a single run in our batch. Then we'll process our batch using the dx run command with the --batch-tsv options.

Generate Batch File

In the project My Research Project we have the following files in our root directory:

$ dx select "My Research Project"
Selected project My Research Project
$ dx ls /
RP10B_S1_R1_001.fastq.gz
RP10B_S1_R2_001.fastq.gz
RP10T_S5_R1_001.fastq.gz
RP10T_S5_R2_001.fastq.gz
RP15B_S4_R1_002.fastq.gz
RP15B_S4_R2_002.fastq.gz
RP15T_S8_R1_002.fastq.gz
RP15T_S8_R2_002.fastq.gz

We want to batch process these read pairs using BWA-MEM (link requires platform login). For a single execution of the BWA-MEM app, we need to specify the following inputs:

reads_fastqgzs - FASTQ containing the left mates
reads2_fastqgzs - FASTQ containing the right mates
genomeindex_targz - BWA reference genome index

We'll use the BWA reference genome index from the public Reference Genome (requires platform login) project for all runs; however, for the forward and reverse reads we want read pairs used to vary from run to run. To generate a batch file that pairs our input reads:

$ dx generate_batch_inputs -ireads_fastqgzs='RP(.*)_R1_(.*).fastq.gz' -ireads2_fastqgzs='RP(.*)_R2_(.*).fastq.gz'
Found 4 valid batch IDs matching desired pattern.
Created batch file dx_batch.0000.tsv

CREATED 1 batch files each with at most 500 batch IDs.

You can optionally provide a --path argument and provide a specific file and folder to search for recursively within your project. Specifically, the value for --path must be a directory specified as:

/path/to/directory or project-xxxx:/path/to/directory

Any file present within this directory or recursively within any subdirectory of this directory will be considered a candidate for a batch run.

The (.*) are regular expression groups. You can provide arbitrary regular expressions as input. The first match in the group will be the pattern used to group pairs in the batch, these matches are called batch identifiers (batch IDs). To explain this behavior in more detail, We will use the output of the dx generate_batch_inputs command above:

The dx generate_batch_inputs command creates the dx_batch.0000.tsv that looks like:

$ cat dx_batch.0000.tsv
batch ID  reads_fastqgzs              reads2_fastqgzs              pair1 ID    pair2 ID
10B_S1    RP10B_S1_R1_001.fastq.gz    RP10B_S1_R2_001.fastq.gz     file-aaa    file-bbb
10T_S5    RP10T_S5_R1_001.fastq.gz    RP10T_S5_R2_001.fastq.gz     file-ccc    file-ddd
15B_S4    RP15B_S4_R1_002.fastq.gz    RP15B_S4_R2_002.fastq.gz     file-eee    file-fff
15T_S8    RP15T_S8_R1_002.fastq.gz    RP15T_S8_R2_002.fastq.gz     file-ggg    file-hhh

Recall the regular expression was RP(.*)_R1_(.*).fastq.gz. Although there are two grouped matches in this example, only the first one is used as the pattern for the batch ID. For example, the pattern identified for RP10B_S1_R1_001.fastq.gz is 10B_S1 which corresponds to the first grouped match while the second one is ignored.

Examining the TSV file above, the files are grouped as expected, with the first match labeling the identifier of the group within the batch. The next two columns show the file names. The last two columns contain the IDs of the files on the DNAnexus platform. You can either edit this file directly or import it into a spreadsheet to make any subsequent changes.

Note that if an input for the app is an array, the input file IDs within the batch.tsv file need to be in square brackets in order to work. The following bash command will add brackets to the file IDs to column 4 and 5. You may need to change the variables in the below command ("$4" and "$5") to match the correct columns in your file. The command's output file, "new.tsv", is ready for the dx run --batch-tsv command.

head -n 1 dx_batch.0000.tsv > temp.tsv && tail -n +2 dx_batch.0000.tsv | awk '{sub($4, "[&]"); print}' | awk '{sub($5, "[&]"); print}' >> temp.tsv; tr -d '\r' < temp.tsv > new.tsv; rm temp.tsv

Note that the example above is for a case where all files have been paired properly. dx generate_batch_inputs will create a TSV for all files that can be successfully matched for a particular batch ID. There are two classes of errors for batch IDs that are not successfully matched:

A particular input is missing (e.g. reads_fastqgzs has a pattern but no corresponding match can be found for reads2_fastqgzs)
More than one file ID matches the exact same name

For both of these cases, dx generate_batch_inputs returns a description of these errors to STDERR.

If you match more than 500 files, multiple batch files will be generated in groups of 500 to limit the number of jobs in a single batch run.

Run a Batch Job

We have our batch file so now we can execute our BWA-MEM batch process:

$ dx run bwa_mem_fastq_read_mapper -igenomeindex_targz="Reference Genome Files":"/H. Sapiens - GRCh38/GRCh38.no_alt_analysis_set.bwa-index.tar.gz" --batch-tsv dx_batch.0000.tsv
# Output

Here, genomeindex_targz is a parameter set at execution time that is common to all groups in the batch and --batch-tsv corresponds to the input file generated above.

To monitor a batch job, simply use the 'Monitor' tab like you normally would for jobs you launch.

Setting Output Folders for Batch Jobs

In order to direct the output of each run into a separate folder, the --batch-folders flag can be used, for example:

$ dx run bwa_mem_fastq_read_mapper -igenomeindex_targz="project-BQpp3Y804Y0xbyG4GJPQ01xv:file-BFBy4G805pXZKqV1ZVGQ0FG8" --batch-tsv dx_batch.0000.tsv --batch-folders
# Output

This will output the results for each sample in folders named after batch IDs, in our case the folders: "/10B_S1/", "/10T_S5/", "/15B_S4/", and "/15T_S8/". If the folders do not exist, they will be created.

The output folders are created under a path defined with --destination, which by default is set to current project and the "/" folder. For example, this command will output the result files in "/run_01/10B_S1/", "/run_01/10T_S5/", etc.:

$ dx run bwa_mem_fastq_read_mapper -igenomeindex_targz="project-BQpp3Y804Y0xbyG4GJPQ01xv:file-BFBy4G805pXZKqV1ZVGQ0FG8" --batch-tsv dx_batch.0000.tsv --batch-folders --destination=My_project:/run_01
# Output

Batching Multiple Inputs

dx generate_batch_inputs is limited to starting runs that differ only in input fields of type file. Use a more flexible for loop construct if you want batch runs that differ in string, file array or other non-file type inputs.

Additionally, a for loop allows you to specify other dx run arguments such as name for every run:

for i in 1 2; do
    dx run swiss-army-knife -icmd="wc *>${i}.out" -iin="fileinput_batch${i}a" -iin="file_input_batch${i}b" --name "sak_batch${i}"
done

You can also use the dx run command in order to use stage_id . For example, if you create a workflow called "Trio Exome Workflow - Jan 1st 2020 9:00am" in your project, you can run it from the command line:

dx login
dx run "Trio Exome Workflow - Jan 1st 2020 9\:00am"

Note the \ that is needed to escape the : in the workflow name.

Inputs to the workflow can be specified using dx run <workflow> --input name=stage_id:value, where stage_id is a numeric ID starting at 0. More help can be found by running the commands dx run --help and dx run <workflow> --help.

To batch multiple inputs then, do the following:

dx cd /path/to/inputs
for i in $(dx ls); do
    dx run "Trio Exome Workflow - Jan 1st 2020 9\:00am" --input 0.reads="$i"
done

Additional Resources

For additional information and examples of how to run batch jobs, Chapter 6 of this reference guide may be useful. Note that this material is not a part of the official DNAnexus documentation and is for reference only.

Monitoring Executions

Learn how to get information on current and past executions, via both the UI and the CLI.

Monitoring an Execution via the UI

Getting Basic Information on an Execution

To get basic information on a job (the execution of an app or applet) or an analysis (the execution of a workflow):

Click on Projects in the main Platform menu.
On the Projects list page, find and click on the name of the project within which the execution was launched.
Click on the Monitor tab to open the Monitor screen.
On the Monitor screen, you'll see a list of executions launched within the project. By default, they're listed in reverse chronological order, with the most recently launched execution at the top.
Find the row displaying information on the execution.
- Note that for an analysis (the execution of a workflow), you can click on the "+" icon to the left of the analysis name to expand the row to view information on its stages. If an execution has further descendants, you can click on the “+” icon next to its name to expand the row further.
To see additional information on an execution, click on its name to be taken to its details page.
- Note that there are shortcuts you can use to view information that is found on the details page directly on the list page, or relaunch an execution:
  - To view the Info pane:
    Click the Info icon, above the right edge of the executions list, if it’s not already selected, and then select the execution by clicking on the row, or
    Hover over the row and click on the “More Actions” button that looks like three vertical dots at the end of the row to select View Info in the fly out menu.
  - To view the log file for a job, do either of the following:
    Select the execution by clicking on the row. A View Log button will appear in the header. Click the View Log button, or
    Hover over the row and click on the “More Actions” button that looks like three vertical dots at the end of the row to select View Log in the fly out menu.
  - To re-launch a job, do either of the following:
    Select the execution by clicking on the row. A Launch as New Job button will appear in the header. Click the Launch as New Job button, or
    Hover over the row and click on the “More Actions” button that looks like three vertical dots at the end of the row, then select Launch as New Job in the flyout menu.
  - To re-launch an analysis, do either of the following:
    Select the execution by clicking on the row. A Launch as New Analysis button will appear in the header. Click the Launch as New Analysis button, or
    Hover over the row and click on the “More Actions” button that looks like three vertical dots at the end of the row to select Launch as New Analysis in the flyout menu.

Available Basic Information on Executions

In the list on the Monitor screen, you'll see the following information for each of the executions that is running or has been run within the project in question:

Name - The default name for an execution is the name of the app, applet, or workflow being run. When configuring an execution, you can give it a custom name, either via the UI, or via the CLI. Note that the execution's name is used in Platform email alerts related to the execution. Note as well that clicking on a name in the executions list opens the execution details page, giving in-depth information on the execution.
State - This is the execution's state. State values include:
- "Waiting" - When initially launched, the execution's state will be "Waiting" until the Platform has allocated the resources required to run it, and, in some cases, until other executions on which it depends have finished.
- "Running" - Once a job has started to run, its state will change from "Waiting" to “Running.”
- "In Progress" - Once an analysis has been launched, its state will change to "In Progress."
- "Done" - If the execution completes with no errors, its state will change to "Done."
- "Failed" - If the execution fails to complete with errors, its state will change to "Failed." See Types of Errors for help in understanding why an execution failed.
- "Partially Failed" - An analysis is in the "Partially Failed" state if one or more stages in the workflow have not finished successfully, and there is at least one stage that has not transitioned to a terminal state (either "Done," "Failed," or "Terminated").
- "Terminating" - If the worker has begun terminating an execution, but not yet finished doing so, the execution's state will be displayed as "Terminating."
- "Terminated" - If the execution is terminated prior to completion, its state will change to "Terminated."
- "Debug Hold" - If an execution has been run with debugging options, and has failed for an applicable reason, and is being held for debugging, its state will show as "Debug Hold."
Executable - The executable or executables run in the course of the execution. Note that if the execution is an analysis, each stage will be shown in a separate row, including the name of the executable run during the stage in question. Note as well that if there is an informational page giving details about the executable and how to configure and use it, the executable's name will be clickable, and clicking the name will display that page.
Tags - Tags are strings associated with objects on the platform. They are a type of metadata that can be added to an execution.
Launched By - The name of the user who launched the execution.
Launched On - The time at which the execution was launched. Note that for many executions, this will be earlier than the time displayed in the Started Running column, because many executions spend time waiting for resources to become available, before they start running.
Started Running - The time at which the execution started running, if it has done so. Note that this is not always the same as its launch time, if it has to spend time waiting for resources to become available, before it can start running.
Duration - For jobs, this figure represents the time elapsed since the job entered the running state. For analyses, it represents the time elapsed since the analysis was created.
Cost - A value is displayed in this column when the user has access to billing info for the execution. The figure shown represents either, for a running execution, an estimate of the charges it has incurred so far, or, for a completed execution, the total costs it incurred.
Priority - The priority assigned to the execution - either "low," "normal," or "high" - when it was configured, either via the CLI or via the UI. This setting determines the scheduling priority of the execution, vis-a-vis other executions that are waiting to be launched.
Worker URL - If the execution is running an executable - such as DXJupyterLab - to which you can connect directly via a web URL, that URL will be shown here. Clicking the URL will open a connection to the executable, in a new browser tab.
Output Folder - For each execution, the value displayed represents a path relative to the project's root folder. Clicking the value will open the folder in which the execution's outputs have been or will be stored.

Additional Basic Information

Additional basic information can be displayed for each execution. To do this:

Click on the "table" icon at the right edge of the table header row.
Select one or more of the entries in the list, to display an additional column or columns.

Available additional columns include:

Stopped Running - The time at which the execution stopped running.
Custom properties columns - If a custom property or properties have been assigned to any of the listed executions, a column can be added to the table, for each such property, showing the values assigned to each execution, for that property.

Customizing the Executions List Display

To remove columns from the list, click on the "table" icon at the right edge of the table header row, then de-select one or more of the entries in the list, to hide the column or columns in question.

Filtering the Executions List

A filter menu above the executions list allows you to run a search that refines the list to display only executions meeting specific criteria.

By default, pills are displayed that allow you to set search criteria that will filter executions by one or more of the following attributes:

Name - Execution name
State - Execution state
ID - An execution's job ID or analysis ID
Executable - A specific executable
Launched By - The user who launched an execution or executions
Launch Time - The time range within which executions were launched

Click the List icon, above the right edge of the executions list, to display pills that allow filtering by additional execution attributes.

Search Scope

Note that by default, filters are set to display only root executions that meet the criteria defined in the filter. If you want the display to include all executions, including those run during individual stages of workflows, click the button, above the left edge of the executions list, showing the default value "Root Executions Only." Then click "All Executions."

Saving and Reusing Filters

To save a particular filter, click the Bookmark icon, above the right edge of the executions list, assign your filter a name, then click Save.

To apply a saved filter to the executions list, click the Bookmark icon, then select the filter from the list.

Terminating an Execution from the Monitor Screen

As a launcher of a given execution, and a contributor to the project within which the execution is running, you can terminate the execution from the list on the Monitor screen, when it's in a non-terminal state. You can also terminate executions launched by other project members if you have project admin status.

To terminate an execution:

Find the execution in the list:
- Select the execution by clicking on the row. A red Terminate button will appear at the end of the header. Click the Terminate button; or
- Hover over the row and click on the “More Actions” button that looks like three vertical dots at the end of the row to select Terminate in the fly out menu.
A modal window will open, asking you to confirm that you want to terminate the execution. Click Terminate to confirm.
The execution's state will show as "Terminating" as it is being terminated. Then its state will change to "Terminated."

Getting Detailed Information on an Execution via the UI

To get additional information on an execution, click on its name in the list on the Monitor screen. A new page will open.

Available Detailed Information on Executions

On the details page for an execution, you'll see a range of information, including:

High-level details - In the Execution Tree section, at the top of the screen, you'll see high-level information, including:
- For a standalone execution - such as a job without children - you'll see a single entry that includes details on the state of the execution, when it started and stopped running, and how long it spent in the running state.
- For an execution with descendants - such as an analysis with multiple stages - you'll see a list, with each row containing details on the execution run at each stage of the analysis. If the execution has descendants, you can click on the “+” icon next to its name to expand the row to view information on its descendants. To see a page displaying detailed information on a stage, click on its name in the list. To navigate back to the workflow's details page, click on its name in the "breadcrumb" navigation menu in the top right corner of the screen.
Execution state - In the Execution Tree section, each execution row includes a color bar that represents the execution's current state. For descendants within the same execution tree, the time visualizations are staggered, indicating their different start and stop times in relation to each other. The colors include:
- Blue - A blue bar indicates that the execution is in the "Running" or "In Progress" state.
- Green - A green bar indicates that the execution is in the "Done" state.
- Red - A red bar indicates that the execution is in the "Failed" or "Partially Failed" state.
- "Grey" indicates that the execution is in the "Terminated" state.
Execution start and stop times - Times are displayed in the header bar at the top of the Execution Tree section. These times run, from left to right, from the time at which the job started running, or when the analysis was created, to either the current time, or the time at which the execution entered a terminal state ("Done," "Failed," or "Terminated").
Inputs - In this section, you'll see a list of the inputs to the execution. If a direct link to the input file is available, the input's name will be hyperlinked to the file; clicking the link will open the project location containing the file. If the input was provided by another execution in a workflow, the execution's name will be hyperlinked; clicking the link will open the details page for the execution in question.
Outputs - In this section, you'll see a list of the execution's outputs. If a direct link to the output file is available, the output's name will be hyperlinked to the file; clicking the link will open the folder containing the file.
Log files - An execution's log file is useful in understanding details about, for example, the resources used by an execution, the costs it incurred, and the source of any delays it encountered. To access log files, and, as needed, download them in .txt format:
- To access the log file for a job, click either the View Log button in the top right corner of the screen, or the View Log link in the Execution Tree section.
- To access the log file for each stage in an analysis, click the View Log link next to the row displaying information on the stage in question, in the Execution Tree section.
Basic info - The Info pane, on the right side of the screen, displays a range of basic information on the execution, along with additional detail such as the execution's unique ID, and custom properties and tags assigned to it.
Reused results - If an execution reuses results from another execution, this information will be shown in a blue pane, above the Execution Tree section. To see details on the execution that generated these results, click on its name.

Getting Help with Failed Executions

If an execution failed, a Cause of Failure pane will display, above the Execution Tree section. The cause of failure is a system-generated error message. For assistance in diagnosing the failure and any related issues:

Click the button labeled Send Failure Report to DNAnexus Support.
A form will open in a modal window, with both the Subject and Message fields pre-populated with information that DNAnexus Support will use in diagnosing and resolving the issue.
By clicking the button in the Grant Access section, DNAnexus Support reps will be given "View" access to the project in which the issue occurred. This will enable Support reps to diagnose and resolve the issue more quickly.
Click Send Report to send the report.

Launching a New Execution

To re-launch a job from the execution details screen:

Click the Launch as New Job button in the upper right corner of the screen.
A new browser tab will open, displaying the Run App / Applet form.
Configure the run, then click Start Analysis.

To re-launch an analysis from the execution details screen:

Click the Launch as New Analysis button in the upper right corner of the screen.
A new browser tab will open, displaying the Run Analysis form.
Configure the run, then click Start Analysis.

Saving a Workflow as a New Workflow

If you want to save a copy of a workflow along with its input configurations under a new name, from the execution details screen:

Click the Save as New Workflow button in the upper right corner of the screen.
In the Save as New Workflow modal window, give the workflow a name, and select the project in which you'd like to save it.
Click Save.

Viewing Initial Tries for Restarted Jobs

As described in this documentation, jobs can be configured to restart automatically upon certain types of failures.

If you want to view the execution details for the initial tries for a restarted job:

Click on the "Tries" link below the job name in the summary banner, or the "Tries" link next to the job name in the execution tree.
A modal window will open.
Click the name of the try for which you'd like to view execution details.

Note that you can only send a failure report for the most recent try, not for any previous tries.

Monitoring a Job via the CLI

You can use dx watch to view the log of a running job or any past jobs, which may have finished successfully, failed, or been terminated.

Monitoring a Currently Running Job

If you'd like to view the job's log stream while it runs, you can use dx watch. The log stream includes a log of stdout, stderr, and additional information the worker outputs as it executes the job.

$ dx watch job-xxxx
Watching job job-xxxx. Press Ctrl+C to stop.
* Sample Prints (sample_prints:main) (running) job-xxxx
  amy 2024-01-01 09:00:00 (running for 0:00:37)
2024-01-01 09:06:00 Sample Prints INFO Logging initialized (priority)
2024-01-01 09:06:37 Sample Prints INFO CPU: 4% (4 cores) * Memory: 547/7479MB * Storage: 74GB free * Net: 0↓/0↑MBps
2024-01-01 09:06:37 Sample Prints INFO Setting SSH public key
2024-01-01 09:06:37 Sample Prints STDOUT dxpy/0.365.0 (Linux-5.15.0-1050-aws-x86_64-with-glibc2.29) Python/3.8.10
2024-01-01 09:06:37 Sample Prints STDOUT Invoking main with {}
2024-01-01 09:06:37 Sample Prints STDOUT 0
...

Terminating a Job

If for some reason you need to terminate a job before it completes, use the command dx terminate.

Monitoring Past Jobs

If you'd like to view any jobs that have finished running, you can use the dx watch command. The log stream includes a log of stdout, stderr, and additional information the worker outputs as it executed the job.

$ dx watch job-xxxx
Watching job job-xxxx. Press Ctrl+C to stop.
* Sample Prints (sample_prints:main) (running) job-xxxx
  amy 2024-01-01 09:00:00 (running for 0:00:37)
2024-01-01 09:06:00 Sample Prints INFO Logging initialized (priority)
2024-01-01 09:06:37 Sample Prints INFO CPU: 4% (4 cores) * Memory: 547/7479MB * Storage: 74GB free * Net: 0↓/0↑MBps
2024-01-01 09:06:37 Sample Prints INFO Setting SSH public key
204-01-01 09:06:37 Sample Prints STDOUT dxpy/0.365.0 (Linux-5.15.0-1050-aws-x86_64-with-glibc2.29) Python/3.8.10
2024-01-01 09:06:37 Sample Prints STDOUT Invoking main with {}
2024-01-01 09:06:37 Sample Prints STDOUT 0
2024-01-01 09:06:37 Sample Prints STDOUT 1
2024-01-01 09:06:37 Sample Prints STDOUT 2
2024-01-01 09:06:37 Sample Prints STDOUT 3
* Sample Prints (sample_prints:main) (done) job-xxxx
  amy 2024-01-01 09:08:11 (runtime 0:02:11)
  Output: -

Finding Executions via the CLI

You can use dx find executions to return the ten most recent executions in your current project. You can specify the number of executions you wish to view by running dx find executions -n <specified number>. The output from dx find executions will be similar to the information shown in the "Monitor" tab on the DNAnexus web UI.

Below is an example of dx find executions; in this case, only two executions have been run in the current project. There is an individual job, DeepVariant Germline Variant Caller, and a workflow consisting of two stages, Variant Calling Workflow. A stage is represented by either another analysis (if running a workflow) or a job (if running an app(let)).

The job running the DeepVariant Germline Variant Caller executable is running and has been running for 10 minutes and 28 seconds. The analysis running the Variant Calling Workflow consists of 2 stages, Freebayes Variant Caller, which is waiting on input, and BWA-MEM FASTQ Read Mapper, which has been running for 10 minutes and 18 seconds.

$ dx find executions
* DeepVariant Germline Variant Caller (deepvariant_germline:main) (running) job-xxxx
  amy 2024-01-01 09:00:18 (running for 0:10:28)
* Variant Calling Workflow (in_progress) analysis-xxxx
│ amy 2024-01-01 09:00:18
├── * FreeBayes Variant Caller (freebayes:main) (waiting_on_input) job-yyyy
│     amy 2024-01-01 09:00:18
└── * BWA-MEM FASTQ Read Mapper (bwa_mem_fastq_read_mapper:main) (running) job-zzzz
      amy 2024-01-01 09:00:18 (running for 0:10:18)

Using dx find executions

By default, the dx find executions operation will search for jobs or analyses created when a user runs an app or applet. If a job is part of an analysis, the results will be returned in a tree representation linking all of the jobs in an analysis together.

By default, dx find executions will return up to ten of the most recent executions in your current project in order of execution creation time.

However, a user can also filter the returned executions by job type. Using the flag --origin-jobs in conjunction with the dx find executions command returns only original jobs, whereas the flag --all-jobs will also include subjobs.

Finding Analyses via the CLI

We can choose to monitor only analyses by running the command dx find analyses. Analyses are executions of workflows and consist of one or more app(let)s being run. When using dx find analyses, the command will return only the top-level analyses, not any of the jobs contained therein.

Below is an example of dx find analyses:

$ dx find analyses
* Variant Calling Workflow (in_progress) analysis-xxxx
  amy 2024-01-01 09:00:18

Finding Jobs via the CLI

Jobs are runs of an individual app(let) and compose analyses. We can monitor jobs by running the command dx find jobs, which will return a flat list of jobs. If a job is in an analysis, all jobs within the analysis are also returned.

Below is an example of dx find jobs:

$ dx find jobs
* DeepVariant Germline Variant Caller (deepvariant_germline:main) (running) job-xxxx
  amy 2024-01-01 09:10:00 (running for 0:00:28)
* FreeBayes Variant Caller (freebayes:main) (waiting_on_input) job-yyyy
  amy 2024-01-01 09:00:18 
* BWA-MEM FASTQ Read Mapper (bwa_mem_fastq_read_mapper:main) (running) job-zzzz
  amy 2024-01-01 09:00:18 (running for 0:10:18)

Advanced CLI Monitoring Options

Searches for executions can be restricted to specific parameters.

Viewing stdout and/or stderr from a Job Log

To extract stdout only from this job, we can run the command dx watch job-xxxx --get-stdout
To extract stderr only from this job, we can run the command dx watch job-xxxx --get-stderr\
To extract only stdout and stderr from this job, we can run the command dx watch job-xxxx --get-streams

Below is an example of viewing stdout lines of a job log:

$ dx watch job-xxxx --get-streams
Watching job job-xxxx. Press Ctrl+C to stop.
dxpy/0.365.0 (Linux-5.15.0-1050-aws-x86_64-with-glibc2.29) Python/3.8.10
Invoking main with {}
0
1
2
3
4
5
6
7
8
9
10

Viewing Subjobs

To view the entire job tree, including both main jobs and subjobs, use the command dx watch job-xxxx --tree.

Viewing the First n Messages of a Job Log

To view the entire job tree -- both main jobs and subjobs -- use the command dx watch job-xxxx -n 8. If the job already ran, the output is displayed as well.

In the example below, the app Sample Prints doesn’t have any output.

$ dx watch job-F5vPQg807yxPJ3KP16Ff1zyG -n 8
Watching job job-xxxx. Press Ctrl+C to stop.
* Sample Prints (sample_prints:main) (done) job-xxxx
  amy 2024-01-01 09:00:00 (runtime 0:02:11)
2024-01-01 09:06:00 Sample Prints INFO Logging initialized (priority)
2024-01-01 09:08:11 Sample Prints INFO CPU: 4% (4 cores) * Memory: 547/7479MB * Storage: 74GB free * Net: 0↓/0↑MBps
2024-01-01 09:08:11 Sample Prints INFO Setting SSH public key
2024-01-01 09:08:11 Sample Prints dxpy/0.365.0 (Linux-5.15.0-1050-aws-x86_64-with-glibc2.29) Python/3.8.10
* Sample Prints (sample_prints:main) (done) job-F5vPQg807yxPJ3KP16Ff1zyG
  amy 2024-01-01 09:00:00 (runtime 0:02:11)
  Output: -

Finding and Examining Initial Tries for Restarted Jobs

Jobs can be configured to restart automatically upon certain types of failures as described in the Restartable Jobs section. To view initial tries of the restarted jobs along with execution subtrees rooted in those initial tries, use dx find executions --include-restarted. To examine job logs for initial tries, use dx watch job-xxxx --try X. An example of these commands is shown below.

$ dx run swiss-army-knife -icmd="exit 1" \
    --extra-args '{"executionPolicy": { "restartOn":{"*":2}}}'

$ dx find executions --include-restarted
* Swiss Army Knife (swiss-army-knife:main) (failed) job-xxxx tries
├── * Swiss Army Knife (swiss-army-knife:main) (failed) job-xxxx try 2
│     amy 2023-08-02 16:33:40 (runtime 0:01:45)
├── * Swiss Army Knife (swiss-army-knife:main) (restarted) job-xxxx try 1
│     amy 2023-08-02 16:33:40
└── * Swiss Army Knife (swiss-army-knife:main) (restarted) job-xxxx try 0
      amy 2023-08-02 16:33:40

$ dx watch job-xxxx --try 0
Watching job job-xxxx try 0. Press Ctrl+C to stop watching.
* Swiss Army Knife (swiss-army-knife:main) (restarted) job-xxxx try 0
  amy 2023-08-02 16:33:40
2023-08-02 16:35:26 Swiss Army Knife INFO Logging initialized (priority)

Searching Across All Projects

By default, dx find will restrict your search to only your current project context. To search across all the projects to which you have access, use the --all-projects flag.

$ dx find executions -n 3 --all-projects
* Sample Prints (sample_prints:main) (done) job-xxxx
  amy 2024-01-01 09:15:00 (runtime 0:02:11)
* Sample Applet (sample_applet:main) (done) job-yyyy
  ben 2024-01-01 09:10:00 (runtime 0:00:28)
* Sample Applet (sample_applet:main) (failed) job-zzzz
  amy 2024-01-01 09:00:00 (runtime 0:19:02)

Returning More Than Ten Results

By default, dx find will only return up to ten of the most recently launched executions matching your search query. To change the number of executions returned, you can use the -n option.

# Find the 100 most recently launched jobs in your project
$ dx find executions -n 100

Searching by Executable

A user can search for only executions of a specific app(let) or workflow based on its entity ID.

# Find most recent executions running app-deepvariant_germline in the current project
$ dx find executions --executable app-deepvariant_germline
* DeepVariant Germline Variant Caller (deepvariant_germline:main) (running) job-xxxx
  amy 2024-01-01 09:00:18 (running for 0:10:18)

Searching by Execution Start Time

Users can also use the --created-before and --created-after options to search based on when the execution began.

Searching by Date

# Find executions run on January 2, 2024
$ dx find executions --created-after=2024-01-01 --created-before=2024-01-03

Searching by Time

# Find executions created in the last 2 hours
$ dx find executions --created-after=-2h

# Find analyses created in the last 5 days
$ dx find analyses --created-after=-5d

Searching by Execution State

Users can also restrict the search to a specific state, e.g. "done", "failed", "terminated".

# Find failed jobs in the current project 
$ dx find jobs --state failed

Scripting

Delimiters

The --delim flag will tab-delimit the output. This allows the output to be passed into other shell commands.

$ dx find jobs --delim
* Cloud Workstation (cloud_workstation:main) done  job-xxxx    amy   2024-01-07 09:00:00 (runtime 1:00:00)
* GATK3 Human Exome Pipeline(gatk3_human_exome_pipeline:main)    done  job-yyyy amy 2024-01-07  09:00:00 (runtime 0:21:16)

Returning Only IDs

You can use the --brief flag to return only the object IDs for the objects returned by your search query. The ‑‑origin‑jobs flag will omit the subjob information.

Below is an example usage of the --brief flag:

$ dx find jobs -n 3 --brief
job-xxxx
job-yyyy
job-zzzz

Below is an example of using the flags --origin-jobs and --brief. In the example below, we describe the last job run in the current default project.

$ dx describe $(dx find jobs -n 1 --origin-jobs --brief)
Result 1:
ID                  job-xxxx
Class               job
Job name            BWA-MEM FASTQ Read Mapper
Executable name     bwa_mem_fastq_read_mapper
Project context     project-xxxx
Billed to           amy
Workspace           container-xxxx
Cache workspace     container-yyyy
Resources           container-zzzz
App                 app-xxxx
Instance Type       mem1_ssd1_x8
Priority            high
State               done
Root execution      job-zzzz
Origin job          job-zzzz
Parent job          -
Function            main
Input               genomeindex_targz = file-xxxx
                reads_fastqgz = file-xxxx
                [read_group_library = "1"]
                [mark_as_secondary = true]
                [read_group_platform = "ILLUMINA"]
                [read_group_sample = "1"]
                [add_read_group = true]
                [read_group_id = {"$dnanexus_link": {"input": "reads_fastqgz", "metadata": "name"}}]
                [read_group_platform_unit = "None"]
Output              -
Output folder       /
Launched by         amy
Created             Sun Jan  1 09:00:17 2024
Started running     Sun Jan  1 09:00:10 2024
Stopped running     Sun Jan  1 09:00:27 2024 (Runtime: 0:00:16)
Last modified       Sun Jan  1 09:00:28 2024
Depends on          -
Sys Requirements    {"main": {"instanceType": "mem1_ssd1_x8"}}
Tags                -
Properties          -

Rerunning Time-Specific Failed Jobs With Updated Instance Types

# Find failed jobs in the current project from a time period
$ dx find jobs --state failed --created-after=2024-01-01 --created-before=2024-02-01
* BWA-MEM FASTQ Read Mapper (bwa_mem_fastq_read_mapper:main) (failed) job-xxxx
  amy 2024-01-22 09:00:00 (runtime 0:02:12)
* BWA-MEM FASTQ Read Mapper (bwa_mem_fastq_read_mapper:main) (done) job-yyyy
  amy 2024-01-07 06:00:00 (runtime 0:11:22)

Rerunning Failed Executions With an Updated Executable

# Find all failed executions of specified executable
$ dx find executions --state failed --executable app-bwa_mem_fastq_read_mapper
* BWA-MEM FASTQ Read Mapper (bwa_mem_fastq_read_mapper:main) (failed) job-xxxx
  amy 2024-01-01 09:00:00 (runtime 0:02:12)

# Update the app and navigate to within app directory
$ dx build -a 
INFO:dxpy:Archived app app-xxxx to project-xxxx:"/.App_archive/bwa_mem_fastq_read_mapper (Sun Jan  1 09:00:00 2024)"
{"id": "app-yyyy"}

# Rerun job with updated app
$ dx run bwa_mem_fastq_read_mapper --clone job-xxxx

$ dx find jobs --tag TAG

See the Index of dx Commands documentation for more on using dx find jobs.

Forwarding Job Logs to Splunk for Analysis

A license is required to use this feature. Contact DNAnexus Sales for more information.

Job logs can be automatically forwarded to a customer's Splunk instance for analysis. See this documentation for more information on enabling and using this feature.

Job Notifications

Learn how to set job notification thresholds on the DNAnexus Platform.

A license is required to use the functionality described on this page. Contact DNAnexus Sales for more information.

Being notified of when a job may be stuck can help users to troubleshoot problems in a timely manner. On DNAnexus, users can set timeouts to limit the amount of time their jobs can run, or set a threshold on how long a job can take to run before the user is notified. The notification threshold can be specified in the executable at compile time via dx or dxCompiler.

When the threshold has been reached for a job tree, the user who launched the executable and the org admin will receive an email notification.

Setting Thresholds From the Command Line

For a root execution, the turnaround time is the time between its creation time and the time it reaches the terminal state (or the current time if it is not in a terminal state). The terminal states of an execution are done, terminated and failed. The job tree turnaround time threshold can be set from the dxapp.json app metadata file using the treeTurnaroundTimeThreshold supported field, where the threshold time is set in seconds. Note that when a user runs an executable that has a threshold, only the resulting root execution will have that threshold applied to it. See here for more details on the treeTurnAroundTimeThreshold API.

Example of including the treeTurnaroundTimeThreshold field in dxapp.json:

{
  "treeTurnaroundTimeThreshold": {threshold},
  ...
}

In the command-line interface (CLI), the dx build and dx build --app commands can accept the treeTurnaroundTimeThreshold field from dxapp.json, and the resulting app will be built with the job tree turnaround time threshold from the JSON file.

To check the treeTurnaroundTimeThreshold value of an executable, users can use dx describe {app, applet, workflow or global workflow id} --json command.

Using the dx describe {execution_id} --json command will display the selectedTreeTurnaroundTimeThreshold, selectedTreeTurnaroundTimeThresholdFrom, and treeTurnaroundTime values of root executions.

WDL Workflows

For WDL workflows and tasks, dxCompiler allows users to specify tree turnaround time using the extras JSON file. dxCompiler uses the value of the treeTurnaroundTimeThreshold field from the perWorkflowDxAttributes and defaultWorkflowDxAttributes sections in extras and passes on this threshold to the new workflow generated from this file. To set a job tree turnaround time threshold for an applet using dxCompiler, add the treeTurnaroundTimeThreshold field to the perTaskDxAttributes and defaultTaskDxAttributes sections in the extras JSON file.

Example of including the treeTurnAroundTimeThreshold field in perWorkflowDxAttributes:

{
  "perWorkflowDxAttributes": {
    {workflow_name}: {
      "treeTurnaroundTimeThreshold": {threshold},
      ...
    },
    ...
  }
}

Job Lifecycle

Learn about the states through which a job or analysis may go, during its lifecycle.

Example Execution Tree

In the following example, we have a workflow that has two stages, one of which is an applet, and the other of which is an app.

If the workflow is run, it will generate an analysis with an attached workspace for storing intermediate output from its stages. Jobs are also created to run the two stages. These jobs in turn can spawn more jobs, either to run another function in the same executable or to run an executable. The blue labels indicate which jobs or analyses can be described using a particular term (as defined above).

Note that the subjob or child job of stage 1's origin job shares the same temporary workspace as its parent job. Any calls to run a new applet or app (using the API methods /applet-xxxx/run or /app-xxxx/run will launch a master job that has its own separate workspace, and (by default) no visibility into its parent job's workspace.

Job States

Successful Jobs

Every successful job goes through at least the following four states: 1. idle: initial state of every new job, regardless of what API call was made to create it. 2. runnable: the job's inputs are ready, and it is not waiting for any other job to finish or data object to finish closing. 3. running: the job has been assigned to and is being run on a worker in the cloud. 4. done: the job has completed, and it is not waiting for any descendent job to finish or data object to finish closing. This is a terminal state, so no job will become a different state after transitioning to done.

Jobs may also pass through the following transitional states as part of more complicated execution patterns:

waiting_on_input (between idle and runnable): a job enters and stays in this state if at least one of the following is true:
- it has an unresolved job-based object reference in its input
- it has a data object input that cannot be cloned yet because it is not in the closed state or a linked hidden object is not in the closed state
- it was created to wait on a list of jobs or data objects that must enter the done or closed states, respectively (see the dependsOn field of any API call that creates a job); linked hidden objects are implicitly included in this list
waiting_on_output (between running and done): a job enters and stays in this state if at least one of the following is true:
- it has a descendant job that has not been moved to the done state
- it has an unresolved job-based object reference in its output
- it is an origin or master job which has a data object (or linked hidden data object) output in the closing state

Unsuccessful Jobs

There are two terminal job states other than the done state, terminated and failed, and a job can enter either of these states from any other state except another terminal state.

Terminated Jobs

The terminated state is entered when a user has requested that the job (or another job that shares the same origin job) be terminated. For all terminated jobs, the failureReason in their describe hash will be set to "Terminated", and the failureMessage will indicate the user responsible for terminating the job. Only the user who launched the job or administrators of the job's project context can terminate the job.

Failed Jobs

Jobs can fail for a variety of reasons, and once a job fails, this triggers failure for all other jobs that share the same origin job. If an unrelated job (i.e. is not in the same job tree) has a job-based object reference or otherwise depends on a failed job, then it will also fail. For more information about errors that jobs can encounter, see the Error Information page.

On the DNAnexus Platform, jobs are limited to a runtime of 30 days. Jobs running longer than 30 days will fail with JobTimeoutExceeded error.

Restartable Jobs

Jobs can automatically restart upon certain types of failures, such as UnresponsiveWorker, ExecutionError, AppInternalError, JobTimeoutExceeded, which can be specified in the executionPolicy of an app(let) or workflow. If a job fails for a restartable reason, its failure propagates to its nearest master job and restarts that job (if that executable has restartableEntryPoints flag set to the default value of master) or restarts the job itself (if the if the executable has restartableEntryPoints flag set to the default value of all ). A job can be restarted the number of times that is given in the executionPolicy, after which the entire job tree fails.

Jobs belonging to root executions launched after July 12, 2023 00:13 UTC have a try integer attribute representing different tries for restarted jobs. The first try of a job has try set to 0. The second job try (if the job was restarted) has its try attribute set to 1, and so on.

restartable: a job in a restartable state indicates that the job is ready to be restarted.
restarted: a job try in a restarted state indicates that the job try was restarted.

Some API methods ( e.g. /job-xxxx/describe, /job-xxxx/addTags, /job-xxxx/removeTags, /job-xxxx/setProperties, /system/findExecutions, /system/findJobs, /system/findAnalyses) accept optional job try input and may include job's try attribute in their output. All API methods interpret job ID inputs without a try argument as referring to the most recent try corresponding to that job ID.

Additional States

For unsuccessful jobs, there are a couple more states that jobs may enter between the running state and its eventual terminal state of terminated or failed; unsuccessful jobs starting in all other non-terminal states will be transitioned directly to the appropriate terminal state.

terminating: the transitional state when the worker in the cloud has begun terminating the job and tearing down the execution environment. Once the worker in the cloud has reported that it has terminated the job or otherwise becomes unresponsive, then the job will transition to its terminal state.
debug_hold: a job has been run with debugging options and has failed for an applicable reason, and is being held for debugging by the user. For more information about triggering this state, see the Connecting to Jobs page.

Analysis States

All analyses start in the state in_progress, and, like jobs, will end up in one of the terminal states done, failed, or terminated. The following diagram shows the state transition for all successful analyses.

If an analysis is unsuccessful, it may transition through one or more intermediate states before it reaches its terminal state:

partially_failed: this state indicates that one or more stages in the analysis have not finished successfully, and there is at least one stage which has not transitioned to a terminal state. In this state, some stages may have already finished successfully (and entered the done state), and the remaining stages will also be allowed to finish successfully if they can.
terminating: an analysis may enter this state either via an API call where a user has terminated the analysis, or there is some failure condition under which the analysis is terminating any remaining stages. This may happen if the executionPolicy for the analysis (or a stage of an analysis) had the onNonRestartableFailure value set to "failAllStages".

Billing

In general, compute and data storage costs due to jobs that end up failing because of user error (e.g. InputError, OutputError) and terminated jobs are still charged to the project in which the jobs were run. For internal errors of the DNAnexus platform, such costs will not be billed.

The costs for each stage in an analysis is determined independently. If the first stage finishes successfully while a second stage fails for a system error, the first stage will still be billed, and the second will not.

Executions and Time Limits

Learn about different types of time limits on executions, and how they can affect your executions on the DNAnexus Platform.

Types of Time Limits

On the DNAnexus Platform, executions are subject to two independent time limits: job timeouts, and execution tree expirations.

Job Timeouts

Each job has a timeout setting. This setting denotes the maximum amount of “wall clock time” that the job can spend in the “running” state, i.e. running on the DNAnexus Platform.

If the job is still running when this limit is reached, the job will be terminated.

The default job timeout setting is 30 days, though individual apps may have different timeout settings, as specified by the app’s creator. A job may be given a custom timeout setting.

How Job Timeouts Work

As noted above, job timeouts only apply to the time a job spends in the "running" state.

Job timeouts do not apply to any time a job spends waiting to begin running - as, for example, when a job is waiting for inputs to become available.

Job timeouts also do not apply to the time a job may spend between exiting the “running” state, and entering the “done” state - as, for example, when it is waiting for subjobs to finish.

See this documentation to learn more about on the job lifecycle and job states.

Errors

If a job fails to complete running before reaching its timeout limit, it will be terminated, with the Platform returning JobTimeoutExceeded as the job's failure reason.

Execution Tree Expiration

Each job is part of an execution tree. All jobs in an execution tree must complete running within 30 days of the launch of the tree’s root execution.

After this limit has been reached, all jobs within the execution tree lose the ability to access the Platform.

Note that if an execution tree is restarted, its timeout setting is not reset. Jobs in the tree lose Platform access 30 days after the initial launch (the first try) of the tree’s root execution.

Errors

If an execution tree reaches its time limit, jobs in the tree may not fail right away. If such a job is waiting for inputs or outputs, or if it is running without accessing the Platform, it may remain in that state. Only when the job tries to access the Platform will it fail. Depending on the access pattern, the Platform will return AppInternalError, AppError, or AuthError as the job's failure reason.

Executions and Cost and Spending Limits

Learn about limits on the costs executions can incur, and how these limits can affect executions on the DNAnexus Platform.

Types of Cost and Spending Limits

A running execution can be terminated when it incurs charges that cause a cost or spending limit to be reached. When a spending limit is reached, this can also prevent new executions from being launched.

Execution Cost Limits

An execution cost limit is an optional limit on the usage charges an execution tree can incur. This limit is set when a root execution is launched. Once this limit is reached, the DNAnexus Platform will terminate running executions in the affected execution tree.

Errors

When an execution is terminated in this fashion, the Platform will set CostLimitExceeded as the failure reason. This failure code will be displayed on the UI, on the relevant project’s Monitor page.

Billing Account Spending Limits

Billing account spending limits are managed by billing administrators, and can impact executions in projects billed to the account.

Billing account spending limits apply to cumulative charges incurred by projects billed to the account.

If cumulative charges reach this limit, the Platform will terminate running jobs in projects billed to the account, and will prevent new executions from being launched.

Errors

When a job is terminated in this fashion, the Platform will set SpendingLimitExceeded as the failure reason. This failure reason will be displayed on the UI, on the relevant project’s Monitor page.

Project-Level Compute and Egress Spending Limits

A license is required to use the Enforce Monthly Spending Limit for Computing and Egress feature. Contact DNAnexus Sales for more information.

Monthly project compute spending limits can be set by project admins, and can impact executions run within the project. Project admins can also set a separate monthly project-level egress spending limit, which can impact data egress from the project.

If the compute spending limit is reached, the Platform may terminate running jobs launched by project members, and prevent new executions from being launched. If the egress spending limit is reached, the Platform may prevent data egress from the project. The exact behavior depends on the policies of the org to which the project is billed.

For more information on these limits, see this overview, and this detailed explanation of setting org spending limit policies.

Compute Charges Incurred by Using Relational Database Clusters

Monthly project compute limits do not apply to compute charges incurred by using relational database clusters.

Compute Charges for Using Public IPv4 Addresses for Workers

There is a charge for using public IPv4 addresses for workers used for compute. When a job uses such a worker, IPv4 charges are included in the total cost figure shown for the job on the UI. These charges also count toward any compute spending limit that applies to the project in which the job is running.

See this documentation for information on how to find the per-hour charge for using IPv4 addresses, in each cloud region in which org members can run executions.`

Getting Info on Cost and Spending Limits

Execution Costs and Cost Limits

The UI displays information on costs and cost limits for both individual executions and execution trees. Navigate to the project in which the execution or execution tree is being run, then click the Monitor tab. Click on the name of the execution or execution tree to open a page showing detailed information about it.

While an execution or execution tree is running, information will be displayed on the charges it has incurred so far, and on additional charges it can incur, before an applicable cost limit is reached.

Spending Limits

Org spending limit information is available from the Billing page for each org.

Project-Level Monthly Spending Limits

If project-level monthly spending limits have been set for a project, detailed information is available via the CLI, using the command dx describe project-id.

Smart Reuse (Job Reuse)

Speed workflow development and reduce testing costs by reusing computational outputs.

A license is required to access the Smart Reuse feature. Please contact DNAnexus Sales for more information.

DNAnexus allows organizations to optionally reuse outputs of jobs that share the same executable and input IDs, even if these outputs are across projects or entire organizations. This feature has two primary use cases.

Example Use Cases

Dramatically Speed Up R&D of Workflows

For example, suppose you are developing a workflow, and at each stage, you end up debugging an issue. Let's assume that each stage takes approximately one hour to develop and run. If you do not reuse outputs as you are developing, the development process takes 1 + 2 + 3 + ... + n hours since at every stage you fix something, you have to recompute results from previous stages you were working on. On the other hand, if you simply reuse results for stages that have matured and are no longer modified, your total development time is now just the total amount of time it takes to develop and run the pipeline (in this case n hours). This is an order of magnitude difference in development time, and the improvement becomes more pronounced for longer workflows.

This feature is also powerful for saving time developing forks of existing workflows. For example, suppose you are a developer in an R&D organization and want to modify the last couple of stages of a production workflow in another organization. As long as the new workflow uses the same executable IDs for the stages before it, the time required for R&D of the forked version is only that of last stages.

Dramatically Reduce Costs When Testing at Scale

In production environments, it is important to test R&D modifications to a workflow at scale (e.g. a workflow for a clinical test). For example, suppose you are testing a workflow like the forked workflow discussed in the example above. This is a clinical workflow that needs to be tested on thousands of samples (let that number be represented by m) before being vetted to run in a production environment. Let's also suppose the whole workflow takes n hours but you only have modified the last k stages. You save (n-k)m total compute hours. This can add up to dramatic cost savings as m grows and if k is small.

Example Reuse with WDL

To demonstrate Smart Reuse, we will use WDL syntax as supported by DNAnexus through our toolkit and by dxCompiler.

task dupfile {
    File infile

    command { cat ${infile} ${infile} > outfile.txt  }
    output { File outfile = 'outfile.txt' }
}

task headfile {
    File infile

    command { head -10 ${infile} > outfile.txt  }
    output { File outfile = 'outfile.txt' }
}

workflow basic_reuse {
    File infile
    call dupfile { input: infile=infile }
    call headfile { input: infile=dupfile.outfile }
}

The workflow above is a two-step workflow that simply duplicates a file and takes the first 10 lines from the duplicate.

Now suppose the user has run the workflow above on some file and simply wants to tweak headfile to output the first 15 lines instead:

task dupfile {
    File infile

    command { cat ${infile} ${infile} > outfile.txt  }
    output { File outfile = 'outfile.txt' }
}

task headfile2 {
    File infile

    command { head -15 ${infile} > outfile.txt  }
    output { File outfile = 'outfile.txt' }
}

workflow basic_reuse_tweaked {
    File infile
    call dupfile { input: infile=infile }
    call headfile { input: infile=dupfile.outfile }
}

Here the only difference is that we renamed headfile, basic_reuse, and changed 10 to 15. The compilation process automatically detects that dupfile is the same but there is a different second stage. The generated workflow therefore uses the original executable ID for dupfile but a different executable ID for headfile2.

When executing basic_reuse_tweaked on the same input file with Smart Reuse enabled, the results from dupfile task are reused. This is because since there is already a job on the DNAnexus Platform that has run that specific executable with the same input file, the system can reuse that file.

When using Smart Reuse with complex WDL workflows involving WDL expressions in input arguments, scatters, and nested sub-workflows, we recommend launching workflows using the --preserve-job-outputs option, in order to preserve the outputs of all the jobs in the execution tree in the project, and also increase the potential for subsequent Smart Reuse.

Specific Properties

Smart Reuse:

only applies to jobs run in projects billed to an organization that has Smart Reuse enabled
is applied only to completed jobs executed after the policies are updated for an org

Jobs:

may only reuse results from other jobs if there exists a previously run job that ran with the exact same executable and input IDs (including the function called within the applet). If an input is watermarked the watermark and its version must be the same as well. This therefore does not include other settings like the instance type the job was run on, for example.
if ignoreReuse: true the job will not be considered a future candidate for job reuse.
the job to be reused must have all outputs intact at the time of reuse. Partial output from the job (e.g. some of the output is missing or inaccessible) will prevent the reuse.
contain a field called outputReusedFrom that refers to the job ID that originally computed the requested outputs. This field never refers to another job that has itself been reused
may only use results across projects if the corresponding application's dxapp.json contains "allProjects": "VIEW" in the "access" field
must have at least VIEW access to the original job's outputs, and those outputs must still exist on the Platform (i.e. they have not been since deleted)
are reported as having run for 0 seconds and correspondingly are billed as $0
are assumed to be deterministic in output
if the reused jobs/workflows is located in a different project or a different folder, the output data will not be cloned to the working project or the new destination folder since the new jobs/workflows are not actually run.

Enable/Disable Smart Reuse

If you are an administrator of a licensed org and want to enable Smart Reuse, run this command:

dx api org-myorg update '{"policies":{"jobReuse":true}}'

If you plan to reuse this feature across projects, you must modify all applet and app configurations with the "allProjects": "VIEW" as described above.

Conversely, set the value to false to disable it. If you are a licensed customer and cannot run the command above, contact support@dnanexus.com. If you are interested in this feature and are not a licensed customer, reach out to sales@dnanexus.com or your account executive for more information.

Apps and Workflows Glossary

Learn key terms used to describe apps and workflows.

On the DNAnexus Platform, the following terms are used when discussing apps and workflows:

Execution: An analysis or job.
- Root execution: The initial analysis or job that's created when a user makes an API call to run a workflow, app, or applet. Analyses and jobs created from a job via /executable-xxxx/runAPI call with detach flag set to true are also root executions.
- Execution tree: The set of all jobs and/or analyses that are created as a result of running a root execution.
Analysis: An analysis is created when a workflow is run. It consists of some number of stages, each of which consists of either another analysis (if running a workflow) or a job (if running an app or applet).
- Parent analysis: Each analysis is the parent analysis to each of the jobs that are created to run its stages.
Job: A job is a unit of execution that is run on a worker in the cloud. A job is created when an app or applet is run, or when a job spawns another job.
- Origin job: The job created when an app or applet is run by either a user or an analysis. An origin job always executes the "main" entry point.
- Master job: The job created when an app or applet is run by a user, job, or analysis. A master job always executes the "main" entry point. All origin jobs are also master jobs.
- Parent job: A job that creates another job or analysis via an /executable-xxxx/run or /job/new API call.
- Child job: A job created from a parent job via an /app[let]-xxxx/run or /job/new API call.
- Subjob: A job created from a job via a /job/new API call. A subjob runs the same executable as its parent, and executes the entry point specified in the API call that created it.
- Job tree: A set of all jobs that share the same origin job.
Job-based object reference: A hash containing a job ID and an output field name. This hash is given in the input or output of a job. Once the specified job has transitioned to the "done" state, it is replaced with the specified job's output field.

Tools List

Public Tools

Annotation Apps

Data Transfer Apps

DNAseq Apps

GWAS Apps

File Transfer Apps

Import Apps

Interactive Analysis Apps

Joint Genotyping Apps

Mapping Manipulation Apps

PheWAS Apps

PRS Apps

QC Apps

Quantification Apps

Read Mapping Apps

Read Manipulation Apps

RNAseq Apps

RNAseq Notebooks

Utility Apps

Variant Calling Apps

Visualization Apps

Titan Tools

Statistics Apps

Apollo Tools

Dataset Administration Apps

Dataset Management Apps

Data Science Apps

Third Party Tools

Tools in this section are created and maintained by their respective vendors and may require separate licenses.

DNAseq Apps

Joint Cohort Genotyping Apps

Read Mapping Apps

Variant Calling Apps

Cohort Browser

An overview of the Cohort Browser's key features and how to use them.

The Cohort Browser is accessible to all users of the UK Biobank Research Analysis Platform and the Our Future Health Trusted Research Environment.

Overview

DNAnexus Apollo builds on the technological foundation of the core DNAnexus Platform to offer scientists and bioinformaticians an environment to store and query large sets of genomic, phenotypic, multi-omic, and other structured data. Researchers can bring their data to the Platform and leverage DNAnexus apps to ingest the data into queryable databases.

The Cohort Browser dashboard can show up to three tabs based on the configuration of the dataset: Overview, Data Preview, and either Genomics (if the dataset contains germline genomic data) or Somatic Variants (if the dataset contains somatic variant data). Tabs are loaded as the user clicks on them, so if there is no change in filtering, the tabs will stay cached and will not need to reload.

Accessing Datasets

Opening a Dataset Using the Cohort Browser

From the project where a dataset is located, go to Manage tab and select your dataset of interest. Click on the Explore Data action to open this dataset in Cohort Browser.

You can also access datasets via the Datasets page, which is located under the Projects menu. The Datasets page displays all datasets you have access to, and enables you to browse and find a specific dataset without navigating through projects.

You can use the optional information panel to view further information about a selected dataset, including creator, sponsorship, etc.

Exploring Data in a Dataset

In the Cohort Browser's Overview tab, you'll see visualizations that provide an introduction to the dataset, and insights on the data it contains.

Adding a Data Field of Interest as a Tile

To create and view a chart visualizing data in a field, click the Add Tile button. The Add Tile dialog will open, showing a hierarchical view of all the data fields available in the dataset.
Browse the list or search an item by its title to narrow down the list.
Select a data field from the list. In the Data Field Details panel, you can see metadata on the selected data field, visualization preview, as well as options to customize chart types.
Confirm selection via the Add as Tile action. The new tile will appear on your dashboard.

Creating Multi-Variable Charts

Once you've selected a primary data field in the Add Tile dialogue, you can add a secondary data field by clicking on the + icon next to an eligible secondary data field.

Note that Cohort Browser performance can be affected when more than 10 tiles are displayed.

Video Overview: Exploring New Datasets

This video provides a detailed overview of exploring new datasets using the Cohort Browser:

Defining a Cohort

Adding a Cohort Filter

From the cohort which you wish to edit, click on Add Filter button.
Select a data field you want to filter by, confirm by clicking on Add as Filter.
Select operators and enter values to filter by. Click on Apply Filter to confirm.
Filters added are displayed in corresponding cohort panels. You can edit a specific filter any time by clicking on it, which would bring up the Edit Filter dialogue.
The default logical operator is 'AND'. To switch the operator to 'OR', click on the operator. For a filter group (a set of filters tied to 1 specific entity), all operators will be the same: all 'OR' or all 'AND'.

Once filters are added or edited, an updated cohort size will appear under name of the affected cohort. The dashboard will also auto-refresh to fetch updated results basing on latest cohort selection.

Adding a Genomic Filter

Germline Genomic Data

If your dataset includes germline genomic data, then you will have the option to add a genomic filter to your cohort.

From the cohort you wish to edit, click on Add Filter button.
Toggle to Geno tab.
Edit filter in Edit Genomic Filter dialogue by one of the following criteria:
1. Filter by genes and variant effects: Filter your dataset by variants of certain types and consequences within specified genes and/or genomic ranges. A maximum of 5 genes/ranges can be entered.
2. Filter by a list of variant IDs. A maximum of 100 variants can be entered.
3. If more than 1 range, gene, or variant is added, the values should be comma separated or each value must be on a new line.
Confirm edit by clicking on Apply Geno Filter button.

Similarly to the other cohort filters, a genomic filter is applied to the main entity of your dataset (in most cases, patients or participants).

For datasets with canonical transcript information available, an additional toggle will appear in the Genomic Filter dialogue titled "Match effects for canonical transcript only" which may be set to YES initially in order to restrict the results only to variants that have canonical information available.

Somatic Variants Data

In the Add Filter / Edit Filter pane, you will see options enabling you to:

Filter by genes and variant effects.
Filter by a particular HGVS DNA or HGVS protein notation, preceded by a gene symbol.
Filter by a list of variant IDs. A maximum of 10 variants can be entered.

Note that for each somatic variant filter, you can specify if matching variants are to be used as inclusion criteria or exclusion criteria for your cohort. By default, you will be selecting patients who have at least one detected variant that matches the specified criteria. To select patients or participants who do not have any matching variant, click the “WITH” dropdown button and change its value to “WITHOUT”.

You can create up to 10 somatic variant filters for each cohort.

Adding a Join Filter

When working with datasets that have multiple data entities, you can create a join filter by selecting data fields from a secondary entity and adding them as filters. An entity is a grouping of data around a unique item, event, or a concept: e.g. patient, visit, medication, laboratory tests.

Join Filters are displayed as subrows deriving from the main entity. Depending on the entity to which your selected data field belongs, a join filter that reflects the relationship between those entities will be automatically created. To create a new cohort criteria using the join filters, click + Add filter or the Filter > Add filter on a tile. To add additional criteria to an existing criteria in a join click the Add additional criteria inline on the row of the chosen filter.

You can choose between the 'AND' or 'OR' logical operators when creating a cohort and comparing join filters. To switch between them, click on the logical operator. For a specific level of join filtering, joins are either all 'AND' or all 'OR'. Note that even when using 'OR' for two join filters, the implication that "this criteria exists" precedes the join level, i.e. “where exists, join 1 or join 2”.

Once a join filter is created, you can further define the secondary entity by adding additional criteria to the branch, or adding more layers of join filters deriving from the current branch. As you add more layers, the field selector automatically hides fields that are ineligible to be added based on the join.

For an example of interpreting join filters, consider the following:

The First Example cohort identifies all patients with a "high" or "medium" risk level who have a first (visit instance = 1) hospital visit and who also have had a lab test that was a "nasal swab". This lab test does not necessarily have to be conducted at the time of the patient’s first hospital visit. In the Second Example, the cohort includes all patients with a "high" or "medium" risk level who had the "nasal swab" test performed on the first visit.

Video Overviews: Cohort Definition

Define Your Cohort

Build Cohort Definitions Across Complex Datasets

Adding Genomic Based Cohort Definitions

Setting Up a Cohort Browser Dashboard

This video provides an overview of setting up your dashboard as part of defining and refining a cohort:

Combining Cohorts

You can create complex cohorts by combining existing cohorts from the same dataset. Cohort combine can be accessed via the “Compare / Combine Cohorts” menu located at the top of the page.

The Cohort Browser supports the following combination logic:

Once a combined cohort is created, you can inspect the combination logic and its original cohorts in the cohort filters section.

Cohorts already combined cannot be combined a second time.

Cohort Table

The Cohort Table can visualize up to 30 columns of data per tab. Tables with over 30 columns but under 200 will still show the column names and allow the user to save the cohort but data will not be queried. Tables with over 200 columns are not supported.

In the Data Preview tab, the Cohort Table shows records that are within your current cohort selection split up by the entity they are on. You can add or remove data fields as columns via the column customization menu, which is located in the top-right corner of the table. As you add fields, entities are automatically split out. You can have up to 5 different entities showing at a time. In the entity drop down, you can toggle between various entities you’ve added and remove them outright.

Click on table column headers to access more functionalities including sorting and searching in a specific column and the data field information. From the data field information, you can quickly add the field as a tile or a filter if it has not been added yet.

Export Cohort Table

You can export table information either as a list of record IDs or a CSV file. Export options are available on the top-right corner of the table once you have selected a number of table rows.

Note that when exporting data, the resulting file will contain only those fields that are displayed in the Data Preview table.

Note that your view may contain more than one table (e.g. a participants table and a hospital records table). When you export the view to a CSV or TSV, doing so will yield a separate file for each table.

Variant Browser

The Variant Browser shows variants that are present in current cohort selection.

For datasets containing germline data, the Variant Browser appears in the Genomics tab. For datasets containing somatic variants data, it appears in the Somatic Variants tab.

Germline Data

For datasets containing germline data, the Variant Browser includes a lollipop plot displaying allele frequencies for variants in a specified genomic region.

The table below the lollipop plot displays a list of the same variants in tabular format, along with further annotation information including:

Type: whether the variant is a SNP, deletion, insertion, or mixed.
Population Allele Frequency: Allele frequency calculated across entire dataset from which the cohort is created.
Cohort Allele Frequency: Allele frequency calculated across current cohort selection.

If canonical transcript information is available, the following three columns with additional annotation information will appear in the Allele Table:

Consequences (Canonical Transcript): Canonical effects per each associated gene, according to SnpEff.
HGVS DNA (Canonical Transcript): HGVS (DNA) standard terminology per each associated gene with this variant
HGVS Protein (Canonical Transcript): HGVS (Protein) standard terminology per each associated gene with this variant

To view further annotation information, you can go to the detail page of a given variant by clicking on the link in the Location column .

Somatic Variant Data

Lollipop Plot

Note the following:

The lollipop plot displays variant information in one gene / canonical protein at a time.
Each “lollipop” represents amino acid changes at a given location (e.g. “Thr322Ala”), with location information visualized as horizontal position (X axis) and affected sample frequency in the current cohort visualized as height (Y axis).
Each lollipop is color-coded by consequence according to the canonical transcript. Lollipops that cover more than one consequence types are color-coded as “Multiple Consequences”

You can inspect variant statistics under each specific consequence, by interacting with the legend panel and selecting one color at a time. Selecting a particular lollipop in the plot will apply a filter to the the variants table, such that only variants corresponding the selected lollipop are displayed.

You can navigate to a different gene by entering the gene symbol in the Go to Gene field. This will also update the variants table by automatically navigating to the corresponding genomic region.

Variant Frequency Matrix

For datasets containing somatic variant data, the Variant Browser includes a chart illustrating the overall variant landscape across the top-mutated genes, for the current cohort.

Note the following:

The genes are sorted in descending order of percent of affected samples.
Samples are displayed, from left to right, from those that have the greatest number of mutated genes across all genes, to those that have the least.
A sample is considered affected, if the sample has at least one detected mutation of high or moderate impact within the canonical transcript for that gene.
Samples are color-coded by consequences. Samples with two or more detected variants are color-coded as “Multi Hit”.
The variant frequency matrix plot displays up to 50 top mutated genes, for up to 500 samples for any given cohort.

You can inspect variant statistics under each specific consequence, by interacting with the legend panel and selecting one color at a time.

Hovering over a particular sample cell will open a popover window showing detailed information on the sample, including:

The Sample ID
The count of variants per consequence, within the respective gene

Variants Table

For datasets containing somatic variant data, the Variant Browser includes a Variants Table that provides, in tabular format, details on the variants found in a particular genomic region, in samples for the current cohort.

Information displayed in the Variants Table includes:

Location of variant
Reference allele of variant
Alternate allele of variant
Type of variant
Variant consequences, with entries color-coded by level of severity
HGVS cDNA
HGVS Protein

Exporting the Variant Table

You can export selected variants in the table as a list of variant IDs or a CSV file. Export options will appear at the top-right corner of the table once you have items selected.

Save Cohort

Cohorts will be saved with the filters applied, along with the latest set of visualizations and dashboard layout information. Similar to Dataset objects, Cohort objects can be found under the Manage tab in your selected projects, and can be re-opened via the Explore Data option.

Export Cohort

You can export a list of main entity IDs in your current cohort selection as a CSV file. This action can be found next to the Save Cohort button, on the top-right corner of cohort panel.

Dashboard Views

Dashboard views contain layout and configuration information that can be re-used during cohort browsing. You can save or load a dashboard view via the Dashboard Actions menu located at the top-right corner of the header area. Dashboard views are saved as "Type: DashboardView" objects, which once saved also show up in selected project folders.

Comparing Cohorts

You can compare two cohorts by adding both cohorts into the Cohort Browser. In cohort compare mode, all visualizations are converted to show data from both cohorts.

The Compare Cohort action can be found in the header area next to the cohort title. You can create a new cohort, duplicate the current cohort, or load a previously saved cohort.

In compare mode, you can continue to edit both cohorts and visualize the results dynamically.

You can compare a cohort with its complement in the dataset by selecting the “Not In …” option in the Compare / Combine menu. Similar to combining cohorts, you must first save your current cohort before creating its not-in counterpart.

Similar to combined cohorts, cohorts created using Not In cannot be used further for the creation of combined / not-in cohorts.

“Not In” cohorts are linked to the cohort they are originally based upon. Once a not-in cohort is created, further changes to the original cohort definition will no longer be reflected.

Video Overview: Comparing Cohorts During Cohort Definition

Download Restrictions

If a database is in a project that has a restricteddownloadPolicy, then a Cohort Browser that shows a dataset, cohorts, or dashboards pointing to that database should not allow downloads, regardless of project.
If all copies of the dataset are in a project that has a restricteddownloadPolicy, then a Cohort Browser that shows a dataset, cohorts, or dashboards pointing to that dataset should not allow downloads, regardless of project.
1. Check all copies of the dataset.
2. If at least one copy is in a "download allowed" then the download will be allowed.
If the cohort or dashboard you are launching has a restricteddownloadPolicy, then a Cohort Browser that shows the cohort/dashboard should not allow downloads.

Create Cohorts via CLI

Chart Types

Get an overview of the range of different charts you can build and use in the Cohort Browser.

The Cohort Browser is accessible to all users of the UK Biobank Research Analysis Platform and the Our Future Health Trusted Research Environment.

While working in the Cohort Browser, you can visualize data using a variety of different types of charts.

Single-Variable Charts

The following single-variable chart types are available in the Cohort Browser:

Multi-Variable Charts

The following multi-variable chart types are available in the Cohort Browser:

When creating multi-variable charts using datasets that incorporate data related to multiple entities, the entity relationship between the selected data fields will affect chart type availability. In most cases, data fields related to the same entity, or data fields related to entities that in turn relate to one another in 1:1, N:1, or 1:N fashion, can be used together in a multi-variable chart.

Interpreting Chart Data

Chart Totals and Missing Data

In all charts used in the Cohort Browser, a chart total count is displayed under the chart's title. This figure represents the number of records for which data is displayed in the chart. The label - "Participants" in the chart shown below - indicates the entity to which the data relates.

This figure is not always the same as the number of records in the cohort.

In a single-variable chart, if the field in question, in a record, is empty or contains a null value, that record will not be included in the total, as its data can't be visualized. If any such records exist in the cohort, an "i" warning icon will appear next to the chart total figure. Hover over the icon to show a tooltip with information about records that aren't included in the total.

The same holds for multi-variable charts. If any record contains a null value in either of the selected fields, or if either field is empty, that record won't be included in the chart total count, as its data can't be visualized.

Row Chart

Learn to build and use row charts in the Cohort Browser.

The Cohort Browser is accessible to all users of the UK Biobank Research Analysis Platform and the Our Future Health Trusted Research Environment.

When to Use Row Charts

Row charts can be used to visualize categorical data.

When creating a row chart, note that:

The data must be from a field that contains either categorical or categorical multi-select data
This field must contain no more than 20 distinct category values
The values cannot be organized in a hierarchy

When to Use List Views for Categorical Data

Using Stacked Row Charts for Multivariate Visualizations

Using Row Charts in the Cohort Browser

In a row chart, each row shows a single category value, along with the number of records - the "count" - in which that value appears in the selected field. Also shown is the percentage of total cohort records in which it appears - its "freq." or "frequency."

Below is a sample row chart showing the distribution of values in a field Salt added to food. Note that In the current cohort selection of 100,000 participants, 27,979 records contain the value "Sometimes", which represents 27.98% of the current cohort size.

Preparing Data for Visualization in Row Charts

String Categorical
String Categorical Sparse
String Categorical Multi-select
Integer Categorical
Integer Categorical Multi-select

While sparse serial data can be visualized using row charts, non-encoded values are currently not supported. These values will not appear as rows.

Histogram

Learn to build and use histograms in the Cohort Browser.

The Cohort Browser is accessible to all users of the UK Biobank Research Analysis Platform and the Our Future Health Trusted Research Environment.

When to Use Histograms

Histograms can be used to visualize numerical, date, and datetime data.

Using Histograms in the Cohort Browser

In a histogram in the Cohort Browser, each vertical bar represents the count of records in a particular "bin." Each bin groups records that share the same value or very similar values, in a particular field.

The Cohort Browser automatically groups records into bins, based on the distribution of values in the dataset, for the field in question. Values are distributed in a linear fashion, on the x axis.

Below is a sample histogram showing the distribution of values in a field Critical care total days. Note the label under the chart title, indicating the number of records (203) for which values are shown , and the name of the entity ("RNAseq Notes") to which the data relates.

Non-Numeric Data in Histograms

In some cases, a field containing numeric data may also contain some non-numeric values. These values cannot be represented in a histogram. In such cases, you'll see an following informational message below the chart:

Clicking the "non-numeric values" link will display detail on those values, and the number of record in which each appears:

Histograms in Cohort Compare Mode

In Cohort Compare mode, histograms can be used to compare the distribution of values in a field that's common to both cohorts. In this scenario, the distributions are overlaid one atop another. Clicking the "ˇ" icon, in the lower right corner of the tile containing the chart, opens a tooltip showing the cohort names and the colors used to represent data in each.

Preparing Data for Visualization in Histograms

Integer
Integer Sparse
Float
Float Sparse
Date
Date Sparse
Datetime
Datetime Sparse

Box Plot

Learn to build and use box plots in the Cohort Browser.

The Cohort Browser is accessible to all users of the UK Biobank Research Analysis Platform and the Our Future Health Trusted Research Environment.

When to Use Box Plots

Box plots can be used to visualize numerical data.

Using Box Plots in the Cohort Browser

Box plots provide a range of detail on the distribution of values in a field containing numerical data. Each box plot includes three thin blue horizontal lines, indicating, from top to bottom:

Max - The maximum, or highest value
Med - The median value
Min - The minimum, or lowest value

The blue box straddling the median value line represents the span covered by the median 50% of values. Of the total number of values, 25% sit above the box, and 25% lie below it.

Hovering over the middle of a box plot opens a window displaying detail on the maximum, median, and minimum values. Also shown are the values at the "top" ("Q3") and "bottom" ("Q1") of the box. "Q1" is the highest value in the first, or lowest, quartile of values; "Q3" is the highest value in the third quartile.

Also shown in this window is a number representing the total number of values covered by the box plot, along with the name of the entity to which the data relates.

Non-Numeric Data in Box Plots

In some cases, a field containing numeric data may also contain some non-numeric values. These values cannot be represented in a box plot. See the chart just above for an example of the informational message that will show below the chart, in this scenario.

Clicking the "non-numeric values" link will display detail on those values, and the number of record in which each appears:

Note as well that in this scenario, there will be a discrepancy between the "count" figure shown in the chart label, and that shown in the informational window that opens, when hovering over the middle of a box plot. The latter figure will be smaller, with the discrepancy determined by the number of records for which values can't be displayed in the box plot.

Outliers

Cohort Browser box plots represent all non-null numeric values. When a field contains an outlier value or values - that is, values that are unusually high or low - this can result in a box plot that looks like this:

This box plot displays data on the number of cups of coffee consumed per day, by members of a particular cohort. One cohort member was recorded as consuming 42 cups of coffee per day, much higher than the value (2 cups/day) at the "top" of the third quartile, and far higher than the median value of 2 cups/day.

Box Plots in Cohort Compare Mode

In Cohort Compare mode, a box plot chart can be used to compare the distribution of values in a field that's common to both cohorts. In this scenario, a separate, color-coded box plot is displayed for each cohort.

Hovering over either of the plots opens an informational window showing detail on the distribution of values for the cohort in question.

Clicking the "ˇ" icon, in the lower right corner of the tile containing the chart, opens a tooltip showing the cohort names and the colors used to represent data in each.

Preparing Data for Visualization in Box Plots

Integer
Integer Sparse
Float
Float Sparse

List View

Learn to build and use list views in the Cohort Browser.

The Cohort Browser is accessible to all users of the UK Biobank Research Analysis Platform and the Our Future Health Trusted Research Environment.

When to Use List Views

List views can be used to visualize categorical data.

When creating a list view, note that:

The data must be from a field that contains either categorical or categorical multi-select data
This field must contain no more than 20 distinct category values
The values can be organized in a hierarchy

Using List Views to Visualize Hierarchically Organized Data

Using List Views to Visualize Data from Two Different Fields

List views can be used to visualize categorical data from two different fields. The same restrictions apply to the fields whose values are displayed, as when creating a simple list view.

Using List Views in the Cohort Browser

Visualizing Data from a Single Field

In a list view in the Cohort Browser showing data from one field, each row displays a value, along with the number of records in the current cohort - the "count" - that contain this value. Also shown is a figure labeled "freq." - this is the percentage of all cohort records, that contain the value.

Below is a sample list view showing the distribution of values in a field Episode type. Note that In the current cohort selection of 80 participants, 13 records contain the value "Delivery episode", which represents 16.25% of the current cohort size.

Visualizing Data from Two Fields

To visualize data from two fields, select a categorical field, then select "List View" as your visualization type. In the field list, select a second categorical field as a secondary field.

Below is the default view of a sample list view visualizing data from two fields: Critical care record origin and Critical care record format:

Critical care record origin is the primary field, Critical care record format is the secondary field.

Here, the user has clicked the ">" icon next to "Originating from Scotland" to display additional rows with detail on records that contain that value in the field Critical care record origin:

Each of these additional rows shows the number of records that contain a particular value for Critical care record format, along with the value "Originating from Scotland" for Critical care record origin.

In these additional rows, "count" and "freq." figures refer to records having a particular combination of values, in the fields in question.

Visualizing Complex Categorical Data

Below is an example of a list view used to visualize data in a categorical hierarchical field Home State/Province:

By default, only values in the category at the top level of the hierarchy are displayed.

Here, the user has clicked ">" next to one of these values, revealing additional rows that show how many records have the value "Canada" for the top-level category, in combination with different values in the category at the next level down:

In these additional rows, "count" and "freq." figures refer to records having a particular combination of values, in the fields in question. In the list view above, for example, a single record, representing 10% of the cohort, has both the value "Canada" for the top-level category, and "British Columbia" for the second-level category.

The following example shows how "count" and "freq." are calculated, for list views based on fields containing categorical data organized into multiple levels of hierarchy:

For the bottommost row, "count" and "freq" refer to records having all of the following values:

"Yes" for the category at the top of the hierarchy
"9" for the category at the second level of the hierarchy
"8" for the category at the third level of the hierarchy
"7" for the category at the fourth level of the hierarchy
"3" for the category at the bottom level of the hierarchy

Locating Values in a List View

In cases where the field has categories at multiple levels and this make it difficult to find a particular value, use the search box at the bottom of the list view, to hone in on a row or rows containing that value:

List Views in Cohort Compare

In Cohort Compare mode, a list view can be used to compare the distribution of values in a field that's common to both cohorts. In this scenario, the list includes a color-coded column for each cohort, as well as color-coded "count" figures for each, as in this example:

Note that in each column, count and "freq." figures refer to the occurrence of values in the individual cohort, not across both cohorts.

Preparing Data for Visualization in List Views

String Categorical
String Categorical Hierarchical
String Categorical Multi-Select
String Categorical Multi-Select Hierarchical
String Categorical Sparse
String Categorical Sparse Hierarchical
Integer Categorical
Integer Categorical Hierarchical
Integer Categorical Multi-Select
Integer Categorical Multi-Select Hierarchical

Grouped Box Plot

Learn to build and use grouped box plots in the Cohort Browser.

The Cohort Browser is accessible to all users of the UK Biobank Research Analysis Platform and the Our Future Health Trusted Research Environment.

When to Use Grouped Box Plots

Grouped box plots can be used to compare the distribution of values in a field containing numerical data, across different groups in a cohort. In a grouped box plot, each such group is defined by its members sharing the same value in another field that contains categorical data.

When creating a grouped box plot, note that:

The primary field must contain categorical or categorical multiple data
The primary field must contain no more than 15 distinct category values
The secondary field must contain numerical data

Using Grouped Box Plots in the Cohort Browser

The grouped box plot below shows a cohort that has been broken down into groups, according to the value in a field Doctor. For each group, a box plot provides detail on the reported Visit Feeling, for cohort members who share a doctor:

Non-Numeric Data in Grouped Box Plots

In some cases, a field containing numeric data may also contain some non-numeric values. These values cannot be represented in a grouped box plot. See the chart just above for an example of the informational message that will show below the chart, in this scenario.

Clicking the "non-numeric values" link will display detail on those values, and the number of record in which each appears:

Outliers

Cohort Browser grouped box plots represent all non-null numeric values. When a field contains an outlier value or values - that is, values that are unusually high or low - this can result in a grouped box plot that looks like this:

This grouped box plot displays data on the number of cups of coffee consumed per day, by members of different groups in a particular cohort, with groups defined by shared value in a field Coffee type. Note that in several groups, one member was recorded as consuming far more cups of coffee per day than others in the group.

Grouped Box Plots in Cohort Compare

In Cohort Compare mode, a grouped box plot can be used to compare the distribution of values in a field that's common to both cohorts, across groups defined using values in a categorical field that is also common to both cohorts.

In this scenario, a separate, color-coded box plot is displayed for each group in each cohort.

Hovering over one of these box plots opens an informational window showing detail on the distribution of values for the group in question.

Clicking the "ˇ" icon, in the lower right corner of the tile containing the chart, opens a tooltip showing the cohort names and the colors used to represent data in each.

Preparing Data for Visualization in Grouped Box Plots

Primary Field

String Categorical
String Categorical Multi-Select
String Categorical Sparse
Integer Categorical
Integer Categorical Multi-Select

Secondary Field

Integer
Integer Sparse
Float
Float Sparse

Stacked Row Chart

Learn to build and use stacked row charts in the Cohort Browser.

The Cohort Browser is accessible to all users of the UK Biobank Research Analysis Platform and the Our Future Health Trusted Research Environment.

When to Use Stacked Row Charts

Stacked row charts can be used to compare the distribution of values in a field containing categorical data, across different groups in a cohort. In a stacked row chart, each such group is defined by its member sharing the same value in another field that also contains categorical data.

When creating a stacked row chart, note that:

Both the primary and secondary fields must contain categorical data
Both the primary and secondary fields must contain no more than 20 distinct category values

Categorical multiple and categorical hierarchical data are not supported in stacked row charts.

Using Stacked Row Charts in the Cohort Browser

In the stacked row chart below, the primary field is VisitType, while DoctorType is the secondary field. In this chart, a cohort has been broken down into two groups, with the first sharing the value "Out-patient" in the VisitType field, while the second shares the value "In-patient."

The size of each bar, and the number to its right, indicate the total number of records in each group. In the chart below, for example, we see that 3,179 records contain the value "Out-patient" in the VisitType field.

Each bar contains a color-coded section indicating how many of the group's records contain a specific value in the secondary field. Hovering over one of these sections reveals how many records, within a particular group, share a particular value in the secondary field. In the chart below, for example, we see that 87 records in the first group share the value "specialist" in the DoctorType field.

Cohort Compare

Preparing Data for Visualization in Stacked Row Charts

String Categorical
String Categorical Sparse
Integer Categorical

Scatter Plot

Learn to build and use scatter plots in the Cohort Browser.

The Cohort Browser is accessible to all users of the UK Biobank Research Analysis Platform and the Our Future Health Trusted Research Environment.

When to Use Scatter Plots

Scatter plots can be used to compare the distribution of values in a field containing numerical data, across different groups in a cohort. In a scatter plot, each such group is defined by its members sharing the same value in another field that also contains numerical data.

Primary field values are plotted on the x axis. Secondary field values are plotted on the y axis.

Using Scatter Plots in the Cohort Browser

In the scatter plot below, each dot represents a particular combination of values, found in one or more records in a cohort, in fields Insurance Billed and Cost. The lighter the dot at a particular point, the fewer the records that share that combination. Darker dots, meanwhile, indicate that relatively more records that share a particular combination.

Non-Numeric Data in Scatter Plots

In some cases, a field containing numeric data may also contain some non-numeric values. These values cannot be represented in a scatter plot. The message "This field contains non-numeric values" will appear below the scatter plot, as in this sample chart:

Clicking the "non-numeric values" link will display detail on those values, and the number of record in which each appears.

Limit on Number of Data Points

In the Cohort Browser, scatter plots can show up to 30,000 distinct data points. If you create a scatter plot that would require that more data points be shown, you'll see this message above the chart:

Cohort Compare

Scatter plots are not supported in Cohort Compare.

Preparing Data for Visualization in Scatter Plots

Integer
Integer Sparse
Float
Float Sparse

Kaplan-Meier Survival Curve

Learn to build and use Kaplan-Meier Survival Curve charts in the Cohort Browser.

The Cohort Browser is accessible to all users of the UK Biobank Research Analysis Platform and the Our Future Health Trusted Research Environment.

Building a Kaplan-Meier Survival Curve Chart

To generate a survival chart, select one numerical field representing time, and one categorical field, which will be transformed into the individual’s status.

The categorical field should use one of the following 4 terms (case-insensitive) to indicate a status of "Living": “living”, “alive”, “diseasefree”, “disease-free”

For multi-entity datasets, survival curve charts only support data fields from the main entity, or entities with 1:1 relation to the main entity.

Calculating Survival Percentage

To calculate survival percent at the current event the system evaluates the following formula:

ST: Survival at the current event

LT0: Number of subjects living at the start of the period or event

D: Number of subjects that died

For each time period the following values are generated:

Status

Each individual is considered Dead unless they qualify as Living

Number of Subjects Living at the Start (LT0)

For the initial value this is the total number of records returned by the backend from survival data with Living or Dead Status.
For followup events this is the number of subjects at the start of the previous event minus the number of subjects that died in the previous event and the subjects that dropped out or were censored in the previous event

Number of Subjects Who Died (D)

1 for each individual who at the event does not have a status of Living

Number of Subjects Dropped or Censored

1 for each individual who at the event has a status of Living

Survival Percent at the Current Event (ST)

Cumulative Survival (S)

ST-1: Survival percent at the previous event

Note that this is the actual point drawn on the survival plot.

Learn More

Objects

Uploading and Downloading Files

Small File Sets

Running Nextflow Pipelines

This tutorial demonstrates how to use Nextflow pipelines on the DNAnexus Platform by importing a Nextflow pipeline from a remote repository or building from local disk space.

A license is required to create a DNAnexus app or applet from the Nextflow script folder. Contact DNAnexus Sales for more information.

This documentation assumes you already have a basic understanding of how to develop and run a Nextflow pipeline. To learn more about Nextflow, consult the official Nextflow Documentation.

To run a Nextflow pipeline on the DNAnexus Platform:

Import the pipeline script from a remote repository or local disk
Convert the script to an app or applet
Run the app or applet

You can do this via either the user interface (UI) or the command-line interface (CLI), using the dx command-line client.

Use the latest version of dx-toolkit to take advantage of recent improvements and bug fixes.

All versions beginning with v0.338.0 support converting Nextflow pipelines to apps or applets.

This documentation covers features available in dx-toolkit versions beginning with v0.378.0.

Quickstart

Pipeline Script Folder Structure

(Required) A main Nextflow file with the extension .nf containing the pipeline. The default filename is main.nf. A different filename can be specified in the nextflow.config file.
(Optional) A nextflow.config file.
(Optional, recommended) A nextflow_schema.json file. If this file is present at the root folder of the Nextflow script when importing or building the executable, the input parameters described in the file will be exposed as the built Nextflow pipeline applet's input parameters. See this section for more information on how the exposed parameters are used at run time.
(Optional) Subfolders and other configuration files. Subfolders and other configuration files can be referenced by the main Nextflow file or nextflow.config via theinclude or includeConfig keyword. Ensure that all referenced subfolders and files exist under the pipeline script folder at the time of building or importing the pipeline.

An nf-core flavored folder structure is encouraged but not required.

Importing a Nextflow Pipeline

Import via UI

An example of the Import Pipeline/Workflow modal:

After you've launched the import job, you'll see a status message "External workflow import job started" appear.

You can access information about the pipeline import job in the project’s Monitor tab:

Once the import is complete, you can find the imported pipeline executable as an applet. This is the output of the pipeline import job you previously ran:

You can find the newly created Nextflow pipeline applet - e.g. hello - in the project:

Import via CLI from a Remote Repository

$ dx build --nextflow \
  --repository https://github.com/nextflow-io/hello \
  --destination project-xxxx:/applets/hello

Started builder job job-aaaa
Created Nextflow pipeline applet-zzzz

Use the latest version of dx-toolkit to take advantage of recent improvements and bug fixes.

All versions beginning with v0.338.0 support converting Nextflow pipelines to apps or applets.

This documentation covers features available in dx-toolkit versions beginning with v0.370.0.

Your destination project’s billTo feature needs to be enabled for Nextflow pipeline applet building. Contact DNAnexus Sales for more information.

Once the pipeline import job has finished, it will generate a new Nextflow pipeline applet with an applet ID in the form applet-zzzz.

Use dx run -h to get more information about running the applet:

$ dx run project-xxxx:/applets/hello -h
usage: dx run project-xxxx:/applets/hello [-iINPUT_NAME=VALUE ...]

Applet: hello

hello

Inputs:
 Nextflow options
  Nextflow Run Options: [-inextflow_run_opts=(string)]
        Additional run arguments for Nextflow (e.g. -profile docker).

  Nextflow Top-level Options: [-inextflow_top_level_opts=(string)]
        Additional top-level options for Nextflow (e.g. -quiet).

  Soft Configuration File: [-inextflow_soft_confs=(file) [-inextflow_soft_confs=... [...]]]
        (Optional) One or more nextflow configuration files to be appended to the Nextflow pipeline
        configuration set

  Script Parameters File: [-inextflow_params_file=(file)]
        (Optional) A file, in YAML or JSON format, for specifying input parameter values

 Advanced Executable Development Options
  Debug Mode: [-idebug=(boolean, default=false)]
        Shows additional information in the job log. If true, the execution log messages from
        Nextflow will also be included.

  Resume: [-iresume=(string)]
        Unique ID of the previous session to be resumed. If 'true' or 'last' is provided instead of
        the sessionID, will resume the latest resumable session run by an applet with the same name
        in the current project in the last 6 months.

  Preserve Cache: [-ipreserve_cache=(boolean, default=false)]
        Enable storing pipeline cache and local working files to the current project. If true, local
        working files and cache files will be uploaded to the platform, so the current session could
        be resumed in the future

Outputs:
  Published files of Nextflow pipeline: [published_files (array:file)]
        Output files published by current Nextflow pipeline and uploaded to the job output
        destination.

Building from a Local Disk

$ pwd
/path/to/hello
$ ls
LICENSE         README.md       main.nf         nextflow.config

Ensure that the folder structure is in the required format, as described here.

$ dx build --nextflow /path/to/hello \
  --destination project-xxxx:/applets2/hello
{"id": "applet-yyyy"}

Your destination project’s billTo feature needs to be enabled for Nextflow pipeline applet building. Contact S ales for more information.

The dx run -h command can be run to see information about this applet, similar to the above example.

Building a Nextflow Pipeline App from a Nextflow Pipeline Applet

You can also build a Nextflow pipeline app from a Nextflow pipeline applet by running the command: dx build --app --from applet-xxxx.

Running a Nextflow Pipeline Executable (App or Applet)

Running a Nextflow Pipeline Executable via UI

Running a Nextflow Pipeline Applet via CLI

To run the Nextflow pipeline applet, use dx run applet-xxxx or dx run app-xxxx commands in the CLI and specify your inputs:

$ dx run project-yyyy:applet-xxxx \
  -i debug=false \
  --destination project-xxxx:/path/to/destination/ \
  --brief -y

job-bbbb

You can list and see the progress of the Nextflow pipeline job tree, which is structured as a head job with many subjobs, using the following command:

# See subjobs in progress
$ dx find jobs --origin job-bbbb
* hello (done) job-bbbb
│ amy 2023-09-20 14:57:58 (runtime 0:02:03)
├── sayHello (3) (hello:nf_task_entry) (done) job-1111
│   amy 2023-09-20 14:58:57 (runtime 0:00:45)
├── sayHello (1) (hello:nf_task_entry) (done) job-2222
│   amy 2023-09-20 14:58:52 (runtime 0:00:52)
├── sayHello (2) (hello:nf_task_entry) (done) job-3333
│   amy 2023-09-20 14:58:48 (runtime 0:00:53)
└── sayHello (4) (hello:nf_task_entry) (done) job-4444
    amy 2023-09-20 14:58:43 (runtime 0:00:50)

Monitoring Jobs

To monitor the detail log of the head job and the subjobs, you can monitor each job’s DNAnexus log via the UI or the CLI.

On the DNAnexus Platform, jobs are limited to a runtime of 30 days. Jobs running longer than 30 days will be automatically terminated.

Monitoring in the UI

An example of the log of a head job:

An example of the log of a subjob:

Monitoring in the CLI

From the CLI, you can use the dx watch command to check the status and view the log of the head job or each subjob.

Monitoring the head job:

# Monitor job in progress
$ dx watch job-bbbb
Watching job job-bbbb. Press Ctrl+C to stop watching.
* hello (done) job-bbbb
  amy 2023-09-20 14:57:58 (runtime 0:02:03)
... [deleted]
2023-09-20 14:58:29 hello STDOUT dxpy/0.358.0 (Linux-5.15.0-1045-aws-x86_64-with-glibc2.29) Python/3.8.10
2023-09-20 14:58:30 hello STDOUT bash running (job ID job-bbbb)
2023-09-20 14:58:31 hello STDOUT =============================================================
2023-09-20 14:58:31 hello STDOUT === NF projectDir   : /home/dnanexus/hello
2023-09-20 14:58:31 hello STDOUT === NF session ID   : 0eac8f92-1216-4fce-99cf-dee6e6b04bc2
2023-09-20 14:58:31 hello STDOUT === NF log file     : dx://project-xxxx:/applets/nextflow-job-bbbb.log
2023-09-20 14:58:31 hello STDOUT === NF command      : nextflow -log nextflow-job-bbbb.log run /home/dnanexus/hello -name job-bbbb
2023-09-20 14:58:31 hello STDOUT === Built with dxpy : 0.358.0
2023-09-20 14:58:31 hello STDOUT =============================================================
2023-09-20 14:58:34 hello STDOUT N E X T F L O W  ~  version 22.10.7
2023-09-20 14:58:35 hello STDOUT Launching `/home/dnanexus/hello/main.nf` [job-bbbb] DSL2 - revision: 1647aefcc7
2023-09-20 14:58:43 hello STDOUT [0a/6a81ca] Submitted process > sayHello (4)
2023-09-20 14:58:48 hello STDOUT [f5/87df8b] Submitted process > sayHello (2)
2023-09-20 14:58:53 hello STDOUT [4b/21374a] Submitted process > sayHello (1)
2023-09-20 14:58:57 hello STDOUT [f6/8c44f5] Submitted process > sayHello (3)
2023-09-20 14:59:51 hello STDOUT Hola world!
2023-09-20 14:59:51 hello STDOUT 
2023-09-20 14:59:51 hello STDOUT Ciao world!
2023-09-20 14:59:51 hello STDOUT 
2023-09-20 15:00:06 hello STDOUT Bonjour world!
2023-09-20 15:00:06 hello STDOUT 
2023-09-20 15:00:06 hello STDOUT Hello world!
2023-09-20 15:00:06 hello STDOUT 
2023-09-20 15:00:07 hello STDOUT === Execution completed — cache and working files will not be resumable
2023-09-20 15:00:07 hello STDOUT === Execution completed — upload nextflow log to job output destination project-xxxx:/applets/
2023-09-20 15:00:09 hello STDOUT Upload nextflow log as file: file-GZ5ffkj071zqZ9Qj22qv097J
2023-09-20 15:00:09 hello STDOUT === Execution succeeded — upload published files to job output destination project-xxxx:/applets/
* hello (done) job-bbbb
  amy 2023-09-20 14:57:58 (runtime 0:02:03)
  Output: -

Monitoring a subjob:

# Monitor job in progress
$ dx watch job-cccc
Watching job job-cccc. Press Ctrl+C to stop watching.
sayHello (1) (hello:nf_task_entry) (done) job-cccc
amy 2023-09-20 14:58:52 (runtime 0:00:52)
... [deleted]
2023-09-20 14:59:28 sayHello (1) STDOUT dxpy/0.358.0 (Linux-5.15.0-1045-aws-x86_64-with-glibc2.29) Python/3.8.10
2023-09-20 14:59:30 sayHello (1) STDOUT bash running (job ID job-cccc)
2023-09-20 14:59:33 sayHello (1) STDOUT file-GZ5ffQj047j3Vq7QX220Q5vQ
2023-09-20 14:59:34 sayHello (1) STDOUT Bonjour world!
2023-09-20 14:59:36 sayHello (1) STDOUT file-GZ5ffVQ047j2QXZ2ZkFx4YxG
2023-09-20 14:59:38 sayHello (1) STDOUT file-GZ5ffX0047j2QXZ2ZkFx4YxK
2023-09-20 14:59:41 sayHello (1) STDOUT file-GZ5ffXQ047jGYZ91x6KG32Jp
2023-09-20 14:59:43 sayHello (1) STDOUT file-GZ5ffY8047jF2PY3609JPBKB
sayHello (1) (hello:nf_task_entry) (done) job-cccc
amy 2023-09-20 14:58:52 (runtime 0:00:52)
Output: exit_code = 0

Advanced Options: Running a Nextflow Pipeline Executable (App or Applet)

Nextflow Execution on DNAnexus

Nextflow Execution Log File

Private Docker Repository

Syntax of a private Docker credential:

{
  "docker_registry": {
    "registry": "url-to-registry",
    "username": "name123",
    "token": "12345678"
  }
}

It is encouraged to save this credential file in a separate project where only limited users have permission to access it for privacy reasons.

Nextflow Pipeline Executable Inputs and Outputs

Specifying Input Values to a Nextflow Pipeline Executable

Executable (app or applet) run time
1. DNAnexus Platform app or applet input.
  - CLI example: dx run project-xxxx:applet-xxxx -i reads_fastqgz=project-xxxx:file-yyyy
  - reads_fastqgz is an example of an executable input parameter name. All Nextflow pipeline inputs can be configured and exposed by the pipeline developer using an nf-core flavored pipeline schema file (nextflow_schema.json).
  - When the input parameter is expecting a file, you need to specify the value in a certain format based on the class of the input parameter. When the input is of the “file” class, use DNAnexus qualified ID (i.e. absolute path to the file object such as “project-xxxx:file-yyyy”); when the input is of the “string” class, use the DNAnexus URI (“dx://project-xxxx:/path/to/file”). See table below for full descriptions of the formatting of PATHs.
  - You can use dx run <app(let)> --help to query the class of each input parameter at the app(let) level. In the example code block below, fasta is an input parameter of a file object, while fasta_fai is an input parameter of a string object. You will then use DNAnexus qualifiedID format for fasta, and DNAnexus URI format for fasta_fai.
  - The DNAnexus object class of each input parameter is based on the “type” and “format” specified in the pipeline’s nextflow_schema.json, when it exists. See additional documentation here to understand how Nextflow input parameter’s type and format (when applicable) converts to an app or applet’s input class.
  - It is recommended to always use the app/applet means for specifying input values. The platform validates the input class and existence before the job is created.
  - All inputs for a Nextflow pipeline executable are set as “optional” inputs. This allows users to have flexibility to specify input via other means.
2. Nextflow pipeline command line input parameter (i.e. nextflow_pipeline_params). This is an optional "string" class input, available for any Nextflow pipeline executable upon it being built.
  - CLI example: dx run project-xxxx:applet-xxxx -i nextflow_pipeline_params="--foo=xxxx --bar=yyyy", where "--foo=xxxx --bar=yyyy" corresponds to the "--something value" pattern of Nextflow input specification referenced here.
  - Because nextflow_pipeline_params is a string type parameter with file-path format, use the DNAnexus URI format when the file is stored on DNAnexus.
3. Nextflow options parameter (i.e. nextflow_run_opts). This is a optional "string" class input, available for any Nextflow pipeline executable upon it being built.
  - CLI example: dx run project-xxxx:applet-xxxx -i nextflow_run_opts=“-profile test”, where -profile is single-dash prefix parameter that corresponds to the Nextflow run options pattern, specifying a preset input configuration.
4. Nextflow parameter file (i.e. nextflow_params_file). This is a optional "file" class input, available for any Nextflow pipeline executable that is being built.
  - CLI example: dx run project-xxxx:applet-xxxx -i nextflow_params_file=project-xxxx:file-yyyy, where project-xxxx:file-yyyy is the DNAnexus qualified ID of the file being passed to nextflow run -params-file <file>. This corresponds to -params-file option of nextflow run.
5. Nextflow soft configuration override file (i.e. nextflow_soft_confs). This is a optional "array:file" class input, available for any Nextflow pipeline executable that is being built.
  - CLI example: dx run project-xxxx:applet-xxxx -i nextflow_soft_confs=project-xxxx:file-1111 -i nextflow_soft_confs=project-xxxx:file-2222, where project-xxxx:file-1111 and project-xxxx:file-2222 are the DNAnexus qualified IDs of the file being passed to nextflow run -c <config-file1> -c <config-file2>. This corresponds to -c option of nextflow run, and the order specified for this array of file input is preserved when passing to the nextflow run execution.
  - The soft configuration file can be used for assigning default values of configuration scopes (such as process).
  - It is highly recommended to use nextflow_params_file as a replacement to using nextflow_soft_confs for the use case of specifying parameter values, especially when running Nextflow DSL2 nf-core pipelines. Read more about this at nf-core documentation.
Pipeline source code:
1. nextflow_schema.json
  - Pipeline developers may specify default values of inputs in the nextflow_schema.json file.
  - If an input parameter is of Nextflow’s string type with file-path format, use DNAnexus URI format when the file is stored on DNAnexus.
2. nextflow.config
  - Pipeline developers may specify default values of inputs in thenextflow.config file.
  - Pipeline developers may specify a default profile value using --profile <value>, when building the executable. e.g. dx build --nextflow --profile test.
3. main.nf , sourcecode.nf
  - Pipeline developers may specify default values of inputs in the Nextflow source code file (*.nf).
  - If an input parameter is of Nextflow’s string type with file-path format, use the DNAnexus URI format when the file is stored on DNAnexus.

# Query for the class of each input parameter
$ dx run project-yyyy:applet-xxxx --help
usage: dx run project-yyyy:applet-xxxx [-iINPUT_NAME=VALUE ...]

Applet: example_applet

example_applet

Inputs:
…
  fasta: [-ifasta=(file)]
…

  fasta_fai: [-ifasta_fai=(string)]
…


# Assign values of the parameter based on the class of the parameter
$ dx run project-yyyy:applet-xxxx -ifasta=”project-xxxx:file-yyyy” -ifasta_fai=”dx://project-xxxx:/path/to/file”

Formats of PATH to File, Folder, or Wildcards

Scenarios

Valid PATH format

• App or applet input parameter class as file object

• CLI/API level (e.g. dx run --destination PATH)

DNAnexus qualified ID (i.e. absolute path to the file object).

• E.g. (file):

project-xxxx:file-yyyy,

project-xxxx:/path/to/file

• E.g. (folder):

project-xxxx:/path/to/folder/

• App or applet input parameter class as string

• Nextflow configuration and source code files (e.g. nextflow_schema.json, nextflow.config, main.nf, sourcecode.nf)

DNAnexus URI.

• E.g. (file):

dx://project-xxxx:/path/to/file

• E.g. (folder):

dx://project-xxxx:/path/to/folder/

• E.g. (wildcard):

dx://project-xxxx:/path/to/wildcard_files

Specifying a Nextflow Job Tree Output Folder

Read more detail about the output folder specification and publishDir here. Find an example on how to construct output paths of an nf-core pipeline job tree at run time from our FAQ.

Using an AWS S3 Bucket as a Work Directory for Nextflow Pipeline Runs

You can have your Nextflow pipeline runs use an Amazon Web Services (AWS) S3 bucket as a work directory. To do this, follow the steps outlined below.

Step 1. Configure Your AWS Account to Trust the DNAnexus Platform as an OIDC Identity Provider

Step 2. Configure an AWS IAM Role with the Proper Trust and Permissions Policies

Permissions Policy

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:DeleteObject",
        "s3:GetObject",
        "s3:ListBucket",
        "s3:PutObject"
      ],
      "Resource": [
        "arn:aws:s3:::my-nextflow-s3-workdir",
        "arn:aws:s3:::my-nextflow-s3-workdir/*"
      ]
    }
  ]
}

Note in the above example:

The "Action" section contains a list of the actions the role is allowed to perform, including deleting, getting, listing, and putting objects.
The two entries in the list in the "Resource" section enable the role to access all resources in the bucket accessible via the S3 URI my-nextflow-s3-workdir.

Trust Policy

The following example shows how to configure an IAM role's trust policy, to allow only properly configured Platform jobs to assume the role:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Principal": {
        "Federated": "arn:aws:iam::123456789012:oidc-provider/job-oidc.dnanexus.com/"
        ,
        "Condition": {
         "StringEquals": {
            "job-oidc.dnanexus.com/:aud": “dx_nextflow_s3_scratch_token_aud"
          },
         "StringEquals": {
            "job-oidc.dnanexus.com/:sub": “project_id;project-xxxx;launched_by;user-aaaa"
          }
        }
      }
    }
  ]
}

Note in the above example:

To assume the role, a job must be launched from within a specific Platform project (in this case, project-xxxx).
To assume the role, a job must be launched by a specific Platform user (in this case, user-aaaa).
Via the "Federated" setting in the "Principal" section, the policy configures the role to trust the Platform as an OIDC identity provider, as accessible at job-oidc.dnanexus.com.

Step 3. Configure Your Nextflow Pipeline's Configuration File to Access the S3 Bucket

# In a nextflow configuration file:

aws { region = '<aws region>'}

dnanexus {
	workDir = '<S3 URI path>'
	jobTokenAudience = '<OIDC_audience_name>'
	jobTokenSubjectClaims = '<list of claims separated by commas>'
	iamRoleArnToAssume = '<arn of the role who is set with permission>'
}

Note in the above example:

workDir is the path to the bucket to be used as a work directory, in S3 URI format.
jobTokenAudience is the value of "Audience" you defined in Step 1 above.
jobTokenSubjectClaims is an ordered, comma-separated list of DNAnexus job identity token custom claims - for example, "project_id, launched_by" - that the job must present, in order to assume the role that enables bucket access.
iamRoleArnToAssume is the Amazon Resource Name (ARN) for the role that you configured in Step 2 above, and that will be assumed by jobs in order to access the bucket.
You need also to configure your pipeline to access the bucket within the appropriate AWS region, which you specify via the region parameter, within an aws config scope.

Using Subject Claims to Control Bucket Access

Values of StringEquals:job-oidc.dnanexus.com/:sub

Which jobs can assume the role that enables bucket access?

project_id;project-xxxx

Any Nextflow pipeline jobs that are running in project-xxxx

launched_by;user-aaaa

Any Nextflow pipeline jobs that are launched by user-aaaa

project_id;project-xxxx;launched_by;user-aaaa

Any Nextflow pipeline jobs that are launched by user-aaaa in project-xxxx

bill_to;org-zzzz

Any Nextflow pipeline jobs that are billed to org-zzzz

# In a nextflow configuration file:
dnanexus {
	...
	jobTokenSubjectClaims = 'project_id,launched_by'
	...
}

Note that you must also, within the dna config scope, set the value of iamRoleArnToAssume to that of the appropriate role:

# In a nextflow configuration file:
dnanexus {
	...
	iamRoleArnToAssume = arn:aws:iam::123456789012:role/NextflowRunIdentityToken
	...
}

Advanced Options: Building a Nextflow Pipeline Executable

Nextflow Pipeline Executable Permissions

By default, the Platform limits apps' and applets' ability to read and write data. Nextflow pipeline apps and applets have the following capabilities that are exceptions to these limits:

External internet access ("network": ["*"]) - This is required for Nextflow pipeline apps and applets to be able to pull Docker images from external docker registries at runtime.
UPLOAD access to the project in which a Nextflow pipeline job is run ("project": "UPLOAD") - This is required in order for Nextflow pipeline jobs to record the progress of executions, and preserve the run cache, in order to enable resume functionality.

You can modify a Nextflow pipeline app or applet's permissions by overriding the default values when building from a local disk, using the --extra-args flag with dx build. An example:

$ dx build --nextflow /path/to/hello --extra-args \
    '{"access":{"network": [], "allProjects":"VIEW"}}'
...
{"id": "applet-yyyy"}

In this example, note:

"network": [] prevents jobs from accessing the internet.
"allProjects":"VIEW" increases jobs' access permission level to VIEW. This means that each job will have "read" access to projects that can be accessed by the user running the job. Use this carefully. This permission setting can be useful when expected input file PATHs are provided as DNAnexus URIs - via a samplesheet.csv, for example - from projects other than the one in which a job is being run.

Advanced Building and Importing Pipelines

There are additional options for dx build --nextflow:

Options

Class

Description

--profile PROFILE

string

Set default profile for the Nextflow pipeline executable.

--repository REPOSITORY

string

Specifies a Git repository of a Nextflow pipeline. Incompatible with --remote.

--repository-tag TAG

string

Specifies tag for Git repository. Can be used only with --repository.

--git-credentials GIT_CREDENTIALS

file

--cache-docker

flag

Stores a container image tarball in the currently selected project in /.cached_dockerImages. Currently only docker engine is supported. Incompatible with --remote.

--nextflow-pipeline-params NEXTFLOW_PIPELINE_PARAMS

string

Custom pipeline parameters to be referenced when collecting the docker images.

--docker-secrets DOCKER_SECRETS

file

A dx file id with credentials for a private docker repository.

Use dx build --help for more information.

Private Nextflow Pipeline Repository

providers {
  github {
    user = 'username'
    password = 'ghp_xxxx'
  }
}

To safeguard this credentials field object, store it in a separate project that only you can access.

Platform File Objects as Runtime Docker Images

Two methods are available for preparing Docker images as tarball file objects on the platform: Built-in Docker image caching or Manually preparing the tarballs.

Built-in Docker Image Caching vs. Manually Preparing Tarballs

Built-in Docker image caching

Manually preparing tarballs

Requires running a "building job" with external internet access?

Yes, if building an applet for the first time or if any image is going to be updated.

No internet access required upon rebuild.

Docker images packaged as bundledDepends?

Yes.

For Docker images that will be used in the execution, they are cached and bundled at build time.

No.

Docker tarballs resolved at runtime.

At runtime

Job will attempt to locate the Docker image based on the Docker cache path referenced. If this fails, the job will attempt to pull from the external repository, via the internet.

Built-in Docker Image Caching

An example:

$ dx build --nextflow /path/to/hello \
--cache-docker \
--nextflow-pipeline-params "--alpha=1 --beta=foo" \ # when required
--destination project-xxxx:/applets2/hello
...
{"id:"applet-yyyy"}

$ dx tree /.cached_docker_images/
/.cached_docker_images/
├── samtools
│   └── samtools_1.16.1--h6899075_1
├── multiqc
│   └── multiqc_1.18--pyhdfd78af_0
└── fastqc
    └── fastqc_0.11.9--0

"docker_registry": {
  "registry": "url-to-registry",
  "username": "name123",
  "token": "12345678"
}

When a pipeline requires specific inputs, such as file objects, sample values must be present within the project in which building job is to execute. These values must be provided along with the flag --nextflow-pipeline-params.
- It's crucial that these sample values be structured in the same way as actual input data will be structured. This ensures that the execution logic of the Nextflow pipeline remains intact. During the build process, use small files, containing data representative of the larger dataset, as sample data, in order to reduce file localization overhead.
For pipelines featuring conditional process trees determined by input values, you may provide mocked input values for caching Docker containers used by processes affected by the condition.
A building job requires CONTRIBUTE or higher permission to the destination project, i.e. the project in which it will place tarballs created from Docker containers.
Pipeline source code will be saved at/.nf_source/<pipeline_folder_name>/ in the destination project. The user is responsible for cleaning up this folder after the executable has been built.

Manually Preparing Tarballs

You can manually convert Docker images to tarball file objects. Within Nextflow pipeline scripts, you must then reference the location of each such tarball, in one of the following three ways:

Reference each tarball by its unique Platform ID (e.g. dx://project-xxxx:file-yyyy). Use this approach if you want deterministic execution behavior. You can use Platform IDs in Nextflow pipeline scripts (*.nf) or configuration files (*.config), as follows:

# In a Nextflow pipeline script:
process foo {
  container 'dx://project-xxxx:file-yyyy'
  
  '''
  do this
  '''
}

# In nextflow.config    // at root folder of the nextflow pipeline:
process {
    withName:foo {
        container = 'dx://project-xxxx:file-yyyy'
    }   
}

When accessing a Platform project, a Nextflow pipeline job needsVIEW or higher permission to the project.

Within a Nextflow pipeline script, you can also reference a Docker image by using its full image name. Use this name within a path that's in the following format: project-xxxx:/.cached_docker_images/<image_name>/<image_name>_<version> An example:

# In nextflow configuration file:
docker.enabled = true
docker.registry = 'quay.io'

# In the Nextflow pipeline script:
process bar {
  container 'quay.io/biocontainers/tabix:1.11--hdfd78af_0'

  '''
  do this
  '''
}

At Nextflow pipeline executable runtime:

If no image is found at the path provided, the Nextflow pipeline job will attempt to pull the Docker image from the remote external registry, based on the image name. This pull attempt requires internet access.
If the version is referenced as latest, or if no version tag is provided, the Nextflow pipeline job will attempt to search the digest of the image’s latest reference from the external Docker repository and use it to search for the corresponding tarball on the platform. This digest search requires internet access. If no digest is found, or if there is no internet access, the execution will fail.

Here are several examples of tarball file object paths and names, as constructed from image names and version tags:

Image Name

Version Tag

Tarball File Object Path and Name

quay.io/biocontainers/tabix

1.11--hdfd78af_0

project-xxxx:/.cached_docker_images/tabix/tabix_1.11--hdfd78af_0

python

3.9-slim

project-xxxx:/.cached_docker_images/python/python_3.9-slim

python

latest

Nextflow pipeline job will attempt to pull from remote external registry

You can also reference Docker image names in pipeline scripts by digest - for example, <Image_name>@sha256:XYZ123…). Note that no file extension is necessary, and that project-xxxx is the project where the Nextflow pipeline executable was built and will be executed. For.cached_docker_images, substitute the name of the folder in which these images have been stored. Note as well that an exact <version> reference must be included - latest is not an accepted tag in this context. In addition, to refer to a tarball file on the Platform in this way, an object property image_digest - for example, “image_digest”:”<IMAGE_DIGEST_HERE>”- needs to have been assigned to it.
An example:

# In nextflow configuration file:
docker.enabled = true
docker.registry = 'quay.io'

# In the Nextflow pipeline script:
process bar {
  container 'quay.io/biocontainers/tabix@sha256:XYZ123…'
  '''
  do this
  '''
}

Nextflow Input Parameter Type Conversion to DNAnexus Executable Input Parameter Class

Based on the input parameter’s type and format (when applicable) defined in the corresponding nextflow_schema.json file, each parameter will be assigned to the corresponding class (ref1, ref2).

From: Nextflow Input Parameter (defined at nextflow_schema.json) Type

Format

To: DNAnexus Input Parameter Class

string

file-path

file

string

directory-path

string

path

string

integer

int

number

float

boolean

object

hash

File Input as String or File Class

Converting a URL path to a String

Managing intermediate files and publishing outputs

Pipeline Output Setting Using output: block and publishDir

Values of publishDir

At pipeline development time, the valid value of publishDir can be:

A local path string , e.g. “publishDir path: ./path/to/nf/publish_dir/”,
A dynamic string value defined as a pipeline input parameter (e.g. “params.outdir”, where “outdir” is a string-class input), allowing pipeline users to determine parameter values at runtime. For example, “publishDir path: '${params.outdir}/some/dir/'” or './some/dir/${params.outdir}/' or './some/dir/${params.outdir}/some/dir/' .
- When publishDir is defined this way, the user who launches the Nextflow pipeline executable is responsible for constructing the publishDir to be a valid relative path.

Find an example on how to construct output paths for an nf-core pipeline job tree at run time from our FAQ.

Queue Size Configuration

Instance Type Determination

Head job instance type determination

Subjob instance type determination

Choose the cheapest instance that satisfies the system requirements.
Use only SSD type instances.
For all things equal (price and instance specifications), it will prefer a version2 (v2) instance type.

Order of precedence for subjob instance type determination:

The value assigned to machineType directive.
Values assigned to cpus, memory, and disk directives in their configuration.

An example command for specifying machineType by DNAnexus instance type name is provided below:

process foo {
  machineType 'mem1_ssd1_v2_x36'

  """
  <your script here>
  """
}

It is possible the actual selected instance type’s CPUs, memory, or disk capacity being inconsistent with what is allocated by task. The former follows above precedence to determine the instance type, the latter is returning the value assigned in the configuration file.
When using Docker as the runtime container, Nextflow executor is propagating task execution settings to the Docker run command. For example, if task.memory is specified, this would then be the maximum amount of memory the container is allowed to use, for example: docker run --memory ${task.memory}

Nextflow Resume

Preserve Run Caches and Resuming Previous Jobs

Nextflow utilizes a scratch storage area for caching and preserving each task’s temporary results. The directory is called “working directory”, and the directory’s path is defined by

The session id, a universally unique identifier (UUID) associated with current execution
Each task’s unique hash ID: a hash number composed of each task’s input values, input files, command line strings, container ID (e.g. Docker image), conda environment, environment modules, and executed scripts in the bin directory, when applicable.

You can utilize the Nextflow resume feature with the following Nextflow pipeline executable parameters:

preserve_cache Boolean type. Default value is false. When set to true, the run will be cached in the current project for future resumes. For example:
- dx run applet-xxxx -i reads_fastqgz=project-xxxx:file-yyyy -i preserve_cache=true
- This enables the Nextflow job tree to preserve cached information as well as all temporary results in the project where it is executed under the following paths, based on its session ID and each subjob’s unique ID.
- The session's cache directory containing information on the location of the workDir, the session progress, etc. are saved to project-xxxx:/.nextflow_cache_db/<session_id>/cache.tar , where project-xxxx is the project where the job tree is executed.
- Each task's working directory will be saved to project-xxxx:/.nextflow_cache_db/<session_id>/work/<2digit>/<30characters>/ , where <2digit>/<30characters>/ is technically the task’s unique ID, and project-xxxx is the project where the job tree is executed.
resume String type. Default value is an empty string, and the run will start from scratch. When assigned with a session id, the run will resume from what is cached for the session id on the project. When assigned with “true” or “last”, the run will determine the session id that corresponds to the latest valid execution in the current project and resume the run from it. For example:
- dx run applet-xxxxm -i reads_fastqgz=“project-xxxx:file-yyyy” -i resume="<session_id>”

When preserve_cache=true, DNAnexus executor overrides the value of workDir of the job tree to be project-xxxx:/.nextflow_cache_db/<session_id>/work/, where project-xxxx is the project where the job tree was executed.
When a new job is launched and resumes a cached session (session_id may be formatted as 12345678-1234-1234-1234-123456789012 for example), the new job not only resumes from where the cache left at, but also shares the same session_id with the cached session it resumes. When a new job makes progress in a session and if the job is being cached, it creates temporary results to the same session’s workDir. This will generate a new cache directory (cache.tar) with the latest cache information.
You can have many Nextflow job trees sharing the same sessionID and writing to the same path for workDir and creating its own cache.tar, while only the latest job that ends in “done” or “failed” state will be preserved on the project.
When the head job enters its terminal state (e.g. “failed” or “terminated”) that is not caused by the executor, no cache directory will be preserved, even when the job was run with preserve_cache=true. Subsequent new jobs will not be able to resume from this job run. Examples: a job tree fails due to exceeding a cost limit; a user terminates a job of the job tree, etc.

Below are four possible scenarios and the recommended use cases for –i resume:

Scenarios

Parameters

Use Cases

Note

1 (default)

resume=“” (empty string) and preserve_cache=false

Production data processing; most high volume use cases

resume=“” (empty string) and preserve_cache=true

Pipeline development; only happens for the first few pipeline tests.

During development; it would be useful to see all intermediate results in workDir.

Only up to 20 Nextflow sessions can be preserved per project.

resume=<session_ID>|“true”|“last”and preserve_cache=false

Pipeline development; pipeline developers can investigate the job workspace with --delay_workspace_destruction and --ssh

resume=<session_ID>|“true”|“last” and preserve_cache=true

Pipeline development; only happens for the first few tests.

Only 1 job with the same <session_ID> can run at each time point.

Cache Preserve Limitations and Cleaning Up workDir

dx rm -r project-xxxx:/.nextflow_cache_db/              # cleanup ALL sessions caches
dx rm -r project-xxxx:/.nextflow_cache_db/<session_id>/ # clean up a specific session’s cache

Note that deleting an object on UI or using CLI dx rm cannot be undone. Once the session work directory is deleted or moved, subsequent runs will not be able to resume from the session.

Nextflow’s errorStrategy

errorStrategy

Subjob Error

Head Job

All Other Subjobs

terminate

Job properties set with: "nextflow_errorStrategy”:"terminate”, "nextflow_errored_subjob”:"self”

End in “failed” state immediately

Job properties set with: "nextflow_errorStrategy”:”terminate”,

End in “failed” state immediately, with error message, “Job was terminated by Nextflow with terminate errorStrategy for job-xxxx, check the job log to find the failure”

End in “failed” state immediately.

finish

Job properties set with:

"nextflow_errorStrategy”:"finish”, "nextflow_errored_subjob”:"self”

End in “done” state immediately

Job properties set with:

"nextflow_errorStrategy:finish”, "nextflow_errored_subjob”:”job-xxxx, job-2xxx" , where job-xxxx and job-2xxxx are the the errored subjobs,

Not create new subjobs after the time point of error

Keep on running until entering their terminal states.

retry

Job properties set with:

"nextflow_errorStrategy”:”retry”, "nex

tflow_errored_subjob”:”self"

End in “done” state immediately

Spin off a new subjob which retries the errored job, with the following job name:

<name> (retry: <RetryCount>) , where <name> is the original subjob name and <RetryCount> is the order of this retry (ex. retry:1, retry:2).

End in a terminal state depending on the terminal states of other currently existing subjobs that are not yet in their terminal states. Can be either “done”, “failed” or “terminated”.

Keep on running until enter their terminal states.

If error occurs in one of these subjobs, their errorStrategy set in the subjob’s corresponding Nextflow process is applied.

ignore

Job properties set with: "nextflow_errorStrategy”:”ignore”, "nextflow_errored_subjob”:”self"

End in “done” state immediately

Have job properties set with: "nextflow_erorrStrategy”:”ignore”, "nextflow_errorred_subjob”:”job-1xxx, job-2xxx"

Shows “subjob(s) <job-1xxxx>, <job-2xxxx> runs into Nextflow process errors’ ignore errorStrategy were applied” in the end of the job log.

End in a terminal state depending on the terminal states of other currently existing subjobs that are not yet in their terminal states. Can be either “done”, “failed” or “terminated”.

Keep on running until they enter their terminal states.

If error occurs in one of these subjobs, their errorStrategy set in the subjob’s corresponding Nextflow process is applied.

When more than one errorStrategy directives are applied to a pipeline job tree, the following rules will be applied depending on the first errorStrategy used.

When terminate is the first errorStrategy directive to be triggered in a subjob, all the other ongoing subjobs will result in the "failed" state immediately.
When finish is the first errorStrategy directive to be triggered in a subjob, any other errorStrategy that is reached in the remaining ongoing subjob(s) will also apply the finish errorStrategy, ignoring any other error stategies set in the pipeline’s source code or configuration.
If the retry errorStrategy is the first directive triggered in a subjob, if any of the remaining subjobs trigger a terminate, finish, or ignore errorStrategy, these other errorStrategy directives will be applied to the corresponding subjob.
When ignore is the first errorStrategy directive to trigger in a subjob , and if any of terminate, finish, or retry errorStrategy directives applies to the remaining subjob(s), that other errorStrategy will be applied to the corresponding subjob.

FAQ

My Nextflow job tree failed, how do I find where the errors are?

$ dx describe job-xxxx --json | jq -r .properties.nextflow_errored_subjob
job-yyyy
$ dx describe job-xxxx --json | jq -r .properties.nextflow_errorStrategy
terminate

where job-xxxx is the head job’s job ID.

What is the version of Nextflow that is used?

A: You can find the Nextflow version used by reading the log of the head job. Each built Nextflow executable is locked down to the specific version of Nextflow executor.

What container runtimes are supported?

My job hangs at the end of the analysis. What can I do to avoid this problem?

Can I have an example of how to construct an output path when I run a Nextflow pipeline with params.outdir, publishDir and job-level destination?

Taking nf-core/sarek (3.3.1) as an example, start with reading the pipeline's logic:

The pipeline's publishDir is constructed with a prefix of the params.outdir variable followed by each task's name for each subfolder: publishDir = [ path: { "${params.outdir}/${...}" }, ... ]
params.outdir is a required input parameter to the pipeline, and the default value ofparams.outdir is null. The user running the corresponding Nextflow pipeline executable must specify a value to params.outdir which will:
1. Meet the input requirement for executing the pipeline.
2. Resolve the value ofpublishDir, with outdir as the leading path and each task's name as the subfolder name.

To specify a value of params.outdir for the Nextflow pipeline executable built from the nf-core/sarek pipeline script, you can use the following command:

dx run project-xxxx:applet-zzzz \
-i outdir=./local/to/outdir \   # assign "./local/to/outdir" params.outdir
--brief -y

You can also set a job tree's output destination using --destination :

dx run project-xxxx:applet-zzzz \
-i outdir=./local/to/outdir \   # assign "./local/to/outdir" params.outdir
--destination project-xxxx:/path/to/jobtree/destination/ \ 
--brief -y

This above command will construct the final output paths in the following manner:

project-xxxx:/path/to/jobtree/destination/ as the destination of the job tree's shared output folder.
project-xxxx:/path/to/jobtree/destination/local/to/outdir as the shared output folder of the all tasks/processes/subjobs of this pipeline.
project-xxxx:/path/to/jobtree/destination/local/to/outdir/<task_name> as the output folder of each specific task/process/subjob of this pipeline.

This example is built based on Specifying A Nextflow Job Tree Output Folder and Managing intermediate files and publishing outputs.
Not all Nextflow pipelines haveparams.outdir as input, nor do all of them use params.outdir in publishDir. Read the source script of the Nextflow pipeline for the actual context of usage and requirements forparams.outdir andpublishDir.

DNAnexus Documentation

Overview

Using the DNAnexus Platform

Getting Started

Uploading and Sharing Data

Running a Single App

Creating and Running a Workflow

Monitoring Jobs and Viewing Results

Visualizing Data

Learn More

Additional Tutorials

DNAnexus Essentials

Learn More

Key Concepts

Projects

About Projects

Managing Project Content

Downloading Files

Getting More Information on Objects

Deleting Objects

Copying Data to Another Project

Access and Sharing

Adding Project Members

Removing Project Members

Project Access Levels

Project Access Levels: Two Examples

Restricting Access to Executables

Project Data Access Controls

PHI Data Protection

Billing and Charges

Transferring Project Billing Responsibility

Transferring Billing Responsibility to Another User

Transferring Billing Responsibility to an Org

Cancelling a Transfer of Billing Responsibility

Accepting a Transfer Request

Projects with the Auto-Symlink feature Enabled

Projects with PHI Data Protection Enabled

Sponsored Projects

Project Sponsorship

Learn More

Organizations

What Is an Org?

Org Membership Levels

Members

Admins

Viewing Metadata for All Org Projects

Becoming a Developer for All Org Apps

Examples of Using Orgs

Org Structure Diagram

Example 1: Creating an Org for Sharing Data

Example 2: Only Admins can Create Projects

Example 3: Shared Billing Account

Other Cases

Managing Your Orgs

Viewing and Updating Org Information

Managing Org Members

Inviting a New Member

Creating New DNAnexus Accounts

Editing Member Access

Removing Members

Org Projects

Granting Admin Access to Org Projects

Org Billing

Accessing Org Billing Information

Setting Up or Updating Billing Information for an Org

Setting a Spending Limit

Viewing Estimated Charges

Monitoring Spending and Usage

Org Policies

Glossary of Org Terms

Learn More

Apps and Workflows

Finding the Right App or Workflow

Running Apps and Workflows

Launching a Tool

Launching from the Tools Library

Launching from a Project

Launch Configuration

Configure Inputs and Outputs

Configure Runtime Settings