DNAnexus Documentation
APIDownloadsIndex of dx CommandsLegal
  • Overview
  • Getting Started
    • DNAnexus Essentials
    • Key Concepts
      • Projects
      • Organizations
      • Apps and Workflows
    • User Interface Quickstart
    • Command Line Quickstart
    • Developer Quickstart
    • Developer Tutorials
      • Bash
        • Bash Helpers
        • Distributed by Chr (sh)
        • Distributed by Region (sh)
        • SAMtools count
        • TensorBoard Example Web App
        • Git Dependency
        • Mkfifo and dx cat
        • Parallel by Region (sh)
        • Parallel xargs by Chr
        • Precompiled Binary
        • R Shiny Example Web App
      • Python
        • Dash Example Web App
        • Distributed by Region (py)
        • Parallel by Chr (py)
        • Parallel by Region (py)
        • Pysam
      • Web App(let) Tutorials
        • Dash Example Web App
        • TensorBoard Example Web App
      • Concurrent Computing Tutorials
        • Distributed
          • Distributed by Region (sh)
          • Distributed by Chr (sh)
          • Distributed by Region (py)
        • Parallel
          • Parallel by Chr (py)
          • Parallel by Region (py)
          • Parallel by Region (sh)
          • Parallel xargs by Chr
  • User
    • Login and Logout
    • Projects
      • Project Navigation
      • Path Resolution
    • Running Apps and Workflows
      • Running Apps and Applets
      • Running Workflows
      • Running Nextflow Pipelines
      • Running Batch Jobs
      • Monitoring Executions
      • Job Notifications
      • Job Lifecycle
      • Executions and Time Limits
      • Executions and Cost and Spending Limits
      • Smart Reuse (Job Reuse)
      • Apps and Workflows Glossary
      • Tools List
    • Cohort Browser
      • Chart Types
        • Row Chart
        • Histogram
        • Box Plot
        • List View
        • Grouped Box Plot
        • Stacked Row Chart
        • Scatter Plot
        • Kaplan-Meier Survival Curve
      • Locus Details Page
    • Using DXJupyterLab
      • DXJupyterLab Quickstart
      • Running DXJupyterLab
        • FreeSurfer in DXJupyterLab
      • Spark Cluster-Enabled DXJupyterLab
        • Exploring and Querying Datasets
      • Stata in DXJupyterLab
      • Running Older Versions of DXJupyterLab
      • DXJupyterLab Reference
    • Using Spark
      • Apollo Apps
      • Connect to Thrift
      • Example Applications
        • CSV Loader
        • SQL Runner
        • VCF Loader
      • VCF Preprocessing
    • Environment Variables
    • Objects
      • Describing Data Objects
      • Searching Data Objects
      • Visualizing Data
      • Filtering Objects and Jobs
      • Archiving Files
      • Relational Database Clusters
      • Symlinks
      • Uploading and Downloading Files
        • Small File Sets
          • dx upload
          • dx download
        • Batch
          • Upload Agent
          • Download Agent
    • Platform IDs
    • Organization Member Guide
    • Index of dx commands
  • Developer
    • Developing Portable Pipelines
      • dxCompiler
    • Cloud Workstation
    • Apps
      • Introduction to Building Apps
      • App Build Process
      • Advanced Applet Tutorial
      • Bash Apps
      • Python Apps
      • Spark Apps
        • Table Exporter
        • DX Spark Submit Utility
      • HTTPS Apps
        • Isolated Browsing for HTTPS Apps
      • Transitioning from Applets to Apps
      • Third Party and Community Apps
        • Community App Guidelines
        • Third Party App Style Guide
        • Third Party App Publishing Checklist
      • App Metadata
      • App Permissions
      • App Execution Environment
        • Connecting to Jobs
      • Dependency Management
        • Asset Build Process
        • Docker Images
        • Python package installation in Ubuntu 24.04 AEE
      • Job Identity Tokens for Access to Clouds and Third-Party Services
      • Enabling Web Application Users to Log In with DNAnexus Credentials
      • Types of Errors
    • Workflows
      • Importing Workflows
      • Introduction to Building Workflows
      • Building and Running Workflows
      • Workflow Build Process
      • Versioning and Publishing Global Workflows
      • Workflow Metadata
    • Ingesting Data
      • Molecular Expression Assay Loader
        • Common Errors
        • Example Usage
        • Example Input
      • Data Model Loader
        • Data Ingestion Key Steps
        • Ingestion Data Types
        • Data Files Used by the Data Model Loader
        • Troubleshooting
      • Dataset Extender
        • Using Dataset Extender
    • Dataset Management
      • Rebase Cohorts and Dashboards
      • Assay Dataset Merger
      • Clinical Dataset Merger
    • Apollo Datasets
      • Dataset Versions
      • Cohorts
    • Creating Custom Viewers
    • Client Libraries
      • Support for Python 3
    • Walkthroughs
      • Creating a Mixed Phenotypic Assay Dataset
      • Guide for Ingesting a Simple Four Table Dataset
    • DNAnexus API
      • Entity IDs
      • Protocols
      • Authentication
      • Regions
      • Nonces
      • Users
      • Organizations
      • OIDC Clients
      • Data Containers
        • Folders and Deletion
        • Cloning
        • Project API Methods
        • Project Permissions and Sharing
      • Data Object Lifecycle
        • Types
        • Object Details
        • Visibility
      • Data Object Metadata
        • Name
        • Properties
        • Tags
      • Data Object Classes
        • Records
        • Files
        • Databases
        • Drives
        • DBClusters
      • Running Analyses
        • I/O and Run Specifications
        • Instance Types
        • Job Input and Output
        • Applets and Entry Points
        • Apps
        • Workflows and Analyses
        • Global Workflows
        • Containers for Execution
      • Search
      • System Methods
      • Directory of API Methods
      • DNAnexus Service Limits
  • Administrator
    • Billing
    • Org Management
    • Single Sign-On
    • Audit Trail
    • Integrating with External Services
    • Portal Setup
    • GxP
      • Controlled Tool Access (allowed executables)
  • Science Corner
    • Scientific Guides
      • Somatic Small Variant and CNV Discovery Workflow Walkthrough
      • SAIGE GWAS Walkthrough
      • LocusZoom DNAnexus App
      • Human Reference Genomes
    • Using Hail to Analyze Genomic Data
    • Open-Source Tools by DNAnexus Scientists
    • Using IGV Locally with DNAnexus
  • Downloads
  • FAQs
    • EOL Documentation
      • Python 3 Support and Python 2 End of Life (EOL)
    • Automating Analysis Workflow
    • Backups of Customer Data
    • Developing Apps and Applets
    • Importing Data
    • Platform Uptime
    • Legal and Compliance
    • Sharing and Collaboration
    • Product Version Numbering
  • Release Notes
  • Technical Support
  • Legal
Powered by GitBook

Copyright 2025 DNAnexus

On this page
  • Preparing Input Files
  • BWA Reference and Indexes
  • CNV References
  • Panel of Normal
  • Resource Bundles
  • Launching the Workflow
  • Launching From the UI
  • Launching From the CLI
  • Helpful Tips
  • Workflow Names by Region
  • Panel of Normal (PON)
  • Using a Copy Number Reference Profile
  • Base Quality Score Recalibration
  • Large Scale Analysis

Was this helpful?

Export as PDF
  1. Science Corner
  2. Scientific Guides

Somatic Small Variant and CNV Discovery Workflow Walkthrough

Learn how to use this workflow to detect somatic small variants and CNVs.

Last updated 2 days ago

Was this helpful?

The Somatic Small Variant and CNV Discovery Workflow, a Global Workflow Description Language (WDL) workflow on DNAnexus, enables detection of somatic small variants and copy number variations (CNV) using tools and processing steps as described in ’s and . Starting with a pair of tumor/normal FASTQ files as input, the output of this workflow is a set of somatic variants which may be used for further downstream analysis (e.g. investigating variant association with a specific type of cancer). This flowchart below shows a simplified view of all the applications used within the workflow:

The workflow is compatible with somatic files generated from whole genome sequencing (WGS), whole exome sequencing (WES), and targeted next-generation sequencing panels (coverage of specific set of variants or region of interest). This workflow also allows for variant filtering based on allele frequency, contamination, and orientation bias.

Preparing Input Files

This workflow uses several input files, some of which will need to be prepared separately prior to running this workflow. The apps used to prep the input files can be run from the user interface (UI) or the command-line interface (CLI).

BWA Reference and Indexes

CNV References

Panel of Normal

Resource Bundles

Location

Available Resource

Project: “Reference Genome Files”

Directory: “gatk.resources.b37” or “gatk.resources.GRCh38”

• Common germline variant sites VCF

• Germline population VCF

• Known variants

• Panel of Normals

• Panel of Normals Index

Project:

“Reference Genome Files”

Directory: “H. Sapiens - GRCh37 - hs37d5 (1000 Genomes Phase II)” or “H. Sapiens - GRCh38 with alt contigs - hs38DH”

• BWA reference genome index

• Reference sequence

• Reference sequence dictionary

Instructions on how to use these files as inputs to the workflow are described in the next section.

Launching the Workflow

Launching From the UI

Some reference genome related input files, like BWA reference genome index (2), are available in public projects, like “Reference Genome Files”(1), to select as inputs under “Suggested Items” in the top left corner:

Launching From the CLI

The Somatic Small Variant and CNV Discovery Workflow can also be run non-interactively if file IDs are already known.

Example:

dx run somatic_small_variant_and_cnv_discovery \
  -icommon_variant_sites_vcfgz=file-GFz5xgQ0Bv264YzX4p6P8331 \
  -igenomeindex_targz=file-FFJPKp0034KY8f20F6V9yYkk\
  -igermline_resource_vcfgz=file-GFz5y7Q0v822QxJy4q7kZ3x7\
  -inormal_reads_fastqgzs=[normal_reads_1.fastq.gz] \
  -ipanel_of_normals_vcfgz=file-GGQ6X5Q0j92jq3XXJYJv30g6 \
  -ipanel_of_normals_vcfgz_tbi=file-GGQ6X7Q0Jj589z6XFy1J2KP8 \
  -ireference_dict=file-GFz5xf00Bqx2j79G4q4F5jXV \
  -ireference_faigz=file-FFJx1P80XJyP87xzF632jqqQ \
  -ireference_fastagz=file-FF2vqv007JZyg5vFFBYb0gJZ \
  -itumor_reads_fastqgzs=[tumor_reads_1.fastq.gz] \
  -icn_reference_profile_cnngz=refprofile.cnn \
  -ifilter_contamination=true \
  -ifilter_orientation_bias=false \
  -iinterval_list=interval.bed\
  -iknown_variants_vcfgzs=file-GFz5xgj0K5zFb72j4pkGF768\
  -imin_allele_fraction=0.05 \
  -imin_reads_required=0 \
  -imutect_memory_per_process_gb=10 \
  -imutect_scatter_worker_ratio=2 \
  -inormal_reads2_fastqgzs=[normal_reads_2.fastq.gz] \
  -ioutput_prefix=’output’ \
  -iperform_bqsr=true \
  -irg_info_csvgz=readgroup.csv.gz \
  -itumor_reads2_fastqgzs=[tumor_reads_2.fastq.gz]

Helpful Tips

Workflow Names by Region

Depending on what region the execution project is in, the Somatic Small Variant and CNV Discovery Workflow will have a different name and ID:

Region

Workflow Name

Workflow ID

URL

AWS US (East)

somatic_small_variant_and_cnv_discovery

globalworkflow-GGy3fyj0XbybQxzV4gy8V085

AWS Asia Pacific - Sydney

somatic_small_variant_and_cnv_discovery_sydney

globalworkflow-GGy4kfj5f18KQf524kJ1V4QP

AWS Europe - Frankfurt

somatic_small_variant_and_cnv_discovery_frankfurt

globalworkflow-GGy491Q4ZZYyKZ92KXXyGjq7

AWS Europe - London

somatic_small_variant_and_cnv_discovery_london_g

globalworkflow-GGy4K0BKQ3Q8YBYF19gxP2Xj

Azure Amsterdam

somatic_small_variant_and_cnv_discovery_azure_eu

globalworkflow-GGy5038BQX5PQ8PK6pggk4bg

Azure US

somatic_small_variant_and_cnv_discovery_azure_us

globalworkflow-GGy4x009x22GgqJ34gb6YJJf

Panel of Normal (PON)

GATK Best Practices also suggests that a PON helps Mutect2 to detect additional complicated sites in sequencing data, technical artifacts which may arise from sequencing, data processing, and/or mapping.

Using a Copy Number Reference Profile

dx run app-cnvkit_batch \
 -inormal_bams=[normal.bam] \ 
 -ibaits=interval.bed \
 -ifasta=grch38.fa.gz

If a copy number reference profile is not provided as an input, this workflow will build the .cnn file using the normal samples. The .cnn will be one of the output files of the workflow.

If a copy number reference profile from a previous CNVkit analysis (with the same normal samples) is available, it may be reused for subsequent processing of further tumor samples by using it as an input. File reuse will likely save time and cost as the workflow will not need to build the reference profile each time from the same set of normal samples.

Base Quality Score Recalibration

The workflow provides users an option to perform Base Quality Score Recalibration (BQSR). Though GATK’s Best Practice suggests performing BQSR, omitting this step can save time/resources. When using data from latest sequencers (generated after 2015), this step can be omitted.

Large Scale Analysis

The BWA-MEM genome index can be generated using the application on the platform.

The copy number reference profile can be built using the application on the platform (additional instructions described in the section).

The panel of normal (PON) is a VCF file of sites observed in normal samples. The file can be created using the application on the platform prior to running this workflow. Public GATK panels of normals can be used in absence of a custom PON (additional information described in the section).

The page provides information around their standard files for working with human resequencing data with GATK. Additionally, the following commonly used reference files are provided for users’ access in public projects on the DNAnexus platform:

The workflow detailed in this tutorial may be found in the Tools Library section of the UI on the platform, which is accessible by clicking on the tab on the top left menu of the screen. Filter for “globalworkflow” under the Any Type filter and select “Somatic Small Variant and CNV Discovery.” To search for “Somatic Small Variant and CNV Discovery” by name, search using the Any Name filter.

Below are the commands to run this analysis from the CLI using dx-toolkit. The workflow is deployed with different naming conventions for each region- the examples below are using the workflow from the AWS US (East) region. The corresponding workflow name for each region can be found in the Table under the section.

If using reference data available in the public “Reference Genome Files” project, running the workflow in will allow for selection of the relevant file.

for small variant discovery advises to create the PON by running the variant caller, Mutect2, individually on a set of normal samples first, and the to combine the resulting variant calls using desired criteria (e.g. excluding any sites that are not present in at least two normals). The result will produce a sites-only VCF file which may be reused as a PON for subsequent processing, again with Mutect2.

The application on the DNAnexus platform may be separately used to construct a new copy number reference profile. To build a copy number reference profile, run the application with normal sample BAM files, reference FASTA file, and a baited (tiled, targeted) genomic regions file, in BED or GATK/Picard-style interval list format. The output will be a .cnn file that can be used as input in this workflow. For example, using CLI and dx-toolkit:

The Somatic Small Variant and CNV Discovery Workflow can be run with large scale datasets where the workflow can be run simultaneously on multiple tumor/normal pairs. See .

BWA FASTA Indexer
GATK resource bundle
Tools
interactive mode
GATK Best Practices
CNVkit
Running Batch Jobs
CNVkit
Helpful Tips
GATK Somatic Panel of Normals Builder
Helpful Tips
Helpful Tips
link
link
link
link
link
link
GATK
Best Practices for Somatic small variant discovery
CNVkit
The Somatic Small Variant and CNV Discovery workflow is region-specific, so select the workflow matching your account region.