Somatic Small Variant and CNV Discovery Workflow Walkthrough
Learn how to use this workflow to detect somatic small variants and CNVs.
Last updated
Learn how to use this workflow to detect somatic small variants and CNVs.
Last updated
Copyright 2024 DNAnexus
The Somatic Small Variant and CNV Discovery Workflow, a Global Workflow Description Language (WDL) workflow on DNAnexus, enables detection of somatic small variants and copy number variations (CNV) using tools and processing steps as described in GATK’s Best Practices for Somatic small variant discovery and CNVkit. Starting with a pair of tumor/normal FASTQ files as input, the output of this workflow is a set of somatic variants which may be used for further downstream analysis (e.g. investigating variant association with a specific type of cancer). This flowchart below shows a simplified view of all the applications used within the workflow:
The workflow is compatible with somatic files generated from whole genome sequencing (WGS), whole exome sequencing (WES), and targeted next-generation sequencing panels (coverage of specific set of variants or region of interest). This workflow also allows for variant filtering based on allele frequency, contamination, and orientation bias.
This workflow uses several input files, some of which will need to be prepared separately prior to running this workflow. The apps used to prep the input files can be run from the user interface (UI) or the command-line interface (CLI).
The BWA-MEM genome index can be generated using the BWA FASTA Indexer application on the platform.
The copy number reference profile can be built using the CNVkit application on the platform (additional instructions described in the Helpful Tips section).
The panel of normal (PON) is a VCF file of sites observed in normal samples. The file can be created using the GATK Somatic Panel of Normals Builder application on the platform prior to running this workflow. Public GATK panels of normals can be used in absence of a custom PON (additional information described in the Helpful Tips section).
The GATK resource bundle page provides information around their standard files for working with human resequencing data with GATK. Additionally, the following commonly used reference files are provided for users’ access in public projects on the DNAnexus platform:
Location
Available Resource
Project: “Reference Genome Files”
Directory: “gatk.resources.b37” or “gatk.resources.GRCh38”
• Common germline variant sites VCF
• Germline population VCF
• Known variants
• Panel of Normals
• Panel of Normals Index
Project:
“Reference Genome Files”
Directory: “H. Sapiens - GRCh37 - hs37d5 (1000 Genomes Phase II)” or “H. Sapiens - GRCh38 with alt contigs - hs38DH”
• BWA reference genome index
• Reference sequence
• Reference sequence dictionary
Instructions on how to use these files as inputs to the workflow are described in the next section.
The workflow detailed in this tutorial may be found in the Tools Library section of the UI on the platform, which is accessible by clicking on the Tools tab on the top left menu of the screen. Filter for “globalworkflow” under the Any Type filter and select “Somatic Small Variant and CNV Discovery.” To search for “Somatic Small Variant and CNV Discovery” by name, search using the Any Name filter.
Some reference genome related input files, like BWA reference genome index (2), are available in public projects, like “Reference Genome Files”(1), to select as inputs under “Suggested Items” in the top left corner:
Command Line Interface
Below are the commands to run this analysis from the CLI using dx-toolkit
. The workflow is deployed with different naming conventions for each region- the examples below are using the workflow from the AWS US (East) region. The corresponding workflow name for each region can be found in the Table under the Helpful Tips section.
If using reference data available in the public “Reference Genome Files” project, running the workflow in interactive mode will allow for selection of the relevant file.
The Somatic Small Variant and CNV Discovery Workflow can also be run non-interactively if file IDs are already known.
Example:
Depending on what region the execution project is in, the Somatic Small Variant and CNV Discovery Workflow will have a different name and ID:
Region
Workflow Name
Workflow ID
URL
AWS Asia Pacific - Sydney
somatic_small_variant_and_cnv_discovery_sydney
globalworkflow-GGy4kfj5f18KQf524kJ1V4QP
AWS Europe - Frankfurt
somatic_small_variant_and_cnv_discovery_frankfurt
globalworkflow-GGy491Q4ZZYyKZ92KXXyGjq7
AWS Europe - London
somatic_small_variant_and_cnv_discovery_london_g
globalworkflow-GGy4K0BKQ3Q8YBYF19gxP2Xj
Azure Amsterdam
somatic_small_variant_and_cnv_discovery_azure_eu
globalworkflow-GGy5038BQX5PQ8PK6pggk4bg
GATK Best Practices for small variant discovery advises to create the PON by running the variant caller, Mutect2, individually on a set of normal samples first, and the to combine the resulting variant calls using desired criteria (e.g. excluding any sites that are not present in at least two normals). The result will produce a sites-only VCF file which may be reused as a PON for subsequent processing, again with Mutect2.
GATK Best Practices also suggests that a PON helps Mutect2 to detect additional complicated sites in sequencing data, technical artifacts which may arise from sequencing, data processing, and/or mapping.
The CNVkit application on the DNAnexus platform may be separately used to construct a new copy number reference profile. To build a copy number reference profile, run the application with normal sample BAM files, reference FASTA file, and a baited (tiled, targeted) genomic regions file, in BED or GATK/Picard-style interval list format. The output will be a .cnn
file that can be used as input in this workflow. For example, using CLI and dx-toolkit
:
If a copy number reference profile is not provided as an input, this workflow will build the .cnn
file using the normal samples. The .cnn
will be one of the output files of the workflow.
If a copy number reference profile from a previous CNVkit analysis (with the same normal samples) is available, it may be reused for subsequent processing of further tumor samples by using it as an input. File reuse will likely save time and cost as the workflow will not need to build the reference profile each time from the same set of normal samples.
The workflow provides users an option to perform Base Quality Score Recalibration (BQSR). Though GATK’s Best Practice suggests performing BQSR, omitting this step can save time/resources. When using data from latest sequencers (generated after 2015), this step can be omitted.
The Somatic Small Variant and CNV Discovery Workflow can be run with large scale datasets where the workflow can be run simultaneously on multiple tumor/normal pairs. See Running Batch Jobs.