# Somatic Small Variant and CNV Discovery Workflow Walkthrough

The Somatic Small Variant and CNV Discovery Workflow, a Global Workflow Description Language (WDL) workflow on DNAnexus, enables detection of somatic small variants and copy number variations (CNV) using tools and processing steps as described in [Genome Analysis Toolkit (GATK)](https://gatk.broadinstitute.org/hc/en-us)'s [Best Practices for Somatic small variant discovery](https://gatk.broadinstitute.org/hc/en-us/articles/360035894731-Somatic-short-variant-discovery-SNVs-Indels-) and [CNVkit](https://cnvkit.readthedocs.io/en/stable/). This workflow takes a pair of tumor/normal FASTQ files as input and generates a set of somatic variants suitable for further downstream analysis. These variants can be used for investigating variant association with a specific type of cancer, among other applications. The flowchart below shows a simplified view of all the applications used within the workflow:

![](/files/vDjRqoupHuoZcDSZjpDM)

The workflow processes somatic files generated from whole genome sequencing (WGS), whole exome sequencing (WES), and targeted next-generation sequencing panels (coverage of a specific set of variants or a region of interest). This workflow supports variant filtering based on allele frequency, contamination, and orientation bias.

## Preparing Input Files

This workflow uses specific input files requiring separate preparation before execution. The apps used to prep the input files can be run from the user interface (UI) or the command-line interface (CLI).

### BWA Reference and Indexes

You can generate the **BWA-MEM genome index** using the [BWA FASTA Indexer](https://platform.dnanexus.com/app/bwa_fasta_indexer) application on the platform.

### CNV References

You can build the **copy number reference profile** using the [CNVkit](https://platform.dnanexus.com/panx/tools/run/app-GFZ3kp00QFYyBV0pBBbx968b) application on the platform (additional instructions described in the [Helpful Tips](#using-a-copy-number-reference-profile) section).

### Panel of Normals

The **panel of normals** (PON) is a VCF file of sites observed in normal samples. You can create this file using the [GATK Somatic Panel of Normals Builder](https://platform.dnanexus.com/app/gatk4_somatic_panel_of_normals_builder) application on the platform before running this workflow. Public GATK panels of normals serve as alternatives to a custom PON (additional information described in the [Helpful Tips](#panel-of-normals-pon) section).

### Resource Bundles

The [GATK resource bundle](https://gatk.broadinstitute.org/hc/en-us/articles/360035890811-Resource-bundle) page provides information around their standard files for working with human resequencing data with GATK.

The following commonly used reference files are provided for users' access in public projects on the DNAnexus Platform:

| Location                                                                                                                                                                             | Available Resource                                                                                                                            |
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------- |
| <p>Project: Reference Genome Files<br>Directory: <code>gatk.resources.b37</code> or <code>gatk.resources.GRCh38</code></p>                                                           | <p>• Common germline variant sites VCF<br>• Germline population VCF<br>• Known variants<br>• Panel of Normals<br>• Panel of Normals Index</p> |
| <p>Project: Reference Genome Files<br>Directory: <code>H. Sapiens - GRCh37 - hs37d5</code> (1000 Genomes Phase II) or <code>H. Sapiens - GRCh38 with alt contigs - hs38DH</code></p> | <p>• BWA reference genome index<br>• Reference sequence<br>• Reference sequence dictionary</p>                                                |

Instructions on how to use these files as inputs to the workflow are described in the next section.

## Launching the Workflow

### Launching From the UI

Find the workflow detailed in this tutorial in the **Tools Library** section of the UI on the platform, accessible by clicking on the [**Tools**](https://platform.dnanexus.com/panx/tools) tab on the top left menu of the screen. Filter for "GlobalWorkflow" under the **Any Type** filter and select "Somatic Small Variant and CNV Discovery." To search for "Somatic Small Variant and CNV Discovery" by name, use the **Any Name** filter.

![The Somatic Small Variant and CNV Discovery workflow is region-specific, so select the workflow matching your account region.](/files/i8hU1GzvpNSWZ9SvD2oo)

Some reference genome related input files, like BWA reference genome index (2), are available in public projects, like "Reference Genome Files" (1), to select as inputs under "Suggested Items" in the top left corner:

![](/files/3qswlcJjX8rulS8c7aX1)

### Launching From the CLI

Below are the commands to run this analysis from the CLI using `dx-toolkit`. The workflow is deployed with different naming conventions for each region- the examples below are using the workflow from the AWS US (East) region. The corresponding workflow name for each region can be found in the Table under the [**Helpful Tips**](#workflow-names-by-region) section.

If using reference data available in the public "Reference Genome Files" project, running the workflow in [interactive mode](/user/running-apps-and-workflows/running-workflows.md#running-in-interactive-mode) enables selection of the relevant file.

The Somatic Small Variant and CNV Discovery Workflow can also be run non-interactively if file IDs are already known.

Example:

```shell
dx run somatic_small_variant_and_cnv_discovery \
  -icommon_variant_sites_vcfgz=file-GFz5xgQ0Bv264YzX4p6P8331 \
  -igenomeindex_targz=file-FFJPKp0034KY8f20F6V9yYkk\
  -igermline_resource_vcfgz=file-GFz5y7Q0v822QxJy4q7kZ3x7\
  -inormal_reads_fastqgzs=[normal_reads_1.fastq.gz] \
  -ipanel_of_normals_vcfgz=file-GGQ6X5Q0j92jq3XXJYJv30g6 \
  -ipanel_of_normals_vcfgz_tbi=file-GGQ6X7Q0Jj589z6XFy1J2KP8 \
  -ireference_dict=file-GFz5xf00Bqx2j79G4q4F5jXV \
  -ireference_faigz=file-FFJx1P80XJyP87xzF632jqqQ \
  -ireference_fastagz=file-FF2vqv007JZyg5vFFBYb0gJZ \
  -itumor_reads_fastqgzs=[tumor_reads_1.fastq.gz] \
  -icn_reference_profile_cnngz=refprofile.cnn \
  -ifilter_contamination=true \
  -ifilter_orientation_bias=false \
  -iinterval_list=interval.bed\
  -iknown_variants_vcfgzs=file-GFz5xgj0K5zFb72j4pkGF768\
  -imin_allele_fraction=0.05 \
  -imin_reads_required=0 \
  -imutect_memory_per_process_gb=10 \
  -imutect_scatter_worker_ratio=2 \
  -inormal_reads2_fastqgzs=[normal_reads_2.fastq.gz] \
  -ioutput_prefix='output' \
  -iperform_bqsr=true \
  -irg_info_csvgz=readgroup.csv.gz \
  -itumor_reads2_fastqgzs=[tumor_reads_2.fastq.gz]
```

## Helpful Tips

### Workflow Names by Region

The Somatic Small Variant and CNV Discovery Workflow has different names and IDs depending on the execution project region:

| Region                    | Workflow Name                                       | Workflow ID                               | URL                                                                                                                                                   |
| ------------------------- | --------------------------------------------------- | ----------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
| AWS US (East)             | `somatic_small_variant_and_cnv_discovery`           | `globalworkflow-GGy3fyj0XbybQxzV4gy8V085` | [AWS US (East) workflow](https://platform.dnanexus.com/panx/tools/run/globalworkflow-GGy3fyj0XbybQxzV4gy8V085/execution-settings/select-output-modal) |
| AWS Asia Pacific - Sydney | `somatic_small_variant_and_cnv_discovery_sydney`    | `globalworkflow-GGy4kfj5f18KQf524kJ1V4QP` | [Sydney workflow](https://platform.dnanexus.com/panx/tools/run/globalworkflow-GGy4kfj5f18KQf524kJ1V4QP/execution-settings/select-output-modal)        |
| AWS Europe - Frankfurt    | `somatic_small_variant_and_cnv_discovery_frankfurt` | `globalworkflow-GGy491Q4ZZYyKZ92KXXyGjq7` | [Frankfurt workflow](https://platform.dnanexus.com/panx/tools/run/globalworkflow-GGy491Q4ZZYyKZ92KXXyGjq7/execution-settings/select-output-modal)     |
| AWS Europe - London       | `somatic_small_variant_and_cnv_discovery_london_g`  | `globalworkflow-GGy4K0BKQ3Q8YBYF19gxP2Xj` | [London workflow](https://platform.dnanexus.com/panx/tools/run/globalworkflow-GGy4K0BKQ3Q8YBYF19gxP2Xj/execution-settings/select-output-modal)        |
| Azure Amsterdam           | `somatic_small_variant_and_cnv_discovery_azure_eu`  | `globalworkflow-GGy5038BQX5PQ8PK6pggk4bg` | [Amsterdam workflow](https://platform.dnanexus.com/panx/tools/run/globalworkflow-GGy5038BQX5PQ8PK6pggk4bg/execution-settings/select-output-modal)     |
| Azure US                  | `somatic_small_variant_and_cnv_discovery_azure_us`  | `globalworkflow-GGy4x009x22GgqJ34gb6YJJf` | [Azure US workflow](https://platform.dnanexus.com/panx/tools/run/globalworkflow-GGy4x009x22GgqJ34gb6YJJf/execution-settings/select-output-modal)      |

### Panel of Normals (PON)

[GATK Best Practices](https://gatk.broadinstitute.org/hc/en-us/articles/360035890631-Panel-of-Normals-PON-) for small variant discovery advise creating the PON by running the variant caller, Mutect2, individually on a set of normal samples first, and then combining the resulting variant calls using your criteria. You might exclude any sites that are not present in at least two normals. The result produces a sites-only VCF file suitable for reuse as a PON for subsequent processing, again with Mutect2.

GATK Best Practices also suggest that a PON helps Mutect2 to detect additional complicated sites in sequencing data, technical artifacts which may arise from sequencing, data processing, and/or mapping.

### Using a Copy Number Reference Profile

The [CNVkit](https://platform.dnanexus.com/panx/tools/run/app-GFZ3kp00QFYyBV0pBBbx968b) application on the DNAnexus Platform may be used to construct a new copy number reference profile. To build a copy number reference profile, run the application with normal sample BAM files, reference FASTA file, and a baited (tiled, targeted) genomic regions file, in BED or GATK/Picard-style interval list format. The output is a `.cnn` file that can be used as input in this workflow. For example, using CLI and `dx-toolkit`:

```shell
dx run app-cnvkit_batch \
 -inormal_bams=[normal.bam] \
 -ibaits=interval.bed \
 -ifasta=grch38.fa.gz
```

If a copy number reference profile is not provided as an input, this workflow builds the `.cnn` file using the normal samples. The `.cnn` is generated as one of the output files of the workflow.

If a copy number reference profile from a previous CNVkit analysis (with the same normal samples) is available, reuse it for subsequent processing of further tumor samples by providing it as an input. File reuse saves time and cost by removing the need to regenerate the reference profile from the same set of normal samples.

### Base Quality Score Recalibration

The workflow includes an option to perform Base Quality Score Recalibration. Though GATK Best Practices suggest performing Base Quality Score Recalibration, omitting this step saves time and resources. For data from modern sequencers (generated after 2015), this step is optional.

### Large Scale Analysis

The Somatic Small Variant and CNV Discovery Workflow supports parallel execution on multiple tumor/normal pairs in large scale datasets. See [Running Batch Jobs](/user/running-apps-and-workflows/running-batch-jobs.md).


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://documentation.dnanexus.com/science/scientific-guides/somatic-small-variant-and-cnv-discovery-workflow-walkthrough.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
