Somatic Variant Assay Loader

Ingest and annotate somatic variant data from VCF files into an Apollo Dataset.

An Apollo license is required to use the Somatic Variant Assay Loader on the DNAnexus Platform. Org approval may also be required. Contact DNAnexus Sales for more information.

Overview

The Somatic Variant Assay Loader lets you ingest somatic variant assay data into an Apollo Dataset. You can use it on its own or as part of a workflow with existing datasets and downstream tools such as JupyterLab.

The somatic variant model provides access to allelic-level tumor and/or tumor-normal variation, including Short Variants (SNV & Indel), Copy Number Variations (CNV), Structural Variants (SV), and Fusions. The loader reads sets of tumor-only or tumor-normal somatic variants from individual VCF files, validates the input data, ingests it into a database, annotates the data, and returns a dataset containing a "Somatic Variant" assay object.

You can use somatic variant data to answer questions such as:

Does an individual have a tumor sample that differs from a paired normal sample at a specific locus and contains a specific allele?
Does an individual have any variation in a specific gene where the tumor sample differs from their normal sample?
Does an individual have a tumor sample that differs from their normal sample in BRCA1, BRCA2, HER2, or TP53 and has been diagnosed with any cancer, regardless of tissue type?

For large datasets (over 100,000 samples × 10,000 variants per sample), contact DNAnexus Professional Services for guidance on optimal performance and tuning.

How to Use the App

Using the UI

To use the Somatic Variant Assay Loader within the DNAnexus Platform:

In the DNAnexus Platform, go to Tools > Tools Library.
For the Somatic Variant Assay Loader - Orchestrator app, click Run Latest Version.
In Output to, select a project and output location for the app's outputs.
Click Next.
In the Inputs tab, specify the required inputs.
Click Start Analysis.

Using the CLI

To use the Somatic Variant Assay Loader from the command-line interface, install the DNAnexus Platform SDK.

When using the DNAnexus Platform through Cloud Workspace or JupyterLab, the DNAnexus SDK is preinstalled. You can use the dx command right away.

Use the following command format, customizing the input parameters for your specific data.

dx run app-somatic_variant_assay_loader_orchestrator \
  -i sample_manifest=example_manifest.csv \
  -i reference="GRCh38" \
  -i assay_title="my_somatic_assay" \
  -i database="my_somatic_database"

Inputs

Required Parameters

Sample Manifest - CSV file that maps individual-level VCF files following standard VCF version 4.3 specifications. The manifest may reference tumor-only or tumor-normal VCFs.
Reference - The reference genome used for annotation. Options: GRCh38 or GRCh37.
Assay Title - A short, human-readable name for the somatic variant assay being ingested.
Database - The database ID or name for the ingested data.

Optionally, to optimize performance for large datasets, you can provide JSON configuration files for sub-jobs (Validation, Ingestion, Annotation, Optimization). Adjust these only for datasets larger than 100,000 samples × 10,000 variants each. For details, see in-app documentation for Somatic Variant Assay Loader - Orchestrator.

Supported Data Types

The Somatic Variant Assay Loader requires two types of input for data ingestion:

Source Data - Your tumor VCF files (either tumor-only or tumor-normal)
Sample Manifest - A file describing your source data

VCF File Formats

All files must comply with the standard VCF version 4.3 specifications. The somatic model retains all data and metadata from the original VCF file, allowing for complete retrieval.

Tumor-Only VCF

When ingesting tumor-only data, you must provide one VCF file for each individual. The sample name, located in the 10th field of the VCF, must correspond directly to a single individual.

example_tumor.vcf

##fileformat=VCFv4.3
##contig=<ID=chr1,length=248956422>
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth">
##FILTER=<ID=PASS,Description="Site contains at least one allele that passes filters">
##FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele fractions of alternate alleles in the tumor">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##tumor_sample=ts_001
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	ts_001
chr1	1049980	.	G	C	.	PASS	DP=493	GT:AF	0/1:0.991
chr1	1212740	.	A	C	.	PASS	DP=283	GT:AF	0/1:0.988
chr1	1460088	.	TGAGA	T	.	PASS	DP=43	GT:AF	0|1:0.246

Tumor-Normal VCF

When ingesting tumor-normal data, provide a tumor-normal VCF file for each individual. Each file must contain two uniquely named samples, located in the 10th and 11th columns. By default, the 10th column is specified as the normal sample and the 11th as the tumor sample.

However, if your samples are not in this order or if you want to use different column names to specify normal and tumor, you can use a separate manifest file to explicitly link the sample IDs from the VCF to the corresponding normal and tumor types.

example_tumor-normal.vcf

##fileformat=VCFv4.3
##contig=<ID=chr1,length=248956422>
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth">
##FILTER=<ID=PASS,Description="Site contains at least one allele that passes filters">
##FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele fractions of alternate alleles in the tumor">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	normal_sample001	normal_sample001
chr1	1049980	.	G	C	.	PASS	DP=493	GT:AF	0/0:0.11	0/1:0.991
chr1	1212740	.	A	C	.	PASS	DP=283	GT:AF	0/0:0.01	0/1:0.988
chr1	1460088	.	TGAGA	T	.	PASS	DP=43	GT:AF	0|0:0.06	0|1:0.246

Sample Manifest Formats

The sample manifest is a comma-delimited file (*.csv) that maps multiple individual-level VCF files to the data ingestion process. Store the manifest and all VCF files within the same project.

Explicit Sample Mapping

To explicitly map source files to referenced samples and individuals, provide a four-column file with the following headers:

file_id - The DNAnexus Platform file ID for each VCF file object
sample_id - A unique identifier for each set of samples in each VCF. Tumor and normal IDs must map to a sample. If not supplied, the tumor_id is used
tumor_id - The unique identifier of the tumor sample
normal_id - If present, the unique identifier for the normal sample

Each line must contain one unique file ID. All tumor and normal IDs must be unique across the entire file set. Sample IDs must also be unique across all files, though they may differ from the tumor ID. If you do not include a header, columns must follow the default order (file_id, sample_id, tumor_id, and normal_id). If a header is included, you can arrange columns in any order.

example_manifest_explicit_mapping.csv

file_id,sample_id,tumor_id,normal_id
file-GY6G4Yj0VBvpZ1zY3ZFXX4zy,sample_001,tumor_001,normal_001
file-GY6G4f80VBvf53x3v4Xp0zGg,sample_002,tumor_002,normal_002
file-FJGNEf80VBvf53x3v40Kj8jn,sample_003,tumor_003,
file-Hi84n3lK9lkj98dh3Ksk83n2,sample_004,tumor_004,

Automatic Sample Mapping

You can also use a minimal sample manifest file that includes only a single file_id column (with or without a header), as long as the source VCF files meet the following conditions:

All tumor and normal IDs must be unique across all files.
For tumor-normal VCFs, the normal sample must be in the 10th column and the tumor sample in the 11th column.

example_manifest_automatic_mapping.csv

file-GY6G4Yj0VBvpZ1zY3ZFXX4zy
file-GY6G4f80VBvf53x3v4Xp0zGg
file-FJGNEf80VBvf53x3v40Kj8jn
file-Hi84n3lK9lkj98dh3Ksk83n2

Outputs

Database - The identifier of the database containing the somatic variant data.
Dataset - A dataset containing the SomaticVariantAssay object.
Cluster Logs - Log files from all nodes. These logs are present only when the Collect Cluster Logs input is set to true.
Reports - Structured logs from the parent job (Orchestration) and its subjobs (Validation, Ingestion, Annotation, and Optimization).

You can use the generated dataset in the Cohort Browser by clicking on the dataset name, or selecting the dataset record and clicking Explore Data. For more details, see Analyzing Somatic Variants.

Best Practices & Troubleshooting

Data Preparation Guidelines

Ensure all VCF files comply with VCF version 4.3 specifications.
For tumor-normal VCFs using automatic mapping, place the normal sample in the 10th column and tumor sample in the 11th column.
All tumor and normal IDs must be unique across the entire file set.
Store the manifest and all VCF files within the same project.

Performance Considerations

The somatic variant data model is designed for tumor-normal related variation and may not be efficient for population genomics analysis.
For optimal performance, limit your dataset to no more than 100,000 samples, each with between 1 and 30,000 variants. Larger datasets may require additional tuning to prevent query timeouts.
For large datasets, contact DNAnexus Professional Services for guidance.

Common Issues and Solutions

Check the generated reporting files. Validation reports provide a summary of errors and warnings that can help with troubleshooting.
If a job completes the validation step but fails due to insufficient resources, you can recover by:
- Selecting a more powerful instance type for the next run.
- Enabling Recovery Mode.
- Ensuring the new job uses the same sample manifest, database name, and geno bin size as the failed job
Annotation is supported as a comparison between a sample (either tumor or normal) and a reference genome. Direct tumor-normal comparisons are not made, but a tumor and normal are mapped to each other for additional comparative annotation.

Next Steps

For details on analyzing somatic variants in the Cohort Browser, see Analyzing Somatic Variants.
Create multi-assay datasets by combining with other dataset types (clinical, germline variant, molecular expression).
Use dx extract_assay somatic to access somatic variant data from the command line.
Parse assay metadata and access somatic variant data in Spark-enabled JupyterLab instances.

See tutorial notebooks in OpenBio for:

Last updated 1 month ago

Was this helpful?