Somatic Variant Assay Loader
Ingest and annotate somatic variant data from VCF files into an Apollo Dataset.
Overview
The Somatic Variant Assay Loader lets you ingest somatic variant assay data into an Apollo Dataset. You can use it on its own or as part of a workflow with existing datasets and downstream tools such as JupyterLab.
The somatic variant model provides access to allelic-level tumor and/or tumor-normal variation, including Short Variants (SNV & Indel), Copy Number Variations (CNV), Structural Variants (SV), and Fusions. The loader reads sets of tumor-only or tumor-normal somatic variants from individual VCF files, validates the input data, ingests it into a database, annotates the data, and returns a dataset containing a "Somatic Variant" assay object.
You can use somatic variant data to answer questions such as:
Does an individual have a tumor sample that differs from a paired normal sample at a specific locus and contains a specific allele?
Does an individual have any variation in a specific gene where the tumor sample differs from their normal sample?
Does an individual have a tumor sample that differs from their normal sample in BRCA1, BRCA2, HER2, or TP53 and has been diagnosed with any cancer, regardless of tissue type?
For large datasets (over 100,000 samples × 10,000 variants per sample), contact DNAnexus Professional Services for guidance on optimal performance and tuning.

How to Use the App
Using the UI
To use the Somatic Variant Assay Loader within the DNAnexus Platform:
In the DNAnexus Platform, go to Tools > Tools Library.
For the Somatic Variant Assay Loader - Orchestrator app, click Run Latest Version.
In Output to, select a project and output location for the app's outputs.
Click Next.
In the Inputs tab, specify the required inputs.
Click Start Analysis.

Using the CLI
To use the Somatic Variant Assay Loader from the command-line interface, install the DNAnexus Platform SDK.
When using the DNAnexus Platform through Cloud Workspace or JupyterLab, the DNAnexus SDK is preinstalled. You can use the dx command right away.
Use the following command format, customizing the input parameters for your specific data.
dx run app-somatic_variant_assay_loader_orchestrator \
-i sample_manifest=example_manifest.csv \
-i reference="GRCh38" \
-i assay_title="my_somatic_assay" \
-i database="my_somatic_database"Inputs
Required Parameters
Sample Manifest - CSV file that maps individual-level VCF files following standard VCF version 4.3 specifications. The manifest may reference tumor-only or tumor-normal VCFs.
Reference - The reference genome used for annotation. Options:
GRCh38orGRCh37.Assay Title - A short, human-readable name for the somatic variant assay being ingested.
Database - The database ID or name for the ingested data.
Optionally, to optimize performance for large datasets, you can provide JSON configuration files for sub-jobs (Validation, Ingestion, Annotation, Optimization). Adjust these only for datasets larger than 100,000 samples × 10,000 variants each. For details, see in-app documentation for Somatic Variant Assay Loader - Orchestrator.
Supported Data Types
The Somatic Variant Assay Loader requires two types of input for data ingestion:
Source Data - Your tumor VCF files (either tumor-only or tumor-normal)
Sample Manifest - A file describing your source data
VCF File Formats
All files must comply with the standard VCF version 4.3 specifications. The somatic model retains all data and metadata from the original VCF file, allowing for complete retrieval.
Tumor-Only VCF
When ingesting tumor-only data, you must provide one VCF file for each individual. The sample name, located in the 10th field of the VCF, must correspond directly to a single individual.
##fileformat=VCFv4.3
##contig=<ID=chr1,length=248956422>
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth">
##FILTER=<ID=PASS,Description="Site contains at least one allele that passes filters">
##FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele fractions of alternate alleles in the tumor">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##tumor_sample=ts_001
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT ts_001
chr1 1049980 . G C . PASS DP=493 GT:AF 0/1:0.991
chr1 1212740 . A C . PASS DP=283 GT:AF 0/1:0.988
chr1 1460088 . TGAGA T . PASS DP=43 GT:AF 0|1:0.246Tumor-Normal VCF
When ingesting tumor-normal data, provide a tumor-normal VCF file for each individual. Each file must contain two uniquely named samples, located in the 10th and 11th columns. By default, the 10th column is specified as the normal sample and the 11th as the tumor sample.
However, if your samples are not in this order or if you want to use different column names to specify normal and tumor, you can use a separate manifest file to explicitly link the sample IDs from the VCF to the corresponding normal and tumor types.
##fileformat=VCFv4.3
##contig=<ID=chr1,length=248956422>
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth">
##FILTER=<ID=PASS,Description="Site contains at least one allele that passes filters">
##FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele fractions of alternate alleles in the tumor">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT normal_sample001 normal_sample001
chr1 1049980 . G C . PASS DP=493 GT:AF 0/0:0.11 0/1:0.991
chr1 1212740 . A C . PASS DP=283 GT:AF 0/0:0.01 0/1:0.988
chr1 1460088 . TGAGA T . PASS DP=43 GT:AF 0|0:0.06 0|1:0.246Sample Manifest Formats
The sample manifest is a comma-delimited file (*.csv) that maps multiple individual-level VCF files to the data ingestion process. Store the manifest and all VCF files within the same project.
Explicit Sample Mapping
To explicitly map source files to referenced samples and individuals, provide a four-column file with the following headers:
file_id- The DNAnexus Platform file ID for each VCF file objectsample_id- A unique identifier for each set of samples in each VCF. Tumor and normal IDs must map to a sample. If not supplied, thetumor_idis usedtumor_id- The unique identifier of the tumor samplenormal_id- If present, the unique identifier for the normal sample
Each line must contain one unique file ID. All tumor and normal IDs must be unique across the entire file set. Sample IDs must also be unique across all files, though they may differ from the tumor ID. If you do not include a header, columns must follow the default order (file_id, sample_id, tumor_id, and normal_id). If a header is included, you can arrange columns in any order.
file_id,sample_id,tumor_id,normal_id
file-GY6G4Yj0VBvpZ1zY3ZFXX4zy,sample_001,tumor_001,normal_001
file-GY6G4f80VBvf53x3v4Xp0zGg,sample_002,tumor_002,normal_002
file-FJGNEf80VBvf53x3v40Kj8jn,sample_003,tumor_003,
file-Hi84n3lK9lkj98dh3Ksk83n2,sample_004,tumor_004,Automatic Sample Mapping
You can also use a minimal sample manifest file that includes only a single file_id column (with or without a header), as long as the source VCF files meet the following conditions:
All tumor and normal IDs must be unique across all files.
For tumor-normal VCFs, the normal sample must be in the 10th column and the tumor sample in the 11th column.
file-GY6G4Yj0VBvpZ1zY3ZFXX4zy
file-GY6G4f80VBvf53x3v4Xp0zGg
file-FJGNEf80VBvf53x3v40Kj8jn
file-Hi84n3lK9lkj98dh3Ksk83n2Outputs
Database - The identifier of the database containing the somatic variant data.
Dataset - A dataset containing the SomaticVariantAssay object.
Cluster Logs - Log files from all nodes. These logs are present only when the Collect Cluster Logs input is set to
true.Reports - Structured logs from the parent job (Orchestration) and its subjobs (Validation, Ingestion, Annotation, and Optimization).
You can use the generated dataset in the Cohort Browser by clicking on the dataset name, or selecting the dataset record and clicking Explore Data. For more details, see Analyzing Somatic Variants.
Best Practices & Troubleshooting
Data Preparation Guidelines
Ensure all VCF files comply with VCF version 4.3 specifications.
For tumor-normal VCFs using automatic mapping, place the normal sample in the 10th column and tumor sample in the 11th column.
All tumor and normal IDs must be unique across the entire file set.
Store the manifest and all VCF files within the same project.
Performance Considerations
The somatic variant data model is designed for tumor-normal related variation and may not be efficient for population genomics analysis.
For optimal performance, limit your dataset to no more than 100,000 samples, each with between 1 and 30,000 variants. Larger datasets may require additional tuning to prevent query timeouts.
For large datasets, contact DNAnexus Professional Services for guidance.
Common Issues and Solutions
Check the generated reporting files. Validation reports provide a summary of errors and warnings that can help with troubleshooting.
If a job completes the validation step but fails due to insufficient resources, you can recover by:
Selecting a more powerful instance type for the next run.
Enabling Recovery Mode.
Ensuring the new job uses the same sample manifest, database name, and geno bin size as the failed job
Annotation is supported as a comparison between a sample (either tumor or normal) and a reference genome. Direct tumor-normal comparisons are not made, but a tumor and normal are mapped to each other for additional comparative annotation.
Next Steps
For details on analyzing somatic variants in the Cohort Browser, see Analyzing Somatic Variants.
Create multi-assay datasets by combining with other dataset types (clinical, germline variant, molecular expression).
Use
dx extract_assay somaticto access somatic variant data from the command line.Parse assay metadata and access somatic variant data in Spark-enabled JupyterLab instances.
See tutorial notebooks in OpenBio for:
Last updated
Was this helpful?