VCF Preprocessing

Learn about preprocessing VCF data before using it in an analysis.

Overview

It may be necessary to preprocess, or harmonize, the data before you load them.

Harmonizing Data

  • The raw data is expected to be a set of gVCF files -- one file per sample in the cohort.

  • GLnexus is used to harmonize sites across all gVCFs and generate a single pVCF file containing all harmonized sites and all genotypes for all samples.

Apollo GLnexus

Basic Run

dx run app-glnexus \
    -i common.gvcf_manifest=<manifest_file_id> \
    -i common.config=gatk_unfiltered \
    -i common.targets_bed=<bed_target_ranges>

Advanced Run

dx run workflow-glnexus \
    -i common.gvcf_manifest=<manifest_file_id> \
    -i common.config=gatk_unfiltered \
    -i common.targets_bed=<bed_target_ranges> \
    -i unify.shards_bed=<bed_genomic_partition_ranges> \
    -i etl.shards=<num_sample_partitions>

Annotating Variants

VCF files can include variant annotations. SnpEff annotations provided as INFO/ANN tags are loaded into the database. You can annotate the harmonized pVCF yourself by running any standard SnpEff annotator before loading it. For large pVCFs, rely on the internal annotation step in the VCF Loader instead of generating an annotated intermediate file. The VCF Loader performs annotation in a distributed, massively parallel process.

The VCF Loader does not persist the intermediate, annotated pVCF as a file. If you want to have access to the annotated file up front, you should annotate it yourself.

Annotation flow

VCF annotation flows. In (a) the annotation step is external to the VCF Loader, whereas in (b) the annotation step is internal. In any case, SnpEff annotations present as INFO/ANN tags are loaded into the database by the VCF Loader.

Last updated

Was this helpful?