VCF Loader

A license is required to access Spark functionality on the DNAnexus Platform. Contact DNAnexus Sales for more information.

Overview

VCF Loader ingests Variant Call Format (VCF) files into a database. The input VCF files are loaded into a Parquet-format database that can be queried using Spark SQL.

The input VCF for every run can be a single VCF file or many VCF files, but the merged input must represent a single logical VCF file. In the many files case, the logical VCF may be partitioned by chromosome, by genomic region, and/or by sample. In any case, every input VCF file must be a syntactically correct, sorted VCF file.

VCF Preprocessing

Although VCF data can be loaded into Apollo databases immediately after the variant call step, the dataset may not be normalized for downstream analyses across large cohorts. In that case, you'll want to preprocess and harmonize your data before loading. To learn more, see VCF Preprocessing.

How to Run VCF Loader

Input:

  • vcf_manifest : (file) a text file containing a list of file ID's of the VCF files to load (one per line). The referenced files' names must be distinct and end in .vcf.gz If more than one file is specified, then the complete VCF file to load is considered to be partitioned and every specified partition must be a valid VCF file. Moreover after the partition-merge step in preprocessing, the complete VCF file must be valid.

Required Parameters:

  • database_name : (string) name of the database into which to load the VCF files.

  • create_mode : (string) strict mode creates database and tables from scratch and optimistic mode creates databases and tables if they do not already exist.

  • insert_mode: (string)append appends data to the end of tables and overwrite is equivalent to truncating the tables and then appending to them.

  • run_mode : (string)site mode processes only the site-specific data, genotype mode processes genotype-specific data and other non-site-specific data and all mode processes both types of data.

  • etl_spec_id : (string) currently only genomics-phenotype schema choice is supported.

  • is_sample_partitioned : (boolean) whether the raw VCF data is partitioned.

Other Options:

  • snpeff : (boolean) default true -- whether to include the SnpEff annotation step in preprocessing with INFO/ANN tags. If SnpEff annotations are desired in the database, then either pre-annotate the raw VCF separately, or include this SnpEff annotation step -- it is not necessary to do both.

  • snpeff_human_genome : (string) default GRCh38.92 -- id of the SnpEff human genome to use in the SnpEff annotation step in preprocessing.

  • snpeff_opt_no_upstream : (boolean) default true -- exclude SnpEff upstream gene annotations in the SnpEff annotation step (equivalent to SnpEff's -no-upstream option). Note that this option does not filter pre-calculated annotations outside of the SnpEff annotation step.

  • snpeff_opt_no_downstream : (boolean) default true -- exclude SnpEff downstream gene annotations in the SnpEff annotation step (equivalent to SnpEff's -no-downstream option). Note that this option does not filter pre-calculated annotations outside of the SnpEff annotation step.

  • calculate_worst_effects : (boolean) default true -- whether to include the SnpEff worst-effects annotation step in preprocessing, which adds the SnpEff worst-effects for every alternate-allele--gene combination as INFO/ANN_WORST tags (Number "A"). Note that this option automatically filters SnpEff annotations to exclude 'feature_type!=transcript', 'transcript_biotype!=protein_coding', 'effect=upstream_gene_variant' and 'effect=downstream_gene_variant'

  • calculate_locus_frequencies : (boolean) default true -- whether to include the locus-level frequencies annotation step in preprocessing, which adds locus-level allele and genotype frequencies as INFO tags.

  • snpsift : (boolean) default true -- whether to include the SnpSift/dbSNP annotation step in preprocessing. The SnpSift/dbSNP annotation step adds dbSNP ID annotations to the INFO/RSID tag (Number "A"), which is required in the ETL stage. If the raw VCF is already pre-annotated, then this annotation step is not necessary.

  • num_init_partitions : (int) integer defining the the number of partitions for the initial VCF lines Spark RDD.

Basic Run

dx run vcf-loader \
   -i vcf_manifest=file-xxxx \
   -i is_sample_partitioned=false \
   -i database_name=<my_favorite_db> \
   -i etl_spec_id=genomics-phenotype \
   -i create_mode=strict \
   -i insert_mode=append \
   -i run_mode=genotype

Last updated