VCF Loader
A license is required to access Spark functionality on the DNAnexus Platform. Contact DNAnexus Sales for more information.
Overview
VCF Loader ingests Variant Call Format (VCF) files into a database. The input VCF files are loaded into a Parquet-format database that can be queried using Spark SQL.
The input VCF for every run can be a single VCF file or many VCF files, but the merged input must represent a single logical VCF file. In the many files case, the logical VCF may be partitioned by chromosome, by genomic region, and/or by sample. In any case, every input VCF file must be a syntactically correct, sorted VCF file.
VCF Preprocessing
Although VCF data can be loaded into Apollo databases immediately after the variant call step, the dataset may not be normalized for downstream analyses across large cohorts. In that case, you'll want to preprocess and harmonize your data before loading. To learn more, see VCF Preprocessing.
How to Run VCF Loader
Input:
vcf_manifest
: (file) a text file containing a list of file ID's of the VCF files to load (one per line). The referenced files' names must be distinct and end in .vcf.gz If more than one file is specified, then the complete VCF file to load is considered to be partitioned and every specified partition must be a valid VCF file. Moreover after the partition-merge step in preprocessing, the complete VCF file must be valid.
Required Parameters:
database_name
: (string) name of the database into which to load the VCF files.create_mode
: (string)strict
mode creates database and tables from scratch andoptimistic
mode creates databases and tables if they do not already exist.insert_mode
: (string)append
appends data to the end of tables andoverwrite
is equivalent to truncating the tables and then appending to them.run_mode
: (string)site
mode processes only the site-specific data,genotype
mode processes genotype-specific data and other non-site-specific data andall
mode processes both types of data.etl_spec_id
: (string) currently onlygenomics-phenotype
schema choice is supported.is_sample_partitioned
: (boolean) whether the raw VCF data is partitioned.
Other Options:
snpeff
: (boolean) defaulttrue
-- whether to include the SnpEff annotation step in preprocessing with INFO/ANN tags. If SnpEff annotations are desired in the database, then either pre-annotate the raw VCF separately, or include this SnpEff annotation step -- it is not necessary to do both.snpeff_human_genome
: (string) defaultGRCh38.92
-- id of the SnpEff human genome to use in the SnpEff annotation step in preprocessing.snpeff_opt_no_upstream
: (boolean) defaulttrue
-- exclude SnpEff upstream gene annotations in the SnpEff annotation step (equivalent to SnpEff's -no-upstream option). Note that this option does not filter pre-calculated annotations outside of the SnpEff annotation step.snpeff_opt_no_downstream
: (boolean) defaulttrue
-- exclude SnpEff downstream gene annotations in the SnpEff annotation step (equivalent to SnpEff's -no-downstream option). Note that this option does not filter pre-calculated annotations outside of the SnpEff annotation step.calculate_worst_effects
: (boolean) defaulttrue
-- whether to include the SnpEff worst-effects annotation step in preprocessing, which adds the SnpEff worst-effects for every alternate-allele--gene combination as INFO/ANN_WORST tags (Number "A"). Note that this option automatically filters SnpEff annotations to exclude 'feature_type!=transcript', 'transcript_biotype!=protein_coding', 'effect=upstream_gene_variant' and 'effect=downstream_gene_variant'calculate_locus_frequencies
: (boolean) defaulttrue
-- whether to include the locus-level frequencies annotation step in preprocessing, which adds locus-level allele and genotype frequencies as INFO tags.snpsift
: (boolean) defaulttrue
-- whether to include the SnpSift/dbSNP annotation step in preprocessing. The SnpSift/dbSNP annotation step adds dbSNP ID annotations to the INFO/RSID tag (Number "A"), which is required in the ETL stage. If the raw VCF is already pre-annotated, then this annotation step is not necessary.num_init_partitions
: (int) integer defining the the number of partitions for the initial VCF lines Spark RDD.
Basic Run
Last updated