Comment on page
VCF Loader ingests Variant Call Format (VCF) files into a database. The input VCF files are loaded into a Parquet-format database that can be queried using Spark SQL.
The input VCF for every run can be a single VCF file or many VCF files, but the merged input must represent a single logical VCF file. In the many files case, the logical VCF may be partitioned by chromosome, by genomic region, and/or by sample. In any case, every input VCF file must be a syntactically correct, sorted VCF file.
Although VCF data can be loaded into Apollo databases immediately after the variant call step, the dataset may not be normalized for downstream analyses across large cohorts. In that case, you'll want to preprocess and harmonize your data before loading. To learn more, see VCF Preprocessing.
vcf_manifest: (file) a text file containing a list of file ID's of the VCF files to load (one per line). The referenced files' names must be distinct and end in .vcf.gz If more than one file is specified, then the complete VCF file to load is considered to be partitioned and every specified partition must be a valid VCF file. Moreover after the partition-merge step in preprocessing, the complete VCF file must be valid.
database_name: (string) name of the database into which to load the VCF files.
strictmode creates database and tables from scratch and
optimisticmode creates databases and tables if they do not already exist.
appendappends data to the end of tables and
overwriteis equivalent to truncating the tables and then appending to them.
sitemode processes only the site-specific data,
genotypemode processes genotype-specific data and other non-site-specific data and
allmode processes both types of data.
etl_spec_id: (string) currently only
genomics-phenotypeschema choice is supported.
is_sample_partitioned: (boolean) whether the raw VCF data is partitioned.
snpeff: (boolean) default
true-- whether to include the SnpEff annotation step in preprocessing with INFO/ANN tags. If SnpEff annotations are desired in the database, then either pre-annotate the raw VCF separately, or include this SnpEff annotation step -- it is not necessary to do both.
snpeff_human_genome: (string) default
GRCh38.92-- id of the SnpEff human genome to use in the SnpEff annotation step in preprocessing.
snpeff_opt_no_upstream: (boolean) default
true-- exclude SnpEff upstream gene annotations in the SnpEff annotation step (equivalent to SnpEff's -no-upstream option). Note that this option does not filter pre-calculated annotations outside of the SnpEff annotation step.
snpeff_opt_no_downstream: (boolean) default
true-- exclude SnpEff downstream gene annotations in the SnpEff annotation step (equivalent to SnpEff's -no-downstream option). Note that this option does not filter pre-calculated annotations outside of the SnpEff annotation step.
calculate_worst_effects: (boolean) default
true-- whether to include the SnpEff worst-effects annotation step in preprocessing, which adds the SnpEff worst-effects for every alternate-allele--gene combination as INFO/ANN_WORST tags (Number "A"). Note that this option automatically filters SnpEff annotations to exclude 'feature_type!=transcript', 'transcript_biotype!=protein_coding', 'effect=upstream_gene_variant' and 'effect=downstream_gene_variant'
calculate_locus_frequencies: (boolean) default
true-- whether to include the locus-level frequencies annotation step in preprocessing, which adds locus-level allele and genotype frequencies as INFO tags.
snpsift: (boolean) default
true-- whether to include the SnpSift/dbSNP annotation step in preprocessing. The SnpSift/dbSNP annotation step adds dbSNP ID annotations to the INFO/RSID tag (Number "A"), which is required in the ETL stage. If the raw VCF is already pre-annotated, then this annotation step is not necessary.
num_init_partitions: (int) integer defining the the number of partitions for the initial VCF lines Spark RDD.
dx run vcf-loader \
-i vcf_manifest=file-xxxx \
-i is_sample_partitioned=false \
-i database_name=<my_favorite_db> \
-i etl_spec_id=genomics-phenotype \
-i create_mode=strict \
-i insert_mode=append \