> For the complete documentation index, see [llms.txt](https://documentation.dnanexus.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://documentation.dnanexus.com/user/spark/example-applications/vcf-loader.md).

# VCF Loader

{% hint style="info" %}
A license is required to access Spark functionality on the DNAnexus Platform. [Contact DNAnexus Sales](mailto:sales@dnanexus.com) for more information.
{% endhint %}

## Overview

VCF Loader ingests Variant Call Format (VCF) files into a database. The input VCF files are loaded into a Parquet-format database that can be queried using Spark SQL.

The input VCF for every run can be a single VCF file or many VCF files, but the merged input must represent a single logical VCF file. In the many-file case, the logical VCF may be partitioned by chromosome, genomic region, sample, or a combination of these strategies. Every input VCF file must be a syntactically correct, sorted VCF file. If the files are partitioned by sample, set `is_sample_partitioned` to `true`. If you run VCF Loader through another workflow, confirm the partitioning rules supported by that workflow because they can be more restrictive than the standalone app.

### VCF Preprocessing

Although VCF data can be loaded into Apollo databases after the variant call step, the dataset may not be normalized for downstream analyses across large cohorts. In that case, we recommend you complete preprocessing before loading so the data is harmonized for downstream use. To learn more, see [VCF Preprocessing](/user/spark/vcf-preprocessing.md).

## How to Run VCF Loader

Input:

* `vcf_manifest`: (file) a text file containing a list of file IDs of the VCF files to load (one per line). The referenced file names must be distinct and end in `.vcf.gz`. If more than one file is specified, then the complete VCF file to load is considered to be partitioned and every specified partition must be a valid VCF file. After the partition-merge step in preprocessing, the complete VCF file must still be valid. If the partitions split the sample set for the same loci, set `is_sample_partitioned` to `true`.

Required Parameters:

* `database_name`: (string) name of the database into which to load the VCF files.
* `create_mode`: (string) `strict` mode creates a database and tables from scratch and `optimistic` mode creates a database and tables if they do not already exist.
* `insert_mode`: (string) `append` appends data to the end of tables and `overwrite` is equivalent to truncating the tables and then appending to them.
* `run_mode`: (string) `site` mode processes only the site-specific data, `genotype` mode processes genotype-specific data and other non-site-specific data and `all` mode processes both types of data.
* `etl_spec_id`: (string) Only the `genomics-phenotype` schema choice is supported.
* `is_sample_partitioned`: (boolean) set to `true` when the input files are partitioned across different sample subsets for the same logical VCF. Leave this as `false` when the files are partitioned only by chromosome or genomic region.

Other Options:

* `snpeff`: (boolean) default `true` -- whether to include the SnpEff annotation step in preprocessing with `INFO/ANN` tags. If you want SnpEff annotations in the database, then either pre-annotate the raw VCF, or include this SnpEff annotation step — it is not necessary to do both.
* `snpeff_human_genome`: (string) default `GRCh38.92` -- ID of the SnpEff human genome to use in the SnpEff annotation step in preprocessing.
* `snpeff_opt_no_upstream`: (boolean) default `true` -- exclude SnpEff upstream gene annotations in the SnpEff annotation step (equivalent to SnpEff's `-no-upstream` option). This option does not filter pre-calculated annotations outside of the SnpEff annotation step.
* `snpeff_opt_no_downstream`: (boolean) default `true` -- exclude SnpEff downstream gene annotations in the SnpEff annotation step (equivalent to SnpEff's `-no-downstream` option). This option does not filter pre-calculated annotations outside of the SnpEff annotation step.
* `calculate_worst_effects`: (boolean) default `true` -- whether to include the SnpEff worst-effects annotation step in preprocessing, which adds the SnpEff worst-effects for every alternate-allele--gene combination as `INFO/ANN_WORST` tags (Number "A"). This option automatically filters SnpEff annotations to exclude `feature_type!=transcript`, `transcript_biotype!=protein_coding`, `effect=upstream_gene_variant` and `effect=downstream_gene_variant`.
* `calculate_locus_frequencies`: (boolean) default `true` -- whether to include the locus-level frequencies annotation step in preprocessing, which adds locus-level allele and genotype frequencies as INFO tags.
* `snpsift`: (boolean) default `true` -- whether to include the SnpSift/dbSNP annotation step in preprocessing. The SnpSift/dbSNP annotation step adds dbSNP ID annotations to the `INFO/RSID` tag (Number "A"), which is required in the ETL stage. If the raw VCF is already pre-annotated, then this annotation step is not necessary.
* `num_init_partitions`: (integer) Number of partitions for the initial VCF-line Spark RDD.

### Basic Run

```shell
dx run vcf-loader \
   -i vcf_manifest=file-xxxx \
   -i is_sample_partitioned=false \
   -i database_name=<my_favorite_db> \
   -i etl_spec_id=genomics-phenotype \
   -i create_mode=strict \
   -i insert_mode=append \
   -i run_mode=genotype
```


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://documentation.dnanexus.com/user/spark/example-applications/vcf-loader.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
