# VCF Preprocessing

## Overview

It may be necessary to preprocess, or harmonize, the data before you load them.

## Harmonizing Data

* The raw data is expected to be a set of gVCF files -- one file per sample in the cohort.
* [GLnexus](https://github.com/dnanexus-rnd/GLnexus) is used to harmonize sites across all gVCFs and generate a single pVCF file containing all harmonized sites and all genotypes for all samples.

![Apollo GLnexus](https://1612471957-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-L_EsL_ie8XyZlLe_yf9%2Fuploads%2Fgit-blob-48d9df553567bb963aadf23598b696e33113b0dc%2Fapollo-glnexus.png?alt=media)

### Basic Run

```shell
dx run app-glnexus \
    -i common.gvcf_manifest=<manifest_file_id> \
    -i common.config=gatk_unfiltered \
    -i common.targets_bed=<bed_target_ranges>
```

### Advanced Run

```shell
dx run workflow-glnexus \
    -i common.gvcf_manifest=<manifest_file_id> \
    -i common.config=gatk_unfiltered \
    -i common.targets_bed=<bed_target_ranges> \
    -i unify.shards_bed=<bed_genomic_partition_ranges> \
    -i etl.shards=<num_sample_partitions>
```

{% hint style="success" %}
To learn more about GLnexus, see [GLnexus](https://github.com/dnanexus-rnd/GLnexus) or [Getting started with GLnexus](https://github.com/dnanexus-rnd/GLnexus/wiki/Getting-Started).
{% endhint %}

## Annotating Variants

VCF files can include variant annotations. SnpEff annotations provided as `INFO/ANN` tags are loaded into the database. You can annotate the harmonized pVCF yourself by running any standard SnpEff annotator before loading it. For large pVCFs, rely on the internal annotation step in the VCF Loader instead of generating an annotated intermediate file. The VCF Loader performs annotation in a distributed, massively parallel process.

The VCF Loader does not persist the intermediate, annotated pVCF as a file. If you want to have access to the annotated file up front, you should annotate it yourself.

![Annotation flow](https://1612471957-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-L_EsL_ie8XyZlLe_yf9%2Fuploads%2Fgit-blob-305d2823866fbea405afe6b85c115530fcd83552%2Fapollo-annotation-flows.png?alt=media)

VCF annotation flows. In (a) the annotation step is external to the VCF Loader, whereas in (b) the annotation step is internal. In any case, SnpEff annotations present as `INFO/ANN` tags are loaded into the database by the VCF Loader.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://documentation.dnanexus.com/user/spark/vcf-preprocessing.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
