Using Hail to Analyze Genomic Data

Hail is an open-source library built on top of Apache Spark for scalable genomic data analysis. DXJupyterLab Spark Cluster is an app provided by DNAnexus that allows users to use Hail within notebooks with ease.

A license is required to access the functionality described on this page. Contact DNAnexus Sales for more information.

Launching JupyterLab with Hail

To launch a JupyterLab notebook with Hail on the platform, follow the same instructions for launching the DXJupyterLab Spark Cluster app while also selecting either:

feature=HAIL-0.2.78, where the Hail Java library is pre-installed on all cluster nodes
feature=HAIL-0.2.78-VEP-1.0.3, where the Ensembl Variant Effect Predictor (VEP) plugin is pre-installed along with Hail on all cluster nodes.

If feature=HAIL-0.2.78-VEP-1.0.3 is selected, the spin up time for the JupyterLab notebook will take a few minutes longer (compared to feature=HAIL-0.2.78) as both Hail and VEP are being installed on all cluster nodes on start-up. Only VEP version 103 GRCh38 is available on DNAnexus (custom annotations not available).

The instance type and number of nodes selected will affect how powerful the Spark cluster will be. Generally, the default settings allow for casual interrogation of the data. If you are running simple queries with a small amount of data, you can save costs by selecting a smaller instance type. If you are running complex queries or analyzing a large amount of data in memory, you may need to select a larger instance type. To increase parallelization efficiency and reduce processing time, you may need to select more nodes.

Example Notebooks

The following sections provide a guide for common Hail operations. Each section includes:

A description of the operation
Launch specs for different datasets
- Instance type
- Number of nodes
- Example dx run command
- Approximate wall time taken to run the notebook- these values may vary in practice
A link to an example notebook in OpenBio that can be used as a starting point

The notebooks are as follows:

Import pVCF genomic data into a Hail MatrixTable (MT)

Import BGEN genomic data into a Hail MT

Filter by chr and pos

Filter by variant IDs

Filter by sample IDs

Replace sample IDs

Annotate genomic data using Hail VEP

Annotate genomic data using Hail DB

pre-GWAS QC: locus QC

pre-GWAS QC: sample QC

GWAS

GWAS visualizations

Annotate GWAS results using Hail VEP

Annotate GWAS results using Hail DB

Export genomic data as BGEN

Information provided in these sections may help guide in choosing the best launch specs and knowing what to (generally) expect when working with your data.

These metrics are specific for the data used–they should be interpreted as estimates and may not scale linearly with the data size.

Import pVCF genomic data into a Hail MatrixTable (MT)

The pVCF import notebook shows how to import genomic data from VCF with multiple samples, termed project-VCF (pVCF), into a Hail MT and store it in DNAnexus. Input data may be structured in one of three different ways: 1 file in total, 1 file per chromosome, and multiple files per chromosome.

Data Used

Instance type

Num nodes

Example dx run command

Time (approx)

Additional Notes for Spark config

WES 100 samples 10K variants Chr 1-22

mem1_ssd1_v2_x4

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78

< 30 sec

N/A

WES 100K samples 1.4M variants Chr 1, 20, 21

mem1_ssd1_v2_x16

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x16 --instance-count=2 -ifeature=HAIL-0.2.78

35 min

N/A

Import BGEN genomic data into a Hail MatrixTable (MT)

The BGEN import notebook shows how to import BGEN genomic data into a Hail MT and store it in DNAnexus. Input data may be structured in one of three different ways: 1 file in total, 1 file per chromosome, and multiple files per chromosome.

When importing BGEN files with Hail:

A Hail-specific index file is required for each BGEN file.
Only biallelic variants are supported.

Data

Instance type

Num nodes

Example dx run command

Time (approx)

Additional Notes (including Spark config)

WES 100 samples chr 1,2,3

mem1_ssd1_v2_x4

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78

Creating index files: < 30 sec Import and write MT: < 30 sec

N/A

Filter by chromosome (chr) and position (pos)

The chromosome and position filtering notebook shows how to retrieve a Hail MT from DNAnexus, filter chromosomes and positions, and store the results as a Hail Table in DNAnexus.

Data

Instance type

Num nodes

Example dx run command

Time (approx)

Additional Notes (including Spark config)

WES 100 samples 10K variants Chr 1-22

mem1_ssd1_v2_x4

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78

< 30 sec

N/A

WES 100K samples 1.4M variants Chr 1, 20, 21

mem2_ssd1_v2_x8

dx run dxjupyterlab_spark_cluster --instance-type=mem2_ssd1_v2_x8 --instance-count=2 -ifeature=HAIL-0.2.78

< 1 min

N/A

Filter by variant IDs

The variant ID filtering notebook shows how to retrieve a Hail MT from DNAnexus, filter variant IDs, and then store results as a Hail Table in DNAnexus.

Data

Instance type

Num nodes

Example dx run command

Time (approx)

Additional Notes (including Spark config)

WES 100 samples 10K variants Chr 1-22

mem1_ssd1_v2_x4

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78

< 30 sec

N/A

WES 100K samples 1.4M variants Chr 1, 20, 21

mem1_ssd1_v2_x16

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x16 --instance-count=2 -ifeature=HAIL-0.2.78

< 30 sec

N/A

Filter by sample IDs

The sample ID filtering notebook shows how to retrieve a Hail MT from DNAnexus, filter sample IDs, and then store results as a Hail Table in DNAnexus.

Data

Instance type

Num nodes

Example dx run command

Time (approx)

Additional Notes (including Spark config)

WES 100 samples 10K variants Chr 1-22

mem1_ssd1_v2_x4

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78

< 30 sec

N/A

WES 100K samples 1.4M variants Chr 1, 20, 21

mem1_ssd1_v2_x16

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x16 --instance-count=2 -ifeature=HAIL-0.2.78

< 30 sec

N/A

Replace sample IDs

The sample ID replacement notebook shows how to retrieve a Hail MT from DNAnexus, replace sample IDs using a mapping table, and then store results as a Hail Table in DNAnexus.

Data

Instance type

Num nodes

Example dx run command

Time (approx)

Additional Notes (including Spark config)

WES 100 samples 10K variants Chr 1-22

mem1_ssd1_v2_x4

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78

< 30 sec

N/A

WES 100K samples 1.4M variants Chr 1, 20, 21

mem1_ssd1_v2_x16

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x16 --instance-count=5 -ifeature=HAIL-0.2.78

< 30 sec

N/A

Annotate genomic data using Hail VEP

The VEP annotation notebook shows how to retrieve a Hail MT from DNAnexus, annotate using Hail's Variant Effect Predictor (VEP), and then store results as a Hail Table in DNAnexus. Hail runs VEP in parallel on every variant in a dataset, which requires VEP data files to be available on every node. By selecting feature=HAIL-0.2.78-VEP-1.0.3 when starting up the DXJupyterLab with Spark Cluster app, VEP data files (version 103 GRCh38) will be pre-installed on every node.

Data

Instance type

Num nodes

Example dx run command

Time (approx)

Additional Notes (including Spark config)

WES 100 samples 10K variants Chr 1-22

mem1_ssd1_v2_x8

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=2 -ifeature=HAIL-0.2.78-VEP-1.0.3

6 min

N/A

WES 100K samples 1.4M variants Chr 1, 20, 21

mem1_ssd1_v2_x8

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=2 -ifeature=HAIL-0.2.78-VEP-1.0.3

46 min

N/A

Annotate genomic data using Hail DB

The Hail DB annotation notebook shows how to retrieve a Hail MT from DNAnexus, annotate using Hail's Annotation Database (DB) and then store results as a Hail Table in DNAnexus. Hail's Annotation DB is a curated collection of variant annotations. Hail's annotation datasets are available in Open Data on AWS where they are in a S3 bucket in the US region, however they may be accessed via a DNAnexus project of any region.

Data

Instance type

Num nodes

Example dx run command

Time (approx)

Additional Notes (including Spark config)

WES 100 samples 10K variants Chr 1-22

mem1_ssd1_v2_x8

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=2 -ifeature=HAIL-0.2.78

< 1 min

N/A

WES 100K samples 1.4M variants Chr 1, 20, 21

mem2_ssd1_v2_x8

dx run dxjupyterlab_spark_cluster --instance-type=mem2_ssd1_v2_x8 --instance-count=5 -ifeature=HAIL-0.2.78

< 5 min

N/A

Pre-GWAS QC: locus QC

The locus QC notebook shows how to retrieve a Hail MT from DNAnexus, compute locus quality control (QC) metrics using Hail's variant_qc() method, and then store results as a Hail Table in DNAnexus. The variant_qc() method computes variant statistics from the genotype data and creates a new field in the MT with this information.

Data

Instance type

Num nodes

Example dx run command

Time (approx)

Additional Notes (including Spark config)

WES 100 samples 10K variants Chr 1-22

mem1_ssd1_v2_x4

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78

< 30 sec

N/A

WES 100K samples 1.4M variants Chr 1, 20, 21

mem1_ssd1_v2_x8

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=2 -ifeature=HAIL-0.2.78

24 min

N/A

Pre-GWAS QC: sample QC

The sample QC notebook shows how to retrieve a Hail MT from DNAnexus, compute sample quality control (QC) metrics using Hail's sample_qc() method, and then store results as a Hail Table in DNAnexus. The sample_qc() method computes per-sample metrics and creates a new field in the MT with this information.

Data

Instance type

Num nodes

Example dx run command

Time (approx)

Additional Notes (including Spark config)

WES 100 samples 10K variants Chr 1-22

mem1_ssd1_v2_x4

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78

< 30 sec

N/A

WES 100K samples 1.4M variants Chr 1, 20, 21

mem1_ssd1_v2_x16

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x16 --instance-count=5 -ifeature=HAIL-0.2.78

8 min

N/A

GWAS

The GWAS notebook shows how to perform a genome-wide association study (GWAS) for one case–control trait using Firth logistic regression, and then save results as a Hail Table in DNAnexus. The notebook first retrieves previously stored genomic, locus QC, and sample QC Tables for analysis preparation, and then creates a phenotype Hail Table containing the case–control trait. Hail's logistic_regression_rows() method is used to run the analysis.

Data

Instance type

Num nodes

Example dx run command

Time (approx)

Additional Notes (including Spark config)

WES 100 samples 10K variants Chr 1-22

mem1_ssd1_v2_x4

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78

< 30 sec

N/A

WES 100K samples 1.4M variants Chr 1, 20, 21

mem1_ssd1_v2_x8

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=5 -ifeature=HAIL-0.2.78

1hr 30min

spark.hadoop.dnanexus.fs.output.upload.chunk.size: 67108864 spark.hadoop.dnanexus.fs.cache.filestatus.maxsize: 44330 spark.hadoop.dnanexus.fs.cache.filestatus.expiretime: 28800 spark.hadoop.dnanexus.fs.cache.readfileurl.maxsize: 44330

GWAS visualizations

The GWAS visualization notebook shows how to retrieve a Table of GWAS results from DNAnexus and then visualize using a Q-Q plot and a Manhattan plot using Hail. The visualizations require the Python library bokeh, which comes pre-installed with Hail in the JupyterLab notebook environment.

Data

Instance type

Num nodes

Example dx run command

Time (approx)

Additional Notes (including Spark config)

WES 100 samples 10K variants Chr 1-22

mem1_ssd1_v2_x4

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78

QQ plot: < 30 sec Manhattan plot: < 30 sec

N/A

WES 100K samples 1.4M variants Chr 1, 20, 21

mem1_ssd1_v2_x8

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=2 -ifeature=HAIL-0.2.78

QQ plot: < 30 sec Manhattan plot: < 30 sec

N/A

Annotate GWAS results using Hail VEP

The GWAS VEP annotation notebook shows how to retrieve a GWAS results Table from DNAnexus, annotate using Hail's Variant Effect Predictor (VEP), and then store results as a Hail Table in DNAnexus. Hail runs VEP in parallel on every variant in a dataset, which requires VEP data files to be available on every node. By selecting feature=HAIL-0.2.78-VEP-1.0.3 when starting up the DXJupyterLab with Spark Cluster app, VEP data files (version 103 GRCh38) will be pre-installed on every node.

Data

Instance type

Num nodes

Example dx run command

Time (approx)

Additional Notes (including Spark config)

WES 100 samples 10K variants Chr 1-22

mem1_ssd1_v2_x8

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=5 -ifeature=HAIL-0.2.78-VEP-1.0.3

8 min

N/A

WES 100K samples 1.4M variants Chr 1, 20, 21

mem1_ssd1_v2_x8

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=5 -ifeature=HAIL-0.2.78-VEP-1.0.3

9 min

N/A

Annotate GWAS results using Hail DB

The GWAS Hail DB annotation notebook shows how to retrieve a GWAS results Table from DNAnexus, annotate using Hail's Annotation Database (DB) , and then store results as a Hail Table in DNAnexus. Hail's Annotation DB is a curated collection of variant annotations. Hail's annotation datasets are available in Open Data on AWS where they are in a S3 bucket in the US region, however they may be accessed via a DNAnexus project of any region.

Data

Instance type

Num nodes

Example dx run command

Time (approx)

Additional Notes (including Spark config)

WES 100 samples 10K variants Chr 1-22

mem1_ssd1_v2_x8

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=2 -ifeature=HAIL-0.2.78

< 30 sec

N/A

WES 100K samples 1.4M variants Chr 1, 20, 21

mem1_ssd1_v2_x8

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=2 -ifeature=HAIL-0.2.78

< 30 sec

N/A

Export genomic data

The BGEN export notebook shows how to export a Hail MT of genomic data as a single BGEN file for each chromosome in v1.2 format with 8 bits per probability. This format is recommended if downstream analysis includes using REGENIE. An additional step (not shown in the notebook) recommended for REGENIE is to create a .bgi index file for each BGEN file. The .bgi files can be created using the tool, bgenix, which is part of the BGEN library.

Data

Instance type

Num nodes

Example dx run command

Time (approx)

Additional Notes (including spark config)

WES 100 samples 10K variants Chr 1-22

mem1_ssd1_v2_x4

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78

< 1 min

N/A

WES 100K samples 1.4M variants Chr 1, 20, 21

mem1_ssd1_v2_x8

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=2 -ifeature=HAIL-0.2.78

33 min

N/A

General Advice

When accessing input data files on the project for use with Hail, it is recommended to use /mnt/project to read the content of the files. This requires that the input data files are already uploaded to the project before starting the DXJupyterLab Spark Cluster app. All example notebooks that use input data files show how to access mounted input data files from the project.

Guidance on scaling with Hail

Check Hail's official general advice for running Hail on the cloud.
Hail functions, describe() and n_partitions(), each print information about MatrixTables and Tables without being computationally expensive and will execute immediately. Functions such as, count(), may be computationally expensive and take longer to compute- especially if the size of data is large
It is recommended to use a lower number of nodes when running certain Hail operations, especially import into MT and annotation with VEP. This is particularly important when the Spark UI shows nodes sitting idle when running a task.
The data in a MT or Table are divided into chunks called partitions, where each partition can be read and processed in parallel by the available cores (see Spark's RDD Programming Guide). When considering how the data is partitioned for Hail operations:
- More partitions means more parallelization
- Fewer partitions means more memory would be required
- There should be fewer than 100K partitions
When considering memory management for Hail operations:
- On the DNAnexus Platform, the Spark cluster is allocated roughly 70% of the available memory of the selected instance type by default. For details, see Developing Spark Apps.
- To check memory usage and availability, see the Executors tab in the Spark UI
- The easiest way to manage memory availability is to choose a different instance type
- Another method for memory management is to use CPU reservation by fine-tuning spark configurations. An example of how to fine-tune spark configurations within the notebook before initializing Hail:

from pyspark.sql import SparkSession
import hail as hl

builder = (
                SparkSession 
                .builder
                .config("spark.executor.cores", 2)
                .config("spark.executor.memory", "18g")
                .enableHiveSupport()
)

spark = builder.getOrCreate()
hl.init(sc=spark.sparkContext)

While the metrics presented for the notebooks can be used as a guide for extrapolating (and interpolating) run time (and cost), remember that most systems do not scale linearly.

Last updated 27 days ago

Was this helpful?