Using Hail to Analyze Genomic Data

Hail is an open-source library built on top of Apache Spark for scalable genomic data analysis. DXJupyterLab Spark Cluster is an app provided by DNAnexus that allows users to use Hail within notebooks with ease.

A license is required to access the functionality described on this page. Contact DNAnexus Sales for more information.

Launching JupyterLab with Hail

To launch a JupyterLab notebook with Hail on the platform, follow the same instructions for launching the DXJupyterLab Spark Cluster app while also selecting either:

  • feature=HAIL-0.2.78, where the Hail Java library is pre-installed on all cluster nodes

  • feature=HAIL-0.2.78-VEP-1.0.3, where the Ensembl Variant Effect Predictor (VEP) plugin is pre-installed along with Hail on all cluster nodes.

If feature=HAIL-0.2.78-VEP-1.0.3 is selected, the spin up time for the JupyterLab notebook will take a few minutes longer (compared to feature=HAIL-0.2.78) as both Hail and VEP are being installed on all cluster nodes upon start-up. Note that only VEP version 103 GRCh38 is available on DNAnexus (custom annotations not available).

The instance type and number of nodes selected will affect how powerful the Spark cluster will be. Generally, the default settings allow for casual interrogation of the data. If you are running simple queries with a small amount of data, you can save costs by selecting a smaller instance type. If you are running complex queries or analyzing a large amount of data in memory, you may need to select a larger instance type. To increase parallelization efficiency and reduce processing time, you may need to select more nodes.

Example Notebooks

The following sections provide a guide for common Hail operations. Each section includes:

  • A description of the operation

  • Launch specs for different datasets

    • Instance type

    • Number of nodes

    • Example dx run command

    • Approximate wall time taken to run the notebook- these values may vary in practice

  • A link to an example notebook in OpenBio that can be used as a starting point

The notebooks are as follows:

Import pVCF genomic data into a Hail MatrixTable (MT)

Import BGEN genomic data into a Hail MT

Filter by chr and pos

Filter by variant IDs

Filter by sample IDs

Replace sample IDs

Annotate genomic data using Hail VEP

Annotate genomic data using Hail DB

pre-GWAS QC: locus QC

pre-GWAS QC: sample QC

GWAS

GWAS visualizations

Annotate GWAS results using Hail VEP

Annotate GWAS results using Hail DB

Export genomic data as BGEN

Information provided in these sections may help guide in choosing the best launch specs and knowing what to (generally) expect when working with your data.

These metrics are specific for the data used–they should be interpreted as estimates and may not scale linearly with the data size.

Import pVCF genomic data into a Hail MatrixTable (MT)

This notebook shows how to import genomic data from VCF with multiple samples, termed project-VCF (pVCF), into a Hail MT and store it in DNAnexus. Input data may be structured in one of three different ways: 1 file in total, 1 file per chromosome, and multiple files per chromosome.

Data Used

Instance type

Num nodes

Example dx run command

Time (approx)

Additional Notes (i.e. Spark config)

WES

100 samples

10K variants

Chr 1-22

mem1_ssd1_v2_x4

2

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78

< 30 sec

N/A

WES

100K samples

1.4M variants

Chr 1, 20, 21

mem1_ssd1_v2_x16

2

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x16 --instance-count=2 -ifeature=HAIL-0.2.78

35 min

N/A

Import BGEN genomic data into a Hail MatrixTable (MT)

This notebook shows how to import BGEN genomic data into a Hail MT and store it in DNAnexus. Input data may be structured in one of three different ways: 1 file in total, 1 file per chromosome, and multiple files per chromosome. When importing BGEN files with Hail, note that:

  • a Hail-specific index file is required for each BGEN file

  • only biallelic variants are supported

Data

Instance type

Num nodes

Example dx run command

Time (approx)

Additional Notes (including Spark config)

WES

100 samples

chr 1,2,3

mem1_ssd1_v2_x4

2

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78

Creating index files:

< 30 sec

Import and write MT:

< 30 sec

N/A

Filter by chromosome (chr) and position (pos)

This notebook shows how to retrieve a Hail MT from DNAnexus, filter chromosomes and positions, and store the results as a Hail Table in DNAnexus.

Data

Instance type

Num nodes

Example dx run command

Time (approx)

Additional Notes (including Spark config)

WES

100 samples

10K variants

Chr 1-22

mem1_ssd1_v2_x4

2

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78

< 30 sec

N/A

WES

100K samples

1.4M variants

Chr 1, 20, 21

mem2_ssd1_v2_x8

2

dx run dxjupyterlab_spark_cluster --instance-type=mem2_ssd1_v2_x8 --instance-count=2 -ifeature=HAIL-0.2.78

< 1 min

N/A

Filter by variant IDs

This notebook shows how to retrieve a Hail MT from DNAnexus, filter variant IDs, and then store results as a Hail Table in DNAnexus.

Data

Instance type

Num nodes

Example dx run command

Time (approx)

Additional Notes (including Spark config)

WES

100 samples

10K variants

Chr 1-22

mem1_ssd1_v2_x4

2

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78

< 30 sec

N/A

WES

100K samples

1.4M variants

Chr 1, 20, 21

mem1_ssd1_v2_x16

2

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x16 --instance-count=2 -ifeature=HAIL-0.2.78

< 30 sec

N/A

Filter by sample IDs

This notebook shows how to retrieve a Hail MT from DNAnexus, filter sample IDs, and then store results as a Hail Table in DNAnexus.

Data

Instance type

Num nodes

Example dx run command

Time (approx)

Additional Notes (including Spark config)

WES

100 samples

10K variants

Chr 1-22

mem1_ssd1_v2_x4

2

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78

< 30 sec

N/A

WES

100K samples

1.4M variants

Chr 1, 20, 21

mem1_ssd1_v2_x16

2

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x16 --instance-count=2 -ifeature=HAIL-0.2.78

< 30 sec

N/A

Replace sample IDs

This notebook shows how to retrieve a Hail MT from DNAnexus, replace sample IDs using a mapping table, and then store results as a Hail Table in DNAnexus.

Data

Instance type

Num nodes

Example dx run command

Time (approx)

Additional Notes (including Spark config)

WES

100 samples

10K variants

Chr 1-22

mem1_ssd1_v2_x4

2

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78

< 30 sec

N/A

WES

100K samples

1.4M variants

Chr 1, 20, 21

mem1_ssd1_v2_x16

5

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x16 --instance-count=5 -ifeature=HAIL-0.2.78

< 30 sec

N/A

Annotate genomic data using Hail VEP

This notebook shows how to retrieve a Hail MT from DNAnexus, annotate using Hail’s Variant Effect Predictor (VEP), and then store results as a Hail Table in DNAnexus. Hail runs VEP in parallel on every variant in a dataset, which requires VEP data files to be available on every node. By selecting feature=HAIL-0.2.78-VEP-1.0.3 when starting up the DXJupyterlab with Spark Cluster app, VEP data files (version 103 GRCh38) will be pre-installed on every node.

Data

Instance type

Num nodes

Example dx run command

Time (approx)

Additional Notes (including Spark config)

WES

100 samples

10K variants

Chr 1-22

mem1_ssd1_v2_x8

5

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=2 -ifeature=HAIL-0.2.78-VEP-1.0.3

6min

N/A

WES

100K samples

1.4M variants

Chr 1, 20, 21

mem1_ssd1_v2_x8

2

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=2 -ifeature=HAIL-0.2.78-VEP-1.0.3

46min

N/A

Annotate genomic data using Hail DB

This notebook shows how to retrieve a Hail MT from DNAnexus, annotate using Hail’s Annotation Database (DB) and then store results as a Hail Table in DNAnexus. Hail’s Annotation DB is a curated collection of variant annotations. Hail’s annotation datasets are available in Open Data on AWS where they are in a S3 bucket in the US region, however they may be accessed via a DNAnexus project of any region.

Data

Instance type

Num nodes

Example dx run command

Time (approx)

Additional Notes (including Spark config)

WES

100 samples

10K variants

Chr 1-22

mem1_ssd1_v2_x8

2

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=2 -ifeature=HAIL-0.2.78

< 1 min

N/A

WES

100K samples

1.4M variants

Chr 1, 20, 21

mem2_ssd1_v2_x8

5

dx run dxjupyterlab_spark_cluster --instance-type=mem2_ssd1_v2_x8 --instance-count=5 -ifeature=HAIL-0.2.78

< 5 min

N/A

Pre-GWAS QC: locus QC

This notebook shows how to retrieve a Hail MT from DNAnexus, compute locus quality control (QC) metrics using Hail’s variant_qc() method, and then store results as a Hail Table in DNAnexus. The variant_qc() method computes variant statistics from the genotype data and creates a new field in the MT with this information.

Data

Instance type

Num nodes

Example dx run command

Time (approx)

Additional Notes (including Spark config)

WES

100 samples

10K variants

Chr 1-22

mem1_ssd1_v2_x4

2

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78

< 30 sec

N/A

WES

100K samples

1.4M variants

Chr 1, 20, 21

mem1_ssd1_v2_x8

2

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=2 -ifeature=HAIL-0.2.78

24 min

N/A

Pre-GWAS QC: sample QC

This notebook shows how to retrieve a Hail MT from DNAnexus, compute sample quality control (QC) metrics using Hail’s sample_qc() method, and then store results as a Hail Table in DNAnexus. The sample_qc() method computes per-sample metrics and creates a new field in the MT with this information.

Data

Instance type

Num nodes

Example dx run command

Time (approx)

Additional Notes (including Spark config)

WES

100 samples

10K variants

Chr 1-22

mem1_ssd1_v2_x4

2

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78

< 30 sec

N/A

WES

100K samples

1.4M variants

Chr 1, 20, 21

mem1_ssd1_v2_x16

5

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x16 --instance-count=5 -ifeature=HAIL-0.2.78

8min

N/A

GWAS

This notebook shows how to perform a genome-wide association study (GWAS) for one case–control trait using Firth logistic regression, and then save results as a Hail Table in DNAnexus. The notebook first retrieves previously stored genomic, locus QC, and sample QC Tables for analysis preparation, and then creates a phenotype Hail Table containing the case–control trait. Hail’s logistic_regression_rows() method is used to run the analysis.

Data

Instance type

Num nodes

Example dx run command

Time (approx)

Additional Notes (including Spark config)

WES

100 samples

10K variants

Chr 1-22

mem1_ssd1_v2_x4

2

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78

< 30 sec

N/A

WES

100K samples

1.4M variants

Chr 1, 20, 21

mem1_ssd1_v2_x8

5

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=5 -ifeature=HAIL-0.2.78

1hr 30min

"spark.hadoop.dnanexus.fs.output.upload.chunk.size": 67108864,

"spark.hadoop.dnanexus.fs.cache.filestatus.maxsize": 44330,

"spark.hadoop.dnanexus.fs.cache.filestatus.expiretime": 28800,

"spark.hadoop.dnanexus.fs.cache.readfileurl.maxsize": 44330

GWAS visualizations

This notebook shows how to retrieve a Table of GWAS results from DNAnexus and then visualize using a Q-Q plot and a Manhattan plot using Hail. The visualizations require the Python library, Bokeh, which comes pre-installed with Hail in the JupyterLab notebook environment.

Data

Instance type

Num nodes

Example dx run command

Time (approx)

Additional Notes (including Spark config)

WES

100 samples

10K variants

Chr 1-22

mem1_ssd1_v2_x4

2

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78

QQ plot: < 30 sec

Manhattan plot:

< 30 sec

N/A

WES

100K samples

1.4M variants

Chr 1, 20, 21

mem1_ssd1_v2_x8

2

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=2 -ifeature=HAIL-0.2.78

QQ plot: < 30 sec

Manhattan plot:

< 30 sec

N/A

Annotate GWAS results using Hail VEP

This notebook shows how to retrieve a GWAS results Table from DNAnexus, annotate using Hail’s Variant Effect Predictor (VEP), and then store results as a Hail Table in DNAnexus. Hail runs VEP in parallel on every variant in a dataset, which requires VEP data files to be available on every node. By selecting feature=HAIL-0.2.78-VEP-1.0.3when starting up the DXJupyterlab with Spark Cluster app, VEP data files (version 103 GRCh38) will be pre-installed on every node.

Data

Instance type

Num nodes

Example dx run command

Time (approx)

Additional Notes (including Spark config)

WES

100 samples

10K variants

Chr 1-22

mem1_ssd1_v2_x8

5

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=5 -ifeature=HAIL-0.2.78-VEP-1.0.3

8 min

N/A

WES

100K samples

1.4M variants

Chr 1, 20, 21

mem1_ssd1_v2_x8

5

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=5 -ifeature=HAIL-0.2.78-VEP-1.0.3

9 min

N/A

Annotate GWAS results using Hail DB

This notebook shows how to retrieve a GWAS results Table from DNAnexus, annotate using Hail’s Annotation Database (DB) , and then store results as a Hail Table in DNAnexus. Hail’s Annotation DB is a curated collection of variant annotations. Hail’s annotation datasets are available in Open Data on AWS where they are in a S3 bucket in the US region, however they may be accessed via a DNAnexus project of any region.

Data

Instance type

Num nodes

Example dx run command

Time (approx)

Additional Notes (including Spark config)

WES

100 samples

10K variants

Chr 1-22

mem1_ssd1_v2_x8

2

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=2 -ifeature=HAIL-0.2.78

< 30 sec

N/A

WES

100K samples

1.4M variants

Chr 1, 20, 21

mem1_ssd1_v2_x8

2

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=2 -ifeature=HAIL-0.2.78

< 30 sec

N/A

Export genomic data

This notebook shows how to export a Hail MT of genomic data as a single BGEN file for each chromosome in v1.2 format with 8 bits per probability. This format is recommended if downstream analysis includes using regenie. An additional step (not shown in the notebook) recommended for regenie is to create a .bgi index file for each BGEN file. The .bgi files can be created using the tool, bgenix, which is part of the BGEN library.

Data

Instance type

Num nodes

Example dx run command

Time (approx)

Additional Notes (including spark config)

WES

100 samples

10K variants

Chr 1-22

mem1_ssd1_v2_x4

2

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78

< 1 min

N/A

WES

100K samples

1.4M variants

Chr 1, 20, 21

mem1_ssd1_v2_x8

2

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=2 -ifeature=HAIL-0.2.78

33min

N/A

General Advice

  • When accessing input data files on the project for use with Hail, it is recommended to use /mnt/project to read the content of the files. This requires that the input data files are already uploaded to the project before starting the DXJupyterLab Spark Cluster app. All example notebooks that use input data files show how to access mounted input data files from the project.

Guidance on scaling with Hail

  • See https://hail.is/docs/0.2/cloud/general_advice.html for general advice when using Hail.

  • Hail functions, describe() and n_partitions(), each print information about MTs and Tables without being computationally expensive and will execute immediately. Functions such as, count(), may be computationally expensive and take longer to compute- especially if the size of data is large

  • It is recommended to use a lower number of nodes when running certain Hail operations (i.e. import into MT, annotation with VEP) especially when the Spark UI shows nodes sitting idle when running a task.

  • The data in a MT or Table are divided into chunks called partitions, where each partition can be read and processed in parallel by the available cores (see https://spark.apache.org/docs/latest/rdd-programming-guide.html). When considering how the data is partitioned for Hail operations:

    • More partitions means more parallelization

    • Fewer partitions means more memory would be required

    • There should be fewer than 100K partitions

  • When considering memory management for Hail operations:

from pyspark.sql import SparkSession
import hail as hl

builder = (
                SparkSession 
                .builder
                .config("spark.executor.cores", 2)
                .config("spark.executor.memory", "18g")
                .enableHiveSupport()
)
                               
spark = builder.getOrCreate()
hl.init(sc=spark.sparkContext)

While the metrics presented for the notebooks can be used as an estimated guide for extrapolating (and interpolating) run time (and cost), keep in mind that most systems do not scale linearly.

Last updated