Using Hail to Analyze Genomic Data
Last updated
Was this helpful?
Last updated
Was this helpful?
is an open-source library built on top of for scalable genomic data analysis. is an app provided by DNAnexus that allows users to use Hail within notebooks with ease.
To launch a JupyterLab notebook with Hail on the platform, follow the same instructions for launching the DXJupyterLab Spark Cluster app while also selecting either:
feature=HAIL-0.2.78
, where the Hail Java library is pre-installed on all cluster nodes
feature=HAIL-0.2.78-VEP-1.0.3
, where the Ensembl Variant Effect Predictor (VEP) plugin is pre-installed along with Hail on all cluster nodes.
If feature=HAIL-0.2.78-VEP-1.0.3
is selected, the spin up time for the JupyterLab notebook will take a few minutes longer (compared to feature=HAIL-0.2.78
) as both Hail and VEP are being installed on all cluster nodes upon start-up. Note that only VEP version 103 GRCh38 is available on DNAnexus (custom annotations not available).
The instance type and number of nodes selected will affect how powerful the Spark cluster will be. Generally, the default settings allow for casual interrogation of the data. If you are running simple queries with a small amount of data, you can save costs by selecting a smaller instance type. If you are running complex queries or analyzing a large amount of data in memory, you may need to select a larger instance type. To increase parallelization efficiency and reduce processing time, you may need to select more nodes.
The following sections provide a guide for common Hail operations. Each section includes:
A description of the operation
Launch specs for different datasets
Instance type
Number of nodes
Example dx run
command
Approximate wall time taken to run the notebook- these values may vary in practice
A link to an example notebook in OpenBio that can be used as a starting point
The notebooks are as follows:
Information provided in these sections may help guide in choosing the best launch specs and knowing what to (generally) expect when working with your data.
Data Used
Instance type
Num nodes
Example dx run
command
Time (approx)
Additional Notes (i.e. Spark config)
WES
100 samples
10K variants
Chr 1-22
mem1_ssd1_v2_x4
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78
< 30 sec
N/A
WES
100K samples
1.4M variants
Chr 1, 20, 21
mem1_ssd1_v2_x16
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x16 --instance-count=2 -ifeature=HAIL-0.2.78
35 min
N/A
a Hail-specific index file is required for each BGEN file
only biallelic variants are supported
Data
Instance type
Num nodes
Example dx run
command
Time (approx)
Additional Notes (including Spark config)
WES
100 samples
chr 1,2,3
mem1_ssd1_v2_x4
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78
Creating index files:
< 30 sec
Import and write MT:
< 30 sec
N/A
Data
Instance type
Num nodes
Example dx run
command
Time (approx)
Additional Notes (including Spark config)
WES
100 samples
10K variants
Chr 1-22
mem1_ssd1_v2_x4
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78
< 30 sec
N/A
WES
100K samples
1.4M variants
Chr 1, 20, 21
mem2_ssd1_v2_x8
2
dx run dxjupyterlab_spark_cluster --instance-type=mem2_ssd1_v2_x8 --instance-count=2 -ifeature=HAIL-0.2.78
< 1 min
N/A
Data
Instance type
Num nodes
Example dx run
command
Time (approx)
Additional Notes (including Spark config)
WES
100 samples
10K variants
Chr 1-22
mem1_ssd1_v2_x4
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78
< 30 sec
N/A
WES
100K samples
1.4M variants
Chr 1, 20, 21
mem1_ssd1_v2_x16
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x16 --instance-count=2 -ifeature=HAIL-0.2.78
< 30 sec
N/A
Data
Instance type
Num nodes
Example dx run
command
Time (approx)
Additional Notes (including Spark config)
WES
100 samples
10K variants
Chr 1-22
mem1_ssd1_v2_x4
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78
< 30 sec
N/A
WES
100K samples
1.4M variants
Chr 1, 20, 21
mem1_ssd1_v2_x16
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x16 --instance-count=2 -ifeature=HAIL-0.2.78
< 30 sec
N/A
Data
Instance type
Num nodes
Example dx run
command
Time (approx)
Additional Notes (including Spark config)
WES
100 samples
10K variants
Chr 1-22
mem1_ssd1_v2_x4
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78
< 30 sec
N/A
WES
100K samples
1.4M variants
Chr 1, 20, 21
mem1_ssd1_v2_x16
5
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x16 --instance-count=5 -ifeature=HAIL-0.2.78
< 30 sec
N/A
Data
Instance type
Num nodes
Example dx run
command
Time (approx)
Additional Notes (including Spark config)
WES
100 samples
10K variants
Chr 1-22
mem1_ssd1_v2_x8
5
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=2 -ifeature=HAIL-0.2.78-VEP-1.0.3
6min
N/A
WES
100K samples
1.4M variants
Chr 1, 20, 21
mem1_ssd1_v2_x8
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=2 -ifeature=HAIL-0.2.78-VEP-1.0.3
46min
N/A
Data
Instance type
Num nodes
Example dx run
command
Time (approx)
Additional Notes (including Spark config)
WES
100 samples
10K variants
Chr 1-22
mem1_ssd1_v2_x8
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=2 -ifeature=HAIL-0.2.78
< 1 min
N/A
WES
100K samples
1.4M variants
Chr 1, 20, 21
mem2_ssd1_v2_x8
5
dx run dxjupyterlab_spark_cluster --instance-type=mem2_ssd1_v2_x8 --instance-count=5 -ifeature=HAIL-0.2.78
< 5 min
N/A
Data
Instance type
Num nodes
Example dx run
command
Time (approx)
Additional Notes (including Spark config)
WES
100 samples
10K variants
Chr 1-22
mem1_ssd1_v2_x4
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78
< 30 sec
N/A
WES
100K samples
1.4M variants
Chr 1, 20, 21
mem1_ssd1_v2_x8
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=2 -ifeature=HAIL-0.2.78
24 min
N/A
Data
Instance type
Num nodes
Example dx run
command
Time (approx)
Additional Notes (including Spark config)
WES
100 samples
10K variants
Chr 1-22
mem1_ssd1_v2_x4
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78
< 30 sec
N/A
WES
100K samples
1.4M variants
Chr 1, 20, 21
mem1_ssd1_v2_x16
5
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x16 --instance-count=5 -ifeature=HAIL-0.2.78
8min
N/A
Data
Instance type
Num nodes
Example dx run
command
Time (approx)
Additional Notes (including Spark config)
WES
100 samples
10K variants
Chr 1-22
mem1_ssd1_v2_x4
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78
< 30 sec
N/A
WES
100K samples
1.4M variants
Chr 1, 20, 21
mem1_ssd1_v2_x8
5
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=5 -ifeature=HAIL-0.2.78
1hr 30min
"spark.hadoop.dnanexus.fs.output.upload.chunk.size": 67108864,
"spark.hadoop.dnanexus.fs.cache.filestatus.maxsize": 44330,
"spark.hadoop.dnanexus.fs.cache.filestatus.expiretime": 28800,
"spark.hadoop.dnanexus.fs.cache.readfileurl.maxsize": 44330
Data
Instance type
Num nodes
Example dx run
command
Time (approx)
Additional Notes (including Spark config)
WES
100 samples
10K variants
Chr 1-22
mem1_ssd1_v2_x4
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78
QQ plot: < 30 sec
Manhattan plot:
< 30 sec
N/A
WES
100K samples
1.4M variants
Chr 1, 20, 21
mem1_ssd1_v2_x8
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=2 -ifeature=HAIL-0.2.78
QQ plot: < 30 sec
Manhattan plot:
< 30 sec
N/A
Data
Instance type
Num nodes
Example dx run
command
Time (approx)
Additional Notes (including Spark config)
WES
100 samples
10K variants
Chr 1-22
mem1_ssd1_v2_x8
5
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=5 -ifeature=HAIL-0.2.78-VEP-1.0.3
8 min
N/A
WES
100K samples
1.4M variants
Chr 1, 20, 21
mem1_ssd1_v2_x8
5
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=5 -ifeature=HAIL-0.2.78-VEP-1.0.3
9 min
N/A
Data
Instance type
Num nodes
Example dx run
command
Time (approx)
Additional Notes (including Spark config)
WES
100 samples
10K variants
Chr 1-22
mem1_ssd1_v2_x8
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=2 -ifeature=HAIL-0.2.78
< 30 sec
N/A
WES
100K samples
1.4M variants
Chr 1, 20, 21
mem1_ssd1_v2_x8
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=2 -ifeature=HAIL-0.2.78
< 30 sec
N/A
Data
Instance type
Num nodes
Example dx run
command
Time (approx)
Additional Notes (including spark config)
WES
100 samples
10K variants
Chr 1-22
mem1_ssd1_v2_x4
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78
< 1 min
N/A
WES
100K samples
1.4M variants
Chr 1, 20, 21
mem1_ssd1_v2_x8
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=2 -ifeature=HAIL-0.2.78
33min
N/A
When accessing input data files on the project for use with Hail, it is recommended to use /mnt/project
to read the content of the files. This requires that the input data files are already uploaded to the project before starting the DXJupyterLab Spark Cluster app. All example notebooks that use input data files show how to access mounted input data files from the project.
Hail functions, describe()
and n_partitions()
, each print information about MTs and Tables without being computationally expensive and will execute immediately. Functions such as, count(), may be computationally expensive and take longer to compute- especially if the size of data is large
It is recommended to use a lower number of nodes when running certain Hail operations (i.e. import into MT, annotation with VEP) especially when the Spark UI shows nodes sitting idle when running a task.
More partitions means more parallelization
Fewer partitions means more memory would be required
There should be fewer than 100K partitions
When considering memory management for Hail operations:
Another method for memory management is to use CPU reservation by fine-tuning spark configurations. An example of how to fine-tune spark configurations within the notebook before initializing Hail:
This shows how to import genomic data from VCF with multiple samples, termed project-VCF (pVCF), into a Hail MT and store it in DNAnexus. Input data may be structured in one of three different ways: 1 file in total, 1 file per chromosome, and multiple files per chromosome.
This shows how to import BGEN genomic data into a Hail MT and store it in DNAnexus. Input data may be structured in one of three different ways: 1 file in total, 1 file per chromosome, and multiple files per chromosome. When importing BGEN files with Hail, note that:
This shows how to retrieve a Hail MT from DNAnexus, filter chromosomes and positions, and store the results as a Hail Table in DNAnexus.
This shows how to retrieve a Hail MT from DNAnexus, filter variant IDs, and then store results as a Hail Table in DNAnexus.
This shows how to retrieve a Hail MT from DNAnexus, filter sample IDs, and then store results as a Hail Table in DNAnexus.
This shows how to retrieve a Hail MT from DNAnexus, replace sample IDs using a mapping table, and then store results as a Hail Table in DNAnexus.
This shows how to retrieve a Hail MT from DNAnexus, annotate using Hail’s Variant Effect Predictor (VEP), and then store results as a Hail Table in DNAnexus. Hail runs VEP in parallel on every variant in a dataset, which requires VEP data files to be available on every node. By selecting feature=HAIL-0.2.78-VEP-1.0.3
when starting up the DXJupyterlab with Spark Cluster app, VEP data files (version 103 GRCh38) will be pre-installed on every node.
This shows how to retrieve a Hail MT from DNAnexus, annotate using Hail’s Annotation Database (DB) and then store results as a Hail Table in DNAnexus. Hail’s Annotation DB is a curated collection of variant annotations. Hail’s annotation datasets are available in Open Data on AWS where they are in a S3 bucket in the US region, however they may be accessed via a DNAnexus project of any region.
This shows how to retrieve a Hail MT from DNAnexus, compute locus quality control (QC) metrics using Hail’s variant_qc()
method, and then store results as a Hail Table in DNAnexus. The variant_qc()
method computes variant statistics from the genotype data and creates a new field in the MT with this information.
This shows how to retrieve a Hail MT from DNAnexus, compute sample quality control (QC) metrics using Hail’s sample_qc()
method, and then store results as a Hail Table in DNAnexus. The sample_qc()
method computes per-sample metrics and creates a new field in the MT with this information.
This shows how to perform a genome-wide association study (GWAS) for one case–control trait using Firth logistic regression, and then save results as a Hail Table in DNAnexus. The notebook first retrieves previously stored genomic, locus QC, and sample QC Tables for analysis preparation, and then creates a phenotype Hail Table containing the case–control trait. Hail’s logistic_regression_rows() method is used to run the analysis.
This shows how to retrieve a Table of GWAS results from DNAnexus and then visualize using a Q-Q plot and a Manhattan plot using Hail. The visualizations require the Python library, Bokeh, which comes pre-installed with Hail in the JupyterLab notebook environment.
This shows how to retrieve a GWAS results Table from DNAnexus, annotate using Hail’s Variant Effect Predictor (VEP), and then store results as a Hail Table in DNAnexus. Hail runs VEP in parallel on every variant in a dataset, which requires VEP data files to be available on every node. By selecting feature=HAIL-0.2.78-VEP-1.0.3
when starting up the DXJupyterlab with Spark Cluster app, VEP data files (version 103 GRCh38) will be pre-installed on every node.
This shows how to retrieve a GWAS results Table from DNAnexus, annotate using Hail’s Annotation Database (DB) , and then store results as a Hail Table in DNAnexus. Hail’s Annotation DB is a curated collection of variant annotations. Hail’s annotation datasets are available in Open Data on AWS where they are in a S3 bucket in the US region, however they may be accessed via a DNAnexus project of any region.
This shows how to export a Hail MT of genomic data as a single BGEN file for each chromosome in v1.2 format with 8 bits per probability. This format is recommended if downstream analysis includes using regenie. An additional step (not shown in the notebook) recommended for regenie is to create a .bgi index file for each BGEN file. The .bgi files can be created using the tool, , which is part of the BGEN library.
See for general advice when using Hail.
The data in a MT or Table are divided into chunks called partitions, where each partition can be read and processed in parallel by the available cores (see ). When considering how the data is partitioned for Hail operations:
On the DNAnexus platform, the Spark cluster is allocated roughly 70% of the available memory of the selected instance type by default (See )
Memory usage and availability can be viewed on the Executors Tab in the Spark UI (see )
The easiest way to manage memory availability is to choose a different instance type (see )