Using Hail to Analyze Genomic Data
Hail is an open-source library built on top of Apache Spark for scalable genomic data analysis. DXJupyterLab Spark Cluster is an app provided by DNAnexus that allows users to use Hail within notebooks with ease.
A license is required to access the functionality described on this page. Contact DNAnexus Sales for more information.
Launching JupyterLab with Hail
To launch a JupyterLab notebook with Hail on the platform, follow the same instructions for launching the DXJupyterLab Spark Cluster app while also selecting either:
feature=HAIL-0.2.78
, where the Hail Java library is pre-installed on all cluster nodesfeature=HAIL-0.2.78-VEP-1.0.3
, where the Ensembl Variant Effect Predictor (VEP) plugin is pre-installed along with Hail on all cluster nodes.
If feature=HAIL-0.2.78-VEP-1.0.3
is selected, the spin up time for the JupyterLab notebook will take a few minutes longer (compared to feature=HAIL-0.2.78
) as both Hail and VEP are being installed on all cluster nodes upon start-up. Note that only VEP version 103 GRCh38 is available on DNAnexus (custom annotations not available).
The instance type and number of nodes selected will affect how powerful the Spark cluster will be. Generally, the default settings allow for casual interrogation of the data. If you are running simple queries with a small amount of data, you can save costs by selecting a smaller instance type. If you are running complex queries or analyzing a large amount of data in memory, you may need to select a larger instance type. To increase parallelization efficiency and reduce processing time, you may need to select more nodes.
Example Notebooks
The following sections provide a guide for common Hail operations. Each section includes:
A description of the operation
Launch specs for different datasets
Instance type
Number of nodes
Example
dx run
commandApproximate wall time taken to run the notebook- these values may vary in practice
A link to an example notebook in OpenBio that can be used as a starting point
The notebooks are as follows:
Import pVCF genomic data into a Hail MatrixTable (MT)
Import BGEN genomic data into a Hail MT
Annotate genomic data using Hail VEP
Annotate genomic data using Hail DB
Annotate GWAS results using Hail VEP
Annotate GWAS results using Hail DB
Information provided in these sections may help guide in choosing the best launch specs and knowing what to (generally) expect when working with your data.
These metrics are specific for the data used–they should be interpreted as estimates and may not scale linearly with the data size.
Import pVCF genomic data into a Hail MatrixTable (MT)
This notebook shows how to import genomic data from VCF with multiple samples, termed project-VCF (pVCF), into a Hail MT and store it in DNAnexus. Input data may be structured in one of three different ways: 1 file in total, 1 file per chromosome, and multiple files per chromosome.
Data Used
Instance type
Num nodes
Example dx run
command
Time (approx)
Additional Notes (i.e. Spark config)
WES
100 samples
10K variants
Chr 1-22
mem1_ssd1_v2_x4
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78
< 30 sec
N/A
WES
100K samples
1.4M variants
Chr 1, 20, 21
mem1_ssd1_v2_x16
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x16 --instance-count=2 -ifeature=HAIL-0.2.78
35 min
N/A
Import BGEN genomic data into a Hail MatrixTable (MT)
This notebook shows how to import BGEN genomic data into a Hail MT and store it in DNAnexus. Input data may be structured in one of three different ways: 1 file in total, 1 file per chromosome, and multiple files per chromosome. When importing BGEN files with Hail, note that:
a Hail-specific index file is required for each BGEN file
only biallelic variants are supported
Data
Instance type
Num nodes
Example dx run
command
Time (approx)
Additional Notes (including Spark config)
WES
100 samples
chr 1,2,3
mem1_ssd1_v2_x4
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78
Creating index files:
< 30 sec
Import and write MT:
< 30 sec
N/A
Filter by chromosome (chr) and position (pos)
This notebook shows how to retrieve a Hail MT from DNAnexus, filter chromosomes and positions, and store the results as a Hail Table in DNAnexus.
Data
Instance type
Num nodes
Example dx run
command
Time (approx)
Additional Notes (including Spark config)
WES
100 samples
10K variants
Chr 1-22
mem1_ssd1_v2_x4
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78
< 30 sec
N/A
WES
100K samples
1.4M variants
Chr 1, 20, 21
mem2_ssd1_v2_x8
2
dx run dxjupyterlab_spark_cluster --instance-type=mem2_ssd1_v2_x8 --instance-count=2 -ifeature=HAIL-0.2.78
< 1 min
N/A
Filter by variant IDs
This notebook shows how to retrieve a Hail MT from DNAnexus, filter variant IDs, and then store results as a Hail Table in DNAnexus.
Data
Instance type
Num nodes
Example dx run
command
Time (approx)
Additional Notes (including Spark config)
WES
100 samples
10K variants
Chr 1-22
mem1_ssd1_v2_x4
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78
< 30 sec
N/A
WES
100K samples
1.4M variants
Chr 1, 20, 21
mem1_ssd1_v2_x16
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x16 --instance-count=2 -ifeature=HAIL-0.2.78
< 30 sec
N/A
Filter by sample IDs
This notebook shows how to retrieve a Hail MT from DNAnexus, filter sample IDs, and then store results as a Hail Table in DNAnexus.
Data
Instance type
Num nodes
Example dx run
command
Time (approx)
Additional Notes (including Spark config)
WES
100 samples
10K variants
Chr 1-22
mem1_ssd1_v2_x4
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78
< 30 sec
N/A
WES
100K samples
1.4M variants
Chr 1, 20, 21
mem1_ssd1_v2_x16
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x16 --instance-count=2 -ifeature=HAIL-0.2.78
< 30 sec
N/A
Replace sample IDs
This notebook shows how to retrieve a Hail MT from DNAnexus, replace sample IDs using a mapping table, and then store results as a Hail Table in DNAnexus.
Data
Instance type
Num nodes
Example dx run
command
Time (approx)
Additional Notes (including Spark config)
WES
100 samples
10K variants
Chr 1-22
mem1_ssd1_v2_x4
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78
< 30 sec
N/A
WES
100K samples
1.4M variants
Chr 1, 20, 21
mem1_ssd1_v2_x16
5
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x16 --instance-count=5 -ifeature=HAIL-0.2.78
< 30 sec
N/A
Annotate genomic data using Hail VEP
This notebook shows how to retrieve a Hail MT from DNAnexus, annotate using Hail’s Variant Effect Predictor (VEP), and then store results as a Hail Table in DNAnexus. Hail runs VEP in parallel on every variant in a dataset, which requires VEP data files to be available on every node. By selecting feature=HAIL-0.2.78-VEP-1.0.3
when starting up the DXJupyterlab with Spark Cluster app, VEP data files (version 103 GRCh38) will be pre-installed on every node.
Data
Instance type
Num nodes
Example dx run
command
Time (approx)
Additional Notes (including Spark config)
WES
100 samples
10K variants
Chr 1-22
mem1_ssd1_v2_x8
5
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=2 -ifeature=HAIL-0.2.78-VEP-1.0.3
6min
N/A
WES
100K samples
1.4M variants
Chr 1, 20, 21
mem1_ssd1_v2_x8
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=2 -ifeature=HAIL-0.2.78-VEP-1.0.3
46min
N/A
Annotate genomic data using Hail DB
This notebook shows how to retrieve a Hail MT from DNAnexus, annotate using Hail’s Annotation Database (DB) and then store results as a Hail Table in DNAnexus. Hail’s Annotation DB is a curated collection of variant annotations. Hail’s annotation datasets are available in Open Data on AWS where they are in a S3 bucket in the US region, however they may be accessed via a DNAnexus project of any region.
Data
Instance type
Num nodes
Example dx run
command
Time (approx)
Additional Notes (including Spark config)
WES
100 samples
10K variants
Chr 1-22
mem1_ssd1_v2_x8
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=2 -ifeature=HAIL-0.2.78
< 1 min
N/A
WES
100K samples
1.4M variants
Chr 1, 20, 21
mem2_ssd1_v2_x8
5
dx run dxjupyterlab_spark_cluster --instance-type=mem2_ssd1_v2_x8 --instance-count=5 -ifeature=HAIL-0.2.78
< 5 min
N/A
Pre-GWAS QC: locus QC
This notebook shows how to retrieve a Hail MT from DNAnexus, compute locus quality control (QC) metrics using Hail’s variant_qc()
method, and then store results as a Hail Table in DNAnexus. The variant_qc()
method computes variant statistics from the genotype data and creates a new field in the MT with this information.
Data
Instance type
Num nodes
Example dx run
command
Time (approx)
Additional Notes (including Spark config)
WES
100 samples
10K variants
Chr 1-22
mem1_ssd1_v2_x4
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78
< 30 sec
N/A
WES
100K samples
1.4M variants
Chr 1, 20, 21
mem1_ssd1_v2_x8
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=2 -ifeature=HAIL-0.2.78
24 min
N/A
Pre-GWAS QC: sample QC
This notebook shows how to retrieve a Hail MT from DNAnexus, compute sample quality control (QC) metrics using Hail’s sample_qc()
method, and then store results as a Hail Table in DNAnexus. The sample_qc()
method computes per-sample metrics and creates a new field in the MT with this information.
Data
Instance type
Num nodes
Example dx run
command
Time (approx)
Additional Notes (including Spark config)
WES
100 samples
10K variants
Chr 1-22
mem1_ssd1_v2_x4
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78
< 30 sec
N/A
WES
100K samples
1.4M variants
Chr 1, 20, 21
mem1_ssd1_v2_x16
5
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x16 --instance-count=5 -ifeature=HAIL-0.2.78
8min
N/A
GWAS
This notebook shows how to perform a genome-wide association study (GWAS) for one case–control trait using Firth logistic regression, and then save results as a Hail Table in DNAnexus. The notebook first retrieves previously stored genomic, locus QC, and sample QC Tables for analysis preparation, and then creates a phenotype Hail Table containing the case–control trait. Hail’s logistic_regression_rows() method is used to run the analysis.
Data
Instance type
Num nodes
Example dx run
command
Time (approx)
Additional Notes (including Spark config)
WES
100 samples
10K variants
Chr 1-22
mem1_ssd1_v2_x4
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78
< 30 sec
N/A
WES
100K samples
1.4M variants
Chr 1, 20, 21
mem1_ssd1_v2_x8
5
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=5 -ifeature=HAIL-0.2.78
1hr 30min
"spark.hadoop.dnanexus.fs.output.upload.chunk.size": 67108864,
"spark.hadoop.dnanexus.fs.cache.filestatus.maxsize": 44330,
"spark.hadoop.dnanexus.fs.cache.filestatus.expiretime": 28800,
"spark.hadoop.dnanexus.fs.cache.readfileurl.maxsize": 44330
GWAS visualizations
This notebook shows how to retrieve a Table of GWAS results from DNAnexus and then visualize using a Q-Q plot and a Manhattan plot using Hail. The visualizations require the Python library, Bokeh, which comes pre-installed with Hail in the JupyterLab notebook environment.
Data
Instance type
Num nodes
Example dx run
command
Time (approx)
Additional Notes (including Spark config)
WES
100 samples
10K variants
Chr 1-22
mem1_ssd1_v2_x4
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78
QQ plot: < 30 sec
Manhattan plot:
< 30 sec
N/A
WES
100K samples
1.4M variants
Chr 1, 20, 21
mem1_ssd1_v2_x8
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=2 -ifeature=HAIL-0.2.78
QQ plot: < 30 sec
Manhattan plot:
< 30 sec
N/A
Annotate GWAS results using Hail VEP
This notebook shows how to retrieve a GWAS results Table from DNAnexus, annotate using Hail’s Variant Effect Predictor (VEP), and then store results as a Hail Table in DNAnexus. Hail runs VEP in parallel on every variant in a dataset, which requires VEP data files to be available on every node. By selecting feature=HAIL-0.2.78-VEP-1.0.3
when starting up the DXJupyterlab with Spark Cluster app, VEP data files (version 103 GRCh38) will be pre-installed on every node.
Data
Instance type
Num nodes
Example dx run
command
Time (approx)
Additional Notes (including Spark config)
WES
100 samples
10K variants
Chr 1-22
mem1_ssd1_v2_x8
5
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=5 -ifeature=HAIL-0.2.78-VEP-1.0.3
8 min
N/A
WES
100K samples
1.4M variants
Chr 1, 20, 21
mem1_ssd1_v2_x8
5
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=5 -ifeature=HAIL-0.2.78-VEP-1.0.3
9 min
N/A
Annotate GWAS results using Hail DB
This notebook shows how to retrieve a GWAS results Table from DNAnexus, annotate using Hail’s Annotation Database (DB) , and then store results as a Hail Table in DNAnexus. Hail’s Annotation DB is a curated collection of variant annotations. Hail’s annotation datasets are available in Open Data on AWS where they are in a S3 bucket in the US region, however they may be accessed via a DNAnexus project of any region.
Data
Instance type
Num nodes
Example dx run
command
Time (approx)
Additional Notes (including Spark config)
WES
100 samples
10K variants
Chr 1-22
mem1_ssd1_v2_x8
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=2 -ifeature=HAIL-0.2.78
< 30 sec
N/A
WES
100K samples
1.4M variants
Chr 1, 20, 21
mem1_ssd1_v2_x8
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=2 -ifeature=HAIL-0.2.78
< 30 sec
N/A
Export genomic data
This notebook shows how to export a Hail MT of genomic data as a single BGEN file for each chromosome in v1.2 format with 8 bits per probability. This format is recommended if downstream analysis includes using regenie. An additional step (not shown in the notebook) recommended for regenie is to create a .bgi index file for each BGEN file. The .bgi files can be created using the tool, bgenix, which is part of the BGEN library.
Data
Instance type
Num nodes
Example dx run
command
Time (approx)
Additional Notes (including spark config)
WES
100 samples
10K variants
Chr 1-22
mem1_ssd1_v2_x4
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78
< 1 min
N/A
WES
100K samples
1.4M variants
Chr 1, 20, 21
mem1_ssd1_v2_x8
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=2 -ifeature=HAIL-0.2.78
33min
N/A
General Advice
When accessing input data files on the project for use with Hail, it is recommended to use
/mnt/project
to read the content of the files. This requires that the input data files are already uploaded to the project before starting the DXJupyterLab Spark Cluster app. All example notebooks that use input data files show how to access mounted input data files from the project.
Guidance on scaling with Hail
See https://hail.is/docs/0.2/cloud/general_advice.html for general advice when using Hail.
Hail functions,
describe()
andn_partitions()
, each print information about MTs and Tables without being computationally expensive and will execute immediately. Functions such as, count(), may be computationally expensive and take longer to compute- especially if the size of data is largeIt is recommended to use a lower number of nodes when running certain Hail operations (i.e. import into MT, annotation with VEP) especially when the Spark UI shows nodes sitting idle when running a task.
The data in a MT or Table are divided into chunks called partitions, where each partition can be read and processed in parallel by the available cores (see https://spark.apache.org/docs/latest/rdd-programming-guide.html). When considering how the data is partitioned for Hail operations:
More partitions means more parallelization
Fewer partitions means more memory would be required
There should be fewer than 100K partitions
When considering memory management for Hail operations:
On the DNAnexus platform, the Spark cluster is allocated roughly 70% of the available memory of the selected instance type by default (See https://documentation.dnanexus.com/developer/apps/developing-spark-apps#default-instance-configurations)
Memory usage and availability can be viewed on the Executors Tab in the Spark UI (see https://spark.apache.org/docs/latest/web-ui.html#executors-tab)
The easiest way to manage memory availability is to choose a different instance type (see https://documentation.dnanexus.com/developer/api/running-analyses/instance-types)
Another method for memory management is to use CPU reservation by fine-tuning spark configurations. An example of how to fine-tune spark configurations within the notebook before initializing Hail:
While the metrics presented for the notebooks can be used as an estimated guide for extrapolating (and interpolating) run time (and cost), keep in mind that most systems do not scale linearly.
Last updated