DNAnexus Documentation
APIDownloadsIndex of dx CommandsLegal
  • Overview
  • Getting Started
    • DNAnexus Essentials
    • Key Concepts
      • Projects
      • Organizations
      • Apps and Workflows
    • User Interface Quickstart
    • Command Line Quickstart
    • Developer Quickstart
    • Developer Tutorials
      • Bash
        • Bash Helpers
        • Distributed by Chr (sh)
        • Distributed by Region (sh)
        • SAMtools count
        • TensorBoard Example Web App
        • Git Dependency
        • Mkfifo and dx cat
        • Parallel by Region (sh)
        • Parallel xargs by Chr
        • Precompiled Binary
        • R Shiny Example Web App
      • Python
        • Dash Example Web App
        • Distributed by Region (py)
        • Parallel by Chr (py)
        • Parallel by Region (py)
        • Pysam
      • Web App(let) Tutorials
        • Dash Example Web App
        • TensorBoard Example Web App
      • Concurrent Computing Tutorials
        • Distributed
          • Distributed by Region (sh)
          • Distributed by Chr (sh)
          • Distributed by Region (py)
        • Parallel
          • Parallel by Chr (py)
          • Parallel by Region (py)
          • Parallel by Region (sh)
          • Parallel xargs by Chr
  • User
    • Login and Logout
    • Projects
      • Project Navigation
      • Path Resolution
    • Running Apps and Workflows
      • Running Apps and Applets
      • Running Workflows
      • Running Nextflow Pipelines
      • Running Batch Jobs
      • Monitoring Executions
      • Job Notifications
      • Job Lifecycle
      • Executions and Time Limits
      • Executions and Cost and Spending Limits
      • Smart Reuse (Job Reuse)
      • Apps and Workflows Glossary
      • Tools List
    • Cohort Browser
      • Chart Types
        • Row Chart
        • Histogram
        • Box Plot
        • List View
        • Grouped Box Plot
        • Stacked Row Chart
        • Scatter Plot
        • Kaplan-Meier Survival Curve
      • Locus Details Page
    • Using DXJupyterLab
      • DXJupyterLab Quickstart
      • Running DXJupyterLab
        • FreeSurfer in DXJupyterLab
      • Spark Cluster-Enabled DXJupyterLab
        • Exploring and Querying Datasets
      • Stata in DXJupyterLab
      • Running Older Versions of DXJupyterLab
      • DXJupyterLab Reference
    • Using Spark
      • Apollo Apps
      • Connect to Thrift
      • Example Applications
        • CSV Loader
        • SQL Runner
        • VCF Loader
      • VCF Preprocessing
    • Environment Variables
    • Objects
      • Describing Data Objects
      • Searching Data Objects
      • Visualizing Data
      • Filtering Objects and Jobs
      • Archiving Files
      • Relational Database Clusters
      • Symlinks
      • Uploading and Downloading Files
        • Small File Sets
          • dx upload
          • dx download
        • Batch
          • Upload Agent
          • Download Agent
    • Platform IDs
    • Organization Member Guide
    • Index of dx commands
  • Developer
    • Developing Portable Pipelines
      • dxCompiler
    • Cloud Workstation
    • Apps
      • Introduction to Building Apps
      • App Build Process
      • Advanced Applet Tutorial
      • Bash Apps
      • Python Apps
      • Spark Apps
        • Table Exporter
        • DX Spark Submit Utility
      • HTTPS Apps
        • Isolated Browsing for HTTPS Apps
      • Transitioning from Applets to Apps
      • Third Party and Community Apps
        • Community App Guidelines
        • Third Party App Style Guide
        • Third Party App Publishing Checklist
      • App Metadata
      • App Permissions
      • App Execution Environment
        • Connecting to Jobs
      • Dependency Management
        • Asset Build Process
        • Docker Images
        • Python package installation in Ubuntu 24.04 AEE
      • Job Identity Tokens for Access to Clouds and Third-Party Services
      • Enabling Web Application Users to Log In with DNAnexus Credentials
      • Types of Errors
    • Workflows
      • Importing Workflows
      • Introduction to Building Workflows
      • Building and Running Workflows
      • Workflow Build Process
      • Versioning and Publishing Global Workflows
      • Workflow Metadata
    • Ingesting Data
      • Molecular Expression Assay Loader
        • Common Errors
        • Example Usage
        • Example Input
      • Data Model Loader
        • Data Ingestion Key Steps
        • Ingestion Data Types
        • Data Files Used by the Data Model Loader
        • Troubleshooting
      • Dataset Extender
        • Using Dataset Extender
    • Dataset Management
      • Rebase Cohorts and Dashboards
      • Assay Dataset Merger
      • Clinical Dataset Merger
    • Apollo Datasets
      • Dataset Versions
      • Cohorts
    • Creating Custom Viewers
    • Client Libraries
      • Support for Python 3
    • Walkthroughs
      • Creating a Mixed Phenotypic Assay Dataset
      • Guide for Ingesting a Simple Four Table Dataset
    • DNAnexus API
      • Entity IDs
      • Protocols
      • Authentication
      • Regions
      • Nonces
      • Users
      • Organizations
      • OIDC Clients
      • Data Containers
        • Folders and Deletion
        • Cloning
        • Project API Methods
        • Project Permissions and Sharing
      • Data Object Lifecycle
        • Types
        • Object Details
        • Visibility
      • Data Object Metadata
        • Name
        • Properties
        • Tags
      • Data Object Classes
        • Records
        • Files
        • Databases
        • Drives
        • DBClusters
      • Running Analyses
        • I/O and Run Specifications
        • Instance Types
        • Job Input and Output
        • Applets and Entry Points
        • Apps
        • Workflows and Analyses
        • Global Workflows
        • Containers for Execution
      • Search
      • System Methods
      • Directory of API Methods
      • DNAnexus Service Limits
  • Administrator
    • Billing
    • Org Management
    • Single Sign-On
    • Audit Trail
    • Integrating with External Services
    • Portal Setup
    • GxP
      • Controlled Tool Access (allowed executables)
  • Science Corner
    • Scientific Guides
      • Somatic Small Variant and CNV Discovery Workflow Walkthrough
      • SAIGE GWAS Walkthrough
      • LocusZoom DNAnexus App
      • Human Reference Genomes
    • Using Hail to Analyze Genomic Data
    • Open-Source Tools by DNAnexus Scientists
    • Using IGV Locally with DNAnexus
  • Downloads
  • FAQs
    • EOL Documentation
      • Python 3 Support and Python 2 End of Life (EOL)
    • Automating Analysis Workflow
    • Backups of Customer Data
    • Developing Apps and Applets
    • Importing Data
    • Platform Uptime
    • Legal and Compliance
    • Sharing and Collaboration
    • Product Version Numbering
  • Release Notes
  • Technical Support
  • Legal
Powered by GitBook

Copyright 2025 DNAnexus

On this page
  • Launching JupyterLab with Hail
  • Example Notebooks
  • Import pVCF genomic data into a Hail MatrixTable (MT)
  • Import BGEN genomic data into a Hail MatrixTable (MT)
  • Filter by chromosome (chr) and position (pos)
  • Filter by variant IDs
  • Filter by sample IDs
  • Replace sample IDs
  • Annotate genomic data using Hail VEP
  • Annotate genomic data using Hail DB
  • Pre-GWAS QC: locus QC
  • Pre-GWAS QC: sample QC
  • GWAS
  • GWAS visualizations
  • Annotate GWAS results using Hail VEP
  • Annotate GWAS results using Hail DB
  • Export genomic data
  • General Advice
  • Guidance on scaling with Hail

Was this helpful?

Export as PDF
  1. Science Corner

Using Hail to Analyze Genomic Data

Last updated 2 years ago

Was this helpful?

is an open-source library built on top of for scalable genomic data analysis. is an app provided by DNAnexus that allows users to use Hail within notebooks with ease.

A license is required to access the functionality described on this page. for more information.

Launching JupyterLab with Hail

To launch a JupyterLab notebook with Hail on the platform, follow the same instructions for launching the DXJupyterLab Spark Cluster app while also selecting either:

  • feature=HAIL-0.2.78, where the Hail Java library is pre-installed on all cluster nodes

  • feature=HAIL-0.2.78-VEP-1.0.3, where the Ensembl Variant Effect Predictor (VEP) plugin is pre-installed along with Hail on all cluster nodes.

If feature=HAIL-0.2.78-VEP-1.0.3 is selected, the spin up time for the JupyterLab notebook will take a few minutes longer (compared to feature=HAIL-0.2.78) as both Hail and VEP are being installed on all cluster nodes upon start-up. Note that only VEP version 103 GRCh38 is available on DNAnexus (custom annotations not available).

The instance type and number of nodes selected will affect how powerful the Spark cluster will be. Generally, the default settings allow for casual interrogation of the data. If you are running simple queries with a small amount of data, you can save costs by selecting a smaller instance type. If you are running complex queries or analyzing a large amount of data in memory, you may need to select a larger instance type. To increase parallelization efficiency and reduce processing time, you may need to select more nodes.

Example Notebooks

The following sections provide a guide for common Hail operations. Each section includes:

  • A description of the operation

  • Launch specs for different datasets

    • Instance type

    • Number of nodes

    • Example dx run command

    • Approximate wall time taken to run the notebook- these values may vary in practice

  • A link to an example notebook in OpenBio that can be used as a starting point

The notebooks are as follows:

Information provided in these sections may help guide in choosing the best launch specs and knowing what to (generally) expect when working with your data.

These metrics are specific for the data used–they should be interpreted as estimates and may not scale linearly with the data size.

Import pVCF genomic data into a Hail MatrixTable (MT)

Data Used

Instance type

Num nodes

Example dx run command

Time (approx)

Additional Notes (i.e. Spark config)

WES

100 samples

10K variants

Chr 1-22

mem1_ssd1_v2_x4

2

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78

< 30 sec

N/A

WES

100K samples

1.4M variants

Chr 1, 20, 21

mem1_ssd1_v2_x16

2

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x16 --instance-count=2 -ifeature=HAIL-0.2.78

35 min

N/A

Import BGEN genomic data into a Hail MatrixTable (MT)

  • a Hail-specific index file is required for each BGEN file

  • only biallelic variants are supported

Data

Instance type

Num nodes

Example dx run command

Time (approx)

Additional Notes (including Spark config)

WES

100 samples

chr 1,2,3

mem1_ssd1_v2_x4

2

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78

Creating index files:

< 30 sec

Import and write MT:

< 30 sec

N/A

Filter by chromosome (chr) and position (pos)

Data

Instance type

Num nodes

Example dx run command

Time (approx)

Additional Notes (including Spark config)

WES

100 samples

10K variants

Chr 1-22

mem1_ssd1_v2_x4

2

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78

< 30 sec

N/A

WES

100K samples

1.4M variants

Chr 1, 20, 21

mem2_ssd1_v2_x8

2

dx run dxjupyterlab_spark_cluster --instance-type=mem2_ssd1_v2_x8 --instance-count=2 -ifeature=HAIL-0.2.78

< 1 min

N/A

Filter by variant IDs

Data

Instance type

Num nodes

Example dx run command

Time (approx)

Additional Notes (including Spark config)

WES

100 samples

10K variants

Chr 1-22

mem1_ssd1_v2_x4

2

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78

< 30 sec

N/A

WES

100K samples

1.4M variants

Chr 1, 20, 21

mem1_ssd1_v2_x16

2

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x16 --instance-count=2 -ifeature=HAIL-0.2.78

< 30 sec

N/A

Filter by sample IDs

Data

Instance type

Num nodes

Example dx run command

Time (approx)

Additional Notes (including Spark config)

WES

100 samples

10K variants

Chr 1-22

mem1_ssd1_v2_x4

2

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78

< 30 sec

N/A

WES

100K samples

1.4M variants

Chr 1, 20, 21

mem1_ssd1_v2_x16

2

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x16 --instance-count=2 -ifeature=HAIL-0.2.78

< 30 sec

N/A

Replace sample IDs

Data

Instance type

Num nodes

Example dx run command

Time (approx)

Additional Notes (including Spark config)

WES

100 samples

10K variants

Chr 1-22

mem1_ssd1_v2_x4

2

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78

< 30 sec

N/A

WES

100K samples

1.4M variants

Chr 1, 20, 21

mem1_ssd1_v2_x16

5

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x16 --instance-count=5 -ifeature=HAIL-0.2.78

< 30 sec

N/A

Annotate genomic data using Hail VEP

Data

Instance type

Num nodes

Example dx run command

Time (approx)

Additional Notes (including Spark config)

WES

100 samples

10K variants

Chr 1-22

mem1_ssd1_v2_x8

5

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=2 -ifeature=HAIL-0.2.78-VEP-1.0.3

6min

N/A

WES

100K samples

1.4M variants

Chr 1, 20, 21

mem1_ssd1_v2_x8

2

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=2 -ifeature=HAIL-0.2.78-VEP-1.0.3

46min

N/A

Annotate genomic data using Hail DB

Data

Instance type

Num nodes

Example dx run command

Time (approx)

Additional Notes (including Spark config)

WES

100 samples

10K variants

Chr 1-22

mem1_ssd1_v2_x8

2

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=2 -ifeature=HAIL-0.2.78

< 1 min

N/A

WES

100K samples

1.4M variants

Chr 1, 20, 21

mem2_ssd1_v2_x8

5

dx run dxjupyterlab_spark_cluster --instance-type=mem2_ssd1_v2_x8 --instance-count=5 -ifeature=HAIL-0.2.78

< 5 min

N/A

Pre-GWAS QC: locus QC

Data

Instance type

Num nodes

Example dx run command

Time (approx)

Additional Notes (including Spark config)

WES

100 samples

10K variants

Chr 1-22

mem1_ssd1_v2_x4

2

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78

< 30 sec

N/A

WES

100K samples

1.4M variants

Chr 1, 20, 21

mem1_ssd1_v2_x8

2

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=2 -ifeature=HAIL-0.2.78

24 min

N/A

Pre-GWAS QC: sample QC

Data

Instance type

Num nodes

Example dx run command

Time (approx)

Additional Notes (including Spark config)

WES

100 samples

10K variants

Chr 1-22

mem1_ssd1_v2_x4

2

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78

< 30 sec

N/A

WES

100K samples

1.4M variants

Chr 1, 20, 21

mem1_ssd1_v2_x16

5

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x16 --instance-count=5 -ifeature=HAIL-0.2.78

8min

N/A

GWAS

Data

Instance type

Num nodes

Example dx run command

Time (approx)

Additional Notes (including Spark config)

WES

100 samples

10K variants

Chr 1-22

mem1_ssd1_v2_x4

2

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78

< 30 sec

N/A

WES

100K samples

1.4M variants

Chr 1, 20, 21

mem1_ssd1_v2_x8

5

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=5 -ifeature=HAIL-0.2.78

1hr 30min

"spark.hadoop.dnanexus.fs.output.upload.chunk.size": 67108864,

"spark.hadoop.dnanexus.fs.cache.filestatus.maxsize": 44330,

"spark.hadoop.dnanexus.fs.cache.filestatus.expiretime": 28800,

"spark.hadoop.dnanexus.fs.cache.readfileurl.maxsize": 44330

GWAS visualizations

Data

Instance type

Num nodes

Example dx run command

Time (approx)

Additional Notes (including Spark config)

WES

100 samples

10K variants

Chr 1-22

mem1_ssd1_v2_x4

2

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78

QQ plot: < 30 sec

Manhattan plot:

< 30 sec

N/A

WES

100K samples

1.4M variants

Chr 1, 20, 21

mem1_ssd1_v2_x8

2

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=2 -ifeature=HAIL-0.2.78

QQ plot: < 30 sec

Manhattan plot:

< 30 sec

N/A

Annotate GWAS results using Hail VEP

Data

Instance type

Num nodes

Example dx run command

Time (approx)

Additional Notes (including Spark config)

WES

100 samples

10K variants

Chr 1-22

mem1_ssd1_v2_x8

5

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=5 -ifeature=HAIL-0.2.78-VEP-1.0.3

8 min

N/A

WES

100K samples

1.4M variants

Chr 1, 20, 21

mem1_ssd1_v2_x8

5

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=5 -ifeature=HAIL-0.2.78-VEP-1.0.3

9 min

N/A

Annotate GWAS results using Hail DB

Data

Instance type

Num nodes

Example dx run command

Time (approx)

Additional Notes (including Spark config)

WES

100 samples

10K variants

Chr 1-22

mem1_ssd1_v2_x8

2

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=2 -ifeature=HAIL-0.2.78

< 30 sec

N/A

WES

100K samples

1.4M variants

Chr 1, 20, 21

mem1_ssd1_v2_x8

2

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=2 -ifeature=HAIL-0.2.78

< 30 sec

N/A

Export genomic data

Data

Instance type

Num nodes

Example dx run command

Time (approx)

Additional Notes (including spark config)

WES

100 samples

10K variants

Chr 1-22

mem1_ssd1_v2_x4

2

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78

< 1 min

N/A

WES

100K samples

1.4M variants

Chr 1, 20, 21

mem1_ssd1_v2_x8

2

dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=2 -ifeature=HAIL-0.2.78

33min

N/A

General Advice

  • When accessing input data files on the project for use with Hail, it is recommended to use /mnt/project to read the content of the files. This requires that the input data files are already uploaded to the project before starting the DXJupyterLab Spark Cluster app. All example notebooks that use input data files show how to access mounted input data files from the project.

Guidance on scaling with Hail

  • Hail functions, describe() and n_partitions(), each print information about MTs and Tables without being computationally expensive and will execute immediately. Functions such as, count(), may be computationally expensive and take longer to compute- especially if the size of data is large

  • It is recommended to use a lower number of nodes when running certain Hail operations (i.e. import into MT, annotation with VEP) especially when the Spark UI shows nodes sitting idle when running a task.

    • More partitions means more parallelization

    • Fewer partitions means more memory would be required

    • There should be fewer than 100K partitions

  • When considering memory management for Hail operations:

    • Another method for memory management is to use CPU reservation by fine-tuning spark configurations. An example of how to fine-tune spark configurations within the notebook before initializing Hail:

from pyspark.sql import SparkSession
import hail as hl

builder = (
                SparkSession 
                .builder
                .config("spark.executor.cores", 2)
                .config("spark.executor.memory", "18g")
                .enableHiveSupport()
)
                               
spark = builder.getOrCreate()
hl.init(sc=spark.sparkContext)

While the metrics presented for the notebooks can be used as an estimated guide for extrapolating (and interpolating) run time (and cost), keep in mind that most systems do not scale linearly.

This shows how to import genomic data from VCF with multiple samples, termed project-VCF (pVCF), into a Hail MT and store it in DNAnexus. Input data may be structured in one of three different ways: 1 file in total, 1 file per chromosome, and multiple files per chromosome.

This shows how to import BGEN genomic data into a Hail MT and store it in DNAnexus. Input data may be structured in one of three different ways: 1 file in total, 1 file per chromosome, and multiple files per chromosome. When importing BGEN files with Hail, note that:

This shows how to retrieve a Hail MT from DNAnexus, filter chromosomes and positions, and store the results as a Hail Table in DNAnexus.

This shows how to retrieve a Hail MT from DNAnexus, filter variant IDs, and then store results as a Hail Table in DNAnexus.

This shows how to retrieve a Hail MT from DNAnexus, filter sample IDs, and then store results as a Hail Table in DNAnexus.

This shows how to retrieve a Hail MT from DNAnexus, replace sample IDs using a mapping table, and then store results as a Hail Table in DNAnexus.

This shows how to retrieve a Hail MT from DNAnexus, annotate using Hail’s Variant Effect Predictor (VEP), and then store results as a Hail Table in DNAnexus. Hail runs VEP in parallel on every variant in a dataset, which requires VEP data files to be available on every node. By selecting feature=HAIL-0.2.78-VEP-1.0.3 when starting up the DXJupyterlab with Spark Cluster app, VEP data files (version 103 GRCh38) will be pre-installed on every node.

This shows how to retrieve a Hail MT from DNAnexus, annotate using Hail’s Annotation Database (DB) and then store results as a Hail Table in DNAnexus. Hail’s Annotation DB is a curated collection of variant annotations. Hail’s annotation datasets are available in Open Data on AWS where they are in a S3 bucket in the US region, however they may be accessed via a DNAnexus project of any region.

This shows how to retrieve a Hail MT from DNAnexus, compute locus quality control (QC) metrics using Hail’s variant_qc() method, and then store results as a Hail Table in DNAnexus. The variant_qc() method computes variant statistics from the genotype data and creates a new field in the MT with this information.

This shows how to retrieve a Hail MT from DNAnexus, compute sample quality control (QC) metrics using Hail’s sample_qc() method, and then store results as a Hail Table in DNAnexus. The sample_qc() method computes per-sample metrics and creates a new field in the MT with this information.

This shows how to perform a genome-wide association study (GWAS) for one case–control trait using Firth logistic regression, and then save results as a Hail Table in DNAnexus. The notebook first retrieves previously stored genomic, locus QC, and sample QC Tables for analysis preparation, and then creates a phenotype Hail Table containing the case–control trait. Hail’s logistic_regression_rows() method is used to run the analysis.

This shows how to retrieve a Table of GWAS results from DNAnexus and then visualize using a Q-Q plot and a Manhattan plot using Hail. The visualizations require the Python library, Bokeh, which comes pre-installed with Hail in the JupyterLab notebook environment.

This shows how to retrieve a GWAS results Table from DNAnexus, annotate using Hail’s Variant Effect Predictor (VEP), and then store results as a Hail Table in DNAnexus. Hail runs VEP in parallel on every variant in a dataset, which requires VEP data files to be available on every node. By selecting feature=HAIL-0.2.78-VEP-1.0.3when starting up the DXJupyterlab with Spark Cluster app, VEP data files (version 103 GRCh38) will be pre-installed on every node.

This shows how to retrieve a GWAS results Table from DNAnexus, annotate using Hail’s Annotation Database (DB) , and then store results as a Hail Table in DNAnexus. Hail’s Annotation DB is a curated collection of variant annotations. Hail’s annotation datasets are available in Open Data on AWS where they are in a S3 bucket in the US region, however they may be accessed via a DNAnexus project of any region.

This shows how to export a Hail MT of genomic data as a single BGEN file for each chromosome in v1.2 format with 8 bits per probability. This format is recommended if downstream analysis includes using regenie. An additional step (not shown in the notebook) recommended for regenie is to create a .bgi index file for each BGEN file. The .bgi files can be created using the tool, , which is part of the BGEN library.

See for general advice when using Hail.

The data in a MT or Table are divided into chunks called partitions, where each partition can be read and processed in parallel by the available cores (see ). When considering how the data is partitioned for Hail operations:

On the DNAnexus platform, the Spark cluster is allocated roughly 70% of the available memory of the selected instance type by default (See )

Memory usage and availability can be viewed on the Executors Tab in the Spark UI (see )

The easiest way to manage memory availability is to choose a different instance type (see )

Hail
Apache Spark
DXJupyterLab Spark Cluster
Contact DNAnexus Sales
notebook
notebook
notebook
notebook
notebook
notebook
notebook
notebook
notebook
notebook
notebook
notebook
notebook
notebook
notebook
bgenix
https://hail.is/docs/0.2/cloud/general_advice.html
https://spark.apache.org/docs/latest/rdd-programming-guide.html
https://documentation.dnanexus.com/developer/apps/developing-spark-apps#default-instance-configurations
https://spark.apache.org/docs/latest/web-ui.html#executors-tab
https://documentation.dnanexus.com/developer/api/running-analyses/instance-types
Import pVCF genomic data into a Hail MatrixTable (MT)
Import BGEN genomic data into a Hail MT
Filter by chr and pos
Filter by variant IDs
Filter by sample IDs
Replace sample IDs
Annotate genomic data using Hail VEP
Annotate genomic data using Hail DB
pre-GWAS QC: locus QC
pre-GWAS QC: sample QC
GWAS
GWAS visualizations
Annotate GWAS results using Hail VEP
Annotate GWAS results using Hail DB
Export genomic data as BGEN