Loading...
Learn how to use this workflow to detect somatic small variants and CNVs.
The workflow is compatible with somatic files generated from whole genome sequencing (WGS), whole exome sequencing (WES), and targeted next-generation sequencing panels (coverage of specific set of variants or region of interest). This workflow also allows for variant filtering based on allele frequency, contamination, and orientation bias.
This workflow uses several input files, some of which will need to be prepared separately prior to running this workflow. The apps used to prep the input files can be run from the user interface (UI) or the command-line interface (CLI).
Location
Available Resource
Project: “Reference Genome Files”
Directory: “gatk.resources.b37” or “gatk.resources.GRCh38”
• Common germline variant sites VCF
• Germline population VCF
• Known variants
• Panel of Normals
• Panel of Normals Index
Project:
“Reference Genome Files”
Directory: “H. Sapiens - GRCh37 - hs37d5 (1000 Genomes Phase II)” or “H. Sapiens - GRCh38 with alt contigs - hs38DH”
• BWA reference genome index
• Reference sequence
• Reference sequence dictionary
Instructions on how to use these files as inputs to the workflow are described in the next section.
Some reference genome related input files, like BWA reference genome index (2), are available in public projects, like “Reference Genome Files”(1), to select as inputs under “Suggested Items” in the top left corner:
Command Line Interface
The Somatic Small Variant and CNV Discovery Workflow can also be run non-interactively if file IDs are already known.
Example:
Depending on what region the execution project is in, the Somatic Small Variant and CNV Discovery Workflow will have a different name and ID:
Region
Workflow Name
Workflow ID
URL
AWS US (East)
somatic_small_variant_and_cnv_discovery
globalworkflow-GGy3fyj0XbybQxzV4gy8V085
AWS Asia Pacific - Sydney
somatic_small_variant_and_cnv_discovery_sydney
globalworkflow-GGy4kfj5f18KQf524kJ1V4QP
AWS Europe - Frankfurt
somatic_small_variant_and_cnv_discovery_frankfurt
globalworkflow-GGy491Q4ZZYyKZ92KXXyGjq7
AWS Europe - London
somatic_small_variant_and_cnv_discovery_london_g
globalworkflow-GGy4K0BKQ3Q8YBYF19gxP2Xj
Azure Amsterdam
somatic_small_variant_and_cnv_discovery_azure_eu
globalworkflow-GGy5038BQX5PQ8PK6pggk4bg
Azure US
somatic_small_variant_and_cnv_discovery_azure_us
globalworkflow-GGy4x009x22GgqJ34gb6YJJf
GATK Best Practices for small variant discovery advises to create the PON by running the variant caller, Mutect2, individually on a set of normal samples first, and the to combine the resulting variant calls using desired criteria (e.g. excluding any sites that are not present in at least two normals). The result will produce a sites-only VCF file which may be reused as a PON for subsequent processing, again with Mutect2.
GATK Best Practices also suggests that a PON helps Mutect2 to detect additional complicated sites in sequencing data, technical artifacts which may arise from sequencing, data processing, and/or mapping.
The CNVkit application on the DNAnexus platform may be separately used to construct a new copy number reference profile. To build a copy number reference profile, run the application with normal sample BAM files, reference FASTA file, and a baited (tiled, targeted) genomic regions file, in BED or GATK/Picard-style interval list format. The output will be a .cnn
file that can be used as input in this workflow. For example, using CLI and dx-toolkit
:
If a copy number reference profile is not provided as an input, this workflow will build the .cnn
file using the normal samples. The .cnn
will be one of the output files of the workflow.
If a copy number reference profile from a previous CNVkit analysis (with the same normal samples) is available, it may be reused for subsequent processing of further tumor samples by using it as an input. File reuse will likely save time and cost as the workflow will not need to build the reference profile each time from the same set of normal samples.
The workflow provides users an option to perform Base Quality Score Recalibration (BQSR). Though GATK’s Best Practice suggests performing BQSR, omitting this step can save time/resources. When using data from latest sequencers (generated after 2015), this step can be omitted.
The Somatic Small Variant and CNV Discovery Workflow can be run with large scale datasets where the workflow can be run simultaneously on multiple tumor/normal pairs. See Running Batch Jobs.
SAIGE is a Scalable and Accurate Implementation of Generalized mixed model (Chen, H. et al. 2016) implemented as an R package. It accounts for sample relatedness, provides accurate P-values even when case-control ratios are extremely unbalanced, and can be used for genetic association tests in large cohorts with more than 400,000 individuals. SAIGE performs single-variant association tests for binary and quantitative traits.
For example, in UK Biobank (UKB) data there are related individuals and many phenotypes with unbalanced case/control ratios, such as rare diseases diagnoses. SAIGE has been used on imbalanced case/control ratios as large as 1:1138 with 358 cases and 407,399 controls. [Ref]
SAIGE authors provide a tutorial at https://github.com/weizhouUMICH/SAIGE/wiki/Genetic-association-tests-using-SAIGE#running-saige-and-saige-gene for how to use the software.
This document showcases how to run a SAIGE GWAS analysis on the DNAnexus platform using UKB data, with the following steps:
Merge the assay genotypes across all autosomes together into PLINK format.
Use the output of the previous step to run saige_gwas_grm application to generate the variance ratio and model files. saige_gwas_grm application fits the null logistic or linear mixed model to construct the Genetic Relatedness Matrix (GRM).
Perform single-variant association tests (SVAT) using the saige_gwas_svat application.
Optionally concatenate the results from multiple saige_gwas_svat analysis together.
The Swiss Army Knife (SAK) application (https://platform.dnanexus.com/app/swiss-army-knife on the DNAnexus platform or https://ukbiobank.dnanexus.com/app/swiss-army-knife on the UKB platform) can be used to concatenate the autosomal assayed genotypes together and generate a single set of PLINK binary files that will be used as input to the saige_gwas_grm app.
The SAK interface will prompt for inputs. On the “Analysis Inputs” tab, provide the required input files by selecting the files for chromosomes 1-22 with the assayed genotypes.
Next, in the “Command line” input, paste the following code which uses plink to merge files together (--merge-list option is documented at https://www.cog-genomics.org/plink/1.9/data#merge_list)
This code will create a list of input files to be merged, merge the listed files using plink into a (.bim, .bed, .fam) set of files, and return the set of merged (.bim, .bed, .fam) files as output.
Next the output of Step 1 will be used to run saige_gwas_grm application to generate the variance ratio and model files. saige_gwas_grm application takes the following inputs:
Genotype file set in PLINK binary format (.bim, .bed, .fam). These PLINK binary files should contain variants merged across all autosomes that will be used to generate the genetic relatedness matrix model and variance ratio files.
Phenotype file is a space or tab-delimited file with a header, containing a column for IDs of samples as they appear in the genotype data, a column for the phenotype and optional columns for non-genetic covariates, such as gender and age. The phenotype file should only have samples that are present in both the GRM and SVAT stages.
Then run saige_gwas_grm app on the merged PLINK ( .bim, .bed, .fam) fileset to obtain the model and the variance ratio files to be used as inputs for the saige_gwas_svat app.
Select the files for required inputs and set the configuration parameters for the run including covariates and phenotype information as well as advanced options to define thresholds for variants to be included.
The GRM app will produce a model .rda file, variance ratio file and the association result file for the subset of randomly selected markers. Use the default mem3_ssd1_v2_x32 instance type pre-selected in the app. The model .rda file and Variance ratio file will be used as inputs for saige_gwas_svat app to perform single variant association tests.
The saige_gwas_svat app computes single variant association tests for a chunk of genomic data. UKB imputed data is chunked by chromosome, so we’ll launch saige_gwas_svat app in batch mode to compute single variant association tests on each chromosome in parallel.
Upon selecting the SAIGE GWAS SVAT app for analysis, the GUI will prompt the user to add input files and configuration parameters. Select the Batch Run option for the Genotype BGEN Index file inputs, which will be processed in parallel batches.
Configure the files you wish to batch run, complete the required inputs section and press “Start Analysis”.
Batch executions can also be launched from the CLI using instructions at https://documentation.dnanexus.com/user/running-apps-and-workflows/running-batch-jobs.
UKB’s whole genome variant data is stored in 60,000 pVCF files. This example will show how to perform single variant association tests on chromosome 22 using CLI.
First merge chromosomes 22 pVCF files into a single file as follows:
Note that larger chromosomes may need to be merged into several merged files that can be passed to parallel saige_gwas_svat runs for faster processing and smaller memory footprint.
Next, run the saige_gwas_svat app. To find out what inputs are required for saige_gwas_svat app, use
Run saige_gwas_svat app using the model and variance ratio files from Step 2 and the merged chromosome 22 and the corresponding tbi index file from the code block above:
If needed, SAK can be used to concatenate the results of multiple saige_gwas_svat runs. As input, include all association result files. Assuming association result files have a common naming pattern (for example “saige_step2_ukb_imp_chr*_v3.txt”), use the following code in SAK’s “command line” input field to concatenate the association result files together:
The LocusZoom DNAnexus app visualizes GWAS result files stored on the DNAnexus platform.
The LocusZoom app visualizes GWAS result files stored on the DNAnexus platform including an interactive Manhattan plot, QQ plot, and a table of high p-value loci, as well as standard LocusZoom regional plots. It supports the use of GRCh37 and GRCh38 reference genomes.
This DNAnexus app is based on hosted LocusZoom codebase, and uses a virtual machine, so it incurs compute charges during use. The GWAS results stay on the DNAnexus platform while using this app.
LocusZoom app is launched from the Tools Library menu as shown below:
A user can specify an array of GWAS results files to visualize as app's input. These files will show up in the GWAS file dropdown once the app is running. Note that other GWAS results can be visualized by entering file IDs of a platform file with the GWAS results in the input entry field.
After selecting which GWAS results file you want to view, a set of interactive dialogs that recognize common GWAS results formats will appear on screen. You can select more specific settings for the GWAS visualization, and select Next when you are done.
One your file is done loading, clicking on the name of the file in the Analyze tab will bring up the LocusZoom interactive analysis environment.
Clicking on a loci on the Manhattan Plot or in the Top Loci table will zoom into the loci:
Note that this app does not save its state, so you will need to follow the steps above to re-import the GWAS results file in another instance of LocusZoom app if you wish to view the visualization again.
Human DNA comprises the following entities:
22 pairs of non-sex chromosomes, labeled with numbers from 1 to 22, roughly in order of their sizes (with 1 being the longest).
One pair of sex chromosomes (labeled with letters), consisting of two X chromosomes in females or one X and one Y chromosome in males.
The mitochondrial genome; that is, the DNA contained in special organelles known as mitochondria.
The Human Genome Project set out to identify the sequences of these 25 distinct DNA entities (chromosomes 1 through 22, chromosomes X and Y, and the mitochondria), aka "the human genome". In February of 2009, the Genome Reference Consortium (GRC) released "build 37" of the human genome, called GRCh37. In 2013, the GRC released a newer "build 38" of the human genome, called GRCh38.
Due to the complexity of DNA sequencing and genome assembly, the GRCh37 release included the following sequences:
24 "relatively complete" sequences for chromosomes 1 to 22, X and Y.
A complete mitochondrial sequence.
Several "unlocalized sequences". These are sequences that are known to originate from specific chromosomes, but their exact location within the chromosome is not known.
Several "unplaced sequences". These are sequences that are known to originate from the human genome, but their chromosomal association is not known.
Several "alternate loci". These are sequences that contain alternate representations of specific human regions.
In releasing all these sequences, GRC did not provide a canonical naming scheme for these sequences, nor did it impose a particular ordering of the sequences. This presents a problem in bioinformatics, as all file formats (SAM/BAM, VCF, GFF, BED, etc.) require a unique string identifier when referring to a particular sequence. Everything from read mappings, to variants, to genomic annotations (such as dbSNP or gene databases) needs to identify its genomic location by sequence name and coordinate. This freedom lead to different conventions being adopted by different teams.
The 1000 Genomes Project, in its first phase, used the following conventions, which are commonly referred to as "b37" (a term particularly popular among the GATK and IGV communities):
The 24 "relatively complete" chromosomal sequences were named "1" to "22", "X" and "Y".
The GRCh37 mitochondrial sequence was named "MT".
The unlocalized sequences were named after their accession numbers, such as "GL000191.1", "GL000194.1", etc.
The unplaced sequences were named after their accession numbers, such as "GL000211.1", "GL000241.1", etc.
The alternate loci were not included in the b37 dataset.
These conventions (where chromosomes are called "1" to "22", "X", "Y" and "MT") are also followed by the ENSEMBL genome browser, the NCBI dbSNP (in VCF files), the Sanger COSMIC (in VCF files), etc. and are the preferred standard for new projects.
When GRCh37 was released, the UCSC genome browser team performed the following adaptation to the sequences, and called the end result "hg19":
The 24 "relatively complete" chromosomal sequences were given the names "chr1" to "chr22", "chrX" and "chrY".
The GRCh37 mitochondrial sequence was not copied over. Instead, the UCSC genome browser team copied an older mitochondrial sequence from the previous release ("build 36"), and gave it the name "chrM".
The unlocalized sequences were given custom names such as "chr1_gl000191_random" and "chr4_gl000194_random".
The unplaced sequences were given custom names such as "chrUn_gl000221" and "chrUn_gl000241".
The alternate loci were given custom names such as "chr6_apd_hap1" and "chr4_ctg9_hap1".
Unfortunately, the use of the non-GRCh37 mitochondrial sequence makes this incompatible with the actual GRCh37. Mappings or annotations that fall on the hg19 mitochondrial sequence cannot be easily transfered over to the GRCh37/b37 mitochondrial sequence.
Despite the nonstandard sequence naming, the stale mitochondrial sequence, and the inclusion of alternate loci (which is sometimes undesirable for read mapping), hg19 has gained popularity due to its exposure via the UCSC genome browser, and is often the convention used by vendors when reporting exome enrichment kit coordinates.
In its second phase, the 1000 Genomes Project extended the b37 dataset with additional sequences:
A human herpesvirus 4 type 1 sequence (named "NC_007605").
A "decoy" sequence derived from HuRef, human BAC and Fosmid clones, and NA12878 (named "hs37d5").
In addition, the pseudo-autosomal regions (PAR) of chromosome Y have been masked out (replaced with "N"), so that the respective regions in chromosome X may be treated as diploid.
Collectively these changes make this set of sequences optimal for read mapping and variation calling, as they decrease false positives, while being generally compatible with b37. More information can be found here.
The Torrent Suite software (which Ion Torrent makes available for their instruments) allows downloading of a particular human reference genome from the Ion Torrent servers. Ion Torrent calls it "hg19", but it has distinct differences from the UCSC hg19. In particular, it uses the UCSC naming conventions ("chr1" to "chr22", "chrX", "chrY", "chrM"), but has replaced the stale UCSC hg19 mitochondrial sequence with the newer GRCh37 one. This renders the general rule of "chrM refers to the old mitochondria, and MT refers to the new mitochondria" as invalid, because now there is a sequence named "chrM" which refers to the new mitochondria.
The 1000 Genomes Phase II (hs37d5) sequence is particularly preferred when read mapping is performed. It leads to better mapping quality due to masking of PAR regions in chromosome Y and the addition of the decoy sequences, while being compatible with b37, GATK, and IGV.
Hail is an open-source library built on top of Apache Spark for scalable genomic data analysis. DXJupyterLab Spark Cluster is an app provided by DNAnexus that allows users to use Hail within notebooks with ease.
To launch a JupyterLab notebook with Hail on the platform, follow the same instructions for launching the DXJupyterLab Spark Cluster app while also selecting either:
feature=HAIL-0.2.78
, where the Hail Java library is pre-installed on all cluster nodes
feature=HAIL-0.2.78-VEP-1.0.3
, where the Ensembl Variant Effect Predictor (VEP) plugin is pre-installed along with Hail on all cluster nodes.
If feature=HAIL-0.2.78-VEP-1.0.3
is selected, the spin up time for the JupyterLab notebook will take a few minutes longer (compared to feature=HAIL-0.2.78
) as both Hail and VEP are being installed on all cluster nodes upon start-up. Note that only VEP version 103 GRCh38 is available on DNAnexus (custom annotations not available).
The instance type and number of nodes selected will affect how powerful the Spark cluster will be. Generally, the default settings allow for casual interrogation of the data. If you are running simple queries with a small amount of data, you can save costs by selecting a smaller instance type. If you are running complex queries or analyzing a large amount of data in memory, you may need to select a larger instance type. To increase parallelization efficiency and reduce processing time, you may need to select more nodes.
The following sections provide a guide for common Hail operations. Each section includes:
A description of the operation
Launch specs for different datasets
Instance type
Number of nodes
Example dx run
command
Approximate wall time taken to run the notebook- these values may vary in practice
A link to an example notebook in OpenBio that can be used as a starting point
The notebooks are as follows:
Import pVCF genomic data into a Hail MatrixTable (MT)
Import BGEN genomic data into a Hail MT
Annotate genomic data using Hail VEP
Annotate genomic data using Hail DB
Annotate GWAS results using Hail VEP
Annotate GWAS results using Hail DB
Information provided in these sections may help guide in choosing the best launch specs and knowing what to (generally) expect when working with your data.
This notebook shows how to import genomic data from VCF with multiple samples, termed project-VCF (pVCF), into a Hail MT and store it in DNAnexus. Input data may be structured in one of three different ways: 1 file in total, 1 file per chromosome, and multiple files per chromosome.
Data Used
Instance type
Num nodes
Example dx run
command
Time (approx)
Additional Notes (i.e. Spark config)
WES
100 samples
10K variants
Chr 1-22
mem1_ssd1_v2_x4
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78
< 30 sec
N/A
WES
100K samples
1.4M variants
Chr 1, 20, 21
mem1_ssd1_v2_x16
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x16 --instance-count=2 -ifeature=HAIL-0.2.78
35 min
N/A
This notebook shows how to import BGEN genomic data into a Hail MT and store it in DNAnexus. Input data may be structured in one of three different ways: 1 file in total, 1 file per chromosome, and multiple files per chromosome. When importing BGEN files with Hail, note that:
a Hail-specific index file is required for each BGEN file
only biallelic variants are supported
Data
Instance type
Num nodes
Example dx run
command
Time (approx)
Additional Notes (including Spark config)
WES
100 samples
chr 1,2,3
mem1_ssd1_v2_x4
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78
Creating index files:
< 30 sec
Import and write MT:
< 30 sec
N/A
This notebook shows how to retrieve a Hail MT from DNAnexus, filter chromosomes and positions, and store the results as a Hail Table in DNAnexus.
Data
Instance type
Num nodes
Example dx run
command
Time (approx)
Additional Notes (including Spark config)
WES
100 samples
10K variants
Chr 1-22
mem1_ssd1_v2_x4
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78
< 30 sec
N/A
WES
100K samples
1.4M variants
Chr 1, 20, 21
mem2_ssd1_v2_x8
2
dx run dxjupyterlab_spark_cluster --instance-type=mem2_ssd1_v2_x8 --instance-count=2 -ifeature=HAIL-0.2.78
< 1 min
N/A
This notebook shows how to retrieve a Hail MT from DNAnexus, filter variant IDs, and then store results as a Hail Table in DNAnexus.
Data
Instance type
Num nodes
Example dx run
command
Time (approx)
Additional Notes (including Spark config)
WES
100 samples
10K variants
Chr 1-22
mem1_ssd1_v2_x4
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78
< 30 sec
N/A
WES
100K samples
1.4M variants
Chr 1, 20, 21
mem1_ssd1_v2_x16
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x16 --instance-count=2 -ifeature=HAIL-0.2.78
< 30 sec
N/A
This notebook shows how to retrieve a Hail MT from DNAnexus, filter sample IDs, and then store results as a Hail Table in DNAnexus.
Data
Instance type
Num nodes
Example dx run
command
Time (approx)
Additional Notes (including Spark config)
WES
100 samples
10K variants
Chr 1-22
mem1_ssd1_v2_x4
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78
< 30 sec
N/A
WES
100K samples
1.4M variants
Chr 1, 20, 21
mem1_ssd1_v2_x16
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x16 --instance-count=2 -ifeature=HAIL-0.2.78
< 30 sec
N/A
This notebook shows how to retrieve a Hail MT from DNAnexus, replace sample IDs using a mapping table, and then store results as a Hail Table in DNAnexus.
Data
Instance type
Num nodes
Example dx run
command
Time (approx)
Additional Notes (including Spark config)
WES
100 samples
10K variants
Chr 1-22
mem1_ssd1_v2_x4
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78
< 30 sec
N/A
WES
100K samples
1.4M variants
Chr 1, 20, 21
mem1_ssd1_v2_x16
5
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x16 --instance-count=5 -ifeature=HAIL-0.2.78
< 30 sec
N/A
This notebook shows how to retrieve a Hail MT from DNAnexus, annotate using Hail’s Variant Effect Predictor (VEP), and then store results as a Hail Table in DNAnexus. Hail runs VEP in parallel on every variant in a dataset, which requires VEP data files to be available on every node. By selecting feature=HAIL-0.2.78-VEP-1.0.3
when starting up the DXJupyterlab with Spark Cluster app, VEP data files (version 103 GRCh38) will be pre-installed on every node.
Data
Instance type
Num nodes
Example dx run
command
Time (approx)
Additional Notes (including Spark config)
WES
100 samples
10K variants
Chr 1-22
mem1_ssd1_v2_x8
5
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=2 -ifeature=HAIL-0.2.78-VEP-1.0.3
6min
N/A
WES
100K samples
1.4M variants
Chr 1, 20, 21
mem1_ssd1_v2_x8
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=2 -ifeature=HAIL-0.2.78-VEP-1.0.3
46min
N/A
This notebook shows how to retrieve a Hail MT from DNAnexus, annotate using Hail’s Annotation Database (DB) and then store results as a Hail Table in DNAnexus. Hail’s Annotation DB is a curated collection of variant annotations. Hail’s annotation datasets are available in Open Data on AWS where they are in a S3 bucket in the US region, however they may be accessed via a DNAnexus project of any region.
Data
Instance type
Num nodes
Example dx run
command
Time (approx)
Additional Notes (including Spark config)
WES
100 samples
10K variants
Chr 1-22
mem1_ssd1_v2_x8
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=2 -ifeature=HAIL-0.2.78
< 1 min
N/A
WES
100K samples
1.4M variants
Chr 1, 20, 21
mem2_ssd1_v2_x8
5
dx run dxjupyterlab_spark_cluster --instance-type=mem2_ssd1_v2_x8 --instance-count=5 -ifeature=HAIL-0.2.78
< 5 min
N/A
This notebook shows how to retrieve a Hail MT from DNAnexus, compute locus quality control (QC) metrics using Hail’s variant_qc()
method, and then store results as a Hail Table in DNAnexus. The variant_qc()
method computes variant statistics from the genotype data and creates a new field in the MT with this information.
Data
Instance type
Num nodes
Example dx run
command
Time (approx)
Additional Notes (including Spark config)
WES
100 samples
10K variants
Chr 1-22
mem1_ssd1_v2_x4
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78
< 30 sec
N/A
WES
100K samples
1.4M variants
Chr 1, 20, 21
mem1_ssd1_v2_x8
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=2 -ifeature=HAIL-0.2.78
24 min
N/A
This notebook shows how to retrieve a Hail MT from DNAnexus, compute sample quality control (QC) metrics using Hail’s sample_qc()
method, and then store results as a Hail Table in DNAnexus. The sample_qc()
method computes per-sample metrics and creates a new field in the MT with this information.
Data
Instance type
Num nodes
Example dx run
command
Time (approx)
Additional Notes (including Spark config)
WES
100 samples
10K variants
Chr 1-22
mem1_ssd1_v2_x4
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78
< 30 sec
N/A
WES
100K samples
1.4M variants
Chr 1, 20, 21
mem1_ssd1_v2_x16
5
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x16 --instance-count=5 -ifeature=HAIL-0.2.78
8min
N/A
This notebook shows how to perform a genome-wide association study (GWAS) for one case–control trait using Firth logistic regression, and then save results as a Hail Table in DNAnexus. The notebook first retrieves previously stored genomic, locus QC, and sample QC Tables for analysis preparation, and then creates a phenotype Hail Table containing the case–control trait. Hail’s logistic_regression_rows() method is used to run the analysis.
Data
Instance type
Num nodes
Example dx run
command
Time (approx)
Additional Notes (including Spark config)
WES
100 samples
10K variants
Chr 1-22
mem1_ssd1_v2_x4
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78
< 30 sec
N/A
WES
100K samples
1.4M variants
Chr 1, 20, 21
mem1_ssd1_v2_x8
5
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=5 -ifeature=HAIL-0.2.78
1hr 30min
"spark.hadoop.dnanexus.fs.output.upload.chunk.size": 67108864,
"spark.hadoop.dnanexus.fs.cache.filestatus.maxsize": 44330,
"spark.hadoop.dnanexus.fs.cache.filestatus.expiretime": 28800,
"spark.hadoop.dnanexus.fs.cache.readfileurl.maxsize": 44330
This notebook shows how to retrieve a Table of GWAS results from DNAnexus and then visualize using a Q-Q plot and a Manhattan plot using Hail. The visualizations require the Python library, Bokeh, which comes pre-installed with Hail in the JupyterLab notebook environment.
Data
Instance type
Num nodes
Example dx run
command
Time (approx)
Additional Notes (including Spark config)
WES
100 samples
10K variants
Chr 1-22
mem1_ssd1_v2_x4
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78
QQ plot: < 30 sec
Manhattan plot:
< 30 sec
N/A
WES
100K samples
1.4M variants
Chr 1, 20, 21
mem1_ssd1_v2_x8
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=2 -ifeature=HAIL-0.2.78
QQ plot: < 30 sec
Manhattan plot:
< 30 sec
N/A
This notebook shows how to retrieve a GWAS results Table from DNAnexus, annotate using Hail’s Variant Effect Predictor (VEP), and then store results as a Hail Table in DNAnexus. Hail runs VEP in parallel on every variant in a dataset, which requires VEP data files to be available on every node. By selecting feature=HAIL-0.2.78-VEP-1.0.3
when starting up the DXJupyterlab with Spark Cluster app, VEP data files (version 103 GRCh38) will be pre-installed on every node.
Data
Instance type
Num nodes
Example dx run
command
Time (approx)
Additional Notes (including Spark config)
WES
100 samples
10K variants
Chr 1-22
mem1_ssd1_v2_x8
5
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=5 -ifeature=HAIL-0.2.78-VEP-1.0.3
8 min
N/A
WES
100K samples
1.4M variants
Chr 1, 20, 21
mem1_ssd1_v2_x8
5
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=5 -ifeature=HAIL-0.2.78-VEP-1.0.3
9 min
N/A
This notebook shows how to retrieve a GWAS results Table from DNAnexus, annotate using Hail’s Annotation Database (DB) , and then store results as a Hail Table in DNAnexus. Hail’s Annotation DB is a curated collection of variant annotations. Hail’s annotation datasets are available in Open Data on AWS where they are in a S3 bucket in the US region, however they may be accessed via a DNAnexus project of any region.
Data
Instance type
Num nodes
Example dx run
command
Time (approx)
Additional Notes (including Spark config)
WES
100 samples
10K variants
Chr 1-22
mem1_ssd1_v2_x8
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=2 -ifeature=HAIL-0.2.78
< 30 sec
N/A
WES
100K samples
1.4M variants
Chr 1, 20, 21
mem1_ssd1_v2_x8
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=2 -ifeature=HAIL-0.2.78
< 30 sec
N/A
This notebook shows how to export a Hail MT of genomic data as a single BGEN file for each chromosome in v1.2 format with 8 bits per probability. This format is recommended if downstream analysis includes using regenie. An additional step (not shown in the notebook) recommended for regenie is to create a .bgi index file for each BGEN file. The .bgi files can be created using the tool, bgenix, which is part of the BGEN library.
Data
Instance type
Num nodes
Example dx run
command
Time (approx)
Additional Notes (including spark config)
WES
100 samples
10K variants
Chr 1-22
mem1_ssd1_v2_x4
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x4 --instance-count=2 -ifeature=HAIL-0.2.78
< 1 min
N/A
WES
100K samples
1.4M variants
Chr 1, 20, 21
mem1_ssd1_v2_x8
2
dx run dxjupyterlab_spark_cluster --instance-type=mem1_ssd1_v2_x8 --instance-count=2 -ifeature=HAIL-0.2.78
33min
N/A
When accessing input data files on the project for use with Hail, it is recommended to use /mnt/project
to read the content of the files. This requires that the input data files are already uploaded to the project before starting the DXJupyterLab Spark Cluster app. All example notebooks that use input data files show how to access mounted input data files from the project.
See https://hail.is/docs/0.2/cloud/general_advice.html for general advice when using Hail.
Hail functions, describe()
and n_partitions()
, each print information about MTs and Tables without being computationally expensive and will execute immediately. Functions such as, count(), may be computationally expensive and take longer to compute- especially if the size of data is large
It is recommended to use a lower number of nodes when running certain Hail operations (i.e. import into MT, annotation with VEP) especially when the Spark UI shows nodes sitting idle when running a task.
The data in a MT or Table are divided into chunks called partitions, where each partition can be read and processed in parallel by the available cores (see https://spark.apache.org/docs/latest/rdd-programming-guide.html). When considering how the data is partitioned for Hail operations:
More partitions means more parallelization
Fewer partitions means more memory would be required
There should be fewer than 100K partitions
When considering memory management for Hail operations:
On the DNAnexus platform, the Spark cluster is allocated roughly 70% of the available memory of the selected instance type by default (See https://documentation.dnanexus.com/developer/apps/developing-spark-apps#default-instance-configurations)
Memory usage and availability can be viewed on the Executors Tab in the Spark UI (see https://spark.apache.org/docs/latest/web-ui.html#executors-tab)
The easiest way to manage memory availability is to choose a different instance type (see https://documentation.dnanexus.com/developer/api/running-analyses/instance-types)
Another method for memory management is to use CPU reservation by fine-tuning spark configurations. An example of how to fine-tune spark configurations within the notebook before initializing Hail:
This page is a list of open-source tools developed by the DNAnexus science team.
GLnexus: Scalable gVCF merging and joint variant calling for population sequencing projects. Also on BioRxiv.
IndexTools: A toolkit for extremely fast NGS analysis based on index files.
While IGV is available as a tool that can be used on the DNAnexus Platform in your project's Visualize tab, you can also use the Integrative Genomics Viewer (IGV) software on your local machine to visualize files stored on DNAnexus.
The Integrative Genomics Viewer (IGV) is a visualization tool developed and maintained by the Broad Institute for interactive exploration of large, integrated genomic datasets. It supports a wide variety of data types, including array-based and next-generation sequence data, and genomic annotations. For more information about IGV or to download the latest version, please visit the official website at https://www.broadinstitute.org/igv/.
There are two methods of visualizing files stored on DNAnexus using IGV. You can either load individual tracks one-by-one or create a session file to automatically load multiple files.
Both methods require you to first set up your IGV preferences for use with DNAnexus URLs.
Requirements: IGV v2.3.62 or later
This is a one-time setup each time you must do each time you download or update IGV on your computer.
After opening your local IGV application, go to the menu bar and select View and then Preferences.
In the preference window, go to the Advanced tab (on the right). If you see an option to have IGV Automatically discover index and coverage files, please uncheck this option to enable DNAnexus URLs to work with IGV.
Click OK to save your preferences. In some later versions of IGV, this option has been removed so you can proceed without clicking OK.
Step 1
From the DNAnexus Platform, select the files you would like to visualize. Check the boxes next to the names of the desired files. You will need to select both the file (e.g. SRR504516.bam) and the index file (e.g. SRR504516.bam.bai) as both are needed for IGV.
Step 2
Generate the URLs for the files. After selecting your files, click on the Download button on the upper right side of the screen. Then copy the URLs generated, as seen below, or click on the page icon on the upper right to copy all the URLs at once.
Warning: These URLs will only be valid for 24 hours. If you would like to set the duration of the URLs to be valid longer, please use our [CLI tools]((/user/helpstrings-of-sdk-command-line-utilities#make_download_url) for generating URLs.
Step 3
From the IGV application on your local machine, go to the menu bar and select File and then Load from URL.
Enter the URL of the file you wish to view into the "File URL" field of the "Load from URL" window that opens up. Repeat this step for the index file's URL.
Repeat this step for each file you want to visualize during your IGV session.
Step 4
Now you are able to browse your DNAnexus files from your local machine. Please note that certain actions may take a bit longer to run, as IGV does need to download some data locally. However, IGV will only download the data required to visualize the portion of the genome you're viewing, not the entire file.
Alternatively, you can create an "index-aware session file" to automatically load multiple tracks into IGV. This session file is simply a text file with the following format:
The file must have a file extension of .idxsession.
The file must have one line for each track you want to visualize.
If you are visualizing a BAM or VCF file, the line must also contain the URL for the index file, separated by a space.
If you a visualizing a VCF file, you may optionally include a URL for the .tdf coverage file, also separated by a space.
Once you have this session file, you can open the session file in IGV and IGV will automatically load all the tracks you specified in your file. To do this, in IGV, go to the menu >> File >> Open Session...
That will then open a window for you to select the session file to open in IGV.