Human Reference Genomes
What exactly is "human DNA"?
Human DNA comprises the following entities:
22 pairs of non-sex chromosomes, labeled with numbers from 1 to 22, roughly in order of their sizes (with 1 being the longest).
One pair of sex chromosomes (labeled with letters), consisting of two X chromosomes in females or one X and one Y chromosome in males.
The mitochondrial genome; that is, the DNA contained in special organelles known as mitochondria.
What is GRCh37?
The Human Genome Project set out to identify the sequences of these 25 distinct DNA entities (chromosomes 1 through 22, chromosomes X and Y, and the mitochondria), aka "the human genome". In February of 2009, the Genome Reference Consortium (GRC) released "build 37" of the human genome, called GRCh37. In 2013, the GRC released a newer "build 38" of the human genome, called GRCh38.
Due to the complexity of DNA sequencing and genome assembly, the GRCh37 release included the following sequences:
24 "relatively complete" sequences for chromosomes 1 to 22, X and Y.
A complete mitochondrial sequence.
Several "unlocalized sequences". These are sequences that are known to originate from specific chromosomes, but their exact location within the chromosome is not known.
Several "unplaced sequences". These are sequences that are known to originate from the human genome, but their chromosomal association is not known.
Several "alternate loci". These are sequences that contain alternate representations of specific human regions.
In releasing all these sequences, GRC did not provide a canonical naming scheme for these sequences, nor did it impose a particular ordering of the sequences. This presents a problem in bioinformatics, as all file formats (SAM/BAM, VCF, GFF, BED, etc.) require a unique string identifier when referring to a particular sequence. Everything from read mappings, to variants, to genomic annotations (such as dbSNP or gene databases) needs to identify its genomic location by sequence name and coordinate. This freedom lead to different conventions being adopted by different teams.
The "b37" conventions (by the 1000 Genomes Project Phase I)
The 1000 Genomes Project, in its first phase, used the following conventions, which are commonly referred to as "b37" (a term particularly popular among the GATK and IGV communities):
The 24 "relatively complete" chromosomal sequences were named "1" to "22", "X" and "Y".
The GRCh37 mitochondrial sequence was named "MT".
The unlocalized sequences were named after their accession numbers, such as "GL000191.1", "GL000194.1", etc.
The unplaced sequences were named after their accession numbers, such as "GL000211.1", "GL000241.1", etc.
The alternate loci were not included in the b37 dataset.
These conventions (where chromosomes are called "1" to "22", "X", "Y" and "MT") are also followed by the ENSEMBL genome browser, the NCBI dbSNP (in VCF files), the Sanger COSMIC (in VCF files), etc. and are the preferred standard for new projects.
The "hg19" conventions (by UCSC)
When GRCh37 was released, the UCSC genome browser team performed the following adaptation to the sequences, and called the end result "hg19":
The 24 "relatively complete" chromosomal sequences were given the names "chr1" to "chr22", "chrX" and "chrY".
The GRCh37 mitochondrial sequence was not copied over. Instead, the UCSC genome browser team copied an older mitochondrial sequence from the previous release ("build 36"), and gave it the name "chrM".
The unlocalized sequences were given custom names such as "chr1_gl000191_random" and "chr4_gl000194_random".
The unplaced sequences were given custom names such as "chrUn_gl000221" and "chrUn_gl000241".
The alternate loci were given custom names such as "chr6_apd_hap1" and "chr4_ctg9_hap1".
Unfortunately, the use of the non-GRCh37 mitochondrial sequence makes this incompatible with the actual GRCh37. Mappings or annotations that fall on the hg19 mitochondrial sequence cannot be easily transfered over to the GRCh37/b37 mitochondrial sequence.
Despite the nonstandard sequence naming, the stale mitochondrial sequence, and the inclusion of alternate loci (which is sometimes undesirable for read mapping), hg19 has gained popularity due to its exposure via the UCSC genome browser, and is often the convention used by vendors when reporting exome enrichment kit coordinates.
The "b37+decoy" / "hs37d5" extensions (by the 1000 Genomes Project Phase II)
In its second phase, the 1000 Genomes Project extended the b37 dataset with additional sequences:
A human herpesvirus 4 type 1 sequence (named "NC_007605").
A "decoy" sequence derived from HuRef, human BAC and Fosmid clones, and NA12878 (named "hs37d5").
In addition, the pseudo-autosomal regions (PAR) of chromosome Y have been masked out (replaced with "N"), so that the respective regions in chromosome X may be treated as diploid.
Collectively these changes make this set of sequences optimal for read mapping and variation calling, as they decrease false positives, while being generally compatible with b37. More information can be found here.
The "Ion Torrent hg19"
The Torrent Suite software (which Ion Torrent makes available for their instruments) allows downloading of a particular human reference genome from the Ion Torrent servers. Ion Torrent calls it "hg19", but it has distinct differences from the UCSC hg19. In particular, it uses the UCSC naming conventions ("chr1" to "chr22", "chrX", "chrY", "chrM"), but has replaced the stale UCSC hg19 mitochondrial sequence with the newer GRCh37 one. This renders the general rule of "chrM refers to the old mitochondria, and MT refers to the new mitochondria" as invalid, because now there is a sequence named "chrM" which refers to the new mitochondria.
Which human sequence should one use?
The 1000 Genomes Phase II (hs37d5) sequence is particularly preferred when read mapping is performed. It leads to better mapping quality due to masking of PAR regions in chromosome Y and the addition of the decoy sequences, while being compatible with b37, GATK, and IGV.
Last updated