DNAnexus Documentation
APIDownloadsIndex of dx CommandsLegal
  • Overview
  • Getting Started
    • DNAnexus Essentials
    • Key Concepts
      • Projects
      • Organizations
      • Apps and Workflows
    • User Interface Quickstart
    • Command Line Quickstart
    • Developer Quickstart
    • Developer Tutorials
      • Bash
        • Bash Helpers
        • Distributed by Chr (sh)
        • Distributed by Region (sh)
        • SAMtools count
        • TensorBoard Example Web App
        • Git Dependency
        • Mkfifo and dx cat
        • Parallel by Region (sh)
        • Parallel xargs by Chr
        • Precompiled Binary
        • R Shiny Example Web App
      • Python
        • Dash Example Web App
        • Distributed by Region (py)
        • Parallel by Chr (py)
        • Parallel by Region (py)
        • Pysam
      • Web App(let) Tutorials
        • Dash Example Web App
        • TensorBoard Example Web App
      • Concurrent Computing Tutorials
        • Distributed
          • Distributed by Region (sh)
          • Distributed by Chr (sh)
          • Distributed by Region (py)
        • Parallel
          • Parallel by Chr (py)
          • Parallel by Region (py)
          • Parallel by Region (sh)
          • Parallel xargs by Chr
  • User
    • Login and Logout
    • Projects
      • Project Navigation
      • Path Resolution
    • Running Apps and Workflows
      • Running Apps and Applets
      • Running Workflows
      • Running Nextflow Pipelines
      • Running Batch Jobs
      • Monitoring Executions
      • Job Notifications
      • Job Lifecycle
      • Executions and Time Limits
      • Executions and Cost and Spending Limits
      • Smart Reuse (Job Reuse)
      • Apps and Workflows Glossary
      • Tools List
    • Cohort Browser
      • Chart Types
        • Row Chart
        • Histogram
        • Box Plot
        • List View
        • Grouped Box Plot
        • Stacked Row Chart
        • Scatter Plot
        • Kaplan-Meier Survival Curve
      • Locus Details Page
    • Using DXJupyterLab
      • DXJupyterLab Quickstart
      • Running DXJupyterLab
        • FreeSurfer in DXJupyterLab
      • Spark Cluster-Enabled DXJupyterLab
        • Exploring and Querying Datasets
      • Stata in DXJupyterLab
      • Running Older Versions of DXJupyterLab
      • DXJupyterLab Reference
    • Using Spark
      • Apollo Apps
      • Connect to Thrift
      • Example Applications
        • CSV Loader
        • SQL Runner
        • VCF Loader
      • VCF Preprocessing
    • Environment Variables
    • Objects
      • Describing Data Objects
      • Searching Data Objects
      • Visualizing Data
      • Filtering Objects and Jobs
      • Archiving Files
      • Relational Database Clusters
      • Symlinks
      • Uploading and Downloading Files
        • Small File Sets
          • dx upload
          • dx download
        • Batch
          • Upload Agent
          • Download Agent
    • Platform IDs
    • Organization Member Guide
    • Index of dx commands
  • Developer
    • Developing Portable Pipelines
      • dxCompiler
    • Cloud Workstation
    • Apps
      • Introduction to Building Apps
      • App Build Process
      • Advanced Applet Tutorial
      • Bash Apps
      • Python Apps
      • Spark Apps
        • Table Exporter
        • DX Spark Submit Utility
      • HTTPS Apps
        • Isolated Browsing for HTTPS Apps
      • Transitioning from Applets to Apps
      • Third Party and Community Apps
        • Community App Guidelines
        • Third Party App Style Guide
        • Third Party App Publishing Checklist
      • App Metadata
      • App Permissions
      • App Execution Environment
        • Connecting to Jobs
      • Dependency Management
        • Asset Build Process
        • Docker Images
        • Python package installation in Ubuntu 24.04 AEE
      • Job Identity Tokens for Access to Clouds and Third-Party Services
      • Enabling Web Application Users to Log In with DNAnexus Credentials
      • Types of Errors
    • Workflows
      • Importing Workflows
      • Introduction to Building Workflows
      • Building and Running Workflows
      • Workflow Build Process
      • Versioning and Publishing Global Workflows
      • Workflow Metadata
    • Ingesting Data
      • Molecular Expression Assay Loader
        • Common Errors
        • Example Usage
        • Example Input
      • Data Model Loader
        • Data Ingestion Key Steps
        • Ingestion Data Types
        • Data Files Used by the Data Model Loader
        • Troubleshooting
      • Dataset Extender
        • Using Dataset Extender
    • Dataset Management
      • Rebase Cohorts and Dashboards
      • Assay Dataset Merger
      • Clinical Dataset Merger
    • Apollo Datasets
      • Dataset Versions
      • Cohorts
    • Creating Custom Viewers
    • Client Libraries
      • Support for Python 3
    • Walkthroughs
      • Creating a Mixed Phenotypic Assay Dataset
      • Guide for Ingesting a Simple Four Table Dataset
    • DNAnexus API
      • Entity IDs
      • Protocols
      • Authentication
      • Regions
      • Nonces
      • Users
      • Organizations
      • OIDC Clients
      • Data Containers
        • Folders and Deletion
        • Cloning
        • Project API Methods
        • Project Permissions and Sharing
      • Data Object Lifecycle
        • Types
        • Object Details
        • Visibility
      • Data Object Metadata
        • Name
        • Properties
        • Tags
      • Data Object Classes
        • Records
        • Files
        • Databases
        • Drives
        • DBClusters
      • Running Analyses
        • I/O and Run Specifications
        • Instance Types
        • Job Input and Output
        • Applets and Entry Points
        • Apps
        • Workflows and Analyses
        • Global Workflows
        • Containers for Execution
      • Search
      • System Methods
      • Directory of API Methods
      • DNAnexus Service Limits
  • Administrator
    • Billing
    • Org Management
    • Single Sign-On
    • Audit Trail
    • Integrating with External Services
    • Portal Setup
    • GxP
      • Controlled Tool Access (allowed executables)
  • Science Corner
    • Scientific Guides
      • Somatic Small Variant and CNV Discovery Workflow Walkthrough
      • SAIGE GWAS Walkthrough
      • LocusZoom DNAnexus App
      • Human Reference Genomes
    • Using Hail to Analyze Genomic Data
    • Open-Source Tools by DNAnexus Scientists
    • Using IGV Locally with DNAnexus
  • Downloads
  • FAQs
    • EOL Documentation
      • Python 3 Support and Python 2 End of Life (EOL)
    • Automating Analysis Workflow
    • Backups of Customer Data
    • Developing Apps and Applets
    • Importing Data
    • Platform Uptime
    • Legal and Compliance
    • Sharing and Collaboration
    • Product Version Numbering
  • Release Notes
  • Technical Support
  • Legal
Powered by GitBook

Copyright 2025 DNAnexus

On this page
  • What exactly is "human DNA"?
  • What is GRCh37?
  • The "b37" conventions (by the 1000 Genomes Project Phase I)
  • The "hg19" conventions (by UCSC)
  • The "b37+decoy" / "hs37d5" extensions (by the 1000 Genomes Project Phase II)
  • The "Ion Torrent hg19"
  • Which human sequence should one use?

Was this helpful?

Export as PDF
  1. Science Corner
  2. Scientific Guides

Human Reference Genomes

What exactly is "human DNA"?

Human DNA comprises the following entities:

  • 22 pairs of non-sex chromosomes, labeled with numbers from 1 to 22, roughly in order of their sizes (with 1 being the longest).

  • One pair of sex chromosomes (labeled with letters), consisting of two X chromosomes in females or one X and one Y chromosome in males.

  • The mitochondrial genome; that is, the DNA contained in special organelles known as mitochondria.

What is GRCh37?

The Human Genome Project set out to identify the sequences of these 25 distinct DNA entities (chromosomes 1 through 22, chromosomes X and Y, and the mitochondria), aka "the human genome". In February of 2009, the Genome Reference Consortium (GRC) released "build 37" of the human genome, called GRCh37. In 2013, the GRC released a newer "build 38" of the human genome, called GRCh38.

Due to the complexity of DNA sequencing and genome assembly, the GRCh37 release included the following sequences:

  1. 24 "relatively complete" sequences for chromosomes 1 to 22, X and Y.

  2. A complete mitochondrial sequence.

  3. Several "unlocalized sequences". These are sequences that are known to originate from specific chromosomes, but their exact location within the chromosome is not known.

  4. Several "unplaced sequences". These are sequences that are known to originate from the human genome, but their chromosomal association is not known.

  5. Several "alternate loci". These are sequences that contain alternate representations of specific human regions.

In releasing all these sequences, GRC did not provide a canonical naming scheme for these sequences, nor did it impose a particular ordering of the sequences. This presents a problem in bioinformatics, as all file formats (SAM/BAM, VCF, GFF, BED, etc.) require a unique string identifier when referring to a particular sequence. Everything from read mappings, to variants, to genomic annotations (such as dbSNP or gene databases) needs to identify its genomic location by sequence name and coordinate. This freedom lead to different conventions being adopted by different teams.

The "b37" conventions (by the 1000 Genomes Project Phase I)

The 1000 Genomes Project, in its first phase, used the following conventions, which are commonly referred to as "b37" (a term particularly popular among the GATK and IGV communities):

  1. The 24 "relatively complete" chromosomal sequences were named "1" to "22", "X" and "Y".

  2. The GRCh37 mitochondrial sequence was named "MT".

  3. The unlocalized sequences were named after their accession numbers, such as "GL000191.1", "GL000194.1", etc.

  4. The unplaced sequences were named after their accession numbers, such as "GL000211.1", "GL000241.1", etc.

  5. The alternate loci were not included in the b37 dataset.

These conventions (where chromosomes are called "1" to "22", "X", "Y" and "MT") are also followed by the ENSEMBL genome browser, the NCBI dbSNP (in VCF files), the Sanger COSMIC (in VCF files), etc. and are the preferred standard for new projects.

The "hg19" conventions (by UCSC)

  1. The 24 "relatively complete" chromosomal sequences were given the names "chr1" to "chr22", "chrX" and "chrY".

  2. The GRCh37 mitochondrial sequence was not copied over. Instead, the UCSC genome browser team copied an older mitochondrial sequence from the previous release ("build 36"), and gave it the name "chrM".

  3. The unlocalized sequences were given custom names such as "chr1_gl000191_random" and "chr4_gl000194_random".

  4. The unplaced sequences were given custom names such as "chrUn_gl000221" and "chrUn_gl000241".

  5. The alternate loci were given custom names such as "chr6_apd_hap1" and "chr4_ctg9_hap1".

Unfortunately, the use of the non-GRCh37 mitochondrial sequence makes this incompatible with the actual GRCh37. Mappings or annotations that fall on the hg19 mitochondrial sequence cannot be easily transfered over to the GRCh37/b37 mitochondrial sequence.

Despite the nonstandard sequence naming, the stale mitochondrial sequence, and the inclusion of alternate loci (which is sometimes undesirable for read mapping), hg19 has gained popularity due to its exposure via the UCSC genome browser, and is often the convention used by vendors when reporting exome enrichment kit coordinates.

The "b37+decoy" / "hs37d5" extensions (by the 1000 Genomes Project Phase II)

In its second phase, the 1000 Genomes Project extended the b37 dataset with additional sequences:

  • A human herpesvirus 4 type 1 sequence (named "NC_007605").

  • A "decoy" sequence derived from HuRef, human BAC and Fosmid clones, and NA12878 (named "hs37d5").

In addition, the pseudo-autosomal regions (PAR) of chromosome Y have been masked out (replaced with "N"), so that the respective regions in chromosome X may be treated as diploid.

The "Ion Torrent hg19"

The Torrent Suite software (which Ion Torrent makes available for their instruments) allows downloading of a particular human reference genome from the Ion Torrent servers. Ion Torrent calls it "hg19", but it has distinct differences from the UCSC hg19. In particular, it uses the UCSC naming conventions ("chr1" to "chr22", "chrX", "chrY", "chrM"), but has replaced the stale UCSC hg19 mitochondrial sequence with the newer GRCh37 one. This renders the general rule of "chrM refers to the old mitochondria, and MT refers to the new mitochondria" as invalid, because now there is a sequence named "chrM" which refers to the new mitochondria.

Which human sequence should one use?

The 1000 Genomes Phase II (hs37d5) sequence is particularly preferred when read mapping is performed. It leads to better mapping quality due to masking of PAR regions in chromosome Y and the addition of the decoy sequences, while being compatible with b37, GATK, and IGV.

Last updated 3 years ago

Was this helpful?

When GRCh37 was released, the performed the following adaptation to the sequences, and called the end result "hg19":

Collectively these changes make this set of sequences optimal for read mapping and variation calling, as they decrease false positives, while being generally compatible with b37. More information can be found .

UCSC genome browser team
here