Ingesting Data
Understand common use cases for the types of data you can ingest.
Data Ingestion
In Apollo, data ingestion transforms and stores data, creates an Apollo Dataset, and makes the data available for scalable, repeatable, and reproducible use.
The ingestion process loads data into the Apollo database, which uses Parquet as its backend. Combined with a Spark-based analysis framework, this setup enables high-performance, population-scale analysis, often involving data from hundreds of thousands to millions of participants.
Once ingested and turned into a dataset, the data is immediately available for use with specific Platform tools, such as the Cohort Browser, dxdata, and other dataset-enabled apps, applets, and workflows. This enables fast and highly scalable analysis.
Phenotypic / Clinical Data Ingestion
Phenotypic data refers to any data related to an individual's observable traits. The "individual" may be a participant, a sample, a project, or any desired primary focal point of a dataset.
Phenotypic data can include a wide range of information: determinants, status, and measures of health, documentation of care delivery (such as clinical data, general practitioner's notes, or telemetrics), and molecular biomarker data converted to a phenotypic style for easier analysis and categorization.
Because Apollo uses a bring-your-own-schema structure, phenotypic data ingestion can support most data structures with single paths from the main entity to other entities (no circular references).
Ingesting a Novel Small Dataset
Small datasets have a high degree of quality and predictability, with only a few logical entities, fewer than 100 features (columns), and typically no more than a few hundred thousand examples (rows) per entity. These datasets can represent a completed analysis, a sample of a larger dataset, or limited data availability.
This type of dataset is ideal for learning how to use data ingestion tools before working with larger datasets, as managing, preparing, and ingesting the data can be done all at once.
For a small dataset, use the Data Model Loader application to ingest the data files, along with a data dictionary and optional coding. This ensures that the data is properly ingested into the database and a dataset is created for use with the Cohort Browser, specific apps, and in a structured way through dxdata for use in Jupyter or other command line environments.
Ingesting a Novel Large Dataset
Large datasets span many logical entities, can have hundreds or thousands of features (columns), and can have millions of examples (rows) in each entity. These datasets can include extracts of:
EHR data
Biobank data
Large clinical datasets
Core company data
Other large, mature datasets
Datasets of this size may conform to ontologies such as OMOP, SNOMED, or MedDRA, or be predictably structured such as UK Biobank. These datasets often require greater data engineering consideration to outline the data structures and logical entities, and may require harmonization or cleansing before the ingestion process begins.
Once the data is cleansed and structured, use the Data Model Loader application to ingest the data files, along with a data dictionary and optional coding. An incremental ingestion strategy is recommended to ensure iterative success and easier troubleshooting. For ingestions of this magnitude, customers often rely on help from the DNAnexus Professional Services team to ensure an optimal experience.
Large or Technical Clinical Data Additions
When the data becomes too complex or if large amounts of new data become available, the Dataset Extender app may not provide enough control for extending your Apollo dataset. Complex data might include multi-entity data, data types requiring custom coding, or extensive new entities. The new data may contain multiple entities that relate either to the main entity or to an existing secondary entity.
To add this data to an existing dataset:
Ingest the new data as if it is a novel dataset using Data Model Loader
Use the Clinical Dataset Merger to link the new clinical data to the existing dataset
The newly generated dataset contains all the original data and the new entities. This combined dataset is available for use with the Cohort Browser, specific apps, and through dxdata for use in Jupyter notebooks.
Minor Extensions of Existing Datasets
During translational research, new data may become available. You may want to append this data to an existing dataset for further use. This type of data typically represents a single entity (or may be an extension of an existing ingested entity) and consists of no more than a few hundred features (columns) and no more than a few million examples (rows).
To extend an existing dataset, use the Dataset Extender app to rapidly ingest delimited files and append them to an existing dataset with minimal configuration. The data is then available for use with the Cohort Browser, specific apps, and in a structured way through dxdata for use in Jupyter or other command line environments.
Molecular (Assay) Data Ingestion
Molecular or assay data refers to the qualitative and/or quantitative representation of molecular features. For example, single nucleotide polymorphisms (SNPs) derived from whole exome sequencing (WES) of germline DNA, or bulk-mRNA transcript expression counts from RNA-seq of tissue samples.
Assay data is often well-defined by the community and has standardized data structures and formats. DNAnexus provides explicit support for the ingestion of commonly used assay types for stand-alone use in a novel dataset and/or integration with existing datasets to optimize data organization and query performance for downstream analysis.
Datasets may contain none, one, or many assays, and assays may be of the same type or of different types. The supported assay models are genetic variation, somatic variation, and molecular expression.
Genetic Variation Assay Model
The "Genetic Variant" assay model provides support for genetic variant (SNP) resolution at the sample level. Population-level summaries are provided through the Cohort Browser for filter building and cohort validation.
During ingestion, homozygous reference variants are intentionally filtered out to focus on non-reference variants that are annotated with structural and functional information. Population-scale SNP arrays, whole exome sequencing (WES), and whole genome sequencing (WGS) are most commonly ingested into this format.
Assistance from DNAnexus Professional Services is required for setting up this type of assay dataset.
Somatic Variant Assay Model
The "Somatic Variant" assay model provides support for genetic variation as derived from somatic tissue and, if relevant, about paired normal tissue. Sets of individual-level tumor-only VCFs or paired tumor-normal VCFs are most commonly ingested into this data model.
Using the Cohort Browser or the dx-toolkit CLI command dx extract_assay somatic, you can build filters, create cohorts, and compare cohorts based on allelic-level Short Variants, Copy Number Variations (CNVs), Structural Variants, and Fusions.
For a detailed explanation of the model, as well as accepted inputs and examples of how to ingest data using the model, see Somatic Variant Assay Loader.
Molecular Expression Assay Model
The "Molecular Expression" assay model provides support for the quantitative assessment of multiple features per sample. For example, expression counts for all mRNA transcripts for each individual's liver tissue samples in a patient population.
Typically, input for this model may be a matrix of counts, where column headers are the individual sample IDs and row names are the respective feature IDs.
Last updated
Was this helpful?