Overview of common use cases for data ingestion.
In Apollo, data ingestion is the process by which data is transformed and stored, an Apollo Dataset is created, and the data is made available to the end user for scalable, repeatable, reproducible data consumption. Data ingestion loads data into the Apollo database which is backed by Parquet. When paired with a Spark-based analysis framework, this combination supports analysis scalability and performance at population scale - often, data representing hundreds of thousands, or even millions of participants). Once the data has been ingested and a Dataset created, the data can quickly and repeatedly be used with various Platform tools, such as the Cohort Browser, dxdata, and other Dataset-enabled apps, applets, and workflows, for rapid, delightful, and exceptionally scalable analysis.
Phenotypic data generally refers to any data related to an individual's observable traits. The “individual” may be a participant, a sample, a project, or any desired primary focal point of a Dataset. Phenotypic data may contain a wide range of data; determinants, status, and measures of health, to documentation of care delivery, such as clinical data, general practitioner’s (GP) notes, or even telemetrics. It may also contain molecular biomarker data converted to a phenotypic style for easier analysis and categorization. As Apollo has a bring-your-own-schema structure, phenotypic data ingestion can support most data structures with single paths from the main entity to other entities (no circular references).
Small datasets are datasets with a high degree of quality/predictability with only a few logical entities that have less than a hundred features (columns) and usually no more than a few hundred thousand examples (rows) in each entity. These datasets can represent some analysis that's been performed, a sample of a larger dataset, or just limited availability.
This type of a dataset is a great dataset to get use for getting familiar with data ingestion tools before moving on to a larger dataset as managing, prepping, and ingesting the dataset can be done all at once.
For a small dataset, the Data Model Loader application can be used to ingest the data files along with a data dictionary and optional coding. This will ensure that the ingested data is properly ingested into the database and a dataset is created so that the data can then be used with the Cohort Browser, various apps, and is available in a structured manner through dxdata for use in Jupyter or other command line environments.
Large datasets are datasets of varying quality that span many logical entities, can have hundreds or thousands of features (columns) and can have millions of examples (rows) in each entities. These datasets can include extracts of the following:
- EHR data
- biobank data
- large clinical datasets
- core company data
- other large, mature datasets
Datasets of this size may conform to ontologies such as OMOP, SNOMED, or MedDRA or be predictably structured such as UKBiobank. These datasets often require greater data engineering consideration to outline the data structures and logical entities and can require harmonization or cleansing before the ingestion process begins.
Once the data is cleansed and structured, the Data Model Loader application can be used to ingest the data files along with a data dictionary and optional coding. A more incremental ingestion strategy is recommended to ensure iterative success and easier troubleshooting should issues arise. Often for ingestions of this magnitude, xVantage services are used to help lead to an optimal experience.
When the data generated becomes too complex (e.g. multi-Entity data, data types requiring custom coding, extremely wide new Entities) or if large amounts of new data become available, the Dataset Extender app may no longer provide enough control for extending your Apollo Dataset. The new data being added also may contain multiple Entities worth of data and may relate either to the main Entity or relate to an existing secondary Entity. To add this data to an existing Dataset, ingest the new data as if it is a novel Dataset using Data Model Loader and then use the Clinical Dataset Merger to link the new clinical data to the existing Dataset. The newly generated Dataset will contain all of the original data and the new Entities all in the same for use with the Cohort Browser, various apps, and all of the data is available in a structured manner through dxdata for use in Jupyter.
Through the process of translational research, new data can become available or is generated. To facilitate smoother usage usage of the data, the user may desire to append the data to an existing dataset for further use. This type of data is usually only representative of a single entity (or may be an extension of an existing ingested entity) and consists of no more than a few hundred features (columns) and no more than a few million examples (rows). To extend an existing dataset, the Dataset Extender app can be used to rapidly ingest delimited files and append them to an existing dataset with minimal configuration for use with the Cohort Browser, various apps, and is available in a structured manner through dxdata for use in Jupyter or other command line environments.
Molecular or assay data refers to the qualitative and/or quantitative representation of molecular features. For example, single nucleotide polymorphisms (SNPs) derived from whole exome sequencing (WES) of germline DNA, or bulk-mRNA transcript expression counts as derived from RNA-seq of tissue samples, are two possible types of data. Assay data tends to be well-defined by the community and often has standardized data structures and formats. Given this defined nature, we provide explicit support for the ingestion of commonly used assay types for stand-alone use in a novel Dataset and/or integration with existing Datasets to optimize data organization and query performance for downstream analysis. Datasets may contain none, one, or many assay instances, and assays may be of the same type or of different types. Representation of various assay types are provided below through the following assay models.
The “Genetic Variant” assay model provides support for genetic variant (SNP) resolution at the sample level. Population level summaries are provided through the Cohort Browser for filter building and cohort validation. During ingestion, homozygous reference variants are intentionally filtered out to focus on non-reference variants that are annotated with structural and functional information. Population scale SNP arrays, whole exome sequencing (WES), and even whole genome sequencing (WGS) are most commonly ingested into this format. xVantage services are currently required for setting up this type of assay Dataset.
The “Molecular Expression” assay model provides support for the quantitative assessment of multiple features per sample. An example of this could be expression counts for all mRNA transcripts for each individual’s liver tissue sample in a patient population. Typically, input for this model may be a matrix of counts, where column headers are the individual sample ID and row names are the respective feature IDs. For a detailed explanation of the model, as well as accepted inputs and examples of how to ingest data using the model, please refer to the Molecular Expression Assay Loader application documentation.