Overview of common use cases for data ingestion.
In Apollo, data ingestion is the process by which data is transformed and stored, an Apollo Dataset is created, and the data is made available to the end user for scalable, repeatable, reproducible data consumption. Data ingestion loads data into the Apollo database which is backed by Parquet. When paired with a Spark-based analysis framework, this combination supports analysis scalability and performance at population scale - often, data representing hundreds of thousands, or even millions of participants). Once the data has been ingested and a Dataset created, the data can quickly and repeatedly be used with various Platform tools, such as the Cohort Browser, dxdata, and other Dataset-enabled apps, applets, and workflows, for rapid, delightful, and exceptionally scalable analysis.
Phenotypic data generally refers to any data related to an individual's observable traits. The “individual” may be a participant, a sample, a project, or any desired primary focal point of a Dataset. Phenotypic data may contain a wide range of data; determinants, status, and measures of health, to documentation of care delivery, such as clinical data, general practitioner’s (GP) notes, or even telemetrics. It may also contain molecular biomarker data converted to a phenotypic style for easier analysis and categorization. As Apollo has a bring-your-own-schema structure, phenotypic data ingestion can support most data structures with single paths from the main entity to other entities (no circular references).
Small datasets are datasets with a high degree of quality/predictability with only a few logical entities that have less than a hundred features (columns) and usually no more than a few hundred thousand examples (rows) in each entity. These datasets can represent some analysis that's been performed, a sample of a larger dataset, or just limited availability.
This type of a dataset is a great dataset to get use for getting familiar with data ingestion tools before moving on to a larger dataset as managing, prepping, and ingesting the dataset can be done all at once.
For a small dataset, the Data Model Loader application can be used to ingest the data files along with a data dictionary and optional coding. This will ensure that the ingested data is properly ingested into the database and a dataset is created so that the data can then be used with the Cohort Browser, various apps, and is available in a structured manner through dxdata for use in Jupyter or other command line environments.
Large datasets are datasets of varying quality that span many logical entities, can have hundreds or thousands of features (columns) and can have millions of examples (rows) in each entities. These datasets can include extracts of the following:
EHR data
biobank data
large clinical datasets
core company data
other large, mature datasets
Datasets of this size may conform to ontologies such as OMOP, SNOMED, or MedDRA or be predictably structured such as UKBiobank. These datasets often require greater data engineering consideration to outline the data structures and logical entities and can require harmonization or cleansing before the ingestion process begins.
Once the data is cleansed and structured, the Data Model Loader application can be used to ingest the data files along with a data dictionary and optional coding. A more incremental ingestion strategy is recommended to ensure iterative success and easier troubleshooting should issues arise. Often for ingestions of this magnitude, customers rely on help from the DNAnexus Professional Services team, to ensure an optimal experience.
When the data generated becomes too complex (e.g. multi-Entity data, data types requiring custom coding, extremely wide new Entities) or if large amounts of new data become available, the Dataset Extender app may no longer provide enough control for extending your Apollo Dataset. The new data being added also may contain multiple Entities worth of data and may relate either to the main Entity or relate to an existing secondary Entity. To add this data to an existing Dataset, ingest the new data as if it is a novel Dataset using Data Model Loader and then use the Clinical Dataset Merger to link the new clinical data to the existing Dataset. The newly generated Dataset will contain all of the original data and the new Entities all in the same for use with the Cohort Browser, various apps, and all of the data is available in a structured manner through dxdata for use in Jupyter.
Through the process of translational research, new data can become available or is generated. To facilitate smoother usage usage of the data, the user may desire to append the data to an existing dataset for further use. This type of data is usually only representative of a single entity (or may be an extension of an existing ingested entity) and consists of no more than a few hundred features (columns) and no more than a few million examples (rows). To extend an existing dataset, the Dataset Extender app can be used to rapidly ingest delimited files and append them to an existing dataset with minimal configuration for use with the Cohort Browser, various apps, and is available in a structured manner through dxdata for use in Jupyter or other command line environments.
Molecular or assay data refers to the qualitative and/or quantitative representation of molecular features. For example, single nucleotide polymorphisms (SNPs) derived from whole exome sequencing (WES) of germline DNA, or bulk-mRNA transcript expression counts as derived from RNA-seq of tissue samples, are two possible types of data. Assay data tends to be well-defined by the community and often has standardized data structures and formats. Given this defined nature, we provide explicit support for the ingestion of commonly used assay types for stand-alone use in a novel Dataset and/or integration with existing Datasets to optimize data organization and query performance for downstream analysis. Datasets may contain none, one, or many assay instances, and assays may be of the same type or of different types. Representation of various assay types are provided below through the following assay models.
The “Genetic Variant” assay model provides support for genetic variant (SNP) resolution at the sample level. Population level summaries are provided through the Cohort Browser for filter building and cohort validation. During ingestion, homozygous reference variants are intentionally filtered out to focus on non-reference variants that are annotated with structural and functional information. Population scale SNP arrays, whole exome sequencing (WES), and even whole genome sequencing (WGS) are most commonly ingested into this format. Assistance from the DNAnexus Professional Services team is currently required for setting up this type of assay Dataset.
The “Molecular Expression” assay model provides support for the quantitative assessment of multiple features per sample. An example of this could be expression counts for all mRNA transcripts for each individual’s liver tissue sample in a patient population. Typically, input for this model may be a matrix of counts, where column headers are the individual sample ID and row names are the respective feature IDs. For a detailed explanation of the model, as well as accepted inputs and examples of how to ingest data using the model, please refer to the Molecular Expression Assay Loader application documentation.
Loading...
The Molecular Expression Assay Loader contains numerous checks and validations. Below are a few of the more common errors one might encounter.
The length of an “Assay Title” (assay_title) is restricted to less than 256 ASCII characters. Rerun the app using a title conforming to character limitations.
The format of an “Assay Name” (assay_name) is restricted to less than 256 characters, having only alphanumeric characters (a-zA-Z0-1), an underscore or dash, “_” or “-”, no spaces, and beginning with an alphabetic character (a-zA-Z). Rerun the app using a name conforming to character limitations.
The format of a “Database” (database) is restricted to less than 256 characters, having only alphanumeric characters (a-zA-Z0-1), an underscore or dash, “_” or “-”, no spaces, and beginning with an alphabetic character (a-zA-Z). Rerun the app using a database name conforming to character limitations.
The format of a “Dataset Name” (dataset_name) is restricted to less than 256 characters, having only alphanumeric characters (a-zA-Z0-1), an underscore or dash, “_” or “-”, no spaces, and beginning with an alphabetic character (a-zA-Z). Rerun the app using a dataset name conforming to character limitations.
If a feature is proved, there should be no missing, NULL, or “NA” expression values. Only values [0, inf) are allowed. Review your input data and confirm values are present for all features for all samples.
The “auto” detect method failed to detect the underlying source data as one of the expected inputs (matrix, long, or manifest-based). Review example inputs and ensure your input data follows the specified convention.
The submitted file does not conform to expected file type, based on the provided file extension. Reformat the file to one of the expected file types specified in the documentation.
Example usage of the Molecular Expression Assay Loader application.
If launching the app from outside the Platform, first download and set up dx-toolkit. If running the app from the Platform (Cloud Workspace, JupyterLab, etc.) dx-toolkit is already installed and ready for use.
From CLI, launch the app using dx run
. For example:
If launching the app from the GUI, log on to the DNAnexus Platform.
Navigate to the tool.
Follow instructions provided on the Run App dialogue to get to the Inputs page.
A Dataset created by the Molecular Expression Assay Loader App may be used in the Cohort Browser just like any other Dataset, however only the phenotypic data will be available for cohort browsing and cohort criteria selection. Either double click on the Dataset (a Record entity on the Platform) or right click on More Actions and select Explore Data.
The Dataset created by the Molecular Expression Assay Loader app can be linked to an existing Dataset that contains either phenotypic or phenotypic and assay data to create a combined new Dataset for use. This can be done by adding this Dataset as an input to the Assay Dataset Merger app. This allows for the creation of a rich clinico-omic Dataset where both the molecular information and clinical data related to your study are linked together to accelerate analysis.
A Spark-enabled JupyterLab instance may be used to parse assay metadata and access molecular expression data, as referenced in a Molecular Expression Assay Loader Dataset. See the tutorial notebook in OpenBio.
There are two categories of input to consider when ingesting data using the Molecular Expression Assay Loader Application: feature contexts and data formats.
For the molecular expression model, the core unit to be measured is the “feature”. To represent a molecular expression assay in a Dataset, there are three terms used to describe a feature: feature type, feature ID type, and feature value type. The feature type refers to the general category of what is being measured. The feature ID type refers to a standardized naming method for how an individual feature is identified. The feature value type refers to the method of measurement. For practical purposes, the following is a list of accepted combinations:
Feature type
Feature ID Type
Feature Value Type
mRNA
Either ENSG* or ENST*
RPKM (double)
FPKM (double)
FPKM-UQ (double)
TPM (double)
count (integer)
Software programs and data suppliers provide data in different formats. DNAnexus aims to support common formats to reduce any data transformation burden prior to ingestion. The following formats are currently supported for simplified ingestion.
N x M matrix of N features (rows) by M samples (columns), where each feature and sample is unique. A header row must be provided as part of this format, including a column for the feature ID. For example:
(N x M) x 3 table of N features with M samples (rows) and 3 columns with headers, where the first column is the “feature_id,” the second column is the “sample_id,” and the third column is the “value.” Each row should contain a unique combination of feature ID and sample ID. For example:
Two sets of files; one manifest file which describes the respective data file ID and associated sample, and the set of individual data files. The manifest file should have two columns with headers, “file_id” and “sample_id.” Individual files should each have two columns with the headers “feature_id” and “value.” For example:
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...