Ingesting Data

Overview of common use cases for data ingestion

Note: This is intended only for customers with a DNAnexus Apollo license. Contact sales@dnanexus.com for more information.

Phenotypic / Clinical Data Ingestion

Ingesting a novel small dataset

Small datasets are datasets with a high degree of quality/predictability with only a few logical entities with less than a hundred features (columns) and usually no more than a few hundred thousand examples (rows) in each entity. These datasets can represent some analysis that's been performed, a sample of a larger dataset, or just limited availability.

This type of a dataset is a great dataset to get use for getting familiar with data ingestion tools before moving on to a larger dataset as managing, prepping, and ingesting the dataset can be done all at once.

For a small dataset, the Data Model Loader application can be used to ingest the data files along with a data dictionary and optional coding. This will ensure that the ingested data is properly ingested into the database and a dataset is created so that the data can then be used with the Cohort Browser, various apps, and is available in a structured manner through dxdata for use in Jupyter or other command line environments.

Ingesting a novel large dataset

Large datasets are datasets of varying quality that span many logical entities, can have hundreds or thousands of features (columns) and can have millions of examples (rows) in each entities. These datasets can be extracts of EHR data, biobank data, large clinical datasets, core company data, or other large, mature datasets. Datasets of this size may conform to ontologies such as OMOP, SNOMED, or MedDRA or be predictably structured such as UKBiobank.

These datasets often require greater data engineering consideration to outline the data structures and logical entities and can require harmonization or cleansing before the ingestion process begins. Once the data is cleansed and structured, the Data Model Loader application can be used to ingest the data files along with a data dictionary and optional coding. A more incremental ingestion strategy is recommended to ensure iterative success and easier troubleshooting should issues arise. Often for ingestions of this magnitude, xVantage services are used to help lead to an optimal experience.

Minor extension of existing datasets

Thought the process of translational research, new data can become available or is generated. To facilitate smoother usage usage of the data, the user may desire to append the data to an existing dataset for further use by themselves or their team. This type of data is usually only representative of a single entity (or may be an extension of an existing ingested entity) and consists of no more than a few hundred features (columns) and no more than a few million examples (rows). To extend an existing dataset, the Dataset Extender app can be used to rapidly ingest delimited files and append them to an existing dataset with minimal configuration for use with the Cohort Browser, various apps, and is available in a structured manner through dxdata for use in Jupyter or other command line environments.