Data ingestion into a SPARK based database allows for increases in scalability and performance. With DNAnexus, in addition to the increased scalability, ingested data can be used with the Cohort Browser, Association Browser, and various dataset enabled tools.
Small datasets are datasets with a high degree of quality/predictability with only a few logical entities that have less than a hundred features (columns) and usually no more than a few hundred thousand examples (rows) in each entity. These datasets can represent some analysis that's been performed, a sample of a larger dataset, or just limited availability.
This type of a dataset is a great dataset to get use for getting familiar with data ingestion tools before moving on to a larger dataset as managing, prepping, and ingesting the dataset can be done all at once.
For a small dataset, the Data Model Loader application can be used to ingest the data files along with a data dictionary and optional coding. This will ensure that the ingested data is properly ingested into the database and a dataset is created so that the data can then be used with the Cohort Browser, various apps, and is available in a structured manner through dxdata for use in Jupyter or other command line environments.
Large datasets are datasets of varying quality that span many logical entities, can have hundreds or thousands of features (columns) and can have millions of examples (rows) in each entities. These datasets can include extracts of the following:
large clinical datasets
core company data
other large, mature datasets
Datasets of this size may conform to ontologies such as OMOP, SNOMED, or MedDRA or be predictably structured such as UKBiobank. These datasets often require greater data engineering consideration to outline the data structures and logical entities and can require harmonization or cleansing before the ingestion process begins.
Once the data is cleansed and structured, the Data Model Loader application can be used to ingest the data files along with a data dictionary and optional coding. A more incremental ingestion strategy is recommended to ensure iterative success and easier troubleshooting should issues arise. Often for ingestions of this magnitude, xVantage services are used to help lead to an optimal experience.
Thought the process of translational research, new data can become available or is generated. To facilitate smoother usage usage of the data, the user may desire to append the data to an existing dataset for further use. This type of data is usually only representative of a single entity (or may be an extension of an existing ingested entity) and consists of no more than a few hundred features (columns) and no more than a few million examples (rows). To extend an existing dataset, the Dataset Extender app can be used to rapidly ingest delimited files and append them to an existing dataset with minimal configuration for use with the Cohort Browser, various apps, and is available in a structured manner through dxdata for use in Jupyter or other command line environments.