A Dataset is a DNAnexus platform object with a record of type dataset that encapsulates both data and metadata, mapping between the logical data structure (phenotypes, genotypes, etc.) and the physical layout of the underlying database(s) and metadata lookups. It enables you to combine phenotypic or genotypic data across multiple databases in a single record. It is a set of structured data organized into Entities (tables) and Fields (columns), along with metadata such as field titles, field units, field coded values, entity relationships, semantic concepts, etc. It is typically used to represent phenotypic data, either by itself or in combination with linked assays such as genomic data, gene expression data, etc. Currently, the dataset is stored using Apollo's Spark Databases, although the physical structure of the databases may differ from the logical structure of the original entities and fields. You can conveniently view all of your datasets in the user interface by selecting Dataset from the Project menu.

This dataset object is created during data ingestions and is then used to translate between the physical data storage and the data usage for various use cases such as Cohort Building (Cohort Browser), ad-hoc analysis (Jupyter Notebooks via dxdata), application analysis, or results exploration (Association Browser).

A representation of a pheno-geno dataset with data split across multiple databases

The fundamental dataset structure is a series of linked JSON files that flexibly describe the stored data and its relationships. Because the structure is not predefined, it creates a framework that can adjust to the data being brought to Apollo rather than forcing the data to conform to a predefined data structure.

While the flexibility is available, ingestion tools are available for certain predefined data structures like pVCF ingestion. Additionally, to achieve the most out of data ingested, even when using flexible ingestion tools like the Data Model Loader, it is prudent to think through typical use cases and the fundamental data architecture to preprocess data before ingestion so that it can make the largest impact.

Example uses of a dataset include: