Guide for Ingesting a Simple Four Table Dataset

Learn to use the Data Model Leader app to ingest phenotypic data and create a dataset for use in Apollo.

On the DNAnexus Platform, an Apollo license is required to use the features described on this page. Org approval may also be required. Contact DNAnexus Sales for more information.

Overview

This walkthrough describes phenotypic data ingestion using a data dictionary file; a codings file, if needed; an optional entity dictionary file; and accompanying data CSV files. The files are loaded using the Data Model Loader app, which validates and ingests the input CSV files to create an Apollo dataset. This Dataset is then accessible using the Cohort Browser, or using JupyterLab and our Python SDK, dxdata.

The following steps show how to organize your data into the required file sets following the Data Ingestion Key Steps. These files can then be loaded using the Data Model Loader app to create a database encapsulated by a Dataset record, which are then immediately accessible for use.

Requirements

  • DNAnexus Apollo License

  • Access to the Data Model Loader app

  • A way to manipulate and create CSV files

  • Uploader or greater permissions in the project the app will be running

Example Raw Files

Guide

Step 1. Identify Your Data

Specify the Data to Be Used

For this dataset the following CSV files from the Example section above will be used: patient.csv, hospital.csv, encounter.csv, and test.csv. The data has a structure as follows:

  • Many patients will point to one hospital but some may be missing hospital information.

  • A patient may have none, one, or many encounters.

  • Each encounter may have none, one, or many tests.

Specify the Main Entity and Main Field

For this dataset create cohorts of different patients to perform analysis on. Some examples of cohorts that can be built are:

  • All of the Patients that were in Hospital X and had an Encounter with Doctor Y

  • All of the Patients with a weight over A

  • All of the Patients born before a date, with a risk factor of H that had an Encounter with Doctor Y

Given these types of questions and the data structure, the main entity will be the patient. The patient data includes a patient_id that is unique for the table and will be used as the key (main field).

Specify Additional Entities

Since all of the data already has a one-to-many relationships, there are no one-to-one relationships, and there are no large sets of related multi-select fields, the secondary entities will be kept as:

  • Hospital

  • Encounter

  • Test

Step 2. Provide File Specifications

While the data ingestion key steps document guides the user through column by column for the different data files, this walkthrough will go through one entity at a time. This is recommended to ensure precision when working with a larger or multi-entity dataset.

The Patient CSV (main entity) file should be formatted as follows:

patient_id

name

age

risk

date_of_birth

weight

resident_state_prov

title

hospital_id

1

John

29

h

1983-03-05

135.56035

KS

Mr

2

2

Sally

27

m

1951-08-12

101.22172

SC

Dr|Hon

2

3

Cassy

l

1943-08-10

248.04192

MO

1

Create a Data Dictionary File

Create a data_dictionary.csv file and fill in the fields for entity and name. Add "global" as the primary_key_type value for patient_id.

The Data Dictionary CSV file should be formatted as follows:

entity

name

primary_key_type

patient

patient_id

global

patient

name

patient

age

patient

risk

patient

date_of_birth

patient

weight

patient

resident_state_prov

patient

title

patient

hospital_id

Specify Field Characteristics

For each field in each entity, indicate is_sparse_coding and is_multi_select, and specify the field's type.

Layer in the types for the different fields. For this example, there are the following groupings:

  • Integer: patient_id, age, hospital_id

  • Date: date_of_birth

  • Float: weight

  • String: name, risk, resident_state

    • Multi-select: title

The Data Dictionary CSV file should be formatted as follows:

entity

name

primary_key_type

is_sparse_coding

is_multi_select

type

patient

patient_id

global

integer

patient

name

string

patient

age

integer

patient

risk

string

patient

date_of_birth

date

patient

weight

float

patient

resident_state_prov

string

patient

title

yes

string

patient

hospital_id

integer

Specify Coding Settings for Categorical Fields

For categorical fields, build the codings.csvfile and determine coding_name. Layer in the coding for categorical fields and fields that you want to summarize using categorical chart types. For Patient, create categorical selections for risk, resident_state_prov, and titles. For these first generate a codings file from the dictionary as follows: the raw code in the data is populated in the code column, the text to display in the Cohort Browser is in the meaning column, and if it is a hierarchical field (e.g. state_prov), the parent code points to the code of the parent for the entry. The display_order can be used to override the order codes are displayed in summary tiles. The coding file should look as follows:

The Coding CSV file should be formatted as follows:

coding_name

code

meaning

parent_code

display_order

risk_level

h

High

risk_level

m

Medium

risk_level

l

Low

state_prov

US

United States

state_prov

CA

Canada

state_prov

KS

Kansas

US

state_prov

SC

South Carolina

US

state_prov

MO

Missouri

US

state_prov

OH

Ohio

US

state_prov

TX

Texas

US

state_prov

PA

Pennsylvania

US

state_prov

AL

Alabama

US

state_prov

YK

Yukon Territory

CA

state_prov

BC

British Columbia

CA

titles

Rev

Reverend

titles

Mr

Mr

titles

Dr

Doctor

titles

Hon

Honorable

titles

Mrs

Mrs

titles

Ms

Ms

Now that you created the coding file, go back and fill in the coding_name column in data_dictionary.csv with the corresponding coding_name from the coding.csv.

The Data Dictionary CSV should be formatted as follows:

entity

name

primary_key_type

is_sparse_coding

is_multi_select

type

coding_name

patient

patient_id

global

integer

patient

name

string

patient

age

integer

patient

risk

string

risk_level

patient

date_of_birth

date

patient

weight

float

patient

resident_state_prov

string

state_prov

patient

title

yes

string

titles

patient

hospital_id

integer

Specify Data Formats

Now you can validate that your fields will be ingested as their expected ingestion data type. The following are the data type to field mappings expected for this walkthrough:

  • Integer: Patient ID, Age, Hospital_id

  • String: Name

    • String Categorical: Risk

    • String Categorical Hierarchical: Resident State Province

    • String Categorical Multi-select: Title

  • Date: Date of Birth

  • Float: Weight

Define Entity Relationships

Since everything looks as expected, now link the entities together. For Patient, since the relationship can only be one-to-one or many-to-one, create field linkages down the relationship (note that relationships automatically handle none cases where joins can't be made on all rows). The relationships is created by entering the the <entity>:<field name> into the referenced_entity_field and the relationship into the relationship. The relationships to create for the example dataset are:

  • hospital to patient on hospital_id in a "many_to_one" relationship

  • encounter to patient on patient_id in a "many_to_one" relationship

  • test to encounter on encounter_id in a "many_to_one" relationship

The Data Dictionary CSV file should be formatted as follows:

entity

name

primary_key_type

referenced_entity_field

relationship

is_sparse_coding

is_multi_select

type

coding_name

patient

patient_id

global

integer

patient

name

string

patient

age

integer

patient

risk

string

risk_level

patient

date_of_birth

date

patient

weight

float

patient

resident_state_prov

string

state_prov

patient

title

yes

string

titles

patient

hospital_id

hospital:hospital_id

many_to_one

integer

hospital

hospital_id

local

integer

encounter

encounter_id

local

integer

encounter

patient_id

patient:patient_id

many_to_one

integer

test

test_id

local

integer

test

encounter_id

encounter:encounter_id

many_to_one

integer

Add Additional Descriptive Information

Now that you have the core data relationships and types in places, it's time to add metadata about the field (title, description, units, and linkout) along with specifying the folder path desired to be shown in the field selector of the cohort browser. For more information on each field see the detailed documentation.

If you would like the folder structure to mirror the entity structure exactly, leave the folder_path field empty.

If you would like to hide certain fields from the field selector, populate the folder_path field for only the fields you would like to show and leave the path empty for the fields to hide.

Instead of following the examples generally set for entities for folder paths, put the identifiers in their own "ID" folder and patient identifying information (PII) in a subfolder under "Patient" called "PII". Foreign keys are hidden from the field selector.

The Data Dictionary CSV file should be formatted as follows:

entity

name

primary_key_type

referenced_entity_field

relationship

is_sparse_coding

is_multi_select

type

coding_name

folder_path

title

description

units

linkout

patient

patient_id

global

integer

ID

Patient ID

patient

name

string

Patient

Patient

name of the patient

patient

age

integer

Patient>PII

Age

age at signup

patient

risk

string

risk_level

Patient

Risk Level

evaluated risk level at signup

patient

date_of_birth

date

Patient>PII

DOB

date of birth

patient

weight

float

Patient

Weight

baseline weight taken at signup

lbs

patient

resident_state_prov

string

state_prov

Patient

Home State / Province

state of current residency

patient

title

yes

string

titles

Patient

Titles

patient

hospital_id

hospital:hospital_id

many_to_one

integer

hospital

hospital_id

local

integer

ID

Hospital ID

encounter

encounter_id

local

integer

ID

Encounter ID

encounter

patient_id

patient:patient_id

many_to_one

integer

test

test_id

local

integer

ID

Test ID

test

encounter_id

encounter:encounter_id

many_to_one

integer

Now that the patient is set, repeat the steps for the other entities: hospital, encounter, and test. Feel free to use this coding as a guide for categorical values.

Note that this table only highlights one entity in totality and only a few fields from the other entities. The expectation is that you will now go through the remaining three files and fill in the data yourself.

Generate Entity Details

After that, the only step remaining is to generate the entity_metadata.csv to clean up the labels that the browser will use for the entities (refer to the detailed docs for more guidance). Blank spaces are left in the label fields if the field is a repeat of the entity_title.

The Entity Metadata CSV should be formatted as follows:

entity

entity_title

entity_label_singular

entity_label_plural

entity_description

patient

Patients

Patient

the patient who visited

encounter

Visit

Patient Visit

Patient Visits

each admission is logged as an individual visit

test

Lab Tests

Test

Tests

all tests performed during a visit or after but linked to the same visit

hospital

Hospital

Hospitals

Step 3. Ingest Data

Examples of Completed Files Ready for Ingestion

At this point, not only do you have the raw CSV file above, but you have a data dictionary file, an entity metadata file, and a codings file. If you struggled, below are exemplar files that can be used for ingestion:

Running the Data Model Loader

With the files provided, configure the Data Model Loader and run the ingestion. See below for an example configuration.

The four raw example files are input into the Data CSVs input field.

Step 4. Explore and Analyze Data

Once the Data Model Loader job has completed, you will see a dataset and a database output. If you used the supplied metadata files, you should see something similar to this in the cohort browser:

Last updated