Guide for Ingesting a Simple Four Table Dataset

Overview

Generalized phenotypic data ingestion is done with an ingestion process that takes in well-described data in the form of a data dictionary file, a codings file (if needed), an optional entity dictionary file, and accompanying data CSV files. The files are loaded using the Data Model Loader app, which validates and ingests the input CSV files to create a dataset. This Dataset is then accessible using the Cohort Browser, or using JupyterLab and our Python SDK, dxdata.

The following steps show how to organize your data into the required file sets following the Data Ingestion Key Steps. These files can then be loaded using the Data Model Loader app to create a database encapsulated by a Dataset record, which are then immediately accessible for use.

Requirements

  • DNAnexus Apollo License

  • Access to the Data Model Loader app

  • A way to manipulate and create CSV files

  • Uploader or greater permissions in the project the app will be running

Example Raw Files

Guide

1.Identifying Your Data

1.1 Decide upon the set of data you want to use

For this dataset the following CSV files from the Example section above will be used: patient.csv, hospital.csv, encounter.csv, and test.csv. The data has a structure as follows:

ERD for the example data
  • Many patients will point to one hospital but some may be missing hospital information.

  • A patient may have none, one, or many encounters.

  • Each encounter may have none, one, or many tests.

1.2 Decide upon your main entity and main field

For this dataset create cohorts of different patients to perform analysis on. Some examples of cohorts that can be built are:

  • All of the Patients that were in Hospital X and had an Encounter with Doctor Y

  • All of the Patients with a weight over A

  • All of the Patients born before a date, with a risk factor of H that had an Encounter with Doctor Y

Given these types of questions and the data structure, the main entity will be the patient. The patient data includes a patient_id that is unique for the table and will be used as the key (main field).

1.3 Define all other entities

Since all of the data already has a one-to-many relationships, there are no one-to-one relationships, and there are no large sets of related multi-select fields, the secondary entities will be kept as:

  • Hospital

  • Encounter

  • Test

2. Defining Your Files

While the data ingestion key steps document guides the user through column by column for the different data files, this walkthrough will go through one entity at a time. This is recommended to ensure precision when working with a larger or multi-entity dataset.

Patient CSV (main entity)

patient_id

name

age

risk

date_of_birth

weight

resident_state_prov

title

hospital_id

1

John

29

h

1983-03-05

135.56035

KS

Mr

2

2

Sally

27

m

1951-08-12

101.22172

SC

Dr|Hon

2

3

Cassy

l

1943-08-10

248.04192

MO

1

2.1 Creating a data dictionary file and fill in entity and name for each field & 2.2 Define the primary_key_type.

Create a data_dictionary.csv file and fill in the fields for entity and name. Add "global" as the primary_key_type value for patient_id.

Data Dictionary CSV

entity

name

primary_key_type

patient

patient_id

global

patient

name

patient

age

patient

risk

patient

date_of_birth

patient

weight

patient

resident_state_prov

patient

title

patient

hospital_id

2.3 For each field in each entity, determine: is_sparse_coding, is_multi_select, and type.

Now layer in the types for the different fields. For this example, there are the following groupings:

  • Integer: patient_id, age, hospital_id

  • Date: date_of_birth

  • Float: weight

  • String: name, risk, resident_state

    • Multi-select: title

Data Dictionary CSV

entity

name

primary_key_type

is_sparse_coding

is_multi_select

type

patient

patient_id

global

integer

patient

name

string

patient

age

integer

patient

risk

string

patient

date_of_birth

date

patient

weight

float

patient

resident_state_prov

string

patient

title

yes

string

patient

hospital_id

integer

2.4 Build the codings.csvfile and determine coding_name.

Next layer in the coding for categorical fields and fields that you want to summarize using categorical chart types. For Patient, create categorical selections for risk, resident_state_prov, and titles. For these first generate a codings file from the dictionary as follows: the raw code in the data is populated in the code column, the text to display in the Cohort Browser is in the meaning column, and if it is a hierarchical field (e.g. state_prov), the parent code points to the code of the parent for the entry. The display_order can be used to override the order codes are displayed in summary tiles. The coding file should look as follows:

Coding CSV

coding_name

code

meaning

parent_code

display_order

risk_level

h

High

risk_level

m

Medium

risk_level

l

Low

state_prov

US

United States

state_prov

CA

Canada

state_prov

KS

Kansas

US

state_prov

SC

South Carolina

US

state_prov

MO

Missouri

US

state_prov

OH

Ohio

US

state_prov

TX

Texas

US

state_prov

PA

Pennsylvania

US

state_prov

AL

Alabama

US

state_prov

YK

Yukon Territory

CA

state_prov

BC

British Columbia

CA

titles

Rev

Reverend

titles

Mr

Mr

titles

Dr

Doctor

titles

Hon

Honorable

titles

Mrs

Mrs

titles

Ms

Ms

Now that you created the coding file, go back and fill in the coding_name column in data_dictionary.csv with the corresponding coding_name from the coding.csv.

Data Dictionary CSV

entity

name

primary_key_type

is_sparse_coding

is_multi_select

type

coding_name

patient

patient_id

global

integer

patient

name

string

patient

age

integer

patient

risk

string

risk_level

patient

date_of_birth

date

patient

weight

float

patient

resident_state_prov

string

state_prov

patient

title

yes

string

titles

patient

hospital_id

integer

2.5 Determine if desired data format is allowed.

Now you can validate that your fields will be ingested as their expected ingestion data type. The following are the data type to field mappings expected for this walkthrough:

  • Integer: Patient ID, Age, Hospital_id

  • String: Name

    • String Categorical: Risk

    • String Categorical Hierarchical: Resident State Province

    • String Categorical Multi-select: Title

  • Date: Date of Birth

  • Float: Weight

2.6 Define entity relationships and determine referenced_entity_field and relationship.

Since everything looks as expected, now link the entities together. For Patient, since the relationship can only be one-to-one or many-to-one, create field linkages down the relationship (note that relationships automatically handle none cases where joins can't be made on all rows). The relationships is created by entering the the <entity>:<field name> into the referenced_entity_field and the relationship into the relationship. The relationships to create for the example dataset are:

  • hospital to patient on hospital_id in a "many_to_one" relationship

  • encounter to patient on patient_id in a "many_to_one" relationship

  • test to encounter on encounter_id in a "many_to_one" relationship

Data Dictionary CSV

entity

name

primary_key_type

referenced_entity_field

relationship

is_sparse_coding

is_multi_select

type

coding_name

patient

patient_id

global

integer

patient

name

string

patient

age

integer

patient

risk

string

risk_level

patient

date_of_birth

date

patient

weight

float

patient

resident_state_prov

string

state_prov

patient

title

yes

string

titles

patient

hospital_id

hospital:hospital_id

many_to_one

integer

hospital

hospital_id

local

integer

encounter

encounter_id

local

integer

encounter

patient_id

patient:patient_id

many_to_one

integer

test

test_id

local

integer

test

encounter_id

encounter:encounter_id

many_to_one

integer

2.7 Add additional descriptive information: title, description, units, and linkout & 2.8 Specify folder for cohort browser using folder_path.

Now that you have the core data relationships and types in places, it's time to add metadata about the field (title, description, units, and linkout) along with specifying the folder path desired to be shown in the field selector of the cohort browser. For more information on each field see the detailed documentation.

If you would like the folder structure to mirror the entity structure exactly, leave the folder_path field empty.

If you would like to hide certain fields from the field selector, populate the folder_path field for only the fields you would like to show and leave the path empty for the fields to hide.

Instead of following the examples generally set for entities for folder paths, put the identifiers in their own "ID" folder and patient identifying information (PII) in a subfolder under "Patient" called "PII". Foreign keys are hidden from the field selector.

Data Dictionary CSV

entity

name

primary_key_type

referenced_entity_field

relationship

is_sparse_coding

is_multi_select

type

coding_name

folder_path

title

description

units

linkout

patient

patient_id

global

integer

ID

Patient ID

patient

name

string

Patient

Patient

name of the patient

patient

age

integer

Patient>PII

Age

age at signup

patient

risk

string

risk_level

Patient

Risk Level

evaluated risk level at signup

https://ourinternalwebsite.com/resources/id=3948483

patient

date_of_birth

date

Patient>PII

DOB

date of birth

patient

weight

float

Patient

Weight

baseline weight taken at signup

lbs

patient

resident_state_prov

string

state_prov

Patient

Home State / Province

state of current residency

patient

title

yes

string

titles

Patient

Titles

patient

hospital_id

hospital:hospital_id

many_to_one

integer

hospital

hospital_id

local

integer

ID

Hospital ID

encounter

encounter_id

local

integer

ID

Encounter ID

encounter

patient_id

patient:patient_id

many_to_one

integer

test

test_id

local

integer

ID

Test ID

test

encounter_id

encounter:encounter_id

many_to_one

integer

Now that the patient is set, repeat the steps for the other entities: hospital, encounter, and test. Feel free to use this coding as a guide for categorical values.

Note that this table only highlights one entity in totality and only a few fields from the other entities. The expectation is that you will now go through the remaining three files and fill in the data yourself.

2.9 Organize data into respective entities: data.csv(s).

Already completed prior to starting these steps.

2.10 Generate your entity details: entity_metadata.csv.

After that, the only step remaining is to generate the entity_metadata.csv to clean up the labels that the browser will use for the entities (refer to the detailed docs for more guidance). Blank spaces are left in the label fields if the field is a repeat of the entity_title. Add details as follows:

Entity Metadata CSV

entity

entity_title

entity_label_singular

entity_label_plural

entity_description

patient

Patients

Patient

the patient who visited

encounter

Visit

Patient Visit

Patient Visits

each admission is logged as an individual visit

test

Lab Tests

Test

Tests

all tests performed during a visit or after but linked to the same visit

hospital

Hospital

Hospitals

3. Data Ingestion

3.1 Examples of completed files ready for ingestion.

At this point, not only do you have the raw CSV file above, but you have a data dictionary file, an entity metadata file, and a codings file. If you struggled, below are exemplar files that can be used for ingestion:

3.2 Running the Data Model Loader

With the files provided, configure the Data Model Loader and run the ingestion. See below for an example configuration.

The 4 raw example files are input into the Data CSVs input.

4. Data Exploration and Analysis

Once the data model loader is executed, you will see a dataset and a database output. If you used the supplied metadata files, you should see something similar to this in the cohort browser:

The files created from data model loader
View of the Field Selector
Example Cohort Browser view of ingested Example Files