Links

Guide for Ingesting a Simple Four Table Dataset

Learn to use the Data Model Leader app to ingest phenotypic data and create a dataset for use in Apollo.

Overview

This walkthrough describes phenotypic data ingestion using a data dictionary file; a codings file, if needed; an optional entity dictionary file; and accompanying data CSV files. The files are loaded using the Data Model Loader app, which validates and ingests the input CSV files to create an Apollo dataset. This Dataset is then accessible using the Cohort Browser, or using JupyterLab and our Python SDK, dxdata.
The following steps show how to organize your data into the required file sets following the Data Ingestion Key Steps. These files can then be loaded using the Data Model Loader app to create a database encapsulated by a Dataset record, which are then immediately accessible for use.

Requirements

An Apollo license is required to access the features described in this walkthrough. Contact DNAnexus Sales for more information.
  • DNAnexus Apollo License
  • Access to the Data Model Loader app
  • A way to manipulate and create CSV files
  • Uploader or greater permissions in the project the app will be running

Example Raw Files

patient.csv
521B
Text
hospital.csv
288B
Text
encounter.csv
652B
Text
test.csv
443B
Text

Guide

Step 1. Identify Your Data

Specify the Data to Be Used

For this dataset the following CSV files from the Example section above will be used: patient.csv, hospital.csv, encounter.csv, and test.csv. The data has a structure as follows:
ERD for the example data
  • Many patients will point to one hospital but some may be missing hospital information.
  • A patient may have none, one, or many encounters.
  • Each encounter may have none, one, or many tests.

Specify the Main Entity and Main Field

For this dataset create cohorts of different patients to perform analysis on. Some examples of cohorts that can be built are:
  • All of the Patients that were in Hospital X and had an Encounter with Doctor Y
  • All of the Patients with a weight over A
  • All of the Patients born before a date, with a risk factor of H that had an Encounter with Doctor Y
Given these types of questions and the data structure, the main entity will be the patient. The patient data includes a patient_id that is unique for the table and will be used as the key (main field).

Specify Additional Entities

Since all of the data already has a one-to-many relationships, there are no one-to-one relationships, and there are no large sets of related multi-select fields, the secondary entities will be kept as:
  • Hospital
  • Encounter
  • Test

Step 2. Provide File Specifications

While the data ingestion key steps document guides the user through column by column for the different data files, this walkthrough will go through one entity at a time. This is recommended to ensure precision when working with a larger or multi-entity dataset.
The Patient CSV (main entity) file should be formatted as follows:
patient_id
name
age
risk
date_of_birth
weight
resident_state_prov
title
hospital_id
1
John
29
h
1983-03-05
135.56035
KS
Mr
2
2
Sally
27
m
1951-08-12
101.22172
SC
Dr|Hon
2
3
Cassy
l
1943-08-10
248.04192
MO
1

Create a Data Dictionary File

Create a data_dictionary.csv file and fill in the fields for entity and name. Add "global" as the primary_key_type value for patient_id.
The Data Dictionary CSV file should be formatted as follows:
entity
name
primary_key_type
patient
patient_id
global
patient
name
patient
age
patient
risk
patient
date_of_birth
patient
weight
patient
resident_state_prov
patient
title
patient
hospital_id

Specify Field Characteristics

For each field in each entity, indicate is_sparse_coding and is_multi_select, and specify the field's type.
Layer in the types for the different fields. For this example, there are the following groupings:
  • Integer: patient_id, age, hospital_id
  • Date: date_of_birth
  • Float: weight
  • String: name, risk, resident_state
    • Multi-select: title
The Data Dictionary CSV file should be formatted as follows:
entity
name
primary_key_type
is_sparse_coding
is_multi_select
type
patient
patient_id
global
integer
patient
name
string
patient
age
integer
patient
risk
string
patient
date_of_birth
date
patient
weight
float
patient
resident_state_prov
string
patient
title
yes
string
patient
hospital_id
integer

Specify Coding Settings for Categorical Fields

For categorical fields, build the codings.csvfile and determine coding_name. Layer in the coding for categorical fields and fields that you want to summarize using categorical chart types. For Patient, create categorical selections for risk, resident_state_prov, and titles. For these first generate a codings file from the dictionary as follows: the raw code in the data is populated in the code column, the text to display in the Cohort Browser is in the meaning column, and if it is a hierarchical field (e.g. state_prov), the parent code points to the code of the parent for the entry. The display_order can be used to override the order codes are displayed in summary tiles. The coding file should look as follows:
The Coding CSV file should be formatted as follows:
coding_name
code
meaning
parent_code
display_order
risk_level
h
High
risk_level
m
Medium
risk_level
l
Low
state_prov
US
United States
state_prov
CA
Canada
state_prov
KS
Kansas
US
state_prov
SC
South Carolina
US
state_prov
MO
Missouri
US
state_prov
OH
Ohio
US
state_prov
TX
Texas
US
state_prov
PA
Pennsylvania
US
state_prov
AL
Alabama
US
state_prov
YK
Yukon Territory
CA
state_prov
BC
British Columbia
CA
titles
Rev
Reverend
titles
Mr
Mr
titles
Dr
Doctor
titles
Hon
Honorable
titles
Mrs
Mrs
titles
Ms
Ms
Now that you created the coding file, go back and fill in the coding_name column in data_dictionary.csv with the corresponding coding_name from the coding.csv.
The Data Dictionary CSV should be formatted as follows:
entity
name
primary_key_type
is_sparse_coding
is_multi_select
type
coding_name
patient
patient_id
global
integer
patient
name
string
patient
age
integer
patient
risk
string
risk_level
patient
date_of_birth
date
patient
weight
float
patient
resident_state_prov
string
state_prov
patient
title
yes
string
titles
patient
hospital_id
integer

Specify Data Formats

Now you can validate that your fields will be ingested as their expected ingestion data type. The following are the data type to field mappings expected for this walkthrough:
  • Integer: Patient ID, Age, Hospital_id
  • String: Name
    • String Categorical: Risk
    • String Categorical Hierarchical: Resident State Province
    • String Categorical Multi-select: Title
  • Date: Date of Birth
  • Float: Weight

Define Entity Relationships

Since everything looks as expected, now link the entities together. For Patient, since the relationship can only be one-to-one or many-to-one, create field linkages down the relationship (note that relationships automatically handle none cases where joins can't be made on all rows). The relationships is created by entering the the <entity>:<field name> into the referenced_entity_field and the relationship into the relationship. The relationships to create for the example dataset are:
  • hospital to patient on hospital_id in a "many_to_one" relationship
  • encounter to patient on patient_id in a "many_to_one" relationship
  • test to encounter on encounter_id in a "many_to_one" relationship
The Data Dictionary CSV file should be formatted as follows:
entity
name
primary_key_type
referenced_entity_field
relationship
is_sparse_coding
is_multi_select
type
coding_name
patient
patient_id
global
integer
patient
name
string
patient
age
integer
patient
risk
string
risk_level
patient
date_of_birth
date
patient
weight
float
patient
resident_state_prov
string
state_prov
patient
title
yes
string
titles
patient
hospital_id
hospital:hospital_id
many_to_one
integer
hospital
hospital_id
local
integer
encounter
encounter_id
local
integer
encounter
patient_id
patient:patient_id
many_to_one
integer
test
test_id
local
integer
test
encounter_id
encounter:encounter_id
many_to_one
integer

Add Additional Descriptive Information

Now that you have the core data relationships and types in places, it's time to add metadata about the field (title, description, units, and linkout) along with specifying the folder path desired to be shown in the field selector of the cohort browser. For more information on each field see the detailed documentation.
If you would like the folder structure to mirror the entity structure exactly, leave the folder_path field empty.
If you would like to hide certain fields from the field selector, populate the folder_path field for only the fields you would like to show and leave the path empty for the fields to hide.
Instead of following the examples generally set for entities for folder paths, put the identifiers in their own "ID" folder and patient identifying information (PII) in a subfolder under "Patient" called "PII". Foreign keys are hidden from the field selector.
The Data Dictionary CSV file should be formatted as follows:
entity
name
primary_key_type
referenced_entity_field
relationship
is_sparse_coding
is_multi_select
type
coding_name
folder_path
title
description
units
linkout
patient
patient_id
global
integer
ID
Patient ID
patient
name
string
Patient
Patient
name of the patient
patient
age
integer
Patient>PII
Age
age at signup
patient
risk
string
risk_level
Patient
Risk Level
evaluated risk level at signup
patient
date_of_birth
date
Patient>PII
DOB
date of birth
patient
weight
float
Patient
Weight
baseline weight taken at signup
lbs
patient
resident_state_prov
string
state_prov
Patient
Home State / Province
state of current residency
patient
title
yes
string
titles
Patient
Titles
patient
hospital_id
hospital:hospital_id
many_to_one
integer
hospital
hospital_id
local
integer
ID
Hospital ID
encounter
encounter_id
local
integer
ID
Encounter ID