Guide for Ingesting a Simple Four Table Dataset
Learn to use the Data Model Leader app to ingest phenotypic data and create a dataset for use in Apollo.
Overview
This walkthrough describes phenotypic data ingestion using a data dictionary file, a codings file if needed, an optional entity dictionary file, and accompanying data CSV files. The files are loaded using the Data Model Loaderapp, which validates and ingests the input CSV files to create an Apollo dataset. This Dataset is then accessible using the Cohort Browser, or using JupyterLab and our Python SDK, dxdata
.
The following steps show how to organize your data into the required file sets following the Data Ingestion Key Steps. These files can then be loaded using the Data Model Loaderapp to create a database encapsulated by a Dataset record, which are then immediately accessible for use.
Requirements
DNAnexus Apollo License
Access to the Data Model Loader app
A way to manipulate and create CSV files
Uploader or greater permissions in the project the app will be running
Example Raw Files
Guide
Step 1. Identify Your Data
Specify the Data to Be Used
For this dataset the following CSV files from the Example section above will be used: patient.csv
, hospital.csv
, encounter.csv
, and test.csv
. The data has a structure as follows:

Many patients will point to one hospital but some may be missing hospital information.
A patient may have none, one, or many encounters.
Each encounter may have none, one, or many tests.
Specify the Main Entity and Main Field
For this dataset create cohorts of different patients to perform analysis on. Some examples of cohorts that can be built are:
The Patients that were in Hospital X and had an Encounter with Doctor Y
The Patients with a weight over A
The Patients born before a date, with a risk factor of H that had an Encounter with Doctor Y
Given these types of questions and the data structure, the main entity will be the patient
. The patient data includes a patient_id
that is unique for the table and will be used as the key (main field).
Specify Additional Entities
Since the data already has a one-to-many relationships, there are no one-to-one relationships, and there are no large sets of related multi-select fields, the secondary entities will be kept as:
Hospital
Encounter
Test
Step 2. Provide File Specifications
While the data ingestion key steps document guides the user through column by column for the different data files, this walkthrough will go through one entity at a time. This is recommended to ensure precision when working with a larger or multi-entity dataset.
The Patient CSV (main entity) file should be formatted as follows:
patient_id
name
age
risk
date_of_birth
weight
resident_state_prov
title
hospital_id
1
John
29
h
1983-03-05
135.56035
KS
Mr
2
2
Sally
27
m
1951-08-12
101.22172
SC
Dr|Hon
2
3
Cassy
l
1943-08-10
248.04192
MO
1
Create a Data Dictionary File
Create a data_dictionary.csv
file and fill in the fields for entity
and name
. Add global
as the primary_key_type
value for patient_id
.
The Data Dictionary CSV file should be formatted as follows:
entity
name
primary_key_type
patient
patient_id
global
patient
name
patient
age
patient
risk
patient
date_of_birth
patient
weight
patient
resident_state_prov
patient
title
patient
hospital_id
Specify Field Characteristics
For each field in each entity, indicate is_sparse_coding
and is_multi_select
, and specify the field's type
.
Layer in the types for the different fields. For this example, there are the following groupings:
Integer:
patient_id
,age
,hospital_id
Date:
date_of_birth
Float:
weight
String:
name
,risk
,resident_state
Multi-select:
title
The Data Dictionary CSV file should be formatted as follows:
entity
name
primary_key_type
is_sparse_coding
is_multi_select
type
patient
patient_id
global
integer
patient
name
string
patient
age
integer
patient
risk
string
patient
date_of_birth
date
patient
weight
float
patient
resident_state_prov
string
patient
title
yes
string
patient
hospital_id
integer
Specify Coding Settings for Categorical Fields
For categorical fields, build the codings.csv
file and determine coding_name
. Layer in the coding for categorical fields and fields that you want to summarize using categorical chart types. For patient
, create categorical selections for risk
, resident_state_prov
, and titles
. For these, first generate a codings file from the dictionary as follows: the raw code in the data is populated in the code
column, the text to display in the Cohort Browser is in the meaning
column, and if it is a hierarchical field such as state_prov
, the parent_code
points to the code
of the parent for the entry. The display_order
can be used to override the order codes are displayed in summary tiles. The coding file should look as follows:
The Coding CSV file should be formatted as follows:
coding_name
code
meaning
parent_code
display_order
risk_level
h
High
risk_level
m
Medium
risk_level
l
Low
state_prov
US
United States
state_prov
CA
Canada
state_prov
KS
Kansas
US
state_prov
SC
South Carolina
US
state_prov
MO
Missouri
US
state_prov
OH
Ohio
US
state_prov
TX
Texas
US
state_prov
PA
Pennsylvania
US
state_prov
AL
Alabama
US
state_prov
YK
Yukon Territory
CA
state_prov
BC
British Columbia
CA
titles
Rev
Reverend
titles
Mr
Mr
titles
Dr
Doctor
titles
Hon
Honorable
titles
Mrs
Mrs
titles
Ms
Ms
Now that you created the coding file, go back and fill in the coding_name
column in data_dictionary.csv
with the corresponding coding_name
from the coding.csv
.
The Data Dictionary CSV should be formatted as follows:
entity
name
primary_key_type
is_sparse_coding
is_multi_select
type
coding_name
patient
patient_id
global
integer
patient
name
string
patient
age
integer
patient
risk
string
risk_level
patient
date_of_birth
date
patient
weight
float
patient
resident_state_prov
string
state_prov
patient
title
yes
string
titles
patient
hospital_id
integer
Specify Data Formats
Now you can validate that your fields will be ingested as their expected ingestion data type. The following are the data type to field mappings expected for this walkthrough:
Integer: Patient ID, Age, Hospital ID
String: Name
String Categorical: Risk
String Categorical Hierarchical: Resident State Province
String Categorical Multi-select: Title
Date: Date of Birth
Float: Weight
Define Entity Relationships
Since everything looks as expected, now link the entities together. For Patient, since the relationship can only be one-to-one or many-to-one, create field linkages down the relationship (note that relationships automatically handle none cases where joins can't be made on all rows). The relationships are created by entering the <entity>:<field name>
into the referenced_entity_field
and the relationship into the relationship
. The relationships to create for the example dataset are:
hospital
topatient
onhospital_id
in amany_to_one
relationshipencounter
topatient
onpatient_id
in amany_to_one
relationshiptest
toencounter
onencounter_id
in amany_to_one
relationship
The Data Dictionary CSV file should be formatted as follows:
entity
name
primary_key_type
referenced_entity_field
relationship
is_sparse_coding
is_multi_select
type
coding_name
patient
patient_id
global
integer
patient
name
string
patient
age
integer
patient
risk
string
patient
date_of_birth
date
patient
weight
float
patient
resident_state_prov
string
patient
title
yes
string
patient
hospital_id
hospital:hospital_id
many_to_one
integer
hospital
hospital_id
local
integer
encounter
encounter_id
local
integer
encounter
patient_id
patient:patient_id
many_to_one
integer
test
test_id
local
integer
test
encounter_id
encounter:encounter_id
many_to_one
integer
Add Additional Descriptive Information
Now that you have the core data relationships and types in place, it's time to add metadata about the field (title
, description
, units
, and linkout
) along with specifying the folder path desired to be shown in the field selector of the Cohort Browser. For more information on each field, see data dictionary details.
Instead of following the examples generally set for entities for folder paths, put the identifiers in their own "ID" folder and patient identifying information (PII) in a subfolder under "Patient" called "PII". Foreign keys are hidden from the field selector.
The Data Dictionary CSV file should be formatted as follows:
entity
name
primary_key_type
referenced_entity_field
relationship
is_sparse_coding
is_multi_select
type
coding_name
folder_path
title
description
units
linkout
patient
patient_id
global
integer
ID
Patient ID
patient
name
string
Patient
Patient
name of the patient
patient
age
integer
Patient>PII
Age
age at signup
patient
risk
string
risk_level
Patient
Risk Level
evaluated risk level at signup
https://ourinternalwebsite.com/resources/id=3948483
patient
date_of_birth
date
Patient>PII
DOB
date of birth
patient
weight
float
Patient
Weight
baseline weight taken at signup
lbs
patient
resident_state_prov
string
state_prov
Patient
Home State / Province
state of current residency
patient
title
yes
string
titles
Patient
Titles
patient
hospital_id
hospital:hospital_id
many_to_one
integer
hospital
hospital_id
local
integer
ID
Hospital ID
encounter
encounter_id
local
integer
ID
Encounter ID
encounter
patient_id
patient:patient_id
many_to_one
integer
test
test_id
local
integer
ID
Test ID
test
encounter_id
encounter:encounter_id
many_to_one
integer
Now that the patient
is set, repeat the steps for the other entities: hospital
, encounter
, and test
. Feel free to use this coding as a guide for categorical values.
Generate Entity Details
After that, the only step remaining is to generate the entity_metadata.csv
to clean up the labels that the browser will use for the entities (refer to the detailed docs for more guidance). Blank spaces are left in the label fields if the field is a repeat of the entity_title
.
The Entity Metadata CSV should be formatted as follows:
entity
entity_title
entity_label_singular
entity_label_plural
entity_description
patient
Patients
Patient
the patient who visited
encounter
Visit
Patient Visit
Patient Visits
each admission is logged as an individual visit
test
Lab Tests
Test
Tests
all tests performed during a visit or after but linked to the same visit
hospital
Hospital
Hospitals
Step 3. Ingest Data
Examples of Completed Files Ready for Ingestion
At this point, not only do you have the raw CSV file above, but you have a data dictionary file, an entity metadata file, and a codings file. If you struggled, below are exemplar files that can be used for ingestion:
Running the Data Model Loader
With the files provided, configure the Data Model Loaderand run the ingestion. See below for an example configuration.


The four raw example files are input into the Data CSVs input field.
Step 4. Explore and Analyze Data
Once the Data Model Loader job has completed, you will see a dataset and a database output. If you used the supplied metadata files, you should see something similar to this in the Cohort Browser:



Last updated
Was this helpful?