Guide for Ingesting a Simple Four Table Dataset
Learn to use the Data Model Leader app to ingest phenotypic data and create a dataset for use in Apollo.
Last updated
Learn to use the Data Model Leader app to ingest phenotypic data and create a dataset for use in Apollo.
Last updated
Copyright 2024 DNAnexus
On the DNAnexus Platform, an Apollo license is required to use the features described on this page. Org approval may also be required. Contact DNAnexus Sales for more information.
This walkthrough describes phenotypic data ingestion using a data dictionary file; a codings file, if needed; an optional entity dictionary file; and accompanying data CSV files. The files are loaded using the Data Model Loader app, which validates and ingests the input CSV files to create an Apollo dataset. This Dataset is then accessible using the Cohort Browser, or using JupyterLab and our Python SDK, dxdata.
The following steps show how to organize your data into the required file sets following the Data Ingestion Key Steps. These files can then be loaded using the Data Model Loader app to create a database encapsulated by a Dataset record, which are then immediately accessible for use.
DNAnexus Apollo License
Access to the Data Model Loader app
A way to manipulate and create CSV files
Uploader or greater permissions in the project the app will be running
For this dataset the following CSV files from the Example section above will be used: patient.csv
, hospital.csv
, encounter.csv
, and test.csv
. The data has a structure as follows:
Many patients will point to one hospital but some may be missing hospital information.
A patient may have none, one, or many encounters.
Each encounter may have none, one, or many tests.
For this dataset create cohorts of different patients to perform analysis on. Some examples of cohorts that can be built are:
All of the Patients that were in Hospital X and had an Encounter with Doctor Y
All of the Patients with a weight over A
All of the Patients born before a date, with a risk factor of H that had an Encounter with Doctor Y
Given these types of questions and the data structure, the main entity will be the patient. The patient data includes a patient_id that is unique for the table and will be used as the key (main field).
Since all of the data already has a one-to-many relationships, there are no one-to-one relationships, and there are no large sets of related multi-select fields, the secondary entities will be kept as:
Hospital
Encounter
Test
While the data ingestion key steps document guides the user through column by column for the different data files, this walkthrough will go through one entity at a time. This is recommended to ensure precision when working with a larger or multi-entity dataset.
The Patient CSV (main entity) file should be formatted as follows:
patient_id
name
age
risk
date_of_birth
weight
resident_state_prov
title
hospital_id
1
John
29
h
1983-03-05
135.56035
KS
Mr
2
2
Sally
27
m
1951-08-12
101.22172
SC
Dr|Hon
2
3
Cassy
l
1943-08-10
248.04192
MO
1
Create a data_dictionary.csv
file and fill in the fields for entity and name. Add "global" as the primary_key_type value for patient_id.
The Data Dictionary CSV file should be formatted as follows:
entity
name
primary_key_type
patient
patient_id
global
patient
name
patient
age
patient
risk
patient
date_of_birth
patient
weight
patient
resident_state_prov
patient
title
patient
hospital_id
For each field in each entity, indicate is_sparse_coding and is_multi_select, and specify the field's type.
Layer in the types for the different fields. For this example, there are the following groupings:
Integer: patient_id, age, hospital_id
Date: date_of_birth
Float: weight
String: name, risk, resident_state
Multi-select: title
The Data Dictionary CSV file should be formatted as follows:
entity
name
primary_key_type
is_sparse_coding
is_multi_select
type
patient
patient_id
global
integer
patient
name
string
patient
age
integer
patient
risk
string
patient
date_of_birth
date
patient
weight
float
patient
resident_state_prov
string
patient
title
yes
string
patient
hospital_id
integer
For categorical fields, build the codings.csv
file and determine coding_name. Layer in the coding for categorical fields and fields that you want to summarize using categorical chart types. For Patient, create categorical selections for risk, resident_state_prov, and titles. For these first generate a codings file from the dictionary as follows: the raw code in the data is populated in the code column, the text to display in the Cohort Browser is in the meaning column, and if it is a hierarchical field (e.g. state_prov), the parent code points to the code of the parent for the entry. The display_order can be used to override the order codes are displayed in summary tiles. The coding file should look as follows:
The Coding CSV file should be formatted as follows:
coding_name
code
meaning
parent_code
display_order
risk_level
h
High
risk_level
m
Medium
risk_level
l
Low
state_prov
US
United States
state_prov
CA
Canada
state_prov
KS
Kansas
US
state_prov
SC
South Carolina
US
state_prov
MO
Missouri
US
state_prov
OH
Ohio
US
state_prov
TX
Texas
US
state_prov
PA
Pennsylvania
US
state_prov
AL
Alabama
US
state_prov
YK
Yukon Territory
CA
state_prov
BC
British Columbia
CA
titles
Rev
Reverend
titles
Mr
Mr
titles
Dr
Doctor
titles
Hon
Honorable
titles
Mrs
Mrs
titles
Ms
Ms
Now that you created the coding file, go back and fill in the coding_name column in data_dictionary.csv
with the corresponding coding_name from the coding.csv
.
The Data Dictionary CSV should be formatted as follows:
entity
name
primary_key_type
is_sparse_coding
is_multi_select
type
coding_name
patient
patient_id
global
integer
patient
name
string
patient
age
integer
patient
risk
string
risk_level
patient
date_of_birth
date
patient
weight
float
patient
resident_state_prov
string
state_prov
patient
title
yes
string
titles
patient
hospital_id
integer
Now you can validate that your fields will be ingested as their expected ingestion data type. The following are the data type to field mappings expected for this walkthrough:
Integer: Patient ID, Age, Hospital_id
String: Name
String Categorical: Risk
String Categorical Hierarchical: Resident State Province
String Categorical Multi-select: Title
Date: Date of Birth
Float: Weight
Since everything looks as expected, now link the entities together. For Patient, since the relationship can only be one-to-one or many-to-one, create field linkages down the relationship (note that relationships automatically handle none cases where joins can't be made on all rows). The relationships is created by entering the the <entity>:<field name> into the referenced_entity_field and the relationship into the relationship. The relationships to create for the example dataset are:
hospital to patient on hospital_id in a "many_to_one" relationship
encounter to patient on patient_id in a "many_to_one" relationship
test to encounter on encounter_id in a "many_to_one" relationship
The Data Dictionary CSV file should be formatted as follows:
entity
name
primary_key_type
referenced_entity_field
relationship
is_sparse_coding
is_multi_select
type
coding_name
patient
patient_id
global
integer
patient
name
string
patient
age
integer
patient
risk
string
risk_level
patient
date_of_birth
date
patient
weight
float
patient
resident_state_prov
string
state_prov
patient
title
yes
string
titles
patient
hospital_id
hospital:hospital_id
many_to_one
integer
hospital
hospital_id
local
integer
encounter
encounter_id
local
integer
encounter
patient_id
patient:patient_id
many_to_one
integer
test
test_id
local
integer
test
encounter_id
encounter:encounter_id
many_to_one
integer
Now that you have the core data relationships and types in places, it's time to add metadata about the field (title, description, units, and linkout) along with specifying the folder path desired to be shown in the field selector of the cohort browser. For more information on each field see the detailed documentation.
If you would like the folder structure to mirror the entity structure exactly, leave the folder_path field empty.
If you would like to hide certain fields from the field selector, populate the folder_path field for only the fields you would like to show and leave the path empty for the fields to hide.
Instead of following the examples generally set for entities for folder paths, put the identifiers in their own "ID" folder and patient identifying information (PII) in a subfolder under "Patient" called "PII". Foreign keys are hidden from the field selector.
The Data Dictionary CSV file should be formatted as follows:
entity
name
primary_key_type
referenced_entity_field
relationship
is_sparse_coding
is_multi_select
type
coding_name
folder_path
title
description
units
linkout
patient
patient_id
global
integer
ID
Patient ID
patient
name
string
Patient
Patient
name of the patient
patient
age
integer
Patient>PII
Age
age at signup
patient
risk
string
risk_level
Patient
Risk Level
evaluated risk level at signup
patient
date_of_birth
date
Patient>PII
DOB
date of birth
patient
weight
float
Patient
Weight
baseline weight taken at signup
lbs
patient
resident_state_prov
string
state_prov
Patient
Home State / Province
state of current residency
patient
title
yes
string
titles
Patient
Titles
patient
hospital_id
hospital:hospital_id
many_to_one
integer
hospital
hospital_id
local
integer
ID
Hospital ID
encounter
encounter_id
local
integer
ID
Encounter ID
encounter
patient_id
patient:patient_id
many_to_one
integer
test
test_id
local
integer
ID
Test ID
test
encounter_id
encounter:encounter_id
many_to_one
integer
Now that the patient is set, repeat the steps for the other entities: hospital, encounter, and test. Feel free to use this coding as a guide for categorical values.
Note that this table only highlights one entity in totality and only a few fields from the other entities. The expectation is that you will now go through the remaining three files and fill in the data yourself.
After that, the only step remaining is to generate the entity_metadata.csv
to clean up the labels that the browser will use for the entities (refer to the detailed docs for more guidance). Blank spaces are left in the label fields if the field is a repeat of the entity_title.
The Entity Metadata CSV should be formatted as follows:
entity
entity_title
entity_label_singular
entity_label_plural
entity_description
patient
Patients
Patient
the patient who visited
encounter
Visit
Patient Visit
Patient Visits
each admission is logged as an individual visit
test
Lab Tests
Test
Tests
all tests performed during a visit or after but linked to the same visit
hospital
Hospital
Hospitals
At this point, not only do you have the raw CSV file above, but you have a data dictionary file, an entity metadata file, and a codings file. If you struggled, below are exemplar files that can be used for ingestion:
With the files provided, configure the Data Model Loader and run the ingestion. See below for an example configuration.
The four raw example files are input into the Data CSVs input field.
Once the Data Model Loader job has completed, you will see a dataset and a database output. If you used the supplied metadata files, you should see something similar to this in the cohort browser: