Guide for Ingesting a Simple Four Table Dataset
Learn to use the Data Model Leader app to ingest phenotypic data and create a dataset for use in Apollo.
Last updated
Learn to use the Data Model Leader app to ingest phenotypic data and create a dataset for use in Apollo.
Last updated
Copyright 2024 DNAnexus
On the DNAnexus Platform, an Apollo license is required to use the features described on this page. Org approval may also be required. Contact DNAnexus Sales for more information.
This walkthrough describes phenotypic data ingestion using a data dictionary file; a codings file, if needed; an optional entity dictionary file; and accompanying data CSV files. The files are loaded using the Data Model Loader app, which validates and ingests the input CSV files to create an Apollo dataset. This Dataset is then accessible using the Cohort Browser, or using JupyterLab and our Python SDK, dxdata.
The following steps show how to organize your data into the required file sets following the Data Ingestion Key Steps. These files can then be loaded using the Data Model Loader app to create a database encapsulated by a Dataset record, which are then immediately accessible for use.
DNAnexus Apollo License
Access to the Data Model Loader app
A way to manipulate and create CSV files
Uploader or greater permissions in the project the app will be running
For this dataset the following CSV files from the Example section above will be used: patient.csv
, hospital.csv
, encounter.csv
, and test.csv
. The data has a structure as follows:
Many patients will point to one hospital but some may be missing hospital information.
A patient may have none, one, or many encounters.
Each encounter may have none, one, or many tests.
For this dataset create cohorts of different patients to perform analysis on. Some examples of cohorts that can be built are:
All of the Patients that were in Hospital X and had an Encounter with Doctor Y
All of the Patients with a weight over A
All of the Patients born before a date, with a risk factor of H that had an Encounter with Doctor Y
Given these types of questions and the data structure, the main entity will be the patient. The patient data includes a patient_id that is unique for the table and will be used as the key (main field).
Since all of the data already has a one-to-many relationships, there are no one-to-one relationships, and there are no large sets of related multi-select fields, the secondary entities will be kept as:
Hospital
Encounter
Test
While the data ingestion key steps document guides the user through column by column for the different data files, this walkthrough will go through one entity at a time. This is recommended to ensure precision when working with a larger or multi-entity dataset.
The Patient CSV (main entity) file should be formatted as follows:
patient_id | name | age | risk | date_of_birth | weight | resident_state_prov | title | hospital_id |
1 | John | 29 | h | 1983-03-05 | 135.56035 | KS | Mr | 2 |
2 | Sally | 27 | m | 1951-08-12 | 101.22172 | SC | Dr|Hon | 2 |
3 | Cassy | l | 1943-08-10 | 248.04192 | MO | 1 |
Create a data_dictionary.csv
file and fill in the fields for entity and name. Add "global" as the primary_key_type value for patient_id.
The Data Dictionary CSV file should be formatted as follows:
entity | name | primary_key_type |
patient | patient_id | global |
patient | name | |
patient | age | |
patient | risk | |
patient | date_of_birth | |
patient | weight | |
patient | resident_state_prov | |
patient | title | |
patient | hospital_id |
For each field in each entity, indicate is_sparse_coding and is_multi_select, and specify the field's type.
Layer in the types for the different fields. For this example, there are the following groupings:
Integer: patient_id, age, hospital_id
Date: date_of_birth
Float: weight
String: name, risk, resident_state
Multi-select: title
The Data Dictionary CSV file should be formatted as follows:
entity | name | primary_key_type | is_sparse_coding | is_multi_select | type |
patient | patient_id | global | integer | ||
patient | name | string | |||
patient | age | integer | |||
patient | risk | string | |||
patient | date_of_birth | date | |||
patient | weight | float | |||
patient | resident_state_prov | string | |||
patient | title | yes | string | ||
patient | hospital_id | integer |
For categorical fields, build the codings.csv
file and determine coding_name. Layer in the coding for categorical fields and fields that you want to summarize using categorical chart types. For Patient, create categorical selections for risk, resident_state_prov, and titles. For these first generate a codings file from the dictionary as follows: the raw code in the data is populated in the code column, the text to display in the Cohort Browser is in the meaning column, and if it is a hierarchical field (e.g. state_prov), the parent code points to the code of the parent for the entry. The display_order can be used to override the order codes are displayed in summary tiles. The coding file should look as follows:
The Coding CSV file should be formatted as follows:
coding_name | code | meaning | parent_code | display_order |
risk_level | h | High | ||
risk_level | m | Medium | ||
risk_level | l | Low | ||
state_prov | US | United States | ||
state_prov | CA | Canada | ||
state_prov | KS | Kansas | US | |
state_prov | SC | South Carolina | US | |
state_prov | MO | Missouri | US | |
state_prov | OH | Ohio | US | |
state_prov | TX | Texas | US | |
state_prov | PA | Pennsylvania | US | |
state_prov | AL | Alabama | US | |
state_prov | YK | Yukon Territory | CA | |
state_prov | BC | British Columbia | CA | |
titles | Rev | Reverend | ||
titles | Mr | Mr | ||
titles | Dr | Doctor | ||
titles | Hon | Honorable | ||
titles | Mrs | Mrs | ||
titles | Ms | Ms |
Now that you created the coding file, go back and fill in the coding_name column in data_dictionary.csv
with the corresponding coding_name from the coding.csv
.
The Data Dictionary CSV should be formatted as follows:
entity | name | primary_key_type | is_sparse_coding | is_multi_select | type | coding_name |
patient | patient_id | global | integer | |||
patient | name | string | ||||
patient | age | integer | ||||
patient | risk | string | risk_level | |||
patient | date_of_birth | date | ||||
patient | weight | float | ||||
patient | resident_state_prov | string | state_prov | |||
patient | title | yes | string | titles | ||
patient | hospital_id | integer |
Now you can validate that your fields will be ingested as their expected ingestion data type. The following are the data type to field mappings expected for this walkthrough:
Integer: Patient ID, Age, Hospital_id
String: Name
String Categorical: Risk
String Categorical Hierarchical: Resident State Province
String Categorical Multi-select: Title
Date: Date of Birth
Float: Weight
Since everything looks as expected, now link the entities together. For Patient, since the relationship can only be one-to-one or many-to-one, create field linkages down the relationship (note that relationships automatically handle none cases where joins can't be made on all rows). The relationships is created by entering the the <entity>:<field name> into the referenced_entity_field and the relationship into the relationship. The relationships to create for the example dataset are:
hospital to patient on hospital_id in a "many_to_one" relationship
encounter to patient on patient_id in a "many_to_one" relationship
test to encounter on encounter_id in a "many_to_one" relationship
The Data Dictionary CSV file should be formatted as follows:
entity | name | primary_key_type | referenced_entity_field | relationship | is_sparse_coding | is_multi_select | type | coding_name |
patient | patient_id | global | integer | |||||
patient | name | string | ||||||
patient | age | integer | ||||||
patient | risk | string | risk_level | |||||
patient | date_of_birth | date | ||||||
patient | weight | float | ||||||
patient | resident_state_prov | string | state_prov | |||||
patient | title | yes | string | titles | ||||
patient | hospital_id | hospital:hospital_id | many_to_one | integer | ||||
hospital | hospital_id | local | integer | |||||
encounter | encounter_id | local | integer | |||||
encounter | patient_id | patient:patient_id | many_to_one | integer | ||||
test | test_id | local | integer | |||||
test | encounter_id | encounter:encounter_id | many_to_one | integer |
Now that you have the core data relationships and types in places, it's time to add metadata about the field (title, description, units, and linkout) along with specifying the folder path desired to be shown in the field selector of the cohort browser. For more information on each field see the detailed documentation.
If you would like the folder structure to mirror the entity structure exactly, leave the folder_path field empty.
If you would like to hide certain fields from the field selector, populate the folder_path field for only the fields you would like to show and leave the path empty for the fields to hide.
Instead of following the examples generally set for entities for folder paths, put the identifiers in their own "ID" folder and patient identifying information (PII) in a subfolder under "Patient" called "PII". Foreign keys are hidden from the field selector.
The Data Dictionary CSV file should be formatted as follows:
entity | name | primary_key_type | referenced_entity_field | relationship | is_sparse_coding | is_multi_select | type | coding_name | folder_path | title | description | units | linkout |
patient | patient_id | global | integer | ID | Patient ID | ||||||||
patient | name | string | Patient | Patient | name of the patient | ||||||||
patient | age | integer | Patient>PII | Age | age at signup | ||||||||
patient | risk | string | risk_level | Patient | Risk Level | evaluated risk level at signup | |||||||
patient | date_of_birth | date | Patient>PII | DOB | date of birth | ||||||||
patient | weight | float | Patient | Weight | baseline weight taken at signup | lbs | |||||||
patient | resident_state_prov | string | state_prov | Patient | Home State / Province | state of current residency | |||||||
patient | title | yes | string | titles | Patient | Titles | |||||||
patient | hospital_id | hospital:hospital_id | many_to_one | integer | |||||||||
hospital | hospital_id | local | integer | ID | Hospital ID | ||||||||
encounter | encounter_id | local | integer | ID | Encounter ID | ||||||||
encounter | patient_id | patient:patient_id | many_to_one | integer | |||||||||
test | test_id | local | integer | ID | Test ID | ||||||||
test | encounter_id | encounter:encounter_id | many_to_one | integer |
Now that the patient is set, repeat the steps for the other entities: hospital, encounter, and test. Feel free to use this coding as a guide for categorical values.
Note that this table only highlights one entity in totality and only a few fields from the other entities. The expectation is that you will now go through the remaining three files and fill in the data yourself.
After that, the only step remaining is to generate the entity_metadata.csv
to clean up the labels that the browser will use for the entities (refer to the detailed docs for more guidance). Blank spaces are left in the label fields if the field is a repeat of the entity_title.
The Entity Metadata CSV should be formatted as follows:
entity | entity_title | entity_label_singular | entity_label_plural | entity_description |
patient | Patients | Patient | the patient who visited | |
encounter | Visit | Patient Visit | Patient Visits | each admission is logged as an individual visit |
test | Lab Tests | Test | Tests | all tests performed during a visit or after but linked to the same visit |
hospital | Hospital | Hospitals |
At this point, not only do you have the raw CSV file above, but you have a data dictionary file, an entity metadata file, and a codings file. If you struggled, below are exemplar files that can be used for ingestion:
With the files provided, configure the Data Model Loader and run the ingestion. See below for an example configuration.
The four raw example files are input into the Data CSVs input field.
Once the Data Model Loader job has completed, you will see a dataset and a database output. If you used the supplied metadata files, you should see something similar to this in the cohort browser: