Comment on page
Guide for Ingesting a Simple Four Table Dataset
Learn to use the Data Model Leader app to ingest phenotypic data and create a dataset for use in Apollo.
On the DNAnexus Platform, an Apollo license is required to use the features described on this page. Org approval may also be required. Contact DNAnexus Sales for more information.
This walkthrough describes phenotypic data ingestion using a data dictionary file; a codings file, if needed; an optional entity dictionary file; and accompanying data CSV files. The files are loaded using the Data Model Loader app, which validates and ingests the input CSV files to create an Apollo dataset. This Dataset is then accessible using the Cohort Browser, or using JupyterLab and our Python SDK, dxdata.
The following steps show how to organize your data into the required file sets following the Data Ingestion Key Steps. These files can then be loaded using the Data Model Loader app to create a database encapsulated by a Dataset record, which are then immediately accessible for use.
- DNAnexus Apollo License
- A way to manipulate and create CSV files
- Uploader or greater permissions in the project the app will be running
patient.csv
521B
Text
hospital.csv
288B
Text
encounter.csv
652B
Text
test.csv
443B
Text
For this dataset the following CSV files from the Example section above will be used:
patient.csv
, hospital.csv
, encounter.csv
, and test.csv
. The data has a structure as follows:
ERD for the example data
- Many patients will point to one hospital but some may be missing hospital information.
- A patient may have none, one, or many encounters.
- Each encounter may have none, one, or many tests.
For this dataset create cohorts of different patients to perform analysis on. Some examples of cohorts that can be built are:
- All of the Patients that were in Hospital X and had an Encounter with Doctor Y
- All of the Patients with a weight over A
- All of the Patients born before a date, with a risk factor of H that had an Encounter with Doctor Y
Given these types of questions and the data structure, the main entity will be the patient. The patient data includes a patient_id that is unique for the table and will be used as the key (main field).
Since all of the data already has a one-to-many relationships, there are no one-to-one relationships, and there are no large sets of related multi-select fields, the secondary entities will be kept as:
- Hospital
- Encounter
- Test
While the data ingestion key steps document guides the user through column by column for the different data files, this walkthrough will go through one entity at a time. This is recommended to ensure precision when working with a larger or multi-entity dataset.
The Patient CSV (main entity) file should be formatted as follows:
patient_id | name | age | risk | date_of_birth | weight | resident_state_prov | title | hospital_id |
1 | John | 29 | h | 1983-03-05 | 135.56035 | KS | Mr | 2 |
2 | Sally | 27 | m | 1951-08-12 | 101.22172 | SC | Dr|Hon | 2 |
3 | Cassy | | l | 1943-08-10 | 248.04192 | MO | | 1 |
Create a
data_dictionary.csv
file and fill in the fields for entity and name. Add "global" as the primary_key_type value for patient_id.The Data Dictionary CSV file should be formatted as follows:
entity | name | primary_key_type |
patient | patient_id | global |
patient | name | |
patient | age | |
patient | risk | |
patient | date_of_birth | |
patient | weight | |
patient | resident_state_prov | |
patient | title | |
patient | hospital_id | |
For each field in each entity, indicate is_sparse_coding and is_multi_select, and specify the field's type.
Layer in the types for the different fields. For this example, there are the following groupings:
- Integer: patient_id, age, hospital_id
- Date: date_of_birth
- Float: weight
- String: name, risk, resident_state
- Multi-select: title
The Data Dictionary CSV file should be formatted as follows:
entity | name | primary_key_type | is_sparse_coding | is_multi_select | type |
patient | patient_id | global | | | integer |
patient | name | | | | string |
patient | age | | | | integer |
patient | risk | | | | string |
patient | date_of_birth | | | | date |
patient | weight | | | | float |
patient | resident_state_prov | | | | string |
patient | title | | | yes | string |
patient | hospital_id | | | | integer |
For categorical fields, build the
codings.csv
file and determine coding_name. Layer in the coding for categorical fields and fields that you want to summarize using categorical chart types. For Patient, create categorical selections for risk, resident_state_prov, and titles. For these first generate a codings file from the dictionary as follows: the raw code in the data is populated in the code column, the text to display in the Cohort Browser is in the meaning column, and if it is a hierarchical field (e.g. state_prov), the parent code points to the code of the parent for the entry. The display_order can be used to override the order codes are displayed in summary tiles. The coding file should look as follows:The Coding CSV file should be formatted as follows:
coding_name | code | meaning | parent_code | display_order |
risk_level | h | High | | |
risk_level | m | Medium | | |
risk_level | l | Low | | |
state_prov | US | United States | | |
state_prov | CA | Canada | | |
state_prov | KS | Kansas | US | |
state_prov | SC | South Carolina | US | |
state_prov | MO | Missouri | US | |
state_prov | OH | Ohio | US | |
state_prov | TX | Texas | US | |
state_prov | PA | Pennsylvania | US | |
state_prov | AL | Alabama | US | |
state_prov | YK | Yukon Territory | CA | |
state_prov | BC | British Columbia | CA | |
titles | Rev | Reverend | | |
titles | Mr | Mr | | |
titles | Dr | Doctor | | |
titles | Hon | Honorable | | |
titles | Mrs | Mrs | | |
titles | Ms | Ms | | |
Now that you created the coding file, go back and fill in the coding_name column in
data_dictionary.csv
with the corresponding coding_name from the coding.csv
.The Data Dictionary CSV should be formatted as follows:
entity | name | primary_key_type | is_sparse_coding | is_multi_select | type | coding_name |
patient | patient_id | global | | | integer | |
patient | name | | | | string | |
patient | age | | | | integer | |
patient | risk | | | | string | risk_level |
patient | date_of_birth | | | | date | |
patient | weight | | | | float | |
patient | resident_state_prov | | | | string | state_prov |
patient | title | | | yes | string | titles |
patient | hospital_id | | | | integer | |
Now you can validate that your fields will be ingested as their expected ingestion data type. The following are the data type to field mappings expected for this walkthrough:
- Integer: Patient ID, Age, Hospital_id
- String: Name
- String Categorical: Risk
- String Categorical Hierarchical: Resident State Province
- String Categorical Multi-select: Title
- Date: Date of Birth
- Float: Weight
Since everything looks as expected, now link the entities together. For Patient, since the relationship can only be one-to-one or many-to-one, create field linkages down the relationship (note that relationships automatically handle none cases where joins can't be made on all rows). The relationships is created by entering the the <entity>:<field name> into the referenced_entity_field and the relationship into the relationship. The relationships to create for the example dataset are:
- hospital to patient on hospital_id in a "many_to_one" relationship
- encounter to patient on patient_id in a "many_to_one" relationship
- test to encounter on encounter_id in a "many_to_one" relationship
The Data Dictionary CSV file should be formatted as follows:
entity | name | primary_key_type | referenced_entity_field | relationship | is_sparse_coding | is_multi_select | type | coding_name |
patient | patient_id | global | | | | | integer | |
patient | name | | | | | | string | |
patient | age | | | | | | integer | |
patient | risk | | | | | | string | risk_level |
patient | date_of_birth | | | | | | date | |
patient | weight | | | | | | float | |
patient | resident_state_prov | | | | | | string | state_prov |
patient | title | | | | | yes | string | titles |
patient | hospital_id | | hospital:hospital_id | many_to_one | | | integer | |
hospital | hospital_id | local | | | | | integer | |
encounter | encounter_id | local | | | | | integer | |
encounter | patient_id | | patient:patient_id | many_to_one | | | integer | |
test | test_id | local | | | | | integer | |
test | encounter_id | | encounter:encounter_id | many_to_one | | | integer | |
Now that you have the core data relationships and types in places, it's time to add metadata about the field (title, description, units, and linkout) along with specifying the folder path desired to be shown in the field selector of the cohort browser. For more information on each field see the detailed documentation.
If you would like the folder structure to mirror the entity structure exactly, leave the folder_path field empty.
If you would like to hide certain fields from the field selector, populate the folder_path field for only the fields you would like to show and leave the path empty for the fields to hide.
Instead of following the examples generally set for entities for folder paths, put the identifiers in their own "ID" folder and patient identifying information (PII) in a subfolder under "Patient" called "PII". Foreign keys are hidden from the field selector.
The Data Dictionary CSV file should be formatted as follows:
entity | name | primary_key_type | referenced_entity_field | relationship | is_sparse_coding | is_multi_select | type | coding_name | folder_path | title | description | units | linkout |
patient | patient_id | global | | | | | integer | | ID | Patient ID | | | |
patient | name | | | | | | string | | Patient | Patient | name of the patient | | |
patient | age | | | | | | integer | | Patient>PII | Age | age at signup | | |
patient | risk | | | | | | string | risk_level | Patient | Risk Level | evaluated risk level at signup | | |
patient | date_of_birth | | | | | | date | | Patient>PII | DOB | date of birth | | |
patient | weight | | | | | | float | | Patient | Weight | baseline weight taken at signup | lbs | |
patient | resident_state_prov | | | | | | string | state_prov | Patient | Home State / Province | state of current residency | | |
patient | title | | | | | yes | string | titles | Patient | Titles | | | |
patient | hospital_id | | hospital:hospital_id | many_to_one | | | integer | | | | | | |
hospital | hospital_id | local | | | | | integer | | ID | Hospital ID | | | |
encounter | encounter_id | local | | | | | integer | | ID | Encounter ID | | | |
encounter | patient_id | | patient:patient_id | many_to_one | | | integer |