Guide for Ingesting a Simple Four Table Dataset

Learn to use the Data Model Leader app to ingest phenotypic data and create a dataset for use in Apollo.

On the DNAnexus Platform, an Apollo license is required to use the features described on this page. Org approval may also be required. Contact DNAnexus Sales for more information.

Overview

This walkthrough describes phenotypic data ingestion using a data dictionary file, a codings file if needed, an optional entity dictionary file, and accompanying data CSV files. The files are loaded using the Data Model Loaderapp, which validates and ingests the input CSV files to create an Apollo dataset. This Dataset is then accessible using the Cohort Browser, or using JupyterLab and our Python SDK, dxdata.

The following steps show how to organize your data into the required file sets following the Data Ingestion Key Steps. These files can then be loaded using the Data Model Loaderapp to create a database encapsulated by a Dataset record, which are then immediately accessible for use.

Requirements

DNAnexus Apollo License
Access to the Data Model Loader app
A way to manipulate and create CSV files
Uploader or greater permissions in the project the app will be running

Example Raw Files

Guide

Step 1. Identify Your Data

Specify the Data to Be Used

For this dataset the following CSV files from the Example section above will be used: patient.csv, hospital.csv, encounter.csv, and test.csv. The data has a structure as follows:

Many patients will point to one hospital but some may be missing hospital information.
A patient may have none, one, or many encounters.
Each encounter may have none, one, or many tests.

Specify the Main Entity and Main Field

For this dataset create cohorts of different patients to perform analysis on. Some examples of cohorts that can be built are:

The Patients that were in Hospital X and had an Encounter with Doctor Y
The Patients with a weight over A
The Patients born before a date, with a risk factor of H that had an Encounter with Doctor Y

Given these types of questions and the data structure, the main entity will be the patient. The patient data includes a patient_id that is unique for the table and will be used as the key (main field).

Specify Additional Entities

Since the data already has a one-to-many relationships, there are no one-to-one relationships, and there are no large sets of related multi-select fields, the secondary entities will be kept as:

Hospital
Encounter
Test

Step 2. Provide File Specifications

While the data ingestion key steps document guides the user through column by column for the different data files, this walkthrough will go through one entity at a time. This is recommended to ensure precision when working with a larger or multi-entity dataset.

The Patient CSV (main entity) file should be formatted as follows:

patient_id

name

age

risk

date_of_birth

weight

resident_state_prov

title

hospital_id

John

1983-03-05

135.56035

Sally

1951-08-12

101.22172

Dr|Hon

Cassy

1943-08-10

248.04192

Create a Data Dictionary File

Create a data_dictionary.csv file and fill in the fields for entity and name. Add global as the primary_key_type value for patient_id.

The Data Dictionary CSV file should be formatted as follows:

entity

name

primary_key_type

patient

patient_id

global

patient

name

patient

age

patient

risk

patient

date_of_birth

patient

weight

patient

resident_state_prov

patient

title

patient

hospital_id

Specify Field Characteristics

For each field in each entity, indicate is_sparse_coding and is_multi_select, and specify the field's type.

Layer in the types for the different fields. For this example, there are the following groupings:

Integer: patient_id, age, hospital_id
Date: date_of_birth
Float: weight
String: name, risk, resident_state
- Multi-select: title

The Data Dictionary CSV file should be formatted as follows:

entity

name

primary_key_type

is_sparse_coding

is_multi_select

type

patient

patient_id

global

integer

patient

name

string

patient

age

integer

patient

risk

string

patient

date_of_birth

date

patient

weight

float

patient

resident_state_prov

string

patient

title

yes

string

patient

hospital_id

integer

Specify Coding Settings for Categorical Fields

For categorical fields, build the codings.csv file and determine coding_name. Layer in the coding for categorical fields and fields that you want to summarize using categorical chart types. For patient, create categorical selections for risk, resident_state_prov, and titles. For these, first generate a codings file from the dictionary as follows: the raw code in the data is populated in the code column, the text to display in the Cohort Browser is in the meaning column. If it is a hierarchical field such as state_prov, the parent_code points to the code of the parent for the entry. The display_order can be used to override the order codes are displayed in summary tiles. The coding file should look as follows:

The Coding CSV file should be formatted as follows:

coding_name

code

meaning

parent_code

display_order

risk_level

High

risk_level

Medium

risk_level

Low

state_prov

United States

state_prov

Canada

state_prov

Kansas

state_prov

South Carolina

state_prov

Missouri

state_prov

Ohio

state_prov

Texas

state_prov

Pennsylvania

state_prov

Alabama

state_prov

Yukon Territory

state_prov

British Columbia

titles

Rev

Reverend

titles

Doctor

titles

Hon

Honorable

titles

Mrs

titles

Now that you created the coding file, go back and fill in the coding_name column in data_dictionary.csv with the corresponding coding_name from the coding.csv.

The Data Dictionary CSV should be formatted as follows:

entity

name

primary_key_type

is_sparse_coding

is_multi_select

type

coding_name

patient

patient_id

global

integer

patient

name

string

patient

age

integer

patient

risk

string

risk_level

patient

date_of_birth

date

patient

weight

float

patient

resident_state_prov

string

state_prov

patient

title

yes

string

titles

patient

hospital_id

integer

Specify Data Formats

Now you can validate that your fields will be ingested as their expected ingestion data type. The following are the data type to field mappings expected for this walkthrough:

Integer: Patient ID, Age, Hospital ID
String: Name
- String Categorical: Risk
- String Categorical Hierarchical: Resident State Province
- String Categorical Multi-select: Title
Date: Date of Birth
Float: Weight

Define Entity Relationships

Now, link the entities together by defining their relationships. For the Patient entity, use one-to-one or many-to-one relationships as appropriate. To create these links, enter the referenced entity and field in the format <entity>:<field name> in the referenced_entity_field column. Specify the type of relationship in the relationship column. If a join cannot be made for all rows, the system will automatically handle missing cases.

For this example dataset, create the following relationships:

Link hospital to patient using hospital_id with a many_to_one relationship.
Link encounter to patient using patient_id with a many_to_one relationship.
Link test to encounter using encounter_id with a many_to_one relationship.

The Data Dictionary CSV file should be formatted as follows:

entity

name

primary_key_type

referenced_entity_field

relationship

is_sparse_coding

is_multi_select

type

coding_name

patient

patient_id

global

integer

patient

name

string

patient

age

integer

patient

risk

string

patient

date_of_birth

date

patient

weight

float

patient

resident_state_prov

string

patient

title

yes

string

patient

hospital_id

hospital:hospital_id

many_to_one

integer

hospital

hospital_id

local

integer

encounter

encounter_id

local

integer

encounter

patient_id

patient:patient_id

many_to_one

integer

test

test_id

local

integer

test

encounter_id

encounter:encounter_id

many_to_one

integer

Add Additional Descriptive Information

Now that you have the core data relationships and types in place, it's time to add metadata about the field (title, description, units, and linkout) along with specifying the folder path desired to be shown in the field selector of the Cohort Browser. For more information on each field, see data dictionary details.

If you would like the folder structure to mirror the entity structure exactly, leave the folder_path field empty.

If you would like to hide certain fields from the field selector, populate the folder_path field for only the fields you would like to show and leave the path empty for the fields to hide.

Instead of following the examples generally set for entities for folder paths, put the identifiers in their own "ID" folder and patient identifying information (PII) in a subfolder under "Patient" called "PII". Foreign keys are hidden from the field selector.

The Data Dictionary CSV file should be formatted as follows:

entity

name

primary_key_type

referenced_entity_field

relationship

is_sparse_coding

is_multi_select

type

coding_name

folder_path

title

description

units

linkout

patient

patient_id

global

integer

ID

Patient ID

patient

name

string

Patient

name of the patient

patient

age

integer

Patient>PII

Age

age at signup

patient

risk

string

risk_level

Patient

Risk Level

evaluated risk level at signup

https://ourinternalwebsite.com/resources/id=3948483

patient

date_of_birth

date

Patient>PII

DOB

date of birth

patient

weight

float

Patient

Weight

baseline weight taken at signup

lbs

patient

resident_state_prov

string

state_prov

Patient

Home State / Province

state of current residency

patient

title

yes

string

titles

Patient

Titles

patient

hospital_id

hospital:hospital_id

many_to_one

integer

hospital

hospital_id

local

integer

ID

Hospital ID

encounter

encounter_id

local

integer

ID

Encounter ID

encounter

patient_id

patient:patient_id

many_to_one

integer

test

test_id

local

integer

ID

Test ID

test

encounter_id

encounter:encounter_id

many_to_one

integer

Now that the patient is set, repeat the steps for the other entities: hospital, encounter, and test. Feel free to use this coding as a guide for categorical values.

This table only highlights one entity in totality and only a few fields from the other entities. You will need to go through the remaining three files and fill in the data yourself.

Generate Entity Details

After that, the only step remaining is to generate the entity_metadata.csv to clean up the labels that the browser will use for the entities (refer to the detailed docs for more guidance). Blank spaces are left in the label fields if the field is a repeat of the entity_title.

The Entity Metadata CSV should be formatted as follows:

entity

entity_title

entity_label_singular

entity_label_plural

entity_description

patient

Patients

Patient

the patient who visited

encounter

Visit

Patient Visit

Patient Visits

each admission is logged as an individual visit

test

Lab Tests

Test

Tests

all tests performed during a visit or after but linked to the same visit

hospital

Hospital

Hospitals

Step 3. Ingest Data

Examples of Completed Files Ready for Ingestion

You now have a raw CSV file above, a data dictionary file, an entity metadata file, and a codings file. If you struggled, below are exemplar files that can be used for ingestion:

Running the Data Model Loader

With the files provided, configure the Data Model Loaderand run the ingestion. See below for an example configuration.

The four raw example files are input into the Data CSVs input field.

Step 4. Explore and Analyze Data

Once the Data Model Loader job has completed, you will see a dataset and a database output. If you used the supplied metadata files, you should see something similar to this in the Cohort Browser:

Last updated 27 days ago

Was this helpful?