# Guide for Ingesting a Four Table Dataset

{% hint style="info" %}
On the DNAnexus Platform, an Apollo license is required to use the features described on this page. Org approval may also be required. [Contact DNAnexus Sales](mailto:sales@dnanexus.com) for more information.
{% endhint %}

## Overview

This walkthrough describes phenotypic data ingestion using a [data dictionary file](https://documentation.dnanexus.com/ingesting-data/data-model-loader/data-file-inputs-data-model-loader#data-dictionary), a [codings file](https://documentation.dnanexus.com/ingesting-data/data-model-loader/data-file-inputs-data-model-loader#codings) if needed, an optional [entity dictionary](https://documentation.dnanexus.com/ingesting-data/data-model-loader/data-file-inputs-data-model-loader#entity-dictionary) file, and accompanying [data CSV files](https://documentation.dnanexus.com/ingesting-data/data-model-loader/data-file-inputs-data-model-loader#data-files). The files are loaded using the [Data Model Loader](https://documentation.dnanexus.com/developer/ingesting-data/data-model-loader)app, which validates and ingests the input CSV files to create an [Apollo dataset](https://documentation.dnanexus.com/developer/datasets). This Dataset is then accessible using the [Cohort Browser](https://documentation.dnanexus.com/user/cohort-browser), or using [JupyterLab](https://documentation.dnanexus.com/user/jupyter-notebooks/running-dxjupyterlab) and the `dxdata` Python SDK.

The following steps show how to organize your data into the required file sets following the [Data Ingestion Key Steps](https://documentation.dnanexus.com/developer/ingesting-data/data-model-loader/data-ingestion-key-steps). These files can then be loaded using the [Data Model Loader](https://documentation.dnanexus.com/developer/ingesting-data/data-model-loader)app to create a database encapsulated by a [Dataset](https://documentation.dnanexus.com/developer/datasets) record, which are then immediately accessible for use.

## Requirements

* DNAnexus Apollo License
* Access to the [Data Model Loader](https://documentation.dnanexus.com/developer/ingesting-data/data-model-loader) app
* A way to manipulate and create CSV files
* Uploader or greater permissions in the project the app runs

### Example Raw Files

{% file src="<https://1612471957-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-L_EsL_ie8XyZlLe_yf9%2Fuploads%2Fgit-blob-340e06192fe8d625bd4fd414a7a489b8f3deed7e%2Fpatient.csv?alt=media&token=b60f799f-e2df-4646-83f7-2e292f4a7917>" %}

{% file src="<https://1612471957-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-L_EsL_ie8XyZlLe_yf9%2Fuploads%2Fgit-blob-09b8e0be300706ae7619ff88b35a9e1de8b1d1a8%2Fhospital.csv?alt=media&token=d307f85d-a52e-421c-94ea-bcdca7ed325c>" %}

{% file src="<https://1612471957-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-L_EsL_ie8XyZlLe_yf9%2Fuploads%2Fgit-blob-c69e2697fa2c1b36ab6adfe7044f2c00cbd82e89%2Fencounter.csv?alt=media&token=db77c894-1da5-4514-b500-39a0e10fca6f>" %}

{% file src="<https://1612471957-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-L_EsL_ie8XyZlLe_yf9%2Fuploads%2Fgit-blob-2e20c889e154eca9d5fbdb9a57912f97914a0498%2Ftest.csv?alt=media&token=345f91d8-0616-4a05-9ce5-3bf9ae6182b5>" %}

## Guide

### Step 1. Identify Your Data

#### Specify the Data to Be Used

For this dataset the following CSV files from the [Example section](#example-raw-files) above are used: `patient.csv`, `hospital.csv`, `encounter.csv`, and `test.csv`. The data has a structure as follows:

![ERD for the example data](https://1612471957-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-L_EsL_ie8XyZlLe_yf9%2Fuploads%2Fgit-blob-0f5dd6333452cdcafec487e729b7bb43de7bafe8%2Fimage.png?alt=media)

* Many patients point to one hospital but some may be missing hospital information.
* A patient may have none, one, or many encounters.
* Each encounter may have none, one, or many tests.

#### Specify the Main Entity and Main Field

For this dataset create cohorts of different patients to perform analysis on. Some examples of cohorts that can be built are:

* The Patients that were in Hospital X and had an Encounter with Doctor Y
* The Patients with a weight over A
* The Patients born before a date, with a risk factor of H that had an Encounter with Doctor Y

Given these types of questions and the data structure, the main entity is the `patient`. The patient data includes a `patient_id` that is unique for the table and is used as the key (main field).

#### Specify Additional Entities

Since the data already has a one-to-many relationships, there are no one-to-one relationships, and there are no large sets of related multi-select fields, the secondary entities are kept as:

* `Hospital`
* `Encounter`
* `Test`

### Step 2. Provide File Specifications

While the [data ingestion key steps](https://documentation.dnanexus.com/developer/ingesting-data/data-model-loader/data-ingestion-key-steps) document guides the user through column by column for the different data files, this walkthrough goes through one entity at a time. This is recommended to ensure precision when working with a larger or multi-entity dataset.

The Patient CSV (main entity) file should be formatted as follows:

| `patient_id` | `name` | `age` | `risk` | `date_of_birth` | `weight`  | `resident_state_prov` | `title`      | `hospital_id` |
| ------------ | ------ | ----- | ------ | --------------- | --------- | --------------------- | ------------ | ------------- |
| 1            | John   | 29    | h      | 1983-03-05      | 135.56035 | KS                    | Mr           | 2             |
| 2            | Sally  | 27    | m      | 1951-08-12      | 101.22172 | SC                    | Dr\|Hon      | 2             |
| 3            | Diane  | 33    | l      | 1991-09-30      | 126.9032  | OH                    | Mrs\|Hon\|Dr | 1             |

#### Create a Data Dictionary File

Create a `data_dictionary.csv` file and fill in the fields for `entity` and `name`. Add `global` as the `primary_key_type` value for `patient_id`.

The Data Dictionary CSV file should be formatted as follows:

| `entity`  | `name`                | `primary_key_type` |
| --------- | --------------------- | ------------------ |
| `patient` | `patient_id`          | `global`           |
| `patient` | `name`                |                    |
| `patient` | `age`                 |                    |
| `patient` | `risk`                |                    |
| `patient` | `date_of_birth`       |                    |
| `patient` | `weight`              |                    |
| `patient` | `resident_state_prov` |                    |
| `patient` | `title`               |                    |
| `patient` | `hospital_id`         |                    |

#### Specify Field Characteristics

For each field in each entity, indicate `is_sparse_coding` and `is_multi_select`, and specify the field's `type`.

Layer in the types for the different fields. For this example, there are the following groupings:

* **Integer:** `patient_id`, `age`, `hospital_id`
* **Date:** `date_of_birth`
* **Float:** `weight`
* **String:** `name`, `risk`, `resident_state`
  * **Multi-select:** `title`

The Data Dictionary CSV file should be formatted as follows:

| `entity`  | `name`                | `primary_key_type` | `is_sparse_coding` | `is_multi_select` | `type`    |
| --------- | --------------------- | ------------------ | ------------------ | ----------------- | --------- |
| `patient` | `patient_id`          | `global`           |                    |                   | `integer` |
| `patient` | `name`                |                    |                    |                   | `string`  |
| `patient` | `age`                 |                    |                    |                   | `integer` |
| `patient` | `risk`                |                    |                    |                   | `string`  |
| `patient` | `date_of_birth`       |                    |                    |                   | `date`    |
| `patient` | `weight`              |                    |                    |                   | `float`   |
| `patient` | `resident_state_prov` |                    |                    |                   | `string`  |
| `patient` | `title`               |                    |                    | `yes`             | `string`  |
| `patient` | `hospital_id`         |                    |                    |                   | `integer` |

#### Specify Coding Settings for Categorical Fields

For categorical fields, build the `codings.csv` file and determine `coding_name`. Layer in the coding for categorical fields and fields that you want to summarize using [categorical chart types](https://documentation.dnanexus.com/user/cohort-browser/chart-types). For `patient`, create categorical selections for `risk`, `resident_state_prov`, and `titles`. For these, first generate a codings file from the dictionary as follows: the raw code in the data is populated in the `code` column, the text to display in the Cohort Browser is in the `meaning` column. If it is a hierarchical field such as `state_prov`, the `parent_code` points to the `code` of the parent for the entry. The `display_order` can be used to override the order codes are displayed in summary tiles. The coding file should look as follows:

The Coding CSV file should be formatted as follows:

| `coding_name` | `code` | `meaning`        | `parent_code` | `display_order` |
| ------------- | ------ | ---------------- | ------------- | --------------- |
| `risk_level`  | h      | High             |               |                 |
| `risk_level`  | m      | Medium           |               |                 |
| `risk_level`  | l      | Low              |               |                 |
| `state_prov`  | US     | United States    |               |                 |
| `state_prov`  | CA     | Canada           |               |                 |
| `state_prov`  | KS     | Kansas           | US            |                 |
| `state_prov`  | SC     | South Carolina   | US            |                 |
| `state_prov`  | MO     | Missouri         | US            |                 |
| `state_prov`  | OH     | Ohio             | US            |                 |
| `state_prov`  | TX     | Texas            | US            |                 |
| `state_prov`  | PA     | Pennsylvania     | US            |                 |
| `state_prov`  | AL     | Alabama          | US            |                 |
| `state_prov`  | YK     | Yukon Territory  | CA            |                 |
| `state_prov`  | BC     | British Columbia | CA            |                 |
| `titles`      | Rev    | Reverend         |               |                 |
| `titles`      | Mr     | Mr               |               |                 |
| `titles`      | Dr     | Doctor           |               |                 |
| `titles`      | Hon    | Honorable        |               |                 |
| `titles`      | Mrs    | Mrs              |               |                 |
| `titles`      | Ms     | Ms               |               |                 |

After you created the coding file, go back and fill in the `coding_name` column in `data_dictionary.csv` with the corresponding `coding_name` from the `coding.csv`.

The Data Dictionary CSV should be formatted as follows:

| `entity`  | `name`                | `primary_key_type` | `is_sparse_coding` | `is_multi_select` | `type`    | `coding_name` |
| --------- | --------------------- | ------------------ | ------------------ | ----------------- | --------- | ------------- |
| `patient` | `patient_id`          | `global`           |                    |                   | `integer` |               |
| `patient` | `name`                |                    |                    |                   | `string`  |               |
| `patient` | `age`                 |                    |                    |                   | `integer` |               |
| `patient` | `risk`                |                    |                    |                   | `string`  | `risk_level`  |
| `patient` | `date_of_birth`       |                    |                    |                   | `date`    |               |
| `patient` | `weight`              |                    |                    |                   | `float`   |               |
| `patient` | `resident_state_prov` |                    |                    |                   | `string`  | `state_prov`  |
| `patient` | `title`               |                    |                    | `yes`             | `string`  | `titles`      |
| `patient` | `hospital_id`         |                    |                    |                   | `integer` |               |

#### Specify Data Formats

You can validate that your fields are ingested as their expected [ingestion data type](https://documentation.dnanexus.com/developer/ingesting-data/data-model-loader/ingestion-data-types). The following are the data type to field mappings expected for this walkthrough:

* **Integer:** Patient ID, Age, Hospital ID
* **String:** Name
  * **String Categorical:** Risk
  * **String Categorical Hierarchical**: Resident State Province
  * **String Categorical Multi-select**: Title
* **Date:** Date of Birth
* **Float:** Weight

#### Define Entity Relationships

Link the entities together by defining their relationships. For the Patient entity, use one-to-one or many-to-one relationships as appropriate. To create these links, enter the referenced entity and field in the format `<entity>:<field name>` in the `referenced_entity_field` column. Specify the type of relationship in the `relationship` column. If a join cannot be made for all rows, the system automatically handles missing cases.

For this example dataset, create the following relationships:

* Link `hospital` to `patient` using `hospital_id` with a `many_to_one` relationship.
* Link `encounter` to `patient` using `patient_id` with a `many_to_one` relationship.
* Link `test` to `encounter` using `encounter_id` with a `many_to_one` relationship.

The Data Dictionary CSV file should be formatted as follows:

| `entity`    | `name`                | `primary_key_type` | `referenced_entity_field` | `relationship` | `is_sparse_coding` | `is_multi_select` | `type`    | `coding_name` |
| ----------- | --------------------- | ------------------ | ------------------------- | -------------- | ------------------ | ----------------- | --------- | ------------- |
| `patient`   | `patient_id`          | `global`           |                           |                |                    |                   | `integer` |               |
| `patient`   | `name`                |                    |                           |                |                    |                   | `string`  |               |
| `patient`   | `age`                 |                    |                           |                |                    |                   | `integer` |               |
| `patient`   | `risk`                |                    |                           |                |                    |                   | `string`  |               |
| `patient`   | `date_of_birth`       |                    |                           |                |                    |                   | `date`    |               |
| `patient`   | `weight`              |                    |                           |                |                    |                   | `float`   |               |
| `patient`   | `resident_state_prov` |                    |                           |                |                    |                   | `string`  |               |
| `patient`   | `title`               |                    |                           |                |                    | `yes`             | `string`  |               |
| `patient`   | `hospital_id`         |                    | `hospital:hospital_id`    | `many_to_one`  |                    |                   | `integer` |               |
| `hospital`  | `hospital_id`         | `local`            |                           |                |                    |                   | `integer` |               |
| `encounter` | `encounter_id`        | `local`            |                           |                |                    |                   | `integer` |               |
| `encounter` | `patient_id`          |                    | `patient:patient_id`      | `many_to_one`  |                    |                   | `integer` |               |
| `test`      | `test_id`             | `local`            |                           |                |                    |                   | `integer` |               |
| `test`      | `encounter_id`        |                    | `encounter:encounter_id`  | `many_to_one`  |                    |                   | `integer` |               |

#### Add Additional Descriptive Information

After you have the core data relationships and types in place, it's time to add metadata about the field (`title`, `description`, `units`, and `linkout`) along with specifying the folder path desired to be shown in the field selector of the Cohort Browser. For more information on each field, see [data dictionary details](https://documentation.dnanexus.com/ingesting-data/data-model-loader/data-file-inputs-data-model-loader#data-dictionary).

{% hint style="info" %}
If you would like the folder structure to mirror the entity structure exactly, leave the `folder_path` field empty.

If you would like to hide certain fields from the field selector, populate the `folder_path` field for only the fields you would like to show and leave the path empty for the fields to hide.
{% endhint %}

Instead of following the examples generally set for entities for folder paths, put the identifiers in their own "ID" folder and patient identifying information (PII) in a subfolder under "Patient" called "PII". Foreign keys are hidden from the field selector.

The Data Dictionary CSV file should be formatted as follows:

| `entity`    | `name`                | `primary_key_type` | `referenced_entity_field` | `relationship` | `is_sparse_coding` | `is_multi_select` | `type`    | `coding_name` | `folder_path` | `title`                 | `description`                   | `units` | `linkout`                                             |
| ----------- | --------------------- | ------------------ | ------------------------- | -------------- | ------------------ | ----------------- | --------- | ------------- | ------------- | ----------------------- | ------------------------------- | ------- | ----------------------------------------------------- |
| `patient`   | `patient_id`          | `global`           |                           |                |                    |                   | `integer` |               | `ID`          | `Patient ID`            |                                 |         |                                                       |
| `patient`   | `name`                |                    |                           |                |                    |                   | `string`  |               | `Patient`     | `Patient`               | name of the patient             |         |                                                       |
| `patient`   | `age`                 |                    |                           |                |                    |                   | `integer` |               | `Patient>PII` | `Age`                   | age at signup                   |         |                                                       |
| `patient`   | `risk`                |                    |                           |                |                    |                   | `string`  | `risk_level`  | `Patient`     | `Risk Level`            | evaluated risk level at signup  |         | `https://ourinternalwebsite.com/resources/id=3948483` |
| `patient`   | `date_of_birth`       |                    |                           |                |                    |                   | `date`    |               | `Patient>PII` | `DOB`                   | date of birth                   |         |                                                       |
| `patient`   | `weight`              |                    |                           |                |                    |                   | `float`   |               | `Patient`     | `Weight`                | baseline weight taken at signup | `lbs`   |                                                       |
| `patient`   | `resident_state_prov` |                    |                           |                |                    |                   | `string`  | `state_prov`  | `Patient`     | `Home State / Province` | state of current residency      |         |                                                       |
| `patient`   | `title`               |                    |                           |                |                    | `yes`             | `string`  | `titles`      | `Patient`     | `Titles`                |                                 |         |                                                       |
| `patient`   | `hospital_id`         |                    | `hospital:hospital_id`    | `many_to_one`  |                    |                   | `integer` |               |               |                         |                                 |         |                                                       |
| `hospital`  | `hospital_id`         | `local`            |                           |                |                    |                   | `integer` |               | `ID`          | `Hospital ID`           |                                 |         |                                                       |
| `encounter` | `encounter_id`        | `local`            |                           |                |                    |                   | `integer` |               | `ID`          | `Encounter ID`          |                                 |         |                                                       |
| `encounter` | `patient_id`          |                    | `patient:patient_id`      | `many_to_one`  |                    |                   | `integer` |               |               |                         |                                 |         |                                                       |
| `test`      | `test_id`             | `local`            |                           |                |                    |                   | `integer` |               | `ID`          | `Test ID`               |                                 |         |                                                       |
| `test`      | `encounter_id`        |                    | `encounter:encounter_id`  | `many_to_one`  |                    |                   | `integer` |               |               |                         |                                 |         |                                                       |

After the `patient` is set, repeat the steps for the other entities: `hospital`, `encounter`, and `test`. Feel free to use this coding as a guide for categorical values.

{% file src="<https://1612471957-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-L_EsL_ie8XyZlLe_yf9%2Fuploads%2Fgit-blob-750e5e1fa93582511a5175f8c41e5a54191685da%2Fcoding.csv?alt=media&token=6148f63c-810f-4c49-b3ae-62a2871ea62e>" %}
Sample Coding CSV
{% endfile %}

{% hint style="info" %}
This table only highlights one entity in totality and only a few fields from the other entities. You need to go through the remaining three files and fill in the data yourself.
{% endhint %}

#### Generate Entity Details

After that, the only step remaining is to generate the `entity_metadata.csv` to clean up the labels that the browser uses for the entities (refer to the [detailed docs](https://documentation.dnanexus.com/developer/ingesting-data/data-model-loader/data-file-inputs-data-model-loader) for more guidance). Blank spaces are left in the label fields if the field is a repeat of the `entity_title`.

The Entity Metadata CSV should be formatted as follows:

| `entity`    | `entity_title` | `entity_label_singular` | `entity_label_plural` | `entity_description`                                                     |
| ----------- | -------------- | ----------------------- | --------------------- | ------------------------------------------------------------------------ |
| `patient`   | `Patients`     | `Patient`               |                       | the patient who visited                                                  |
| `encounter` | `Visit`        | `Patient Visit`         | `Patient Visits`      | each admission is logged as an individual visit                          |
| `test`      | `Lab Tests`    | `Test`                  | `Tests`               | all tests performed during a visit or after but linked to the same visit |
| `hospital`  | `Hospital`     |                         | `Hospitals`           |                                                                          |

### Step 3. Ingest Data

#### Examples of Completed Files Ready for Ingestion

You have a raw CSV file above, a data dictionary file, an entity metadata file, and a codings file. If you struggled, below are exemplar files that can be used for ingestion:

{% file src="<https://1612471957-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-L_EsL_ie8XyZlLe_yf9%2Fuploads%2Fgit-blob-6a7219def8c58c44981e0975e5beac17d4aabfd6%2Fdata_dictionary.csv?alt=media&token=6372e791-09a1-42de-8198-205f6b50df66>" %}
Sample Data Dictionary CSV
{% endfile %}

{% file src="<https://1612471957-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-L_EsL_ie8XyZlLe_yf9%2Fuploads%2Fgit-blob-2d714f5759f8a363f66348592805e1c9865df061%2Fentity_metadata.csv?alt=media&token=821fcb8d-0217-4c25-80d9-c05f505f2280>" %}
Sample Entity Metadata CSV
{% endfile %}

{% file src="<https://1612471957-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-L_EsL_ie8XyZlLe_yf9%2Fuploads%2Fgit-blob-750e5e1fa93582511a5175f8c41e5a54191685da%2Fcoding.csv?alt=media&token=6148f63c-810f-4c49-b3ae-62a2871ea62e>" %}
Sample Coding CSV
{% endfile %}

#### Running the Data Model Loader

With the files provided, configure the [Data Model Loader](https://documentation.dnanexus.com/developer/ingesting-data/data-model-loader)and run the ingestion. See below for an example configuration.

![Example inputs to the Data Model Loader.](https://1612471957-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-L_EsL_ie8XyZlLe_yf9%2Fuploads%2Fgit-blob-64104ed66dea2573ccb2d7e8746e10a0b0bb0797%2Fingest_four_dataset_1.png?alt=media)

![Example Common setup for the Data Model Loader](https://1612471957-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-L_EsL_ie8XyZlLe_yf9%2Fuploads%2Fgit-blob-c002eb0ecbe51bb0301b596216ba7fab2b9802f7%2Fingest_four_dataset_2.png?alt=media)

The [four raw example files](#example-raw-files) are input into the **Data CSVs** input field.

### Step 4. Explore and Analyze Data

Once the Data Model Loader job has completed, you see a dataset and a database output. If you used the supplied metadata files, you should see something similar to this in the Cohort Browser:

![The files created from Data Model Loader](https://1612471957-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-L_EsL_ie8XyZlLe_yf9%2Fuploads%2Fgit-blob-78374aea474a80bdb91f360033408672d774c4e8%2Fimage.png?alt=media)

![View of the Field Selector](https://1612471957-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-L_EsL_ie8XyZlLe_yf9%2Fuploads%2Fgit-blob-35f077ec561118d567e5ff40792dbabbe417beac%2Fimage.png?alt=media)

![Example Cohort Browser view of ingested Example Files](https://1612471957-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-L_EsL_ie8XyZlLe_yf9%2Fuploads%2Fgit-blob-27327cbaf1c6e7afd99de6efd517829af919ee92%2Fimage.png?alt=media)
