# Guide for Ingesting a Four Table Dataset

{% hint style="info" %}
On the DNAnexus Platform, an Apollo license is required to use the features described on this page. Org approval may also be required. [Contact DNAnexus Sales](mailto:sales@dnanexus.com) for more information.
{% endhint %}

## Overview

This walkthrough describes phenotypic data ingestion using a [data dictionary file](https://documentation.dnanexus.com/ingesting-data/data-model-loader/data-file-inputs-data-model-loader#data-dictionary), a [codings file](https://documentation.dnanexus.com/ingesting-data/data-model-loader/data-file-inputs-data-model-loader#codings) if needed, an optional [entity dictionary](https://documentation.dnanexus.com/ingesting-data/data-model-loader/data-file-inputs-data-model-loader#entity-dictionary) file, and accompanying [data CSV files](https://documentation.dnanexus.com/ingesting-data/data-model-loader/data-file-inputs-data-model-loader#data-files). The files are loaded using the [Data Model Loader](https://documentation.dnanexus.com/developer/ingesting-data/data-model-loader)app, which validates and ingests the input CSV files to create an [Apollo dataset](https://documentation.dnanexus.com/developer/datasets). This Dataset is then accessible using the [Cohort Browser](https://documentation.dnanexus.com/user/cohort-browser), or using [JupyterLab](https://documentation.dnanexus.com/user/jupyter-notebooks/running-dxjupyterlab) and the `dxdata` Python SDK.

The following steps show how to organize your data into the required file sets following the [Data Ingestion Key Steps](https://documentation.dnanexus.com/developer/ingesting-data/data-model-loader/data-ingestion-key-steps). These files can then be loaded using the [Data Model Loader](https://documentation.dnanexus.com/developer/ingesting-data/data-model-loader)app to create a database encapsulated by a [Dataset](https://documentation.dnanexus.com/developer/datasets) record, which is then immediately accessible for use.

## Requirements

* DNAnexus Apollo License
* Access to the [Data Model Loader](https://documentation.dnanexus.com/developer/ingesting-data/data-model-loader) app
* A way to manipulate and create CSV files
* Uploader or greater permissions in the project the app runs

### Example Raw Files

{% file src="<https://1612471957-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-L_EsL_ie8XyZlLe_yf9%2Fuploads%2Fgit-blob-340e06192fe8d625bd4fd414a7a489b8f3deed7e%2Fpatient.csv?alt=media&token=b60f799f-e2df-4646-83f7-2e292f4a7917>" %}

{% file src="<https://1612471957-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-L_EsL_ie8XyZlLe_yf9%2Fuploads%2Fgit-blob-09b8e0be300706ae7619ff88b35a9e1de8b1d1a8%2Fhospital.csv?alt=media&token=d307f85d-a52e-421c-94ea-bcdca7ed325c>" %}

{% file src="<https://1612471957-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-L_EsL_ie8XyZlLe_yf9%2Fuploads%2Fgit-blob-c69e2697fa2c1b36ab6adfe7044f2c00cbd82e89%2Fencounter.csv?alt=media&token=db77c894-1da5-4514-b500-39a0e10fca6f>" %}

{% file src="<https://1612471957-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-L_EsL_ie8XyZlLe_yf9%2Fuploads%2Fgit-blob-2e20c889e154eca9d5fbdb9a57912f97914a0498%2Ftest.csv?alt=media&token=345f91d8-0616-4a05-9ce5-3bf9ae6182b5>" %}

## Guide

### Step 1. Identify Your Data

#### Specify the Data to Be Used

For this dataset the following CSV files from the [Example section](#example-raw-files) above are used: `patient.csv`, `hospital.csv`, `encounter.csv`, and `test.csv`. The data has a structure as follows:

![ERD for the example data](https://1612471957-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-L_EsL_ie8XyZlLe_yf9%2Fuploads%2Fgit-blob-0f5dd6333452cdcafec487e729b7bb43de7bafe8%2Fimage.png?alt=media)

* Many patients point to one hospital but some may be missing hospital information.
* A patient may have none, one, or many encounters.
* Each encounter may have none, one, or many tests.

#### Specify the Main Entity and Main Field

For this dataset create cohorts of different patients to perform analysis on. Some examples of cohorts that can be built are:

* The Patients that were in Hospital X and had an Encounter with Doctor Y
* The Patients with a weight over A
* The Patients born before a date, with a risk factor of H that had an Encounter with Doctor Y

Given these types of questions and the data structure, the main entity is the `patient`. The patient data includes a `patient_id` that is unique for the table and is used as the key (main field).

#### Specify Additional Entities

Since the data already has one-to-many relationships, there are no one-to-one relationships, and there are no large sets of related multi-select fields, the secondary entities are kept as:

* `Hospital`
* `Encounter`
* `Test`

### Step 2. Provide File Specifications

While the [data ingestion key steps](https://documentation.dnanexus.com/developer/ingesting-data/data-model-loader/data-ingestion-key-steps) document guides the user through column by column for the different data files, this walkthrough goes through one entity at a time. This is recommended to ensure precision when working with a larger or multi-entity dataset.

The Patient CSV (main entity) file should be formatted as follows:

| `patient_id` | `name` | `age` | `risk` | `date_of_birth` | `weight`  | `resident_state_prov` | `title`      | `hospital_id` |
| ------------ | ------ | ----- | ------ | --------------- | --------- | --------------------- | ------------ | ------------- |
| 1            | John   | 29    | h      | 1983-03-05      | 135.56035 | KS                    | Mr           | 2             |
| 2            | Sally  | 27    | m      | 1951-08-12      | 101.22172 | SC                    | Dr\|Hon      | 2             |
| 3            | Diane  | 33    | l      | 1991-09-30      | 126.9032  | OH                    | Mrs\|Hon\|Dr | 1             |

#### Create a Data Dictionary File

Create a `data_dictionary.csv` file and fill in the fields for `entity` and `name`. Add `global` as the `primary_key_type` value for `patient_id`.

The Data Dictionary CSV file should be formatted as follows:

| `entity`  | `name`                | `primary_key_type` |
| --------- | --------------------- | ------------------ |
| `patient` | `patient_id`          | `global`           |
| `patient` | `name`                |                    |
| `patient` | `age`                 |                    |
| `patient` | `risk`                |                    |
| `patient` | `date_of_birth`       |                    |
| `patient` | `weight`              |                    |
| `patient` | `resident_state_prov` |                    |
| `patient` | `title`               |                    |
| `patient` | `hospital_id`         |                    |

#### Specify Field Characteristics

For each field in each entity, indicate `is_sparse_coding` and `is_multi_select`, and specify the field's `type`.

Layer in the types for the different fields. For this example, there are the following groupings:

* **Integer:** `patient_id`, `age`, `hospital_id`
* **Date:** `date_of_birth`
* **Float:** `weight`
* **String:** `name`, `risk`, `resident_state`
  * **Multi-select:** `title`

The Data Dictionary CSV file should be formatted as follows:

| `entity`  | `name`                | `primary_key_type` | `is_sparse_coding` | `is_multi_select` | `type`    |
| --------- | --------------------- | ------------------ | ------------------ | ----------------- | --------- |
| `patient` | `patient_id`          | `global`           |                    |                   | `integer` |
| `patient` | `name`                |                    |                    |                   | `string`  |
| `patient` | `age`                 |                    |                    |                   | `integer` |
| `patient` | `risk`                |                    |                    |                   | `string`  |
| `patient` | `date_of_birth`       |                    |                    |                   | `date`    |
| `patient` | `weight`              |                    |                    |                   | `float`   |
| `patient` | `resident_state_prov` |                    |                    |                   | `string`  |
| `patient` | `title`               |                    |                    | `yes`             | `string`  |
| `patient` | `hospital_id`         |                    |                    |                   | `integer` |

#### Specify Coding Settings for Categorical Fields

For categorical fields, build the `codings.csv` file and determine `coding_name`. Layer in the coding for categorical fields and fields that you want to summarize using [categorical chart types](https://documentation.dnanexus.com/user/cohort-browser/chart-types). For `patient`, create categorical selections for `risk`, `resident_state_prov`, and `titles`. For these, first generate a codings file from the dictionary as follows: the raw code in the data is populated in the `code` column, the text to display in the Cohort Browser is in the `meaning` column. If it is a hierarchical field such as `state_prov`, the `parent_code` points to the `code` of the parent for the entry. The `display_order` can be used to override the order codes are displayed in summary tiles. The coding file should look as follows:

The Coding CSV file should be formatted as follows:

| `coding_name` | `code` | `meaning`        | `parent_code` | `display_order` |
| ------------- | ------ | ---------------- | ------------- | --------------- |
| `risk_level`  | h      | High             |               |                 |
| `risk_level`  | m      | Medium           |               |                 |
| `risk_level`  | l      | Low              |               |                 |
| `state_prov`  | US     | United States    |               |                 |
| `state_prov`  | CA     | Canada           |               |                 |
| `state_prov`  | KS     | Kansas           | US            |                 |
| `state_prov`  | SC     | South Carolina   | US            |                 |
| `state_prov`  | MO     | Missouri         | US            |                 |
| `state_prov`  | OH     | Ohio             | US            |                 |
| `state_prov`  | TX     | Texas            | US            |                 |
| `state_prov`  | PA     | Pennsylvania     | US            |                 |
| `state_prov`  | AL     | Alabama          | US            |                 |
| `state_prov`  | YK     | Yukon Territory  | CA            |                 |
| `state_prov`  | BC     | British Columbia | CA            |                 |
| `titles`      | Rev    | Reverend         |               |                 |
| `titles`      | Mr     | Mr               |               |                 |
| `titles`      | Dr     | Doctor           |               |                 |
| `titles`      | Hon    | Honorable        |               |                 |
| `titles`      | Mrs    | Mrs              |               |                 |
| `titles`      | Ms     | Ms               |               |                 |

After you created the coding file, go back and fill in the `coding_name` column in `data_dictionary.csv` with the corresponding `coding_name` from the `coding.csv`.

The Data Dictionary CSV should be formatted as follows:

| `entity`  | `name`                | `primary_key_type` | `is_sparse_coding` | `is_multi_select` | `type`    | `coding_name` |
| --------- | --------------------- | ------------------ | ------------------ | ----------------- | --------- | ------------- |
| `patient` | `patient_id`          | `global`           |                    |                   | `integer` |               |
| `patient` | `name`                |                    |                    |                   | `string`  |               |
| `patient` | `age`                 |                    |                    |                   | `integer` |               |
| `patient` | `risk`                |                    |                    |                   | `string`  | `risk_level`  |
| `patient` | `date_of_birth`       |                    |                    |                   | `date`    |               |
| `patient` | `weight`              |                    |                    |                   | `float`   |               |
| `patient` | `resident_state_prov` |                    |                    |                   | `string`  | `state_prov`  |
| `patient` | `title`               |                    |                    | `yes`             | `string`  | `titles`      |
| `patient` | `hospital_id`         |                    |                    |                   | `integer` |               |

#### Specify Data Formats

You can validate that your fields are ingested as their expected [ingestion data type](https://documentation.dnanexus.com/developer/ingesting-data/data-model-loader/ingestion-data-types). The following are the data type to field mappings expected for this walkthrough:

* **Integer:** Patient ID, Age, Hospital ID
* **String:** Name
  * **String Categorical:** Risk
  * **String Categorical Hierarchical**: Resident State Province
  * **String Categorical Multi-select**: Title
* **Date:** Date of Birth
* **Float:** Weight

#### Define Entity Relationships

Link the entities together by defining their relationships. For the Patient entity, use one-to-one or many-to-one relationships as appropriate. To create these links, enter the referenced entity and field in the format `<entity>:<field name>` in the `referenced_entity_field` column. Specify the type of relationship in the `relationship` column. If a join cannot be made for all rows, the system automatically handles missing cases.

For this example dataset, create the following relationships:

* Link `hospital` to `patient` using `hospital_id` with a `many_to_one` relationship.
* Link `encounter` to `patient` using `patient_id` with a `many_to_one` relationship.
* Link `test` to `encounter` using `encounter_id` with a `many_to_one` relationship.

The Data Dictionary CSV file should be formatted as follows:

| `entity`    | `name`                | `primary_key_type` | `referenced_entity_field` | `relationship` | `is_sparse_coding` | `is_multi_select` | `type`    | `coding_name` |
| ----------- | --------------------- | ------------------ | ------------------------- | -------------- | ------------------ | ----------------- | --------- | ------------- |
| `patient`   | `patient_id`          | `global`           |                           |                |                    |                   | `integer` |               |
| `patient`   | `name`                |                    |                           |                |                    |                   | `string`  |               |
| `patient`   | `age`                 |                    |                           |                |                    |                   | `integer` |               |
| `patient`   | `risk`                |                    |                           |                |                    |                   | `string`  |               |
| `patient`   | `date_of_birth`       |                    |                           |                |                    |                   | `date`    |               |
| `patient`   | `weight`              |                    |                           |                |                    |                   | `float`   |               |
| `patient`   | `resident_state_prov` |                    |                           |                |                    |                   | `string`  |               |
| `patient`   | `title`               |                    |                           |                |                    | `yes`             | `string`  |               |
| `patient`   | `hospital_id`         |                    | `hospital:hospital_id`    | `many_to_one`  |                    |                   | `integer` |               |
| `hospital`  | `hospital_id`         | `local`            |                           |                |                    |                   | `integer` |               |
| `encounter` | `encounter_id`        | `local`            |                           |                |                    |                   | `integer` |               |
| `encounter` | `patient_id`          |                    | `patient:patient_id`      | `many_to_one`  |                    |                   | `integer` |               |
| `test`      | `test_id`             | `local`            |                           |                |                    |                   | `integer` |               |
| `test`      | `encounter_id`        |                    | `encounter:encounter_id`  | `many_to_one`  |                    |                   | `integer` |               |

#### Add Additional Descriptive Information

After you have the core data relationships and types in place, it's time to add metadata about the field (`title`, `description`, `units`, and `linkout`) along with specifying the folder path to show in the field selector of the Cohort Browser. For more information on each field, see [data dictionary details](https://documentation.dnanexus.com/ingesting-data/data-model-loader/data-file-inputs-data-model-loader#data-dictionary).

{% hint style="info" %}
If you want the folder structure to mirror the entity structure exactly, leave the `folder_path` field empty.

If you want to hide certain fields from the field selector, populate the `folder_path` field for only the fields you want to show and leave the path empty for the fields to hide.
{% endhint %}

Instead of following the examples set for entities for folder paths, put the identifiers in their own "ID" folder and patient identifying information (PII) in a subfolder under "Patient" called "PII". Foreign keys are hidden from the field selector.

The Data Dictionary CSV file should be formatted as follows:

| `entity`    | `name`                | `primary_key_type` | `referenced_entity_field` | `relationship` | `is_sparse_coding` | `is_multi_select` | `type`    | `coding_name` | `folder_path` | `title`                 | `description`                   | `units` | `linkout`                                             |
| ----------- | --------------------- | ------------------ | ------------------------- | -------------- | ------------------ | ----------------- | --------- | ------------- | ------------- | ----------------------- | ------------------------------- | ------- | ----------------------------------------------------- |
| `patient`   | `patient_id`          | `global`           |                           |                |                    |                   | `integer` |               | `ID`          | `Patient ID`            |                                 |         |                                                       |
| `patient`   | `name`                |                    |                           |                |                    |                   | `string`  |               | `Patient`     | `Patient`               | name of the patient             |         |                                                       |
| `patient`   | `age`                 |                    |                           |                |                    |                   | `integer` |               | `Patient>PII` | `Age`                   | age at signup                   |         |                                                       |
| `patient`   | `risk`                |                    |                           |                |                    |                   | `string`  | `risk_level`  | `Patient`     | `Risk Level`            | evaluated risk level at signup  |         | `https://ourinternalwebsite.com/resources/id=3948483` |
| `patient`   | `date_of_birth`       |                    |                           |                |                    |                   | `date`    |               | `Patient>PII` | `DOB`                   | date of birth                   |         |                                                       |
| `patient`   | `weight`              |                    |                           |                |                    |                   | `float`   |               | `Patient`     | `Weight`                | baseline weight taken at signup | `lbs`   |                                                       |
| `patient`   | `resident_state_prov` |                    |                           |                |                    |                   | `string`  | `state_prov`  | `Patient`     | `Home State / Province` | state of current residency      |         |                                                       |
| `patient`   | `title`               |                    |                           |                |                    | `yes`             | `string`  | `titles`      | `Patient`     | `Titles`                |                                 |         |                                                       |
| `patient`   | `hospital_id`         |                    | `hospital:hospital_id`    | `many_to_one`  |                    |                   | `integer` |               |               |                         |                                 |         |                                                       |
| `hospital`  | `hospital_id`         | `local`            |                           |                |                    |                   | `integer` |               | `ID`          | `Hospital ID`           |                                 |         |                                                       |
| `encounter` | `encounter_id`        | `local`            |                           |                |                    |                   | `integer` |               | `ID`          | `Encounter ID`          |                                 |         |                                                       |
| `encounter` | `patient_id`          |                    | `patient:patient_id`      | `many_to_one`  |                    |                   | `integer` |               |               |                         |                                 |         |                                                       |
| `test`      | `test_id`             | `local`            |                           |                |                    |                   | `integer` |               | `ID`          | `Test ID`               |                                 |         |                                                       |
| `test`      | `encounter_id`        |                    | `encounter:encounter_id`  | `many_to_one`  |                    |                   | `integer` |               |               |                         |                                 |         |                                                       |

After the `patient` is set, repeat the steps for the other entities: `hospital`, `encounter`, and `test`. Feel free to use this coding as a guide for categorical values.

{% file src="<https://1612471957-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-L_EsL_ie8XyZlLe_yf9%2Fuploads%2Fgit-blob-750e5e1fa93582511a5175f8c41e5a54191685da%2Fcoding.csv?alt=media&token=6148f63c-810f-4c49-b3ae-62a2871ea62e>" %}
Sample Coding CSV
{% endfile %}

{% hint style="info" %}
This table only highlights one entity in totality and only a few fields from the other entities. You need to go through the remaining three files and fill in the data yourself.
{% endhint %}

#### Generate Entity Details

After that, the only step remaining is to generate the `entity_metadata.csv` to clean up the labels that the browser uses for the entities (refer to the [detailed docs](https://documentation.dnanexus.com/developer/ingesting-data/data-model-loader/data-file-inputs-data-model-loader) for more guidance). Blank spaces are left in the label fields if the field is a repeat of the `entity_title`.

The Entity Metadata CSV should be formatted as follows:

| `entity`    | `entity_title` | `entity_label_singular` | `entity_label_plural` | `entity_description`                                                     |
| ----------- | -------------- | ----------------------- | --------------------- | ------------------------------------------------------------------------ |
| `patient`   | `Patients`     | `Patient`               |                       | the patient who visited                                                  |
| `encounter` | `Visit`        | `Patient Visit`         | `Patient Visits`      | each admission is logged as an individual visit                          |
| `test`      | `Lab Tests`    | `Test`                  | `Tests`               | all tests performed during a visit or after but linked to the same visit |
| `hospital`  | `Hospital`     |                         | `Hospitals`           |                                                                          |

### Step 3. Ingest Data

#### Examples of Completed Files Ready for Ingestion

You have the raw CSV files above, a data dictionary file, an entity metadata file, and a codings file. If you struggled, below are exemplar files that can be used for ingestion:

{% file src="<https://1612471957-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-L_EsL_ie8XyZlLe_yf9%2Fuploads%2Fgit-blob-6a7219def8c58c44981e0975e5beac17d4aabfd6%2Fdata_dictionary.csv?alt=media&token=6372e791-09a1-42de-8198-205f6b50df66>" %}
Sample Data Dictionary CSV
{% endfile %}

{% file src="<https://1612471957-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-L_EsL_ie8XyZlLe_yf9%2Fuploads%2Fgit-blob-2d714f5759f8a363f66348592805e1c9865df061%2Fentity_metadata.csv?alt=media&token=821fcb8d-0217-4c25-80d9-c05f505f2280>" %}
Sample Entity Metadata CSV
{% endfile %}

{% file src="<https://1612471957-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-L_EsL_ie8XyZlLe_yf9%2Fuploads%2Fgit-blob-750e5e1fa93582511a5175f8c41e5a54191685da%2Fcoding.csv?alt=media&token=6148f63c-810f-4c49-b3ae-62a2871ea62e>" %}
Sample Coding CSV
{% endfile %}

#### Running the Data Model Loader

With the files provided, configure the [Data Model Loader](https://documentation.dnanexus.com/developer/ingesting-data/data-model-loader)and run the ingestion. See below for an example configuration.

![Example inputs to the Data Model Loader.](https://1612471957-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-L_EsL_ie8XyZlLe_yf9%2Fuploads%2Fgit-blob-64104ed66dea2573ccb2d7e8746e10a0b0bb0797%2Fingest_four_dataset_1.png?alt=media)

![Example Common setup for the Data Model Loader](https://1612471957-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-L_EsL_ie8XyZlLe_yf9%2Fuploads%2Fgit-blob-c002eb0ecbe51bb0301b596216ba7fab2b9802f7%2Fingest_four_dataset_2.png?alt=media)

The [four raw example files](#example-raw-files) are input into the **Data CSVs** input field.

### Step 4. Explore and Analyze Data

Once the Data Model Loader job has completed, you see a dataset and a database output. If you used the supplied metadata files, you should see something similar to this in the Cohort Browser:

![The files created from Data Model Loader](https://1612471957-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-L_EsL_ie8XyZlLe_yf9%2Fuploads%2Fgit-blob-78374aea474a80bdb91f360033408672d774c4e8%2Fimage.png?alt=media)

![View of the Field Selector](https://1612471957-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-L_EsL_ie8XyZlLe_yf9%2Fuploads%2Fgit-blob-35f077ec561118d567e5ff40792dbabbe417beac%2Fimage.png?alt=media)

![Example Cohort Browser view of ingested Example Files](https://1612471957-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-L_EsL_ie8XyZlLe_yf9%2Fuploads%2Fgit-blob-27327cbaf1c6e7afd99de6efd517829af919ee92%2Fimage.png?alt=media)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://documentation.dnanexus.com/developer/walkthroughs/guide-for-ingesting-a-simple-three-table-dataset.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
