# Data Files Used by the Data Model Loader

{% hint style="info" %}
The information on this page is for the use of customers working with the [DNAnexus Professional Services](https://www.dnanexus.com/professional-services) team. [Contact DNAnexus Sales](mailto:sales@dnanexus.com) for more information.
{% endhint %}

## Overview

Phenotypic data must be delivered in a data model defined by the following files:

* A data dictionary CSV file, named `data_dictionary.csv` in this example.
* A data CSV file, `data.csv`.
* A codings CSV file, `codings.csv`.

### Data Dictionary

**Description:** The data dictionary file describes data in the data file (`*.data.csv`) by file name and by column within the specified data file. The data dictionary lists the metadata for the input `data.csv` file and respective individual columns within the data file. This metadata includes human-readable field names, data types, data coding schemes, and the relationships between the data tables. The information supplied in the data dictionary guides the ingestion process.

The data dictionary can be used to guide Data Model Loader to preload ICD codings and automatically build the codings CSV file. To take advantage of this automated feature in Data Model Loader, reserved terms (`icd9cm:2015`, `icd9pcs:2015`, `icd10cm:2024`, and `icd10pcs:2024`) must be used as `coding_name` in the data dictionary. By default, ICD codings are hierarchically grouped. To disable hierarchy, the reserved term should be followed by `:exclude_hierarchy`, for example, `icd9cm:2015:exclude_hierarchy`.

**Deliverable**: The `data_dictionary.csv` file is similar in structure to a spreadsheet, where columns are defined in Table 1 in the Appendix below. Each row in the data dictionary file represents a field for which an entity, that is, "patient", "sample", or "encounter," has a value.

{% hint style="info" %}
All columns have to be present, however they can be left empty if there are no values for the corresponding `entity:field` combination. The following example shows only enough columns to illustrate the general format. A complete template can be provided on request.
{% endhint %}

#### Example `data_dictionary.csv`

| `entity`    | `name`               | `type`    | `primary_key_type` | `coding_name` | `is_sparse_coding` | `referenced_entity_field` | `relationship` | `folder_path`            |
| ----------- | -------------------- | --------- | ------------------ | ------------- | ------------------ | ------------------------- | -------------- | ------------------------ |
| `subject`   | `subject_id`         | `string`  | `global`           |               |                    |                           |                |                          |
| `subject`   | `first_name`         | `string`  |                    |               |                    |                           |                |                          |
| `subject`   | `last_name`          | `string`  |                    |               |                    |                           |                |                          |
| `subject`   | `birth_year`         | `date`    |                    |               |                    |                           |                |                          |
| `encounter` | `encounter_id`       | `string`  | `local`            |               |                    |                           |                |                          |
| `encounter` | `subject_id`         | `string`  |                    |               |                    | `subject:subject_id`      | `many_to_one`  |                          |
| `encounter` | `date`               | `date`    |                    |               | `no`               |                           |                | `Encounter`              |
| `encounter` | `height`             | `float`   |                    |               | `no`               |                           |                | `Encounter>Measurements` |
| `encounter` | `weight`             | `integer` |                    |               | `no`               |                           |                | `Encounter>Measurements` |
| `encounter` | `sys_blood_pressure` | `integer` |                    |               | `no`               |                           |                | `Encounter>Measurements` |

### Entity Dictionary

**Description:** The optional entity dictionary file describes entities that the data file represent (`*.data.csv`). The entity dictionary lists the metadata for each input `data.csv`. This metadata includes human-readable entity title, singular and plural labels, and a description. The information supplied in the entity dictionary provides additional contextualized information for the user.

**Deliverable**: The `entity_dictionary.csv` file is similar in structure to a spreadsheet, where columns are defined in Table 3 in the Appendix below. Each row in the entity dictionary file represents an entity for which data file and entries in the `entity_dictionary.csv` exist.

{% hint style="info" %}
All columns have to be present, however they can be left empty if there are no values for the corresponding `entity:field` combination. The following example shows only enough columns to illustrate the general format. A complete template can be provided on request.
{% endhint %}

#### Example `entity_dictionary.csv`

| `entity`    | `entity_title`     | `entity_label_singular` | `entity_label_plural` | `entity_description`                                                    |
| ----------- | ------------------ | ----------------------- | --------------------- | ----------------------------------------------------------------------- |
| `patient`   | `Patient`          | `Patient Record`        | `Patient Records`     | The unique record for each patient participating in the clinical trial. |
| `clinical`  | `Clinicals`        |                         |                       | Clinically supplied data at the patient level.                          |
| `encounter` | `Encounters`       | `Encounter`             | `Encounters`          |                                                                         |
| `lab_test`  | `Laboratory Tests` | `Lab Test`              | `Lab Tests`           | Laboratory tests ordered during a patient's hospital visit.             |

### Data Files

**Description:** Phenotypic data must be supplied in a CSV file.

**Deliverable:** The `data.csv` file is described in the data dictionary with its name. The `data.csv` file must be a flattened data structure (a table) with a header, columns, and rows. The values must be of the following listed types: `date`, `datetime`, `integer`, `float`, or `string`.

#### Example subject `data.csv`

| `subject_id` | `first_name` | `last_name` | `birth_year` |
| ------------ | ------------ | ----------- | ------------ |
| 1            | Mom          | Family      | 1985         |
| 2            | Dad          | Family      | 1984         |
| 3            | Baby         | Family      | 2016         |

### Codings

**Description:** The `codings.csv` file is a single file that contains information for coded values present in all the `*.data.csv` files. Often a meaning is designed for human consumption and codes are designed to optimize data storage and retrieval. For example, instead of storing the value "Acquired pure red cell aplasia erythroblastopenia" multiple times in a database, it is easier to code this as the shortened ICD10 code, `D60`. Data are stored as codes in the encoded fields. When retrieving a coded value, such as the ICD10 code `D60`, from a database and displaying the value for consumption, the system converts this back to the full meaning, "Acquired pure red cell aplasia erythroblastopenia," so that the information is intelligible to the users.

Hierarchical Data: The codings file supports specification of a hierarchy. For example, ICD10 codes, `A00` to `B99`, all point to the category, "Certain infectious and parasitic diseases". The code, `A41`, points to "Other sepsis" and codes, `A41.X`, point to specific types of sepsis. Such a hierarchical structure can be defined in the `codings.csv` in the `parent_code` column (see example below).

**Deliverable:** The `codings.csv` file must have columns as defined in Table 2 of the Appendix. Each categorical field in your data needs to have an associated entry in the `codings.csv` file. The same `coding_name` value must be used in the respective `data_dictionary.csv` field to match each field with its respective codes.

#### Example `codings.csv` file

| `coding_name`                   | `code` | `meaning`          | `parent_code` | `display_order` |
| ------------------------------- | ------ | ------------------ | ------------- | --------------- |
| `blood_pressure`                | -1     | Did not measure    |               | 1               |
| `blood_pressure`                | -2     | Measurement failed |               | 2               |
| `t_f_na`                        | 1      | True               |               |                 |
| `t_f_na`                        | 2      | False              |               |                 |
| `t_f_na`                        | 3      | Not Applicable     |               |                 |
| `reason_discontinued_treatment` | 1      | Pain subsided      |               |                 |
| `reason_discontinued_treatment` | 2      | Lost medication    |               |                 |
| `reason_discontinued_treatment` | 3      | Forgot             |               |                 |
| `allergies`                     | a      | Environment        |               | 1               |
| `allergies`                     | b      | Food               |               | 2               |
| `allergies`                     | c      | Medicine           |               | 3               |

## Appendix

### Table 1: Data Dictionary File Description (`data_dictionary.csv`)

| Column                    | Possible values                                   | Description                                                                                                                                                                 |
| ------------------------- | ------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `entity`                  | `^[a-zA-Z][a-zA-Z_0-9]*$`                         | Required: The name of the entity associated with the current field.                                                                                                         |
| `name`                    | `^[a-zA-Z][a-zA-Z_0-9]*$`                         | Required: The name of the current field. No two fields, within an entity, may have the same name.                                                                           |
| `type`                    | `date`, `datetime`, `integer`, `float`, `string`  | Required: The type of the raw data as given to the system. Type `date` is a timepoint with day granularity.                                                                 |
| `coding_name`             | `[\w+-]+`                                         | This column denotes whether the current field is codified. For example, a gender field could have type set to `integer`.                                                    |
| `description`             | Any character except `^[*'\|"]*$` (or left empty) | A human readable description of the field that is displayed to users in the Cohort Browser.                                                                                 |
| `folder_path`             | `^[a-zA-Z_0-9][>]*$`                              | Fields may be displayed in user defined groups, with the strict purpose of visualizing data in a browser. This field organizes the display hierarchy in the Cohort Browser. |
| `is_multi_select`         | `yes` (or left empty)                             | If you are using a `coding_name` and the field is multi-select, then enter `yes` in this column.                                                                            |
| `is_sparse_coding`        | `yes` (or left empty)                             | Set this field to the value `yes` if the `coding_name` field is sparsely codified. This indicates that the field has many possible values but only some are relevant.       |
| `linkout`                 | Compliant URL (or left empty)                     | A URL for linking to additional information.                                                                                                                                |
| `primary_key_type`        | `global`, `local`                                 | This column denotes whether the current field is a primary key for the entity, and whether that entity is the main entity in the dataset.                                   |
| `referenced_entity_field` | `^entity_name:field_name$` (or left empty)        | If the field is a foreign key that references a field in another entity, specify the entity and field being referenced.                                                     |
| `relationship`            | `one_to_one`, `many_to_one` (or left empty)       | This describes the relationship between two entities (tables). Used with `referenced_entity_field` to define the relationship type.                                         |
| `title`                   | `^[a-zA-Z_0-9][>]*$`                              | A human readable title for `name` (column name) which is displayed to users in the Cohort Browser.                                                                          |
| `units`                   | `^[a-zA-Z][a-zA-Z_0-9]*$`                         | For the end user to understand the data, it is necessary to specify the unit of measurement.                                                                                |

\* A value is required unless specified as "or left empty."

### Table 2: Codings File Description (`codings.csv`)

| Column          | Possible values\*                                 | Description                                                                                                                                                                             |
| --------------- | ------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `coding_name`   | `^[\w+-]+$` (max length 256 characters)           | The name of the set of codes referenced when a field is codified. This column cannot be empty.                                                                                          |
| `code`          | Unique for a specific `coding_name`               | Specifies the code for the respective `code:meaning` pair.                                                                                                                              |
| `meaning`       | Any non-empty string (max length 2000 characters) | Specifies the value each code should take on, for a given grouping of codes, grouped by the `coding_name` field.                                                                        |
| `parent_code`   | An existing code in the `coding_name` set         | The hierarchical structure to a coding group may be maintained by specifying parent codes within the grouping.                                                                          |
| `display_order` | A positive integer (or left empty)                | The order for how values are displayed may be retained by directly specifying the order. Use this to control the order of values in list-type charts (non-hierarchical) and bar graphs. |

### Table 3: Entity Dictionary File Description (`entity_dictionary.csv`)

| Column                  | Description                                                                                   | Details                                                                                                                                                                                                                                                           |
| ----------------------- | --------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `entity`                | A machine-readable entity name. Values here must be reflected in the main data dictionary.    | No two entities may have the same name. Entity names need not be descriptive, however it is recommended they are descriptive and reflect the underlying data.                                                                                                     |
| `entity_title`          | A human-readable title for the value in the entity field.                                     | This should be a descriptive title for the entity. Often this will be a singular form, but plural forms are acceptable when that better reflects the underlying data or established terminology. The title is used when a value equals `1` in the Cohort Browser. |
| `entity_label_singular` | A human-readable entity label in singular grammatical number, making sense as a single value. | This should often be the same value as specified in `entity_title` and is descriptive of the entity. The singular label is used when a value equals `1` in the Cohort Browser.                                                                                    |
| `entity_label_plural`   | A human-readable entity label in plural grammatical number, making sense as plural values.    | This should often be the plural version of `entity_label_singular` (or `entity_title` if singular is blank) and is descriptive of the entity. The plural label is used when a value not equal to `1` in the Cohort Browser.                                       |
| `entity_description`    | A human-readable description of the field, making sense as a plural value.                    | This can describe details about the entity, how it is tracked, any additional information a person may need to know about the entity. A description is exposed to the user in the entity tooltip on the Cohort Browser.                                           |
