# Data Ingestion Key Steps

{% hint style="info" %}
An Apollo license is required to use Data Model Loader. Org approval may also be required. [Contact DNAnexus Sales](mailto:sales@dnanexus.com) for more information.
{% endhint %}

Generalized phenotypic data ingestion is done with an ingestion process that takes in well-described data in the form of a [Data Dictionary](https://documentation.dnanexus.com/developer/ingesting-data/data-file-inputs-data-model-loader#data-dictionary) file, a [Codings](https://documentation.dnanexus.com/developer/ingesting-data/data-file-inputs-data-model-loader#codings) file if needed, an optional [Entity Dictionary](https://documentation.dnanexus.com/developer/ingesting-data/data-file-inputs-data-model-loader#entity-dictionary) file, and accompanying [data CSV files](https://documentation.dnanexus.com/developer/ingesting-data/data-file-inputs-data-model-loader#data-files). The files are loaded using the [Data Model Loader](https://documentation.dnanexus.com/developer/ingesting-data/data-model-loader) app, which validates and ingests the input CSV files to create a [Dataset](https://documentation.dnanexus.com/developer/datasets). This Dataset is then accessible using the [Cohort Browser](https://documentation.dnanexus.com/user/cohort-browser), or using [JupyterLab](https://documentation.dnanexus.com/user/jupyter-notebooks) and the `dxdata` Python SDK.

The following steps show how to organize your data into the required file sets. These files can then be loaded using the [Data Model Loader](https://documentation.dnanexus.com/developer/ingesting-data/data-model-loader) app to create a database encapsulated by a [Dataset](https://documentation.dnanexus.com/developer/datasets) record, which are then immediately accessible for use.

## Step 1. Identify Your Data

### Decide What Type of Data You Want to Ingest

Is this data phenotypic or clinical data, or is it molecular data?

Include

* Examples of phenotypic or clinical data include: a patient's height, encounters with a physician or hospital, surgeries, drugs taken for any treatments, medical histories, and demographic information. Clinical data may also include descriptive content, such as information on samples extracted. For example, the weight and size of a tumor, or the date and time from which a tumor was excised.
* General features that describe molecular content may be considered clinical data. For example, including allele content of the BRCA2 gene for a patient is not recommended, however having a field such as, "Tested positive for BRCA2 risk allele: (yes/no/untested)" may be of use.

Exclude

* Examples of molecular data include: allele content of the BRCA2 gene for a patient. This guide does not cover molecular data ingestion. To ingest these and other complex datasets, engaging with the [DNAnexus Professional Services team](https://www.dnanexus.com/professional-services) is advised to ensure an optimal experience.

### Determine the Main Entity and Main Field

Once you have the data to include, you next need to decide on the central focus and organizing focal point of the data. In most situations, the focus is at the individual subject level (i.e, subject, patient, case, participant, or other individual-level entity). You can think of the main entity as the "item" you want to summarize data around and build cohorts of.

For example, individuals are often grouped into cohorts of subjects, such as:

1. Subjects that haven't smoked cigarettes before
2. Subjects that smoke one pack (or more) of cigarettes a week.

The *main entity* would be here would be `subject`. The *main field* would be a unique identifier of the individual subject, such as `subject_id`.

{% hint style="info" %}
This assumes all data in a set of data is "linked" together in some way. For example, if you have a set of data that includes patients and samples, all included samples are expected to be from a patient contained in the set of data. Samples that have no identified patient should not be included in the data.
{% endhint %}

### Define All Other Entities

An entity is a grouping of data. You may group data however it best fits your needs. We recommend grouping all data together which shares a *one-to-one* relationship as a single entity, and grouping any nested data as a separate entity. Entities are dependent on the data you have.

{% hint style="info" %}
*Example*

If you have a main entity, `subject`, you may have another entity, `encounter`, which contains information from the many encounters a subject has at the hospital, and you may have another entity, `sample`, which contains information on the many samples extracted on a subject.

* The entity `subject` would contain all data that is one-to-one with the subject, such as, `first_name`, `last_name`, `date_of_birth`, `sex`, `race`, `ethnicity`.
* `encounter` would contain the `date` of the hospital visit and perhaps the diagnosis, `ICD10_code`, that may have resulted from the visit.
* `sample` would contain any sample information on the subject, such as `date` sample was extracted, `tissue_type` of extraction, and `tissue_weight` of tissue extracted.
* Entities are dependent on the data you have. If you don't have sample data, you don't need a `sample` entity!
  {% endhint %}

## Step 2. Defining Your Files

### Fill in `Entity` and `Name` for Each Field

After defining all entities, create a `data_dictionary.csv` file with the following column names: `entity`, `name`, `primary_key_type`, `coding_name`, `is_sparse_coding`, `is_multi_select`, `longitudinal_axis_type`, `type`, `referenced_entity_field`, `relationship`, `folder_path`, `title`, `description`, `units`, `concept`, and `linkout`. See the [Phenotypic Data Ingest File Details](https://documentation.dnanexus.com/developer/ingesting-data/data-model-loader/data-file-inputs-data-model-loader) for quick reference.

Begin filling out `data_dictionary.csv` by listing all fields within your data under the `name` column and label each field with the respective entity name under the column `entity`.

### Define the `primary_key_type`

#### Main Entity

As your main entity main field are already defined, write the value `global` under the column `primary_key_type` for that row of information in `data_dictionary.csv`.

#### All Other Entities

For every other entity (other than the main entity) determine if the entity contains a field whose value serves as a primary key. Primary key values must not be null, and each must be unique, to serve as a unique identifier for the entity. Only one defined primary key per entity is allowed. For example, the entity `sample` might contain the primary key field `sample_id`.

For each defined primary key, write the value, `local` under the column `primary_key_type` for that row of information in `data_dictionary.csv`.

### For Each Field in Each Entity, Indicate `is_sparse_coding`, `is_multi_select`, and `type`

See the [Ingestion Data Type](https://documentation.dnanexus.com/developer/ingesting-data/data-model-loader/ingestion-data-types) page for guidelines as to the definition of each column.

For each field in `data_dictionary.csv`, provide the value `yes` under each respective column if the field `is_sparse_coding`, or `is_multi_select`. Also, determine the field `type`.

### Build a `codings.csv` File and Indicate `coding_name`

Next, determine if the field is to be a categorical field or not. If so, you need to include the specific `coding_name` used. A field is categorical if the values are not unique, such as with the field, `Tumor status`. Here, reasonable answers would be `malignant`, `benign`, `not applicable`, or `undetermined`. If there are any categorical fields in your data, you need to create a `codings.csv` file with the following column names: `coding_name`, `code`, `meaning`, `parent_code`, `display_order`, and `concept`. See [codings file format](https://documentation.dnanexus.com/developer/ingesting-data/data-file-inputs-data-model-loader#codings) for more information.

If you have files with ICD codes, it is possible to leverage automated pre-loaded ICD codings. In this case, reserved terms (`icd9cm:2015`, `icd9pcs:2015`, `icd10cm:2024`, and `icd10pcs:2024`) must be used as `coding_name` in the data dictionary without corresponding information in the `codings.csv` file. You can overwrite specific codes by adding them to the input `codings.csv` file.

If a field is categorical, for each value in the field, fill in the `code`, `meaning`, `parent_code`, `display_order`, and `concept` within the `codings.csv` file. Finally label the set of codes with a unique `coding_name`. The `coding_name` in the `codings.csv` file should be filled out, and the same `coding_name` value should be used in the respective `data_dictionary.csv` field.

### Indicate if a Desired Data Format is Allowed

You have all necessary information to determine whether the desired format of data (float, integer, date, string, boolean, or other data types) is accepted during ingestion or not. Data Model Loader has limitations in place and restricts allowed formats. Review each field in your set of data and confirm that it fits a represented data format (see the [Ingestion Data Types](https://documentation.dnanexus.com/developer/ingesting-data/data-model-loader/ingestion-data-types) page).

If it is not represented, we recommend reformatting your data, your set of entities, or forcing a field into an allowed data format.

### Define Entity Relationships and Determine `referenced_entity_field` and `relationship`

This assumes that your included data is linked together in some way.

* If you only have one entity, these `data_dictionary.csv` columns may be ignored.
* If there is more than one entity, describe how each entity relates to another entity. Start with the main entity, and identify all entities that have either a one-to-one or many-to-one relationship with it. For example, the entity `encounter` typically has a many-to-one relationship with the main entity, `subject`. In this case, the `encounter` entity should include a field that links directly to a field in the `subject` entity. Both fields might be named `subject_id`. For the `subject_id` field in the `encounter` entity, set the `referenced_entity_field` value in `data_dictionary.csv` to `subject:subject_id`, and set the `relationship` value to `many_to_one`.

### Add Additional Descriptive Information: `title`, `description`, `units`, and `linkout`

For each field, fill in any desired descriptive information. See the [Data Dictionary](https://documentation.dnanexus.com/developer/ingesting-data/data-file-inputs-data-model-loader#data-dictionary) section for details on each column.

### Specify Folder for Display in the Cohort Browser: `folder_path`

Fill in `folder_path` for each field. If a value is not specified, the field is not displayed in the Cohort Browser.

We recommend grouping fields similar to how entities are grouped, however it is not necessary.

### Organize Data into Respective Entities: `data.csv(s)`

For each entity in your `data_dictionary.csv` file, write all data from each field into a flat `data.csv` file. The `data.csv` should be named after the entity.

The entity `subject` should have a respective `data_dictionary.csv` file, labeled `subject.csv`.

The `subject.csv` file should contain all respective fields in the `data_dictionary.csv` file as columns, and rows should be filled in with data.

### Generate Entity Details: `entity_metadata.csv`

For each entity created, add a row to ensure that the `entity` and `entity_title` are populated, and optionally supply an `entity_label_singular`, `entity_label_plural`, and `entity_description`.

For more details on each column refer to the [Entity Dictionary](https://documentation.dnanexus.com/developer/ingesting-data/data-file-inputs-data-model-loader#entity-dictionary) section.

You should have a `data.csv` file and a `data_dictionary.csv` files. Optionally, you may also have an `entity_metadata.csv` and/or a `codings.csv` file.

## Step 3. Ingest Your Files

The resulting `data.csv` files are loaded with the `data_dictionary.csv`, `entity_metadata.csv`, and `codings.csv` files into the [Data Model Loader](https://documentation.dnanexus.com/developer/ingesting-data/data-model-loader) app.
