Data Ingestion Key Steps

Overview

Generalized phenotypic data ingestion is done with an ingestion process that takes in well-described data in the form of a Data Dictionary file, a Codings file if needed, an optional Entity Dictionary file, and accompanying data CSV files. The files are loaded using the Data Model Loader app, which validates and ingests the input CSV files to create a Dataset. This Dataset is then accessible using the Cohort Browser, or using JupyterLab and our Python SDK, dxdata.‌

The following steps show how to organize your data into the required file sets. These files can then be loaded using the Data Model Loader app to create a database encapsulated by a Dataset record, which are then immediately accessible for use.

Guide

1. Identifying Your Data

1.1 Decide upon the set of data you want to ingest

Is this data phenotypic or clinical data, or is it molecular data?

Include

  • Examples of phenotypic or clinical data include: a patient's height, encounters with a physician or hospital, surgeries, drugs taken for any treatments, etc. Clinical data may also include descriptive content, such as information on samples extracted. For example, the weight and size of a tumor, or the date and time from which a tumor was excised.

  • Gross features that describe molecular content may be considered clinical data. For example, we don't recommend including allele content of the BRCA2 gene for a patient, however having a field such as, "Tested positive for BRCA2 risk allele: (yes/no/untested)" may be of use.

Exclude

  • Examples of molecular data include: allele content of the BRCA2 gene for a patient. This guide does not cover molecular data ingestion. To ingest these and other complex datasets, engaging with xVantage is advised to ensure an optimal experience.

1.2 Decide upon your main entity and main field

Once you have all of the data to include, you next need to decide upon the central focus and organizing focal point of the data. In almost all situations, the focus will be at the individual subject level (i.e, subject, patient, case, etc.). You can think of the main entity as the "item" you want to summarize data around and build cohorts of.

Example

We often want to group individuals into cohorts of subjects:

  1. Subjects that haven't smoked cigarettes before and

  2. Subjects that smoke one pack (or more) of cigarettes a week.

The main entity would be here would be "subject". The main field would be a unique identifier of the individual subject, such as "subject_id."

Note: We assume all data in a set of data is "linked" together in some manner. For example, if you have a set of data that includes patients and samples, we would expect all included samples to be from a patient contained in the set of data. Samples that have no identified patient, should not be included in the data.

1.3 Define all other entities

An entity is simply a grouping of data. You may group data however it best fits your needs. We recommend grouping all data together which shares a one-to-one relationship as a single entity, and grouping any nested data as a separate entity. Entities will be dependent on the data you have.

Example

If you have a main entity, subject, you may have another entity, encounter, which contains information from the many encounters a subject has at the hospital, and you may have another entity, sample, which contains information on information on the many samples extracted on a subject.

  • The entity subject would contain all data that is one-to-one with the subject, such as, "first_name," "last_name," "date_of_birth," "sex," "race," "ethnicity."

  • encounter would contain the "date" of the hospital visit and perhaps the diagnosis, "ICD10_code," that may have resulted from the visit.

  • sample would contain any sample information on the subject, such as "date" sample was extracted, "tissue_type" of extraction, and "tissue_weight" of tissue extracted.

  • Entities will be dependent on the data you have. If you don't have sample data, you don't need a sample entity!

2. Defining Your Files

2.1 Fill in "entity" and "name" for each field

Now that all entities have been defined, create a data_dictionary.csv file with the following column names: entity, name, primary_key_type, coding_name, is_sparse_coding, is multi_select, longitudinal_axis_type, type, referenced_entity_field, relationship, folder_path, title, description, units, concept, and linkout. See the Phenotypic Data Ingest File Details for quick reference.

Begin filling out data_dictionary.csv by list all fields within your data under the "name" column and label each field with the respective entity name under the column "entity."

2.2 Defining the "primary_key_type"

Main Entity

As your main entity main field are already defined, write the value, "global" under the column "primary_key_type" for that row of information in data_dictionary.csv.

All Other Entities

For every other entity (other than the main entity) determine if the entity contains a field that is a "primary key," or not. A primary key is UNIQUE and NOT NULL and acts as an identifier for the entity. Only one defined primary key per entity is allowed. For example, the entity, sample, might contain the primary key, "sample_id." Although non-main entities don't need a primary key, it is advised that one exist.

For each defined primary key, write the value, "local" under the column "primary_key_type" for that row of information in data_dictionary.csv.

2.3 For each field in each entity, determine: "is_sparse_coding," "is_multi_select," and "type"

See the Ingestion Data Type page for guidelines as to the definition of each column.

For each field in data_dictionary.csv, provide the value, "yes," under each respective column if the field "is_sparse_coding," or "is_multi_select." Also, determine the field "type."

2.4 Build a codings.csv file and determine "coding_name"

Next, determine if the field is to be a categorial field or not. If so, you will need to include the specific "coding_name" used. A field is categorical if the values are not unique, such as with the field, "Tumor status" Here, reasonable answers would be "malignant," "benign", "not applicable," or "undetermined." If there are any categorical fields in your data, you will need to create a codings.csv file with the following column names: coding_name, code, meaning, parent_code, display_order, and concept. See the codings csv page for quick reference.

If a field is categorical, for each value in the field, fill in the "code," "meaning," "parent_code," "display_order," and "concept" within the codings.csv file. Finally label the set of codes with a unique "coding_name." The "coding_name" in the codings.csv file should be filled out, and the same "coding_name" value should be used in the respective data_dictionary.csv field.

2.5 Determine if desired data format is allowed

You now have all necessary information to determine whether or not the desired format of data (float, integer, date, etc.) will be accepted during ingestion or not. We currently have limitations in place and restrict allowable formats. Review each field in your set of data and confirm that it fits a represented data format (see the Ingestion Data Types page).

If it is not represented, we suggest reformatting your data, your set of entities, or forcing a field into an allowed data format.

2.6 Define entity relationships and determine; "referenced_entity_field" and "relationship"

As specified earlier, we assume that all of your included data is linked, in some way, together.

  • If you only have one entity, these data_dictionary.csv columns may be ignored.

  • If there is more than one entity, you need to describe how each entity relates to another entity.

Start with the main entity, and find all entities that either have a one-to-one relationship or a many-to-one relationship. For example, the entity, encounter, would be related many-to-one with the main entity, Subject. In the Encounter entity, there should be a field that directly links to a field in the subject entity. Each field would be names, "subject_id," and for the "subject_id" field in the encounter entity, the data_dictionary.csv "referenced_entity_field" value would be "subject:subject_id," and the "relationship" value would be "many_to_one."

2.7 Add additional descriptive information: "title," "description," "units," and "linkout"

For each field, fill in any desired descriptive information. See the Data Dictionary section for details on each column.

2.8 Specify folder for cohort browser: "folder_path"

Fill in "folder_path" for each field. If a value is not specified, the field will not be displayed in the Cohort Browser.

We recommend grouping fields similar to how entities are grouped, however it is not necessary.

2.9 Organize data into respective entities: data.csv(s)

For each entity in your data_dictionary.csv file, write all data from each field into a flat data.csv file. The data.csv should be named after the entity.

Example

The entity subject should have a respective data_dictionary.csv file, labeled, "subject.csv."

"subject.csv" should contain all respective fields in the data_dictionary.csv file as columns, and rows should be filled in with data.

2.10 Generate your entity details: entity_metadata.csv

For each entity created, add in a row ensure that the the entity and entity_title are populated and optionally supply an entity_label_singular, entity_label_plural, and entity_description.

For more details on each column refer to the Entity Dictionary section.

You should now have the data.csv files, the data_dictionary.csv, the entity_metadata.csv(optional) and the codings.csv (optional).

3. Ingesting Your Files

The resulting data.csv files will be loaded with the data_dictionary.csv, entity_metadata.csv, and codings.csv files into the Data Model Loader app.