Data Ingestion Key Steps

An Apollo license is required to use Data Model Loader. Org approval may also be required. Contact DNAnexus Sales for more information.

Generalized phenotypic data ingestion is done with an ingestion process that takes in well-described data in the form of a Data Dictionary file, a Codings file if needed, an optional Entity Dictionary file, and accompanying data CSV files. The files are loaded using the Data Model Loader app, which validates and ingests the input CSV files to create a Dataset. This Dataset is then accessible using the Cohort Browser, or using JupyterLab and our Python SDK, dxdata.‌

The following steps show how to organize your data into the required file sets. These files can then be loaded using the Data Model Loader app to create a database encapsulated by a Dataset record, which are then immediately accessible for use.

Step 1. Identify Your Data

Decide What Type of Data You Will Ingest

Is this data phenotypic or clinical data, or is it molecular data?

Include

  • Examples of phenotypic or clinical data include: a patient's height, encounters with a physician or hospital, surgeries, drugs taken for any treatments, medical histories, and demographic information. Clinical data may also include descriptive content, such as information on samples extracted. For example, the weight and size of a tumor, or the date and time from which a tumor was excised.

  • Gross features that describe molecular content may be considered clinical data. For example, we don't recommend including allele content of the BRCA2 gene for a patient, however having a field such as, "Tested positive for BRCA2 risk allele: (yes/no/untested)" may be of use.

Exclude

  • Examples of molecular data include: allele content of the BRCA2 gene for a patient. This guide does not cover molecular data ingestion. To ingest these and other complex datasets, engaging with the DNAnexus Professional Services team is advised to ensure an optimal experience.

Determine the Main Entity and Main Field

Once you have the data to include, you next need to decide on the central focus and organizing focal point of the data. In most situations, the focus will be at the individual subject level (i.e, subject, patient, case, participant, or other individual-level entity). You can think of the main entity as the "item" you want to summarize data around and build cohorts of.

For example, we often want to group individuals into cohorts of subjects, such as:

  1. Subjects that haven't smoked cigarettes before

  2. Subjects that smoke one pack (or more) of cigarettes a week.

The main entity would be here would be subject. The main field would be a unique identifier of the individual subject, such as subject_id.

We assume all data in a set of data is "linked" together in some way. For example, if you have a set of data that includes patients and samples, we would expect all included samples to be from a patient contained in the set of data. Samples that have no identified patient, should not be included in the data.

Define All Other Entities

An entity is simply a grouping of data. You may group data however it best fits your needs. We recommend grouping all data together which shares a one-to-one relationship as a single entity, and grouping any nested data as a separate entity. Entities will be dependent on the data you have.

Example

If you have a main entity, subject, you may have another entity, encounter, which contains information from the many encounters a subject has at the hospital, and you may have another entity, sample, which contains information on the many samples extracted on a subject.

  • The entity subject would contain all data that is one-to-one with the subject, such as, first_name, last_name, date_of_birth, sex, race, ethnicity.

  • encounter would contain the date of the hospital visit and perhaps the diagnosis, ICD10_code, that may have resulted from the visit.

  • sample would contain any sample information on the subject, such as date sample was extracted, tissue_type of extraction, and tissue_weight of tissue extracted.

  • Entities will be dependent on the data you have. If you don't have sample data, you don't need a sample entity!

Step 2. Defining Your Files

Fill in Entity and Name for Each Field

Now that all entities have been defined, create a data_dictionary.csv file with the following column names: entity, name, primary_key_type, coding_name, is_sparse_coding, is_multi_select, longitudinal_axis_type, type, referenced_entity_field, relationship, folder_path, title, description, units, concept, and linkout. See the Phenotypic Data Ingest File Details for quick reference.

Begin filling out data_dictionary.csv by listing all fields within your data under the name column and label each field with the respective entity name under the column entity.

Define the primary_key_type

Main Entity

As your main entity main field are already defined, write the value global under the column primary_key_type for that row of information in data_dictionary.csv.

All Other Entities

For every other entity (other than the main entity) determine if the entity contains a field whose value serves as a primary key. Primary key values must not be null, and each must be unique, to serve as a unique identifier for the entity. Only one defined primary key per entity is allowed. For example, the entity sample might contain the primary key field sample_id.

For each defined primary key, write the value, local under the column primary_key_type for that row of information in data_dictionary.csv.

For Each Field in Each Entity, Indicate is_sparse_coding, is_multi_select, and type

See the Ingestion Data Type page for guidelines as to the definition of each column.

For each field in data_dictionary.csv, provide the value yes under each respective column if the field is_sparse_coding, or is_multi_select. Also, determine the field type.

Build a codings.csv File and Indicate coding_name

Next, determine if the field is to be a categorical field or not. If so, you will need to include the specific coding_name used. A field is categorical if the values are not unique, such as with the field, Tumor status. Here, reasonable answers would be malignant, benign, not applicable, or undetermined. If there are any categorical fields in your data, you will need to create a codings.csv file with the following column names: coding_name, code, meaning, parent_code, display_order, and concept. See codings file format for more information.

If you have files with ICD codes, it is possible to leverage automated pre-loaded ICD codings. In this case, reserved terms (icd9cm:2015, icd9pcs:2015, icd10cm:2024, and icd10pcs:2024) must be used as coding_name in the data dictionary without corresponding information in the codings.csv file. Note that if necessary, it is possible to overwrite specific codes by adding them to the input codings.csv file.

If a field is categorical, for each value in the field, fill in the code, meaning, parent_code, display_order, and concept within the codings.csv file. Finally label the set of codes with a unique coding_name. The coding_name in the codings.csv file should be filled out, and the same coding_name value should be used in the respective data_dictionary.csv field.

Indicate if a Desired Data Format is Allowed

You now have all necessary information to determine whether the desired format of data (float, integer, date, string, boolean, or other data types) will be accepted during ingestion or not. We currently have limitations in place and restrict allowable formats. Review each field in your set of data and confirm that it fits a represented data format (see the Ingestion Data Types page).

If it is not represented, we suggest reformatting your data, your set of entities, or forcing a field into an allowed data format.

Define Entity Relationships and Determine referenced_entity_field and relationship

As specified earlier, we assume that your included data is linked together in some way.

  • If you only have one entity, these data_dictionary.csv columns may be ignored.

  • If there is more than one entity, you need to describe how each entity relates to another entity. Start with the main entity, and identify all entities that have either a one-to-one or many-to-one relationship with it. For example, the entity encounter typically has a many-to-one relationship with the main entity, subject. In this case, the encounter entity should include a field that links directly to a field in the subject entity. Both fields might be named subject_id. For the subject_id field in the encounter entity, set the referenced_entity_field value in data_dictionary.csv to subject:subject_id, and set the relationship value to many_to_one.

Add Additional Descriptive Information: title, description, units, and linkout

For each field, fill in any desired descriptive information. See the Data Dictionary section for details on each column.

Specify Folder for Display in the Cohort Browser: folder_path

Fill in folder_path for each field. If a value is not specified, the field will not be displayed in the Cohort Browser.

We recommend grouping fields similar to how entities are grouped, however it is not necessary.

Organize Data into Respective Entities: data.csv(s)

For each entity in your data_dictionary.csv file, write all data from each field into a flat data.csv file. The data.csv should be named after the entity.

The entity subject should have a respective data_dictionary.csv file, labeled subject.csv.

The subject.csv file should contain all respective fields in the data_dictionary.csv file as columns, and rows should be filled in with data.

Generate Entity Details: entity_metadata.csv

For each entity created, add a row to ensure that the entity and entity_title are populated, and optionally supply an entity_label_singular, entity_label_plural, and entity_description.

For more details on each column refer to the Entity Dictionary section.

You should now have a data.csv file and a data_dictionary.csv files. Optionally, you may also have an entity_metadata.csv and/or a codings.csv file.

Step 3. Ingest Your Files

The resulting data.csv files will be loaded with the data_dictionary.csv, entity_metadata.csv, and codings.csv files into the Data Model Loader app.

Last updated

Was this helpful?