Data Ingestion Key Steps
Last updated
Was this helpful?
Last updated
Was this helpful?
Generalized phenotypic data ingestion is done with an ingestion process that takes in well-described data in the form of a file, a file if needed, an optional file, and accompanying . The files are loaded using the app, which validates and ingests the input CSV files to create a . This Dataset is then accessible using the , or using and our Python SDK, dxdata.‌
The following steps show how to organize your data into the required file sets. These files can then be loaded using the app to create a database encapsulated by a record, which are then immediately accessible for use.
Is this data phenotypic or clinical data, or is it molecular data?
Include
Examples of phenotypic or clinical data include: a patient's height, encounters with a physician or hospital, surgeries, drugs taken for any treatments, etc. Clinical data may also include descriptive content, such as information on samples extracted. For example, the weight and size of a tumor, or the date and time from which a tumor was excised.
Gross features that describe molecular content may be considered clinical data. For example, we don't recommend including allele content of the BRCA2 gene for a patient, however having a field such as, "Tested positive for BRCA2 risk allele: (yes/no/untested)" may be of use.
Exclude
Examples of molecular data include: allele content of the BRCA2 gene for a patient. This guide does not cover molecular data ingestion. To ingest these and other complex datasets, engaging with the is advised to ensure an optimal experience.
Once you have all of the data to include, you next need to decide upon the central focus and organizing focal point of the data. In almost all situations, the focus will be at the individual subject level (i.e, subject, patient, case, etc.). You can think of the main entity as the "item" you want to summarize data around and build cohorts of.
For example, we often want to group individuals into cohorts of subjects, such as:
Subjects that haven't smoked cigarettes before
Subjects that smoke one pack (or more) of cigarettes a week.
The main entity would be here would be "subject". The main field would be a unique identifier of the individual subject, such as "subject_id."
An entity is simply a grouping of data. You may group data however it best fits your needs. We recommend grouping all data together which shares a one-to-one relationship as a single entity, and grouping any nested data as a separate entity. Entities will be dependent on the data you have.
Begin filling out data_dictionary.csv
by list all fields within your data under the "name" column and label each field with the respective entity name under the column "entity."
As your main entity main field are already defined, write the value "global" under the column primary_key_type for that row of information in data_dictionary.csv
.
For every other entity (other than the main entity) determine if the entity contains a field whose value serves as a primary key. Primary key values must not be null, and each must be unique, to serve as a unique identifier for the entity. Only one defined primary key per entity is allowed. For example, the entity sample might contain the primary key field sample_id.
For each defined primary key, write the value, "local" under the column primary_key_type for that row of information in data_dictionary.csv
.
For each field in data_dictionary.csv
, provide the value "yes" under each respective column if the field "is_sparse_coding," or "is_multi_select." Also, determine the field "type."
If you have files with ICD codes, it is possible to leverage automated pre-loaded ICD codings. In this case, reserved terms (icd9cm:2015
, icd9pcs:2015
, icd10cm:2024
, and icd10pcs:2024
) must be used as coding_name in the data dictionary without corresponding information in the codings.csv
file. Note that if necessary, it is possible to overwrite specific codes by adding them to the input codings.csv
file.
If a field is categorical, for each value in the field, fill in the "code," "meaning," "parent_code," "display_order," and "concept" within the codings.csv
file. Finally label the set of codes with a unique "coding_name." The "coding_name" in the codings.csv
file should be filled out, and the same "coding_name" value should be used in the respective data_dictionary.csv
field.
If it is not represented, we suggest reformatting your data, your set of entities, or forcing a field into an allowed data format.
As specified earlier, we assume that all of your included data is linked, in some way, together.
If you only have one entity, these data_dictionary.csv
columns may be ignored.
If there is more than one entity, you need to describe how each entity relates to another entity.
Start with the main entity, and find all entities that either have a one-to-one relationship or a many-to-one relationship. For example, the entity encounter would be related many-to-one with the main entity, Subject. In the Encounter entity, there should be a field that directly links to a field in the subject entity. Each field would be names, "subject_id," and for the "subject_id" field in the encounter entity, the data_dictionary.csv
"referenced_entity_field" value would be "subject:subject_id," and the "relationship" value would be "many_to_one."
Fill in "folder_path" for each field. If a value is not specified, the field will not be displayed in the Cohort Browser.
We recommend grouping fields similar to how entities are grouped, however it is not necessary.
For each entity in your data_dictionary.csv
file, write all data from each field into a flat data.csv
file. The data.csv
should be named after the entity.
The entity subject should have a respective data_dictionary.csv
file, labeled subject.csv
.
The subject.csv
file should contain all respective fields in the data_dictionary.csv
file as columns, and rows should be filled in with data.
For each entity created, add in a row ensure that the the entity and entity_title are populated and optionally supply an entity_label_singular, entity_label_plural, and entity_description.
You should now have a data.csv
file and a data_dictionary.csv
files. Optionally, you may also have an entity_metadata.csv
and/or a codings.csv
file.
Now that all entities have been defined, create a data_dictionary.csv
file with the following column names: entity, name, primary_key_type, coding_name, is_sparse_coding, is multi_select, longitudinal_axis_type, type, referenced_entity_field, relationship, folder_path, title, description, units, concept, and linkout. See the for quick reference.
See the page for guidelines as to the definition of each column.
Next, determine if the field is to be a categorical field or not. If so, you will need to include the specific "coding_name" used. A field is categorical if the values are not unique, such as with the field, "Tumor status" Here, reasonable answers would be "malignant," "benign", "not applicable," or "undetermined." If there are any categorical fields in your data, you will need to create a codings.csv
file with the following column names: coding_name, code, meaning, parent_code, display_order, and concept. See for more information.
You now have all necessary information to determine whether or not the desired format of data (float, integer, date, etc.) will be accepted during ingestion or not. We currently have limitations in place and restrict allowable formats. Review each field in your set of data and confirm that it fits a represented data format (see the page).
For each field, fill in any desired descriptive information. See the section for details on each column.
For more details on each column refer to the section.
The resulting data.csv
files will be loaded with the data_dictionary.csv
, entity_metadata.csv
, and codings.csv
files into the app.