Data Files Used by the Data Model Loader
Overview
Phenotypic data must be delivered in a data model defined by the following files:
A data dictionary CSV file, named
data_dictionary.csv
in this example.A data CSV file,
data.csv
.A codings CSV file,
codings.csv
.
Data Dictionary
Description: The data dictionary file describes data in the data file (*.data.csv
) by file name and by column within the specified data file. The data dictionary lists the metadata for the input data.csv
file and respective individual columns within the data file. This metadata includes human-readable field names, data types, data coding schemes, and the relationships between the data tables. The information supplied in the data dictionary guides the ingestion process.
The data dictionary can be used to guide Data Model Loader to preload ICD codings and automatically build the codings CSV file. To take advantage of this automated feature in Data Model Loader, reserved terms (icd9cm:2015
, icd9pcs:2015
, icd10cm:2024
, and icd10pcs:2024
) must be used as coding_name
in the data dictionary. By default, ICD codings are hierarchically grouped. To disable hierarchy, the reserved term should be followed by :exclude_hierarchy
, for example icd9cm:2015:exclude_hierarchy
.
Deliverable: The data_dictionary.csv
file is similar in structure to a spreadsheet, where columns are defined in Table 1 in the Appendix below. Each row in the data dictionary file represents a field for which an entity, i.e. "patient", "sample", or "encounter," will have a value.
All columns have to be present; however, they can be left empty if there are no values for the corresponding entity:field combination. The example below does not include all columns, just enough to illustrate the general format. A complete template can be provided upon request.
Example data_dictionary.csv
data_dictionary.csv
subject
subject_id
string
global
subject
first_name
string
subject
last_name
string
subject
birth_year
date
encounter
encounter_id
string
local
encounter
subject_id
string
subject:subject_id
many_to_one
encounter
date
date
no
Encounter
encounter
height
float
no
Encounter>Measurements
encounter
weight
integer
no
Encounter>Measurements
encounter
sys_blood_pressure
integer
no
Encounter>Measurements
Entity Dictionary
Description: The optional entity dictionary file describes entities that the data file represent (*.data.csv). The entity dictionary lists the metadata for each input data.csv
. This metadata includes human-readable entity title, singular and plural labels, and a description. The information supplied in the entity dictionary provides additional contextualized information for the user.
Deliverable: The entity_dictionary.csv
file is similar in structure to a spreadsheet, where columns are defined in Table 3 in the Appendix below. Each row in the entity dictionary file represents an entity for which data file and entries in the entity_dictionary.csv
exist.
Example entity_dictionary.csv
entity_dictionary.csv
patient
Patient
Patient Record
Patient Records
The unique record for each patient participating in the clinical trial.
clinical
Clinicals
Clinically supplied data at the patient level.
encounter
Encounters
Encounter
Encounters
lab_test
Laboratory Tests
Lab Test
Lab Tests
Laboratory tests ordered during a patient's hospital visit
Data Files
Description: Phenotypic data must be supplied in a CSV file.
Deliverable: The data.csv
file is described in the data dictionary with its name. The data.csv
file must be a flattened data structure (a table) with a header, columns and rows. The values must be of the following listed types: date, datetime, integer, float, or string.
Example subject data.csv
data.csv
1
Mom
Family
1985
2
Dad
Family
1984
3
Baby
Family
2016
Codings
Description: The codings.csv
file is a single file that contains information for coded values present in all the *.data.csv
files. Often a meaning is designed for human consumption and codes are designed to optimize data storage and retrieval. For example, instead of storing the value “Acquired pure red cell aplasia [erythroblastopenia]” multiple times in a database, it is easier to code this as the shortened ICD10 code, “D60.” Data are stored as codes in the encoded fields. When retrieving a coded value, such as the ICD10 code "D60", from a database and displaying the value for consumption, the system converts this back to the full meaning, “Acquired pure red cell aplasia [erythroblastopenia],” so that the information is intelligible to the users.
Hierarchical Data: The codings file supports specification of a hierarchy. For example, ICD10 codes, A00 to B99, all point to the category, “Certain infectious and parasitic diseases”. The code, A31, points to “Other sepsis” and codes, A31.X, point to specific types of sepsis. Such a hierarchical structure can be defined in the codings.csv
in the parent_code column (see example below).
Deliverable: The codings.csv
file must have columns as defined in Table 2 of the Appendix. Each categorical field in your data will need to have an associated entry in the codings.csv
file. The same coding_name value must be used in the respective data_dictionary.csv
field to match each field with its respective codes.
Example codings.csv
file
codings.csv
fileblood_pressure
-1
Did not measure
1
blood_pressure
-2
Measurement failed
2
t_f_na
1
True
t_f_na
2
False
t_f_na
3
Not Applicable
reason_discontinued_treatment
1
Pain subsided
reason_discontinued_treatment
2
Lost medication
reason_discontinued_treatment
3
Forgot
allergies
a
Environment
1
allergies
b
Food
2
allergies
c
Medicine
3
Appendix
Table 1: Data Dictionary File Description (data_dictionary.csv
)
data_dictionary.csv
)entity
^[a-zA-Z][a-zA-Z_0-9]*$
Required: The name of the entity associated with the current field.
name
^[a-zA-Z][a-zA-Z_0-9]*$
Required: The name of the current field. No two fields, within an entity, may have the same name.
type
date, datetime, integer,
Required: The type of the raw data as given to the system. Type "date" is a timepoint with day granularity...
coding_name
[\w+-]+
This column denotes whether the current field is codified. For example a gender field could have type set to "integer".
description
Any character except ^[*'|"]*$
(or left empty)
A human readable description of the field that will be displayed to users in the Apollo Cohort Browser.
folder_path
^[a-zA-Z_0-9][>]*$
Fields may be displayed in user defined groups, with the strict purpose of visualizing data in a browser. This field will...
is_multi_select
yes (or left empty)
If you are using a coding_name and the field is multi-select, then enter "yes" in this column.
is_sparse_coding
yes (or left empty)
Set this field to the value "yes" if the coding name field is sparsely codified. The meaning of sparsely codified field...
linkout
Compliant URL (or left empty)
A URL for linking to additional information.
primary_key_type
global, local
This column denotes whether the current field is a primary key for the entity, and whether that entity is the main...
referenced_entity_field
^entity_name:field_name$
(or left empty)
If the field is a foreign key that references a field in another entity (e.g., a patient_id
field present in the...
relationship
one_to_one, many_to_one (or left empty)
This describes the relationship between two entities (tables). If, for the specified field, the entity has many rows...
title
^[a-zA-Z_0-9][>]*$
A human readable title for name (column name) which will be displayed to users in the Apollo Cohort Browser.
units
^[a-zA-Z][a-zA-Z_0-9]*$
For the end user to understand the data, it is necessary to specify the unit of measurement.
* A value is required unless specified as "or left empty."
Table 2: Codings File Description (codings.csv
)
codings.csv
)coding_name
/r"[\w+-]+"/
max length 256 characters
The name of the set of codes referenced when a field is codified. This column cannot be empty.
code
Unique for a specific coding_name
Specifies the code for the respective code:meaning pair.
meaning
Any non-empty string.
max length 2000 characters
Specifies the value each code should take on, for a given grouping of codes, grouped by the coding_name field.
parent_code
An existing code in the coding_name set
The hierarchical structure to a coding group may be maintained by specifying parent codes within the grouping.
display_order
A non-zero, non-negative integer or left empty
The order for how values are displayed may be retained by directly specifying the order. Note: use this to control the order of values list-type charts (non-hierarchical) and bar graphs.
Table 3: Entity Dictionary File Description (entity_dictionary.csv
)
entity_dictionary.csv
)entity
A machine-readable entity name. Values here must be reflected in the main data dictionary.
No two entities may have the same name. Entity names need not be descriptive, however it is recommended they are descriptive and reflect the underlying data.
name
A human-readable field name, which will be displayed to users in the Apollo Cohort Browser.
No two fields within an entity may have the same name.
type
The type of raw data as given to the system, such as date, datetime, integer, float, or string.
There may be one and only one type for each entity:field combination. Type "date" is a timepoint with day granularity in ISO 8601 date format, type "datetime" is a timepoint with millisecond granularity in ISO 8601 datetime format, type "integer" is an integer number with 64-bit signed integer precision, type "float" is a floating-point number with 64-bit (double) floating-point precision, and type "string" is a string.
coding_name
This column denotes whether the current field is codified and which coding is being used.
If this column is not empty, it must match the coding name field in the codings.csv file except when leveraging automated pre-loaded ICD codings. In this case, reserved terms must be used and hierarchically grouped unless the :exclude_hierarchy term is appended.
description
A human-readable description of the field that will be displayed to users in the Apollo Cohort Browser.
A description is exposed to the user in the field tooltip on the Cohort Browser.
folder_path
This column specifies whether fields are displayed in user-defined groups or not.
Fields may be displayed in user defined groups, with the purpose of visualizing data in a browser. This then appears as a kind of table of contents for the data in the UI to make it easier to find fields.
is_multi_select
If you are using a coding name and the field is multi-select, then enter “yes” in this column.
This is only applicable to codified fields and implies that there is a coding name.
is_sparse_coding
If the field's coding name field is sparsely codified, then enter “yes” in this column.
This denotes whether the current field's coding name field is sparsely codified and depends on type.
linkout
A URL for linking to additional information.
primary_key_type
This column denotes whether the current field is an entity's primary key and if it is the main entity.
There must be exactly one "global" primary_key_type per dataset and if a field is a primary key, the values must be NOT NULL and UNIQUE.
referenced_entity_field
This column denotes whether the field is a foreign key that references another entity's data.
If this column is not blank for a field, then the relationship and referenced_entity_field must also contain a value. Every entity in the data dictionary should be referenced or reference another entity (all entities linked).
relationship
This describes the relationship between two entities. If many rows in one table map back to a column in another, then the relationship is "many_to_one".
If this column is not blank for a field, then the referenced_entity_field and relationship must also contain a value.
title
A human-readable title for the field name, which will be displayed to users in the Apollo Cohort Browser.
This title is used in forming existential cohort filter statements and is available in the field flyouts in the Cohort Browser.
units
For the end user to understand the data, it is necessary to specify the unit of measurement.
For example, for a temperature field the unit may be Celsius, Fahrenheit, or Kelvin and will display in the metadata for the field and any corresponding charts.
entity_title
A human-readable title for the value in the entity field, making sense as a singular format.
This should often be a single value that makes sense in the singular format and is descriptive of the entity. The title is used when a value equals 1 in the Cohort Browser.
entity_label_singular
A human-readable entity label in singular grammatical number, making sense as a single value.
This should often be the same value as specified in entity_title and is descriptive of the entity. The singular label is used when a value equals 1 in the Cohort Browser.
entity_label_plural
A human-readable entity label in plural grammatical number, making sense as plural values.
This should often be the plural version of entity_label_singular (or entity_title if singular is blank) and is descriptive of the entity. The plural label is used when a value not equal to 1 in the Cohort Browser.
entity_description
A human-readable description of the field, making sense as a plural value.
This can describe details about the entity, how it is tracked, any additional information a person may need to know about the entity. A description is exposed to the user in the entity tooltip on the Cohort Browser.
Last updated
Was this helpful?