Data Files used by the Data Model Loader

Overview

Phenotypic data must be delivered in a data model defined by the following files:

  • A data dictionary CSV file, named data_dictionary.csv in this example.

  • A data CSV file, data.csv.

  • A codings CSV file, codings.csv.

Note: This is intended only for customers working with the DNAnexus xVantage team. Contact [email protected] for more information.

Data Dictionary

Description: The data dictionary file describes data in the data file (*.data.csv) by file name and by column within the specified data file. The data dictionary lists the metadata for the input data.csv file and respective individual columns within the data file. This metadata includes human-readable field names, data types, data coding schemes, and the relationships between the data tables. The information supplied in the data dictionary guides the ingestion process.

Deliverable: The data_dictionary.csv file is similar in structure to a spreadsheet, where columns are defined in Table 1 in the Appendix below. Each row in the data dictionary file represents a field for which an entity, i.e. "patient", "sample", or "encounter," will have a value.

Note: all columns have to be present, however they can be left empty if there are no values for the corresponding entity:field combination. The example below does not include all columns, just enough to illustrate the general format. A complete template can be provided upon request.

Example data_dictionary.csv

entity

name

type

primary_key_type

coding_name

is_sparse_coding

referenced_entity_field

relationship

folder_path

subject

subject_id

string

global

subject

first_name

string

subject

last_name

string

subject

birth_year

date

encounter

encounter_id

string

local

encounter

subject_id

string

subject:subject_id

many_to_one

encounter

date

date

no

Encounter

encounter

height

float

no

Encounter>Measurements

encounter

weight

integer

no

Encounter>Measurements

encounter

sys_blood_pressure

integer

no

Encounter>Measurements

Entity Dictionary (optional)

Description: The entity dictionary file describes entities that the data file represent (*.data.csv). The entity dictionary lists the metadata for each input data.csv. This metadata includes human-readable entity title, singular and plural labels, and a description. The information supplied in the entity dictionary provides additional contextualized information for the user.

Deliverable: The entity_dictionary.csv file is similar in structure to a spreadsheet, where columns are defined in Table 3 in the Appendix below. Each row in the entity dictionary file represents an entity for which data file and entries in the entity_dictionary.csv exist.

Note: all columns have to be present, however they can be left empty if there are no values for the corresponding entity:field combination. The example below does not include all columns, just enough to illustrate the general format. A complete template can be provided upon request.

Example entity_dictionary.csv

entity

entity_title

entity_label_singular

entity_label_plural

entity_description

patient

Patient

Patient Record

Patient Records

The unique record for each patient participating in the clinical trial.

clinical

Clinicals

Clinically supplied data at the patient level.

encounter

Encounters

Encounter

Encounters

lab_test

Laboratory Tests

Lab Test

Lab Tests

Laboratory tests ordered during a patient's hospital visit

Data Files

Description: Phenotypic data must be supplied in a CSV file.

Deliverable: The data.csv file is described in the data dictionary with its name. The data.csv file must be a flattened data structure (a table) with a header, columns and rows. The values must be of the following listed types: date, datetime, integer, float, or string.

Example subject data.csv

subject_id

first_name

last_name

birth_year

1

Mom

Family

1985

2

Dad

Family

1984

3

Baby

Family

2016

Codings

Description: The codings.csv file is a single file that contains information for coded values present in all the *.data.csv files. Often a meaning is designed for human consumption and codes are designed to optimize data storage and retrieval. For example, instead of storing the value “Acquired pure red cell aplasia [erythroblastopenia]” multiple times in a database, it is easier to code this as the shortened ICD10 code, “D60.” Data are stored as codes in the encoded fields. When retrieving a coded value, such as the ICD10 code "D60", from a database and displaying the value for consumption, the system converts this back to the full meaning, “Acquired pure red cell aplasia [erythroblastopenia],” so that the information is intelligible to the users.

Hierarchical Data: The codings file supports specification of a hierarchy. For example, ICD10 codes, A00 to B99, all point to the category, “Certain infectious and parasitic diseases”. The code, A31, points to “Other sepsis” and codes, A31.X, point to specific types of sepsis. Such a hierarchical structure can be defined in the codings.csv in the parent_code column (see example below).

Deliverable: The codings.csv file must have columns as defined in Table 2 of the Appendix. Each categorical field in your data will need to have an associated entry in the codings.csv file. The same coding_name value must be used in the respective data_dictionary.csv field to match each field with its respective codes.

Example codings.csv file

coding_name

code

meaning

parent_code

display_order

blood_pressure

-1

Did not measure

1

blood_pressure

-2

Measurement failed

2

t_f_na

1

True

t_f_na

2

False

t_f_na

3

Not Applicable

reason_discontinued_treatment

1

Pain subsided

reason_discontinued_treatment

2

Lost medication

reason_discontinued_treatment

3

Forgot

allergies

a

Environment

1

allergies

b

Food

2

allergies

c

Medicine

3

Appendix

Table 1: Data Dictionary File Description (data_dictionary.csv)

Column

Possible values*

Description

entity

/^[a-zA-Z][a-zA-Z_0-9]*$/

max length 256 characters

Required: The name of the entity associated with the current field. For example, "subject" or "sample" or encounter.

Note: An entity is a logical table in a dataset (i.e. patients, prescriptions, lab tests). Datasets must contain exactly one main entity (denoted using the primary_key_type column below), which is used as the primary entry point to a dataset, and for which cohorts are to be built. Beyond the main entity, datasets may contain additional entities linked to the main entity using relationships such as one-to-one or one-to-many.

name

/^[a-zA-Z][a-zA-Z_0-9]*$/

max length 2000 characters

Required: The name of the current field. No two fields, within an entity, may have the same name.

type

date

datetime

integer

float

string

Required: The type of the raw data as given to the system. Type "date" is a timepoint with day granularity in ISO 8601 date format. Type "datetime" is a timepoint with millisecond granularity in ISO 8601 datetime format. Type "integer" is an integer number with 64-bit signed integer precision. Type "float" is a floating-point number with 64-bit (double) floating-point precision. When given as a fractional number, values to the right of the decimal point are supported up to 15 digits. Type "string" is a string. There may be one, and only one, “type” for each entity:field.

coding_name

/^[a-zA-Z][a-zA-Z_0-9]*$/

(or left empty)

This column denotes whether the current field is codified. For example a gender field could have type set to "integer" and be codified with a coding where 0 means "female" and 1 means "male". If coding_name is not empty, it must match the coding_name field in the codings.csv file. This column cannot be left empty if is_multi_select (below) is set to "yes".

description

Any character except /^[*'|"]*$/

(or left empty)

A human readable description of the field that will be displayed to users in the Apollo Cohort Browser.

folder_path

/^[a-zA-Z_0-9][ ][a-zA-Z_0-9][>]*$/

(or left empty)

max length 2000 characters

Fields may be displayed in user defined groups, with the strict purpose of visualizing data in a browser. This field will display in the browser as a series of folders and is the method to group entity:name pairs in a hierarchical manner. This then appears as a kind of table of contents for the data in the UI to make it easier to find data fields of interest. The value should be comprised of a single folder name (e.g. “Online follow-up”) or nested folders separated by the carrot symbol “>” (e.g. "Online follow-up > Mental health > Anxiety"). Each folder name must be separated by ">".

If empty, the field will not be accessible within the browser. However, folder_path must be empty where primary_key_type is “local” and where reference_entity_field is not empty. Also, folder_path should be empty for any non-supported data types (e.g., free-text strings) as they can still be imported into the dataset, just not displayed in the Cohort Browser.

is_multi_select

yes

(or left empty)

If you are using a coding_name and the field is multi-select, then enter “yes” in this column. Multi-select fields contains potentially multiple values originating from answering a multiple-choice question (i.e. subject may choose "select all that apply" on a questionnaire). This is only applicable to codified fields. Thus, for "is_multi_select" to be set to "yes", it requires the "coding_name" to also be present. In the data.csv, the data for a multi-select field should be represented as multiple values separated by a pipe character "|" (e.g. “a|b|d|h”). If this column is left blank, then the field is treated as single-select response (i.e. a question in which only one answer is allowed from a list of options) or a single value.

is_sparse_coding

yes

(or left empty)

Set this field to the value “yes” if the coding name field is sparsely codified. The meaning of sparsely codified field depends on the "type".

  • For fields of type date, datetime, integer or float, a sparsely codified field is a field which measures a value of that type, but which supports also special values with special meanings. For example, a weight measurement field could have type set to float for measuring someone's weight, but also be sparsely codified with a code "-1" that means "Patient declined to measure" and a code "-2" that means "Measurement out of range". For such a field, each individual data entry would be a floating point weight measurement, the number -1, the number -2, or NULL. The corresponding entry in the codings.csv file would indicate that code of “ -2” has meaning of "Measurement out of range.”

  • For columns of type "string", a sparsely codified field is a field that can take values from a controlled vocabulary of possible options (originally encoded as string) or a write-in value. For example, a medication field could have "type" set to "string" and be sparsely codified with a RXNORM controlled vocabulary where "RX5640" means "Ibuprofen", etc. For such a field, each individual data entry would be an RXNORM medication code or any other free-form string entry that does not exist in RxNORM (i.e. a written response in an "other" field in a data input screen) .

linkout

Compliant URL

(or left empty)

A URL for linking to additional information.

primary_key_type

global

local

This column denotes whether the current field is a primary key for the entity, and whether that entity is the main entity. If the current field is the primary key for the entity, and the entity is the main entity, this column should contain the value "global". If the current field is the primary key for the entity but the entity is not the main entity, this column should contain the value "local".

There must be exactly one "global" primary_key_type per dataset.

If a field is a primary key, the values must be NOT NULL and UNIQUE.

referenced_entity_field

entity_name:field_name

(or left empty)

If the field is a foreign key that references a field in another entity (e.g., a patient_id field present in the prescriptions entity), then the corresponding reference_entity_field must contain this mapping as the entity_name:field_name. For example: if there is a patient_id field in the prescription entity that maps to the id field in the patients entity, then the corresponding referenced_entity_field is patients:id).

If referenced_entity_field is not blank for a given field, then the relationship field must also contain a value.

relationship

one_to_one

many_to_one

(or left empty)

This describes the relationship between two entities (tables). If, for the specified field, the entity has many rows that map back to a column (i.e. field) in another table, then the relationship is a "many_to_one" relationship for the field.

For example, assume there are two entities, subject and encounter. Each subject may have multiple encounters. The relationship of the encounter_id field to the subject_id field in the encounter.data.csv is "many_to_one" where many encounter_id values could map to the same subject_id value, which represents a subject in the subject_id field of the subject.data.csv. The corresponding reference_entity_field for encounter:subject_id would then have to be subject:subject_id.

If relationship is not blank for a given field, then the referenced_entity_field field must also contain a value.

title

/^[a-zA-Z_0-9][ ][a-zA-Z_0-9]*$/

(or left empty)

max length 2000 characters

A human readable title for name (column name) which will be displayed to users in the Apollo Cohort Browser as the field name.

units

/^[a-zA-Z][a-zA-Z_0-9]*$/

(or left empty)

For the end user to understand the data, it is necessary to specify the unit of measurement. For example, for a temperature field the unit may be "Celsius", "Fahrenheit", "Kelvin", or anything else one prefers. The units will display in the metadata for the field and in any corresponding charts.

* A value is required unless specified as "or left empty."

Table 2: Codings File Description (codings.csv)

Column

Possible values*

Description

coding_name

/^[a-zA-Z][a-zA-Z_0-9]*$/

max length 256 characters

The name of the set of codes referenced when a field is codified. This column cannot be empty.

code

/^[a-zA-Z][a-zA-Z_0-9]*$/

Specifies the code for the respective code:meaning pair.

meaning

Any non-empty string.

max length 2000 characters

Specifies the value each code should take on, for a given grouping of codes, grouped by the coding_name field.

parent_code

An existing code in the coding_name set

The hierarchical structure to a coding group may be maintained by specifying parent codes within the grouping

display_order

A non-zero, non-negative integer

(or left empty)

The order for how values are displayed may be retained by directly specifying the order. Note: use this to control the order of values list-type charts (non-hierarchical) and bar graphs.

Table 3: Entity Dictionary File Description (entity_dictionary.csv)

Column

Possible values*

Description

entity

/^[a-zA-Z][a-zA-Z_0-9]*$/

max length 256 characters

A machine-readable entity name. Values here must be reflected in the entity column of the main data dictionary and the data files (<entity>.csv). For example "subject" or "sample " or "encounter". Note: An entity is a logical table in a dataset (i.e. patients, prescriptions, lab tests).

  • No two entities may have the same name.

  • Entity names need not be descriptive, however it is recommended that they are descriptive and that they reflect the underlying data.

entity_title

/^[a-zA-Z_0-9][ ][a-zA-Z_0-9]*$/

max length 2000 characters

A human-readable title for the value in the entity field. We suggest that the entity_title should be a sensible and informative value, such that it is easy to quickly discern general content of the entity.

  • If left blank, the title will take on the value as specified in the field, entity.

  • Entity titles need not be descriptive, however it is strongly recommended that they are descriptive and that they reflect the underlying data.

This title is used in forming existential cohort filter statement and is available in the data flyouts in the cohort browser.

entity_label_singular

Any non-empty string.

max length 2000 characters

A human-readable entity label in singular grammatical number, like "Lab test". This should often be a value that makes sense in the singular format and is descriptive of the entity.

  • If no value is supplied, default values will be the value as specified in entity_title.

The singular label is used in the tiles on the cohort browser when a value = 1, -1.

entity_label_plural

Any non-empty string.

max length 2000 characters

A human-readable entity label in plural grammatical number, like "Lab tests" . This should often be the plural version of the entity_label_singular (or entity_title if singular label is blank).

  • If no value is supplied, default values will be the value as specified in entity_title.

The plural label is used in the tiles on the cohort browser when a value not equal to 1 (e.g. 0 or >1).

entity_description

Any character except /^['|"]/$/

(or left empty)

A human-readable description of the field. This can describe details about the entity, how it is tracked, any additional information a person may need to know about the entity.

Description is exposed to the user in the entity tooltip on the cohort browser.