Data Files Used by the Data Model Loader

The information on this page is for the use of customers working with the DNAnexus Professional Services team. Contact DNAnexus Sales for more information.

Overview

Phenotypic data must be delivered in a data model defined by the following files:

  • A data dictionary CSV file, named data_dictionary.csv in this example.

  • A data CSV file, data.csv.

  • A codings CSV file, codings.csv.

Data Dictionary

Description: The data dictionary file describes data in the data file (*.data.csv) by file name and by column within the specified data file. The data dictionary lists the metadata for the input data.csv file and respective individual columns within the data file. This metadata includes human-readable field names, data types, data coding schemes, and the relationships between the data tables. The information supplied in the data dictionary guides the ingestion process.

The data dictionary can be used to guide Data Model Loader to preload ICD codings and automatically build the codings CSV file. To take advantage of this automated feature in Data Model Loader, reserved terms (icd9cm:2015, icd9pcs:2015, icd10cm:2024, and icd10pcs:2024) must be used as coding_name in the data dictionary. By default, ICD codings are hierarchically grouped. To disable hierarchy, the reserved term should be followed by :exclude_hierarchy, for example, icd9cm:2015:exclude_hierarchy.

Deliverable: The data_dictionary.csv file is similar in structure to a spreadsheet, where columns are defined in Table 1 in the Appendix below. Each row in the data dictionary file represents a field for which an entity, that is, "patient", "sample", or "encounter," will have a value.

All columns have to be present. However, they can be left empty if there are no values for the corresponding entity:field combination. The example below does not include all columns, just enough to illustrate the general format. A complete template can be provided on request.

Example data_dictionary.csv

entity

name

type

primary_key_type

coding_name

is_sparse_coding

referenced_entity_field

relationship

folder_path

subject

subject_id

string

global

subject

first_name

string

subject

last_name

string

subject

birth_year

date

encounter

encounter_id

string

local

encounter

subject_id

string

subject:subject_id

many_to_one

encounter

date

date

no

Encounter

encounter

height

float

no

Encounter>Measurements

encounter

weight

integer

no

Encounter>Measurements

encounter

sys_blood_pressure

integer

no

Encounter>Measurements

Entity Dictionary

Description: The optional entity dictionary file describes entities that the data file represent (*.data.csv). The entity dictionary lists the metadata for each input data.csv. This metadata includes human-readable entity title, singular and plural labels, and a description. The information supplied in the entity dictionary provides additional contextualized information for the user.

Deliverable: The entity_dictionary.csv file is similar in structure to a spreadsheet, where columns are defined in Table 3 in the Appendix below. Each row in the entity dictionary file represents an entity for which data file and entries in the entity_dictionary.csv exist.

All columns have to be present, however they can be left empty if there are no values for the corresponding entity:field combination. The example below does not include all columns, just enough to illustrate the general format. A complete template can be provided on request.

Example entity_dictionary.csv

entity

entity_title

entity_label_singular

entity_label_plural

entity_description

patient

Patient

Patient Record

Patient Records

The unique record for each patient participating in the clinical trial.

clinical

Clinicals

Clinically supplied data at the patient level.

encounter

Encounters

Encounter

Encounters

lab_test

Laboratory Tests

Lab Test

Lab Tests

Laboratory tests ordered during a patient's hospital visit

Data Files

Description: Phenotypic data must be supplied in a CSV file.

Deliverable: The data.csv file is described in the data dictionary with its name. The data.csv file must be a flattened data structure (a table) with a header, columns, and rows. The values must be of the following listed types: date, datetime, integer, float, or string.

Example subject data.csv

subject_id

first_name

last_name

birth_year

1

Mom

Family

1985

2

Dad

Family

1984

3

Baby

Family

2016

Codings

Description: The codings.csv file is a single file that contains information for coded values present in all the *.data.csv files. Often a meaning is designed for human consumption and codes are designed to optimize data storage and retrieval. For example, instead of storing the value "Acquired pure red cell aplasia erythroblastopenia" multiple times in a database, it is easier to code this as the shortened ICD10 code, D60. Data are stored as codes in the encoded fields. When retrieving a coded value, such as the ICD10 code D60, from a database and displaying the value for consumption, the system converts this back to the full meaning, "Acquired pure red cell aplasia erythroblastopenia," so that the information is intelligible to the users.

Hierarchical Data: The codings file supports specification of a hierarchy. For example, ICD10 codes, A00 to B99, all point to the category, "Certain infectious and parasitic diseases". The code, A31, points to "Other sepsis" and codes, A31.X, point to specific types of sepsis. Such a hierarchical structure can be defined in the codings.csv in the parent_code column (see example below).

Deliverable: The codings.csv file must have columns as defined in Table 2 of the Appendix. Each categorical field in your data will need to have an associated entry in the codings.csv file. The same coding_name value must be used in the respective data_dictionary.csv field to match each field with its respective codes.

Example codings.csv file

coding_name

code

meaning

parent_code

display_order

blood_pressure

-1

Did not measure

1

blood_pressure

-2

Measurement failed

2

t_f_na

1

True

t_f_na

2

False

t_f_na

3

Not Applicable

reason_discontinued_treatment

1

Pain subsided

reason_discontinued_treatment

2

Lost medication

reason_discontinued_treatment

3

Forgot

allergies

a

Environment

1

allergies

b

Food

2

allergies

c

Medicine

3

Appendix

Table 1: Data Dictionary File Description (data_dictionary.csv)

Column
Possible values
Description

entity

^[a-zA-Z][a-zA-Z_0-9]*$

Required: The name of the entity associated with the current field.

name

^[a-zA-Z][a-zA-Z_0-9]*$

Required: The name of the current field. No two fields, within an entity, may have the same name.

type

date, datetime, integer, float, string

Required: The type of the raw data as given to the system. Type date is a timepoint with day granularity.

coding_name

[\w+-]+

This column denotes whether the current field is codified. For example, a gender field could have type set to integer.

description

Any character except ^[*'|"]*$ (or left empty)

A human readable description of the field that will be displayed to users in the Cohort Browser.

folder_path

^[a-zA-Z_0-9][>]*$

Fields may be displayed in user defined groups, with the strict purpose of visualizing data in a browser. This field organizes the display hierarchy in the Cohort Browser.

is_multi_select

yes (or left empty)

If you are using a coding_name and the field is multi-select, then enter yes in this column.

is_sparse_coding

yes (or left empty)

Set this field to the value yes if the coding_name field is sparsely codified. This indicates that the field has many possible values but only some are relevant.

linkout

Compliant URL (or left empty)

A URL for linking to additional information.

primary_key_type

global, local

This column denotes whether the current field is a primary key for the entity, and whether that entity is the main entity in the dataset.

referenced_entity_field

^entity_name:field_name$ (or left empty)

If the field is a foreign key that references a field in another entity, specify the entity and field being referenced.

relationship

one_to_one, many_to_one (or left empty)

This describes the relationship between two entities (tables). Used with referenced_entity_field to define the relationship type.

title

^[a-zA-Z_0-9][>]*$

A human readable title for name (column name) which will be displayed to users in the Cohort Browser.

units

^[a-zA-Z][a-zA-Z_0-9]*$

For the end user to understand the data, it is necessary to specify the unit of measurement.

* A value is required unless specified as "or left empty."

Table 2: Codings File Description (codings.csv)

Column
Possible values*
Description

coding_name

/r"[\w+-]+"/ (max length 256 characters)

The name of the set of codes referenced when a field is codified. This column cannot be empty.

code

Unique for a specific coding_name

Specifies the code for the respective code:meaning pair.

meaning

Any non-empty string (max length 2000 characters)

Specifies the value each code should take on, for a given grouping of codes, grouped by the coding_name field.

parent_code

An existing code in the coding_name set

The hierarchical structure to a coding group may be maintained by specifying parent codes within the grouping.

display_order

A non-zero, non-negative integer or left empty

The order for how values are displayed may be retained by directly specifying the order. Use this to control the order of values list-type charts (non-hierarchical) and bar graphs.

Table 3: Entity Dictionary File Description (entity_dictionary.csv)

Column
Description
Details

entity

A machine-readable entity name. Values here must be reflected in the main data dictionary.

No two entities may have the same name. Entity names need not be descriptive, however it is recommended they are descriptive and reflect the underlying data.

name

A human-readable field name, which will be displayed to users in the Cohort Browser.

No two fields within an entity may have the same name.

type

The type of raw data as given to the system, such as date, datetime, integer, float, or string.

There may be one and only one type for each entity:field combination. Type date is a timepoint with day granularity in ISO 8601 date format, type datetime is a timepoint with millisecond granularity in ISO 8601 datetime format, type integer is an integer number with 64-bit signed integer precision, type float is a floating-point number with 64-bit (double) floating-point precision, and type string is a string.

coding_name

This column denotes whether the current field is codified and which coding is being used.

If this column is not empty, it must match the coding_name field in the codings.csv file unless leveraging automated pre-loaded ICD codings. In this case, reserved terms must be used and hierarchically grouped unless the :exclude_hierarchy term is appended.

description

A human-readable description of the field that will be displayed to users in the Cohort Browser.

A description is exposed to the user in the field tooltip on the Cohort Browser.

folder_path

This column specifies whether fields are displayed in user-defined groups or not.

Fields may be displayed in user defined groups, with the purpose of visualizing data in a browser. This then appears as a kind of table of contents for the data in the UI to make it easier to find fields.

is_multi_select

If you are using a coding_name and the field is multi-select, then enter yes in this column.

This is only applicable to codified fields and implies that there is a coding_name.

is_sparse_coding

If the field's coding_name field is sparsely codified, then enter yes in this column.

This denotes whether the current field's coding_name field is sparsely codified and depends on type.

linkout

A URL for linking to additional information.

primary_key_type

This column denotes whether the current field is an entity's primary key and if it is the main entity.

There must be exactly one global primary_key_type per dataset and if a field is a primary key, the values must be NOT NULL and UNIQUE.

referenced_entity_field

This column denotes whether the field is a foreign key that references another entity's data.

If this column is not blank for a field, then the relationship and referenced_entity_field must also contain a value. Every entity in the data dictionary should be referenced or reference another entity (all entities linked).

relationship

This describes the relationship between two entities. If many rows in one table map back to a column in another, then the relationship is many_to_one.

If this column is not blank for a field, then the referenced_entity_field and relationship must also contain a value.

title

A human-readable title for the field name, which will be displayed to users in the Cohort Browser.

This title is used in forming existential cohort filter statements and is available when selecting fields in the Cohort Browser.

units

For the end user to understand the data, it is necessary to specify the unit of measurement.

For example, for a temperature field the unit may be Celsius, Fahrenheit, or Kelvin and will display in the metadata for the field and any corresponding charts.

entity_title

A human-readable title for the value in the entity field, making sense as a singular format.

This should often be a single value that makes sense in the singular format and is descriptive of the entity. The title is used when a value equals 1 in the Cohort Browser.

entity_label_singular

A human-readable entity label in singular grammatical number, making sense as a single value.

This should often be the same value as specified in entity_title and is descriptive of the entity. The singular label is used when a value equals 1 in the Cohort Browser.

entity_label_plural

A human-readable entity label in plural grammatical number, making sense as plural values.

This should often be the plural version of entity_label_singular (or entity_title if singular is blank) and is descriptive of the entity. The plural label is used when a value not equal to 1 in the Cohort Browser.

entity_description

A human-readable description of the field, making sense as a plural value.

This can describe details about the entity, how it is tracked, any additional information a person may need to know about the entity. A description is exposed to the user in the entity tooltip on the Cohort Browser.

Last updated

Was this helpful?