Data Model Loader

Create datasets with your phenotypic, clinical, and other semi-structured data.

An Apollo license is required to use Data Model Loader. Org approval may also be required. Contact DNAnexus Sales for more information.

What is Data Model Loader

With Data Model Loader (DML), you can move your phenotypic, clinical, and other semi-structured data to the DNAnexus Platform as a dataset. This enables you to bring your harmonized, cleaned, and prepared data in its native logical structure to the Platform, creating an object you can use with dataset-based modules, such as Cohort Browser, dxdata, GWAS pipelines, and more.

Here's what Data Model Loader does:

Reads the raw data along with a coding file and dictionary files.
Runs a series of validations to ensure proper structure is maintained.
Ingests the data into the Apollo database.
Creates a dataset representing the data, dictionaries, and codings provided.

To ingest large complex datasets, make sure to engage with the DNAnexus Professional Services team for an optimal experience. For example, attempting to ingest a Main EHR dataset with hundreds of thousands of patients.

Launch Data Model Loader

To launch the Data Model Loader application, enter this command in your terminal:

dx run data_model_loader_v2

Inputs

The Data Model Loader app requires the following inputs:

Data Dictionary CSV - Defines how the raw data relates to one another along with metadata information about each raw data field such as labels, descriptions, units of measure, linkouts and more. This file allows for DNAnexus to flexibly ingest different data models and data types. To leverage automated pre-loaded ICD codings, specify the ICD coding columns in your data files by using one of the reserved terms as the coding_name in the data dictionary. The reserved terms are: icd9cm:2015, icd9pcs:2015, icd10cm:2024, and icd10pcs:2024. By default, ICD codings are hierarchically grouped. To disable hierarchy, the reserved term should be followed by :exclude_hierarchy, for example, icd9cm:2015:exclude_hierarchy.
Input Configuration
- Output Dataset Name - Defines the name of the dataset to create based on the data dictionary and the ingested data in the database.
Instance Type - The default instance type is sufficient for loading most small to medium-size datasets. If your input files are large, consider using a more powerful instance to ensure success.

Optionally, you can specify the following inputs:

Data CSV - Specifies the raw data CSVs. These files are required for any entities where data is not yet ingested. Typically for data loading. The data CSVs will be part of the end to end run. For complex data loading, data may be incrementally ingested and one final run may be done to create a dataset therefore data CSVs are not required.
Coding CSV - Specifies the granular details for any categorical or coded raw values. The CSV is required for any ingestion where the Data Dictionary CSV contains any coded fields. When using preloaded ICD codings by including reserved terms in the dictionary CSV, there is no need to define the meanings of ICD codes in the Coding CSV, as these meanings will be automatically added to the Coding CSV file, drawing on ICD sources. However, it is possible to overwrite specific ICD codes by adding them to the Coding CSV file, in which case the user-specified ICD codings and meanings will take precedence over the pre-loaded common vocabularies.
Entity Dictionary CSV - Specifies the metadata of the logical entities in the dataset. Use this if you need to provide more information about the logical entities for use by end users. Values in this CSV are used to override the filename in the Cohort Browser and when accessing entity metadata through dxdata.
Dashboard Record - Specifies a file to use as the default dashboard for the dataset. For v3.0 datasets, use the Dashboard View Record.
Mode Configuration JSON - Used for advanced ingestion to adjust the ingestion mode on a per-entity level. See the application documentation for more details.
Input Configurations
- Database Name - Can be an existing database name if you want to create a dataset using the loaded data or load data into an existing database. In such case, the database ID input is also required. If you want to create a new database and load the data into it, you can provide a project-specific, unique database name. If not provided, a database name will be automatically assigned.
- Database ID - If you're using an existing database within the Platform, provide its ID. This is common if you are ingesting data into an existing database or creating a new dataset on top of previously ingested data. The input is "database-identifyingAlphaNumerString", which you can find in the info panel of the database you want to use.
- Dataset Schema - Used as a unique identifier to tag datasets and cohorts together to represent a grouping of like records on the same core data. If not provided, the default value will be used.
- Dashboard Template - the dashboard configuration template to use to generate the dashboard configuration JSON. The default setting, "Global Defaults", does not create a dashboard template allowing the system to select the most appropriate. If you would like to provide your own dashboard, select "Custom". The "Legacy Cohort Browser Dashboard" is only supported with single entity datasets.
- Maximum Columns Per Table - If ingesting an extra wide data file, allows to automatically distribute data across multiple tables no wider than the maximum set to ensure optimal performance. If configured, the value must be 400 or lower. The dataset will automatically join the tables to create one logical entity and all database tables will be named <entity>-<generated split number>.
- Skip Ingestion - If your data is already ingested and you only need to create a dataset, you can skip the ingestion. Because most ingestions include data ingestion and dataset creation, this is false by default.
- Limit Spark Logs - Specifies whether to limit the Spark logs written to the job log. If true, job log includes only critical Spark log events, and full Spark logs go to a dedicated spark_logs.log file in the project. You can find full Spark logs in the Data Model Loader's dml_files folder in the project.
- For details on more granular configuration, see the Data Model Loader app documentation.

Process

For a full end to end data load, the Data Model Loader executes the following process that first validates the inputs, stages the data, transforms the staged data to finalized tables, and then creates a dataset record based on the App inputs.

Validation

The data dictionary, coding files, and data are validated per Data Files Used by the Data Model Loader.

The validation has 4 phases:

Validation that the job inputs conform to expectations.
Validation of the data dictionary, codings file, and entity dictionary as a model.
Validation that the data files and columns input match the dictionary modeled in validation 1.
Validation of each data file against the dictionary defined.

If a phase fails validation, all errors for the phase are aggregated to the error file and the job fails.

Staging

The process where the data CSV files are validated, transformed and ingested into temporary staging tables. If the errors are in the coding or type casting of specific data, CSV files are generated in 'Project:dml_files/<job-id>/error_files/' folder to highlight specific rows where errors occurred.

The following validations and transformations are performed during staging phase.

Empty field transformations.
Data type transformations.
If validation is enabled the following validations/transformations are performed:
1. Infinity check: After data type transformation, if the double values are negative or positive infinity, the values are substituted with max double value and min double value supported.
2. Distinct array : Array fields are transformed to contain distinct values.
3. Valid code : If the field is coded, check if the value is a valid code.
4. Primary key : Check if primary key is set.
5. Generate error details: If any of the validation fails, error_details column is added which contains the cause of failure which will be part of the error files generated.
Add of helper columns for usage throughout the system.
1. These helper columns are added for hierarchical fields.

At the end of this stage, the staging tables (<entity-name>_staging) are created.

Finalizing

Once all the files are staged successfully, final tables are created. During this move, all data, even if it was ingested incrementally, is moved to final tables (<entity_name).

Dataset Creation

If all the final tables are created successfully, dataset record will be created in the project. The dataset is created based on the data dictionary and the entity metadata (entity dictionary) files that provided.

Outputs

Database - the ID of the database used is returned.
Dataset - The dataset record created.
Dashboard View - If a dashboard was configured, a dashboard view will be created based on the dataset.
Logs - Available under 'Project: dml_files/<job-id>/logs/'.
- data_loader.log - Logs on the process being run. This is a superset of the stdout and stderr and is useful when troubleshooting.
- Spark cluster logs - for advanced troubleshooting.

Best Practices

Because the Data Model Loader app loads data into the Apollo Database, certain terms are reserved and not allowed as entity names or field names. This is to ensure that the system can perform the expected functionality on the database created.

The reserved terms are as follows.

['all','alter','and','anti','any','array','as','at','authorization','between','both','builtin','by','case','cast','check','collate','column','commit','constraint','create','cross','cube','current','current_date','current_time','current_timestamp','current_user','default','delete','describe','distinct','drop','else','end','escape','except','exists','external','extract','false','fetch','filter','for','foreign','from','full','function','global','grant','group','grouping','having','in','information_schema','inner','insert','intersect','interval','into','is','join','lateral','leading','left','like','local','minus','natural','no','not','null','of','on','only','or','order','out','outer','overlaps','partition','position','primary','range','references','revoke','right','rollback','rollup','row','rows','select','semi','session','session_user','set','some','start','table','tablesample','then','time','to','trailing','true','truncate','union','unique','unknown','update','user','using','values','when','where','window','with']

Also, field names cannot start with current_*.

For datasets that you plan to use in the Cohort Browser, we recommend specifying an entity dictionary for an ideal user experience. For tables with over 100 million rows, contact the DNAnexus Professional Services team for an optimized experience.

Last updated 12 days ago

Was this helpful?