Data Model Loader

Learn to use Data Model Loader, the primary mechanism by which phenotypic, clinical, and other semi-structured data is moved to the Platform, and datasets using that data are created.

An Apollo license is required to use Data Model Loader on the DNAnexus Platform. Org approval may also be required. Contact DNAnexus Sales for more information.

Overview

The Data Model Loader application reads the raw data along with a coding file and dictionary files, runs a series of validations to ensure proper structure is maintained, ingests the data into the Apollo database, and then creates a dataset representing the data, dictionaries, and codings provided.

The intent of this application is to allow users to bring harmonized, cleaned, prepared data in its native logical structure to the DNAnexus Platform for use with dataset based modules (e.g. Cohort Browser, dxdata, GWAS pipelines...etc).

To ingest large complex datasets, engaging with xVantage is advised to ensure an optimal experience (e.g. attempting to ingest a Main EHR dataset with hundreds of thousands of patients).

Using the Data Model Loader App

Launching the App

To launch the Data Model Loader application, enter this command via the command line:

dx run data_model_loader_v2

Inputs

The Data Model Loader app as requires as an input:

  • Data Dictionary CSV - defines how the raw data relates to one another along with metadata information about each raw data field such as labels, descriptions, units of measure, linkouts and more. This file allows for DNAnexus to flexibly ingest a different data models and data types.

  • Input Configuration

    • Output Dataset Name - name of the dataset that will be created based on the data dictionary and the ingested data in the database.

    • Database Name - this can be an existing database name if the user wishes to create a dataset using the loaded data or load data into an existing database (note that the database ID input is also required). If the user wants the app to create a new database and load the data into it, the user needs to provide a project specific, unique database name.

    • Dataset Schema - a unique identifier to tag datasets and cohorts together to represent a grouping of like records on the same core data.

  • Instance Type - while a default should be sufficient for most small to medium datasets, if input files large, ensure that the instance is increased to help efficient complete the process.

Additional Optional Inputs are:

  • Data CSV(s) - the raw data CSVs. These files are required for any entities where data is not yet ingested. In most cases for data loading. The data CSVs will be part of the end to end run. For very complex data loading, data may be incrementally ingested and one final run may be done to create a dataset therefore data CSVs are not required.

  • Coding CSV - the granular details for any categorical or coded raw values. The CSV is required for any ingestion where the Data Dictionary CSV contains any coded fields.

  • Entity Dictionary CSV - the metadata of the logical entities in the dataset. This CSV is only needed if more information is desired to be provided about the logical entities for use by end users. Values in this file are used to override using the filename in the Cohort Browser and when trying to access entity metadata through dxdata.

  • Dashboard Record - a file to set as the default dashboard for the dataset. For v3.0 datasets use the Dashboard View Record.

  • Mode Configuration JSON - used for advanced ingestion to adjust the ingestion mode on a per-entity level. See the application documentation for more details.

  • Input Configurations

    • Database ID - recommended to be filled in if you are using a Database that already exists on the Platform. This is common if you are ingesting data into an existing database or creating a new dataset on top of previously ingested data. The input is usually "database-identifyingAlphaNumerString" and can be found in the info panel on the database you are attempting to use.

    • Dashboard Template - the dashboard configuration template to use to generate the dashboard configuration JSON. The default setting, “Global Defaults”, does not create a dashboard template allowing the system to select the most appropriate. If you would like to provide your own dashboard, select “Custom”. Note that “Legacy Cohort Browser Dashboard” is only supported with single entity datasets.

    • Maximum Columns Per Table - if an extra wide data file is being ingested, this setting allows for the system to automatically distribute data across a number of tables no wider than the maximum set to ensure optimal performance. This value should be 400 or lower if it is being configured. The dataset will automatically join the tables to create one logical entity and all database tables will be named <entity>-<generated split number>.

    • Skip Ingestion - by default this is False since most ingestions include data ingestion and dataset creation. Toggle this to True if data is already ingested and only dataset creation is needed.

    • See app documentation for further granular configurations

Process

For a full end to end data load the data model loader executes the following process that first validates the inputs, stages the data, transforms the staged data to finalized tables, then creates a dataset record based on the App inputs.

Validation

The data dictionary, coding files, and data are validated as per the specifications here. Validation is performed in 4 phases:

  1. Validation that the job inputs conform to expectations.

  2. Validation of the data dictionary, codings file, and entity dictionary as a model.

  3. Validation that the data files and columns input match the dictionary modeled in validation 1.

  4. Validation of each data file against the dictionary defined.

If a phase fails validation, all errors for the phase are aggregated to the error file and the job fails.

Staging

The process where the data CSV files are validated, transformed and ingested into temporary staging tables. If the errors are in the coding or type casting of specific data, CSV files are generated in 'Project:dml_files/<job-id>/error_files/' folder to highlight specific rows where errors occurred.

The following validations and transformations are performed during staging phase.

  1. Empty field transformations.

  2. Data type transformations.

  3. If validation is enabled the following validations/transformations are performed:

    1. Infinity check: After data type transformation, if the double values are negative or positive infinity, the values are substituted with max double value and min double value supported.

    2. Distinct array : Array fields are transformed to contain distinct values.

    3. Valid code : If the field is coded, check if the value is a valid code.

    4. Primary key : Check if primary key is set.

    5. Generate error details: If any of the validation fails, `error_details` column is added which contains the cause of failure which will be part of the error files generated.

  4. Add of helper columns for usage throughout the system.

    1. These helper columns are added for hierarchical fields.

At the end of this stage, the staging tables (<entity-name>_staging) are created.

Finalizing

Once all the files are staged successfully, final tables are created. During this move, all data, even if it was ingested incrementally, is moved to final tables (<entity_name).

Dataset Creation

If all the final tables are created successfully, dataset record will be created in the project. The dataset is created based on the data dictionary and the entity metadata (entity dictionary) files that provided.

Outputs

  • Database - the ID of the database used is returned.

  • Dataset - The dataset record created.

  • Dashboard View - If a dashboard was configured, a dashboard view will be created based on the dataset.

  • Logs - Available under 'Project: dml_files/<job-id>/logs/'.

    • data_loader.log - Logs on the process being run. This is a superset of the stdout and stderr and is useful when troubleshooting.

    • Spark cluster logs - for advanced troubleshooting.

Best Practices

Since the Data Model Loader app loads data into the Apollo Database, certain terms are reserved words and not allowed as entity names or field names. This is to ensure that the system can perform all of the expected functionality on the database created. The reserved terms are as follows.

Entities and fields cannot be named:

[‘all','alter','and','anti','any','array','as','at','authorization','between','both','builtin','by','case','cast','check','collate','column','commit','constraint','create','cross','cube','current','current_date','current_time','current_timestamp','current_user','default','delete','describe','distinct','drop','else','end','escape','except','exists','external','extract','false','fetch','filter','for','foreign','from','full','function','global','grant','group','grouping','having','in','information_schema','inner','insert','intersect','interval','into','is','join','lateral','leading','left','like','local','minus','natural','no','not','null','of','on','only','or','order','out','outer','overlaps','partition','position','primary','range','references','revoke','right','rollback','rollup','row','rows','select','semi','session','session_user','set','some','start','table','tablesample','then','time','to','trailing','true','truncate','union','unique','unknown','update','user','using','values','when','where','window','with’]

Field names cannot start with current_*.

An entity dictionary is recommended for datasets that will be used in the Cohort Browser to enable the best user experience. For tables with over 100 million rows please work with xVantage Services for an optimized experience.

Last updated