Data Model Loader

Note: This is intended only for customers with a DNAnexus Apollo license and Org approval (if applicable). Contact [email protected] for more information.

dx run data_model_loader_v2 (use -h for help)

The Data Model Loader application is the primary mechanism by which phenotypic, clinical, and other semi-structured data is ingested into the platform and through which datasets are created. This application reads the raw data along with a coding file and dictionary files, runs a series of validations to ensure proper structure is maintained, ingests the data into the Spark database, and then creates a dataset representing the data, dictionaries, and codings provided.

The intent of this application is to allow users to bring harmonized, cleaned, prepared data in its native logical structure to the DNAnexus platform for use with dataset based modules (e.g. Cohort Browser, dxdata, GWAS pipelines...etc).

To ingest large complex datasets, engaging with xVantage is advised to ensure an optimal experience (e.g. attempting to ingest a Main EHR dataset with hundreds of thousands of patients).

Overview

Overview of all required and optional file inputs for the Data Model Loader App

Inputs

The Data Model Loader app as requires as an input:

  • Data Dictionary CSV - defines how the raw data relates to one another along with metadata information about each raw data field such as labels, descriptions, units of measure, linkouts and more. This file allows for DNAnexus to flexibly ingest a different data models and data types.

  • Input Configuration

    • Output Dataset Name - name of the dataset that will be created based on the data dictionary and the ingested data in the database

    • Database Name - a unique name of the database to create (if not created) and load data into or the database of the loaded data for which a dataset will be created

    • Dataset Schema - a unique identifier to tag datasets and cohorts together to represent a grouping of like records on the same core data

  • Instance Type - while a default should be sufficient for most small to medium datasets, if input files large, ensure that the instance is increased to help efficient complete the process.

Additional Optional Inputs are

  • Data CSV(s) - the raw data CSVs. These files are required for any entities where data is not yet ingested. In most cases for data loading. The data CSVs will be part of the end to end run. For very complex data loading, data may be incrementally ingested and one final run may be done to create a dataset therefore data CSVs are not required.

  • Coding CSV - the granular details for any categorical or coded raw values. The CSV is required for any ingestion where the Data Dictionary CSV contains any coded fields.

  • Entity Dictionary CSV - the metadata of the logical entities in the dataset. This CSV is only needed if more information is desired to be provided about the logical entities for use by end users. Values in this file are used to override using the filename in the Cohort Browser and when trying to access entity metadata through dxdata.

  • Dashboard Record - a file to set as the default dashboard for the dataset. For v3.0 datasets use the Dashboard View Record.

  • Mode Configuration JSON - used for advanced ingestion to adjust the ingestion mode on a per-entity level. See the application documentation for more details.

  • Input Configurations

    • Database ID - recommended to be filled in if you are using a Database that already exists on the platform. This is common if you are ingesting data into an existing database or creating a new dataset on top of previously ingested data. The input is usually "database-identifyingAlphaNumerString" and can be found in the info panel on the database you are attempting to use.

    • Maximum Columns Per Table - if an extra wide data file is being ingested, this setting allows for the system to automatically distribute data across a number of tables no wider than the maximum set to ensure optimal performance. This value should be 400 or lower if it is being configured. The dataset will automatically join the tables to create one logical entity and all database tables will be named <entity>-<generated split number>.

    • Skip Ingestion - by default this is False since most ingestions include data ingestion and dataset creation. Toggle this to True if data is already ingested and only dataset creation is needed.

    • See app documentation for further granular configurations

Process

For a full end to end data load the data model loader executes the following process that first validates the inputs, stages the data, transforms the staged data to finalized tables, then creates a dataset record based on the App inputs.

Validation

The data dictionary, coding files, and data are validated as per specification. If there are any validation failures, the application would immediately terminate with error messages available.

Staging

The process where the data CSV files are validated, transformed and ingested into temporary staging tables. If the errors are in the coding or type casting of specific data, CSV files are generated in 'Project:dml_files/<job-id>/error_files/' folder to highlight specific rows where errors occurred.

The following validations and transformations are performed during staging phase.

  1. Empty field transformations.

  2. Data type transformations.

  3. If validation is enabled the following validations/transformations are performed:

    1. Infinity check: After data type transformation, if the double values are negative or positive infinity, the values are substituted with max double value and min double value supported.

    2. Distinct array : Array fields are transformed to contain distinct values.

    3. Valid code : If the field is coded, check if the value is a valid code.

    4. Primary key : Check if primary key is set.

    5. Generate error details: If any of the validation fails, `error_details` column is added which contains the cause of failure which will be part of the error files generated.

  4. Add of helper columns for usage throughout the system.

    1. These helper columns are added for hierarchical fields.

At the end of this stage, the staging tables (<entity-name>_staging) are created.

Finalizing

Once all the files are staged successfully, final tables are created. During this move, all data, even if it was ingested incrementally, is moved to final tables (<entity_name).

Dataset Creation

If all the final tables are created successfully, dataset record will be created in the project. The dataset is created based on the data dictionary and the entity metadata (entity dictionary) files that provided.

Outputs

  • Database - the id of the database used is returned. In the UI this is shown as the database object.

  • Dataset - The dataset record created.

  • Dashboard View - If a dashboard was configured, a dashboard view will be created based on the dataset.

  • Logs - Available under 'Project: dml_files/<job-id>/logs/'.

    • data_loader.log - Logs on the process being run. This is a superset of the stdout and stderr and is useful when troubleshooting.

    • Spark cluster logs - for advanced troubleshooting.