DNAnexus Documentation
APIDownloadsIndex of dx CommandsLegal
  • Overview
  • Getting Started
    • DNAnexus Essentials
    • Key Concepts
      • Projects
      • Organizations
      • Apps and Workflows
    • User Interface Quickstart
    • Command Line Quickstart
    • Developer Quickstart
    • Developer Tutorials
      • Bash
        • Bash Helpers
        • Distributed by Chr (sh)
        • Distributed by Region (sh)
        • SAMtools count
        • TensorBoard Example Web App
        • Git Dependency
        • Mkfifo and dx cat
        • Parallel by Region (sh)
        • Parallel xargs by Chr
        • Precompiled Binary
        • R Shiny Example Web App
      • Python
        • Dash Example Web App
        • Distributed by Region (py)
        • Parallel by Chr (py)
        • Parallel by Region (py)
        • Pysam
      • Web App(let) Tutorials
        • Dash Example Web App
        • TensorBoard Example Web App
      • Concurrent Computing Tutorials
        • Distributed
          • Distributed by Region (sh)
          • Distributed by Chr (sh)
          • Distributed by Region (py)
        • Parallel
          • Parallel by Chr (py)
          • Parallel by Region (py)
          • Parallel by Region (sh)
          • Parallel xargs by Chr
  • User
    • Login and Logout
    • Projects
      • Project Navigation
      • Path Resolution
    • Running Apps and Workflows
      • Running Apps and Applets
      • Running Workflows
      • Running Nextflow Pipelines
      • Running Batch Jobs
      • Monitoring Executions
      • Job Notifications
      • Job Lifecycle
      • Executions and Time Limits
      • Executions and Cost and Spending Limits
      • Smart Reuse (Job Reuse)
      • Apps and Workflows Glossary
      • Tools List
    • Cohort Browser
      • Chart Types
        • Row Chart
        • Histogram
        • Box Plot
        • List View
        • Grouped Box Plot
        • Stacked Row Chart
        • Scatter Plot
        • Kaplan-Meier Survival Curve
      • Locus Details Page
    • Using DXJupyterLab
      • DXJupyterLab Quickstart
      • Running DXJupyterLab
        • FreeSurfer in DXJupyterLab
      • Spark Cluster-Enabled DXJupyterLab
        • Exploring and Querying Datasets
      • Stata in DXJupyterLab
      • Running Older Versions of DXJupyterLab
      • DXJupyterLab Reference
    • Using Spark
      • Apollo Apps
      • Connect to Thrift
      • Example Applications
        • CSV Loader
        • SQL Runner
        • VCF Loader
      • VCF Preprocessing
    • Environment Variables
    • Objects
      • Describing Data Objects
      • Searching Data Objects
      • Visualizing Data
      • Filtering Objects and Jobs
      • Archiving Files
      • Relational Database Clusters
      • Symlinks
      • Uploading and Downloading Files
        • Small File Sets
          • dx upload
          • dx download
        • Batch
          • Upload Agent
          • Download Agent
    • Platform IDs
    • Organization Member Guide
    • Index of dx commands
  • Developer
    • Developing Portable Pipelines
      • dxCompiler
    • Cloud Workstation
    • Apps
      • Introduction to Building Apps
      • App Build Process
      • Advanced Applet Tutorial
      • Bash Apps
      • Python Apps
      • Spark Apps
        • Table Exporter
        • DX Spark Submit Utility
      • HTTPS Apps
        • Isolated Browsing for HTTPS Apps
      • Transitioning from Applets to Apps
      • Third Party and Community Apps
        • Community App Guidelines
        • Third Party App Style Guide
        • Third Party App Publishing Checklist
      • App Metadata
      • App Permissions
      • App Execution Environment
        • Connecting to Jobs
      • Dependency Management
        • Asset Build Process
        • Docker Images
        • Python package installation in Ubuntu 24.04 AEE
      • Job Identity Tokens for Access to Clouds and Third-Party Services
      • Enabling Web Application Users to Log In with DNAnexus Credentials
      • Types of Errors
    • Workflows
      • Importing Workflows
      • Introduction to Building Workflows
      • Building and Running Workflows
      • Workflow Build Process
      • Versioning and Publishing Global Workflows
      • Workflow Metadata
    • Ingesting Data
      • Molecular Expression Assay Loader
        • Common Errors
        • Example Usage
        • Example Input
      • Data Model Loader
        • Data Ingestion Key Steps
        • Ingestion Data Types
        • Data Files Used by the Data Model Loader
        • Troubleshooting
      • Dataset Extender
        • Using Dataset Extender
    • Dataset Management
      • Rebase Cohorts and Dashboards
      • Assay Dataset Merger
      • Clinical Dataset Merger
    • Apollo Datasets
      • Dataset Versions
      • Cohorts
    • Creating Custom Viewers
    • Client Libraries
      • Support for Python 3
    • Walkthroughs
      • Creating a Mixed Phenotypic Assay Dataset
      • Guide for Ingesting a Simple Four Table Dataset
    • DNAnexus API
      • Entity IDs
      • Protocols
      • Authentication
      • Regions
      • Nonces
      • Users
      • Organizations
      • OIDC Clients
      • Data Containers
        • Folders and Deletion
        • Cloning
        • Project API Methods
        • Project Permissions and Sharing
      • Data Object Lifecycle
        • Types
        • Object Details
        • Visibility
      • Data Object Metadata
        • Name
        • Properties
        • Tags
      • Data Object Classes
        • Records
        • Files
        • Databases
        • Drives
        • DBClusters
      • Running Analyses
        • I/O and Run Specifications
        • Instance Types
        • Job Input and Output
        • Applets and Entry Points
        • Apps
        • Workflows and Analyses
        • Global Workflows
        • Containers for Execution
      • Search
      • System Methods
      • Directory of API Methods
      • DNAnexus Service Limits
  • Administrator
    • Billing
    • Org Management
    • Single Sign-On
    • Audit Trail
    • Integrating with External Services
    • Portal Setup
    • GxP
      • Controlled Tool Access (allowed executables)
  • Science Corner
    • Scientific Guides
      • Somatic Small Variant and CNV Discovery Workflow Walkthrough
      • SAIGE GWAS Walkthrough
      • LocusZoom DNAnexus App
      • Human Reference Genomes
    • Using Hail to Analyze Genomic Data
    • Open-Source Tools by DNAnexus Scientists
    • Using IGV Locally with DNAnexus
  • Downloads
  • FAQs
    • EOL Documentation
      • Python 3 Support and Python 2 End of Life (EOL)
    • Automating Analysis Workflow
    • Backups of Customer Data
    • Developing Apps and Applets
    • Importing Data
    • Platform Uptime
    • Legal and Compliance
    • Sharing and Collaboration
    • Product Version Numbering
  • Release Notes
  • Technical Support
  • Legal
Powered by GitBook

Copyright 2025 DNAnexus

On this page
  • What is Data Model Loader
  • Launch Data Model Loader
  • Inputs
  • Process
  • Validation
  • Staging
  • Finalizing
  • Dataset Creation
  • Outputs
  • Best Practices

Was this helpful?

Export as PDF
  1. Developer
  2. Ingesting Data

Data Model Loader

Create datasets with your phenotypic, clinical, and other semi-structured data.

Last updated 3 days ago

Was this helpful?

An Apollo license is required to use Data Model Loader. Org approval may also be required. for more information.

What is Data Model Loader

With (DML), you can move your phenotypic, clinical, and other semi-structured data to the DNAnexus Platform as a dataset. This enables you to bring your harmonized, cleaned, and prepared data in its native logical structure to the Platform, creating an object you can use with dataset-based modules, such as Cohort Browser, dxdata, GWAS pipelines, and more.

Here's what Data Model Loader does:

  • Reads the raw data along with a coding file and dictionary files.

  • Runs a series of validations to ensure proper structure is maintained.

  • Ingests the data into the Apollo database.

  • Creates a dataset representing the data, dictionaries, and codings provided.

Launch Data Model Loader

To launch the Data Model Loader application, enter this command in your terminal:

dx run data_model_loader_v2

Inputs

The Data Model Loader app requires the following inputs:

  • Data Dictionary CSV - Defines how the raw data relates to one another along with metadata information about each raw data field such as labels, descriptions, units of measure, linkouts and more. This file allows for DNAnexus to flexibly ingest different data models and data types. To leverage automated pre-loaded ICD codings, specify the ICD coding columns in your data files by using one of the reserved terms as the coding_name in the data dictionary. The reserved terms are: icd9cm:2015, icd9pcs:2015, icd10cm:2024, and icd10pcs:2024. By default, ICD codings are hierarchically grouped. To disable hierarchy, the reserved term should be followed by :exclude_hierarchy, for example icd9cm:2015:exclude_hierarchy.

  • Input Configuration

    • Output Dataset Name - Defines the name of the dataset to create based on the data dictionary and the ingested data in the database.

Optionally, you can specify the following inputs:

  • Data CSV(s) - Specifies the raw data CSVs. These files are required for any entities where data is not yet ingested. In most cases for data loading. The data CSVs will be part of the end to end run. For very complex data loading, data may be incrementally ingested and one final run may be done to create a dataset therefore data CSVs are not required.

  • Coding CSV - Specifies the granular details for any categorical or coded raw values. The CSV is required for any ingestion where the Data Dictionary CSV contains any coded fields. When using preloaded ICD codings by including reserved terms in the dictionary CSV, there is no need to define the meanings of ICD codes in the Coding CSV, as these meanings will be automatically added to the Coding CSV file, drawing on ICD sources. However, it is possible to overwrite specific ICD codes by adding them to the Coding CSV file, in which case the user-specified ICD codings and meanings will take precedence over the pre-loaded common vocabularies.

  • Dashboard Record - Specifies a file to use as the default dashboard for the dataset. For v3.0 datasets, use the Dashboard View Record.

  • Mode Configuration JSON - Used for advanced ingestion to adjust the ingestion mode on a per-entity level. See the application documentation for more details.

  • Input Configurations

    • Database Name - Can be an existing database name if you want to create a dataset using the loaded data or load data into an existing database. In such case, the database ID input is also required. If you want to create a new database and load the data into it, you can provide a project-specific, unique database name. If not provided, a database name will be automatically assigned.

    • Database ID - If you're using an existing database within the Platform, provide its ID. This is common if you are ingesting data into an existing database or creating a new dataset on top of previously ingested data. The input is usually "database-identifyingAlphaNumerString", which you can find in the info panel of the database you want to use.

    • Dataset Schema - Used as a unique identifier to tag datasets and cohorts together to represent a grouping of like records on the same core data. If not provided, the default value will be used.

    • Dashboard Template - the dashboard configuration template to use to generate the dashboard configuration JSON. The default setting, “Global Defaults”, does not create a dashboard template allowing the system to select the most appropriate. If you would like to provide your own dashboard, select “Custom”. Note that “Legacy Cohort Browser Dashboard” is only supported with single entity datasets.

    • Maximum Columns Per Table - If ingesting an extra wide data file, allows to automatically distribute data across a number of tables no wider than the maximum set to ensure optimal performance. If configured, the value must be 400 or lower. The dataset will automatically join the tables to create one logical entity and all database tables will be named <entity>-<generated split number>.

    • Skip Ingestion - If your data is already ingested and you only need to create a dataset, you can skip the ingestion. Because most ingestions include data ingestion and dataset creation, this is false by default.

    • Limit Spark Logs - Specifies whether to limit the Spark logs written to the job log. If true, job log includes only critical Spark log events, and full Spark logs go to a dedicated spark_logs.log file in the project. You can find full Spark logs in the Data Model Loader's dml_files folder in the project.

    • For details on more granular configuration, see the Data Model Loader app documentation.

Process

For a full end to end data load, the Data Model Loader executes the following process that first validates the inputs, stages the data, transforms the staged data to finalized tables, and then creates a dataset record based on the App inputs.

Validation

The validation has 4 phases:

  1. Validation that the job inputs conform to expectations.

  2. Validation of the data dictionary, codings file, and entity dictionary as a model.

  3. Validation that the data files and columns input match the dictionary modeled in validation 1.

  4. Validation of each data file against the dictionary defined.

If a phase fails validation, all errors for the phase are aggregated to the error file and the job fails.

Staging

The process where the data CSV files are validated, transformed and ingested into temporary staging tables. If the errors are in the coding or type casting of specific data, CSV files are generated in 'Project:dml_files/<job-id>/error_files/' folder to highlight specific rows where errors occurred.

The following validations and transformations are performed during staging phase.

  1. Empty field transformations.

  2. Data type transformations.

  3. If validation is enabled the following validations/transformations are performed:

    1. Infinity check: After data type transformation, if the double values are negative or positive infinity, the values are substituted with max double value and min double value supported.

    2. Distinct array : Array fields are transformed to contain distinct values.

    3. Valid code : If the field is coded, check if the value is a valid code.

    4. Primary key : Check if primary key is set.

    5. Generate error details: If any of the validation fails, error_details column is added which contains the cause of failure which will be part of the error files generated.

  4. Add of helper columns for usage throughout the system.

    1. These helper columns are added for hierarchical fields.

At the end of this stage, the staging tables (<entity-name>_staging) are created.

Finalizing

Once all the files are staged successfully, final tables are created. During this move, all data, even if it was ingested incrementally, is moved to final tables (<entity_name).

Dataset Creation

If all the final tables are created successfully, dataset record will be created in the project. The dataset is created based on the data dictionary and the entity metadata (entity dictionary) files that provided.

Outputs

  • Database - the ID of the database used is returned.

  • Dataset - The dataset record created.

  • Dashboard View - If a dashboard was configured, a dashboard view will be created based on the dataset.

  • Logs - Available under 'Project: dml_files/<job-id>/logs/'.

    • data_loader.log - Logs on the process being run. This is a superset of the stdout and stderr and is useful when troubleshooting.

    • Spark cluster logs - for advanced troubleshooting.

Best Practices

Because the Data Model Loader app loads data into the Apollo Database, certain terms are reserved and not allowed as entity names or field names. This is to ensure that the system can perform all of the expected functionality on the database created.

The reserved terms are as follows.

[‘all','alter','and','anti','any','array','as','at','authorization','between','both','builtin','by','case','cast','check','collate','column','commit','constraint','create','cross','cube','current','current_date','current_time','current_timestamp','current_user','default','delete','describe','distinct','drop','else','end','escape','except','exists','external','extract','false','fetch','filter','for','foreign','from','full','function','global','grant','group','grouping','having','in','information_schema','inner','insert','intersect','interval','into','is','join','lateral','leading','left','like','local','minus','natural','no','not','null','of','on','only','or','order','out','outer','overlaps','partition','position','primary','range','references','revoke','right','rollback','rollup','row','rows','select','semi','session','session_user','set','some','start','table','tablesample','then','time','to','trailing','true','truncate','union','unique','unknown','update','user','using','values','when','where','window','with’]

Also, field names cannot start with current_*.

To ingest large complex datasets, make sure to engage with the for an optimal experience. For example, attempting to ingest a Main EHR dataset with hundreds of thousands of patients.

Instance Type - The default is sufficient for loading most small to medium-size datasets. If your input files are large, consider using a more powerful instance to ensure success.

Entity Dictionary CSV - Specifies the metadata of the logical entities in the dataset. Use this if you need to provide more information about the logical entities for use by end users. Values in this CSV are used to override the filename in the and when accessing entity metadata through dxdata.

The data dictionary, coding files, and data are validated as per .

For datasets that you plan to use in the Cohort Browser, we recommend specifying an entity dictionary for an ideal user experience. For tables with over 100 million rows, contact the for an optimized experience.

DNAnexus Professional Services team
instance type
Cohort Browser
Data Files Used by the Data Model Loader
DNAnexus Professional Services team
Contact DNAnexus Sales
Data Model Loader
Overview of all required and optional file inputs for the Data Model Loader App