DNAnexus Documentation
APIDownloadsIndex of dx CommandsLegal
  • Overview
  • Getting Started
    • DNAnexus Essentials
    • Key Concepts
      • Projects
      • Organizations
      • Apps and Workflows
    • User Interface Quickstart
    • Command Line Quickstart
    • Developer Quickstart
    • Developer Tutorials
      • Bash
        • Bash Helpers
        • Distributed by Chr (sh)
        • Distributed by Region (sh)
        • SAMtools count
        • TensorBoard Example Web App
        • Git Dependency
        • Mkfifo and dx cat
        • Parallel by Region (sh)
        • Parallel xargs by Chr
        • Precompiled Binary
        • R Shiny Example Web App
      • Python
        • Dash Example Web App
        • Distributed by Region (py)
        • Parallel by Chr (py)
        • Parallel by Region (py)
        • Pysam
      • Web App(let) Tutorials
        • Dash Example Web App
        • TensorBoard Example Web App
      • Concurrent Computing Tutorials
        • Distributed
          • Distributed by Region (sh)
          • Distributed by Chr (sh)
          • Distributed by Region (py)
        • Parallel
          • Parallel by Chr (py)
          • Parallel by Region (py)
          • Parallel by Region (sh)
          • Parallel xargs by Chr
  • User
    • Login and Logout
    • Projects
      • Project Navigation
      • Path Resolution
    • Running Apps and Workflows
      • Running Apps and Applets
      • Running Workflows
      • Running Nextflow Pipelines
      • Running Batch Jobs
      • Monitoring Executions
      • Job Notifications
      • Job Lifecycle
      • Executions and Time Limits
      • Executions and Cost and Spending Limits
      • Smart Reuse (Job Reuse)
      • Apps and Workflows Glossary
      • Tools List
    • Cohort Browser
      • Chart Types
        • Row Chart
        • Histogram
        • Box Plot
        • List View
        • Grouped Box Plot
        • Stacked Row Chart
        • Scatter Plot
        • Kaplan-Meier Survival Curve
      • Locus Details Page
    • Using DXJupyterLab
      • DXJupyterLab Quickstart
      • Running DXJupyterLab
        • FreeSurfer in DXJupyterLab
      • Spark Cluster-Enabled DXJupyterLab
        • Exploring and Querying Datasets
      • Stata in DXJupyterLab
      • Running Older Versions of DXJupyterLab
      • DXJupyterLab Reference
    • Using Spark
      • Apollo Apps
      • Connect to Thrift
      • Example Applications
        • CSV Loader
        • SQL Runner
        • VCF Loader
      • VCF Preprocessing
    • Environment Variables
    • Objects
      • Describing Data Objects
      • Searching Data Objects
      • Visualizing Data
      • Filtering Objects and Jobs
      • Archiving Files
      • Relational Database Clusters
      • Symlinks
      • Uploading and Downloading Files
        • Small File Sets
          • dx upload
          • dx download
        • Batch
          • Upload Agent
          • Download Agent
    • Platform IDs
    • Organization Member Guide
    • Index of dx commands
  • Developer
    • Developing Portable Pipelines
      • dxCompiler
    • Cloud Workstation
    • Apps
      • Introduction to Building Apps
      • App Build Process
      • Advanced Applet Tutorial
      • Bash Apps
      • Python Apps
      • Spark Apps
        • Table Exporter
        • DX Spark Submit Utility
      • HTTPS Apps
        • Isolated Browsing for HTTPS Apps
      • Transitioning from Applets to Apps
      • Third Party and Community Apps
        • Community App Guidelines
        • Third Party App Style Guide
        • Third Party App Publishing Checklist
      • App Metadata
      • App Permissions
      • App Execution Environment
        • Connecting to Jobs
      • Dependency Management
        • Asset Build Process
        • Docker Images
        • Python package installation in Ubuntu 24.04 AEE
      • Job Identity Tokens for Access to Clouds and Third-Party Services
      • Enabling Web Application Users to Log In with DNAnexus Credentials
      • Types of Errors
    • Workflows
      • Importing Workflows
      • Introduction to Building Workflows
      • Building and Running Workflows
      • Workflow Build Process
      • Versioning and Publishing Global Workflows
      • Workflow Metadata
    • Ingesting Data
      • Molecular Expression Assay Loader
        • Common Errors
        • Example Usage
        • Example Input
      • Data Model Loader
        • Data Ingestion Key Steps
        • Ingestion Data Types
        • Data Files Used by the Data Model Loader
        • Troubleshooting
      • Dataset Extender
        • Using Dataset Extender
    • Dataset Management
      • Rebase Cohorts and Dashboards
      • Assay Dataset Merger
      • Clinical Dataset Merger
    • Apollo Datasets
      • Dataset Versions
      • Cohorts
    • Creating Custom Viewers
    • Client Libraries
      • Support for Python 3
    • Walkthroughs
      • Creating a Mixed Phenotypic Assay Dataset
      • Guide for Ingesting a Simple Four Table Dataset
    • DNAnexus API
      • Entity IDs
      • Protocols
      • Authentication
      • Regions
      • Nonces
      • Users
      • Organizations
      • OIDC Clients
      • Data Containers
        • Folders and Deletion
        • Cloning
        • Project API Methods
        • Project Permissions and Sharing
      • Data Object Lifecycle
        • Types
        • Object Details
        • Visibility
      • Data Object Metadata
        • Name
        • Properties
        • Tags
      • Data Object Classes
        • Records
        • Files
        • Databases
        • Drives
        • DBClusters
      • Running Analyses
        • I/O and Run Specifications
        • Instance Types
        • Job Input and Output
        • Applets and Entry Points
        • Apps
        • Workflows and Analyses
        • Global Workflows
        • Containers for Execution
      • Search
      • System Methods
      • Directory of API Methods
      • DNAnexus Service Limits
  • Administrator
    • Billing
    • Org Management
    • Single Sign-On
    • Audit Trail
    • Integrating with External Services
    • Portal Setup
    • GxP
      • Controlled Tool Access (allowed executables)
  • Science Corner
    • Scientific Guides
      • Somatic Small Variant and CNV Discovery Workflow Walkthrough
      • SAIGE GWAS Walkthrough
      • LocusZoom DNAnexus App
      • Human Reference Genomes
    • Using Hail to Analyze Genomic Data
    • Open-Source Tools by DNAnexus Scientists
    • Using IGV Locally with DNAnexus
  • Downloads
  • FAQs
    • EOL Documentation
      • Python 3 Support and Python 2 End of Life (EOL)
    • Automating Analysis Workflow
    • Backups of Customer Data
    • Developing Apps and Applets
    • Importing Data
    • Platform Uptime
    • Legal and Compliance
    • Sharing and Collaboration
    • Product Version Numbering
  • Release Notes
  • Technical Support
  • Legal
Powered by GitBook

Copyright 2025 DNAnexus

On this page
  • Overview
  • Using the Data Model Loader App
  • Launching the App
  • Inputs
  • Process
  • Outputs
  • Best Practices

Was this helpful?

Export as PDF
  1. Developer
  2. Ingesting Data

Data Model Loader

Learn to use Data Model Loader, the primary mechanism by which phenotypic, clinical, and other semi-structured data is moved to the Platform, and datasets using that data are created.

Last updated 1 year ago

Was this helpful?

An Apollo license is required to use Data Model Loader on the DNAnexus Platform. Org approval may also be required. for more information.

Overview

The intent of this application is to allow users to bring harmonized, cleaned, prepared data in its native logical structure to the DNAnexus Platform for use with dataset based modules (e.g. Cohort Browser, dxdata, GWAS pipelines...etc).

Using the Data Model Loader App

Launching the App

To launch the Data Model Loader application, enter this command via the command line:

dx run data_model_loader_v2

Inputs

The Data Model Loader app as requires as an input:

  • Data Dictionary CSV - defines how the raw data relates to one another along with metadata information about each raw data field such as labels, descriptions, units of measure, linkouts and more. This file allows for DNAnexus to flexibly ingest a different data models and data types.

  • Input Configuration

    • Output Dataset Name - name of the dataset that will be created based on the data dictionary and the ingested data in the database.

    • Database Name - this can be an existing database name if the user wishes to create a dataset using the loaded data or load data into an existing database (note that the database ID input is also required). If the user wants the app to create a new database and load the data into it, the user needs to provide a project specific, unique database name.

    • Dataset Schema - a unique identifier to tag datasets and cohorts together to represent a grouping of like records on the same core data.

  • Instance Type - while a default should be sufficient for most small to medium datasets, if input files large, ensure that the instance is increased to help efficient complete the process.

Additional Optional Inputs are:

  • Data CSV(s) - the raw data CSVs. These files are required for any entities where data is not yet ingested. In most cases for data loading. The data CSVs will be part of the end to end run. For very complex data loading, data may be incrementally ingested and one final run may be done to create a dataset therefore data CSVs are not required.

  • Coding CSV - the granular details for any categorical or coded raw values. The CSV is required for any ingestion where the Data Dictionary CSV contains any coded fields.

  • Entity Dictionary CSV - the metadata of the logical entities in the dataset. This CSV is only needed if more information is desired to be provided about the logical entities for use by end users. Values in this file are used to override using the filename in the Cohort Browser and when trying to access entity metadata through dxdata.

  • Dashboard Record - a file to set as the default dashboard for the dataset. For v3.0 datasets use the Dashboard View Record.

  • Mode Configuration JSON - used for advanced ingestion to adjust the ingestion mode on a per-entity level. See the application documentation for more details.

  • Input Configurations

    • Database ID - recommended to be filled in if you are using a Database that already exists on the Platform. This is common if you are ingesting data into an existing database or creating a new dataset on top of previously ingested data. The input is usually "database-identifyingAlphaNumerString" and can be found in the info panel on the database you are attempting to use.

    • Dashboard Template - the dashboard configuration template to use to generate the dashboard configuration JSON. The default setting, “Global Defaults”, does not create a dashboard template allowing the system to select the most appropriate. If you would like to provide your own dashboard, select “Custom”. Note that “Legacy Cohort Browser Dashboard” is only supported with single entity datasets.

    • Maximum Columns Per Table - if an extra wide data file is being ingested, this setting allows for the system to automatically distribute data across a number of tables no wider than the maximum set to ensure optimal performance. This value should be 400 or lower if it is being configured. The dataset will automatically join the tables to create one logical entity and all database tables will be named <entity>-<generated split number>.

    • Skip Ingestion - by default this is False since most ingestions include data ingestion and dataset creation. Toggle this to True if data is already ingested and only dataset creation is needed.

    • See app documentation for further granular configurations

Process

For a full end to end data load the data model loader executes the following process that first validates the inputs, stages the data, transforms the staged data to finalized tables, then creates a dataset record based on the App inputs.

Validation

  1. Validation that the job inputs conform to expectations.

  2. Validation of the data dictionary, codings file, and entity dictionary as a model.

  3. Validation that the data files and columns input match the dictionary modeled in validation 1.

  4. Validation of each data file against the dictionary defined.

If a phase fails validation, all errors for the phase are aggregated to the error file and the job fails.

Staging

The process where the data CSV files are validated, transformed and ingested into temporary staging tables. If the errors are in the coding or type casting of specific data, CSV files are generated in 'Project:dml_files/<job-id>/error_files/' folder to highlight specific rows where errors occurred.

The following validations and transformations are performed during staging phase.

  1. Empty field transformations.

  2. Data type transformations.

  3. If validation is enabled the following validations/transformations are performed:

    1. Infinity check: After data type transformation, if the double values are negative or positive infinity, the values are substituted with max double value and min double value supported.

    2. Distinct array : Array fields are transformed to contain distinct values.

    3. Valid code : If the field is coded, check if the value is a valid code.

    4. Primary key : Check if primary key is set.

    5. Generate error details: If any of the validation fails, `error_details` column is added which contains the cause of failure which will be part of the error files generated.

  4. Add of helper columns for usage throughout the system.

    1. These helper columns are added for hierarchical fields.

At the end of this stage, the staging tables (<entity-name>_staging) are created.

Finalizing

Once all the files are staged successfully, final tables are created. During this move, all data, even if it was ingested incrementally, is moved to final tables (<entity_name).

Dataset Creation

If all the final tables are created successfully, dataset record will be created in the project. The dataset is created based on the data dictionary and the entity metadata (entity dictionary) files that provided.

Outputs

  • Database - the ID of the database used is returned.

  • Dataset - The dataset record created.

  • Dashboard View - If a dashboard was configured, a dashboard view will be created based on the dataset.

  • Logs - Available under 'Project: dml_files/<job-id>/logs/'.

    • data_loader.log - Logs on the process being run. This is a superset of the stdout and stderr and is useful when troubleshooting.

    • Spark cluster logs - for advanced troubleshooting.

Best Practices

Since the Data Model Loader app loads data into the Apollo Database, certain terms are reserved words and not allowed as entity names or field names. This is to ensure that the system can perform all of the expected functionality on the database created. The reserved terms are as follows.

Entities and fields cannot be named:

[‘all','alter','and','anti','any','array','as','at','authorization','between','both','builtin','by','case','cast','check','collate','column','commit','constraint','create','cross','cube','current','current_date','current_time','current_timestamp','current_user','default','delete','describe','distinct','drop','else','end','escape','except','exists','external','extract','false','fetch','filter','for','foreign','from','full','function','global','grant','group','grouping','having','in','information_schema','inner','insert','intersect','interval','into','is','join','lateral','leading','left','like','local','minus','natural','no','not','null','of','on','only','or','order','out','outer','overlaps','partition','position','primary','range','references','revoke','right','rollback','rollup','row','rows','select','semi','session','session_user','set','some','start','table','tablesample','then','time','to','trailing','true','truncate','union','unique','unknown','update','user','using','values','when','where','window','with’]

Field names cannot start with current_*.

The reads the raw data along with a coding file and dictionary files, runs a series of validations to ensure proper structure is maintained, ingests the data into the Apollo database, and then creates a dataset representing the data, dictionaries, and codings provided.

To ingest large complex datasets, engaging with the is advised to ensure an optimal experience (e.g. attempting to ingest a Main EHR dataset with hundreds of thousands of patients).

The data dictionary, coding files, and data are validated as per the specifications . Validation is performed in 4 phases:

An entity dictionary is recommended for datasets that will be used in the Cohort Browser to enable the best user experience. For tables with over 100 million rows please work with the for an optimized experience.

Data Model Loader application
DNAnexus Professional Services team
here
DNAnexus Professional Services team
Contact DNAnexus Sales
Overview of all required and optional file inputs for the Data Model Loader App