DNAnexus Documentation
APIDownloadsIndex of dx CommandsLegal
  • Overview
  • Getting Started
    • DNAnexus Essentials
    • Key Concepts
      • Projects
      • Organizations
      • Apps and Workflows
    • User Interface Quickstart
    • Command Line Quickstart
    • Developer Quickstart
    • Developer Tutorials
      • Bash
        • Bash Helpers
        • Distributed by Chr (sh)
        • Distributed by Region (sh)
        • SAMtools count
        • TensorBoard Example Web App
        • Git Dependency
        • Mkfifo and dx cat
        • Parallel by Region (sh)
        • Parallel xargs by Chr
        • Precompiled Binary
        • R Shiny Example Web App
      • Python
        • Dash Example Web App
        • Distributed by Region (py)
        • Parallel by Chr (py)
        • Parallel by Region (py)
        • Pysam
      • Web App(let) Tutorials
        • Dash Example Web App
        • TensorBoard Example Web App
      • Concurrent Computing Tutorials
        • Distributed
          • Distributed by Region (sh)
          • Distributed by Chr (sh)
          • Distributed by Region (py)
        • Parallel
          • Parallel by Chr (py)
          • Parallel by Region (py)
          • Parallel by Region (sh)
          • Parallel xargs by Chr
  • User
    • Login and Logout
    • Projects
      • Project Navigation
      • Path Resolution
    • Running Apps and Workflows
      • Running Apps and Applets
      • Running Workflows
      • Running Nextflow Pipelines
      • Running Batch Jobs
      • Monitoring Executions
      • Job Notifications
      • Job Lifecycle
      • Executions and Time Limits
      • Executions and Cost and Spending Limits
      • Smart Reuse (Job Reuse)
      • Apps and Workflows Glossary
      • Tools List
    • Cohort Browser
      • Chart Types
        • Row Chart
        • Histogram
        • Box Plot
        • List View
        • Grouped Box Plot
        • Stacked Row Chart
        • Scatter Plot
        • Kaplan-Meier Survival Curve
      • Locus Details Page
    • Using DXJupyterLab
      • DXJupyterLab Quickstart
      • Running DXJupyterLab
        • FreeSurfer in DXJupyterLab
      • Spark Cluster-Enabled DXJupyterLab
        • Exploring and Querying Datasets
      • Stata in DXJupyterLab
      • Running Older Versions of DXJupyterLab
      • DXJupyterLab Reference
    • Using Spark
      • Apollo Apps
      • Connect to Thrift
      • Example Applications
        • CSV Loader
        • SQL Runner
        • VCF Loader
      • VCF Preprocessing
    • Environment Variables
    • Objects
      • Describing Data Objects
      • Searching Data Objects
      • Visualizing Data
      • Filtering Objects and Jobs
      • Archiving Files
      • Relational Database Clusters
      • Symlinks
      • Uploading and Downloading Files
        • Small File Sets
          • dx upload
          • dx download
        • Batch
          • Upload Agent
          • Download Agent
    • Platform IDs
    • Organization Member Guide
    • Index of dx commands
  • Developer
    • Developing Portable Pipelines
      • dxCompiler
    • Cloud Workstation
    • Apps
      • Introduction to Building Apps
      • App Build Process
      • Advanced Applet Tutorial
      • Bash Apps
      • Python Apps
      • Spark Apps
        • Table Exporter
        • DX Spark Submit Utility
      • HTTPS Apps
        • Isolated Browsing for HTTPS Apps
      • Transitioning from Applets to Apps
      • Third Party and Community Apps
        • Community App Guidelines
        • Third Party App Style Guide
        • Third Party App Publishing Checklist
      • App Metadata
      • App Permissions
      • App Execution Environment
        • Connecting to Jobs
      • Dependency Management
        • Asset Build Process
        • Docker Images
        • Python package installation in Ubuntu 24.04 AEE
      • Job Identity Tokens for Access to Clouds and Third-Party Services
      • Enabling Web Application Users to Log In with DNAnexus Credentials
      • Types of Errors
    • Workflows
      • Importing Workflows
      • Introduction to Building Workflows
      • Building and Running Workflows
      • Workflow Build Process
      • Versioning and Publishing Global Workflows
      • Workflow Metadata
    • Ingesting Data
      • Molecular Expression Assay Loader
        • Common Errors
        • Example Usage
        • Example Input
      • Data Model Loader
        • Data Ingestion Key Steps
        • Ingestion Data Types
        • Data Files Used by the Data Model Loader
        • Troubleshooting
      • Dataset Extender
        • Using Dataset Extender
    • Dataset Management
      • Rebase Cohorts and Dashboards
      • Assay Dataset Merger
      • Clinical Dataset Merger
    • Apollo Datasets
      • Dataset Versions
      • Cohorts
    • Creating Custom Viewers
    • Client Libraries
      • Support for Python 3
    • Walkthroughs
      • Creating a Mixed Phenotypic Assay Dataset
      • Guide for Ingesting a Simple Four Table Dataset
    • DNAnexus API
      • Entity IDs
      • Protocols
      • Authentication
      • Regions
      • Nonces
      • Users
      • Organizations
      • OIDC Clients
      • Data Containers
        • Folders and Deletion
        • Cloning
        • Project API Methods
        • Project Permissions and Sharing
      • Data Object Lifecycle
        • Types
        • Object Details
        • Visibility
      • Data Object Metadata
        • Name
        • Properties
        • Tags
      • Data Object Classes
        • Records
        • Files
        • Databases
        • Drives
        • DBClusters
      • Running Analyses
        • I/O and Run Specifications
        • Instance Types
        • Job Input and Output
        • Applets and Entry Points
        • Apps
        • Workflows and Analyses
        • Global Workflows
        • Containers for Execution
      • Search
      • System Methods
      • Directory of API Methods
      • DNAnexus Service Limits
  • Administrator
    • Billing
    • Org Management
    • Single Sign-On
    • Audit Trail
    • Integrating with External Services
    • Portal Setup
    • GxP
      • Controlled Tool Access (allowed executables)
  • Science Corner
    • Scientific Guides
      • Somatic Small Variant and CNV Discovery Workflow Walkthrough
      • SAIGE GWAS Walkthrough
      • LocusZoom DNAnexus App
      • Human Reference Genomes
    • Using Hail to Analyze Genomic Data
    • Open-Source Tools by DNAnexus Scientists
    • Using IGV Locally with DNAnexus
  • Downloads
  • FAQs
    • EOL Documentation
      • Python 3 Support and Python 2 End of Life (EOL)
    • Automating Analysis Workflow
    • Backups of Customer Data
    • Developing Apps and Applets
    • Importing Data
    • Platform Uptime
    • Legal and Compliance
    • Sharing and Collaboration
    • Product Version Numbering
  • Release Notes
  • Technical Support
  • Legal
Powered by GitBook

Copyright 2025 DNAnexus

On this page
  • Overview
  • Guide
  • Step 1. Identify Your Data
  • Step 2. Defining Your Files
  • Step 3. Ingest Your Files

Was this helpful?

Export as PDF
  1. Developer
  2. Ingesting Data
  3. Data Model Loader

Data Ingestion Key Steps

Last updated 1 year ago

Was this helpful?

An Apollo license is required to use Data Model Loader on the DNAnexus Platform. Org approval may also be required. for more information.

Overview

Generalized phenotypic data ingestion is done with an ingestion process that takes in well-described data in the form of a file, a file if needed, an optional file, and accompanying . The files are loaded using the app, which validates and ingests the input CSV files to create a . This Dataset is then accessible using the , or using and our Python SDK, dxdata.‌

The following steps show how to organize your data into the required file sets. These files can then be loaded using the app to create a database encapsulated by a record, which are then immediately accessible for use.

Guide

Step 1. Identify Your Data

Decide What Type of Data You Will Ingest

Is this data phenotypic or clinical data, or is it molecular data?

Include

  • Examples of phenotypic or clinical data include: a patient's height, encounters with a physician or hospital, surgeries, drugs taken for any treatments, etc. Clinical data may also include descriptive content, such as information on samples extracted. For example, the weight and size of a tumor, or the date and time from which a tumor was excised.

  • Gross features that describe molecular content may be considered clinical data. For example, we don't recommend including allele content of the BRCA2 gene for a patient, however having a field such as, "Tested positive for BRCA2 risk allele: (yes/no/untested)" may be of use.

Exclude

  • Examples of molecular data include: allele content of the BRCA2 gene for a patient. This guide does not cover molecular data ingestion. To ingest these and other complex datasets, engaging with the is advised to ensure an optimal experience.

Determine the Main Entity and Main Field

Once you have all of the data to include, you next need to decide upon the central focus and organizing focal point of the data. In almost all situations, the focus will be at the individual subject level (i.e, subject, patient, case, etc.). You can think of the main entity as the "item" you want to summarize data around and build cohorts of.

For example, we often want to group individuals into cohorts of subjects, such as:

  1. Subjects that haven't smoked cigarettes before

  2. Subjects that smoke one pack (or more) of cigarettes a week.

The main entity would be here would be "subject". The main field would be a unique identifier of the individual subject, such as "subject_id."

We assume all data in a set of data is "linked" together in some manner. For example, if you have a set of data that includes patients and samples, we would expect all included samples to be from a patient contained in the set of data. Samples that have no identified patient, should not be included in the data.

Define All Other Entities

An entity is simply a grouping of data. You may group data however it best fits your needs. We recommend grouping all data together which shares a one-to-one relationship as a single entity, and grouping any nested data as a separate entity. Entities will be dependent on the data you have.

Example

If you have a main entity, subject, you may have another entity, encounter, which contains information from the many encounters a subject has at the hospital, and you may have another entity, sample, which contains information on information on the many samples extracted on a subject.

  • The entity subject would contain all data that is one-to-one with the subject, such as, "first_name," "last_name," "date_of_birth," "sex," "race," "ethnicity."

  • encounter would contain the "date" of the hospital visit and perhaps the diagnosis, "ICD10_code," that may have resulted from the visit.

  • sample would contain any sample information on the subject, such as "date" sample was extracted, "tissue_type" of extraction, and "tissue_weight" of tissue extracted.

  • Entities will be dependent on the data you have. If you don't have sample data, you don't need a sample entity!

Step 2. Defining Your Files

Fill in "Entity" and "Name" for Each Field

Begin filling out data_dictionary.csv by list all fields within your data under the "name" column and label each field with the respective entity name under the column "entity."

Define the "primary_key_type"

Main Entity

As your main entity main field are already defined, write the value, "global" under the column "primary_key_type" for that row of information in data_dictionary.csv.

All Other Entities

For every other entity (other than the main entity) determine if the entity contains a field that is a "primary key," or not. A primary key is UNIQUE and NOT NULL and acts as an identifier for the entity. Only one defined primary key per entity is allowed. For example, the entity, sample, might contain the primary key, "sample_id." Although non-main entities don't need a primary key, it is advised that one exist.

For each defined primary key, write the value, "local" under the column "primary_key_type" for that row of information in data_dictionary.csv.

For Each Field in Each Entity, Indicate "is_sparse_coding," "is_multi_select," and "type"

For each field in data_dictionary.csv, provide the value, "yes," under each respective column if the field "is_sparse_coding," or "is_multi_select." Also, determine the field "type."

Build a codings.csv File and Indicate "coding_name"

If a field is categorical, for each value in the field, fill in the "code," "meaning," "parent_code," "display_order," and "concept" within the codings.csv file. Finally label the set of codes with a unique "coding_name." The "coding_name" in the codings.csv file should be filled out, and the same "coding_name" value should be used in the respective data_dictionary.csv field.

Indicate if a Desired Data Format is Allowed

If it is not represented, we suggest reformatting your data, your set of entities, or forcing a field into an allowed data format.

Define Entity Relationships and Determine "referenced_entity_field" and "relationship"

As specified earlier, we assume that all of your included data is linked, in some way, together.

  • If you only have one entity, these data_dictionary.csv columns may be ignored.

  • If there is more than one entity, you need to describe how each entity relates to another entity.

Start with the main entity, and find all entities that either have a one-to-one relationship or a many-to-one relationship. For example, the entity, encounter, would be related many-to-one with the main entity, Subject. In the Encounter entity, there should be a field that directly links to a field in the subject entity. Each field would be names, "subject_id," and for the "subject_id" field in the encounter entity, the data_dictionary.csv "referenced_entity_field" value would be "subject:subject_id," and the "relationship" value would be "many_to_one."

Add Additional Descriptive Information: "title," "description," "units," and "linkout"

Specify Folder for Display in the Cohort Browser: "folder_path"

Fill in "folder_path" for each field. If a value is not specified, the field will not be displayed in the Cohort Browser.

We recommend grouping fields similar to how entities are grouped, however it is not necessary.

Organize Data into Respective Entities: data.csv(s)

For each entity in your data_dictionary.csv file, write all data from each field into a flat data.csv file. The data.csv should be named after the entity.

Example

The entity subject should have a respective data_dictionary.csv file, labeled, "subject.csv."

"subject.csv" should contain all respective fields in the data_dictionary.csv file as columns, and rows should be filled in with data.

Generate Entity Details: entity_metadata.csv

For each entity created, add in a row ensure that the the entity and entity_title are populated and optionally supply an entity_label_singular, entity_label_plural, and entity_description.

You should now have the data.csv files, the data_dictionary.csv, the entity_metadata.csv(optional) and the codings.csv (optional).

Step 3. Ingest Your Files

Now that all entities have been defined, create a data_dictionary.csv file with the following column names: entity, name, primary_key_type, coding_name, is_sparse_coding, is multi_select, longitudinal_axis_type, type, referenced_entity_field, relationship, folder_path, title, description, units, concept, and linkout. See the for quick reference.

See the page for guidelines as to the definition of each column.

Next, determine if the field is to be a categorial field or not. If so, you will need to include the specific "coding_name" used. A field is categorical if the values are not unique, such as with the field, "Tumor status" Here, reasonable answers would be "malignant," "benign", "not applicable," or "undetermined." If there are any categorical fields in your data, you will need to create a codings.csv file with the following column names: coding_name, code, meaning, parent_code, display_order, and concept. See the page for quick reference.

You now have all necessary information to determine whether or not the desired format of data (float, integer, date, etc.) will be accepted during ingestion or not. We currently have limitations in place and restrict allowable formats. Review each field in your set of data and confirm that it fits a represented data format (see the page).

For each field, fill in any desired descriptive information. See the section for details on each column.

For more details on each column refer to the section.

The resulting data.csv files will be loaded with the data_dictionary.csv, entity_metadata.csv, and codings.csv files into the app.

Phenotypic Data Ingest File Details
Ingestion Data Type
Ingestion Data Types
Data Model Loader
Contact DNAnexus Sales
Data Model Loader
Dataset
Cohort Browser
JupyterLab
Data Model Loader
Dataset
DNAnexus Professional Services team
Data Dictionary
Codings
Entity Dictionary
data CSV files
codings csv
Data Dictionary
Entity Dictionary