DNAnexus Documentation
APIDownloadsIndex of dx CommandsLegal
  • Overview
  • Getting Started
    • DNAnexus Essentials
    • Key Concepts
      • Projects
      • Organizations
      • Apps and Workflows
    • User Interface Quickstart
    • Command Line Quickstart
    • Developer Quickstart
    • Developer Tutorials
      • Bash
        • Bash Helpers
        • Distributed by Chr (sh)
        • Distributed by Region (sh)
        • SAMtools count
        • TensorBoard Example Web App
        • Git Dependency
        • Mkfifo and dx cat
        • Parallel by Region (sh)
        • Parallel xargs by Chr
        • Precompiled Binary
        • R Shiny Example Web App
      • Python
        • Dash Example Web App
        • Distributed by Region (py)
        • Parallel by Chr (py)
        • Parallel by Region (py)
        • Pysam
      • Web App(let) Tutorials
        • Dash Example Web App
        • TensorBoard Example Web App
      • Concurrent Computing Tutorials
        • Distributed
          • Distributed by Region (sh)
          • Distributed by Chr (sh)
          • Distributed by Region (py)
        • Parallel
          • Parallel by Chr (py)
          • Parallel by Region (py)
          • Parallel by Region (sh)
          • Parallel xargs by Chr
  • User
    • Login and Logout
    • Projects
      • Project Navigation
      • Path Resolution
    • Running Apps and Workflows
      • Running Apps and Applets
      • Running Workflows
      • Running Nextflow Pipelines
      • Running Batch Jobs
      • Monitoring Executions
      • Job Notifications
      • Job Lifecycle
      • Executions and Time Limits
      • Executions and Cost and Spending Limits
      • Smart Reuse (Job Reuse)
      • Apps and Workflows Glossary
      • Tools List
    • Cohort Browser
      • Chart Types
        • Row Chart
        • Histogram
        • Box Plot
        • List View
        • Grouped Box Plot
        • Stacked Row Chart
        • Scatter Plot
        • Kaplan-Meier Survival Curve
      • Locus Details Page
    • Using DXJupyterLab
      • DXJupyterLab Quickstart
      • Running DXJupyterLab
        • FreeSurfer in DXJupyterLab
      • Spark Cluster-Enabled DXJupyterLab
        • Exploring and Querying Datasets
      • Stata in DXJupyterLab
      • Running Older Versions of DXJupyterLab
      • DXJupyterLab Reference
    • Using Spark
      • Apollo Apps
      • Connect to Thrift
      • Example Applications
        • CSV Loader
        • SQL Runner
        • VCF Loader
      • VCF Preprocessing
    • Environment Variables
    • Objects
      • Describing Data Objects
      • Searching Data Objects
      • Visualizing Data
      • Filtering Objects and Jobs
      • Archiving Files
      • Relational Database Clusters
      • Symlinks
      • Uploading and Downloading Files
        • Small File Sets
          • dx upload
          • dx download
        • Batch
          • Upload Agent
          • Download Agent
    • Platform IDs
    • Organization Member Guide
    • Index of dx commands
  • Developer
    • Developing Portable Pipelines
      • dxCompiler
    • Cloud Workstation
    • Apps
      • Introduction to Building Apps
      • App Build Process
      • Advanced Applet Tutorial
      • Bash Apps
      • Python Apps
      • Spark Apps
        • Table Exporter
        • DX Spark Submit Utility
      • HTTPS Apps
        • Isolated Browsing for HTTPS Apps
      • Transitioning from Applets to Apps
      • Third Party and Community Apps
        • Community App Guidelines
        • Third Party App Style Guide
        • Third Party App Publishing Checklist
      • App Metadata
      • App Permissions
      • App Execution Environment
        • Connecting to Jobs
      • Dependency Management
        • Asset Build Process
        • Docker Images
        • Python package installation in Ubuntu 24.04 AEE
      • Job Identity Tokens for Access to Clouds and Third-Party Services
      • Enabling Web Application Users to Log In with DNAnexus Credentials
      • Types of Errors
    • Workflows
      • Importing Workflows
      • Introduction to Building Workflows
      • Building and Running Workflows
      • Workflow Build Process
      • Versioning and Publishing Global Workflows
      • Workflow Metadata
    • Ingesting Data
      • Molecular Expression Assay Loader
        • Common Errors
        • Example Usage
        • Example Input
      • Data Model Loader
        • Data Ingestion Key Steps
        • Ingestion Data Types
        • Data Files Used by the Data Model Loader
        • Troubleshooting
      • Dataset Extender
        • Using Dataset Extender
    • Dataset Management
      • Rebase Cohorts and Dashboards
      • Assay Dataset Merger
      • Clinical Dataset Merger
    • Apollo Datasets
      • Dataset Versions
      • Cohorts
    • Creating Custom Viewers
    • Client Libraries
      • Support for Python 3
    • Walkthroughs
      • Creating a Mixed Phenotypic Assay Dataset
      • Guide for Ingesting a Simple Four Table Dataset
    • DNAnexus API
      • Entity IDs
      • Protocols
      • Authentication
      • Regions
      • Nonces
      • Users
      • Organizations
      • OIDC Clients
      • Data Containers
        • Folders and Deletion
        • Cloning
        • Project API Methods
        • Project Permissions and Sharing
      • Data Object Lifecycle
        • Types
        • Object Details
        • Visibility
      • Data Object Metadata
        • Name
        • Properties
        • Tags
      • Data Object Classes
        • Records
        • Files
        • Databases
        • Drives
        • DBClusters
      • Running Analyses
        • I/O and Run Specifications
        • Instance Types
        • Job Input and Output
        • Applets and Entry Points
        • Apps
        • Workflows and Analyses
        • Global Workflows
        • Containers for Execution
      • Search
      • System Methods
      • Directory of API Methods
      • DNAnexus Service Limits
  • Administrator
    • Billing
    • Org Management
    • Single Sign-On
    • Audit Trail
    • Integrating with External Services
    • Portal Setup
    • GxP
      • Controlled Tool Access (allowed executables)
  • Science Corner
    • Scientific Guides
      • Somatic Small Variant and CNV Discovery Workflow Walkthrough
      • SAIGE GWAS Walkthrough
      • LocusZoom DNAnexus App
      • Human Reference Genomes
    • Using Hail to Analyze Genomic Data
    • Open-Source Tools by DNAnexus Scientists
    • Using IGV Locally with DNAnexus
  • Downloads
  • FAQs
    • EOL Documentation
      • Python 3 Support and Python 2 End of Life (EOL)
    • Automating Analysis Workflow
    • Backups of Customer Data
    • Developing Apps and Applets
    • Importing Data
    • Platform Uptime
    • Legal and Compliance
    • Sharing and Collaboration
    • Product Version Numbering
  • Release Notes
  • Technical Support
  • Legal
Powered by GitBook

Copyright 2025 DNAnexus

On this page
  • Overview
  • Requirements
  • Example Raw Files
  • Guide
  • Step 1. Identify Your Data
  • Step 2. Provide File Specifications
  • Step 3. Ingest Data
  • Step 4. Explore and Analyze Data

Was this helpful?

Export as PDF
  1. Developer
  2. Walkthroughs

Guide for Ingesting a Simple Four Table Dataset

Learn to use the Data Model Leader app to ingest phenotypic data and create a dataset for use in Apollo.

Last updated 2 days ago

Was this helpful?

On the DNAnexus Platform, an Apollo license is required to use the features described on this page. Org approval may also be required. for more information.

Overview

This walkthrough describes phenotypic data ingestion using a ; a , if needed; an optional file; and accompanying . The files are loaded using the app, which validates and ingests the input CSV files to create an . This Dataset is then accessible using the , or using and our Python SDK, dxdata.

The following steps show how to organize your data into the required file sets following the . These files can then be loaded using the app to create a database encapsulated by a record, which are then immediately accessible for use.

Requirements

  • DNAnexus Apollo License

  • Access to the app

  • A way to manipulate and create CSV files

  • Uploader or greater permissions in the project the app will be running

Example Raw Files

Guide

Step 1. Identify Your Data

Specify the Data to Be Used

  • Many patients will point to one hospital but some may be missing hospital information.

  • A patient may have none, one, or many encounters.

  • Each encounter may have none, one, or many tests.

Specify the Main Entity and Main Field

For this dataset create cohorts of different patients to perform analysis on. Some examples of cohorts that can be built are:

  • All of the Patients that were in Hospital X and had an Encounter with Doctor Y

  • All of the Patients with a weight over A

  • All of the Patients born before a date, with a risk factor of H that had an Encounter with Doctor Y

Given these types of questions and the data structure, the main entity will be the patient. The patient data includes a patient_id that is unique for the table and will be used as the key (main field).

Specify Additional Entities

Since all of the data already has a one-to-many relationships, there are no one-to-one relationships, and there are no large sets of related multi-select fields, the secondary entities will be kept as:

  • Hospital

  • Encounter

  • Test

Step 2. Provide File Specifications

The Patient CSV (main entity) file should be formatted as follows:

patient_id
name
age
risk
date_of_birth
weight
resident_state_prov
title
hospital_id

1

John

29

h

1983-03-05

135.56035

KS

Mr

2

2

Sally

27

m

1951-08-12

101.22172

SC

Dr|Hon

2

3

Cassy

l

1943-08-10

248.04192

MO

1

Create a Data Dictionary File

Create a data_dictionary.csv file and fill in the fields for entity and name. Add "global" as the primary_key_type value for patient_id.

The Data Dictionary CSV file should be formatted as follows:

entity
name
primary_key_type

patient

patient_id

global

patient

name

patient

age

patient

risk

patient

date_of_birth

patient

weight

patient

resident_state_prov

patient

title

patient

hospital_id

Specify Field Characteristics

For each field in each entity, indicate is_sparse_coding and is_multi_select, and specify the field's type.

Layer in the types for the different fields. For this example, there are the following groupings:

  • Integer: patient_id, age, hospital_id

  • Date: date_of_birth

  • Float: weight

  • String: name, risk, resident_state

    • Multi-select: title

The Data Dictionary CSV file should be formatted as follows:

entity
name
primary_key_type
is_sparse_coding
is_multi_select
type

patient

patient_id

global

integer

patient

name

string

patient

age

integer

patient

risk

string

patient

date_of_birth

date

patient

weight

float

patient

resident_state_prov

string

patient

title

yes

string

patient

hospital_id

integer

Specify Coding Settings for Categorical Fields

The Coding CSV file should be formatted as follows:

coding_name
code
meaning
parent_code
display_order

risk_level

h

High

risk_level

m

Medium

risk_level

l

Low

state_prov

US

United States

state_prov

CA

Canada

state_prov

KS

Kansas

US

state_prov

SC

South Carolina

US

state_prov

MO

Missouri

US

state_prov

OH

Ohio

US

state_prov

TX

Texas

US

state_prov

PA

Pennsylvania

US

state_prov

AL

Alabama

US

state_prov

YK

Yukon Territory

CA

state_prov

BC

British Columbia

CA

titles

Rev

Reverend

titles

Mr

Mr

titles

Dr

Doctor

titles

Hon

Honorable

titles

Mrs

Mrs

titles

Ms

Ms

Now that you created the coding file, go back and fill in the coding_name column in data_dictionary.csv with the corresponding coding_name from the coding.csv.

The Data Dictionary CSV should be formatted as follows:

entity
name
primary_key_type
is_sparse_coding
is_multi_select
type
coding_name

patient

patient_id

global

integer

patient

name

string

patient

age

integer

patient

risk

string

risk_level

patient

date_of_birth

date

patient

weight

float

patient

resident_state_prov

string

state_prov

patient

title

yes

string

titles

patient

hospital_id

integer

Specify Data Formats

  • Integer: Patient ID, Age, Hospital_id

  • String: Name

    • String Categorical: Risk

    • String Categorical Hierarchical: Resident State Province

    • String Categorical Multi-select: Title

  • Date: Date of Birth

  • Float: Weight

Define Entity Relationships

Since everything looks as expected, now link the entities together. For Patient, since the relationship can only be one-to-one or many-to-one, create field linkages down the relationship (note that relationships automatically handle none cases where joins can't be made on all rows). The relationships is created by entering the the <entity>:<field name> into the referenced_entity_field and the relationship into the relationship. The relationships to create for the example dataset are:

  • hospital to patient on hospital_id in a "many_to_one" relationship

  • encounter to patient on patient_id in a "many_to_one" relationship

  • test to encounter on encounter_id in a "many_to_one" relationship

The Data Dictionary CSV file should be formatted as follows:

entity
name
primary_key_type
referenced_entity_field
relationship
is_sparse_coding
is_multi_select
type
coding_name

patient

patient_id

global

integer

patient

name

string

patient

age

integer

patient

risk

string

risk_level

patient

date_of_birth

date

patient

weight

float

patient

resident_state_prov

string

state_prov

patient

title

yes

string

titles

patient

hospital_id

hospital:hospital_id

many_to_one

integer

hospital

hospital_id

local

integer

encounter

encounter_id

local

integer

encounter

patient_id

patient:patient_id

many_to_one

integer

test

test_id

local

integer

test

encounter_id

encounter:encounter_id

many_to_one

integer

Add Additional Descriptive Information

If you would like the folder structure to mirror the entity structure exactly, leave the folder_path field empty.

If you would like to hide certain fields from the field selector, populate the folder_path field for only the fields you would like to show and leave the path empty for the fields to hide.

Instead of following the examples generally set for entities for folder paths, put the identifiers in their own "ID" folder and patient identifying information (PII) in a subfolder under "Patient" called "PII". Foreign keys are hidden from the field selector.

The Data Dictionary CSV file should be formatted as follows:

entity
name
primary_key_type
referenced_entity_field
relationship
is_sparse_coding
is_multi_select
type
coding_name
folder_path
title
description
units
linkout

patient

patient_id

global

integer

ID

Patient ID

patient

name

string

Patient

Patient

name of the patient

patient

age

integer

Patient>PII

Age

age at signup

patient

risk

string

risk_level

Patient

Risk Level

evaluated risk level at signup

patient

date_of_birth

date

Patient>PII

DOB

date of birth

patient

weight

float

Patient

Weight

baseline weight taken at signup

lbs

patient

resident_state_prov

string

state_prov

Patient

Home State / Province

state of current residency

patient

title

yes

string

titles

Patient

Titles

patient

hospital_id

hospital:hospital_id

many_to_one

integer

hospital

hospital_id

local

integer

ID

Hospital ID

encounter

encounter_id

local

integer

ID

Encounter ID

encounter

patient_id

patient:patient_id

many_to_one

integer

test

test_id

local

integer

ID

Test ID

test

encounter_id

encounter:encounter_id

many_to_one

integer

Now that the patient is set, repeat the steps for the other entities: hospital, encounter, and test. Feel free to use this coding as a guide for categorical values.

Note that this table only highlights one entity in totality and only a few fields from the other entities. The expectation is that you will now go through the remaining three files and fill in the data yourself.

Generate Entity Details

The Entity Metadata CSV should be formatted as follows:

entity
entity_title
entity_label_singular
entity_label_plural
entity_description

patient

Patients

Patient

the patient who visited

encounter

Visit

Patient Visit

Patient Visits

each admission is logged as an individual visit

test

Lab Tests

Test

Tests

all tests performed during a visit or after but linked to the same visit

hospital

Hospital

Hospitals

Step 3. Ingest Data

Examples of Completed Files Ready for Ingestion

At this point, not only do you have the raw CSV file above, but you have a data dictionary file, an entity metadata file, and a codings file. If you struggled, below are exemplar files that can be used for ingestion:

Running the Data Model Loader

Step 4. Explore and Analyze Data

Once the Data Model Loader job has completed, you will see a dataset and a database output. If you used the supplied metadata files, you should see something similar to this in the Cohort Browser:

For this dataset the following CSV files from the above will be used: patient.csv, hospital.csv, encounter.csv, and test.csv. The data has a structure as follows:

While the document guides the user through column by column for the different data files, this walkthrough will go through one entity at a time. This is recommended to ensure precision when working with a larger or multi-entity dataset.

For categorical fields, build the codings.csvfile and determine coding_name. Layer in the coding for categorical fields and fields that you want to summarize using . For Patient, create categorical selections for risk, resident_state_prov, and titles. For these first generate a codings file from the dictionary as follows: the raw code in the data is populated in the code column, the text to display in the Cohort Browser is in the meaning column, and if it is a hierarchical field (e.g. state_prov), the parent code points to the code of the parent for the entry. The display_order can be used to override the order codes are displayed in summary tiles. The coding file should look as follows:

Now you can validate that your fields will be ingested as their expected . The following are the data type to field mappings expected for this walkthrough:

Now that you have the core data relationships and types in places, it's time to add metadata about the field (title, description, units, and linkout) along with specifying the folder path desired to be shown in the field selector of the Cohort Browser. For more information on each field see the .

After that, the only step remaining is to generate the entity_metadata.csv to clean up the labels that the browser will use for the entities (refer to the for more guidance). Blank spaces are left in the label fields if the field is a repeat of the entity_title.

With the files provided, configure the and run the ingestion. See below for an example configuration.

The are input into the Data CSVs input field.

data ingestion key steps
categorical chart types
ingestion data type
detailed docs
Data Model Loader
Example section
four raw example files
https://ourinternalwebsite.com/resources/id=3948483
detailed documentation
Contact DNAnexus Sales
Data Model Loader
Apollo dataset
Cohort Browser
JupyterLab
Data Ingestion Key Steps
Data Model Loader
Dataset
Data Model Loader
data dictionary file
codings file
entity dictionary
data CSV files
521B
patient.csv
288B
hospital.csv
652B
encounter.csv
443B
test.csv
1KB
coding.csv
Exemplar Coding CSV
2KB
data_dictionary.csv
Exemplar Data Dictionary CSV
367B
entity_metadata.csv
Exemplar Entity Metadata CSV
1KB
coding.csv
Exemplar Coding CSV
ERD for the example data
Example inputs to the Data Model Loader.
Example Common setup for the Data Model Loader
The files created from data model loader
View of the Field Selector
Example Cohort Browser view of ingested Example Files