Example Input

There are two categories of input to consider when ingesting data using the Molecular Expression Assay Loader Application: feature contexts and data formats.

An Apollo license is required to use the Molecular Expression Assay Loader on the DNAnexus Platform. Org approval may also be required. Contact DNAnexus Sales for more information.

Input Feature

For the molecular expression model, the core unit to be measured is the “feature”. To represent a molecular expression assay in a Dataset, there are three terms used to describe a feature: feature type, feature ID type, and feature value type. The feature type refers to the general category of what is being measured. The feature ID type refers to a standardized naming method for how an individual feature is identified. The feature value type refers to the method of measurement. For practical purposes, the following is a list of accepted combinations:

Feature type

Feature ID Type

Feature Value Type

mRNA

Either ENSG* or ENST*

RPKM (double)

FPKM (double)

FPKM-UQ (double)

TPM (double)

count (integer)

Input Data Format

Software programs and data suppliers provide data in different formats. DNAnexus aims to support common formats to reduce any data transformation burden prior to ingestion. The following formats are currently supported for simplified ingestion.

Matrix Format

N x M matrix of N features (rows) by M samples (columns), where each feature and sample is unique. A header row must be provided as part of this format, including a column for the feature ID. For example:

example_matrix.csv
feature_id,sample_1,sample_2,sample_3
ENSG00000200998,22,48,1
ENSG00000260796,64,4,53
ENSG00000225672,1,1,1

Long Format

(N x M) x 3 table of N features with M samples (rows) and 3 columns with headers, where the first column is the “feature_id,” the second column is the “sample_id,” and the third column is the “value.” Each row should contain a unique combination of feature ID and sample ID. For example:

example_table.csv
feature_id,sample_id,value
ENSG00000200998,sample_1,22
ENSG00000260796,sample_1,64
ENSG00000225672,sample_1,1
ENSG00000200998,sample_2,48
ENSG00000260796,sample_2,4
ENSG00000225672,sample_2,1
ENSG00000200998,sample_3,1
ENSG00000260796,sample_3,53
ENSG00000225672,sample_3,1

Manifest Format (Single Sample Files with a Manifest File)

Two sets of files; one manifest file which describes the respective data file ID and associated sample, and the set of individual data files. The manifest file should have two columns with headers, “file_id” and “sample_id.” Individual files should each have two columns with the headers “feature_id” and “value.” For example:

example_manifest.csv
file_id,sample_id
project-G5Azzzj061qF9vJpKhK3Kf7k:file-G7Pj4Z00Gjv8KJ89PjGbbJG4,sample_1
project-G5Azzzj061qF9vJpKhK3Kf7k:file-G7Pj4f88GjvGYqGzJgFj9gvK,sample_2
project-G5Azzzj061qF9vJpKhK3Kf7k:file-G7Pj4h80Gjv7Kv49qjYnbJG6,sample_3
sample_1.csv (file-id: file-G7Pj4Z00Gjv8KJ89PjGbbJG4)
feature_id,value
ENSG00000200998,22
ENSG00000260796,64
ENSG00000225672,1
sample_2.csv (file-id: file-G7Pj4f88GjvGYqGzJgFj9gvK)
feature_id,value
ENSG00000200998,48
ENSG00000260796,4
ENSG00000225672,1
sample_3.csv (file-id: file-G7Pj4h80Gjv7Kv49qjYnbJG6)
feature_id,value
ENSG00000200998,1
ENSG00000260796,53
ENSG00000225672,1

Last updated