Example Input

There are two categories of input to consider when ingesting data using the Molecular Expression Assay Loader Application: feature contexts and data formats.

An Apollo license is required to use the Molecular Expression Assay Loader on the DNAnexus Platform. Org approval may also be required. Contact DNAnexus Sales for more information.

Input Feature

For the molecular expression model, the core unit to be measured is the "feature". To represent a molecular expression assay in a Dataset, there are three terms used to describe a feature: feature type, feature ID type, and feature value type. The feature type refers to the general category of what is being measured. The feature ID type refers to a standardized naming method for how an individual feature is identified. The feature value type refers to the method of measurement. For practical purposes, the following is a list of accepted combinations:

Feature type

Feature ID Type

Feature Value Type

mRNA

Either ENSG* or ENST*

RPKM (double)

FPKM (double)

FPKM-UQ (double)

TPM (double)

count (integer)

Input Data Format

Software programs and data suppliers provide data in different formats. DNAnexus aims to support common formats to reduce any data transformation burden before ingestion. The following formats are currently supported for simplified ingestion.

Matrix Format

N x M matrix of N features (rows) by M samples (columns), where each feature and sample is unique. A header row must be provided as part of this format, including a column for the feature ID. For example:

example_matrix.csv

feature_id,sample_1,sample_2,sample_3
ENSG00000200998,22,48,1
ENSG00000260796,64,4,53
ENSG00000225672,1,1,1

Long Format

(N x M) x 3 table of N features with M samples (rows) and 3 columns with headers, where the first column is the feature_id, the second column is the sample_id, and the third column is the "value." Each row should contain a unique combination of feature ID and sample ID. For example:

example_table.csv

feature_id,sample_id,value
ENSG00000200998,sample_1,22
ENSG00000260796,sample_1,64
ENSG00000225672,sample_1,1
ENSG00000200998,sample_2,48
ENSG00000260796,sample_2,4
ENSG00000225672,sample_2,1
ENSG00000200998,sample_3,1
ENSG00000260796,sample_3,53
ENSG00000225672,sample_3,1

Manifest Format (Single Sample Files with a Manifest File)

Two sets of files: one manifest file that describes the respective data file_id and associated sample_id, and a set of individual data files. The manifest file should have two columns with headers, file_id and sample_id. Individual files should each have two columns with the headers feature_id and value. For example:

example_manifest.csv

file_id,sample_id
project-G5Azzzj061qF9vJpKhK3Kf7k:file-G7Pj4Z00Gjv8KJ89PjGbbJG4,sample_1
project-G5Azzzj061qF9vJpKhK3Kf7k:file-G7Pj4f88GjvGYqGzJgFj9gvK,sample_2
project-G5Azzzj061qF9vJpKhK3Kf7k:file-G7Pj4h80Gjv7Kv49qjYnbJG6,sample_3

sample_1.csv (file-id: file-G7Pj4Z00Gjv8KJ89PjGbbJG4)

feature_id,value
ENSG00000200998,22
ENSG00000260796,64
ENSG00000225672,1

sample_2.csv (file-id: file-G7Pj4f88GjvGYqGzJgFj9gvK)

feature_id,value
ENSG00000200998,48
ENSG00000260796,4
ENSG00000225672,1

sample_3.csv (file-id: file-G7Pj4h80Gjv7Kv49qjYnbJG6)

feature_id,value
ENSG00000200998,1
ENSG00000260796,53
ENSG00000225672,1

Last updated 1 month ago

Was this helpful?