Example Input
There are two categories of input to consider when ingesting data using the Molecular Expression Assay Loader Application: feature contexts and data formats.

Input Feature

For the molecular expression model, the core unit to be measured is the “feature”. To represent a molecular expression assay in a Dataset, there are three terms used to describe a feature: feature type, feature ID type, and feature value type. The feature type refers to the general category of what is being measured. The feature ID type refers to a standardized naming method for how an individual feature is identified. The feature value type refers to the method of measurement. For practical purposes, the following is a list of accepted combinations:
Feature type
Feature ID Type
Feature Value Type
mRNA
Either ENSG* or ENST*
RPKM (double)
FPKM (double)
FPKM-UQ (double)
TPM (double)
count (integer)

Input Data Format

Software programs and data suppliers provide data in different formats. DNAnexus aims to support common formats to reduce any data transformation burden prior to ingestion. The following formats are currently supported for simplified ingestion.

Matrix Format

N x M matrix of N features (rows) by M samples (columns), where each feature and sample is unique. A header row must be provided as part of this format, including a column for the feature ID. For example:
example_matrix.csv
1
feature_id,sample_1,sample_2,sample_3
2
ENSG00000200998,22,48,1
3
ENSG00000260796,64,4,53
4
ENSG00000225672,1,1,1
Copied!

Long Format

(N x M) x 3 table of N features with M samples (rows) and 3 columns with headers, where the first column is the “feature_id,” the second column is the “sample_id,” and the third column is the “value.” Each row should contain a unique combination of feature ID and sample ID. For example:
example_table.csv
1
feature_id,sample_id,value
2
ENSG00000200998,sample_1,22
3
ENSG00000260796,sample_1,64
4
ENSG00000225672,sample_1,1
5
ENSG00000200998,sample_2,48
6
ENSG00000260796,sample_2,4
7
ENSG00000225672,sample_2,1
8
ENSG00000200998,sample_3,1
9
ENSG00000260796,sample_3,53
10
ENSG00000225672,sample_3,1
Copied!

Manifest format (single sample files with a manifest file)

Two sets of files; one manifest file which describes the respective data file ID and associated sample, and the set of individual data files. The manifest file should have two columns with headers, “file_id” and “sample_id.” Individual files should each have two columns with the headers “feature_id” and “value.” For example:
example_manifest.csv
1
file_id,sample_id
2
project-G5Azzzj061qF9vJpKhK3Kf7k:file-G7Pj4Z00Gjv8KJ89PjGbbJG4,sample_1
3
project-G5Azzzj061qF9vJpKhK3Kf7k:file-G7Pj4f88GjvGYqGzJgFj9gvK,sample_2
4
project-G5Azzzj061qF9vJpKhK3Kf7k:file-G7Pj4h80Gjv7Kv49qjYnbJG6,sample_3
Copied!
sample_1.csv (file-id: file-G7Pj4Z00Gjv8KJ89PjGbbJG4)
1
feature_id,value
2
ENSG00000200998,22
3
ENSG00000260796,64
4
ENSG00000225672,1
Copied!
sample_2.csv (file-id: file-G7Pj4f88GjvGYqGzJgFj9gvK)
1
feature_id,value
2
ENSG00000200998,48
3
ENSG00000260796,4
4
ENSG00000225672,1
Copied!
sample_3.csv (file-id: file-G7Pj4h80Gjv7Kv49qjYnbJG6)
1
feature_id,value
2
ENSG00000200998,1
3
ENSG00000260796,53
4
ENSG00000225672,1
Copied!