Example Input
There are two categories of input to consider when ingesting data using the Molecular Expression Assay Loader Application: feature contexts and data formats.
Input Feature
For the molecular expression model, the core unit to be measured is the "feature". To represent a molecular expression assay in a Dataset, there are three terms used to describe a feature: feature type, feature ID type, and feature value type. The feature type refers to the general category of what is being measured. The feature ID type refers to a standardized naming method for how an individual feature is identified. The feature value type refers to the method of measurement. For practical purposes, the following is a list of accepted combinations:
mRNA
Either ENSG* or ENST*
RPKM (double)
FPKM (double)
FPKM-UQ (double)
TPM (double)
count (integer)
Input Data Format
Software programs and data suppliers provide data in different formats. DNAnexus aims to support common formats to reduce any data transformation burden before ingestion. The following formats are currently supported for simplified ingestion.
Matrix Format
N x M matrix of N features (rows) by M samples (columns), where each feature and sample is unique. A header row must be provided as part of this format, including a column for the feature ID. For example:
feature_id,sample_1,sample_2,sample_3
ENSG00000200998,22,48,1
ENSG00000260796,64,4,53
ENSG00000225672,1,1,1
Long Format
(N x M) x 3 table of N features with M samples (rows) and 3 columns with headers, where the first column is the feature_id
, the second column is the sample_id
, and the third column is the "value." Each row should contain a unique combination of feature ID and sample ID. For example:
feature_id,sample_id,value
ENSG00000200998,sample_1,22
ENSG00000260796,sample_1,64
ENSG00000225672,sample_1,1
ENSG00000200998,sample_2,48
ENSG00000260796,sample_2,4
ENSG00000225672,sample_2,1
ENSG00000200998,sample_3,1
ENSG00000260796,sample_3,53
ENSG00000225672,sample_3,1
Manifest Format (Single Sample Files with a Manifest File)
Two sets of files: one manifest file that describes the respective data file_id
and associated sample_id
, and a set of individual data files. The manifest file should have two columns with headers, file_id
and sample_id
. Individual files should each have two columns with the headers feature_id
and value
. For example:
file_id,sample_id
project-G5Azzzj061qF9vJpKhK3Kf7k:file-G7Pj4Z00Gjv8KJ89PjGbbJG4,sample_1
project-G5Azzzj061qF9vJpKhK3Kf7k:file-G7Pj4f88GjvGYqGzJgFj9gvK,sample_2
project-G5Azzzj061qF9vJpKhK3Kf7k:file-G7Pj4h80Gjv7Kv49qjYnbJG6,sample_3
feature_id,value
ENSG00000200998,22
ENSG00000260796,64
ENSG00000225672,1
feature_id,value
ENSG00000200998,48
ENSG00000260796,4
ENSG00000225672,1
feature_id,value
ENSG00000200998,1
ENSG00000260796,53
ENSG00000225672,1
Last updated
Was this helpful?