Molecular Expression Assay Loader
Overview
The Molecular Expression Assay Loader app helps you ingest molecular expression data into an Apollo Dataset. You can use this dataset on its own or combine it with existing datasets and tools like the Cohort Browser, JupyterLab, and analysis applications.
The app reads your raw expression data, validates it, uploads it to the Apollo database, adds annotations, and creates a Dataset with a "Molecular Expression" assay. The molecular expression model supports bulk mRNA gene (ENSG) or transcript (ENST) expression values with measurement types including rpkm, fpkm, fpkm-uq, tpm, or count.
For large molecular expression datasets (where samples × features exceed 100 million), consult with DNAnexus Professional Services to ensure optimal performance and user experience.

How to Use the App
Using the UI
To use the Molecular Expression Assay Loader within the DNAnexus Platform:
In the DNAnexus Platform, go to Tools > Tools Library.
For the Molecular Expression Assay Loader app, click Run Latest Version.
In Output to, select a project and output location for the app's outputs.
Click Next.
In the Inputs tab, specify the required inputs.
Click Start Analysis.

Using the CLI
To use the Molecular Expression Assay Loader from the command-line interface, install the DNAnexus Platform SDK.
When using the DNAnexus Platform through Cloud Workspace or JupyterLab, the DNAnexus SDK is preinstalled. You can use the dx command right away.
Use the following command format, customizing the input parameters for your specific data.
dx run app-molecular-expression-assay-loader \
-i source_expression_data=example_matrix.csv \
-i reference="GRCh38.p13" \
-i feature_type="mRNA" \
-i feature_id_type="transcript_ENST" \
-i feature_value_type="tpm" \
-i assay_title="my_expression_assay" \
-i dataset_name="my_expression_dataset" \
-i database="my_somatic_database"Inputs
Supported Data Types
The molecular expression model uses "features" as the core measurement unit. Each feature is described by three components: feature type (the general category being measured), feature ID type (standardized naming method), and feature value type (method of measurement). Supports bulk mRNA expression data with Ensembl gene or transcript IDs.
Required Parameters
Source Expression Data - The source files for the molecular expression data being ingested.
Source Expression Data Format - Format of source files (
auto,matrix,long, ormanifest). Inautomode, the app detects the format automatically.Assay Title - Human-readable, short descriptive name to reference the molecular expression assay. Must be less than 256 ASCII characters.
Feature Type - Type of molecular feature assayed. Supports
mRNA.Feature ID Type - Standardized ID type for referencing features. Options:
gene_ENSG- Ensembl gene ID, such as ENSG00000139618transcript_ENST- Ensembl transcript ID, such as ENST00000380152
Must match the ID type found in your source expression data.
Feature Value Type - Data type of measured molecular feature values:
rpkm- Reads per kilobase of transcript per million reads mappedfpkm- Fragments per kilobase of transcript per million reads mappedfpkm-uq- Fragments per kilobase of transcript per million mapped reads upper quartiletpm- Transcript per millioncount- Gene counts (integer values)
Reference - Reference genome for annotation. Options:
GRCh38.p13orGRCh37.Database - Name or platform-specific ID of database for the ingested data. Must start with a lowercase alphabetic character or underscore, using only alphanumeric characters, underscores, and hyphens. Maximum 256 characters.
See in-product app documentation for additional configuration options.
Supported Data Formats
The Molecular Expression Assay Loader app supports three different data formats to reduce transformation burden before ingestion.
Matrix Format
N x M matrix of N features (rows) by M samples (columns), where each feature and sample is unique. A header row must be provided as part of this format, including a column for the feature ID. For example:
feature_id,sample_1,sample_2,sample_3
ENSG00000200998,22,48,1
ENSG00000260796,64,4,53
ENSG00000225672,1,1,1Long Format
(N x M) x 3 table of N features with M samples (rows) and 3 columns with headers, where the first column is the feature_id, the second column is the sample_id, and the third column is the "value." Each row should contain a unique combination of feature ID and sample ID. For example:
feature_id,sample_id,value
ENSG00000200998,sample_1,22
ENSG00000260796,sample_1,64
ENSG00000225672,sample_1,1
ENSG00000200998,sample_2,48
ENSG00000260796,sample_2,4
ENSG00000225672,sample_2,1
ENSG00000200998,sample_3,1
ENSG00000260796,sample_3,53
ENSG00000225672,sample_3,1Manifest Format
Two sets of files: one manifest file that describes the respective data file_id and associated sample_id, and a set of individual data files. The manifest file should have two columns with headers, file_id and sample_id. Individual files should each have two columns with the headers feature_id and value. For example:
file_id,sample_id
project-G5Azzzj061qF9vJpKhK3Kf7k:file-G7Pj4Z00Gjv8KJ89PjGbbJG4,sample_1
project-G5Azzzj061qF9vJpKhK3Kf7k:file-G7Pj4f88GjvGYqGzJgFj9gvK,sample_2
project-G5Azzzj061qF9vJpKhK3Kf7k:file-G7Pj4h80Gjv7Kv49qjYnbJG6,sample_3feature_id,value
ENSG00000200998,22
ENSG00000260796,64
ENSG00000225672,1feature_id,value
ENSG00000200998,48
ENSG00000260796,4
ENSG00000225672,1feature_id,value
ENSG00000200998,1
ENSG00000260796,53
ENSG00000225672,1Outputs
Database - Output database name.
Dataset - Dataset containing the MolecularExpressionAssay object and a phenotypic model with assay sample IDs abstracted from the Source Expression Data file.
Cluster Logs - When
collect_logsisTRUE, the logs are written into the job folder.
You can use the generated dataset in the Cohort Browser by clicking on the dataset name, or selecting the dataset record and clicking Explore Data.

If you'd like to merge multiple datasets for comprehensive analysis, see Creating Multi-Assay Datasets.
Best Practices & Troubleshooting
Data Preparation
File Format Requirements
Ensure your file extension matches column delimiters (comma for CSV files, tab for TSV files).
Accepted file extensions:
csv,tsv,txt,gz,bz2.If you see "provided a
{file_extension}file" error, reformat to an accepted file type.
Feature ID Guidelines
Use Ensembl IDs without version numbers. For example, use
ENSG00000139618, notENSG00000139618.10.Data with version formatting still ingests, but downstream annotation may fail.
Ensure feature ID type matches what's in your source data.
Data Quality Requirements
No NULL, missing, or "NA" values in expression data - only numeric values from 0 to infinity are allowed.
If you see "Field
{field}is sparsely coded" error, review your data to ensure all features have values for all samples.Each feature and sample combination should be unique.
Input Parameter Issues
Naming Conventions
All names (assay, dataset, database) must follow these rules:
Start with an alphabetic character.
Maximum 256 characters.
Use only alphanumeric characters, underscores
_, and hyphens-.No spaces allowed.
Common Errors
assay_title must contain only less than 256 ASCII characters- Shorten your title.assay_name should start with an alphabetic character- Ensure names begin with a letter.database name/ID should start with an alphabetic character- Check database name format.
Data Format Issues
Format Detection Problems
If you see "could not identify the type of data" error, the auto-detect failed.
Ensure your data follows one of the three supported formats (Matrix, Long, or Manifest).
For Manifest format: individual files must have headers with first column as
feature_idand second asvalue.
File Structure for Manifest Format
When using manifest format, ensure:
Manifest file has columns:
file_id,sample_id.Individual sample files have columns:
feature_id,value.All individual files contain proper headers.
Next Steps
For details on analyzing gene expression assays in the Cohort Browser, see Analyzing Gene Expression Data.
Create multi-assay datasets by combining with other dataset types (clinical, germline variant, somatic variant).
Use
dx extract_assay expressionto access molecular expression data.Parse assay metadata and access molecular expression data in Spark-enabled JupyterLab instances.
See tutorial notebooks in OpenBio for:
Last updated
Was this helpful?