Molecular Expression Assay Loader

An Apollo license is required to use the Molecular Expression Assay Loader on the DNAnexus Platform. Org approval may also be required. Contact DNAnexus Sales for more information.

Overview

The Molecular Expression Assay Loader app helps you ingest molecular expression data into an Apollo Dataset. You can use this dataset on its own or combine it with existing datasets and tools like the Cohort Browser, JupyterLab, and analysis applications.

The app reads your raw expression data, validates it, uploads it to the Apollo database, adds annotations, and creates a Dataset with a "Molecular Expression" assay. The molecular expression model supports bulk mRNA gene (ENSG) or transcript (ENST) expression values with measurement types including rpkm, fpkm, fpkm-uq, tpm, or count.

For large molecular expression datasets (where samples × features exceed 100 million), consult with DNAnexus Professional Services to ensure optimal performance and user experience.

Inputs and outputs of the Molecular Expression Assay Loader

How to Use the App

Using the UI

To use the Molecular Expression Assay Loader within the DNAnexus Platform:

  1. In the DNAnexus Platform, go to Tools > Tools Library.

  2. For the Molecular Expression Assay Loader app, click Run Latest Version.

  3. In Output to, select a project and output location for the app's outputs.

  4. Click Next.

  5. In the Inputs tab, specify the required inputs.

  6. Click Start Analysis.

The input fields of the Molecular Expression Assay Loader

Using the CLI

To use the Molecular Expression Assay Loader from the command-line interface, install the DNAnexus Platform SDK.

Use the following command format, customizing the input parameters for your specific data.

dx run app-molecular-expression-assay-loader \
  -i source_expression_data=example_matrix.csv \
  -i reference="GRCh38.p13" \
  -i feature_type="mRNA" \
  -i feature_id_type="transcript_ENST" \
  -i feature_value_type="tpm" \
  -i assay_title="my_expression_assay" \
  -i dataset_name="my_expression_dataset" \
  -i database="my_somatic_database"

Inputs

Supported Data Types

The molecular expression model uses "features" as the core measurement unit. Each feature is described by three components: feature type (the general category being measured), feature ID type (standardized naming method), and feature value type (method of measurement). Supports bulk mRNA expression data with Ensembl gene or transcript IDs.

Required Parameters

  • Source Expression Data - The source files for the molecular expression data being ingested.

  • Source Expression Data Format - Format of source files (auto, matrix, long, or manifest). In auto mode, the app detects the format automatically.

  • Assay Title - Human-readable, short descriptive name to reference the molecular expression assay. Must be less than 256 ASCII characters.

  • Feature Type - Type of molecular feature assayed. Supports mRNA.

  • Feature ID Type - Standardized ID type for referencing features. Options:

    • gene_ENSG - Ensembl gene ID, such as ENSG00000139618

    • transcript_ENST - Ensembl transcript ID, such as ENST00000380152

    Must match the ID type found in your source expression data.

  • Feature Value Type - Data type of measured molecular feature values:

    • rpkm - Reads per kilobase of transcript per million reads mapped

    • fpkm - Fragments per kilobase of transcript per million reads mapped

    • fpkm-uq - Fragments per kilobase of transcript per million mapped reads upper quartile

    • tpm - Transcript per million

    • count - Gene counts (integer values)

  • Reference - Reference genome for annotation. Options: GRCh38.p13 or GRCh37.

  • Database - Name or platform-specific ID of database for the ingested data. Must start with a lowercase alphabetic character or underscore, using only alphanumeric characters, underscores, and hyphens. Maximum 256 characters.

See in-product app documentation for additional configuration options.

Supported Data Formats

The Molecular Expression Assay Loader app supports three different data formats to reduce transformation burden before ingestion.

Matrix Format

N x M matrix of N features (rows) by M samples (columns), where each feature and sample is unique. A header row must be provided as part of this format, including a column for the feature ID. For example:

example_matrix.csv
feature_id,sample_1,sample_2,sample_3
ENSG00000200998,22,48,1
ENSG00000260796,64,4,53
ENSG00000225672,1,1,1

Long Format

(N x M) x 3 table of N features with M samples (rows) and 3 columns with headers, where the first column is the feature_id, the second column is the sample_id, and the third column is the "value." Each row should contain a unique combination of feature ID and sample ID. For example:

example_table.csv
feature_id,sample_id,value
ENSG00000200998,sample_1,22
ENSG00000260796,sample_1,64
ENSG00000225672,sample_1,1
ENSG00000200998,sample_2,48
ENSG00000260796,sample_2,4
ENSG00000225672,sample_2,1
ENSG00000200998,sample_3,1
ENSG00000260796,sample_3,53
ENSG00000225672,sample_3,1

Manifest Format

Two sets of files: one manifest file that describes the respective data file_id and associated sample_id, and a set of individual data files. The manifest file should have two columns with headers, file_id and sample_id. Individual files should each have two columns with the headers feature_id and value. For example:

example_manifest.csv
file_id,sample_id
project-G5Azzzj061qF9vJpKhK3Kf7k:file-G7Pj4Z00Gjv8KJ89PjGbbJG4,sample_1
project-G5Azzzj061qF9vJpKhK3Kf7k:file-G7Pj4f88GjvGYqGzJgFj9gvK,sample_2
project-G5Azzzj061qF9vJpKhK3Kf7k:file-G7Pj4h80Gjv7Kv49qjYnbJG6,sample_3
sample_1.csv (file-id: file-G7Pj4Z00Gjv8KJ89PjGbbJG4)
feature_id,value
ENSG00000200998,22
ENSG00000260796,64
ENSG00000225672,1
sample_2.csv (file-id: file-G7Pj4f88GjvGYqGzJgFj9gvK)
feature_id,value
ENSG00000200998,48
ENSG00000260796,4
ENSG00000225672,1
sample_3.csv (file-id: file-G7Pj4h80Gjv7Kv49qjYnbJG6)
feature_id,value
ENSG00000200998,1
ENSG00000260796,53
ENSG00000225672,1

Outputs

  • Database - Output database name.

  • Dataset - Dataset containing the MolecularExpressionAssay object and a phenotypic model with assay sample IDs abstracted from the Source Expression Data file.

  • Cluster Logs - When collect_logs is TRUE, the logs are written into the job folder.

You can use the generated dataset in the Cohort Browser by clicking on the dataset name, or selecting the dataset record and clicking Explore Data.

Best Practices & Troubleshooting

Data Preparation

File Format Requirements

  • Ensure your file extension matches column delimiters (comma for CSV files, tab for TSV files).

  • Accepted file extensions: csv, tsv, txt, gz, bz2.

  • If you see "provided a {file_extension} file" error, reformat to an accepted file type.

Feature ID Guidelines

  • Use Ensembl IDs without version numbers. For example, use ENSG00000139618, not ENSG00000139618.10.

  • Data with version formatting still ingests, but downstream annotation may fail.

  • Ensure feature ID type matches what's in your source data.

Data Quality Requirements

  • No NULL, missing, or "NA" values in expression data - only numeric values from 0 to infinity are allowed.

  • If you see "Field {field} is sparsely coded" error, review your data to ensure all features have values for all samples.

  • Each feature and sample combination should be unique.

Input Parameter Issues

Naming Conventions

All names (assay, dataset, database) must follow these rules:

  • Start with an alphabetic character.

  • Maximum 256 characters.

  • Use only alphanumeric characters, underscores _, and hyphens -.

  • No spaces allowed.

Common Errors

  • assay_title must contain only less than 256 ASCII characters - Shorten your title.

  • assay_name should start with an alphabetic character - Ensure names begin with a letter.

  • database name/ID should start with an alphabetic character - Check database name format.

Data Format Issues

Format Detection Problems

  • If you see "could not identify the type of data" error, the auto-detect failed.

  • Ensure your data follows one of the three supported formats (Matrix, Long, or Manifest).

  • For Manifest format: individual files must have headers with first column as feature_id and second as value.

File Structure for Manifest Format

When using manifest format, ensure:

  • Manifest file has columns: file_id, sample_id.

  • Individual sample files have columns: feature_id, value.

  • All individual files contain proper headers.

Next Steps

See tutorial notebooks in OpenBio for:

Last updated

Was this helpful?