# Molecular Expression Assay Loader

{% hint style="info" %}
An Apollo license is required to use the Molecular Expression Assay Loader on the DNAnexus Platform. Org approval may also be required. [Contact DNAnexus Sales](mailto:sales@dnanexus.com) for more information.
{% endhint %}

## Overview

The [Molecular Expression Assay Loader](https://platform.dnanexus.com/app/molecular-expression-assay-loader) app helps you ingest molecular expression data into an [Apollo Dataset](https://documentation.dnanexus.com/developer/datasets). You can use this dataset on its own or combine it with existing datasets and tools like the Cohort Browser, JupyterLab, and analysis applications.

The app reads your raw expression data, validates it, uploads it to the Apollo database, adds annotations, and creates a Dataset with a "Molecular Expression" assay. The molecular expression model supports bulk mRNA gene (`ENSG`) or transcript (`ENST`) expression values with measurement types including `rpkm`, `fpkm`, `fpkm-uq`, `tpm`, or `count`.

For large molecular expression datasets (where samples × features exceed 100 million), consult with [DNAnexus Professional Services](https://www.dnanexus.com/professional-services) to ensure optimal performance and user experience.

![Inputs and outputs of the Molecular Expression Assay Loader](https://1612471957-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-L_EsL_ie8XyZlLe_yf9%2Fuploads%2Fgit-blob-025225a8e32e7e3f6b87d4e09b5e1192cda3ae69%2Fmolecular_assay.png?alt=media)

## How to Use the App

### Using the UI

To use the Molecular Expression Assay Loader within the DNAnexus Platform:

1. In the DNAnexus Platform, go to **Tools** > [**Tools Library**](https://platform.dnanexus.com/panx/tools).
2. For the [Molecular Expression Assay Loader](https://platform.dnanexus.com/app/molecular-expression-assay-loader) app, click **Run Latest Version**.
3. In **Output to**, select a project and output location for the app's outputs.
4. Click **Next**.
5. In the **Inputs** tab, specify the required inputs.
6. Click **Start Analysis**.

![The input fields of the Molecular Expression Assay Loader](https://1612471957-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-L_EsL_ie8XyZlLe_yf9%2Fuploads%2Fgit-blob-0a49cf2b487cb830e635fc0c0408d398b798e3de%2Frunner_molecular_expression_assay_loader_inputs.png?alt=media)

### Using the CLI

To use the Molecular Expression Assay Loader from the command-line interface, install the [DNAnexus Platform SDK](https://documentation.dnanexus.com/downloads#dnanexus-platform-sdk).

{% hint style="success" %}
When using the DNAnexus Platform through Cloud Workspace or JupyterLab, the DNAnexus SDK is preinstalled. You can use the `dx` command right away.
{% endhint %}

Use the following command format, customizing the input parameters for your specific data.

```shell
dx run app-molecular-expression-assay-loader \
  -i source_expression_data=example_matrix.csv \
  -i reference="GRCh38.p13" \
  -i feature_type="mRNA" \
  -i feature_id_type="transcript_ENST" \
  -i feature_value_type="tpm" \
  -i assay_title="my_expression_assay" \
  -i dataset_name="my_expression_dataset" \
  -i database="my_somatic_database"
```

## Inputs

### Supported Data Types

The molecular expression model uses "features" as the core measurement unit. Each feature is described by three components: feature type (the general category being measured), feature ID type (standardized naming method), and feature value type (method of measurement). Supports bulk mRNA expression data with Ensembl gene or transcript IDs.

### Required Parameters

* **Source Expression Data** - The source files for the molecular expression data being ingested.
* **Source Expression Data Format** - Format of source files (`auto`, `matrix`, `long`, or `manifest`). In `auto` mode, the app detects the format automatically.
* **Assay Title** - Human-readable, short descriptive name to reference the molecular expression assay. Must be less than 256 ASCII characters.
* **Feature Type** - Type of molecular feature assayed. Supports `mRNA`.
* **Feature ID Type** - Standardized ID type for referencing features. Options:

  * `gene_ENSG` - Ensembl gene ID, such as ENSG00000139618
  * `transcript_ENST` - Ensembl transcript ID, such as ENST00000380152

  Must match the ID type found in your source expression data.
* **Feature Value Type** - Data type of measured molecular feature values:
  * `rpkm` - Reads per kilobase of transcript per million reads mapped
  * `fpkm` - Fragments per kilobase of transcript per million reads mapped
  * `fpkm-uq` - Fragments per kilobase of transcript per million mapped reads upper quartile
  * `tpm` - Transcript per million
  * `count` - Gene counts (integer values)
* **Reference** - Reference genome for annotation. Options: `GRCh38.p13` or `GRCh37`.
* **Database** - Name or platform-specific ID of database for the ingested data. Must start with a lowercase alphabetic character or underscore, using only alphanumeric characters, underscores, and hyphens. Maximum 256 characters.

See [in-product app documentation](https://platform.dnanexus.com/panx/tool/app/molecular-expression-assay-loader) for additional configuration options.

### Supported Data Formats

The Molecular Expression Assay Loader app supports three different data formats to reduce transformation burden before ingestion.

### Matrix Format

*N x M* matrix of *N* features (rows) by *M* samples (columns), where each feature and sample is unique. A header row must be provided as part of this format, including a column for the feature ID. For example:

{% code title="example\_matrix.csv" %}

```csv
feature_id,sample_1,sample_2,sample_3
ENSG00000200998,22,48,1
ENSG00000260796,64,4,53
ENSG00000225672,1,1,1
```

{% endcode %}

### Long Format

(*N x M*) *x* 3 table of *N* features with *M* samples (rows) and 3 columns with headers, where the first column is the `feature_id`, the second column is the `sample_id`, and the third column is the "value." Each row should contain a unique combination of feature ID and sample ID. For example:

{% code title="example\_table.csv" %}

```csv
feature_id,sample_id,value
ENSG00000200998,sample_1,22
ENSG00000260796,sample_1,64
ENSG00000225672,sample_1,1
ENSG00000200998,sample_2,48
ENSG00000260796,sample_2,4
ENSG00000225672,sample_2,1
ENSG00000200998,sample_3,1
ENSG00000260796,sample_3,53
ENSG00000225672,sample_3,1
```

{% endcode %}

### Manifest Format

Two sets of files: one manifest file that describes the respective data `file_id` and associated `sample_id`, and a set of individual data files. The manifest file should have two columns with headers, `file_id` and `sample_id`. Individual files should each have two columns with the headers `feature_id` and `value`. For example:

{% code title="example\_manifest.csv" %}

```csv
file_id,sample_id
project-G5Azzzj061qF9vJpKhK3Kf7k:file-G7Pj4Z00Gjv8KJ89PjGbbJG4,sample_1
project-G5Azzzj061qF9vJpKhK3Kf7k:file-G7Pj4f88GjvGYqGzJgFj9gvK,sample_2
project-G5Azzzj061qF9vJpKhK3Kf7k:file-G7Pj4h80Gjv7Kv49qjYnbJG6,sample_3
```

{% endcode %}

{% code title="sample\_1.csv (file-id: file-G7Pj4Z00Gjv8KJ89PjGbbJG4)" %}

```csv
feature_id,value
ENSG00000200998,22
ENSG00000260796,64
ENSG00000225672,1
```

{% endcode %}

{% code title="sample\_2.csv (file-id: file-G7Pj4f88GjvGYqGzJgFj9gvK)" %}

```csv
feature_id,value
ENSG00000200998,48
ENSG00000260796,4
ENSG00000225672,1
```

{% endcode %}

{% code title="sample\_3.csv (file-id: file-G7Pj4h80Gjv7Kv49qjYnbJG6)" %}

```csv
feature_id,value
ENSG00000200998,1
ENSG00000260796,53
ENSG00000225672,1
```

{% endcode %}

## Outputs

* **Database** - Output database name.
* **Dataset** - Dataset containing the MolecularExpressionAssay object and a phenotypic model with assay sample IDs abstracted from the Source Expression Data file.
* **Cluster Logs** - When `collect_logs` is `TRUE`, the logs are written into the job folder.

You can use the generated dataset in the [Cohort Browser](https://documentation.dnanexus.com/user/cohort-browser) by clicking on the dataset name, or selecting the dataset record and clicking **Explore Data**.

![](https://1612471957-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-L_EsL_ie8XyZlLe_yf9%2Fuploads%2Fgit-blob-aee583f182b015f7d27dd9281fb6117472de5eef%2Fmolecular_app_3.png?alt=media)

{% hint style="success" %}
If you'd like to merge multiple datasets for comprehensive analysis, see [Creating Multi-Assay Datasets](https://documentation.dnanexus.com/developer/dataset-management/creating-multi-assay-datasets).
{% endhint %}

## Best Practices & Troubleshooting

### Data Preparation

#### File Format Requirements

* Ensure your file extension matches column delimiters (comma for CSV files, tab for TSV files).
* Accepted file extensions: `csv`, `tsv`, `txt`, `gz`, `bz2`.
* If you see "provided a `{file_extension}` file" error, reformat to an accepted file type.

#### Feature ID Guidelines

* Use Ensembl IDs without version numbers. For example, use `ENSG00000139618`, not `ENSG00000139618.10`.
* Data with version formatting still ingests, but downstream annotation may fail.
* Ensure feature ID type matches what's in your source data.

#### Data Quality Requirements

* No NULL, missing, or "NA" values in expression data - only numeric values from 0 to infinity are allowed.
* If you see "Field `{field}` is sparsely coded" error, review your data to ensure all features have values for all samples.
* Each feature and sample combination should be unique.

### Input Parameter Issues

#### Naming Conventions

All names (assay, dataset, database) must follow these rules:

* Start with an alphabetic character.
* Maximum 256 characters.
* Use only alphanumeric characters, underscores `_`, and hyphens `-`.
* No spaces allowed.

#### Common Errors

* `assay_title must contain only less than 256 ASCII characters` - Shorten your title.
* `assay_name should start with an alphabetic character` - Ensure names begin with a letter.
* `database name/ID should start with an alphabetic character` - Check database name format.

### Data Format Issues

#### Format Detection Problems

* If you see "could not identify the type of data" error, the auto-detect failed.
* Ensure your data follows one of the three supported formats (Matrix, Long, or Manifest).
* For Manifest format: individual files must have headers with first column as `feature_id` and second as `value`.

### File Structure for Manifest Format

When using manifest format, ensure:

* Manifest file has columns: `file_id`, `sample_id`.
* Individual sample files have columns: `feature_id`, `value`.
* All individual files contain proper headers.

## Next Steps

* For details on analyzing gene expression assays in the Cohort Browser, see [Analyzing Gene Expression Data](https://documentation.dnanexus.com/user/cohort-browser/analyzing-gene-expression).
* [Create multi-assay datasets](https://documentation.dnanexus.com/developer/dataset-management/creating-multi-assay-datasets) by combining with other dataset types (clinical, germline variant, somatic variant).
* Use [`dx extract_assay expression`](https://documentation.dnanexus.com/user/helpstrings-of-sdk-command-line-utilities#extract_assay-expression) to access molecular expression data.
* Parse assay metadata and access molecular expression data in Spark-enabled JupyterLab instances.

See tutorial notebooks in [OpenBio](https://github.com/dnanexus/OpenBio/) for:

* [Molecular expression assay dataset basics](https://github.com/dnanexus/OpenBio/blob/master/transcriptomics/dataset_tutorial_notebooks/molecular-expression-assay-dataset-basic.ipynb)
* [General usage of `dx extract_assay`](https://github.com/dnanexus/OpenBio/tree/master/dx-toolkit)
