Molecular Expression Assay Loader
Note: This is intended only for customers with a DNAnexus Apollo license and Org approval (if applicable). Contact [email protected] for more information.
dx run app-molecular-expression-assay-loader (use -h for help)
The Molecular Expression Assay Loader application provides a simplified mechanism for ingesting molecular expression assay data into an Apollo Dataset for stand-alone use and/or integrated use with existing Datasets and downstream tools such as the Cohort Browser, JupyterLab, and analysis applications. The application will read raw expression data, validate input data, ingest data into the Apollo database, annotate the data, and finally return a Dataset containing a “Molecular Expression” assay instance representing the data. See Example Usage for common use cases.
The current molecular expression model support is for bulk mRNA gene (“ENSG”) or transcript ( “ENST”) expression values having the following value types; "rpkm", "fpkm", “fpkm-uq”, "tpm", or "count." For ingestion of large sets of molecular expression data, where Sample X feature dimensions exceed 100 million, it is advised to first consult with xVantage services to ensure an optimal experience. See Example Input for input type and format examples.

Overview

Structure of the Molecular Expression Assay Loader.

Inputs

The Molecular Expression Assay Loader app requires the following, as general input:
  • Source Expression Data - The source file(s) for the molecular expression data being ingested. Different formats for the source file are acceptable (described in specifications).
  • Source Expression Data Format - The format of source file(s) for the molecular expression data. In default mode ('auto') the app detects file format automatically. Other values of this argument will be evaluated against auto detection of the format and result in an error if they don't match. One of "auto", "matrix", "long", or "manifest." Review the examples page for more details.
  • Assay Title - A human readable, user-provided, short descriptive name (often unique) to reference the specific molecular expression assay.
  • Feature Type - Type of molecular feature assayed, “mRNA.”
  • Feature ID Type - Standardized ID type to reference, one of "gene_ENSG" (Ensembl gene ID), or "transcript_ENST" (Ensembl transcript ID). The feature ID type should match the type found in the source expression data.
  • Feature Value Type - Data type of measured molecular feature value in the source expression data, one of the following:
    • "rpkm" (reads per kilobase of transcript per million reads mapped)
    • "fpkm" (fragments per kilobase of transcript per million reads mapped)
    • "fpkm-uq" (fragments per kilobase of transcript per million mapped reads upper quartile)
    • "tpm" (transcript per million)
    • "count" (gene counts)
  • Reference - Reference used for annotation, one of "GRCh38.p13" or "GRCh37."
  • Database - User-provided name or platform-specific ID of database to be used for the data being ingested. If the name or ID exists in the context project, the behavior depends on ETL Create Mode and ETL Insert Mode parameters. The name must start with a lowercase alphabetic character or underscore. Eligible characters are alphanumeric, underscore and hyphen characters.
See app documentation for further granular configurations.

Process

  1. 1.
    User specifies descriptive information about the assay.
  2. 2.
    Source data is read, the content and structure are validated, and then content is loaded into a Spark DataFrame.
  3. 3.
    An annotation resource is selected, given the input feature specification, and loaded as a Spark DataFrame.
  4. 4.
    The Spark DataFrames are stored as parquet files and represented/accessed on the DNAnexus Platform as an Apollo Database.
  5. 5.
    A Dataset mapping the physical data to a logical model is created, and contains a “Molecular Expression” assay object, having two entities representing the expression values and annotation, and a clinical/phenotypic “sample” entity, representing the sample IDs specified.

Outputs

  • Database - Output database name.
  • Dataset - Dataset containing the MolecularExpressionAssay object and a pheno model. Entity, which contains the assay sample IDs abstracted from the Source Expression Data file.
  • Cluster Logs - When collect_logs is TRUE, the logs are written into the job folder.

Best Practices

  1. 1.
    Ensure that your file suffix matches column delimiters. For example, the delimiter should be a comma if your suffix is “.csv” and a tab if your suffix is “.tsv.”
  2. 2.
    If supplying a manifest file with associated individual sample data files, ensure that the individual files contain a header, where the first column is for the feature ID and the second column is for the expression value.
  3. 3.
    Feature IDs which are Ensembl IDs should fit expected formats without a “.version.” For example ENSG00000139618 and not ENSG00000139618.10. Data with version formatting will still be ingested, however downstream annotation will fail unless explicitly accounted for when joining to the annotation table.
  4. 4.
    There should be no NULL, missing, or “NA” values present in the expression values. Only numeric values [0, inf) are allowed.
Export as PDF
Copy link