Transcriptomic Expression Quantification Workflow Walkthrough

Transcriptomic Expression Quantification (TEQ) is an end-to-end RNA-seq workflow which accepts unmapped transcript reads (FASTQ format) as input and generates a gene or transcript quantification matrix as output. A user may choose from five pipeline configurations when launching the workflow, each providing different combinations of tools for read mapping and quantification. These five pipelines give the user the freedom to perform genome- or transcriptome-based alignment and gene- or transcript-level quantification. The five configurations consist of the following tool combinations:

  • STAR

  • STAR + Salmon

  • STAR + RSEM

  • Kallisto

  • Salmon

The following demonstrates an example run of the workflow using the “STAR + RSEM” pipeline and through either the DNAnexus platform site (Graphical User Interface, GUI) or SDK (Command Line Interface, CLI).

Launching a Workflow from Graphical User Interface (GUI)

Accessing the Workflow

From the GUI, the configurations described in this tutorial can be found in the Tools Library (https://platform.dnanexus.com/panx/tools) by searching “TEQ.” To search for “TEQ” by name, click the button on the left upper side of the screen labeled “Any name.”

Step 1: Preparing alignment and quantification references

All five configurations have a mandatory analysis input field called "alignment reference". The field "quantification reference", is an additional requirement for both the “STAR + Salmon” and “STAR + RSEM” configurations, however note that these references are referred to as reference index in the nomenclature of alignment tools such as STAR or Kallisto.

The following DNAnexus Platform Apps may be used to generate the corresponding references (see table below). If the pipeline of choice requires an "alignment reference" as well as a "quantification reference", make sure to use the corresponding releases for any genome, transcriptome, and genome annotation (GTF) file combinations when generating references. For example, if the genome reference is from “GENCODE release 40”, then the transcriptome file should also be sourced from “GENCODE release 40”. Once "alignment reference" and quantification reference" inputs have been prepared, these files may be reused for future workflow runs.

Reference input requirements for each pipeline and the corresponding DNAnexus app to be used for generating the reference are listed below.

To run the STAR + RSEM pipeline, the star_generate_genome_index app is used to create the input for "alignment reference". If running STAR or STAR + Salmon pipelines, the same procedure is followed to generate the Alignment Reference. The Quantification Reference for the STAR + RSEM pipeline is prepared using the rsem_prepare_genome app.

To generate a compatible "alignment reference" for TEQ, prepare the genome index using a genome file and a genome annotation file in GTF format. Download the relevant files from Ensembl (https://www.ensembl.org/Homo_sapiens/Info/Index) for this guide. Other sources are fine, as long the genome and annotation files are compatible. Use the URL Fetcher App (https://platform.dnanexus.com/app/url_fetcher) to download the genome and the corresponding GTF from Ensembl. You can use the GUI to run this app (see screenshot below), or simply upload the files to the project on the platform from a local machine. Genome and GTF files may be downloaded from here and here, respectively (Ensembl release 106).

Now the STAR Generate Genome Index app is ready to run. Simply, map the previously downloaded files to "reference genome" and "transcript annotations" inputs of the app (see below for the UI screenshot).

Next, prepare the Quantification Reference for the pipeline’s quantification engine, which is RSEM in this case. This follows the same procedure with the same inputs as outlined before for STAR index generation, the only difference being that the RSEM Prepare Genome app (https://platform.dnanexus.com/app/rsem_prepare_genome) is instead used with the same genome and GTF files downloaded earlier from Ensembl.

Step 2: Running TEQ pipeline

Now the necessary inputs for the STAR + RSEM pipeline have been generated and the analysis can begin. From the GUI, you can find the TEQ workflow in the Tools Library. The TEQ workflow is region-specific, so select the workflow matching the account region.

Next, provide the necessary inputs for running the STAR + RSEM pipeline, including "alignment reference", "quantification reference", and "transcript annotation" files previously downloaded and prepared. Using a Transcript Annotation (GTF file) other than the one used to prepare alignment reference" and "quantification reference" is not recommended and may result in errors.

In this example, a paired-end analysis is being conducted, so both “reads” and “reads 2” are necessary inputs. All FASTQ files in both “reads” and “reads 2” should always come from a single sample. In the case of multi-FASTQ analysis, ensure that the FASTQs are from the same sample and not multiple different samples. For multi-sample analysis, please use Batch Job functionality.

Select "STAR + RSEM" as input in the "Common" section. Other settings may be left “as is” (default). To generate a more comprehensive QC report, enable the “perform mapping QC” option at the bottom of the input section.

Using the CLI

Below are the commands to run this analysis from CLI and using dx-toolkit.

Fetch reference data using the app URL Fetcher:

dx run url_fetcher -iurl=ftp://ftp.ensembl.org/pub/release-106/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz

dx run url_fetcher -iurl=ftp://ftp.ensembl.org/pub/release-106/gtf/homo_sapiens/Homo_sapiens.GRCh38.106.chr.gtf.gz

Next, generate an index file using the app STAR Generate Genome Index:

dx run star_generate_genome_index \
 -iref_input_fasta=Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz \
 -itranscr_annotation_gtf=Homo_sapiens.GRCh38.106.chr.gtf.gz

Prepare the genome for RSEM using the app RSEM Prepare Genome:

dx run rsem_prepare_genome \
 -ireference_fastagz=Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz \
 -igene_annotation_gtf=Homo_sapiens.GRCh38.106.chr.gtf.gz \
 -ireference_prefix="Homo_sapiens.GRCh38.dna.primary_assembly"

Finally, run the global workflow Transcriptomic Expression Quantification:

dx run globalworkflow-transcriptomic_expression_quantification \
  -ipipeline_to_run='STAR + RSEM' \
  -ialign_reference=Homo_sapiens.GRCh38.dna.primary_assembly.star-index.tar.gz \
  -iquant_reference=Homo_sapiens.GRCh38.dna.primary_assembly.RSEM.tar.gz \
  -iannotation_gtf=Homo_sapiens.GRCh38.106.chr.gtf.gz \
  -ireads_fastqgz=reads_1.fastq.gz \
  -ireads2_fastqgz=reads_2.fastq \
  -iperform_mappings_qc=true

TEQ is available in the following regions, however please take note of the different naming conventions across regions, especially when calling the workflow from CLI.

transcriptomic_expression_quantification

AWS US East

transcriptomic_expression_quantification_ap

AWS Asia Pacific - Sydney

transcriptomic_expression_quantification_eu

AWS Europe - Frankfurt

transcriptomic_expression_quantification_eu_west_2_g

AWS Europe - London

transcriptomic_expression_quantification_azure_eu

Azure Europe

transcriptomic_expression_quantification_azure_us

Azure US

Last updated