RNA-seq Variant and Fusion Transcript Calling Walkthrough
Learn how to use the GATK4 Haplotype Caller and Filter for RNAseq and the STAR Fusion Caller.
DNAnexus enables access to the RNA-seq variant and fusion transcript calling tools GATK4 HaplotypeCaller and Filter for RNAseq and STAR Fusion Caller.
The following tutorial highlights example runs for both tools, using either the DNAnexus Platform user interface (UI) and the command line interface (CLI).
There are different reference projects “Reference Genome Files” in each DNAnexus supported region and each has a unique project ID. The tutorial uses the project ID from AWS US (East), but make sure to use the project specific to your region.
Region | Reference Genome Files Project ID |
---|---|
AWS US (East) | project-BQpp3Y804Y0xbyG4GJPQ01xv |
Azure US (West) | project-F0yyz6j9Jz8YpxQV8B8Kk7Zy |
Azure EU (West) | project-FGX8gVQB9X7K5f1pKfPvz9yG |
AWS Europe (London) | project-G4FX6vXKb5P1FBFFJjXv4ypq |
AWS Europe (Frankfurt) | project-F3zxk7Q4F30Xp8fG69K1Vppj |
AWS Asia (Sydney) | project-F4gXb605fKQyBq5vJBG31KGG |
Overview
GATK4 HaplotypeCaller and Filter for RNAseq calls germline SNPs and INDELs from aligned, RNA-seq derived reads. To perform calls from transcript data, three GATK4 tools are called internally to accomplish related goals:
SplitNCigarReads splits reads which contain Ns in the CIGAR string, accounting for splicing events.
HaplotypeCaller calls variants via local, de-novo assembly of haplotypes.
VariantFiltration filters results by marking variant calls based on INFO and/or FORMAT annotations.
To see the guide on calling RNAseq germline SNPs and INDELs, click here.
STAR Fusion Caller identifies fusion transcripts from RNAseq data leveraging the chimeric and discordant read alignments reported by STAR mapping. STAR Fusion Caller leverages up to three tools within its workflow:
STAR Mapping app. In case it was not run previously, STAR Fusion Caller will run STAR aligner to generate the required file containing chimeric read alignments for fusion identification.
STAR-Fusion software package from the Trinity Cancer Transcriptome Analysis Toolkit (CTAT) is used to identify candidate fusion transcripts supported by Illumina reads. For further information on STAR-Fusion, please see the tool documentation.
FusionInspector, another software package from the Trinity Cancer Transcriptome Analysis Toolkit (CTAT), assists in fusion transcript discovery by performing a supervised analysis of fusion predictions, attempting to recover and re-score evidence for such predictions. More information can be found in the tool documentation.
To see the guide on calling RNAseq fusion transcripts, click here.
Calling RNAseq Germline SNPs and INDELS
For this tutorial, experimental data stored in the ENCODE repository was selected. Paired end FASTQ files ENCFF355VKA.fastq.gz and ENCFF900YDP.fastq.gz are used. RNA-seq was captured from poly(a) mRNA derived from the human spinal cord tissue of a female embryo.
Note: GATK HaplotypeCaller and Filter for RNAseq are considered suitable for calling germline variants only.
Before starting the tutorial, make sure the paired-end files are uploaded to your project.
Overview
List of steps:
Step 1: Alignment with STAR Mapping
Step 2: Mark Duplicates with PICARD MarkDuplicates
Step 3: Defining Read Groups with optional local samtools and Swiss Army Knife (SAK)
Step 4: Calling Filtered Variants with GATK4 HaplotypeCaller and Filter for RNAseq
The following section of the tutorial will describe how to prepare your raw FASTQ data into a cleaned BAM format. Briefly, GATK best practices on RNAseq short variant discovery recommends using STAR Mapping software as the alignment tool and sequentially going through a data cleaning step involving Picard MarkDuplicates. The following data preparation steps (1-3) follow these best practices.
However, as the GATK4 HaplotypeCaller and Filter for RNAseq tool accepts any BAM alignment of acceptable quality, if you already happen to have data fitting this description it is possible to skip the below steps, 1-3, and simply use the tool starting with step 4.
BAM outputs from TEQ - Transcriptomic Expression Quantification can be used as an input for GATK4 HaplotypeCaller and Filter for RNAseq if the same genome reference is used in both runs.
Step 1: Alignment with STAR Mapping
For alignment and variant calling, various genome reference files may be used, however GRCh38.no_alt_analysis_set.fa.gz
is recommended. The file may be found on the public DNAnexus project, “Reference Genome Files,” following the path /H. Sapiens - GRCh38 without alt contigs/
. The associated index file which should be selected as the “Reference genome indexed for STAR” is, GRCh38.no_alt_analysis_set.v2.star-index.tar.gz
. If a different reference file is desired, STAR Generate Genome Index may be used to generate the complimentary index file.
Results from the run will then appear in the selected project and include the log ENCFF355VKA.fastq.log.gz
, which contains statistics on the mapping and the sorted alignment BAM file, ENCFF355VKA.fastq.genome.bam
which is used in the next steps of this tutorial.
Using the UI
From the Tools Library, find the STAR Mapping app and click the run icon. After selecting the output location of the app, the user will be taken to the Run App page. From here, select the file GRCh38.no_alt_analysis_set.v2.star-index.tar.gz
from the public project “Reference Genome Files” following the path /H. Sapiens - GRCh38 without alt contigs/
for the "Reference genome indexed for STAR" input that the app requires.
Select ENCFF355VKA.fastq.gz
as the forward reads and ENCFF900YDP.fastq.gz
as the reverse reads (or specify your reads of interest).
After clicking Select, you will be taken back to the Inputs tab. Click the Start Analysis button to begin your analysis.
Using the CLI
STAR Mapping may be launched to map reads to a reference genome file using the following command:
Downloading Logs
Upon job completion, the log file may be downloaded (dx download <project-id:file-id>
) and should look similar to the image below:
Step 2: Mark Duplicates with PICARD MarkDuplicates
Picard Mark duplicates identifies duplicated reads among a set of mappings. Results will include ENCFF355VKA.fastq.genome.deduplicated.bam
. You will need the sorted deduplicated bam files or the next steps.
Using the UI
PICARD MarkDuplicates may be launched directly through DNAnexus Platform UI. The only mandatory input of the tool is a sorted mapping file. First select the BAM file from step 1, ENCFF355VKA.fastq.genome.bam
:
After clicking Select, follow the rest of the directions on screen to start your analysis.
Using the CLI
Mark duplicated reads by confirming the dx-toolkit
command:
Step 3: Defining Read Groups
The BAM files used for input with GATK4 HaplotypeCaller and Filter for RNAseq are expected to contain read groups as tags. These are used to identify sampling in multi-sample outputs when a sequencer involves multiple lanes. For most cases, when the file is derived from a single biological sample, the read group tag might not be present and thus data will need to be manipulated in order to be eligible as input of GATK4 HaplotypeCaller and Filter for RNAseq. For more information about read groups, please refer to GATK documentation regarding read groups.
Optionally, to check if the read group information is present in the BAM file, you will need to download the file and scan through the command line using samtools.
This part of the tutorial is optional, as GATK4 HaplotypeCaller and Filter for RNAseq informs you if read groups are missing. You will need to add the tags manually for missing read groups.
To list the read group information in the BAM file, use the following command from samtools:
samtools view -H ENCFF355VKA.fastq.genome.deduplicated.bam | grep '^@RG'
This command will print lines starting with "@RG" within the header, if present. In the above case, the return is blank, and thus manual insertion of read groups is necessary. More information may be found in GATK documentation on errors about read group (RG) information. The following step of this tutorial is also based on the mentioned documentation.
A manual insertion of read groups may be performed using the Picard AddOrReplaceReadGroups tool, which is available as part of SAK.
Using the UI
SAK may be launched directly through the UI. Select ENCFF355VKA.fastq.genome.deduplicated.bam
as an input file.
Copy and paste the following command on the "Command line" input, under the "Common" section:
The analysis can now be launched.
Using the CLI
Manually define read groups by running the following command:
Step 4: Call and Filter Variants Using GATK4 HaplotypeCaller and Filter for RNAseq
The result of this run will consist of a VCF containing the called variants with the filtering status, which in this case will be ENCFF355VKA.filtered.vcf.gz
and the respective index file ENCFF355VKA.filtered.vcf.gz.tbi
.
Using the UI
Launch GATK4 HaplotypeCaller and Filter for RNAseq using the UI and select the file ENCFF355VKA.fastq.genome.deduplicated.withRG.bam
from the previous step as the expected “RNAseq sorted mappings”.
Use the suggested file GRCh38.no_alt_analysis_set.fa.gz
located in the public project “Reference Genome Files” following the path /H. Sapiens - GRCh38 without alt contigs/
as your “Reference genome FASTA file”. Other references may be used, but they must match the STAR-indexed genome used in step 1.
Select the suggested file, Homo_sapiens_assembly38.dbsnp138.vcf.gz
as your “dbSNP annotation”. It is also located on the public project, “Reference Genome Files,” following the path /gatk.resources.GRCh38/Common Variants/.
Specify the input filter, “QUAL score by allele depth threshold (QD),” as 2 and input, “Fisher strand threshold (FS),” as 30.
Now the analysis can be launched.
Using the CLI
Run GATK4 HaplotypeCaller and Filter for RNAseq by submitting the following command:
Runs With Different Input Sizes
For comparison purposes, information regarding wall-clock time and core hours for runs with different input sizes can be found below:
File size category | Original FASTQs (each file) |
| BAM Alignment |
| Wall-clock time | Core hours |
| Gunzipped size | reads | size | alignments |
|
|
Small - Paired end (tutorial data) | 760 MB | 12.9 M | 1.6 GB | 34.4 M | 32min | 17 |
Medium - Single End | 1.8 GB | 26.7 M | 2.1 GB | 33.1 M | 43min | 23 |
Large - Paired End | 8.2 GB | 106.4 M | 14.5 GB | 237.6M | 3h40min | 117 |
The data is based on runs with the default instance type: mem3_ssd1_v2_x32
.
Calling RNAseq Fusion Transcripts
Data used in this tutorial may be downloaded from the STAR-Fusion tutorial repository, or by using the following git command line:
git clone https://github.com/STAR-Fusion/STAR-Fusion-Tutorial.git
The repository contains multiple files, however for this tutorial, only the following FASTQ files are necessary; rnaseq_1.fastq.gz
and rnaseq_2.fastq.gz
. Before starting the tutorial, make sure the paired-end files are uploaded to the selected project.
Here STAR Fusion Caller to identify fusion transcripts straight from sequenced reads. Alternatively, a gzipped junction file containing chimeric alignments (*Chimeric.out.junction) from an earlier STAR mapping alignment run may be provided to skip alignment step and kickstart fusion detection. If you prefer to run STAR mapping separately, please consult the manual for recommended STAR parameters relevant to fusion detection.
A diagram representing the workflow performed by the tool is shown below.
Step 1: Running STAR Fusion Caller
Using the UI
STAR Fusion Caller may be launched directly through the UI. When launching this and other tools, users are asked to specify a project for running and selecting input files and options. STAR Fusion Caller requires sequenced reads and CTAT genome lib as mandatory inputs.
Specify rnaseq_1.fastq.gz
and rnaseq_2.fastq.gz
as your forward and reverse reads, respectively, or use your desired read file(s).
This tutorial uses the suggested file, GRCh38_gencode_v37_CTAT_lib_Mar012021.plug-n-play.tar.gz
as input for, “CTAT genome lib”. The library is located in the public project “Reference Genome Files”, folder ctat/
. Alternatively, other libraries can be prepared leveraging CTAT genome lib builder.
Using the CLI
Run STAR Fusion Caller by submitting the following command:
Runs with Different Input Sizes
For comparison purposes, information regarding runs with different input sizes can be found below:
Input size | Num of chimeric alignments | Num of detected fusions | Mode | Wall-clock time (min) | Core hours |
100K | 4547 | 115 | With initial alignment | 8.57 | 2.29 |
|
|
| Kickstart* | 8.26 | 2.2 |
10M | 458425 | 362 | With initial alignment | 18.44 | 4.92 |
|
|
| Kickstart* | 15.39 | 4.1 |
30M | 1373885 | 418 | With initial alignment | 50.31 | 13.42 |
|
|
| Kickstart* | 40.14 | 10.7 |
*Kickstart mode = skipping initial alignment step (chimeric.out.junction
file is provided as input)
Note that information is using runs with the default instance type: mem3_ssd1_v2_x16
.
Step 2: Interpreting STAR Fusion Caller Results
The expected output files are detailed below. Most of the files are intermediates from data processing within the tool.
Alignment outputs from STAR mapping:
Reads mapped to the genome (genome_bam):
rnaseq.genome.bam
Genome BAM index (genome_bam_bai):
rnaseq.genome.bam.bai
Log file containing mapping statistics (map_stat_log):
rnaseq.log.gz
Chimeric junctions (chimeric_junctions_gz):
rnaseq.Chimeric.out.junction.gz
STAR-Fusion outputs:
Fusion predictions (fusion_predictions_tsv):
rnaseq.fusion_predictions.tsv
Abridged fusion predictions (fusion_predictions_abridged_tsv):
rnaseq.fusion_predictions.abridged.tsv
FusionInspector outputs:
FusionInspector HTML summary file (fusioninspector_summary_html):
rnaseq.FusionInspector_web.html
Abridged re-scored fusion predictions (rescored_fusion_predictions_abridged_tsv):
rnaseq.FusionInspector.fusions.abridged.tsv.annotated
Re-scored fusion predictions (rescored_fusion_predictions_tsv):
rnaseq.FusionInspector.fusions.tsv
FusionInspector extra outputs (fusioninspector_extra_outputs_targz):
rnaseq.FusionInspector_extra_outputs.tar.gz
The main results may be examined by browsing rnaseq.FusionInspector_web.html
, where fusions may be easily seen and navigated. This information includes gene fusion events with break point locations:
Alternatively, additional outputs in standardized formats (bed, gtf, and fasta) may be found within the compressed directory, rnaseq.FusionInspector_extra_outputs.tar.gz
, and may be visualized in a genome browser, such as the Integrative Genome Viewer (IGV).
Further information such as Fusion Allelic Ratio (FAR) and more can be found among the abridged fusion predictions TSV, rnaseq.FusionInspector.fusions.abridged.tsv.annotated
. FAR is used to estimate the relative expression of the fusion transcript as compared to the alternative non-fused allele of the gene. For more information, please see additional documentation here.
Last updated