Comment on page
RNA-seq Variant and Fusion Transcript Calling Walkthrough
Learn how to use the GATK4 Haplotype Caller and Filter for RNAseq and the STAR Fusion Caller.
There are different reference projects “Reference Genome Files” in each DNAnexus supported region and each has a unique project ID. The tutorial uses the project ID from AWS US (East), but make sure to use the project specific to your region.
GATK4 HaplotypeCaller and Filter for RNAseq calls germline SNPs and INDELs from aligned, RNA-seq derived reads. To perform calls from transcript data, three GATK4 tools are called internally to accomplish related goals:
STAR Fusion Caller identifies fusion transcripts from RNAseq data leveraging the chimeric and discordant read alignments reported by STAR mapping. STAR Fusion Caller leverages up to three tools within its workflow:
- STAR-Fusion software package from the Trinity Cancer Transcriptome Analysis Toolkit (CTAT) is used to identify candidate fusion transcripts supported by Illumina reads. For further information on STAR-Fusion, please see the tool documentation.
- FusionInspector, another software package from the Trinity Cancer Transcriptome Analysis Toolkit (CTAT), assists in fusion transcript discovery by performing a supervised analysis of fusion predictions, attempting to recover and re-score evidence for such predictions. More information can be found in the tool documentation.
Note: GATK HaplotypeCaller and Filter for RNAseq are considered suitable for calling germline variants only.
Before starting the tutorial, make sure the paired-end files are uploaded to your project.
List of steps:
- Step 1: Alignment with STAR Mapping
- Step 2: Mark Duplicates with PICARD MarkDuplicates
- Step 3: Defining Read Groups with optional local samtools and Swiss Army Knife (SAK)
- Step 4: Calling Filtered Variants with GATK4 HaplotypeCaller and Filter for RNAseq
The following section of the tutorial will describe how to prepare your raw FASTQ data into a cleaned BAM format. Briefly, GATK best practices on RNAseq short variant discovery recommends using STAR Mapping software as the alignment tool and sequentially going through a data cleaning step involving Picard MarkDuplicates. The following data preparation steps (1-3) follow these best practices.
However, as the GATK4 HaplotypeCaller and Filter for RNAseq tool accepts any BAM alignment of acceptable quality, if you already happen to have data fitting this description it is possible to skip the below steps, 1-3, and simply use the tool starting with step 4.
For alignment and variant calling, various genome reference files may be used, however
GRCh38.no_alt_analysis_set.fa.gzis recommended. The file may be found on the public DNAnexus project, “Reference Genome Files,” following the path
/H. Sapiens - GRCh38 without alt contigs/. The associated index file which should be selected as the “Reference genome indexed for STAR” is,
GRCh38.no_alt_analysis_set.v2.star-index.tar.gz. If a different reference file is desired, STAR Generate Genome Index may be used to generate the complimentary index file.
Results from the run will then appear in the selected project and include the log
ENCFF355VKA.fastq.log.gz, which contains statistics on the mapping and the sorted alignment BAM file,
ENCFF355VKA.fastq.genome.bamwhich is used in the next steps of this tutorial.
From the Tools Library, find the STAR Mapping app and click the run icon. After selecting the output location of the app, the user will be taken to the Run App page. From here, select the file
GRCh38.no_alt_analysis_set.v2.star-index.tar.gzfrom the public project “Reference Genome Files” following the path
/H. Sapiens - GRCh38 without alt contigs/for the "Reference genome indexed for STAR" input that the app requires.
ENCFF355VKA.fastq.gzas the forward reads and
ENCFF900YDP.fastq.gzas the reverse reads (or specify your reads of interest).
After clicking Select, you will be taken back to the Inputs tab. Click the Start Analysis button to begin your analysis.
STAR Mapping may be launched to map reads to a reference genome file using the following command:
dx run star_mapping \
-igenome_index=project-BQpp3Y804Y0xbyG4GJPQ01xv://H.\ Sapiens\ -\ GRCh38\ without\ alt\ contigs/GRCh38.no_alt_analysis_set.v2.star-index.tar.gz \
Upon job completion, the log file may be downloaded (
dx download <project-id:file-id>) and should look similar to the image below:
Picard Mark duplicates identifies duplicated reads among a set of mappings. Results will include
ENCFF355VKA.fastq.genome.deduplicated.bam. You will need the sorted deduplicated bam files or the next steps.
PICARD MarkDuplicates may be launched directly through DNAnexus Platform UI. The only mandatory input of the tool is a sorted mapping file. First select the BAM file from step 1,
After clicking Select, follow the rest of the directions on screen to start your analysis.
Mark duplicated reads by confirming the
dx run picard_mark_duplicates \
The BAM files used for input with GATK4 HaplotypeCaller and Filter for RNAseq are expected to contain read groups as tags. These are used to identify sampling in multi-sample outputs when a sequencer involves multiple lanes. For most cases, when the file is derived from a single biological sample, the read group tag might not be present and thus data will need to be manipulated in order to be eligible as input of GATK4 HaplotypeCaller and Filter for RNAseq. For more information about read groups, please refer to GATK documentation regarding read groups.
Optionally, to check if the read group information is present in the BAM file, you will need to download the file and scan through the command line using samtools.
This part of the tutorial is optional, as GATK4 HaplotypeCaller and Filter for RNAseq informs you if read groups are missing. You will need to add the tags manually for missing read groups.
To list the read group information in the BAM file, use the following command from samtools:
samtools view -H ENCFF355VKA.fastq.genome.deduplicated.bam | grep '^@RG'
This command will print lines starting with "@RG" within the header, if present. In the above case, the return is blank, and thus manual insertion of read groups is necessary. More information may be found in GATK documentation on errors about read group (RG) information. The following step of this tutorial is also based on the mentioned documentation.
SAK may be launched directly through the UI. Select
ENCFF355VKA.fastq.genome.deduplicated.bamas an input file.
Copy and paste the following command on the "Command line" input, under the "Common" section:
java -jar $HOME/picard.jar AddOrReplaceReadGroups I=ENCFF355VKA.fastq.genome.deduplicated.bam O=ENCFF355VKA.fastq.genome.deduplicated.withRG.bam SORT_ORDER=coordinate RGID=foo RGLB=bar RGPL=illumina RGSM=Sample1 RGPU=unit
The analysis can now be launched.
Manually define read groups by running the following command:
dx run swiss-army-knife \
-icmd=java\ -jar\ \$HOME/picard.jar\ AddOrReplaceReadGroups\ I=ENCFF355VKA.fastq.genome.deduplicated.bam\ O=ENCFF355VKA.fastq.genome.deduplicated.withRG.bam\ SORT_ORDER=coordinate\ RGID=foo\ RGLB=bar\ RGPL=illumina\ RGSM=Sample1\ RGPU=unit
The result of this run will consist of a VCF containing the called variants with the filtering status, which in this case will be
ENCFF355VKA.filtered.vcf.gzand the respective index file
Use the suggested file
GRCh38.no_alt_analysis_set.fa.gzlocated in the public project “Reference Genome Files” following the path
/H. Sapiens - GRCh38 without alt contigs/as your “Reference genome FASTA file”. Other references may be used, but they must match the STAR-indexed genome used in step 1.
Select the suggested file,
Homo_sapiens_assembly38.dbsnp138.vcf.gzas your “dbSNP annotation”. It is also located on the public project, “Reference Genome Files,” following the path
Specify the input filter, “QUAL score by allele depth threshold (QD),” as 2 and input, “Fisher strand threshold (FS),” as 30.
Now the analysis can be launched.
Run GATK4 HaplotypeCaller and Filter for RNAseq by submitting the following command:
dx run gatk4_haplotypecaller_rnaseq \
-igenome_fastagz=project-BQpp3Y804Y0xbyG4GJPQ01xv://H.\ Sapiens\ -\ GRCh38\ without\ alt\ contigs/GRCh38.no_alt_analysis_set.fa.gz \
-idbsnp_vcfgz=project-BQpp3Y804Y0xbyG4GJPQ01xv://gatk.resources.GRCh38/Common\ Variants/Homo_sapiens_assembly38.dbsnp138.vcf.gz \
For comparison purposes, information regarding wall-clock time and core hours for runs with different input sizes can be found below:
The data is based on runs with the default instance type:
git clone https://github.com/STAR-Fusion/STAR-Fusion-Tutorial.git
The repository contains multiple files, however for this tutorial, only the following FASTQ files are necessary;
rnaseq_2.fastq.gz. Before starting the tutorial, make sure the paired-end files are uploaded to the selected project.
Here STAR Fusion Caller to identify fusion transcripts straight from sequenced reads. Alternatively, a gzipped junction file containing chimeric alignments (*Chimeric.out.junction) from an earlier STAR mapping alignment run may be provided to skip alignment step and kickstart fusion detection. If you prefer to run STAR mapping separately, please consult the manual for recommended STAR parameters relevant to fusion detection.
A diagram representing the workflow performed by the tool is shown below.
Using the UI
STAR Fusion Caller may be launched directly through the UI. When launching this and other tools, users are asked to specify a project for running and selecting input files and options. STAR Fusion Caller requires sequenced reads and CTAT genome lib as mandatory inputs.
rnaseq_2.fastq.gzas your forward and reverse reads, respectively, or use your desired read file(s).
This tutorial uses the suggested file,
GRCh38_gencode_v37_CTAT_lib_Mar012021.plug-n-play.tar.gzas input for, “CTAT genome lib”. The library is located in the public project “Reference Genome Files”, folder
ctat/. Alternatively, other libraries can be prepared leveraging CTAT genome lib builder.
Run STAR Fusion Caller by submitting the following command:
dx run star_fusion_caller \
-ireads_fwd_fq=rnaseq_1.fastq.gz -ireads_rev_fq=rnaseq_2.fastq.gz \
For comparison purposes, information regarding runs with different input sizes can be found below:
*Kickstart mode = skipping initial alignment step (
chimeric.out.junctionfile is provided as input)
Note that information is using runs with the default instance type:
The expected output files are detailed below. Most of the files are intermediates from data processing within the tool.
Alignment outputs from STAR mapping:
- Reads mapped to the genome (genome_bam):
- Genome BAM index (genome_bam_bai):
- Log file containing mapping statistics (map_stat_log):
- Chimeric junctions (chimeric_junctions_gz):
- Fusion predictions (fusion_predictions_tsv):
- Abridged fusion predictions (fusion_predictions_abridged_tsv):
- FusionInspector HTML summary file (fusioninspector_summary_html):
- Abridged re-scored fusion predictions (rescored_fusion_predictions_abridged_tsv):
- Re-scored fusion predictions (rescored_fusion_predictions_tsv):
- FusionInspector extra outputs (fusioninspector_extra_outputs_targz):
The main results may be examined by browsing
rnaseq.FusionInspector_web.html, where fusions may be easily seen and navigated. This information includes gene fusion events with break point locations:
Alternatively, additional outputs in standardized formats (bed, gtf, and fasta) may be found within the compressed directory,
rnaseq.FusionInspector_extra_outputs.tar.gz, and may be visualized in a genome browser, such as the Integrative Genome Viewer (IGV).
Further information such as Fusion Allelic Ratio (FAR) and more can be found among the abridged fusion predictions TSV,
rnaseq.FusionInspector.fusions.abridged.tsv.annotated. FAR is used to estimate the relative expression of the fusion transcript as compared to the alternative non-fused allele of the gene. For more information, please see additional documentation here.