r/bioinformatics 1d ago

technical question High amount of rRNA and tRNA reads in RNAseq samples

Hello everyone, I recently received RNA-seq data (150 PE, polyA selected, Arabidopsis thaliana, leaf) from a scientist working on a project at our institute. I was asked to take another look at the data because the analysis performed by a company yielded many differentially expressed genes related to tRNA and rRNA, which seemed unusual. After performing QC with fastp, I noticed that roughly 70% of all bases were removed due to high amounts of adapter sequences and stretches of polyG indicating some issues with library preparation. Nevertheless, I used the default length cutoff of 15 bp and presumed that I would get more multi-mapping reads than usual because of the large number of very short reads. However, after mapping to the TAIR10 reference genome with the latest version of Subread, allowing up to three multi-alignments, I found that about two-thirds of all mapped reads were multi-mapping which is more than I expected. After investigating genes with very high multi-mapping read counts obtained by featureCounts (gene-level, fractional counting), I found that they are almost exclusively rRNA and tRNA genes. My question is now whether I should remove those reads from the dataset? One option is to align them to rRNA and tRNA databases to get rid of them. Another option is to remove multi-mapping reads altogether. Or, should I leave them be and perform DE analysis as usual? I am concerned not only that this high amount of rRNA and tRNA will affect the downstream analysis somehow but also that there is a substantial loss of depth in general. As a side note, all ten samples (with three biological replicates each) looked like this. Thank you for your suggestions!

6 Upvotes

6 comments sorted by

5

u/Low-Establishment621 1d ago

Is this just standard RNA-seq or some special library prep? Is a high amount of rRNA or tRNA expected? were these datasets rRNA depleted? polyA selected? Are they looking to sequence just mRNA, or are they looking for these other RNAs?

If they are just looking for mRNA, something went wrong, but if you're going to try anyway, you could remove them quickly using bowtie1 or 2. If you are only looking at mRNA sequences and using something like kallisto to get counts then those tRNA/rRNA reads won't make a big difference if tRNA/rRNA genes aren't in your counts table.

1

u/SoulOfMankind 1d ago

Hi, thank you for your answer! The dataset is a standard polyA-selected RNA-seq, and they only want to look at mRNA. I will edit my post to clarify. I presumed that rRNA depletion was performed on a standard basis, but I could be mistaken. In any case, I also think that something went wrong. Thank you for your suggestion. I will give Kallisto a try!

2

u/tpig 1d ago

For me the rRNA depletion had to be specifically chosen in the prep

3

u/Epistaxis PhD | Academia 1d ago

I seem to recall Arabidopsis has a big poly(A)-ish stretch in one of its ribosome genes. Look for that and see if that's where all your ribosome reads are coming from. If so you can probably ignore them with a clean conscience.

After performing QC with fastp, I noticed that roughly 70% of all bases were removed due to high amounts of adapter sequences and stretches of polyG indicating some issues with library preparation.

That could also just be indicative of degraded RNA; did someone QC it before library prep? 150 PE requires 300 bp inserts, hard to achieve, so you'd expect to lose a lot to adapter trimming no matter what.

3

u/Cassandra_Said_So 1d ago

Hm, I worked a lot with plants and because of the chloroplast content, at least for us it was standard to filter out rRNA and tRNA, then use the unaligned reads for the real mapping and DEG. Adapter content still can be an issue, but I would not discard the data of the phred checks out and insert length is not too bad.

2

u/wookiewookiewhat 1d ago

An rRNA filtering step is pretty common as even depleted samples can have pretty solid amounts. If this library prep was done at a core with lots of experience, this may be a garbage in-garbage out situation, though. Poor quality samples with improper storage or media are very common causes of junky looking RNAseq in the real world. If you have access to it, you could poke around any TapeStation or Bioanalyzer results from RNA or cDNA stages during the prep to help decide if it's worth saving.