Hawaii Gigas Methylation Analysis Part 3

Revisiting trimming

After I posted this issue, Sam tagged me in another issue. Turns out some WGBS data Hollie received recently had the same issue: mainly a poly-G tail in the second paired read, which causes sequences to fail the per sequence GC content metric. When I reviewed the overrepresented sequences in the first paired reads, I noticed that there were still adapter sequences in these reads! When I mentioned this to Sam at Science Hour, he said that trimgalore and cutadapt usually need to be run twice on a set of samples. This is because only the most abundant primer sequence gets removed the first time, while the remaining adapters get removed the second time. Based on this discussion I decided to:

  • Retrim trim all sequences a second time
  • Manually remove the poly-G tail

I created a new script. For the second round of trimming, the only part of my code that changed was the paths to the files to trim and the directories used. I included code to run multiqc on these files. Next, I set up code to manually trim out poly-G tails, write those files to a different subdirectory, and run multiqc on the files that after three rounds of trimming.

In the meantime, Sam said he’ll contact Zymo since he sent in the sequencing request to begin with! Between him and Hollie, hopefully Zymo can give us more information.

Going forward

  1. Review MultiQC output after trimming
  2. Get more information from Zymo about poly-G tails in samples
  3. Start bismark
  4. Transfer scripts used to a nextflow workflow
  5. Identify methylation analysis framework beyond methylKit
Written on January 15, 2021