Hawaii Gigas Methylation Analysis Part 4
Reviewing second round of trimming
I transfered the output off
rsync --archive --progress --verbose /gscratch/scrubbed/yaaminiv/Hawes/analyses/trimgalore-2/* firstname.lastname@example.org:/Volumes/web/spartina/project-oyster-oa/Haws/trimmed-data-2/ #Transfer output to gannet (mounted on computer)
I also saved the output for the second round of trimming (removing remaining adapter sequences) here and the third round of trimming (removing poly-G tails] here. I started by comparing the
multiqc reports from the first, second, and third rounds of trimming.
Figures 1-3. Status checks for each round of trimming
Based on the status checks, there are no overrepresented sequences in the resultant files from the third round of trimming! I double checked this by looking at the reports from sample 24. There were no overrepresented sequences in the first paired read after the second round of trimming, but there were overrepresented sequences in the second pair. This makes sense because I trimmed out the remaining adapter sequences, but not the poly-G tail. After the third round of trimming, neither paired read had any overrepresented sequences!
I also checked the per sequence GC content in each
Figures 4-6. Per sequence GC content for each round of trimming
There are peaks at the end of the distribution after the first and second round of trimming that can be atttributed to the poly-G tails. After the third round of trimming, that second peak is gone! While trimming out the poly-G tail works, it does make me wonder what other artifacts may be present in my sequences. That can’t be answered until Zymo takes a look at the sequences. Another thing Hollie pointed out is that while trimming is relatively easy, it does mean that we’re “wasting” reads on the poly-G tail instead of data. I decided to look at the length of the sequences between the second and third rounds of trimming. Even without the poly-G tails, I would have had to do two rounds of trimming with
cutadapt anyways. The third round of trimming is the “extra” round that would lead to potential data loss.
To do this, I downloaded the plot data from sequence counts in the
multiqc report. I know I could technically get these numbers from the general stats files
multiqc produces, but that would require a little more finagling and I liked the format of these tables better. I saved the sequence cound information for the second trim here and third trim here. For each file, I calculated the total number of reads, percent unique reads, and percent duplicate reads in Excel. Between the two trims, I expected the total number of reads and percent duplicate reads to decrease after trim 3, but the percent unique reads to increase. I created a new file with this information, and calculated the difference in total reads between trims and the percent of reads “lost” after trim 3. Looking at the file, I lose 1-3% of reads after trimming the poly-G tail. This is about 511,333 to 1,197,753 reads for each individual file.
Obtaining adapter sequences
Since Hollie is also talking to Zymo about her data, she asked me to send her the adapter sequences in my data. To do this, I used
grep to extract the trimmed sequences from each trimming report:
grep Sequence *trimming_report.txt > adapter-sequences.txt #Obtain adapter sequences for first trim cd trimmed-data-2 #Change directory grep Sequence *trimming_report.txt | cat >> ../adapter-sequences.txt #Obtain adapter sequences for second trim and append to existing adapter sequence file
The adapter sequences are saved in this file.
While waiting for information from Zymo, I decided to start
bismark. I need to run the alignments on the files after the third round of trimming. I edited this script to reflect the changes in file names.
rsync --archive --progress --verbose email@example.com:/Users/yaamini/Documents/project-oyster-oa/code/Haws/02-bismark.sh . #Transfer script to user directory sbatch 02-bismark.sh #Run script
- Get more information from Zymo about poly-G tails in samples
- Transfer scripts used to a nextflow workflow
- Identify methylation analysis framework beyond