Hawaii Gigas Methylation Analysis Part 2

Checking trimmed data and starting bismark

Reviewing trimgalore and multiqc output

Once the trim job finished, the first thing I did was transfer all the output to gannet:

rsync —archive —progress —verbose /gscratch/scrubbed/yaaminiv/Hawes/analyses/trimgalore/* yaamini@ #Transfer files off of mox and onto gannet

After moving the files, I opened the MultiQC report to look at sample quality post-trimming. For each sample, cutadapt generally trimmed ≤ 10 bp of adapter sequences from each sample. This was in addition to the 10 bp hard trim from each end.

One thing I noticed is that samples 7 and 8 have more unique reads than the others. When looking at the percentage of unique vs. duplicate reads, all samples look roughly the same.

Screen Shot 2021-01-14 at 12 34 02 PM

Screen Shot 2021-01-14 at 12 34 13 PM

Figures 1-2. FastQC sequence counts.

Sequence quality, per sequence quality scores, per base sequence content, per base N content, sequence duplication levels, and adapter conetent were good for all samples! The per sequence GC content failed for all samples. Most samples had slight abnormalities or very unusual overrepresented sequences. All samples had slightly abnormal sequence length distributions, amnd most had slightly abnormal per tile sequence quality scores.

Screen Shot 2021-01-14 at 11 32 08 AM

Screen Shot 2021-01-14 at 11 32 28 AM

Screen Shot 2021-01-14 at 11 32 56 AM

Screen Shot 2021-01-14 at 11 33 04 AM

Figures 3-7. Status checks for metrics that were slightly abnormal or very unusual.

In looking at some of the FastQC reports from Sam, I noticed that the per sequence GC content remained the same before and after trimming for several samples. Many of the samples with overrepresented sequences were the 2nd paired files, and had a sequence with no hit.

Screen Shot 2021-01-14 at 2 21 41 PM

Figure 8. Most common overrepresented sequence.

It’s been a while since I’ve looked at FastQC and MultiQC reports, so I looked at the coral data. Per sequence GC content, sequence length distribution, and overrepresented sequences were potential problem areas with the coral data as well. I also looked at my C. virginica gonad methylation data MultiQC report. Most samples had slightly unusual per sequence GC content and per base sequence content, and all samples had slighty unusual sequence length distribution. Since I followed the trimming protocol suggested and have similar-ish results to some other data, I decided the files were decent enough to start bismark. I also posted this issue to confirm my thinking with others.

Starting bismark

First, I downloaded the C. gigas bisulfite-converted genome that Sam created previously. I used rsync to transfer it to my data directory on mox. Then, I created this script based on the coral methylation workflow. I updated the paths for bismark, bowtie2, and samtools for the latest program versions, and also added set -e to the top of the script. This will cause the script to exit if any command fails, instead of rnning through the entire script spitting out errors.

I transferred the script to my user directory so I could run it:

rsync —archive —progress —verbose yaamini@ . #Transfer to user directory
sbatch 02-bismark.sh #Run script

When I ran the script, it immediately failed! So at least I knew set -e was working well. When I checked the script, I realized I did not update the filenames based on the trimmed files, so xargs could not find the trimmed files to use. I updated the filenames, transferred the revised script to mox, then ran the script. I still got the same error! I checked the script again and couldn’t find any error. When I looked at my file structure, I realized that changing the “Hawes” directory to “Haws” made it such that there were two scripts. Since my script was open in Atom when I renamed the directory, I was still editing a version saved at a path different from the one I was transferring to mox! I quit Atom, moved the most updated version of the script to the correct directory, then transferred it to mox.

This time it ran successfully…kind of. It was able to find the trimmed files, but it couldn’t find the C. gigas genome:

Failed to move to directory /gscratch/scrubbed/yaaminiv/Hawes/data/Cg-genome/Bisulfite_Genome/CT_conversion/: No such file or directory

The genome was in a directory nested inside /gscratch/scrubbed/yaaminiv/Hawes/data/Cg-genome/. I changed the directory path so that bismark could find the bisulfite genome folder in the directory I specified: /gscratch/scrubbed/yaaminiv/Hawes/data/Cg-genome/Crassostrea_gigas.oyster_v9.dna_sm.toplevel/

Finally, the script started running! I’ll keep tabs on the progress over the next few weeks.

Going forward

  1. Look at bismark output
  2. Transfer scripts used to a nextflow workflow
  3. Identify methylation analysis framework beyond methylKit
Written on January 14, 2021