Hawaii Gigas Methylation Analysis Part 2
Checking trimmed data and starting
Once the trim job finished, the first thing I did was transfer all the output to
rsync —archive —progress —verbose /gscratch/scrubbed/yaaminiv/Hawes/analyses/trimgalore/* email@example.com:/Volumes/web/spartina/project-oyster-oa/data/Haws/trimmed #Transfer files off of mox and onto gannet
After moving the files, I opened the MultiQC report to look at sample quality post-trimming. For each sample,
cutadapt generally trimmed ≤ 10 bp of adapter sequences from each sample. This was in addition to the 10 bp hard trim from each end.
One thing I noticed is that samples 7 and 8 have more unique reads than the others. When looking at the percentage of unique vs. duplicate reads, all samples look roughly the same.
Figures 1-2. FastQC sequence counts.
Sequence quality, per sequence quality scores, per base sequence content, per base N content, sequence duplication levels, and adapter conetent were good for all samples! The per sequence GC content failed for all samples. Most samples had slight abnormalities or very unusual overrepresented sequences. All samples had slightly abnormal sequence length distributions, amnd most had slightly abnormal per tile sequence quality scores.
Figures 3-7. Status checks for metrics that were slightly abnormal or very unusual.
In looking at some of the FastQC reports from Sam, I noticed that the per sequence GC content remained the same before and after trimming for several samples. Many of the samples with overrepresented sequences were the 2nd paired files, and had a sequence with no hit.
Figure 8. Most common overrepresented sequence.
It’s been a while since I’ve looked at FastQC and MultiQC reports, so I looked at the coral data. Per sequence GC content, sequence length distribution, and overrepresented sequences were potential problem areas with the coral data as well. I also looked at my C. virginica gonad methylation data MultiQC report. Most samples had slightly unusual per sequence GC content and per base sequence content, and all samples had slighty unusual sequence length distribution. Since I followed the trimming protocol suggested and have similar-ish results to some other data, I decided the files were decent enough to start
bismark. I also posted this issue to confirm my thinking with others.
First, I downloaded the C. gigas bisulfite-converted genome that Sam created previously. I used
rsync to transfer it to my data directory on
mox. Then, I created this script based on the coral methylation workflow. I updated the paths for
samtools for the latest program versions, and also added
set -e to the top of the script. This will cause the script to exit if any command fails, instead of rnning through the entire script spitting out errors.
I transferred the script to my user directory so I could run it:
rsync —archive —progress —verbose firstname.lastname@example.org:/Users/yaamini/Documents/project-oyster-oa/code/Haws/02-bismark.sh . #Transfer to user directory sbatch 02-bismark.sh #Run script
When I ran the script, it immediately failed! So at least I knew
set -e was working well. When I checked the script, I realized I did not update the filenames based on the trimmed files, so
xargs could not find the trimmed files to use. I updated the filenames, transferred the revised script to
mox, then ran the script. I still got the same error! I checked the script again and couldn’t find any error. When I looked at my file structure, I realized that changing the “Hawes” directory to “Haws” made it such that there were two scripts. Since my script was open in Atom when I renamed the directory, I was still editing a version saved at a path different from the one I was transferring to
mox! I quit Atom, moved the most updated version of the script to the correct directory, then transferred it to
This time it ran successfully…kind of. It was able to find the trimmed files, but it couldn’t find the C. gigas genome:
Failed to move to directory /gscratch/scrubbed/yaaminiv/Hawes/data/Cg-genome/Bisulfite_Genome/CT_conversion/: No such file or directory
The genome was in a directory nested inside
/gscratch/scrubbed/yaaminiv/Hawes/data/Cg-genome/. I changed the directory path so that
bismark could find the bisulfite genome folder in the directory I specified:
Finally, the script started running! I’ll keep tabs on the progress over the next few weeks.
- Look at
- Transfer scripts used to a nextflow workflow
- Identify methylation analysis framework beyond