WGBS Analysis Part 10
Analyzing the full dataset
At the end of last week, I got my samples back from ZymoResearch! These are WGBS samples of C. gigas female gonad tissue from my Manchester experiment. The purpose of sequencing these samples is to determine if the low pH exposure altered the methylome. If we find something interesting from these samples, we may sequence some larval samples, or look into gonad RNA.
The first thing I needed to do was download the data. ZymoResearch provided two URLs: one for sample fastq files, and another for sample checksums. Each URL directed to a text file. The file with fastq information contained a link that could be curled to access the individual files. In order to get the data, I needed a straightforward way to specify multiple URLs in a
I found this link with information on how to do just that! Once I downloaded the text file, I could use
xargs to specify the individual URls to download:
curl https://epiquest.s3.amazonaws.com/epiquest_zr3616/yZDEfJOtQKi6iHfwuDtQa3mUJiy4vC/download_fastq.txt > download_fastq.txt #Download file provided by ZymoResearch xargs -n 1 curl -O < download_fastq.txt #Download fastq files
Then, I verified checksums:
curl https://epiquest.s3.amazonaws.com/epiquest_zr3616/yZDEfJOtQKi6iHfwuDtQa3mUJiy4vC/zr3616_MD5.txt > zr3616_MD5.txt #Download MD5 checksums from ZymoResearch find *fq.gz | md5sum -c zr3616_MD5.txt #Cross-reference original checksums with downloaded files. All files passed
Once I had the data, I needed to move it to
owl and include metadata in the
Nightingales Google Sheet. I used
mv to move samples from my Desktop to
owl mounted on my computer, but couldn’t write to the directory. I posted this issue to get write access to both the
owl folder and Google Sheet. As an aside, Sam reminded me I should use
rsync to transfer the files:
rsync —archive —progress —verbose zr3616_* /Volumes/web-1/nightingales/C_gigas/ #Transfer to nightingales
While the files transferred over many hours, I updated the Google Sheet with sample metadata.
Raw data quality
When I received the data from ZymoResearch, the pdf implied that adapters were trimmed out of the data. Confused, I posted this discussion to determine if I needed to modify my trimming protocol. Sam informed me that the samples were likely not trimmed, and I would see this if I ran
FastQC. Once the files were transferred to Nightingales, I needed to transfer them to
mox and do just that.
rsync to transfer the files:
rsync —archive —verbose —progress firstname.lastname@example.org:/Volumes/web-1/nightingales/C_gigas/zr3616*gz /gscratch/scrubbed/yaaminiv/Manchester/data/ #Transfer to mox
Then, I set up a
fastqc script. To streamline the repository, I moved my preliminary analyses to the pooled subset subdirectory. This way, my scripts will remain linked to this repository, but it won’t clutter it. Running
fastqc seemed simple. I located the program on mox in the
srlab programs directory. The command to run the program is
fastqc + filenames, so I specified the directory with my data, and included code for
](https://multiqc.info/) report compilation. When I ran this script, it obviously didn’t work!
Figuring Sam must have run
fastqc recently, I found this lab notebook entry with a
fastqc script for
mox. I wasn’t sure why Sam needed to work with the files in an array, instead of calling
fastqc will generate sequence quality reports,
multiqc will compile them, and checksums will be generated. I probably don’t need the checksum information since
rsync verifies checksums when moving files, and I have the original checksum information from ZymoResearch. In any case, it couldn’t hurt.
Once I finished modifying the script, I transferred the script and run it:
rsync —archive —verbose —progress email@example.com:/users/yaamini/Documents/project-gigas-oa-meth/code/01-fastqc.sh . #Transfer script to user directory sbatch 01-fastqc.sh #Run script
My script finished running in 40 minutes. It exited with an error message, but when I checked the
multiqc report all samples were accounted for. Looking at the report, I saw that there were no identified adapter sequences in any sample.
Figure 1. Summary of FastQC report parameters for all samples
When I dug into individual reports, I noticed that there were overrepresented PCR primers. And of course, poly-G tails were present in almost all R2 samples, leading to a spike in the tail end of the per sequence GC content distribution.
Figures 2-3. Overrepresented sequences and per sequence GC content distribution for sample 2 R2.
The lack of adapter sequences was concerning, so I looked at the Hawaii C. gigas MultiQC report. Although there were plenty of adapter and overrepresented sequences in that data, the summary at the bottom of the report doesn’t indicate this!
Figure 4. Summary of FastQC report parameters for Hawaii samples
This is a very odd classification convention, but I think it means my samples were not trimmed. I can proceed to the next step!
Since I saw poly-G tails in the FastQC reports, I know I need three rounds of trimming. Luckily, I already had my Hawaii data
trimgalore script I could modify for this project. For this project’s script, I updated file paths, directory paths, and included
echo commands so I could easily search the slurm output and determine which steps were completed. I then transferred the script to
mox and let it run:
rsync --archive --progress --verbose firstname.lastname@example.org:/Users/yaamini/Documents/project-gigas-oa-meth/code/02-trimgalore.sh . #Transfer script to user directory sbatch 02-trimaglore.sh #Run script
- Check trimming output
- Identify DML
- Determine if RNA should be extracted
- Determine if larval DNA/RNA should be extracted