WGBS Analysis Part 10
Analyzing the full dataset
At the end of last week, I got my samples back from ZymoResearch! These are WGBS samples of C. gigas female gonad tissue from my Manchester experiment. The purpose of sequencing these samples is to determine if the low pH exposure altered the methylome. If we find something interesting from these samples, we may sequence some larval samples, or look into gonad RNA.
Obtaining sequences
The first thing I needed to do was download the data. ZymoResearch provided two URLs: one for sample fastq files, and another for sample checksums. Each URL directed to a text file. The file with fastq information contained a link that could be curled to access the individual files. In order to get the data, I needed a straightforward way to specify multiple URLs in a curl
or wget
command.
I found this link with information on how to do just that! Once I downloaded the text file, I could use xargs
to specify the individual URls to download:
curl https://epiquest.s3.amazonaws.com/epiquest_zr3616/yZDEfJOtQKi6iHfwuDtQa3mUJiy4vC/download_fastq.txt > download_fastq.txt #Download file provided by ZymoResearch
xargs -n 1 curl -O < download_fastq.txt #Download fastq files
Then, I verified checksums:
curl https://epiquest.s3.amazonaws.com/epiquest_zr3616/yZDEfJOtQKi6iHfwuDtQa3mUJiy4vC/zr3616_MD5.txt > zr3616_MD5.txt #Download MD5 checksums from ZymoResearch
find *fq.gz | md5sum -c zr3616_MD5.txt #Cross-reference original checksums with downloaded files. All files passed
Once I had the data, I needed to move it to owl
and include metadata in the Nightingales Google Sheet
. I used mv
to move samples from my Desktop to owl
mounted on my computer, but couldn’t write to the directory. I posted this issue to get write access to both the owl
folder and Google Sheet. As an aside, Sam reminded me I should use rsync
to transfer the files:
rsync —archive —progress —verbose zr3616_* /Volumes/web-1/nightingales/C_gigas/ #Transfer to nightingales
While the files transferred over many hours, I updated the Google Sheet with sample metadata.
Raw data quality
When I received the data from ZymoResearch, the pdf implied that adapters were trimmed out of the data. Confused, I posted this discussion to determine if I needed to modify my trimming protocol. Sam informed me that the samples were likely not trimmed, and I would see this if I ran FastQC
. Once the files were transferred to Nightingales, I needed to transfer them to mox
and do just that.
I used rsync
to transfer the files:
rsync —archive —verbose —progress yaamini@172.25.149.226:/Volumes/web-1/nightingales/C_gigas/zr3616*gz /gscratch/scrubbed/yaaminiv/Manchester/data/ #Transfer to mox
Then, I set up a fastqc
script. To streamline the repository, I moved my preliminary analyses to the pooled subset subdirectory. This way, my scripts will remain linked to this repository, but it won’t clutter it. Running fastqc
seemed simple. I located the program on mox in the srlab
programs directory. The command to run the program is fastqc
+ filenames, so I specified the directory with my data, and included code for [
MultiQC](https://multiqc.info/)
report compilation. When I ran this script, it obviously didn’t work!
Figuring Sam must have run fastqc
recently, I found this lab notebook entry with a fastqc
script for mox.
I wasn’t sure why Sam needed to work with the files in an array, instead of calling fastqc,
but I figured it had to do with the JavaScript error I encountered when I tried to do just that. I copied the code into my script and modified the variables so they pointed to the correct directories and samples. With his code, fastqc
will generate sequence quality reports, multiqc
will compile them, and checksums will be generated. I probably don’t need the checksum information since rsync
verifies checksums when moving files, and I have the original checksum information from ZymoResearch. In any case, it couldn’t hurt.
Once I finished modifying the script, I transferred the script and run it:
rsync —archive —verbose —progress yaamini@172.25.149.226:/users/yaamini/Documents/project-gigas-oa-meth/code/01-fastqc.sh . #Transfer script to user directory
sbatch 01-fastqc.sh #Run script
My script finished running in 40 minutes. It exited with an error message, but when I checked the multiqc
report all samples were accounted for. Looking at the report, I saw that there were no identified adapter sequences in any sample.
Figure 1. Summary of FastQC report parameters for all samples
When I dug into individual reports, I noticed that there were overrepresented PCR primers. And of course, poly-G tails were present in almost all R2 samples, leading to a spike in the tail end of the per sequence GC content distribution.
Figures 2-3. Overrepresented sequences and per sequence GC content distribution for sample 2 R2.
The lack of adapter sequences was concerning, so I looked at the Hawaii C. gigas MultiQC report. Although there were plenty of adapter and overrepresented sequences in that data, the summary at the bottom of the report doesn’t indicate this!
Figure 4. Summary of FastQC report parameters for Hawaii samples
This is a very odd classification convention, but I think it means my samples were not trimmed. I can proceed to the next step!
Trimming samples
Since I saw poly-G tails in the FastQC reports, I know I need three rounds of trimming. Luckily, I already had my Hawaii data trimgalore
script I could modify for this project. For this project’s script, I updated file paths, directory paths, and included echo
commands so I could easily search the slurm output and determine which steps were completed. I then transferred the script to mox
and let it run:
rsync --archive --progress --verbose yaamini@172.25.149.226:/Users/yaamini/Documents/project-gigas-oa-meth/code/02-trimgalore.sh . #Transfer script to user directory
sbatch 02-trimaglore.sh #Run script
Going forward
- Check trimming output
- Start
bismark
- Identify DML
- Determine if RNA should be extracted
- Determine if larval DNA/RNA should be extracted