Gigas and Virginica Comparison Part 2
C. gigas vs. C. virginica: general methylation landscapes
Now that I’ve tested methylation island analysis, I want to finalize my C. virginica and C. gigas MI tracks and look at the location of MI in the genome and in relation to DML.
C. virginica MI
Finalizing track parameters
Based on this issue, I tested out 500 bp windows and a 0.02 mCpG fraction with either a 25 bp or 50 bp step size.
Table 1. Testing 500 bp window sizes and different step sizes
Initial Window Size (bp) | mCpG Fraction | **Step Size (bp) ** | Number of MI | Max mCpG in MI | Min mCpG in MI |
---|---|---|---|---|---|
500 | 0.02 | 25 | 64795 | 24777 | 10 |
500 | 0.02 | 50 | 63483 | 24777 | 10 |
I then visualized the tracks in IGV! The two 500 bp tracks are pretty comparable, but there are definitely places without MI in the 500 bp tracks that were in the 200 and 300 bp tracks.
Figures 1-5. MI tracks with 200, 300, and 500 bp windows and 0.02 mCpG fractions against all 5x CpGs, methylated CpGs, and mRNA.
When I asked for feedback on the tracks in this issue, Steven pointed out that some of the MI were in fact less than 500 bp.
Figure 6. MI that is at least 500 bp.
Figure 7. MI that is less than 500 bp.
I don’t undertand why the script still created MI that were less than 500 bp! I decided to manually filter by length using this code:
!awk '{if ($3-$2 >= 500) { print $1"\t"$2"\t"$3"\t"$4}}' 2020-02-06-Methylation-Islands-500_0.02_25.tab \
> 2020-02-06-Methylation-Islands-500_0.02_25-filtered.tab
Almost 30,000 of the MI I identified did not meet this size threshold!
Tables 2. MI characteristics after filtering by size. No MI should be less than 500 bp.
Initial Window Size (bp) | mCpG Fraction | Step Size (bp) | Number of MI | Number of MI after Filtering | Max mCpG in MI | Min mCpG in MI |
---|---|---|---|---|---|---|
500 | 0.02 | 25 | 64795 | 36060 | 24777 | 11 |
500 | 0.02 | 50 | 63483 | 37063 | 24777 | 11 |
Based on the feedback, I decided to move forward with the filtered 500 bp windows, 0.02 mCpG fraction, and 50 bp window sizes.
Location in the genome
In this Jupyter notebook, I characterized the location of genome feature tracks and DML with respect to MI.
Table 3. MI overlaps with genome feature tracks and DML.
Feature | Number of Overlaps with MI |
---|---|
Exons | 240133 |
Introns | 92472 |
Genes | 15009 |
mRNA | 29483 |
Transposable Elements | 107926 |
Putative Promoters | 8846 |
DML | 537 |
Hypermethylated DML | 283 |
Hypomethylated DML | 254 |
Most of my DML (89.7%) were in my MI! Additionally, 32.8% of exons, 29.2% of introns, 49.0% of mRNA, and 38.6% of genes overlapped with MI. In contrast, only 15.6% of transposable elements and 14.7% of promoters overlapped with MI.
To visualize the MI in the C. virginica genome, I decided to create boxplots similar to Jeong et al.. I started by importing MI-feature track overlaps in this R Markdown script. For each file, I calculated the length of the MI, then divided the base piars that overlapped by the MI length. This gave me an overlap percentage that I used in boxplots! I created a base plot, but decided not to spend too much time on it now and return to it for my reviewer comments in a few days. The code I used for calculations and to make the plot was important than actually making the plot pretty and saving it.
C. gigas MI
Now that I generated C. virginic MI, it’s time to move onto C. gigas MI.
General gonad methylation landscape
To identify methylated CpGs in the C. gigas genome, I need to characterize the general methylation landscape. Although Claire and Mac did this previously with ctendia, I wanted to go through this process with my gonad samples in the event that somatic and gonad tissue had variations in methylation landscapes. Additionally, Claire and Mac identified mCpGs using 5x data, and all of my analyses were done with 10x data.
Rerunning bismark
When I went to access my coverage files on ganent
, I found that I had overwritten these files! the folder with bismark
output no longer existed…which meant I had to rerun bismark
. :sob:
Based on previous lab notebook posts detailing my struggle running mox
last time, I made a quick change to this script so my deduplication would run only on my sample files, and not proceed iteratively through a bunch of other files. I used the following code to transfer my files over to mox
:
rsync --archive --progress --verbose yaamini@172.25.149.226:/Users/yaamini/Documents/project-gigas-oa-meth/analyses/2019-08-30-Bismark-Parameter-Testing/*fastq.gz . #Transfer sample files to scrubbed/yaamini/data/2019-09-03-Trimmed-Files
rsync --archive --progress --verbose yaamini@172.25.149.226:/Users/yaamini/Documents/project-gigas-oa-meth/analyses/2019-08-30-Bismark-Parameter-Testing/Crassostrea_gigas.oyster_v9.dna_sm.toplevel . #Transfer bisulfite converted genome to scrubbed/yaamini/data/2019-09-03-Bismark-Inputs
rsync --archive --progress --verbose yaamini@172.25.149.226:/Users/yaamini/Documents/project-gigas-oa-meth/scripts/2019-09-12-Bismark.sh . #Transfer script to srlab/yaaminiv
It took me a few failed runs to realize I transfered files to the incorrect directories or didn’t transfer the files at all. I definitely opted for a more complicated file structure in my script but if I would have perused the script more carefully beforehand it wouldn’t have been a problem. A good reminder for me to really read the script! Once every file was where I said it would be, I ran my script and hoped for the best. I used the following code to check up on progress:
cat slurm-1786754.out
Even though I set up the script to run through all of bismark
, I thankfully only need the coverage files to proceed with my analyses.
One day and six hours later, my bismark
run completed! I used this code to transfer the files off mox
:
rsync --archive --progress --verbose /gscratch/scrubbed/yaamini/analyses/Gigas-WGBS/2019-09-11-Gigas-Bismark/* yaamini@172.25.149.226:/Volumes/web/spartina/2019-09-03-project-gigas-oa-meth/analyses/2019-09-11-Bismark #Transfer files to gannet
Characterizing the methylation landscape
While my files were processing, I set up this Jupyter notebook to create a master file and characterize methylation. Using the head
of two C. virginica coverage files, I used a series of commands to merge the information in both coverage files. I then used these commands on the coverage files I got from bismark
. Yay efficiency!
Before I created a master coverage file, I quickly created 5x and 10x sample bedgraphs in this Jupyter notebook since the output wasn’t on gannet
or in my repository…RIP.
Alright, BACK TO THE GOOD STUFF. First, I created a new column in each coverage file that combined chromosome and start bp information. I will need to use this information to sort files and join them.
for f in ../2019-09-13-IGV-Verification/*cov
do
awk '{print $1"-"$2"\t"$1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6}' ${f} \
| sort -k1,1 \
> ${f}.sorted
done
Next, I joined the two files by the chr-start column. In order to simulate an outer join, I specified that unmatched lines in each file should be included in the output using the -a
argument. The only time a locus would be in one file and not another is when there are no reads for that locus in one sample. To clean up the file, I replaced any empty fields (methylation percentage, count methylated, count unmethylated) with 0 using -e 0
. I then specified which columns I wanted printed in the output using -o
. Before saving the output, I used tr
to translate the “-“ in the first column to a tab delimiter. This way, I did not have to deal with any redundancy in the locus start positions from each files or figure out how to merge the columns while dealing with zeroes.
#Join the first column in the first file with the first column in the second file
#The files are tab delimited, and the output should also be tab delimited (-t $'\t')
#Print unpairable lines in file 1 (-a1) and 2 (-a2) to simulate an outer join. Replace empty fields with 0 (-e) and only print the following fields (-o)
#Convert - to \t to uncouple the chromosome and start position
!join -1 1 -2 1 \
-t $'\t' \
-a1 -a2 -e 0 -o '0,1.5,1.6,1.7,2.5,2.6,2.7' \
FILE \
FILE \
| tr '-' "\t" \
> cgigas_gonad-10x_raw.cov
Finally, I consolidated the methylated and unmethylated read counts from each file, summed the total number of reads at each locus, and calculated the revised methylation percentage.
#Calculate total count methylated and unmethylated for each locus
#Sum number of reads at each locus
#Calculating revised methylation percentage for each locus
#Multiply percentages by 100
!awk '{ print $1"\t"$2"\t"$2"\t"$4+$7"\t"$5+$8}' cgigas_gonad-10x_raw.cov \
| awk '{ print $1"\t"$2"\t"$3"\t"$4+$5"\t"$4"\t"$5}' \
| awk '{ print $1"\t"$2"\t"$3"\t"$5/$4"\t"$5"\t"$6}' \
| awk '{ print $1"\t"$2"\t"$3"\t"$4*100"\t"$5"\t"$6}' \
> cgigas_gonad-10x_concat.cov
I have data for 12,728,174 CpGs at 10x read depth. After reducing the dataset to loci with at least 10x coverage, I identified methylated (≥ 50%), sparsely methylated (≥ 10%, < 50%), and unmethylated (< 10%) CpG loci in this Jupyter notebook. I also characterized the location of these loci in relation to various genome features.
Table 1. General C. gigas gonad methylation landscape.
Feature | All 10x Loci | Methylated Loci | Sparsely Methylated Loci | Unmethylated Loci |
---|---|---|---|---|
Number of CpGs | 12728174 | 1677041 | 2267700 | 8783433 |
Exons | 1803226 | 559020 | 199748 | 1044458 |
Introns | 3834047 | 706876 | 768464 | 2358707 |
Genes | 5637273 | 1265896 | 968212 | 3403165 |
Transposable Elements | 612466 | 41024 | 161419 | 410023 |
Putative Promoters | 803342 | 59704 | 118610 | 625028 |
Other | 5998040 | 354355 | 1093226 | 4550459 |
Creating the MI track
In this Jupyter notebook I generated the methylation island track using the same parameters I used for C. virginica.
Tables 2. MI characteristics after filtering by size. No MI should be less than 500 bp.
Initial Window Size (bp) | mCpG Fraction | *Step Size** | Number of MI | Number of MI after Filtering | Max mCpG in MI | Min mCpG in MI |
---|---|---|---|---|---|---|
500 | 0.02 | 50 | 38443 | 23173 | 3298 | 11 |
I then visualized the MI in this IGV session. The MI in C. gigas are much more sparse, which is different than C. virginica.
Figures 1-4. Methylation islands in the C. gigas gonad
Location in the genome
In this Jupyter notebook, I used bedtools
to characterize the overlaps between C. gigas MI and genome feature tracks! I was able to run the code and get counts, but the output format for the Intron-MI and Promoter-MI files looked wonky:
Figure 5. Weird file formatting
I posted this issue to figure out why my output looked this way, since I needed proper output to correctly count overlaps and create MI overlap figures.
Table 3. Overlaps between methylation islands and various genome features.
Feature | Number of Overlaps |
---|---|
Exons | 49607 (25.2%) |
Introns | 51980 (29.5%) |
Genes | 8761 (31.3%) |
Transposable Elements | 5868 (4.9%) |
Putative Promoters | 2746 (9.8%) |
DML | 384 (61.1%) |
The sparse MI definitely translates to fewer overlaps between DML, but similar overlaps between MI, exons, introns, and genes in both species. Since I don’t trust the intron output, I’m unsure if the actual number here is correct, but in the offchance it is, I included it. Transposable elements and promoters in the C. virginica genome had more overlaps with MI than those same features in the C. gigas genome.
In this R Markdown file I created a standalone C. gigas MI overlap boxplot as well as a grouped barplot with C. virginica and C. gigas data. Again, I didn’t pay too much attention to the standalone plot since I figured I’d come back and use it again when I had more data. Even though my intron overlap data formatting issue prevented me from creating a boxplot with C. gigas intron information, I created a grouped barplot for C. virginica and C. gigas data. I learned you could use at
to set where on the x-axis to place individual boxplots to artificially group them!
Figure 6. Exon, intron, and transposable element overlaps. C. gigas intron-MI overlap data is missing due to output formatting issues.
I think some of the differences are a little clearer in this boxplot. Time to move on from this analysis!
Going forward
- Fix C. gigas intron-MI and promoter-MI formatting issues
- Update C. gigas MI overlap counts and grouped barplot
- Compare C. gigas and C. virginica DML and GOSlim terms
- Draft poster for ASLO and get feedback
- Finalize ASLO poster and send for printing by Thursday to get the ASLO discount!