Gigas and Virginica Comparison Part 2

C. gigas vs. C. virginica: general methylation landscapes

Now that I’ve tested methylation island analysis, I want to finalize my C. virginica and C. gigas MI tracks and look at the location of MI in the genome and in relation to DML.

C. virginica MI

Finalizing track parameters

Based on this issue, I tested out 500 bp windows and a 0.02 mCpG fraction with either a 25 bp or 50 bp step size.

Table 1. Testing 500 bp window sizes and different step sizes

Initial Window Size (bp) mCpG Fraction **Step Size (bp) ** Number of MI Max mCpG in MI Min mCpG in MI
500 0.02 25 64795 24777 10
500 0.02 50 63483 24777 10

I then visualized the tracks in IGV! The two 500 bp tracks are pretty comparable, but there are definitely places without MI in the 500 bp tracks that were in the 200 and 300 bp tracks.

Screen Shot 2020-02-08 at 3 19 06 PM

Screen Shot 2020-02-08 at 3 19 27 PM

Screen Shot 2020-02-08 at 3 20 08 PM

Screen Shot 2020-02-08 at 3 20 54 PM

Screen Shot 2020-02-08 at 3 24 32 PM

Figures 1-5. MI tracks with 200, 300, and 500 bp windows and 0.02 mCpG fractions against all 5x CpGs, methylated CpGs, and mRNA.

When I asked for feedback on the tracks in this issue, Steven pointed out that some of the MI were in fact less than 500 bp.

Screen Shot 2020-02-08 at 4 16 08 PM

Figure 6. MI that is at least 500 bp.

Screen Shot 2020-02-08 at 4 19 57 PM

Figure 7. MI that is less than 500 bp.

I don’t undertand why the script still created MI that were less than 500 bp! I decided to manually filter by length using this code:

!awk '{if ($3-$2 >= 500) { print $1"\t"$2"\t"$3"\t"$4}}' \

Almost 30,000 of the MI I identified did not meet this size threshold!

Tables 2. MI characteristics after filtering by size. No MI should be less than 500 bp.

Initial Window Size (bp) mCpG Fraction Step Size (bp) Number of MI Number of MI after Filtering Max mCpG in MI Min mCpG in MI
500 0.02 25 64795 36060 24777 11
500 0.02 50 63483 37063 24777 11

Based on the feedback, I decided to move forward with the filtered 500 bp windows, 0.02 mCpG fraction, and 50 bp window sizes.

Location in the genome

In this Jupyter notebook, I characterized the location of genome feature tracks and DML with respect to MI.

Table 3. MI overlaps with genome feature tracks and DML.

Feature Number of Overlaps with MI
Exons 240133
Introns 92472
Genes 15009
mRNA 29483
Transposable Elements 107926
Putative Promoters 8846
DML 537
Hypermethylated DML 283
Hypomethylated DML 254

Most of my DML (89.7%) were in my MI! Additionally, 32.8% of exons, 29.2% of introns, 49.0% of mRNA, and 38.6% of genes overlapped with MI. In contrast, only 15.6% of transposable elements and 14.7% of promoters overlapped with MI.

To visualize the MI in the C. virginica genome, I decided to create boxplots similar to Jeong et al.. I started by importing MI-feature track overlaps in this R Markdown script. For each file, I calculated the length of the MI, then divided the base piars that overlapped by the MI length. This gave me an overlap percentage that I used in boxplots! I created a base plot, but decided not to spend too much time on it now and return to it for my reviewer comments in a few days. The code I used for calculations and to make the plot was important than actually making the plot pretty and saving it.

C. gigas MI

Now that I generated C. virginic MI, it’s time to move onto C. gigas MI.

General gonad methylation landscape

To identify methylated CpGs in the C. gigas genome, I need to characterize the general methylation landscape. Although Claire and Mac did this previously with ctendia, I wanted to go through this process with my gonad samples in the event that somatic and gonad tissue had variations in methylation landscapes. Additionally, Claire and Mac identified mCpGs using 5x data, and all of my analyses were done with 10x data.

Rerunning bismark

When I went to access my coverage files on ganent, I found that I had overwritten these files! the folder with bismark output no longer existed…which meant I had to rerun bismark. :sob:

Based on previous lab notebook posts detailing my struggle running mox last time, I made a quick change to this script so my deduplication would run only on my sample files, and not proceed iteratively through a bunch of other files. I used the following code to transfer my files over to mox:

rsync --archive --progress --verbose yaamini@*fastq.gz . #Transfer sample files to scrubbed/yaamini/data/2019-09-03-Trimmed-Files
rsync --archive --progress --verbose yaamini@ . #Transfer bisulfite converted genome to scrubbed/yaamini/data/2019-09-03-Bismark-Inputs
rsync --archive --progress --verbose yaamini@ . #Transfer script to srlab/yaaminiv

It took me a few failed runs to realize I transfered files to the incorrect directories or didn’t transfer the files at all. I definitely opted for a more complicated file structure in my script but if I would have perused the script more carefully beforehand it wouldn’t have been a problem. A good reminder for me to really read the script! Once every file was where I said it would be, I ran my script and hoped for the best. I used the following code to check up on progress:

cat slurm-1786754.out

Even though I set up the script to run through all of bismark, I thankfully only need the coverage files to proceed with my analyses.

One day and six hours later, my bismark run completed! I used this code to transfer the files off mox:

rsync --archive --progress --verbose /gscratch/scrubbed/yaamini/analyses/Gigas-WGBS/2019-09-11-Gigas-Bismark/* yaamini@ #Transfer files to gannet
Characterizing the methylation landscape

While my files were processing, I set up this Jupyter notebook to create a master file and characterize methylation. Using the head of two C. virginica coverage files, I used a series of commands to merge the information in both coverage files. I then used these commands on the coverage files I got from bismark. Yay efficiency!

Before I created a master coverage file, I quickly created 5x and 10x sample bedgraphs in this Jupyter notebook since the output wasn’t on gannet or in my repository…RIP.

Alright, BACK TO THE GOOD STUFF. First, I created a new column in each coverage file that combined chromosome and start bp information. I will need to use this information to sort files and join them.

for f in ../2019-09-13-IGV-Verification/*cov
  awk '{print $1"-"$2"\t"$1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6}' ${f} \
  | sort -k1,1 \
  > ${f}.sorted

Next, I joined the two files by the chr-start column. In order to simulate an outer join, I specified that unmatched lines in each file should be included in the output using the -a argument. The only time a locus would be in one file and not another is when there are no reads for that locus in one sample. To clean up the file, I replaced any empty fields (methylation percentage, count methylated, count unmethylated) with 0 using -e 0. I then specified which columns I wanted printed in the output using -o. Before saving the output, I used tr to translate the “-“ in the first column to a tab delimiter. This way, I did not have to deal with any redundancy in the locus start positions from each files or figure out how to merge the columns while dealing with zeroes.

#Join the first column in the first file with the first column in the second file
#The files are tab delimited, and the output should also be tab delimited (-t $'\t')
#Print unpairable lines in file 1 (-a1) and 2 (-a2) to simulate an outer join. Replace empty fields with 0 (-e) and only print the following fields (-o)
#Convert - to \t to uncouple the chromosome and start position

!join -1 1 -2 1 \
-t $'\t' \
-a1 -a2 -e 0 -o '0,1.5,1.6,1.7,2.5,2.6,2.7' \
| tr '-' "\t" \
> cgigas_gonad-10x_raw.cov

Finally, I consolidated the methylated and unmethylated read counts from each file, summed the total number of reads at each locus, and calculated the revised methylation percentage.

#Calculate total count methylated and unmethylated for each locus
#Sum number of reads at each locus
#Calculating revised methylation percentage for each locus
#Multiply percentages by 100

!awk '{ print $1"\t"$2"\t"$2"\t"$4+$7"\t"$5+$8}' cgigas_gonad-10x_raw.cov \
| awk '{ print $1"\t"$2"\t"$3"\t"$4+$5"\t"$4"\t"$5}' \
| awk '{ print $1"\t"$2"\t"$3"\t"$5/$4"\t"$5"\t"$6}' \
| awk '{ print $1"\t"$2"\t"$3"\t"$4*100"\t"$5"\t"$6}' \
> cgigas_gonad-10x_concat.cov

I have data for 12,728,174 CpGs at 10x read depth. After reducing the dataset to loci with at least 10x coverage, I identified methylated (≥ 50%), sparsely methylated (≥ 10%, < 50%), and unmethylated (< 10%) CpG loci in this Jupyter notebook. I also characterized the location of these loci in relation to various genome features.

Table 1. General C. gigas gonad methylation landscape.

Feature All 10x Loci Methylated Loci Sparsely Methylated Loci Unmethylated Loci
Number of CpGs 12728174 1677041 2267700 8783433
Exons 1803226 559020 199748 1044458
Introns 3834047 706876 768464 2358707
Genes 5637273 1265896 968212 3403165
Transposable Elements 612466 41024 161419 410023
Putative Promoters 803342 59704 118610 625028
Other 5998040 354355 1093226 4550459

Creating the MI track

In this Jupyter notebook I generated the methylation island track using the same parameters I used for C. virginica.

Tables 2. MI characteristics after filtering by size. No MI should be less than 500 bp.

Initial Window Size (bp) mCpG Fraction *Step Size** Number of MI Number of MI after Filtering Max mCpG in MI Min mCpG in MI
500 0.02 50 38443 23173 3298 11

I then visualized the MI in this IGV session. The MI in C. gigas are much more sparse, which is different than C. virginica.

Screen Shot 2020-02-13 at 2 46 36 AM

Screen Shot 2020-02-13 at 2 46 59 AM

Screen Shot 2020-02-13 at 2 48 38 AM

Screen Shot 2020-02-13 at 2 49 12 AM

Figures 1-4. Methylation islands in the C. gigas gonad

Location in the genome

In this Jupyter notebook, I used bedtools to characterize the overlaps between C. gigas MI and genome feature tracks! I was able to run the code and get counts, but the output format for the Intron-MI and Promoter-MI files looked wonky:

Screen Shot 2020-02-13 at 9 20 36 AM

Figure 5. Weird file formatting

I posted this issue to figure out why my output looked this way, since I needed proper output to correctly count overlaps and create MI overlap figures.

Table 3. Overlaps between methylation islands and various genome features.

Feature Number of Overlaps
Exons 49607 (25.2%)
Introns 51980 (29.5%)
Genes 8761 (31.3%)
Transposable Elements 5868 (4.9%)
Putative Promoters 2746 (9.8%)
DML 384 (61.1%)

The sparse MI definitely translates to fewer overlaps between DML, but similar overlaps between MI, exons, introns, and genes in both species. Since I don’t trust the intron output, I’m unsure if the actual number here is correct, but in the offchance it is, I included it. Transposable elements and promoters in the C. virginica genome had more overlaps with MI than those same features in the C. gigas genome.

In this R Markdown file I created a standalone C. gigas MI overlap boxplot as well as a grouped barplot with C. virginica and C. gigas data. Again, I didn’t pay too much attention to the standalone plot since I figured I’d come back and use it again when I had more data. Even though my intron overlap data formatting issue prevented me from creating a boxplot with C. gigas intron information, I created a grouped barplot for C. virginica and C. gigas data. I learned you could use at to set where on the x-axis to place individual boxplots to artificially group them!

Screen Shot 2020-02-13 at 9 43 38 AM

Figure 6. Exon, intron, and transposable element overlaps. C. gigas intron-MI overlap data is missing due to output formatting issues.

I think some of the differences are a little clearer in this boxplot. Time to move on from this analysis!

Going forward

  1. Fix C. gigas intron-MI and promoter-MI formatting issues
  2. Update C. gigas MI overlap counts and grouped barplot
  3. Compare C. gigas and C. virginica DML and GOSlim terms
  4. Draft poster for ASLO and get feedback
  5. Finalize ASLO poster and send for printing by Thursday to get the ASLO discount!
Written on February 8, 2020