MethCompare Part 10

Revisiting CpG distributions

Based on feedback from Friday’s meeting, I concatenated my CpG distribution data and created stacked barplots to investigate trends by methods in this R Markdown script.

Chi-squared tests

The first thing I needed to do was combine data from the three replicates for each method. To do this, I created a new empty data frame. For each column (total lines, methylated CpGs, sparsely methylated CpGs, umethylated CpGs), I took the sum of the three samples for each method. Then I obtained percentages for each CpG type. This is the data table I used with prop.test.

Since my datatable had columns for percent CpG and rows for each method, I transposed the table to use in prop.test. I conducted pairwise tests within each species that compared the distribution of CpG type in one method with another.

Table 1. Chi-squared test results for M. capitata

Comparison Chi-squared df p-value
WGBS vs. RRBS 3.7661 2 0.1521
WGBS vs. MBD-BSSeq 2.2279 2 0.3283
RRBS vs. MBD-BSSeq 9.7074 2 0.0078

Table 2. Chi-squared test results for P. acuta

Comparison Chi-squared df p-value
WGBS vs. RRBS 0.012088 2 0.994
WGBS vs. MBD-BSSeq 10.811 2 0.004493
RRBS vs. MBD-BSSeq 11.352 2 0.003427

As I expected, the distribution of CpG type is significantly different between RRBS and MBD-BSSeq in both species. It’s interesting to see that the WGBS vs. MBD-BSSeq comparison is ony different for P. acuta. This could be because M. capitata is a more heavily-methylated species, so by default WGBS does a good job picking up heavily methylated regions.

I also conducted similar pairwise tests to look at method sensitivity between species.

Table 3. Chi-squared test results for methods compared between M. capitata and P. acuta

Method Compared Chi-squared df p-value
WGBS 9.2868 2 0.009625
RRBS 3.6296 2 0.1629
MBD-BSSeq 1.5601 2 0.4584

It looks like WGBS is different between the two species, which would feed into the hypothesis that M. capitata is a more heavily methylated species to begin with.

Creating stacked barplots

The next thing I wanted to do with this data is create stacked barplots similar to what I made for the C. virginica gonad methylation paper. I wanted each bar to be a different method, and then I wanted to “stack” the percent of CpGs that were methylated, sparsely methylated, or unmethylated. This was simple enough, but the tricky part was figuring out how to make each method a different color with a separate gradient to signify the different CpG types. I thought I could just create a color vector with nine entries instead of three, but barplot reused the first three entries for all the bars! After doing some digging, I found this link. To do this, I needed to create a new matrix from the original data I’m plotting, then finding a way to ignore the other columns and column names.

xx <- t(McapCpGTypeTotalPercents) #New matrix for piecewise color addtions
xx[,-1] <- NA #Ignore all other columns
colnames(xx)[-1] <- NA #Ignore all other column names
barplot(xx, col = c(plotColors[1], plotColors[2], plotColors[3]), add=T, axes=F) #Add WGBS greens

xx <- t(McapCpGTypeTotalPercents) #New matrix for piecewise color addtions
xx[,-2] <- NA #Ignore all other columns
colnames(xx)[-2] <- NA #Ignore all other column names
barplot(xx, col = c(plotColors[4], plotColors[5], plotColors[6]), add=T, axes=F) #Add RRBS blues

xx <- t(McapCpGTypeTotalPercents) #New matrix for piecewise color addtions
xx[,-3] <- NA #Ignore all other columns
colnames(xx)[-3] <- NA #Ignore all other column names
barplot(xx, col = c(plotColors[7], plotColors[8], plotColors[9]), add=T, axes=F) #Add MBD-BSSeq reds

After I figured that out, I found a way to add line segments and text to indicate which pairwise comparisons were significant.

segments(x0 = 2, y0 = 106, x1 = 3, y1 = 106, lwd = 1.5) #Add line connecting RRBS and MBD-BSSeq
segments(x0 = 2, y0 = 106, x1 = 2, y1 = 103, lwd = 1.5) #Add left segment from connecting line to RRBS
segments(x0 = 3, y0 = 106, x1 = 3, y1 = 103, lwd = 1.5) #Add right segment from connecting line to MBD
text("p = 0.008", x = 2.5, y = 108, cex = 1.2) #Add p-value

I messed around with adding a greyscale legend, but I didn’t like the way it looked. I’ll deal with that later.

Screen Shot 2020-05-04 at 4 46 33 PM

Figure 1. CpG distribution for M. capitata (pdf here)

Screen Shot 2020-05-04 at 4 46 42 PM

Figure 2. CpG distribution for P. acuta (pdf here)

Second thoughts

After all of this work, I started to have some doubts about how I concatenated the data. I summed all the categories between the three samples then calculated percentages. However, Hollie used averages to create Circos plots. I wasn’t sure which method was better (and I went back-and-forth when I was doing it in the first place), so I commented in this issue only an hour after I closed it. Whoops.

Going forward

  1. Figure out if my concatenation method is valid
  2. Create tracks for 1 kb upstream and downstream flanks
  3. Locate TE tracks
  4. Characterize intersections between data, flanking regions, and TE and create summary tables
  5. Create figures for CpG characterization in various genome features
  6. Update methods
  7. Update results
  8. Figure out methylation island analysis
Written on May 4, 2020