# MethCompare Part 10

## Revisiting CpG distributions

Based on feedback from Friday’s meeting, I concatenated my CpG distribution data and created stacked barplots to investigate trends by methods in this R Markdown script.

### Chi-squared tests

The first thing I needed to do was combine data from the three replicates for each method. To do this, I created a new empty data frame. For each column (total lines, methylated CpGs, sparsely methylated CpGs, umethylated CpGs), I took the sum of the three samples for each method. Then I obtained percentages for each CpG type. This is the data table I used with `prop.test`

.

Since my datatable had columns for percent CpG and rows for each method, I transposed the table to use in `prop.test`

. I conducted pairwise tests within each species that compared the distribution of CpG type in one method with another.

**Table 1**. Chi-squared test results for *M. capitata*

Comparison |
Chi-squared |
df |
p-value |
---|---|---|---|

WGBS vs. RRBS | 3.7661 | 2 | 0.1521 |

WGBS vs. MBD-BSSeq | 2.2279 | 2 | 0.3283 |

RRBS vs. MBD-BSSeq | 9.7074 | 2 | 0.0078 |

**Table 2**. Chi-squared test results for *P. acuta*

Comparison |
Chi-squared |
df |
p-value |
---|---|---|---|

WGBS vs. RRBS | 0.012088 | 2 | 0.994 |

WGBS vs. MBD-BSSeq | 10.811 | 2 | 0.004493 |

RRBS vs. MBD-BSSeq | 11.352 | 2 | 0.003427 |

As I expected, the distribution of CpG type is significantly different between RRBS and MBD-BSSeq in both species. It’s interesting to see that the WGBS vs. MBD-BSSeq comparison is ony different for *P. acuta*. This could be because *M. capitata* is a more heavily-methylated species, so by default WGBS does a good job picking up heavily methylated regions.

I also conducted similar pairwise tests to look at method sensitivity between species.

**Table 3**. Chi-squared test results for methods compared between *M. capitata* and *P. acuta*

Method Compared |
Chi-squared |
df |
p-value |
---|---|---|---|

WGBS | 9.2868 | 2 | 0.009625 |

RRBS | 3.6296 | 2 | 0.1629 |

MBD-BSSeq | 1.5601 | 2 | 0.4584 |

It looks like WGBS is different between the two species, which would feed into the hypothesis that *M. capitata* is a more heavily methylated species to begin with.

### Creating stacked barplots

The next thing I wanted to do with this data is create stacked barplots similar to what I made for the *C. virginica* gonad methylation paper. I wanted each bar to be a different method, and then I wanted to “stack” the percent of CpGs that were methylated, sparsely methylated, or unmethylated. This was simple enough, but the tricky part was figuring out how to make each method a different color with a separate gradient to signify the different CpG types. I thought I could just create a color vector with nine entries instead of three, but `barplot`

reused the first three entries for all the bars! After doing some digging, I found this link. To do this, I needed to create a new matrix from the original data I’m plotting, then finding a way to ignore the other columns and column names.

```
xx <- t(McapCpGTypeTotalPercents) #New matrix for piecewise color addtions
xx[,-1] <- NA #Ignore all other columns
colnames(xx)[-1] <- NA #Ignore all other column names
barplot(xx, col = c(plotColors[1], plotColors[2], plotColors[3]), add=T, axes=F) #Add WGBS greens
xx <- t(McapCpGTypeTotalPercents) #New matrix for piecewise color addtions
xx[,-2] <- NA #Ignore all other columns
colnames(xx)[-2] <- NA #Ignore all other column names
barplot(xx, col = c(plotColors[4], plotColors[5], plotColors[6]), add=T, axes=F) #Add RRBS blues
xx <- t(McapCpGTypeTotalPercents) #New matrix for piecewise color addtions
xx[,-3] <- NA #Ignore all other columns
colnames(xx)[-3] <- NA #Ignore all other column names
barplot(xx, col = c(plotColors[7], plotColors[8], plotColors[9]), add=T, axes=F) #Add MBD-BSSeq reds
```

After I figured that out, I found a way to add line segments and text to indicate which pairwise comparisons were significant.

```
segments(x0 = 2, y0 = 106, x1 = 3, y1 = 106, lwd = 1.5) #Add line connecting RRBS and MBD-BSSeq
segments(x0 = 2, y0 = 106, x1 = 2, y1 = 103, lwd = 1.5) #Add left segment from connecting line to RRBS
segments(x0 = 3, y0 = 106, x1 = 3, y1 = 103, lwd = 1.5) #Add right segment from connecting line to MBD
text("p = 0.008", x = 2.5, y = 108, cex = 1.2) #Add p-value
```

I messed around with adding a greyscale legend, but I didn’t like the way it looked. I’ll deal with that later.

**Figure 1**. CpG distribution for *M. capitata* (pdf here)

**Figure 2**. CpG distribution for *P. acuta* (pdf here)

### Second thoughts

After all of this work, I started to have some doubts about how I concatenated the data. I summed all the categories between the three samples then calculated percentages. However, Hollie used averages to create Circos plots. I wasn’t sure which method was better (and I went back-and-forth when I was doing it in the first place), so I commented in this issue only an hour after I closed it. Whoops.

### Going forward

- Figure out if my concatenation method is valid
- Create tracks for 1 kb upstream and downstream flanks
- Locate TE tracks
- Characterize intersections between data, flanking regions, and TE and create summary tables
- Create figures for CpG characterization in various genome features
- Update methods
- Update results
- Figure out methylation island analysis