MethCompare Part 13

Revamping distribution plots and creating genome feature plots

After testing my characterization pipeline with union bedgraphs, I needed to make stacked barplots with that data. I also need to explore CpG locations in genome features between methods using various CpG categories. It’s time to make plots on plots on plots.

CpG distribution stacked barplots

I created this R Markdown document to run my code. I contemplated putting it in my previous script, but I thought keeping the workflow separate like I did for the Jupyter notebooks would be better. I imported the tab-delimited files with the file counts for each species and created summary tables (M. capitata here and P. acuta here). I then isolated the percent information and used it for pairwise chi-squared tests and to create stacked barplots.

Table 1. Pairwise chi-squared results for M. capitata, with chi-squared statistics along the top and p-values at the bottom (df = 2). The hypothesis was that each method would have the same distribution of methylated, sparsely methylated, and unmethylated CpGs.

Method WGBS RRBS MBD-BS
WGBS - 2.2883 7.8469
RRBS 0.3185 - 2.3195
MBD-BS 0.01977 0.3136 -

Table 2. Pairwise chi-squared results for P. acuta, with chi-squared statistics along the top and p-values at the bottom (df = 2). The hypothesis was that each method would have the same distribution of methylated, sparsely methylated, and unmethylated CpGs.

Method WGBS RRBS MBD-BS
WGBS - 0.39734 8.2335
RRBS 0.8198 - 10.349
MBD-BS 0.0163 0.005659 -

It’s interesting that using the union data changed results for M. capitata but not P. acuta. The fact that MBD-BS was significantly different from WGBS is probably due to the higher percentage of methylated loci detected by MBD-BS. What these results tell me is that MBD-BS does pick up significantly more methylated loci for each species.

Previously, I used code to make each stacked bar a different color. In the middle of the night (I’m not kidding) I had an epiphany for a way to define i so I could combine three separate code chunks into a loop. In the light of day that’s what I did:

for (i in 1:ncol(t(McapCpGTypePercents))){
    xx <- t(McapCpGTypePercents)
    xx[,-i] <- NA
    colnames(xx)[-i] <- NA
    barplot(xx, col = c(plotColors[(3*i)-2], plotColors[(3*i)-1], plotColors[(3*i)]), add = TRUE, axes = FALSE, names.arg = c("", "", "")) 
} #Make each bar in the stacked barplot a different color gradient. Create a new dataframe with data for the plot and ignore all additional columns and column names. Plot the single bar with the correct color from plotColors. Do not add axes and add blank names for each column.

Screen Shot 2020-05-07 at 1 24 40 PM

Figure 1. CpG distribution for M. capitata (pdf here)

Screen Shot 2020-05-07 at 1 24 21 PM

Figure 2. CpG distribution for P. acuta (pdf here)

After making the stacked barplots, I conducted pairwise chi-squared tests for each method between species.

Table 3. Pairwise chi-squared results examining method sensitivity between species. The hypothesis was that each method would have the same distribution of methylated, sparsely methylated, and unmethylated CpGs for each species.

Method Chi-squared df p-value
WGBS 4.2001 2 0.1225
RRBS 10.841 2 0.004424
MBD-BS 3.9764 2 0.1369

Again, my results were different. Instead of WGBS, RRBS performed differently between species. This lines up well with what we saw in Hollie’s circos plots. I updated and closed this issue.

Planning plots

Before I leapt into the world of stacked barplots for genome locations, I sketched out what datasets I needed to include in each figure:

Union data:

  • All CpGs in the genome (background)
  • WGBS
  • RRBS
  • MBD-BS

P. acuta DMC:

  • MBD vs. WGBS
  • MBD vs. RRBS

UpSet plot data

  • All CpGs in the genome (background)
  • WGBS, RRBS, and MBD-BS
  • WGBS and RRBS
  • WGBS and MBD-BS
  • RRBS and MBD-BS
  • WGBS only
  • RRBS only
  • MBD-BS only
  • None

I knew I could at least finish the first plot before our meeting, and the others until next week because I was bound to encounter questions about the methods and visualizations. Even if I didn’t make plots for the other two data categories, I could at least run them through the pipeline and visualize later.

All CpG genomic locations

Looking at these requirements, I knew my R script needed to include my CG motif intersection data. Actually, at first I had a bit of a panic thinking I needed locations for all my 5x CpG data and started messing with my script, but when I saw that Shelly used all CpGs in the genome for ther UpSet plot, I figured I could use the data I had already and add a bar for 5x CpGs if needed. I wanted to obtain this programmatically instead of copying and pasting this information into my script. I went back and saved line counts in this file for M. capitata and this file for P. acuta.

Union data genomic locations

In the same script, I imported file counts, reformatted my data, and created summary tables.

M. capitata:

P. acuta:

I worked through my pairwise chi-squared tests and things looked a bit funky. I then created two sets of stacked barplots for each species: 1) all three methods versus the location of the CpGs in the genome, and 2) the locations of the CpGs in the genome based on method and methylation status.

Screen Shot 2020-05-07 at 10 12 05 PM

Figure 3. CpG locations in various genome features (from bottom to top of each bar: CDS, intron, flanks, intergenic regions) across the M. capitata genome and between three methods.

Screen Shot 2020-05-07 at 10 48 38 PM

Figure 4. M. capitata CpG locations in various genome features (from bottom to top of each bar: CDS, intron, flanks, intergenic regions) split by CpG methylation status (methylated, sparsely methylated, and unmethylated)

Screen Shot 2020-05-07 at 10 49 40 PM

Figure 5. CpG locations in various genome features (from bottom to top of each bar: CDS, intron, flanks, intergenic regions) across the P. acuta genome and between three methods.

Screen Shot 2020-05-07 at 10 48 56 PM

Figure 6. P. acuta CpG locations in various genome features (from bottom to top of each bar: CDS, intron, flanks, intergenic regions) split by CpG methylation status (methylated, sparsely methylated, and unmethylated)

Even these figures looked weird! The trends in these stacked barplots didn’t match the trends I observed when characterizing CpGs by individual sample. When I thought about it some more, I was starting to doubt the CpG distribution results…

Reminding myself about statistics

In this moment I was reminded of two very important statistics rules (please imagine the appropriate LOTR meme when reading):

  1. One cannot simply use proportions in a chi-squared tests.

I, a rusty stats human, was using pairwise chi-squared tests with proportion data instead of count data! Oy vey. I went back and revised my CpG distribtuion and genomic location pairwise tests to use count data, and ended up with everything being extremely significant (p-value < 2.2e-16). Since I have a lot more comparisons than my C. virginica gonad methylation paper, I figured that chi-squared tests may not be the most powerful method to use. I asked Shelly how she dealt with her geoduck data, and she used an ANOVA. I think after our meeting I will revise my statistics with an ANOVA, Tukey post-hoc tests, and correction for multiple comparisons.

  1. One cannot simply average percentages.

For these union bedgraphs, I averaged percent methylation data for three samples when I really should have done the following calculation: (count methylated sample 1 + count methylated sample 2 + count methylated sample 3)/(total reads sample 1 + total reads sample 2 + total reads sample 3). The union bedgraph information doesn’t take this into account, so I’ll bring this up at our next meeting and see if Steven can programmatically generate the files I want before I go about doing that myself. So really, the plots I generated are null and void. I posted this issue to capture my action items.

…at least my code works and I can easily plug and chug revised data?

Identifying genomic locations

Seeing how I don’t trust any sort of statistics right now, I figured the best thing I could do was generate summary tables for method-associated DMC Mac created and the upset plot data from Shelly. I created this Jupyter notebook to run through all the steps. I generated quick line count tables that can be found in this folder for M. capitata and this folder for P. acuta.

Once I had my line counts, I created this R Markdown script to piece the data tables together. Since I didn’t have to worry about methylation status, I could easily use cbind to mash together tables before calculating percentages.

M. capitata:

P. acuta:

I also created rough stacked barplots for the upset plot data just so we had something to discuss. I will work on the statistical element later. It’s interesting to see how the bars are relatively consistent for M. capitata but have some variation for P. acuta.

Screen Shot 2020-05-08 at 1 18 15 AM

Figure 7. M. capitata upset plot data in various genome features (from bottom to top of each bar: CDS, introns, flanks, intergenic regions) (pdf here

Screen Shot 2020-05-08 at 1 26 18 AM

Figure 8. P. acuta upset plot data in various genome features (from bottom to top of each bar: CDS, introns, flanks, intergenic regions) (pdf here

Welp, I guess I have a lot to bring up during our weekly meeting!

Going forward

  1. Locate TE tracks
  2. Characterize intersections between data and TE, and create summary tables
  3. Calculate averages properly for union data and redo CpG characterization
  4. Perform ANOVA with post-hoc tests and correction for multiple comparisons
  5. Redo stacked barplot for CpG distributions for union data
  6. Redo stacked barplot for CpG genomic locations for union data
  7. Perform statistical tests for upset plot data
  8. Update methods
  9. Update results
  10. Figure out methylation island analysis
Written on May 7, 2020