MethCompare Part 13
Revamping distribution plots and creating genome feature plots
After testing my characterization pipeline with union bedgraphs, I needed to make stacked barplots with that data. I also need to explore CpG locations in genome features between methods using various CpG categories. It’s time to make plots on plots on plots.
CpG distribution stacked barplots
I created this R Markdown document to run my code. I contemplated putting it in my previous script, but I thought keeping the workflow separate like I did for the Jupyter notebooks would be better. I imported the tab-delimited files with the file counts for each species and created summary tables (M. capitata here and P. acuta here). I then isolated the percent information and used it for pairwise chi-squared tests and to create stacked barplots.
Table 1. Pairwise chi-squared results for M. capitata, with chi-squared statistics along the top and p-values at the bottom (df = 2). The hypothesis was that each method would have the same distribution of methylated, sparsely methylated, and unmethylated CpGs.
Method | WGBS | RRBS | MBD-BS |
---|---|---|---|
WGBS | - | 2.2883 | 7.8469 |
RRBS | 0.3185 | - | 2.3195 |
MBD-BS | 0.01977 | 0.3136 | - |
Table 2. Pairwise chi-squared results for P. acuta, with chi-squared statistics along the top and p-values at the bottom (df = 2). The hypothesis was that each method would have the same distribution of methylated, sparsely methylated, and unmethylated CpGs.
Method | WGBS | RRBS | MBD-BS |
---|---|---|---|
WGBS | - | 0.39734 | 8.2335 |
RRBS | 0.8198 | - | 10.349 |
MBD-BS | 0.0163 | 0.005659 | - |
It’s interesting that using the union data changed results for M. capitata but not P. acuta. The fact that MBD-BS was significantly different from WGBS is probably due to the higher percentage of methylated loci detected by MBD-BS. What these results tell me is that MBD-BS does pick up significantly more methylated loci for each species.
Previously, I used code to make each stacked bar a different color. In the middle of the night (I’m not kidding) I had an epiphany for a way to define i
so I could combine three separate code chunks into a loop. In the light of day that’s what I did:
for (i in 1:ncol(t(McapCpGTypePercents))){
xx <- t(McapCpGTypePercents)
xx[,-i] <- NA
colnames(xx)[-i] <- NA
barplot(xx, col = c(plotColors[(3*i)-2], plotColors[(3*i)-1], plotColors[(3*i)]), add = TRUE, axes = FALSE, names.arg = c("", "", ""))
} #Make each bar in the stacked barplot a different color gradient. Create a new dataframe with data for the plot and ignore all additional columns and column names. Plot the single bar with the correct color from plotColors. Do not add axes and add blank names for each column.
Figure 1. CpG distribution for M. capitata (pdf here)
Figure 2. CpG distribution for P. acuta (pdf here)
After making the stacked barplots, I conducted pairwise chi-squared tests for each method between species.
Table 3. Pairwise chi-squared results examining method sensitivity between species. The hypothesis was that each method would have the same distribution of methylated, sparsely methylated, and unmethylated CpGs for each species.
Method | Chi-squared | df | p-value |
---|---|---|---|
WGBS | 4.2001 | 2 | 0.1225 |
RRBS | 10.841 | 2 | 0.004424 |
MBD-BS | 3.9764 | 2 | 0.1369 |
Again, my results were different. Instead of WGBS, RRBS performed differently between species. This lines up well with what we saw in Hollie’s circos plots. I updated and closed this issue.
Planning plots
Before I leapt into the world of stacked barplots for genome locations, I sketched out what datasets I needed to include in each figure:
Union data:
- All CpGs in the genome (background)
- WGBS
- RRBS
- MBD-BS
P. acuta DMC:
- MBD vs. WGBS
- MBD vs. RRBS
UpSet plot data
- All CpGs in the genome (background)
- WGBS, RRBS, and MBD-BS
- WGBS and RRBS
- WGBS and MBD-BS
- RRBS and MBD-BS
- WGBS only
- RRBS only
- MBD-BS only
- None
I knew I could at least finish the first plot before our meeting, and the others until next week because I was bound to encounter questions about the methods and visualizations. Even if I didn’t make plots for the other two data categories, I could at least run them through the pipeline and visualize later.
All CpG genomic locations
Looking at these requirements, I knew my R script needed to include my CG motif intersection data. Actually, at first I had a bit of a panic thinking I needed locations for all my 5x CpG data and started messing with my script, but when I saw that Shelly used all CpGs in the genome for ther UpSet plot, I figured I could use the data I had already and add a bar for 5x CpGs if needed. I wanted to obtain this programmatically instead of copying and pasting this information into my script. I went back and saved line counts in this file for M. capitata and this file for P. acuta.
Union data genomic locations
In the same script, I imported file counts, reformatted my data, and created summary tables.
M. capitata:
P. acuta:
I worked through my pairwise chi-squared tests and things looked a bit funky. I then created two sets of stacked barplots for each species: 1) all three methods versus the location of the CpGs in the genome, and 2) the locations of the CpGs in the genome based on method and methylation status.
Figure 3. CpG locations in various genome features (from bottom to top of each bar: CDS, intron, flanks, intergenic regions) across the M. capitata genome and between three methods.
Figure 4. M. capitata CpG locations in various genome features (from bottom to top of each bar: CDS, intron, flanks, intergenic regions) split by CpG methylation status (methylated, sparsely methylated, and unmethylated)
Figure 5. CpG locations in various genome features (from bottom to top of each bar: CDS, intron, flanks, intergenic regions) across the P. acuta genome and between three methods.
Figure 6. P. acuta CpG locations in various genome features (from bottom to top of each bar: CDS, intron, flanks, intergenic regions) split by CpG methylation status (methylated, sparsely methylated, and unmethylated)
Even these figures looked weird! The trends in these stacked barplots didn’t match the trends I observed when characterizing CpGs by individual sample. When I thought about it some more, I was starting to doubt the CpG distribution results…
Reminding myself about statistics
In this moment I was reminded of two very important statistics rules (please imagine the appropriate LOTR meme when reading):
- One cannot simply use proportions in a chi-squared tests.
I, a rusty stats human, was using pairwise chi-squared tests with proportion data instead of count data! Oy vey. I went back and revised my CpG distribtuion and genomic location pairwise tests to use count data, and ended up with everything being extremely significant (p-value < 2.2e-16). Since I have a lot more comparisons than my C. virginica gonad methylation paper, I figured that chi-squared tests may not be the most powerful method to use. I asked Shelly how she dealt with her geoduck data, and she used an ANOVA. I think after our meeting I will revise my statistics with an ANOVA, Tukey post-hoc tests, and correction for multiple comparisons.
- One cannot simply average percentages.
For these union bedgraphs, I averaged percent methylation data for three samples when I really should have done the following calculation: (count methylated sample 1 + count methylated sample 2 + count methylated sample 3)/(total reads sample 1 + total reads sample 2 + total reads sample 3). The union bedgraph information doesn’t take this into account, so I’ll bring this up at our next meeting and see if Steven can programmatically generate the files I want before I go about doing that myself. So really, the plots I generated are null and void. I posted this issue to capture my action items.
…at least my code works and I can easily plug and chug revised data?
Identifying genomic locations
Seeing how I don’t trust any sort of statistics right now, I figured the best thing I could do was generate summary tables for method-associated DMC Mac created and the upset plot data from Shelly. I created this Jupyter notebook to run through all the steps. I generated quick line count tables that can be found in this folder for M. capitata and this folder for P. acuta.
Once I had my line counts, I created this R Markdown script to piece the data tables together. Since I didn’t have to worry about methylation status, I could easily use cbind
to mash together tables before calculating percentages.
M. capitata:
P. acuta:
I also created rough stacked barplots for the upset plot data just so we had something to discuss. I will work on the statistical element later. It’s interesting to see how the bars are relatively consistent for M. capitata but have some variation for P. acuta.
Figure 7. M. capitata upset plot data in various genome features (from bottom to top of each bar: CDS, introns, flanks, intergenic regions) (pdf here
Figure 8. P. acuta upset plot data in various genome features (from bottom to top of each bar: CDS, introns, flanks, intergenic regions) (pdf here
Welp, I guess I have a lot to bring up during our weekly meeting!
Going forward
- Locate TE tracks
- Characterize intersections between data and TE, and create summary tables
- Calculate averages properly for union data and redo CpG characterization
- Perform ANOVA with post-hoc tests and correction for multiple comparisons
- Redo stacked barplot for CpG distributions for union data
- Redo stacked barplot for CpG genomic locations for union data
- Perform statistical tests for upset plot data
- Update methods
- Update results
- Figure out methylation island analysis