MethCompare Part 17
Fixing characterization code and remaking barplots
Welp I’m still working on this pipeline!
Checking duplicates in bedtools
When looking at the CDS track, Hollie noticed that CDS from alternative transcripts are included as individual entries, even if they overlap. Mac suggested that I check my bedtools
code and confirm that I’m only recording one overlap for each CpG. I looked at this Jupyter notebook to check my CpG genomic location code. I was using bedtools -wb
, which printed all overlapping regions with the CpG. I only needed to record the CpG location if it overlapped with my feature track once, so I changed it to bedtools -u
and reran my script. I saved the text files with counts for each species to use downstream.
Remaking barplots
I used the text files with counts for the number of overlaps between CpGs and genome features to remake my stacked barplots in this R Markdown script. Looking at the barplots, they didn’t really change that much, but at least it reflects data without any duplicate overlaps.
Figure 1. Genomic location of CpGs detected by each method.
When Shelly was looking over my code in this issue she noticed that MBD-BS in M. capitata has the highest percentage of methylated CpGs according to this table, but my figure shows WGBS as having more methylated CpGs. While reviewing this R Markdown script, I realized that when creating the table, MBD-BS data was in the first row, followed by RRBS and WGBS. Instead, I assumed that WGBS would be in the first row. I really should have been tipped off by the fact that the first row had lower CpG counts since MBD-BS coverage was poorer. I added row names to reduce confusion, then reorganized my rows so that the first was WGBS data, second was RRBS, and third was MBD-BS. I then remade the methylation status barplots so they accurately reflected the data.
Figure 2. Methylation status of CpGs detected by each method.
I added both figures to this issue to keep track of the changes I made and the motivation for doing so.
Thinking more about statistics
After doing some research and reviewing multivariate statistics notes, I think our best course of action is to use either an ANOSIM or perMANOVA. We have a multivariate data set-up: one independent variable (sequencing method) and multiple dependent variables (methylation status or genomic location). Using individual sample information, I can assess within-group and between-group differences. I could use a global ANOSIM or perMANOVA to first see if there are differences between groups. Then I can use pairwise multivariate tests to look at method differences and finally, use post-hoc tests (simper or disper) to understand significant pairwise differences. I posted my thoughts in this issue for feedback.
Updating scripts README
In our last meeting we decided to clean up the scripts README. I went through and ordered the scripts I created in the order they should be run. I also included brief descriptions about each script. I removed my Jupyter notebook that characterized 10x data, since we’re using 5x data instead of 10x! I thought about removing my script for creating figures with indiviudal sample data, but I thought maybe those figures might be useful for the supplement, or I could use the script to explore individual-level multivariate analyses.
Going forward
- Finalize statistical methods for comparison and conduct them
- Locate TE tracks
- [Characterize intersections between data and TE, and create summary tables]
- Look into program for mCpG identification