MethCompare Part 17

Fixing characterization code and remaking barplots

Welp I’m still working on this pipeline!

Checking duplicates in `bedtools`

When looking at the CDS track, Hollie noticed that CDS from alternative transcripts are included as individual entries, even if they overlap. Mac suggested that I check my bedtools code and confirm that I’m only recording one overlap for each CpG. I looked at this Jupyter notebook to check my CpG genomic location code. I was using bedtools -wb, which printed all overlapping regions with the CpG. I only needed to record the CpG location if it overlapped with my feature track once, so I changed it to bedtools -u and reran my script. I saved the text files with counts for each species to use downstream.

Remaking barplots

I used the text files with counts for the number of overlaps between CpGs and genome features to remake my stacked barplots in this R Markdown script. Looking at the barplots, they didn’t really change that much, but at least it reflects data without any duplicate overlaps.

Screen Shot 2020-06-03 at 2 12 49 PM

Figure 1. Genomic location of CpGs detected by each method.

When Shelly was looking over my code in this issue she noticed that MBD-BS in M. capitata has the highest percentage of methylated CpGs according to this table, but my figure shows WGBS as having more methylated CpGs. While reviewing this R Markdown script, I realized that when creating the table, MBD-BS data was in the first row, followed by RRBS and WGBS. Instead, I assumed that WGBS would be in the first row. I really should have been tipped off by the fact that the first row had lower CpG counts since MBD-BS coverage was poorer. I added row names to reduce confusion, then reorganized my rows so that the first was WGBS data, second was RRBS, and third was MBD-BS. I then remade the methylation status barplots so they accurately reflected the data.

Screen Shot 2020-06-03 at 1 55 25 PM

Figure 2. Methylation status of CpGs detected by each method.

I added both figures to this issue to keep track of the changes I made and the motivation for doing so.

Thinking more about statistics

After doing some research and reviewing multivariate statistics notes, I think our best course of action is to use either an ANOSIM or perMANOVA. We have a multivariate data set-up: one independent variable (sequencing method) and multiple dependent variables (methylation status or genomic location). Using individual sample information, I can assess within-group and between-group differences. I could use a global ANOSIM or perMANOVA to first see if there are differences between groups. Then I can use pairwise multivariate tests to look at method differences and finally, use post-hoc tests (simper or disper) to understand significant pairwise differences. I posted my thoughts in this issue for feedback.

Updating scripts README

In our last meeting we decided to clean up the scripts README. I went through and ordered the scripts I created in the order they should be run. I also included brief descriptions about each script. I removed my Jupyter notebook that characterized 10x data, since we’re using 5x data instead of 10x! I thought about removing my script for creating figures with indiviudal sample data, but I thought maybe those figures might be useful for the supplement, or I could use the script to explore individual-level multivariate analyses.

Going forward

Finalize statistical methods for comparison and conduct them
Locate TE tracks
[Characterize intersections between data and TE, and create summary tables]
Look into program for mCpG identification

Written on June 3, 2020