MethCompare Part 16

Genome features and running the CpG characterization pipeline (again)

I was minding my business yesterday, updating our draft paper, when I noticed something fishy. That fishy thing lead me to another fishy thing, so I guess my Github-free week was a fantasy after all.

Intergenic-CG motif overlaps

A few weeks ago, I copied and pasted the CG motif overlap summary tables from this Jupyter notebook into the paper. I decided to update this table with upstream and downstream flank information, as well as the revised intergenic overlap counts since I modified this track recently. When looking back at my Jupyter notebook, I noticed I never characterized CG motif overlaps with the revised intergenic track! I totally thought I did this, but I guess not. In the same Jupyter notebook, I used intersectBed to get these counts and added them to the table. I then reran my R Markdown script. Since I revised the number of CpGs in intergenic regions, I needed these updated counts incorporated when I did my chi-squared tests. While going through this process, I realized that this information was already captured in the genomic location stacked barplot figure. I made sure all of my counts, figures, and tables were up-to-date with this new information, then I deleted the original CG motif overlap summary table that started me down this rabbit hole.

CpG characterizations

Alright, back to updating the results section. When looking for the number of lines for each method-specific union bedgraph, I noticed the second fishy thing. According to my union bedgraph averaging output, the number of CpG loci with WGBS data was less than the number of loci with MBD-BS data in M. capitata! I quickly checked my 5x union data characterization pipeline. When selecting columns from my pandas output, I noticed that row numbers were saved as the first column. Past Yaamini took this into acount when putting in column numbers to subset for the chromosome, start, and end positions, but she didn’t use the correct column numbers to also include the column with average percent methylation data for each species!! I ended up using sample 18 data as the WGBS average, the WGBS average as the RRBS average, and the RRBS average as the MBD-BS average. Yikes.

I fixed this error for M. capitata and P. acuta, then reran everything in this Jupyter notebook. I then took the output (line count information) and reran my R Markdown script The output for both scripts can be found here. Immediately, I noticed changes to the plots I made (found here).

Screen Shot 2020-05-20 at 10 14 55 AM

Figure 1. CpG methylation status by sequencing method.

For both species, WGBS detects a higher percentage of strongly methylated CpGs. MBD-BS detects more strongly methylated Cpgs than RRBS, but in P. acuta neither enrichment method gets close to WGBS.

Screen Shot 2020-05-20 at 10 15 09 AM

Figure 2. CpG genomic location by sequencing method.

Looking at genomic locations for M. capitata, MBD-BS and WGBS appear to perform similarly, while RRBS and the genome information are more consistent. MBD-BS may have slightly more CpGs in CDS. In P. acuta it’s clear that all methods do some sort of enrichment or gene bias, but it’s the strongest in MBD-BS.

Screen Shot 2020-05-20 at 10 27 45 AM

Screen Shot 2020-05-20 at 10 30 28 AM

Figures 3-4. Proportion strongly methylated CpGs in genomic features for M. capitata and P. acuta. Scale for P. acuta is double that of M. capitata.

When considering the number of strongly methylated CpGs divided by the total number of CpGs in a given genomic feature, WGBS performs the best in M. capitata. In contraast, MBD-BS performs better than WGBS and RRBS in P. acuta. These differences may be due to poor MBD-BS coverage in M. capitata and the higher baseline methylation in M. capitata.

Nothing really jumped out at me when I skimmed the pairwise contingency test results, but the visualizations are different. I think they may be more consistent with our overall hypotheses, which is good. I updated the tables, figures, and text with this information.

Going forward

  1. Locate TE tracks
  2. [Characterize intersections between data and TE, and create summary tables]
  3. Perform statistical comparisons for upset plot data (
  4. Look into program for mCpG identification
Written on May 20, 2020