MethCompare Part 9

CpG distributions

I made my grand return to R today to apply handy dandy file manipulations and generate tables and figures in this R Markdown script.

Programmatic tables

After characterizing CpGs with 5x coverage, I quickly realized I needed an efficient and programmatic way to create summary tables with CpG counts and percentages. Manually pulling all of the data from wc -l output and calculating percents by hand wasn’t cutting it anymore. The first thing I did was save wc -l output from this Jupyter noteook as text files:

M. capitata:

P. acuta:

I started by generating a summary table with total CpG with data and the number and percentage of methylated, sparsely methylated, and unmethylated CpGs. To do this, I imported each .txt file into R using spaces as delimiters, used cbind to mash all the columns together, removed the columns with file names, and calculated percentages. I saved the final tables after reorganizing the columns.

Creating the summary tables for the genome feature overlaps were a little more difficult. The wc -l output had information for a feature’s overlap with total, methylated, sparsely methylated, and unmethylated loci in each sample. Using a loop, I extracted the information from the correct row and saved it in a new table. Similar to the CpG type tables, I calculated percentages and reorganized the columns before saving the final output.

I pasted the links to the summary tables in this issue, but here they are again:

M. capitata:

P. acuta:

Methylation distribution figure

The next thing I wanted to do was revise the methylation frequency distribution figure I created earlier. Hollie, Shelly, and Mac suggested that I rework the figure such that the x-axis bins CpG type (methylated, sparsely methylated, or unmethylated), and the y-axis is percent of total loci with data at 5x coverage. Using the summary tables I created in R, I removed all non-percent information columns. Then I used barplot to plot the CpG type for each sample. I created a multipanel plot that allows metehods comparisons vertically and sample comparisons horizontally. I added axis labels and sequencing information, then saved each figure as a pdf:

Going forward

  1. Create tracks for 2 kb upstream and downstream flanks
  2. Characterize intersections between data and flanking regions and create summary tables
  3. Create figures for CpG characterization in various genome features
  4. Generate transposable element tracks for each species
  5. Figure out how to meaningfully concatenate data for each method
  6. Figure out methylation island analysis
Written on April 30, 2020