MethCompare Part 9
CpG distributions
I made my grand return to R today to apply handy dandy file manipulations and generate tables and figures in this R Markdown script.
Programmatic tables
After characterizing CpGs with 5x coverage, I quickly realized I needed an efficient and programmatic way to create summary tables with CpG counts and percentages. Manually pulling all of the data from wc -l
output and calculating percents by hand wasn’t cutting it anymore. The first thing I did was save wc -l
output from this Jupyter noteook as text files:
M. capitata:
- Total 5x CpGs with data
- Methylated CpGs
- Sparsely methylated CpGs
- Unmethylated CpGs
- Gene overlaps
- CDS overlaps
- Intron overlaps
- Intergenic overlaps
P. acuta:
- Total 5x CpGs with data
- Methylated CpGs
- Sparsely methylated CpGs
- Unmethylated CpGs
- Gene overlaps
- CDS overlaps
- Intron overlaps
- Intergenic overlaps
I started by generating a summary table with total CpG with data and the number and percentage of methylated, sparsely methylated, and unmethylated CpGs. To do this, I imported each .txt file into R using spaces as delimiters, used cbind
to mash all the columns together, removed the columns with file names, and calculated percentages. I saved the final tables after reorganizing the columns.
Creating the summary tables for the genome feature overlaps were a little more difficult. The wc -l
output had information for a feature’s overlap with total, methylated, sparsely methylated, and unmethylated loci in each sample. Using a loop, I extracted the information from the correct row and saved it in a new table. Similar to the CpG type tables, I calculated percentages and reorganized the columns before saving the final output.
I pasted the links to the summary tables in this issue, but here they are again:
M. capitata:
P. acuta:
Methylation distribution figure
The next thing I wanted to do was revise the methylation frequency distribution figure I created earlier. Hollie, Shelly, and Mac suggested that I rework the figure such that the x-axis bins CpG type (methylated, sparsely methylated, or unmethylated), and the y-axis is percent of total loci with data at 5x coverage. Using the summary tables I created in R, I removed all non-percent information columns. Then I used barplot
to plot the CpG type for each sample. I created a multipanel plot that allows metehods comparisons vertically and sample comparisons horizontally. I added axis labels and sequencing information, then saved each figure as a pdf:
Going forward
- Create tracks for 2 kb upstream and downstream flanks
- Characterize intersections between data and flanking regions and create summary tables
- Create figures for CpG characterization in various genome features
- Generate transposable element tracks for each species
- Figure out how to meaningfully concatenate data for each method
- Figure out methylation island analysis