MethCompare Part 12

Characterizing union bedgraphs

I’m building up to characterizing CpG locations in various genome features. To start, I translated my individual sample characterization pipeline to union bedgraphs. While having individual sample information might be nice when we decide to look at interindividual variation in methods, our methods comparisons will benefit from concatenating the three individual samples.

Identifying methylated loci

I created this Jupyter notebook to run the union bedgraphs through my characterization pipeline. I downloaded the union bedgraphs from this link. Once I obtained the union data, I needed a way to average the three columns for each method while ignoring any missing values. This is easy to do in R, and I figured there would be a simmple way to do it with the shell or awk. After many attempts at targetted Google searches, I couldn’t find anything! I posted this issue. Sam suggested I use pandas, which is a python tool that would allow me to manipulate dataframes in my Jupyter notebook.

First, I installed pandas for the notebook:

import pandas as pd
print(pd.__version__)

I then imported the full union bedgraph into pandas and averaged the three samples for each method across rows while ignoring missing values:

Screen Shot 2020-05-07 at 9 53 20 AM Screen Shot 2020-05-07 at 10 01 17 AM Screen Shot 2020-05-07 at 10 01 28 AM

The function .mean(axis = 1) averages across rows and automatically ignores missing values. To save the dataframe, I used df.to_csv (confusing) but specified tab-delimited output. I also used na_rep to specify what character missing values should be saved as, and used quoating = 3 to make sure my output did not have any quotes. I learned how to select columns from this link and navigated this pandas page for the output saving code. Not the greatest documentation imo.

I then separated all the methods, removed lines where the percent methylation was “N/A”, and counted the number of lines with data:

#Remove header
#Keep chr, start, end, and WGBS average (col 2-4, 13)
#Remove rows where the 4th column (average %meth) is N/A
#Save file
!tail -n +2 Mcap_union_5x-averages.bedgraph \
| awk -F'\t' -v OFS='\t' '{print $2, $3, $4, $13}' \
| awk -F'\t' -v OFS='\t' '$4 != "N/A"' \
> Mcap_union_5x-averages-WGBS.bedgraph

I had to repeat this step a few times because when I would move forward, I learned that I didn’t correctly save the columns! I saved columns 1-3 for all methods, which was actually row number, chromosome, and start instead of chromosome, start, end. Whoops. The counts for M. capitata are here and the counts for P. acuta are here.

CpG locations in the genome

Once I finalized my union bedgraphs, I went through and identified methylated, sparsely methylated, and unmethylated by method for M. capitata and P. acuta. Again, I had to mess around with it a few times because I wanted to use a loop to call all my relevant files, but needed to figure out what wildcard I needed. Which also involved going back and remaking my method bedgraphs when I wanted to test new filenames. Once I identified the loci, I saved counts for each file and added the intermediate files to the .gitignore.

M. capitata:

P. acuta:

For each CpG type, I looked at intersections between genes, CDS, introns, flanking regions, and intergenic regions. Again, I added intermediate files to the .gitignore and saved line counts:

M. capitata:

P. acuta:

Going forward

  1. Remake stacked barplots with union data
  2. Update methods
  3. Update results
  4. Create figures for CpG characterization in various genome features
  5. Locate TE tracks
  6. Characterize intersections between data and TE, and create summary tables
  7. Figure out methylation island analysis
Written on May 6, 2020