MethCompare Part 4

Modifying the CpG characterization pipeline

Once I finished validating genome feature tracks, I decided to turn back to my one true love: characterizing CpGs in files (lol I’ve spent too much time 1) doing this kind of analysis and 2) alone because pandemic).

Reformatting the CpG characterization script

After counting overlaps between genome feature tracks and CG motifs, I figured I might as set up code for P. acuta characterization in the Jupyter notebook. Originally, I only set up M. capitata code to test the pipeline since those were the subsets we had. I broke the script down into multiple steps, and within each step I had planned to switch between species. To differentiate between M. capitata and P. acuta files, I decided to append the species name to the end of the file. When I tried to set up the code for P. acuta, I couldn’t figure out how to select only P. acuta bedgraphs to add the species name too. At this point, I decided that maybe my method was too complicated and I should just have subfolders for each species. So…that’s what I did.

For each species, I first create a species-specific subdirectory. Next, I count the overlaps between CG motifs and each genome feature. I then download coverage files into the subdirectory to proceed with characterizing methylated, sparsely methylated, and unmethylated CpGs and figure out where they’re located in the genome. I also removed the methylation island code I had since that didn’t really fit with the rest of the script.

Going forward

  1. Rerun the pipeline with full samples once all samples on gannet are validated
  2. Update code for methylation frequency distribution figure
  3. Figure out how to meaningfully concatenate data for each method
  4. Look into exon annotations in Liew and Li papers
  5. Figure out methylation island analysis
Written on April 9, 2020