MethCompare Part 23

Calculating median percent methylation

We’re wrapping up our analyses. Hollie is looking at gene methylation by method and asked me to calculate median methylation by gene for each sample, since I did that before for C. virginica. I set out to do just that in this script. I started by testing code on M. capitata samples, then used the finalized code for P. acuta as well.

Looking at my files on gannet, I realized I never transferred the intermediate genonme feature overlap files! I started rerunning this Jupyter notebook to get the gene-sample overlap files. Even after I generated those files, I was missing percent methylation information. I decided to include percent methylation information in the intermediate BEDfiles before using intersectBED. After using intersectBed, I had chromosome, start, stop, and percent methylation information for each sample.

I returned to my R Markdown script to import the overlap files. Again, I realized I was missing something! I needed to associate CpG loci with genes. I used intersectBed to match CpGs with genes:

#For each sample-gene overlap file (ends in *bedgraph.bed-mcGenes)
#Use intersectBed to find where loci and genes intersect, allowing loci to be mapped to genes
#wb: Print all lines in the second file
#a: sample-gene overlap file
#b: gene track
#Save output in a new file that has the same base name and ends in -geneID
for f in ../analyses/Characterizing-CpG-Methylation-5x/Mcap/*bedgraph.bed-mcGenes
do
  /usr/local/bin/intersectBed \
  -wb \
  -a ${f} \
  -b ../genome-feature-files/Mcap.GFFannotation.gene.gff \
  > ${f}-geneID
done

I then took the files and moved them into a Median-Methylation-Calculations subdirectory. Next, I needed to import the nine files for each species into R. I used code I learned about when working with C. virginica to import all the files at once:

setwd("../analyses/Characterizing-CpG-Methylation-5x/Mcap/Median-Methylation-Calculations") #Set working directory within the notebook chunk for list.files to find the necessary files
filesToImport <- list.files(pattern = "*geneID") #Create a file list for all 9 files to import. Only import overlaps for full samples (not divided by methylation status)
list2env(lapply(setNames(filesToImport,
                         make.names(gsub("_R1_001_val_1_bismark_bt2_pe._5x.bedgraph.bed-mcGenes-geneID", "", filesToImport))),
                read.delim, header = FALSE),
         envir = .GlobalEnv) #Import files to the .GlobalEnv with list2env. Use lapply to setNames of the files by taking all the common parts of their names out. Read files with read.delim and include header = FALSE. Files will be named Meth#
head(Meth10) #Confirm import

My next steps were simple: remove redundant columns, add column names, and use aggregate to calculate median percent methylation for each gene ID. That would amount to three lines of code for each sample, but I didn’t want to write an individual block of code for each sample! Based on this Stack Overflow post, I created a vector of dataframes, then pulled the dataframes out of the vector in a loop to format the files:

samplesMcap <- c("Meth10",
                 "Meth11",
                 "Meth12",
                 "Meth13",
                 "Meth14",
                 "Meth15",
                 "Meth16",
                 "Meth17",
                 "Meth18") #Create a vector of sample names

for(sample in samplesMcap) { #For each sample listed in samplesMcap
  sample.tmp <- get(sample) #Extract sample based on vector contents
  sample.tmp <- sample.tmp[,-c(5:12)] #Remove extraneous columns
  colnames(sample.tmp) <- c("chr", "start", "stop", "percentMeth", "geneID") #Rename columns
  assign(sample, sample.tmp) #Replace sample with edited sample.tmp contents
}
head(Meth17) #Confirm formatting changes

Finally, I used similar loops to calculate median percent methylation by gene ID and write out tab-delimited files for each sample:

for(sample in samplesMcap) { #For each sample listed in samplesMcap
  sample.tmp <- get(sample) #Extract sample based on vector contents
  sample.tmp <- aggregate(percentMeth ~ geneID, data = sample.tmp, FUN = median) #Use aggregate to group geneID and calculate median percent methylation
  assign(sample, sample.tmp) #Replace sample with edited sample.tmp contents
}
head(Meth17) #Confirm median methylation calculation

for (i in 1:length(samplesMcap)) { #For each sample listed in samplesMcap
  sample <- get(samplesMcap[i]) #Extract sample based on vector contents
  fileName <- paste("../analyses/Characterizing-CpG-Methylation-5x/Mcap/Median-Methylation-Calculations/", samplesMcap[i], "-Median-Methylation", ".txt", sep = "") #Assign filename for each sample
  write.table(sample, file = fileName, sep = "\t", row.names = FALSE, col.names = TRUE) #Write out files into the Median-Methylation-Calculations subdirectory
}

The trick of putting dataframes in a vector to run the same commands on them is super useful. Definitely using that in the future. I posted the links to the M. capitata and P. acuta subdirectories in this issue for Hollie to use.

Aside from calculating methylation, I also modified figures based on suggestions from Hollie and updated the methods section. Not much left to work on (hopefully)!

Going forward

Update results
Look into program for mCpG identification

Written on July 9, 2020