WGBS Analysis Part 18

Testing base `methylKit` code

Alright, it’s time to run methylKit on mox! Steven and Sam suggested I run my methylKit code in this discussion, so I’ll finish testing my code, make some modifications, and create a script to run on mox.

Percent difference for DML

The first thing I needed to make a decision about was the percent difference I wanted to use as a threshold for DML identification. In this discussion, I pointed out that some papers use a 25% difference, while others (including my C. virginica paper) use a 50% difference. Initially, I set up my code to include 25%, 50%, 75%, and 99% differences in methylation between treatments. After loading in my saved .RData, I ran getMethylDiff using objects generated from calculateDiffMeth with no additional modifications and just the overdispersion correction.

Using a 25% or 50% methylation difference, I got > 1000 DML when not applying any covariate matrix or overdispersion correction! The number of DML identified were in the 100s when using a 75% or 99% difference. When I used the output from calculateDiffMeth with the overdispersion correcrtion, I identified 2119 DML that were 25% different, 564 that were 50% different, and 80 that were 75% different. Clearly making the model more complicated reduces the amount of DML identified. I think if I used output with a covariate matrix and overdispersion correction, a 25% or 50% methylation difference wouldn’t get me more than ~500 DML. I set up the DML identification code to use both a 25% and 50% methylation difference so I can compare the output from mox.

Testing randomization code

The main parts of my R Markdown script I needed to still test were the randomization trial and subsequent statistical testing. I started by running a modified version of the randomization trial that did not include any overdispersion or covariate information for calculateDiffMeth:

for (i in 1:1000) { #For each iteration
  pHRandTreat <- sample(sampleMetadata$pHTreatment,
                        size = length(sampleMetadata$pHTreatment),
                        replace = FALSE) #Randomize treatment information
  pHRandDML <- methylKit::reorganize(processedFiles,
                                     sample.ids = sampleMetadata$sampleID,
                                     treatment = pHRandTreat) %>%
    methylKit::filterByCoverage(., lo.count = 5, lo.perc = NULL,
                                hi.count = NULL, hi.perc = 99.9) %>%
    methylKit::normalizeCoverage(.) %>%
    methylKit::unite(., destrand = FALSE) %>%
    methylKit::calculateDiffMeth(.) %>%
    methylKit::getMethylDiff(., difference = 50, qvalue = 0.01) %>%
    nrow() -> pHRandTest50[i] ##Reorganize, filter, normalize, and unite samples. Calculate diffMeth and obtain DML that are 50% and have a q-value of 0.01. Save the number of DML identified
}

I thought the code would finish running in a few hours, but since it was taking a while and I didn’t get any error messages, I interrupted the code chunk and commented it out. I then set up two randomization tests: one for a 25% methylation difference, and another for a 50% methylation difference:

pHRandTest25 <- NULL #Create an empty matrix to store randomization trial results for 25% difference

for (i in 1:1000) { #For each iteration
  pHRandTreat <- sample(sampleMetadata$pHTreatment,
                        size = length(sampleMetadata$pHTreatment),
                        replace = FALSE) #Randomize treatment information
  pHRandDML <- methylKit::reorganize(processedFiles,
                                     sample.ids = sampleMetadata$sampleID,
                                     treatment = pHRandTreat) %>%
    methylKit::filterByCoverage(., lo.count = 5, lo.perc = NULL,
                                hi.count = NULL, hi.perc = 99.9) %>%
    methylKit::normalizeCoverage(.) %>%
    methylKit::unite(., destrand = FALSE) %>%
    methylKit::calculateDiffMeth(., covariates = covariateMetadata,
                                 overdispersion = "MN",
                                 test = "Chisq") %>%
    methylKit::getMethylDiff(., difference = 25, qvalue = 0.01) %>%
    nrow() -> pHRandTest25[i] ##Reorganize, filter, normalize, and unite samples. Calculate diffMeth and obtain DML that are 50% and have a q-value of 0.01. Save the number of DML identified
}

pHRandTest50 <- NULL #Create an empty matrix to store randomization trial results for 50% difference

for (i in 1:1000) { #For each iteration
  pHRandTreat <- sample(sampleMetadata$pHTreatment,
                        size = length(sampleMetadata$pHTreatment),
                        replace = FALSE) #Randomize treatment information
  pHRandDML <- methylKit::reorganize(processedFiles,
                                     sample.ids = sampleMetadata$sampleID,
                                     treatment = pHRandTreat) %>%
    methylKit::filterByCoverage(., lo.count = 5, lo.perc = NULL,
                                hi.count = NULL, hi.perc = 99.9) %>%
    methylKit::normalizeCoverage(.) %>%
    methylKit::unite(., destrand = FALSE) %>%
    methylKit::calculateDiffMeth(., covariates = covariateMetadata,
                                 overdispersion = "MN",
                                 test = "Chisq") %>%
    methylKit::getMethylDiff(., difference = 50, qvalue = 0.01) %>%
    nrow() -> pHRandTest50[i] ##Reorganize, filter, normalize, and unite samples. Calculate diffMeth and obtain DML that are 50% and have a q-value of 0.01. Save the number of DML identified
}

The output of the randomization test is a vector with the number of DML identified in each trial. I needed a way to determine if the number of DML I identified with getMethylDiff was significantly greater than what may be identified by chance. I dug deep into my stats brain (you know, that part of my brain that barely turns on) and decided to run a one-tailed t-test:

pHRandTestResults50 <- t.test(pHRandTest50, alternative = "less", mu = nrow(diffMethStatsTreatment50)) #Conduct a one-sample t-test to determine the probability of identifying more DML by chance. mu is the number of DML identified with a 50% methylation difference
pHRandTestResults50$statistic #t
pHRandTestResults50$parameter #df
pHRandTestResults50$p.value #P-value

I also wrote code to create a histogram with the random distribution of DML identified, and include a vertical line for the number of DML obtained with getMethylDiff:

jpeg(filename = "../output/06-methylKit/rand-test/Random-Distribution-diff50.jpeg", height = 1000, width = 1000) #Save file with designated name
hist(pHRandTest50, plot = TRUE, main = "", xlab = "Number of DML (50% difference)") #Plot a histogram with randomization trial results
dev.off()

Now that I’ve tested my base code, I will work on separating female and indeterminate samples and creating the actual mox script!

Going forward

Separate female and indeterminate samples
Write a mox script
Run R script on mox
Update methods
Obtain relatedness matrix and SNPs with EpiDiverse/snp
Revise methylKit study design: separate tests for indeterminate and female oysters
Write results
Identify genomic location of DML
Determine if RNA should be extracted
Determine if larval DNA/RNA should be extracted

Written on April 7, 2021

WGBS Analysis Part 18

Testing base methylKit code

Percent difference for DML

Testing randomization code

Going forward

Testing base `methylKit` code