MethCompare Part 20

Preliminary follow-up GLMs

Previously, I tried using NMDS and ANOSIM to understand if sequencing method influenced differences in proportions of CpGs in various methylation statuses or genomic locations. I’m still waiting on a response to some questions I had, but in the meantime I’ll try using generalized linear models to get at some of these differences. Since we have small sample sizes, using a combination of approaches gives us more power. Specifically I will be testing if sequencing method or genomic location influences the proportion of CpGs in a particular methylation status.

In this script, I started creating my models. Evan suggested I use glmmTMB to run beta family models with a logit link. Since the proportions of highly, moderately, and lowly methylated CpGs add up to 1, I need to run separate models:

McapCpGHighModel <- glmmTMB(percentMeth ~ CDS + Introns + Upstream.Flanks + Downstream.Flanks + Intergenic + seqMethod,
                            family = beta_family(link = "logit"),
                            data = McapCpGMasterHigh) #Run the full model (all genomic locations) using a beta distribution and a logit link

When I ran the full model, I got a convergence issue error. Digging into glmmTMB vignettes, it suggested that my model could be overparametrized. This made sense, since I have minimal data and five model parameters. I decided to run two models: one for gene features and another for non-gene features. First, I ran a null model (without sequencing method) and a full model (with sequencing method):

McapCpGHighModel.GeneNull <- glmmTMB(percentMeth ~ CDS + Introns,
                                     family = beta_family(link = "logit"),
                                     data = McapCpGMasterHigh) #Run null model (without sequencing method) using a beta distribution and a logit link
McapCpGHighModel.Gene <- glmmTMB(percentMeth ~ CDS + Introns + seqMethod,
                                 family = beta_family(link = "logit"),
                                 data = McapCpGMasterHigh) #Run model using a beta distribution and a logit link

Then, I compared these models with a likelihood ratio test:

McapCpGHighModel.GeneANOVA <- anova(McapCpGHighModel.Gene, McapCpGHighModel.GeneNull, test = "LRT") #Compare the null and treatment models with a likelihood ratio test. 
McapCpGHighModel.GeneANOVA #Look at ANOVA output

Finally, I looked at the summary output for the best model using summary. Usualy, the best-fit model was one that included the sequencing method. I’m not sure how (or if) to compare the gene and non-gene models, but AIC might be useful in that respect.

Still have questions, but at least it’s something to discuss at our meeting!

Going forward

  1. Finalize analysis
  2. Locate TE tracks
  3. [Characterize intersections between data and TE, and create summary tables]
  4. Look into program for mCpG identification
Written on June 18, 2020