MethCompare Part 20

Preliminary follow-up GLMs

Previously, I tried using NMDS and ANOSIM to understand if sequencing method influenced differences in proportions of CpGs in various methylation statuses or genomic locations. I’m still waiting on a response to some questions I had, but in the meantime I’ll try using generalized linear models to get at some of these differences. Since we have small sample sizes, using a combination of approaches gives us more power. Specifically I will be testing if sequencing method or genomic location influences the proportion of CpGs in a particular methylation status.

In this script, I started creating my models. Evan suggested I use glmmTMB to run beta family models with a logit link. Since the proportions of highly, moderately, and lowly methylated CpGs add up to 1, I need to run separate models:

McapCpGHighModel <- glmmTMB(percentMeth ~ CDS + Introns + Upstream.Flanks + Downstream.Flanks + Intergenic + seqMethod,
                            family = beta_family(link = "logit"),
                            data = McapCpGMasterHigh) #Run the full model (all genomic locations) using a beta distribution and a logit link

When I ran the full model, I got a convergence issue error. Digging into glmmTMB vignettes, it suggested that my model could be overparametrized. This made sense, since I have minimal data and five model parameters. I decided to run two models: one for gene features and another for non-gene features. First, I ran a null model (without sequencing method) and a full model (with sequencing method):

McapCpGHighModel.GeneNull <- glmmTMB(percentMeth ~ CDS + Introns,
                                     family = beta_family(link = "logit"),
                                     data = McapCpGMasterHigh) #Run null model (without sequencing method) using a beta distribution and a logit link
McapCpGHighModel.Gene <- glmmTMB(percentMeth ~ CDS + Introns + seqMethod,
                                 family = beta_family(link = "logit"),
                                 data = McapCpGMasterHigh) #Run model using a beta distribution and a logit link

Then, I compared these models with a likelihood ratio test:

McapCpGHighModel.GeneANOVA <- anova(McapCpGHighModel.Gene, McapCpGHighModel.GeneNull, test = "LRT") #Compare the null and treatment models with a likelihood ratio test. 
McapCpGHighModel.GeneANOVA #Look at ANOVA output

Finally, I looked at the summary output for the best model using summary. Usualy, the best-fit model was one that included the sequencing method. I’m not sure how (or if) to compare the gene and non-gene models, but AIC might be useful in that respect.

Still have questions, but at least it’s something to discuss at our meeting!

Going forward

Finalize analysis
Locate TE tracks
[Characterize intersections between data and TE, and create summary tables]
Look into program for mCpG identification

Written on June 18, 2020