DML Analysis Part 13

Different mincov values in methylKit

Using this R Markdown file, I tested the effect of different mincov values on sample clustering and DMLs produced. After dicsussing methods in this issue, I went through this process with both Steven’s samples and my own samples.

Steven’s samples

All of my output from this analysis can be found here. Below are some highlights:

cpgmeth1

cpgmeth3

cpgmeth5

Figures 1-3. Full sample CpG methylation clustering using a) mincov = 1 b) mincov = 3 or c) mincov = 5.

PCA1

PCA3

PCA5

Figures 4-6 PCA of full sample methylation using a) mincov = 1 b) mincov = 3 or c) mincov = 5.

I also wrote out differentially methylated loci that were at least 50% different between my treatment and control for mincov = 1, mincov = 3, and mincov = 5. I haven’t dug into what the exact differences are between these files, but there are at least differences in the number of DMLs produced.

Table 1. The mincov metric, total number of loci produced, and the number of DMLs that were at least 50% different between treatment andc control samples. More restrictive mincov metrics produced less significantly different DMLs.

mincov Total Loci Number of Significantly Different DMLs
1 1112085 4904
3 670301 1398
5 503780 816

One thing that was concerning about the pipeline is that I kept getting this error:

glm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: fitted probabilities numerically 0 or 1 occurred

My samples

I went through the bismark pipeline in my Jupyter notebook to get my deduplicated and sorted files. Initially I tried using bismark_methylation_extractor, but I was unable to extract methylation data for all files before genefish ran out of space (again…RIP). I moved all my large files to gannet and decided it probably wasn’t worth extracting the methylation data from genefish since I already have the pipeline running on Mox. If I have some downtime, I can always change the code so I’m running bismark_methylation_extractor from gannet.

All output from methylKit testing for my samples can be found here. I also wrote .csv files with DMLs for mincov = 1, mincov = 3, and mincov = 5.

cpgmeth1

cpgmeth3

cpgmeth5

Figures 7-9. Full sample CpG methylation clustering using a) mincov = 1 b) mincov = 3 or c) mincov = 5.

PCA1

PCA3

PCA5

Figures 10-12 PCA of full sample methylation using a) mincov = 1 b) mincov = 3 or c) mincov = 5.

Table 2. The mincov metric, total number of loci produced, and the number of DMLs that were at least 50% different between treatment andc control samples. More restrictive mincov metrics produced less significantly different DMLs.

mincov Total Loci Number of Significantly Different DMLs
1 1112085 4904
3 670301 1398
5 503780 816

Look familiar…? That’s because it’s all the same as Steven’s samples! It’s good to know that different users going through the same pipeline get the same results (#ReproducibilityWin).

Going forward

Based on the dendograms and PCA, I think mincov = 3 maximizes clustering of our treatment samples. The more similar those treatment samples are, the easier it is for me to create meaning out of our differential methylation data.

Written on October 18, 2018