WGBS Analysis Part 27
Characterizing DML
I have DML and I have genome feature tracks, so it’s time to do the thing I’ve done the most (besides yelling at R Studio server): characterize genomic locations with bedtools!
DML summary information
TL;DR, I have a lot of DML and I’m not sure if I picked the correct thresholds.
Table 1. Number of DML obtained for sex-specific contrasts at different percent methylation differences. A q-value of 0.01 was used for each test. After unite, I had 5306686 total CpG loci with data.
| Contrast | 25% | 50% | 75% | 100% |
|---|---|---|---|---|
| Female | 20830 | 3709 | 315 | N/A |
| Indeterminate | 129040 | 73414 | 21891 | 2861 |
My guess is that I have several DML because my sample size is lower. I think that using my most restrictive thresholds in downstream analyses is the best course of action until I get my randomization test results, which means a 75% difference for female-DML, and 100% difference for indeterminate-DML.
Another important note: methylKit used a Fisher test for the indeterminate samples since there were two of them! I need to go through the manual and confirm what exact test was used:

Looking at DML in IGV
I created this IGV session to look at my DML and see if my cutoffs made any sense. I quickly created BEDfiles for DML lists in this Jupyter notebook and saved the output here. I also color-coded the samples such that dark blue = female and light blue/cyan = indeterminate.

Figure 1. Sample 5x bedgraphs and DML tracks in IGV.
I first thumbed through the IGV session and looked at DML identified in female samples. Based on what I saw, I feel comfortable using a 75% difference to identify DML. I looked at some DML that were at least 50% different between samples, and while I was comfortable calling that CpG differentially methylated, the sheer number of 50% DML I have make me think that being more conservative is better for a small sample size. The 25% cutoff was not convincing.





Figures 2-6. Female DML and associated methylation information for each sample. Figures 2-3 look at DML that are 75% different, Figures 3-5 look at DML that are 50% different, and Figure 6 looks at a DML that is 25% different.
Common DML between sex-specific contrasts
Once I confirmed I wanted to use the 75% different female-DML and 100% different indeterminate-DML, I wanted to see if there were any DML that were common between my sex-specific lists. I returned to this Jupyter notebook and used intersectBed to pull out overlapping DML:
#Find overlaps between female- and indeterminate-DML
#Check head
#Count number of overlapping DML
!{bedtoolsDirectory}intersectBed \
-u \
-a DML-pH-75-Cov5-Fem.csv.bed \
-b DML-pH-100-Cov5-Ind.csv.bed \
> DML-pH-Cov5-Overlaps.bed
!head DML-pH-Cov5-Overlaps.bed
!wc -l DML-pH-Cov5-Overlaps.bed
I saved the output here. Turns out there are only 8 DML that overlap between the female and indeterminate samples!
Quick tangent: CG motif track
Before I quantified overlaps between genome feature tracks and DML, I realized I needed a better idea of how many CGs were in the Roslin genome! I opened this Jupyter notebook and used grep -c to count the number of CGs. The output was suspiciously low!

Clearly I was counting incorrectly, so I posted this discussion for feedback. Sam mentioned that I was missing instances with CGs on a new line! I looked through my old C. virginica code and found fgrep -o -i CG {genome-file} | wc -l. Using this code, I got ~13 million CpGs in the genome (for reference, the C. virginica genome has ~14 million). Sam still pointed out that this may not count across new lines, and in fact, he did bring up that point in an old issue! He suggested that instead, I convert the FASTA to a tab-delimited file, and count CGs that way. When I tried his code, I was getting 0 CGs:

I think the seqkit fx2tab wasn’t properly converting the FASTA to a tab-delimited format:

Steven suggested I use some code from Jay, but I still ended up with the same error:

At this point, I realized what I should be doing is creating a CG motif track! Having a separate track would allow me to look at it in IGV and obtain intersections between other BEDfiles of interest (like, all my genome features). I uploaded the Roslin genome and mitochondrial sequence FASTA to Galaxy and used EMBOSS fuzznuc to parse out CGs:


I uploaded my CG motif track to halfshell, and proceeded to calculate how many CG motifs were present in each genome feature track with intersectBed!
Table 2. Genome feature overlaps with CG motifs
| Genome Feature | Number of CG Motifs |
|---|---|
| Total | 13,246,547 |
| Genes | 7,425,225 |
| CDS | 1,516,641 |
| Exon | 2,216,766 |
| lncRNA | 454,393 |
| mRNA | 7,040,925 |
| Non-CDS | 11,040,607 |
| Introns | 5,221,932 |
| Exon UTR | 700,167 |
| All flanks | 1,091,847 |
| Upstream flanks | 592,834 |
| Downstream flanks | 549,787 |
| Intergenic regions | 4,730,728 |
| Transposable elements | 5,424,483 |
Genomic location of DML
I returned to my DML analysis notebook to keep using intersectBed and find overlaps between genome features and female-DML, indeterminate-DML, and overlapping DML.
Table 3. DML overlaps with genome features
| Genome Feature | Female | Indeterminate | Overlaps |
|---|---|---|---|
| Total | 315 | 2861 | 8 |
| Gene | 287 | 2519 | 8 |
| Exon UTR | 20 | 181 | 1 |
| CDS | 78 | 789 | 3 |
| Intron | 190 | 1552 | 4 |
| Upstream flanks | 2 | 33 | 0 |
| Downstream flanks | 16 | 140 | 0 |
| Intergenic regions | 12 | 178 | 0 |
| lncRNA | 5 | 36 | 0 |
| Transposable elements | 91 | 829 | 0 |
Similar to what I found with C. virginica, most of the DML are in genic regions! Hopefully this bodes well for any downstream enrichment analysis.
Going forward
- Finish running
methylKitrandomization tests on R Studio server - Update
moxhandbook with R information - Obtain relatedness matrix and SNPs with EpiDiverse/snp
- Filter C/T SNPs
- Identify overlaps between SNP data and other epigenomic data
- Write methods
- Write results
- Identify significantly enriched GOterms associated with DML
- Identify methylation islands and non-methylated regions
- Determine if larval DNA/RNA should be extracted