WGBS Analysis Part 27

Characterizing DML

I have DML and I have genome feature tracks, so it’s time to do the thing I’ve done the most (besides yelling at R Studio server): characterize genomic locations with bedtools!

DML summary information

TL;DR, I have a lot of DML and I’m not sure if I picked the correct thresholds.

Table 1. Number of DML obtained for sex-specific contrasts at different percent methylation differences. A q-value of 0.01 was used for each test. After unite, I had 5306686 total CpG loci with data.

Contrast	25%	50%	75%	100%
Female	20830	3709	315	N/A
Indeterminate	129040	73414	21891	2861

My guess is that I have several DML because my sample size is lower. I think that using my most restrictive thresholds in downstream analyses is the best course of action until I get my randomization test results, which means a 75% difference for female-DML, and 100% difference for indeterminate-DML.

Another important note: methylKit used a Fisher test for the indeterminate samples since there were two of them! I need to go through the manual and confirm what exact test was used:

Screen Shot 2021-04-30 at 10 35 49 AM

Looking at DML in IGV

I created this IGV session to look at my DML and see if my cutoffs made any sense. I quickly created BEDfiles for DML lists in this Jupyter notebook and saved the output here. I also color-coded the samples such that dark blue = female and light blue/cyan = indeterminate.

Screen Shot 2021-05-11 at 12 44 55 PM

Figure 1. Sample 5x bedgraphs and DML tracks in IGV.

I first thumbed through the IGV session and looked at DML identified in female samples. Based on what I saw, I feel comfortable using a 75% difference to identify DML. I looked at some DML that were at least 50% different between samples, and while I was comfortable calling that CpG differentially methylated, the sheer number of 50% DML I have make me think that being more conservative is better for a small sample size. The 25% cutoff was not convincing.

Screen Shot 2021-05-11 at 12 46 29 PM

Screen Shot 2021-05-11 at 12 48 04 PM

Screen Shot 2021-05-11 at 12 50 58 PM

Screen Shot 2021-05-11 at 12 51 30 PM

Screen Shot 2021-05-11 at 12 50 11 PM

Figures 2-6. Female DML and associated methylation information for each sample. Figures 2-3 look at DML that are 75% different, Figures 3-5 look at DML that are 50% different, and Figure 6 looks at a DML that is 25% different.

Common DML between sex-specific contrasts

Once I confirmed I wanted to use the 75% different female-DML and 100% different indeterminate-DML, I wanted to see if there were any DML that were common between my sex-specific lists. I returned to this Jupyter notebook and used intersectBed to pull out overlapping DML:

#Find overlaps between female- and indeterminate-DML
#Check head
#Count number of overlapping DML
!{bedtoolsDirectory}intersectBed \
-u \
-a DML-pH-75-Cov5-Fem.csv.bed \
-b DML-pH-100-Cov5-Ind.csv.bed \
> DML-pH-Cov5-Overlaps.bed
!head DML-pH-Cov5-Overlaps.bed
!wc -l DML-pH-Cov5-Overlaps.bed

I saved the output here. Turns out there are only 8 DML that overlap between the female and indeterminate samples!

Quick tangent: CG motif track

Before I quantified overlaps between genome feature tracks and DML, I realized I needed a better idea of how many CGs were in the Roslin genome! I opened this Jupyter notebook and used grep -c to count the number of CGs. The output was suspiciously low!

Screen Shot 2021-05-11 at 10 52 32 AM

Clearly I was counting incorrectly, so I posted this discussion for feedback. Sam mentioned that I was missing instances with CGs on a new line! I looked through my old C. virginica code and found fgrep -o -i CG {genome-file} | wc -l. Using this code, I got ~13 million CpGs in the genome (for reference, the C. virginica genome has ~14 million). Sam still pointed out that this may not count across new lines, and in fact, he did bring up that point in an old issue! He suggested that instead, I convert the FASTA to a tab-delimited file, and count CGs that way. When I tried his code, I was getting 0 CGs:

Screen Shot 2021-05-11 at 11 55 52 AM

I think the seqkit fx2tab wasn’t properly converting the FASTA to a tab-delimited format:

Screen Shot 2021-05-11 at 11 54 25 AM

Steven suggested I use some code from Jay, but I still ended up with the same error:

Screen Shot 2021-05-11 at 12 04 03 PM

At this point, I realized what I should be doing is creating a CG motif track! Having a separate track would allow me to look at it in IGV and obtain intersections between other BEDfiles of interest (like, all my genome features). I uploaded the Roslin genome and mitochondrial sequence FASTA to Galaxy and used EMBOSS fuzznuc to parse out CGs:

Screen Shot 2021-05-11 at 12 19 30 PM

Screen Shot 2021-05-11 at 4 08 28 PM

I uploaded my CG motif track to halfshell, and proceeded to calculate how many CG motifs were present in each genome feature track with intersectBed!

Table 2. Genome feature overlaps with CG motifs

Genome Feature	Number of CG Motifs
Total	13,246,547
Genes	7,425,225
CDS	1,516,641
Exon	2,216,766
lncRNA	454,393
mRNA	7,040,925
Non-CDS	11,040,607
Introns	5,221,932
Exon UTR	700,167
All flanks	1,091,847
Upstream flanks	592,834
Downstream flanks	549,787
Intergenic regions	4,730,728
Transposable elements	5,424,483

Genomic location of DML

I returned to my DML analysis notebook to keep using intersectBed and find overlaps between genome features and female-DML, indeterminate-DML, and overlapping DML.

Table 3. DML overlaps with genome features

Genome Feature	Female	Indeterminate	Overlaps
Total	315	2861	8
Gene	287	2519	8
Exon UTR	20	181	1
CDS	78	789	3
Intron	190	1552	4
Upstream flanks	2	33	0
Downstream flanks	16	140	0
Intergenic regions	12	178	0
lncRNA	5	36	0
Transposable elements	91	829	0

Similar to what I found with C. virginica, most of the DML are in genic regions! Hopefully this bodes well for any downstream enrichment analysis.

Going forward

Finish running methylKit randomization tests on R Studio server
Update mox handbook with R information
Obtain relatedness matrix and SNPs with EpiDiverse/snp
Filter C/T SNPs
Identify overlaps between SNP data and other epigenomic data
Write methods
Write results
Identify significantly enriched GOterms associated with DML
Identify methylation islands and non-methylated regions
Determine if larval DNA/RNA should be extracted

Written on May 10, 2021