WGBS Analysis Part 27
Characterizing DML
I have DML and I have genome feature tracks, so it’s time to do the thing I’ve done the most (besides yelling at R Studio server): characterize genomic locations with bedtools
!
DML summary information
TL;DR, I have a lot of DML and I’m not sure if I picked the correct thresholds.
Table 1. Number of DML obtained for sex-specific contrasts at different percent methylation differences. A q-value of 0.01 was used for each test. After unite
, I had 5306686 total CpG loci with data.
Contrast | 25% | 50% | 75% | 100% |
---|---|---|---|---|
Female | 20830 | 3709 | 315 | N/A |
Indeterminate | 129040 | 73414 | 21891 | 2861 |
My guess is that I have several DML because my sample size is lower. I think that using my most restrictive thresholds in downstream analyses is the best course of action until I get my randomization test results, which means a 75% difference for female-DML, and 100% difference for indeterminate-DML.
Another important note: methylKit
used a Fisher test for the indeterminate samples since there were two of them! I need to go through the manual and confirm what exact test was used:
Looking at DML in IGV
I created this IGV session to look at my DML and see if my cutoffs made any sense. I quickly created BEDfiles for DML lists in this Jupyter notebook and saved the output here. I also color-coded the samples such that dark blue = female and light blue/cyan = indeterminate.
Figure 1. Sample 5x bedgraphs and DML tracks in IGV.
I first thumbed through the IGV session and looked at DML identified in female samples. Based on what I saw, I feel comfortable using a 75% difference to identify DML. I looked at some DML that were at least 50% different between samples, and while I was comfortable calling that CpG differentially methylated, the sheer number of 50% DML I have make me think that being more conservative is better for a small sample size. The 25% cutoff was not convincing.
Figures 2-6. Female DML and associated methylation information for each sample. Figures 2-3 look at DML that are 75% different, Figures 3-5 look at DML that are 50% different, and Figure 6 looks at a DML that is 25% different.
Common DML between sex-specific contrasts
Once I confirmed I wanted to use the 75% different female-DML and 100% different indeterminate-DML, I wanted to see if there were any DML that were common between my sex-specific lists. I returned to this Jupyter notebook and used intersectBed
to pull out overlapping DML:
#Find overlaps between female- and indeterminate-DML
#Check head
#Count number of overlapping DML
!{bedtoolsDirectory}intersectBed \
-u \
-a DML-pH-75-Cov5-Fem.csv.bed \
-b DML-pH-100-Cov5-Ind.csv.bed \
> DML-pH-Cov5-Overlaps.bed
!head DML-pH-Cov5-Overlaps.bed
!wc -l DML-pH-Cov5-Overlaps.bed
I saved the output here. Turns out there are only 8 DML that overlap between the female and indeterminate samples!
Quick tangent: CG motif track
Before I quantified overlaps between genome feature tracks and DML, I realized I needed a better idea of how many CGs were in the Roslin genome! I opened this Jupyter notebook and used grep -c
to count the number of CGs. The output was suspiciously low!
Clearly I was counting incorrectly, so I posted this discussion for feedback. Sam mentioned that I was missing instances with CGs on a new line! I looked through my old C. virginica code and found fgrep -o -i CG {genome-file} | wc -l
. Using this code, I got ~13 million CpGs in the genome (for reference, the C. virginica genome has ~14 million). Sam still pointed out that this may not count across new lines, and in fact, he did bring up that point in an old issue! He suggested that instead, I convert the FASTA to a tab-delimited file, and count CGs that way. When I tried his code, I was getting 0 CGs:
I think the seqkit fx2tab
wasn’t properly converting the FASTA to a tab-delimited format:
Steven suggested I use some code from Jay, but I still ended up with the same error:
At this point, I realized what I should be doing is creating a CG motif track! Having a separate track would allow me to look at it in IGV and obtain intersections between other BEDfiles of interest (like, all my genome features). I uploaded the Roslin genome and mitochondrial sequence FASTA to Galaxy and used EMBOSS fuzznuc
to parse out CGs:
I uploaded my CG motif track to halfshell
, and proceeded to calculate how many CG motifs were present in each genome feature track with intersectBed
!
Table 2. Genome feature overlaps with CG motifs
Genome Feature | Number of CG Motifs |
---|---|
Total | 13,246,547 |
Genes | 7,425,225 |
CDS | 1,516,641 |
Exon | 2,216,766 |
lncRNA | 454,393 |
mRNA | 7,040,925 |
Non-CDS | 11,040,607 |
Introns | 5,221,932 |
Exon UTR | 700,167 |
All flanks | 1,091,847 |
Upstream flanks | 592,834 |
Downstream flanks | 549,787 |
Intergenic regions | 4,730,728 |
Transposable elements | 5,424,483 |
Genomic location of DML
I returned to my DML analysis notebook to keep using intersectBed
and find overlaps between genome features and female-DML, indeterminate-DML, and overlapping DML.
Table 3. DML overlaps with genome features
Genome Feature | Female | Indeterminate | Overlaps |
---|---|---|---|
Total | 315 | 2861 | 8 |
Gene | 287 | 2519 | 8 |
Exon UTR | 20 | 181 | 1 |
CDS | 78 | 789 | 3 |
Intron | 190 | 1552 | 4 |
Upstream flanks | 2 | 33 | 0 |
Downstream flanks | 16 | 140 | 0 |
Intergenic regions | 12 | 178 | 0 |
lncRNA | 5 | 36 | 0 |
Transposable elements | 91 | 829 | 0 |
Similar to what I found with C. virginica, most of the DML are in genic regions! Hopefully this bodes well for any downstream enrichment analysis.
Going forward
- Finish running
methylKit
randomization tests on R Studio server - Update
mox
handbook with R information - Obtain relatedness matrix and SNPs with EpiDiverse/snp
- Filter C/T SNPs
- Identify overlaps between SNP data and other epigenomic data
- Write methods
- Write results
- Identify significantly enriched GOterms associated with DML
- Identify methylation islands and non-methylated regions
- Determine if larval DNA/RNA should be extracted