WGBS Analysis Part 31
Before I defended, I wanted to revise my
methylKit code to test all low and ambient pH samples together, instead of separating female and indeterminate samples for analysis. I can justify including the indeterminate samples with the female samples since they may mature into female oysters, and I’m anyways using maturation stage as a covariate to account for any stage-specific differences.
4 vs. 4 test
I returned to the R Studio Server to run
methylKit, then transferred the code to this R Markdown script. Using a q-value of 0.01, I identified 12,826 DML with a 25% difference, 1,599 DML with a 50% difference, and 59 DML with a 75% difference. When I opened this IGV session to spot-check the DML, the 50% difference looked very reasonable.
Figures 1-3. IGV screenshots for DML with a 50% difference between treatments.
Even though 1,599 is way more DML that I’ve identified previously, I decided to stick with these parameters! I then updated the figures that used the
methylKit data. First, I wanted to include maturation stage information in the PCA to demonstrate that samples did not cluster together by maturation stage.
Next, I wanted to see if DML that overlapped with SNPs clustered together in my heatmap. When I added an asterisk to each row that was a SNP-DML overlap, I didn’t see any clustering. I decided to produce one heatmap for all DML identified, instead of splitting it by SNP-DML overlaps and unique DML.
Figure 5. Heatmap of DML
Finally, I plotted the distribution of DML in various chromosomes against the number of genes in that chromosome. The number of DML seemed to track the number of genes.
Figure 6. Chromosomal distribution of DML
Genomic location of revised DML
I took my revised DML list and used it in this Jupyter notebook to characterize the genomic location of DML. The majority of DML were found in genes, with more in introns than CDS or exon UTR. Unique C/T SNPs overlapped with ~20% of DML. I then updated my statistical tests and figures in this R Markdown script. The proportions of DML in CDS, introns, downstream flanks, lncRNA, and intergenic regions were significantly different than that of highly methylated CpGs in those same regions.
Now that I had a revised DML list, I wanted to try enrichment again. I annotated my DML list in this Jupyter notebook, I opened this R Markdown script. Prior to enrichment, I created this table with each GOterm associated with genes containing DML, and its frequency in my dataset. I used this information to create a REVIGO figure to visualize the biological processes associated with all DML for my defense slides.
Figure 8. All biological process GOterms associated with DML in genes
I then proceeded with a Fisher’s Exact test for enrichment with
topGO, and found enriched biological process and molecular function GOterms with this dataset! After I defend, I want to find a way to visualize this result.
- Visualize enrichment results
- Perform randomization test
moxhandbook with R information
- Determine if larval DNA/RNA should be extracted