WGBS Analysis Part 31

Revising methylKit parameters

Before I defended, I wanted to revise my methylKit code to test all low and ambient pH samples together, instead of separating female and indeterminate samples for analysis. I can justify including the indeterminate samples with the female samples since they may mature into female oysters, and I’m anyways using maturation stage as a covariate to account for any stage-specific differences.

4 vs. 4 test

I returned to the R Studio Server to run methylKit, then transferred the code to this R Markdown script. Using a q-value of 0.01, I identified 12,826 DML with a 25% difference, 1,599 DML with a 50% difference, and 59 DML with a 75% difference. When I opened this IGV session to spot-check the DML, the 50% difference looked very reasonable.

Screen Shot 2021-06-16 at 11 07 13 AM

Screen Shot 2021-06-16 at 11 07 44 AM

Screen Shot 2021-06-16 at 11 08 22 AM

Figures 1-3. IGV screenshots for DML with a 50% difference between treatments.

Even though 1,599 is way more DML that I’ve identified previously, I decided to stick with these parameters! I then updated the figures that used the methylKit data. First, I wanted to include maturation stage information in the PCA to demonstrate that samples did not cluster together by maturation stage.

Screen Shot 2021-06-16 at 2 40 22 PM

Figure 4. Global methylation PCA with stage-specific information

Next, I wanted to see if DML that overlapped with SNPs clustered together in my heatmap. When I added an asterisk to each row that was a SNP-DML overlap, I didn’t see any clustering. I decided to produce one heatmap for all DML identified, instead of splitting it by SNP-DML overlaps and unique DML.

Screen Shot 2021-07-06 at 10 23 19 AM

Figure 5. Heatmap of DML

Finally, I plotted the distribution of DML in various chromosomes against the number of genes in that chromosome. The number of DML seemed to track the number of genes.

Screen Shot 2021-07-06 at 10 23 38 AM

Figure 6. Chromosomal distribution of DML

Genomic location of revised DML

I took my revised DML list and used it in this Jupyter notebook to characterize the genomic location of DML. The majority of DML were found in genes, with more in introns than CDS or exon UTR. Unique C/T SNPs overlapped with ~20% of DML. I then updated my statistical tests and figures in this R Markdown script. The proportions of DML in CDS, introns, downstream flanks, lncRNA, and intergenic regions were significantly different than that of highly methylated CpGs in those same regions.

Screen Shot 2021-07-06 at 10 34 32 AM

Figure 7. Genomic location of highly methylated CpGs and DML

Revisiting topGO enrichment

Now that I had a revised DML list, I wanted to try enrichment again. I annotated my DML list in this Jupyter notebook, I opened this R Markdown script. Prior to enrichment, I created this table with each GOterm associated with genes containing DML, and its frequency in my dataset. I used this information to create a REVIGO figure to visualize the biological processes associated with all DML for my defense slides.

Screen Shot 2021-07-06 at 10 45 49 AM

Figure 8. All biological process GOterms associated with DML in genes

I then proceeded with a Fisher’s Exact test for enrichment with topGO, and found enriched biological process and molecular function GOterms with this dataset! After I defend, I want to find a way to visualize this result.

Going forward

  1. Visualize enrichment results
  2. Report mc.cores issue to methylKit
  3. Perform randomization test
  4. Update mox handbook with R information
  5. Determine if larval DNA/RNA should be extracted
Written on June 16, 2021