Methylation landscape analysis

We’re updating foundational methods and results for CEABiGR! I’m working on the methylation analysis section, and decided to do my standard methylation landscape characterization for the data we have. I’m going to characterize the methylation landscape for male and female samples separately, since we’re seeing sex-specific methylation and gene expression patterns.

Revising genome feature tracks

But first…I revised the C. virginica genome feature tracks. I made the original genome feature tracks in 2018, but I didn’t make it such that a feature was only included in one category. For example, there are overlaps between flanking regions and intergenic regions, and I think I only used Gnomon annotations. I created this Jupyter notebook to update the way I created C. virginica genome feature tracks. I also pulled the RepeatMasker output from NCBI itself, instead of using the version created by Sam. I re-created the CG motif track as well so the creation of all feature tracks were in one notebook, and counted the overlap between CG motifs and all genome feature tracks.

Table 1. Number of genome features and overlaps with CG motifs

Feature Number of Unique Features Overlaps with CG Motifs
CG Motifs 14,458,736 N/A
Genes 38,838 7,778,105
CDS 645,368 1,728,303
Exon 731,916 2,334,303
mRNA 60,201 7,507,167
lncRNA 4,750 281,715
Non-CDS 337,305 12,138,514
Intron 311,341 5,497,597
Exon UTR 183,389 606,308
Upstream Flanks 34,817 694,265
Downstream Flanks 35,224 616,684
Intergenic Regions 23,949 5,417,334
TE 344,267 611,471

Methylation landscape

In this Jupyter notebook, I created union bedGraphs for males and females separately. I kept my code for the all-sample union bedGraph since I can’t remember if Katherine used it for any of her analyses. Once I had the union bedGraphs, I counted the number of highly, sparsely, and lowly methylated CpGs in each sample. I also counted the CpGs present in each genomic feature. I created this R Markdown script to create figures and perform chi-squared tests comparing the distribution of CpGs in the C. virginica genome with highly methylated CpGs. As expected, the distribution was significantly different. All output can be found in this gannet folder, and the relevant count files, statistical output, and figures can be found on Github.

Figures 1-3. Genome feature overlaps for all 10x CpGs with data in at least one sample, highly methylated, moderately methylated, and lowly methylated CpGs for female and male union bedGraphs

Screen Shot 2022-05-18 at 4 53 00 PM

Screen Shot 2022-05-18 at 4 53 06 PM

Screen Shot 2022-05-18 at 4 53 56 PM

Going forward

  1. Update foundation methods
  2. Update foundation results
  3. Revise with new expression data from Ariana
  4. Tune sPLS parameters
  5. Run sex-specific SPLS
  6. Identify drivers
Written on May 11, 2022