Hawaii Gigas Methylation Analysis Part 22
The final countdown (for the Hawaii paper)
Two years and ten days later…I’m ready to finish off this paper. In the meantime, I’ve discussed the paper with Steven a few times and decided on the following:
- We gotta go back to
methylKit
results. While theDSS
methods seemed useful and are what I used for my dissertation, just spot-checking what the algorithm identified as DML did not make sense. I also can’t set a methylation threshold for DML withDSS
, which makes it trickier to interpret or compare with other studies. - I had Sam run
EpiDiverse/snp
with the Hawaii data, thinking that I could compare C->T SNP identification between methods. However, theEpiDiverse/snp
output doesn’t provide a list of C->T SNPs. Sam and Steven are trying to troubleshoot this with the CEABIGR data, but for now I’m going to stick to BS-Snper output for SNP identification. TheEpiDiverse/snp
output information, however, could give us genotypic information that we could incorporate later on, since Maria was unable to tell us if the diploid female and tetraploid male oysters used were from related lines, or how many half- or full-sibling families were used for the triploid oysters I eventually got. - A few methodological novelties I want to try with this dataset include a randomization test with
methylKit
, integration with C. gigas ATAC-Seq, csRNA-Seq, and 5’-GRO data, andKOG-MWU
for DML comparison with other Crassostrea spp. epigenetic studies. - Figure out if we can get the pH and water quality data
Original methylKit
results
First things first, I wanted to find my original methylKit
results. Thankfully, I didn’t delete anything from the Github repository (and if I did, I could always go back to a different version). How people ever find anything without Github, an online lab notebook, and large file storage with web links is something I will never understand.
To remind myself of what I did previously, I went through the paper and my lab notebook. I used a 25% cutoff to identify DML, which is different than the 50% I normally use. I also used min.per.group = 8L
, which means a loci needs to have suitable coverage in eight samples per treatment.
The methylKit
output is here, and I saved my .RData
here. Turns out I never made figures with the methylKit
version of the results, but I did find this lab notebook entry with the number of methylKit
DML in each genome feature. My Jupyter notebook examining DML genomic location still has code that uses methylKit
output. The numbers were consistent between the lab notebook and Jupyter notebook. However, I noticed that C->T SNPs were not removed prior to getting the count information! I removed the SNPs, then got updated counts. I needed to just modify a few lines of sed
code at the end of the Jupyter notebook to get a table with the number of methylKit
DML in each genome feature for contingency tests.
Table 1. Number of DML in each genome feature.
Genome Feature | pH DML (%) | Ploidy DML (%) | Common DML (%) |
---|---|---|---|
Total DML | 34 | 24 | 1 |
Hypermethylated DML | 24 | 8 (33.3%) | 0 |
Hypomethylated DML | 10 | 16 (66.6%) | 1 |
Genes | 28 | 20 | 1 |
Exon UTR | 5 | 0 | 0 |
CDS | 3 | 5 | 1 |
Introns | 20 | 15 | 0 |
Upstream flanks | 0 | 0 | 0 |
Downstream flanks | 4 | 1 | 0 |
Intergenic regions | 2 | 3 | 0 |
lncRNA | 3 | 0 | 0 |
Transposable elements | 15 | 6 | 0 |
- Create
methylKit
genome location counts - Revise code to use
methylKit
output instead of DSS output - Change color scheme
Randomization test
Going forward
- Contingency tests for
methylKit
genome location - Revise code to use
methylKit
output instead of DSS output - Change color scheme for figures
methylKIt
randomization test- Add Rajan’s comments to the Google Doc
- Update methods
- Update results
- Revise discussion
- Revise introduction
- Transfer scripts used to a nextflow workflow