Hawaii Gigas Methylation Analysis Part 16

Reworking pH-DML

When writing up my methods, I realized that I incorrectly re-assigned treatments in methylKit when I was shuffling between ploidy and pH treatment assignments! Now that I was using min.per.group with unite, I couldn’t just re-assign treatments because the min.per.group would always be based off of the ploidy treatment information, not the pH treatment information. This could mean that I would call DML with less than eight samples per pH treatment! So, back to the code I go.

Returning to `methylKit`

I opened the R Studio server and my R script to recode the treatment re-assignment. I just needed to re-assign treatments the step before unite:

methylationInformationFilteredCov5T <- methylKit::reorganize(processedFilteredFilesCov5,
                                                             sample.id = sampleMetadata$sampleID,
                                                             treatment = sampleMetadata$pHTreatment) %>%
  methylKit::unite(., destrand = FALSE, min.per.group = 8L) #Re-assign treatment information to filtered and normalized 5x CpG loci, then unite 5x loci with coverage in at least 8/12 samples/treatment

With this re-assignment, I had 5,086,421 CpGs with data in at least eight samples per pH treatment. I then used this new object to identify DML! After finishing up on mox, I used rsync to move my revised DML lists and data to gannet.

Unique C/T SNPs

The next thing I wanted to do was determine how many unique C/T SNPs were in my data in this Jupyter notebook. First, I used cat to combine all SNP lists:

#Combine C/T SNPs into one file
!cat *CT-SNPs.tab >> all-CT-SNPs.tab

Then, I used sort and uniq to pull out all the unique SNPs:

!sort all-CT-SNPs.tab \
| uniq \
> unique-CT-SNPs.tab

When I looked at the output, I realized there were still duplicate C/T SNPs! The eighth column had information from the VCF file that was unique from each sample, so that was likely preventing me from pulling out the unique SNPs properly. Instead, I used awk to pull the first five columns out (chr, bp, ?, C, T), and used that as input into sort and uniq.

!awk '{print $1"\t"$2"\t"$3"\t"$4"\t"$5}' all-CT-SNPs.tab \
| sort \
| uniq \
> unique-CT-SNPs.tab

That gave me 289,826 unique C/T SNPs in my data.

DML characterization

In this Jupyter notebook, I used my new pH-DML and unique C/T SNP lists to look at DML locations in the genome.

Table 1 Overlaps between DML and various genome features.

Genome Feature	pH-DML	ploidy-DML	Common DML
Total	42	29	2
Genes	36	25	2
Exon UTR	5	0	0
CDS	7	7	1
Introns	24	18	1
Upstream flanks	0	0	0
Downstream flanks	4	1	0
Intergenic regions	2	3	0
lncRNA	3	0	0
Transposable elements	16	8	0
Unique C/T SNPs	6	5	0

Going forward

Test-run DSS and ramwas
Try EpiDiverse/snp for SNP extraction from WGBS data
Run methylKit randomization test on mox
Investigate comparison mechanisms for samples with different ploidy in oysters and other taxa
Transfer scripts used to a nextflow workflow
Update methods
Update results
Create figures

Written on May 22, 2021