Hawaii Gigas Methylation Analysis Part 16

Reworking pH-DML

When writing up my methods, I realized that I incorrectly re-assigned treatments in methylKit when I was shuffling between ploidy and pH treatment assignments! Now that I was using min.per.group with unite, I couldn’t just re-assign treatments because the min.per.group would always be based off of the ploidy treatment information, not the pH treatment information. This could mean that I would call DML with less than eight samples per pH treatment! So, back to the code I go.

Returning to methylKit

I opened the R Studio server and my R script to recode the treatment re-assignment. I just needed to re-assign treatments the step before unite:

methylationInformationFilteredCov5T <- methylKit::reorganize(processedFilteredFilesCov5,
                                                             sample.id = sampleMetadata$sampleID,
                                                             treatment = sampleMetadata$pHTreatment) %>%
  methylKit::unite(., destrand = FALSE, min.per.group = 8L) #Re-assign treatment information to filtered and normalized 5x CpG loci, then unite 5x loci with coverage in at least 8/12 samples/treatment

With this re-assignment, I had 5,086,421 CpGs with data in at least eight samples per pH treatment. I then used this new object to identify DML! After finishing up on mox, I used rsync to move my revised DML lists and data to gannet.

Unique C/T SNPs

The next thing I wanted to do was determine how many unique C/T SNPs were in my data in this Jupyter notebook. First, I used cat to combine all SNP lists:

#Combine C/T SNPs into one file
!cat *CT-SNPs.tab >> all-CT-SNPs.tab

Then, I used sort and uniq to pull out all the unique SNPs:

!sort all-CT-SNPs.tab \
| uniq \
> unique-CT-SNPs.tab

When I looked at the output, I realized there were still duplicate C/T SNPs! The eighth column had information from the VCF file that was unique from each sample, so that was likely preventing me from pulling out the unique SNPs properly. Instead, I used awk to pull the first five columns out (chr, bp, ?, C, T), and used that as input into sort and uniq.

!awk '{print $1"\t"$2"\t"$3"\t"$4"\t"$5}' all-CT-SNPs.tab \
| sort \
| uniq \
> unique-CT-SNPs.tab

That gave me 289,826 unique C/T SNPs in my data.

DML characterization

In this Jupyter notebook, I used my new pH-DML and unique C/T SNP lists to look at DML locations in the genome.

Table 1 Overlaps between DML and various genome features.

Genome Feature pH-DML ploidy-DML Common DML
Total 42 29 2
Genes 36 25 2
Exon UTR 5 0 0
CDS 7 7 1
Introns 24 18 1
Upstream flanks 0 0 0
Downstream flanks 4 1 0
Intergenic regions 2 3 0
lncRNA 3 0 0
Transposable elements 16 8 0
Unique C/T SNPs 6 5 0

Going forward

  1. Test-run DSS and ramwas
  2. Try EpiDiverse/snp for SNP extraction from WGBS data
  3. Run methylKit randomization test on mox
  4. Investigate comparison mechanisms for samples with different ploidy in oysters and other taxa
  5. Transfer scripts used to a nextflow workflow
  6. Update methods
  7. Update results
  8. Create figures
Written on May 22, 2021