WGBS Analysis Part 6
Preliminary data for PCSGA
My PCSGA practice talk is tomorrow, and my real talk is on Tuesday! Today I’ll finish processing my data and get tables and figures prepared for my talk.
bismark output in
EVERYTHING FINALLY PROCESSED CORRECTLY (as far as I can tell)! I moved all my output off Mox and onto
gannet with this code:
rsync --archive --progress --verbose /gscratch/scrubbed/yaamini/analyses/Gigas-WGBS/2019-09-11-Gigas-Bismark/* email@example.com:/Volumes/web/spartina/2019-09-03-project-gigas-oa-meth/analyses/2019-09-12-Bismark
I used the revised deduplicated and sorted files in
methylKit. Something to remember:
methylKit uses a Fisher’s exact test instead of a logistic regression since there are no replicates.
Table 1. Number of DML identified by
methylKit that are a 100% different between samples with 10x coverage.
|Coverage||Date Created||Number of DML|
Looks like the new deduplicated and sorted files gave me a few more DML. Although I looked at different coverage metrics, I’m still going to use the 10x coverage data with a 100% methylation difference. These settings are stringent to account for the fact that I only have two samples.
I tried the
diffMethPerChr to obtain a breakdown of DML by chromosomes. The output doesn’t look that great, but at least it’s a visual that shows how sparse the DML are in the genome.
jpeg(filename = "2019-09-15-Loci-Analysis/2019-09-15-DML-Distribution-by-Chromosome.jpeg", height = 1000, width = 1000) #Save file with designated name diffMethPerChr(differentialMethylationStatsFilteredCov10Destrand, plot = TRUE, qvalue.cutoff = 0.001, meth.cutoff = 99) #Look at distribution of hyper- and hypomethylated DML per chromosome. Create an accompanying plot. dev.off()
Figure 1. Percentage of DML by chromosome.
There are also still weird things going on with the DML background when using
destrand = TRUE…I’ll worry about this after PCSGA.
destrand = TRUE output.
Verifying DML in IGV
I returned to this notebook and used my new
bismark output to revise 5x and 10x coverage tracks. I moved the finished tracks to
gannet, then imported these tracks into IGV along with the revised BEDfiles from
For the most part, things lined up and looked good! Weird inconsistencies do exist…again, something to think about after PCSGA.
Figure 3. Tracks and CG motifs not lining up in IGV.
Characterizing DML locations
In this issue, Steven provided this link and this link to C. gigas feature tracks. I created this notebook to download feature tracks from
eagle and characterize DML locations (P.S. by “created,” I mean I copied my C. virginica notebook and deleted all the cells I no longer wanted. I feel so accomplished).
Table 2. Counts for each genome feature.
|Genome Feature Track||Length|
What’s interesting at first glance is that 1) there are not an equal number of genes and promoters and 2) there are 4 million less CG motifs in the C. gigas genome than the C. virginica genome.
Table 3. Number of overlaps with various genome feature tracks. For example, the 1.7% overlap between exons and CG motifs means that exons contained 1.7% of all CG motifs.
|Genome Feature Track||Overlaps with CG motifs||Overlaps with DML||Overlaps with TE|
|CG motifs||N/A||624 (99.4%)||82,554 (0.8%)|
|Exons||172,056 (1.7%)||157 (25%)||2,597 (2.2%)|
|Introns||150,884 (1.5%)||285 (45.4%)||18,989 (15.9%)|
|Genes||28,015 (0.3%)||442 (70.4%); 398 unique genes (61.9%%)||11,748 (9.8%)|
|Promoters||5,696 (0.06%)||24 (3.8%)||3,966 (3.3%)|
|Transposable Elements||82,554 (0.8%)||8 (1.3%)||N/A|
|No overlaps||5,118,363 (51%)||165 (26.3%)||N/A|
It’s interesting to see that most of the DML are actually in introns. This makes me thing that DML could be involved with alternative splicing. There are also slightly more DML in unannotated intergenic regions than exons. I put the overlap proportions in this file and compared overlap proportions in this R Markdown script. I didn’t know how to effectively put files together to generally characterize methylation trends, so I compared total CpGs with DML instead of methylated CpGs with DML. The distributions are different, and I generated a figure which can be found here in pdf form.
Bottom line, this was unsuccessful. I posted this issue to see if we had any annotated gene information since the GFF I’m using has CGI IDs and does not match up with what I see on GigaTON. Steven ran a
blast, and I used that output to try and annotate my DML-gene overlaps in this notebook. Something’s fishy though because I can’t do the annotation! I’m pretty sure I modified and joined all my files correctly until then, so there must be some reason why the IDs aren’t matching up. At this point, I think it’s time to make a talk instead of trying to work out analyses.
- PUT TOGETHER A TALK
- Present preliminary results at PCSGA
- Figure out how to clean up results for WSN and ASLO
- Potentially sequence more samples…?
- Return to C. virginica data and finish. that. paper.