WGBS Analysis Part 32
Miscellaneous analyses
After updating my methods and results sections with information from my 4 vs. 4 analysis, I realized there were a few things I needed to complete before sending my draft to my Reading Committee.
Modifying GOterm annotations
The first thing I wanted to do was add methylation difference information from methylKit
to the GOterm annotations. This way, I could see if certain genes or biological processes were more hyper- or hypomethylated in my low pH samples. I returned to this Jupyter notebook and modified my paste | awk
code to include the column with methylation difference information:
#Column 5 = meth.diff
!paste DML-pH-50-Cov5-All.GeneIDs ../../output/10_DML-characterization/DML-pH-50-Cov5-All-Gene-wb.bed \
| awk -F'\t' -v OFS='\t' '{print $1, $2, $3, $4, $5}' \
| sort \
> DML-pH-50-Cov5-All.GeneIDs.geneOverlap
Since all of the files build off of eachother in my code, my final annotation file now included methylation difference information.
I then wanted to count the number of genes and transcripts associated with the enriched GOterms. I opened this R Markdown script, then created a dataframe with only the enriched GOterms (P-value < 0.01). I merged the enriched GOterms with the annotation information, then counted the number of unique genes or transcripts:
sigRes.allBP <- allRes.allBP[1:35,c(1, 6)] #Filter significantly enriched GOterms, only keep GOID and p-value
colnames(sigRes.allBP) <- c("GOID", "p.value") #Change column names
head(sigRes.allBP)
sigRes.allBPAnnot <- merge(sigRes.allBP, allDMLGOtermsBP, by = "GOID") #Additional annotations
sigRes.allBPAnnot <- unique(sigRes.allBPAnnot[,-11]) #Drop GOcat column and retain only unique lines
length(unique(sigRes.allBPAnnot$geneID)) #20 unique genes
length(unique(sigRes.allBPAnnot$transcript)) #42 unique genes
head(sigRes.allBPAnnot) #Confirm formatting
Interestingly, there were 20 genes associated with enriched biological process GOterms, but 42 transcripts! This tells me that the DML associated with enriched GOterms are found in genes that produce multiple transcripts. This could be important for alternative splicing as a way to control transcription. I repeated this process with the molecular function GOterms, and found that they were associated with 12 unique genes and 50 unique transcripts.
Visualizing enriched terms with simplifyEnrichment
In the same R Markdown script, I decided to create heatmaps similar to Chille et al. (2021) to visualize my enriched GOterms. I started by installing the package simplifyEnrichment
. When I tried installing the package, I realized it was only available for the latest version of R! I couldn’t update R on genefish, so instead I updated R on my own computer. After I got the latest R version, I was able to install simplifyEnrichment
with BiocManager
.
To make the heatmap, I needed to create a semantic similarity matrix of my enriched GOterms. Following the instruction manual, I calculated the semantic similarity matrix using the “Relevance” measure (default) using GO_similarity
. Then, I used simplifyGO
to create the plot! I had to dig into the manual to find how to change aesthetics related to the word cloud and text sizing:
matBP <- GO_similarity(go_id = sigRes.allBP$GOID, ont = "BP") #Calculate the semantic similarity matrix using the Rel method (default)
#pdf("figures/simplifyEnrichment-BP.pdf", width = 11, height = 8.5) #Save figure
simplifyGO(matBP,
column_title = "", col = rev(plotColors), fontsize_range = c(10,40),
word_cloud_grob_param = list(col = "black", max_width = 25)) #Plot GOterms based on semantic similarity. Do not include a column title. Set colors to be plot colors, and set fontsize to range from 10 to 40. Pass arguments to word_cloud_grob_param to dictate the colors of the words and maximum width
#dev.off()
Figures 1-2. Biological process and molecular function simplifyEnrichment
figure
I used the biological process figure in the main text, and kept the molecular function one in the supplement.
Sequencing summary table
The last thing I wanted to do was concatenate all the information related to sequence trimming and alignment. I looked through the fastqc
and bismark
alignment, deduplication, and methylation extraction MultiQC reports to create this monster table for the supplement.
Going forward
- Report
mc.cores
issue tomethylKit
- Perform randomization test
- Update
mox
handbook with R information - Determine if larval DNA/RNA should be extracted