Hawaii Gigas Methylation Analysis Part 21

Trying topGO for GOterm enrichment

I decided to use topGO because it was easier to figure out than goseq, and two different methylation analysis papers have used topGO: Li et al. (2018) and Chandra Rajan et al. (2021). Chandra Rajan et al. (2021) also published their script, found here, which I modeled my workflow off of.

To prepare for the enrichment analysis, I made the input files in this Jupyter notebook. The first thing I needed to create was a list of gene IDs and all associated GOterms. I extracted transcript IDs from the annotated union 5x BEDgraph. Then, I used the annotated blastx information before I unfolded the GOterms onto separate lines:

#Extract columns 2 and 3 to create list needed for topGO gene enrichment
#Convert ; to , for topGO formatting
!cut -f2,3 _blast-annot.tab \
| tr ";" "," \
> geneid2go.tab
!head geneid2go.tab

#Filter the gene ID-GO file based on transcript IDs in 1x union data
!awk 'BEGIN { while(getline <"union_1x.transcriptIDs.unique") id[$0]=1; } id[$1] ' geneid2go.tab \
> geneid2go-union1x.tab

In addition to making all of the input files, I also annotated the union 1x BEDfile (CpG background) and DML lists with Uniprot Accession codes, GOterms, and GOSlim information.

I then ran the topGO analysis in this R Markdown file. First, I imported the gene ID-to-GO term database, and extracted the transcript IDs (gene names) to use as the gene universe:

geneID2GO <- readMappings(file = "../Haws_08-GOterm-annotation/geneid2go-union1x.tab") #Loading the GO annotations and GeneIDs. Each line has one transcript ID and all associated GOterms
str(head(geneID2GO)) #Confirm file structure

Then, I imported my DML lists. I extracted the transcript IDs, then created a separate factor vector that indicated genes with DML in the gene name list. This factor would list genes containing DML as significant (1), and genes without DML would be not-significant (0):

ploidyGenes <- ploidyDMLGOterms$transcript #Extract transcript ID
ploidyGeneList <- factor(as.integer(geneNames %in% ploidyGenes))
names(ploidyGeneList) <- ploidyGenes

I ran the enrichment for biological processes, cellular components, and molecular functions. For each category, I created a topGO object that specified the ontology I was interested in, my gene list, annotation, and gene2GO database. Then, I ran a Fisher’s exact test to see if any processes were overrepresented:

ploidyGOdataBP <- new("topGOdata", ontology = "BP", allGenes = ploidyGeneList,
                      annot = annFUN.gene2GO, gene2GO = geneID2GO) #Create biological process topGO object
ploidyGOdataBP #Get summary of object

Available genes have annotations, but feasible ones are linked to the GO hierarchy that topGO uses (I think…see this link).

test.stat <- new("classicCount", testStatistic = GOFisherTest, name = "Fisher test")
resultFisher.ploidyBP <- getSigGroups(ploidyGOdataBP, test.stat)
resultFisher.ploidyBP

I found enriched GOterms related to biological processes for ploidy-DML, and biological processes and molecular functions for pH-DML! I didn’t find any enriched GOterms for the interaction-DML, which makes sense considering how there was a smaller interaction effect to begin with.

Table 1. Enriched biological process and molecular function GOterms

Screen Shot 2021-06-10 at 2 56 25 AM

Going forward

  1. Update methods
  2. Update results
  3. Outline discussion
  4. Write discussion
  5. Write introduction
  6. Conduct randomization test with DSS
  7. Try EpiDiverse/snp for SNP extraction from WGBS data
  8. Transfer scripts used to a nextflow workflow
Written on June 8, 2021