DML Analysis Part 38

Gene enrichment and description

Now that I have annotated DML tables, I want to try a gene enrichment and find a good way to describe the functions of the genes in my annotations. Ideally, I can use the Uniprot Accession codes to get GOterms to do gene enrichment and to describe functions.

Obtaining GOterms

Turns out I had this exact issue last year (have I really not progressed…? shudders). After some back-and-forth with Sam and Shelly, I realized I could download the Uniprot-SwissProt databse with additional Gene Ontology columns. I went to this website and added columns of interest. I then downloaded the database as a tab-delimited file. I initially tried downloading it as a FASTA, but Sam pointed out that I needed to download it as a .txt file if I wanted to maintain the additional columns. My file had the following columns:

Entry (Uniprot-Accession)
Entry Name (Uniprot-ID)
Status (reviewed)
Protein names
Gene names
Organism
Length
Gene ontology IDs
Gene ontology (GO)
Gene ontology (biological processes)
Gene ontology (cellular component)
Gene ontology (molecular function)

I then imported the file in this R Markdown file. I skimmed some of the columns off, so my final annotation tables (found in this folder) now include GO-ID, GO-BP, GO-CC, and GO-MF.

Functional description

Now that I had GOTerms assigned to genes, I could try grouping GOterms together to describe genes. For each Uniprot Accession code, I have three different GOterm categories: biological processes, cellular component, and molecular function. For my DML-exon and DML-intron annotations, I isolated the first three GO-BP codes for each Uniprot accession code with an e-value no larger than 10^-10. I used count in the dplyr package to create summary tables, found here:

It’s really good information, but I’m not sure how to include such long tables in a paper. I think I’ll need to map the GOterms to parent (or grandparent) GOterms similar to what Shelly did. I’ll review her lab notebook to see how she did that.

Gene enrichment with DAVID

I know one method of obtaining GOterms from Uniprot Accession codes it to use DAVID. Before I could do this, I needed to match Uniprot Accession codes to my gene background file. I returned to this Jupyter notebook to use intersectBed and characterize DML background and mRNA overlaps. In my R Markdown file, I matched Uniprot Accession codes to DMLBackground-mRNA overlaps.

I then performed a gene enrichment with DAVID and put the output here. No surprise: nothing was significantly enriched.

A new approach

During our NSF E20 meeting, Steven suggested I focus on enrichment instead of description. Katie pointed out a different gene enrichment tool, GO-MWU. I’ll tackle this next.

Going forward

Perform gene enrichment with GO-MWU
Work through gene-level analysis
Update methods and results
Update paper repository
Outline the discussion
Write the discussion
Write the introduction
Revise my abstract
Share the draft with collaborators and get feedback
Post the paper on bioRXiv
Prepare the manuscript for publication

Written on June 27, 2019