Hawaii Gigas Methylation Analysis Part 19
Restarting functional annotation with
TL;DR, I decided use standard
blastx instead of
DIAMOND blastx to annotate the new genome. More details on why I made that decision below.
A struggle recap
I initially started using
DIAMOND blastx to annotate the new C. gigas genome, but ran into several issues. Since I wasn’t getting any useful output, Sam suggested I use the existing genome annotation in this discussion. I looked at the annotation release and FTP site and found the annotation provided by the NCBI. While the annotation has protein ID and product information, I have no Uniprot Accession codes, which means I can’t obtain GOterms.
At this point, Steven suggested I retrieve accession information through Uniprot mapping. Sam had a script I could modify to retrieve Uniprot Accession codes from the genomic GFF file. I modified my Jupyter notebook to go through these steps based on Sam’s code. I ran into a few perl module installation issues, which I detailed in this discussion. Once I troubleshooted the perl issues, I still wasn’t getting any matches! Sam spot-checked a few gene IDs on the Uniprot website and didn’t get matches either. Turns out the genome is new enough that the information is not on the Uniprot website!
I spot-checked the protein IDs from the protein annotation, and saw those were returning results. I then pivoted my strategy: use protein IDs to get Uniprot information, then find a different way to map protein IDs to gene IDs. To do this programmatically, I tried using the RefSeq annotation but encountered the same no-match result, which I detailed in a new discussion. At this point, Steven suggested I just use the web interface for Uniprot mapping, since my spot-check worked. When I tried uploading all of the protein IDs to the GUI, the interface was no longer available! Same when I tried just copying and pasting instead of uploading a file. Steven tried this as well, and ended up only getting a ~1000 of the ~60,000 entries mapping. Even then, they only mapped to themselves! So,
blastx it is.
Moving on to standard
I reworked this script for
blastx instead of
DIAMOND blastx. It wasn’t too difficult, as the syntax is mostly similar between the two programs. I started running this on
mox, so hopefully it finishes in a day and I can use the output for enrichment analysis!
Exit script if any command fails
Create database from Uniprot-SwissProt information
Speciy protein database
Save to srlab/yaamini/blastdbs
Output format 6 produces a standard BLAST tab-delimited file