Hawaii Gigas Methylation Analysis Part 19

Restarting functional annotation with blastx

TL;DR, I decided use standard blastx instead of DIAMOND blastx to annotate the new genome. More details on why I made that decision below.

A struggle recap

I initially started using DIAMOND blastx to annotate the new C. gigas genome, but ran into several issues. Since I wasn’t getting any useful output, Sam suggested I use the existing genome annotation in this discussion. I looked at the annotation release and FTP site and found the annotation provided by the NCBI. While the annotation has protein ID and product information, I have no Uniprot Accession codes, which means I can’t obtain GOterms.

At this point, Steven suggested I retrieve accession information through Uniprot mapping. Sam had a script I could modify to retrieve Uniprot Accession codes from the genomic GFF file. I modified my Jupyter notebook to go through these steps based on Sam’s code. I ran into a few perl module installation issues, which I detailed in this discussion. Once I troubleshooted the perl issues, I still wasn’t getting any matches! Sam spot-checked a few gene IDs on the Uniprot website and didn’t get matches either. Turns out the genome is new enough that the information is not on the Uniprot website!

I spot-checked the protein IDs from the protein annotation, and saw those were returning results. I then pivoted my strategy: use protein IDs to get Uniprot information, then find a different way to map protein IDs to gene IDs. To do this programmatically, I tried using the RefSeq annotation but encountered the same no-match result, which I detailed in a new discussion. At this point, Steven suggested I just use the web interface for Uniprot mapping, since my spot-check worked. When I tried uploading all of the protein IDs to the GUI, the interface was no longer available! Same when I tried just copying and pasting instead of uploading a file. Steven tried this as well, and ended up only getting a ~1000 of the ~60,000 entries mapping. Even then, they only mapped to themselves! So, blastx it is.

Moving on to standard blastx

I reworked this script for blastx instead of DIAMOND blastx. It wasn’t too difficult, as the syntax is mostly similar between the two programs. I started running this on mox, so hopefully it finishes in a day and I can use the output for enrichment analysis!

Exit script if any command fails

set -e

Program paths


Create database from Uniprot-SwissProt information

Speciy protein database

Save to srlab/yaamini/blastdbs

-in /gscratch/srlab/yaaminiv/blastdbs/20210601-uniprot_sprot.fasta
-dbtype prot
-out /gscratch/srlab/yaaminiv/blastdbs/20210601-uniprot-sprot.blastdb

Run blastx

Output format 6 produces a standard BLAST tab-delimited file

-db /gscratch/srlab/yaaminiv/blastdbs/20210601-uniprot-sprot.blastdb
-query /gscratch/scrubbed/yaaminiv/data/cgigas_uk_roslin_v1_genomic-mito.fa
-out 20210605-cgigas-roslin-mito-blastx.outfmt6
-outfmt 6
-evalue 1e-4
-max_target_seqs 1
-num_threads 28

Going forward

  1. Perform enrichment
  2. Update methods
  3. Update results
  4. Create figures
  5. Outline discussion
  6. Write discussion
  7. Write introduction
  8. Conduct randomization test with DSS
  9. Try EpiDiverse/snp for SNP extraction from WGBS data
  10. Run methylKit or randomization test on mox
  11. Transfer scripts used to a nextflow workflow
Written on June 5, 2021