Hawaii Gigas Methylation Analysis Part 19

Restarting functional annotation with `blastx`

TL;DR, I decided use standard blastx instead of DIAMOND blastx to annotate the new genome. More details on why I made that decision below.

A struggle recap

I initially started using DIAMOND blastx to annotate the new C. gigas genome, but ran into several issues. Since I wasn’t getting any useful output, Sam suggested I use the existing genome annotation in this discussion. I looked at the annotation release and FTP site and found the annotation provided by the NCBI. While the annotation has protein ID and product information, I have no Uniprot Accession codes, which means I can’t obtain GOterms.

At this point, Steven suggested I retrieve accession information through Uniprot mapping. Sam had a script I could modify to retrieve Uniprot Accession codes from the genomic GFF file. I modified my Jupyter notebook to go through these steps based on Sam’s code. I ran into a few perl module installation issues, which I detailed in this discussion. Once I troubleshooted the perl issues, I still wasn’t getting any matches! Sam spot-checked a few gene IDs on the Uniprot website and didn’t get matches either. Turns out the genome is new enough that the information is not on the Uniprot website!

I spot-checked the protein IDs from the protein annotation, and saw those were returning results. I then pivoted my strategy: use protein IDs to get Uniprot information, then find a different way to map protein IDs to gene IDs. To do this programmatically, I tried using the RefSeq annotation but encountered the same no-match result, which I detailed in a new discussion. At this point, Steven suggested I just use the web interface for Uniprot mapping, since my spot-check worked. When I tried uploading all of the protein IDs to the GUI, the interface was no longer available! Same when I tried just copying and pasting instead of uploading a file. Steven tried this as well, and ended up only getting a ~1000 of the ~60,000 entries mapping. Even then, they only mapped to themselves! So, blastx it is.

Moving on to standard `blastx`

I reworked this script for blastx instead of DIAMOND blastx. It wasn’t too difficult, as the syntax is mostly similar between the two programs. I started running this on mox, so hopefully it finishes in a day and I can use the output for enrichment analysis!

Exit script if any command fails

set -e

Program paths

blast=/gscratch/srlab/programs/ncbi-blast-2.10.1+/bin/

Create database from Uniprot-SwissProt information

Speciy protein database

Save to srlab/yaamini/blastdbs

${blast}makeblastdb
-in /gscratch/srlab/yaaminiv/blastdbs/20210601-uniprot_sprot.fasta
-dbtype prot
-out /gscratch/srlab/yaaminiv/blastdbs/20210601-uniprot-sprot.blastdb

Run blastx

Output format 6 produces a standard BLAST tab-delimited file

${blast}blastx
-db /gscratch/srlab/yaaminiv/blastdbs/20210601-uniprot-sprot.blastdb
-query /gscratch/scrubbed/yaaminiv/data/cgigas_uk_roslin_v1_genomic-mito.fa
-out 20210605-cgigas-roslin-mito-blastx.outfmt6
-outfmt 6
-evalue 1e-4
-max_target_seqs 1
-num_threads 28

Going forward

Perform enrichment
Update methods
Update results
Create figures
Outline discussion
Write discussion
Write introduction
Conduct randomization test with DSS
Try EpiDiverse/snp for SNP extraction from WGBS data
Run methylKit or randomization test on mox
Transfer scripts used to a nextflow workflow

Written on June 5, 2021