Hawaii Gigas Methylation Analysis Part 19
Restarting functional annotation with blastx
TL;DR, I decided use standard blastx
instead of DIAMOND blastx
to annotate the new genome. More details on why I made that decision below.
A struggle recap
I initially started using DIAMOND blastx
to annotate the new C. gigas genome, but ran into several issues. Since I wasn’t getting any useful output, Sam suggested I use the existing genome annotation in this discussion. I looked at the annotation release and FTP site and found the annotation provided by the NCBI. While the annotation has protein ID and product information, I have no Uniprot Accession codes, which means I can’t obtain GOterms.
At this point, Steven suggested I retrieve accession information through Uniprot mapping. Sam had a script I could modify to retrieve Uniprot Accession codes from the genomic GFF file. I modified my Jupyter notebook to go through these steps based on Sam’s code. I ran into a few perl module installation issues, which I detailed in this discussion. Once I troubleshooted the perl issues, I still wasn’t getting any matches! Sam spot-checked a few gene IDs on the Uniprot website and didn’t get matches either. Turns out the genome is new enough that the information is not on the Uniprot website!
I spot-checked the protein IDs from the protein annotation, and saw those were returning results. I then pivoted my strategy: use protein IDs to get Uniprot information, then find a different way to map protein IDs to gene IDs. To do this programmatically, I tried using the RefSeq annotation but encountered the same no-match result, which I detailed in a new discussion. At this point, Steven suggested I just use the web interface for Uniprot mapping, since my spot-check worked. When I tried uploading all of the protein IDs to the GUI, the interface was no longer available! Same when I tried just copying and pasting instead of uploading a file. Steven tried this as well, and ended up only getting a ~1000 of the ~60,000 entries mapping. Even then, they only mapped to themselves! So, blastx
it is.
Moving on to standard blastx
I reworked this script for blastx
instead of DIAMOND blastx
. It wasn’t too difficult, as the syntax is mostly similar between the two programs. I started running this on mox
, so hopefully it finishes in a day and I can use the output for enrichment analysis!
Exit script if any command fails
set -e
Program paths
blast=/gscratch/srlab/programs/ncbi-blast-2.10.1+/bin/
Create database from Uniprot-SwissProt information
Speciy protein database
Save to srlab/yaamini/blastdbs
${blast}makeblastdb
-in /gscratch/srlab/yaaminiv/blastdbs/20210601-uniprot_sprot.fasta
-dbtype prot
-out /gscratch/srlab/yaaminiv/blastdbs/20210601-uniprot-sprot.blastdb
Run blastx
Output format 6 produces a standard BLAST tab-delimited file
${blast}blastx
-db /gscratch/srlab/yaaminiv/blastdbs/20210601-uniprot-sprot.blastdb
-query /gscratch/scrubbed/yaaminiv/data/cgigas_uk_roslin_v1_genomic-mito.fa
-out 20210605-cgigas-roslin-mito-blastx.outfmt6
-outfmt 6
-evalue 1e-4
-max_target_seqs 1
-num_threads 28
Going forward
- Perform enrichment
- Update methods
- Update results
- Create figures
- Outline discussion
- Write discussion
- Write introduction
- Conduct randomization test with
DSS
- Try EpiDiverse/snp for SNP extraction from WGBS data
- Run
methylKit
or randomization test onmox
- Transfer scripts used to a nextflow workflow