Hawaii Gigas Methylation Analysis Part 18
Functional annotation for new C. gigas genome
Now that I have DML lists from methylKit
and DSS
, I need to determine if any biological processes associated with genic DML are enriched. The first step in that process is getting annotation information for the Roslin genome!
The game plan
But first, the larger game plan. I know gene enrichment methods aren’t standardized in our lab yet, so I did some digging to see what other papers have used. Here are some of the methods I saw:
- My C. virginica paper:
GO-MWU
, then examination of GO Slim terms - Li et al. 2018: Enrichment with
topGO
- Chile et al. 2021: Enrichment with
goseq
, visualization withsimplifyEnrichment
I liked the visualizations from Chile et al. (2021), so I dug into the workflow more. They used several databases to actually get GOterms: DIAMOND blastx
to the NCBI database, InterPro Scan for GO and KEGG terms, Blast2GO to combine DIAMOND
and InterPro Scan output, and Uniprot matching to pick up anything else. I didn’t think this was necessary for my immediate purposes, but it’s good to consider for later.
Here’s what I’ll do for functional annotation and gene enrichment in the meantime:
blastx
to Uniprot database- Match accession terms to GOterms using lab workflow
- Work through
goseq
andsimplifyEnrichment
to produce gene enrichment figures
I’ll tackle the functional annotation part in this notebook post.
DIAMOND blastx
struggles
Sam has been using DIAMOND
to run blastx
on mox
in a more time-efficient manner. I decided to do the same! I created this script in my Manchester analysis repository to annotate the Roslin genome. I noticed there were already Uniprot databases on mox
, so I didn’t download another one. I ran the following code based on some modifications Sam made in a script he ran previously:
# Exit script if any command fails
set -e
# Program paths
diamond=/gscratch/srlab/programs/diamond-2.0.4/diamond
# Run DIAMOND with blastx
# Output format 6 produces a standard BLAST tab-delimited file
${diamond} blastx \
--db /gscratch/srlab/blastdbs/uniprot_sprot_20200123/uniprot_sprot.dmnd \
--query /gscratch/scrubbed/yaaminiv/data/cgigas_uk_roslin_v1_genomic-mito.fa \
--out 20210526-cgigas-roslin-blastx.outfmt6 \
--outfmt 6 \
--evalue 1e-4 \
--max-target-seqs 1 \
--block-size 15.0 \
--index-chunks 4
This took 20 minutes to run! When I counted the lines of the output, I only had 209 entries…This didn’t seem right to me, so I posted this issue. I also decided to try running DIAMOND blastx
using the just the somatic sequence and not the mitochondrial sequence. Since there was no methylation in the mitochondrial sequence to begin with, I thought there was no harm in just translating the original sequence in case the mitochondrial information I appended was causing the program to freak out. I got the Roslin genome sequence from NCBI and saved it here for future reference. I then modified my DIAMOND blastx
code to use the somatic FASTA as the query:
${diamond} blastx \
--db /gscratch/srlab/blastdbs/uniprot_sprot_20200123/uniprot_sprot.dmnd \
--query /gscratch/scrubbed/yaaminiv/data/GCF_902806645.1_cgigas_uk_roslin_v1_genomic.fna \
--out 20210526-cgigas-roslin-blastx.outfmt6 \
--outfmt 6 \
--evalue 1e-4 \
--max-target-seqs 1 \
--block-size 15.0 \
--index-chunks 4
I realized too late that I didn’t modify the output file name! In any case, I moved the output to this gannet
location prior to the line count investigation. Once DIAMOND blastx
finished running, I renamed the output 20210601-cgigas-roslin-somatic-blastx.outfmt6
and saved both files in this gannet
folder. The new output also only had 208 lines. My guess is that there’s something in my database. I looked at the README file associated with it and saw the following:
Directory containing files for BLASTing against the Uniprot SwissProt database.
---
FILES
- uniprot_sprot.fasta: Uniprot SwissProt protein FastA file. Downloaded 20200123.
- uniprot_sprot.dmnd: DIAMOND Uniprot (Reviewed) BLAST database generated with the following command:
makedb --in uniprot_sprot.fasta --threads 27 --db uniprot_sprot --taxonmap ../ncbi-nr-20190925/prot.accession2taxid --taxonnodes ../ncbi-nr-20190925/nodes.dmp
Based on the --taxonmap
argument, my guess is that I shouldn’t be using this database for my needs! I decided to modify my DIAMOND blastx
code to create a database first, then proceed with blastx
:
# Create database from Uniprot-SwissProt information
# Use 27 threads to create database
# Save to srlab/yaamini/blastdbs
${diamond} makedb \
--in /gscratch/srlab/blastdbs/uniprot_sprot_20200123/uniprot_sprot.fasta \
--threads 27 \
-d /gscratch/srlab/yaaminiv/blastdbs/20210601-uniprot-sprot.dmnd
# Run DIAMOND with blastx
# Output format 6 produces a standard BLAST tab-delimited file
${diamond} blastx \
--db /gscratch/srlab/yaaminiv/blastdbs/20210601-uniprot-sprot.dmnd \
--query /gscratch/scrubbed/yaaminiv/data/cgigas_uk_roslin_v1_genomic-mito.fa \
--out 20210601-cgigas-roslin-mito-blastx.outfmt6 \
--outfmt 6 \
--evalue 1e-4 \
--max-target-seqs 1 \
--block-size 15.0 \
--index-chunks 4
I still only got 209 entries! So then I decided to go one step further back: downloading the Uniprot-SwissProt database from the website, then using that as input in DIAMOND blastx
. When I ran the code, I still only got 209 matches! Then I realized I may have been missing an important argument from my code to make the database. I tried this next:
# Create database from Uniprot-SwissProt information
# Use 27 threads to create database
# Save to srlab/yaamini/blastdbs
${diamond} makedb \
--in /gscratch/srlab/blastdbs/uniprot_sprot_20200123/uniprot_sprot.fasta \
-dbtype prot \
--threads 27 \
-d /gscratch/srlab/yaaminiv/blastdbs/20210601-uniprot-sprot.dmnd
That failed immediately because it’s not a valid argument for DIAMOND makedb
. Sam suggested I use -long-reads
, since DIAMOND blastx
is developed for shorter reads. This started a series of me trying --long-reads
along with different RAM usages, explicit --range-culling
, -F 15
and --top 10
, and changing --block-size
. None of those modifications worked! So this is when I give up.
Going forward
- Perform enrichment
- Conduct randomization test with
DSS
- Try EpiDiverse/snp for SNP extraction from WGBS data
- Run
methylKit
randomization test onmox
- Transfer scripts used to a nextflow workflow
- Update methods
- Update results
- Create figures