West Coast Green Crab Experiment Part 84
Transcriptome cleaning and annotation
Now that I have my transcriptome assembled and have assembly statistics, my next step in this workflow is to clean and annotate the transcriptome.
2026-02-02
According to Zac’s paper, I’ll first need to blastn the transcriptome and screen for contaminant taxa:
Contigs were then queried against the NCBI nt database with blastn using blast+ version 2.7.1 (Camacho et al., 2009) to screen for potential contaminant taxa including fungi, bacteria, platyhelminthes and nematoda. Contigs with alignments to contaminant taxa references sequences with e-values of 1−10 or lower were removed.
I modified Zac’s code, choosing to blastn the whole transcriptome as opposed to doing it in chunks since I don’t have a time limit:
# Blast transcriptome against nt. database from Dec 31 2018 for initial round of contaminant filtering
mkdir ${OUTPUT_DIR}/blast-results
module load bio blast/2.7.1
blastn -query ${TRINITY_DIR}/trinity_out_dir/Trinity.fasta \
-db nt \
-evalue 1e-10 \
-max_target_seqs 1 \
-max_hsps 1 \
-num_threads 35 \
-outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore stitle staxids" \
-out ${OUTPUT_DIR}/blast-results/transcriptome-contam.tab
The script was pending after submission so I decided to look at the next steps of code I’d need to run:
rule id_contam:
input:
BLAST_CONTAM_MERGED,
'scripts/map_contam_ids.py'
output:
BLAST_CONTAM_LIST
conda:
"envs/ete3.yaml"
shell:
"""
python {input[1]} {input[0]} {output}
"""
#remove contam contigs
rule remove_blast_contam:
input:
txm = TXM_LONG,
contam_list = BLAST_CONTAM_LIST,
script = {'scripts/fasta_subsetter.py'}
output:
txm_long = TXM_LONG_CLEAN
shell:
"""
python {input.script} {input.txm} {input.contam_list} REMOVE
mv outputs/blast/contam_lis_REMOVE.fasta {output.txm_long} #'contam_lis' instead of 'contam_list' because .strip() bug in fasta_subsetter. ignore for now, fix later
"""
Alright, I’d need my transcriptome, a list of contaminant information, and a script to subset the FASTA into a clean transcriptome. I found the python script in Zac’s Github repository. To create the list of contaminant information, I’d need this python script. The contaminant IDs appear to be manually curated, so that will take some time to poke through the blastn results and verify the IDs.
It’s also probably worth creating a new script and subdirectory (06d-blast) for this step to avoid having the world’s longest script. So, I did that! The new script can be found here.
2026-02-03
…and my code errored out almost immediately because I didn’t give SLURM a path to a working directory that existed. I fixed that and confirmed blastn is running:
(base) [yaamini.venkataraman@poseidon-l1 ~]$ head /vortexfs1/scratch/yaamini.venkataraman/wc-green-crab/output/06d-blast/blast-results/transcriptome-contam.tab TRINITY_DN11_c0_g1_i15 gi|1379041273|gb|CP028800.1| 95.569 1354 54 6 1 1352 2593208 2591859 0.0 2163 Acinetobacter junii strain WCHAJ59 chromosome, complete genome 40215 TRINITY_DN11_c0_g1_i2 gi|359551946|gb|JN869067.1| 93.827 891 47 8 243 1129 1 887 0.0 1334 Uncultured bacterium clone AN62 16S ribosomal RNA gene, partial sequence 77133 TRINITY_DN11_c0_g1_i21 gi|1445136029|ref|NR_158069.1| 92.111 1369 92 14 297 1659 1 1359 0.0 1916 Nocardioides agrisoli strain djl-8 16S ribosomal RNA, partial sequence 1882242 TRINITY_DN11_c0_g1_i24 gi|404428310|gb|JQ516465.1| 94.310 580 33 0 2 581 465 1044 0.0 889 Uncultured Rhizobiales bacterium clone 0907_Mf_HT2_B11 16S ribosomal RNA gene, partial sequence 208549 TRINITY_DN11_c0_g1_i25 gi|684780902|gb|KJ995964.1| 88.112 1388 151 14 297 1677 3 1383 0.0 1637 Uncultured bacterium clone C-147 16S ribosomal RNA gene, partial sequence 77133 TRINITY_DN11_c0_g1_i26 gi|1233268392|gb|KX771411.1| 99.163 239 2 0 1 239 903 1141 5.39e-117 431 Uncultured bacterium clone 01G_N5Kili 16S ribosomal RNA gene, partial sequence 77133 TRINITY_DN11_c0_g1_i28 gi|1379041273|gb|CP028800.1| 90.559 2288 174 23 533 2787 2594137 2591859 0.0 2990 Acinetobacter junii strain WCHAJ59 chromosome, complete genome 40215 TRINITY_DN11_c0_g1_i3 gi|359551675|gb|JN868796.1| 87.857 1367 125 23 297 1659 1 1330 0.0 1567 Uncultured bacterium clone MW25 16S ribosomal RNA gene, partial sequence 77133 TRINITY_DN11_c0_g1_i31 gi|459218310|gb|KC442851.1| 88.408 1087 86 20 297 1378 1 1052 0.0 1273 Uncultured bacterium clone A40 16S ribosomal RNA gene, partial sequence 77133 TRINITY_DN11_c0_g1_i35 gi|643391102|gb|KF070860.1| 88.406 1242 111 28 317 1543 1 1224 0.0 1465 Uncultured bacterium clone ncd135b03c1 16S ribosomal RNA gene, partial sequence 77133
(base) [yaamini.venkataraman@poseidon-l1 ~]$ wc -l /vortexfs1/scratch/yaamini.venkataraman/wc-green-crab/output/06d-blast/blast-results/transcriptome-contam.tab 7834 /vortexfs1/scratch/yaamini.venkataraman/wc-green-crab/output/06d-blast/blast-results/transcriptome-contam.tab
2026-02-06
My blastn job finished! There are 115,204 results. The next step is to use the python scripts to clean the transcriptome. I decided to just…run the scripts as is this first time around to see what would happen. I made copies of both python scripts. When I was looking at Zac’s Snakefile, I noticed that there was a specific conda environment he used to run the first python script. I needed to figure out how to replicate that conda environment before I could run the script.
A quick search provided this Stack Overflow link with the following solution:
conda env create --file=myfile.yaml
2026-02-11
Picking pu where I left off last week!
I used wget to download the YAML file onto poseidon, then ran conda env create --file=ete3.yaml.
I got this error as it was solving the environment:
(base) [yaamini.venkataraman@poseidon-l1 ~]$ conda env create -f ete3.yaml Collecting package metadata (repodata.json): done Solving environment: failed ResolvePackageNotFound:
- openssl==1.1.1=h7b6447c_0
I modified the YAML file to remove “=h7b6447c_0” from the version strings, since that may be specific to the Poseidon system when Zac built the file (according to Gemini). Just removing the version strings for that one package worked! I then started running the first python script.
Uploading to /vortexfs1/home/yaamini.venkataraman/.etetoolkit/taxa.sqlite Traceback (most recent call last): File “/vortexfs1/home/yaamini.venkataraman/06d-map_contam_ids.py”, line 4, in
ncbi = NCBITaxa() File "/vortexfs1/home/yaamini.venkataraman/.conda/envs/ete3/lib/python3.7/site-packages/ete3/ncbi_taxonomy/ncbiquery.py", line 110, in __init__ self.update_taxonomy_database(taxdump_file) File "/vortexfs1/home/yaamini.venkataraman/.conda/envs/ete3/lib/python3.7/site-packages/ete3/ncbi_taxonomy/ncbiquery.py", line 129, in update_taxonomy_database update_db(self.dbfile) File "/vortexfs1/home/yaamini.venkataraman/.conda/envs/ete3/lib/python3.7/site-packages/ete3/ncbi_taxonomy/ncbiquery.py", line 760, in update_db upload_data(dbfile) File "/vortexfs1/home/yaamini.venkataraman/.conda/envs/ete3/lib/python3.7/site-packages/ete3/ncbi_taxonomy/ncbiquery.py", line 802, in upload_data db.execute("INSERT INTO synonym (taxid, spname) VALUES (?, ?);", (taxid, spname)) sqlite3.IntegrityError: UNIQUE constraint failed: synonym.spname, synonym.taxid
I used Gemini to help me troubleshoot what to do. It seems like the code is erroring out when it tries and updates the blast database. Gemini suggested doing a manual update of the database prior to running the script to avoid any timeout issues on the cluster:
python -c "from ete3 import NCBITaxa; ncbi = NCBITaxa(); ncbi.update_taxonomy_database()"
When I ran this code, I still got the error! It seems like the ete3 version I’m using cannot handle the current format of the NCBI taxonomy database without triggering a “Duplicate Entry” error. I need to first fix that issue:
sed -i 's/INSERT INTO synonym (taxid, spname) VALUES (?, ?);/INSERT OR IGNORE INTO synonym (taxid, spname) VALUES (?, ?);/' /vortexfs1/home/yaamini.venkataraman/.conda/envs/ete3/lib/python3.7/site-packages/ete3/ncbi_taxonomy/ncbiquery.py
Once I ran the sed command, I reran the manual python script update. It finished without error:
(ete3) [yaamini.venkataraman@poseidon-l1 ~]$ python -c “from ete3 import NCBITaxa; ncbi = NCBITaxa(); ncbi.update_taxonomy_database()” NCBI database not present yet (first time used?) Downloading taxdump.tar.gz from NCBI FTP site (via HTTP)… Done. Parsing… Loading node names… 2723150 names loaded. 428355 synonyms loaded. Loading nodes… 2723150 nodes loaded. Linking nodes… Tree is loaded. Updating database: /vortexfs1/home/yaamini.venkataraman/.etetoolkit/taxa.sqlite … 2723000 generating entries… Uploading to /vortexfs1/home/yaamini.venkataraman/.etetoolkit/taxa.sqlite Inserting synonyms: 425000 Inserting taxid merges: 95000 Inserting taxids: 2720000 Downloading taxdump.tar.gz from NCBI FTP site (via HTTP)… Done. Parsing… Loading node names… 2723150 names loaded. 428355 synonyms loaded. Loading nodes… 2723150 nodes loaded. Linking nodes… Tree is loaded. Updating database: /vortexfs1/home/yaamini.venkataraman/.etetoolkit/taxa.sqlite … 2723000 generating entries… Uploading to /vortexfs1/home/yaamini.venkataraman/.etetoolkit/taxa.sqlite Inserting synonyms: 425000 Inserting taxid merges: 95000 Inserting taxids: 2720000
I then reran the the bash script that runs the python script! I now got a new error:
Traceback (most recent call last): File “/vortexfs1/home/yaamini.venkataraman/06d-map_contam_ids.py”, line 17, in
lineage = ncbi.get_lineage(taxid) File "/vortexfs1/home/yaamini.venkataraman/.conda/envs/ete3/lib/python3.7/site-packages/ete3/ncbi_taxonomy/ncbiquery.py", line 238, in get_lineage raise ValueError("%s taxid not found" %taxid) ValueError: 1859514 taxid not found
Gemini suggested adding this to the python script:
contam = []
for i in hits.index:
contig = hits.loc[i,0]
taxid = hits.loc[i,13]
try:
lineage = ncbi.get_lineage(taxid)
except ValueError:
print(f"Warning: TaxID {taxid} not found in local database. Skipping.")
lineage = [] # Or handle as "Unknown"
if bool(set(lineage) & set(candidate_contam)):
contam.append(contig)
Once again, I revised my script and now it’s running.
2026-02-23
I got another error and my script failed, but I didn’t have time until today to actually debug! Here’s the error:
Warning: TaxID 1859514 not found in local database. Skipping. Warning: TaxID 6811 not found in local database. Skipping. Warning: TaxID 288656 not found in local database. Skipping. Traceback (most recent call last): File “/vortexfs1/home/yaamini.venkataraman/06d-map_contam_ids.py”, line 18, in
lineage = ncbi.get_lineage(taxid) File "/vortexfs1/home/yaamini.venkataraman/.conda/envs/ete3/lib/python3.7/site-packages/ete3/ncbi_taxonomy/ncbiquery.py", line 227, in get_lineage result = self.db.execute('SELECT track FROM species WHERE taxid=%s' %taxid) sqlite3.Warning: You can only execute one statement at a time.
My understanding of python is so limited that I had to use Gemini to troubleshoot. The debugged code it suggested:
contam = []
for i in hits.index:
contig = hits.loc[i, 0]
taxid_raw = hits.loc[i, 13]
try:
# 1. Ensure taxid is a clean integer (removes decimals/whitespace)
taxid = int(float(taxid_raw))
# 2. Attempt to get lineage
lineage = ncbi.get_lineage(taxid)
except (ValueError, KeyError, TypeError):
# This handles the "taxid not found" or bad data types
print(f"Warning: TaxID {taxid_raw} not found in local database. Skipping.")
continue # <--- This stops the current loop iteration and moves to the next row
# 3. Intersection check (only runs if lineage was successfully retrieved)
if set(lineage).intersection(candidate_contam):
contam.append(contig)
The reasoning behind it:
Key Changes Made: Added continue: Instead of setting lineage = [] and letting the script keep running, continue tells Python to abandon that specific contig immediately and start the next one. This prevents subsequent logic from tripping over empty variables.
int(float(taxid_raw)): Pandas often reads numbers as floats (e.g., 1859514.0). Passing a float into the ETE3 SQL query is likely what triggered the “one statement at a time” warning. Converting to a float first and then an integer ensures it’s a clean number.
set().intersection(): While bool(set() & set()) works, intersection() is slightly more readable for this context.
Why this stops the SQLite error: The sqlite3.Warning occurs when a database cursor is left in an “unclean” state. By using continue, we ensure that if a database call fails, we don’t try to perform any further logic on that specific result and move straight to the next “clean” database request.
My script finished running! There are three output files:
(base) [yaamini.venkataraman@poseidon-l2 06d-blast]$ head merged.tab 12 74109 30 29 36 184914 37 42 46 39 57 3409587 67 32033 76 155892 77 74311 79 74313
(base) [yaamini.venkataraman@poseidon-l2 06d-blast]$ head taxa.tab 1 root no rank 1 10239 1 Viruses acellular root 10239,1 131567 1 cellular organisms cellular root 131567,1 2787823 1 unclassified entries no rank 2787823,1 2787854 1 other entries no rank 2787854,1 12333 10239 unclassified bacterial viruses no rank 12333,10239,1 12429 10239 unclassified viruses no rank 12429,10239,1 186616 10239 environmental samples no rank 186616,10239,1 2559587 10239 Riboviria RNA viruses and retroviruses realm 2559587,10239,1 2731341 10239 Duplodnaviria double-stranded DNA viruses realm 2731341,10239,1
(base) [yaamini.venkataraman@poseidon-l2 06d-blast]$ head contam_list.txt TRINITY_DN11_c0_g1_i15 TRINITY_DN11_c0_g1_i2 TRINITY_DN11_c0_g1_i21 TRINITY_DN11_c0_g1_i24 TRINITY_DN11_c0_g1_i25 TRINITY_DN11_c0_g1_i26 TRINITY_DN11_c0_g1_i28 TRINITY_DN11_c0_g1_i3 TRINITY_DN11_c0_g1_i31 TRINITY_DN11_c0_g1_i35
The most important one is contam_list.txt, which is a list of contaminated sequences. There are ~55,000 contaminated sequences, which are pretty small potatoes compared to the 900k transcripts that I have. Now that I identified contaminated sequences, I can use the the subsetting script to remove them. Zac’s code looks like this:
#remove contam contigs
rule remove_blast_contam:
input:
txm = TXM_LONG,
contam_list = BLAST_CONTAM_LIST,
script = {'scripts/fasta_subsetter.py'}
output:
txm_long = TXM_LONG_CLEAN
shell:
"""
python {input.script} {input.txm} {input.contam_list} REMOVE
mv outputs/blast/contam_lis_REMOVE.fasta {output.txm_long} #'contam_lis' instead of 'contam_list' because .strip() bug in fasta_subsetter. ignore for now, fix later
"""
My input is the contam_list.txt file, and the txm file is the transcriptome. I added the following to my blast script:
# Remove contaminated sequences from the transcriptome
python ${HOME_DIR}/06d-fata_subsetter.py ${TRINITY_DIR}/trinity_out_dir/Trinity.fasta ${OUTPUT_DIR}/contam_list.txt REMOVE
mv ${OUTPUT_DIR}/contam_lis_REMOVE.fasta ${OUTPUT_DIR}/transcriptome_contamRemoved.fasta
I saved the script and started a new job. Thankfully, this one didn’t require any troubleshooting! My first round of cleaning is officially complete, and my resultant file is transcriptome_contamRemoved.fasta. My next step is to annotate the transcriptome with EnTap, which will end with an additional round of transcriptome cleaning.
Going forward
- Annotate transcriptome with
EnTap - Quantify transcripts with
salmon - Identify differentially expressed genes
- Additional strand-specific analysis in the supergene region
- Examine HOBO data from 2023 experiment
- Demographic data analysis for 2023 paper