West Coast Green Crab Experiment Part 74

RNA-Seq analysis plan

I need to figure out the best workflow for analyzing the RNA-Seq data! We don’t have a green crab transcriptome, so I’ll need to do de novo transcriptome assembly which I am less familiar with. I’m going to collect some workflows and look at best practices before making a plan.

Research

Notes from Raghavan et al. (2022):

prominent graph-based assembler is trinity —> do i have a trinity script somewhere?
- try trinity and multiple k-mer lengths on a subset of data
parameters of sequence assembly quality
- sequence length and fragmentation (calculate through SeqKit)
- read support: fraction of reads that map back to the assembly
- low proportion of reads mapping to multiple sequences
- assess strand-specificity of assembly for strand-specific sequencing
- ExN50 included in trinity assembler
- check composition: would have recovered a large fraction of the sequenced transcriptome through BUSCO ortholog tests
- TransRate incorporates a lot of these metrics into one program, checks for chimeric sequences
- bellerophon pipeline incorporates TransRate with CD-HIT and bUSCO
alignment
- two-setp: Bowtie2 or STAR to align, then fed through tool to get abundance estimates
- pseudoaligner: kallisto and salmon —> recommended to due speed and comparably high accuracy
assembly thinning and redundancy reduction
- de novo transcriptome assembly produces more sequences than expected for the number of genes
- most aligners classify transcripts into isoform groups
- can remove transcripts below a certain threshold (TPM < 1)
- clustering approach is more rigorous
- assembly thinning should only be performed if necessary —> likely good to not do it unless removing low-expression transcripts

Notes from Tobias et al. (2021):

This is the most recent paper from the Tepolt Lab that includes transcriptome assembly information.

de novo assembly of R. harrisii (mud crab) with trinity
select transcriptome based on ExN50 metrics generated by trinity utility scripts
- longest isoform per gene retained
- contigs less than 200 bp discarded
contigs queries against NCBI nt database (blastn) to remove any contaminants (e values ≤ 1^-10) removed
transcript quantification with salmon within trinity, then differential expression analysis with DESeq2

Based on this reading, I think testing out Trinity assembly parameters is a good first step. Grace shared a Trinity script with me that I can use as a reference. There are Github repositories for Zac’s paper and a Hemigrapsus transcriptome assembly. Both these workflows use snakemake to execute trinity transcriptome assembly and clean transcripts with EnTap and blastn. EnTap is used to identify mitochondrial sequences while blastn is used to remove any contaminant sequences. Snakemake is a workflow management system that is good for reproducibility. There seem to be a handful of tutorials available in their user guide.

The plan

trinity for transcriptome assembly
- Read trinity documentation to identify strand-specific parameters and other options that can be used to test multiple transcriptome assembly specifications
- Try using snakemake on the WHOI cluster to run trinity
- If that doesn’t work, just run trinity normally
Clean transcriptome by removing mitochondrial sequences with EnTap and contaminant sequences with blastn
Quantify transcripts with salmon

Going forward

Experiment with different methods for the temperature x genotype question
PLS-DA for temperature x time question
Transcriptome assembly with trinity
Clean transcriptome with EnTap and blastn
Quantify transcript counts with salmon
Identify differentially expressed genes
Additional strand-specific analysis in the supergene region
Examine HOBO data from 2023 experiment
Demographic data analysis for 2023 paper
Start methods and results of 2023 paper

Written on November 18, 2025