West Coast Green Crab Experiment Part 74
RNA-Seq analysis plan
I need to figure out the best workflow for analyzing the RNA-Seq data! We don’t have a green crab transcriptome, so I’ll need to do de novo transcriptome assembly which I am less familiar with. I’m going to collect some workflows and look at best practices before making a plan.
Research
Notes from Raghavan et al. (2022):
- prominent graph-based assembler is trinity —> do i have a trinity script somewhere?
- try trinity and multiple k-mer lengths on a subset of data
- parameters of sequence assembly quality
- sequence length and fragmentation (calculate through SeqKit)
- read support: fraction of reads that map back to the assembly
- low proportion of reads mapping to multiple sequences
- assess strand-specificity of assembly for strand-specific sequencing
- ExN50 included in trinity assembler
- check composition: would have recovered a large fraction of the sequenced transcriptome through BUSCO ortholog tests
- TransRate incorporates a lot of these metrics into one program, checks for chimeric sequences
- bellerophon pipeline incorporates TransRate with CD-HIT and bUSCO
- alignment
- two-setp: Bowtie2 or STAR to align, then fed through tool to get abundance estimates
- pseudoaligner: kallisto and salmon —> recommended to due speed and comparably high accuracy
- assembly thinning and redundancy reduction
- de novo transcriptome assembly produces more sequences than expected for the number of genes
- most aligners classify transcripts into isoform groups
- can remove transcripts below a certain threshold (TPM < 1)
- clustering approach is more rigorous
- assembly thinning should only be performed if necessary —> likely good to not do it unless removing low-expression transcripts
Notes from Tobias et al. (2021):
This is the most recent paper from the Tepolt Lab that includes transcriptome assembly information.
- de novo assembly of R. harrisii (mud crab) with trinity
- select transcriptome based on ExN50 metrics generated by trinity utility scripts
- longest isoform per gene retained
- contigs less than 200 bp discarded
- contigs queries against NCBI nt database (blastn) to remove any contaminants (e values ≤ 1^-10) removed
- transcript quantification with salmon within trinity, then differential expression analysis with DESeq2
Based on this reading, I think testing out Trinity assembly parameters is a good first step. Grace shared a Trinity script with me that I can use as a reference. There are Github repositories for Zac’s paper and a Hemigrapsus transcriptome assembly. Both these workflows use snakemake to execute trinity transcriptome assembly and clean transcripts with EnTap and blastn. EnTap is used to identify mitochondrial sequences while blastn is used to remove any contaminant sequences. Snakemake is a workflow management system that is good for reproducibility. There seem to be a handful of tutorials available in their user guide.
The plan
trinityfor transcriptome assembly- Read
trinitydocumentation to identify strand-specific parameters and other options that can be used to test multiple transcriptome assembly specifications - Try using
snakemakeon the WHOI cluster to run trinity - If that doesn’t work, just run
trinitynormally
- Read
- Clean transcriptome by removing mitochondrial sequences with
EnTapand contaminant sequences withblastn - Quantify transcripts with
salmon
Going forward
- Experiment with different methods for the temperature x genotype question
- PLS-DA for temperature x time question
- Transcriptome assembly with
trinity - Clean transcriptome with
EnTapandblastn - Quantify transcript counts with
salmon - Identify differentially expressed genes
- Additional strand-specific analysis in the supergene region
- Examine HOBO data from 2023 experiment
- Demographic data analysis for 2023 paper
- Start methods and results of 2023 paper