Gigas and Virginica Comparison Part 3
C. gigas vs. C. virginica: DML
Now the good stuff: looking at DML in response to ocean acidification in Crassostrea spp. gonad tissue!
DML quantity and locations
When I was first analyzing my C. gigas DML, I was pretty sure the numbers weren’t identical. I decided to look at the number of DML and DML-genome feature track overlaps.
Table 1. Comparison of DML number and overlaps with various genome feature files. C. gigas DML were created with 10x data and 2 pooled samples, and C. virginica DML were created with 5x data and 10 individual samples.
Feature | ** C. virginica ** | ** C. gigas ** |
---|---|---|
Quantity | 598 | 628 |
Overlaps with Exons | 368 (61.5%) | 157 (25.0%) |
Overlaps with Introns | 192 (32.1%) | 285 (45.4%) |
Overlaps with Genes | 560 (93.6%) | 442 (70.4%) |
Overlaps with Transposable Elements | 57 (9.5%) | 8 (1.3%) |
Overlaps with Putative Promoters | 42 (7.0%) | 24 (3.8%) |
Overlaps with Other Regions | 21 (3.5%) | 165 (26.3%) |
At first glance, there aren’t that many more C. gigas DML than there are C. virginica DML. However, they’re distributed differently! In C. gigas, only 70.4% of DML are in genes as opposed to 93.6% in C. virginica. Within genes, C. gigas DML are more heavily weighted towards introns, while C. virginica DML are primarily in exons. In this R Markdown script, I conducted a proportion test to see if the DML distribution bewteen species truly were different. Based on a chi-squared test, the distributions of DML across genomic features are significantly between speceis (χ2 = 38.516, df = 4, P-value = 8.767e-08).
Figure 1. Distribution of C. gigas and C. virginica DML in various genome features.
Comparing GOSlim functions of common genes with DML
Even though there are differences in where DML are located in their respective gneomes, there may be some genes in both species that have DML in response to ocean acidification. The first thing I needed to do was identify if there were any overlaps. I decided to try matching the annotated gene products associated with each DML list in this Jupyter notebook. Late-night work delirium coupled with caffeine lead me to believe the best way to do this would be to look for partial matches in the gene product columns. I tried come awk
code I found online, but I couldn’t get it to work consistently. Depending on which list I used as my base, I got two different numbers of overlapping gene products! Annoyed, I posted this issue. Sam rightly pointed out that the best way to go about this would be to look at Uniprot Accession instead since they have one-to-one matches with proteins. While I briefly toyed with the idea of trying to use diff
, I went back to Old Faithful: join
:
#Join the 1st column in the first file with the 1st column in the second file
#The files are tab delimited, and the output should also be tab delimited (-t $'\t')
#Check head of output
#Count number of overlaps
!join -1 1 -2 1 -t $'\t' \
2019-06-20-Cv-DML-Gene-Annotation-sorted-reduced.tab \
DML-Gene-annot-reduced.tab \
> 2020-02-11-Common-DML-Cv-Cg.txt
The output, found here, has Uniprot Accession, Genbank ID, and CGI ID. The inputs were reduced versions of annotated C virginica and annotated C. gigas I made with awk
. I created an expanded version of this common DML here that has product annotations from both DML sets. Even though I figured the products would be identical since they are associated with Uniprot Accession, I still wanted one column from each dataset.
Once I had the list of overlapping genes with DML, I eliminated duplicate entries, going from 117 to 81 overlaps. I then used CGI ID to match these entries with GOSlim terms. I extracted unique biological process entries that did not include “other biological processes” distinctions and saved the output here. In total I had 53 unique GOSlim terms representing 81 common genes with DML between C. gigas and C. virginica! I don’t know what I was expecting, but that really doesn’t seem like much. Interesting.
In this R Markdown file, I made some figures to characterize the GOSlim terms in common genes and in the original datasets. I’m really glad I settled on this figure format during WSN becuase it has come in handy.
Figure 2. GOSlim terms associated with common genes containing DML in C. gigas and C. virginica.
Figure 3. GOSlim terms associated with C. virginica genes with DML
Figure 4. GOSlim terms associated with C. gigas genes with DML
It’s interesting that signal transduction, death, cell adhesion, and cell-cell signaling — processes represented in the C. gigas and C. virginica DML datasets — are not present in GOterms of common genes. Maybe this is due to differences in the experimental set-up or of family and/or species-specific stress tolerances.
Going forward
- Draft poster for ASLO and get feedback
- Finalize ASLO poster and send for printing by Thursday to get the ASLO discount!