Gigas and Virginica Comparison Part 4
Reframing my questions and making a poster
I have (almost) all analyses completed, so it’s time to make the poster! I showed Steven the analyses I did thus far, which helped me frame the sub-questions they correspond to.
Where in the genome are MI?
I calculated how much individual features overlapped with MI, but I don’t have any visual representation of the genomic location of MI. Steven suggested I create a stacked barplot similar to those I’ve made in the past.
C. virginica
To start, I returned to my Jupyter notebook. Since intersectBed
output can change depending on which file you list first, I characterized two different kinds of overlaps:
- MI first: Where are MI in the genome?
- Features first: How much of a single feature overlaps with MI (see below)?
All output is saved in this folder. The naming conventions match which file was argument -a
and which was -b
(ex. MI-Exon means that MI was listed as -a
). For two different features, it didn’t make sense to look at both kinds of overlaps. Understanding which MI are in intergenic regions is important, but I don’t know what it would tell me if there is an intergenic region inside my MI. Also, bedtools
wouldn’t let me switch the inputs to even answer that question. Knowing which DML are in MI is important, but since they are single loci doing the opposite command will lead to the same output.
Table 1. Characterizing two different kinds of overlaps between MI and various genome features in C. virginica
Feature | MI Location in Feature | Individual Feature Overlaps with MI |
---|---|---|
Exons | 22705 | 240133 |
Intron | 28730 | 92472 |
Genes | 30773 | 15009 |
mRNA | 29805 | 29483 |
Transposable Elements | 25085 | 107926 |
Putative Promoters | 4217 | 8846 |
Other | 1154 | N/A |
DML | N/A | 537 |
C. gigas
I rinsed and repeated with C. gigas MI in this Jupyter notebook.
Table 2. Characterizing two different kinds of overlaps between MI and various genome features in C. gigas
Feature | MI Location in Feature | Individual Feature Overlaps with MI |
---|---|---|
Exons | 16135 | 49607 |
Intron | 18322 | 51980 |
Genes | 19278 | 8761 |
Transposable Elements | 2988 | 5868 |
Putative Promoters | 2498 | 2746 |
Other | 3084 | N/A |
DML | N/A | 384 |
Stacked barplot
I updated this file with overlap information to include MI locations. I imported the revised file into this R Markdown document to analyze the distributions. The MI locations in various genomic features are significantly different between the two species (χ2 = 21.026, df = 4, p-value = 0.0003129).
Figure 1. Location of MI in C. virginica and C. gigas genomes.
Are genes in MI conserved?
Which genes are in MI for both species? I started working on that question in this Jupyter notebook. My plan:
- Annotate Gene-MI overlaps for each species with Uniprot Accession codes
- Match Uniprot Accesion codes!
Annotating the C. gigas overlaps with Uniprot Accession codes was easy, but C. virginica was a bit more complicated. I identified overlaps using a BEDfile that didn’t have gene information! Additionally, I have several files with C. virginica Uniprot Accession codes and matching Genbank IDs, but those are found in mRNA.
To fix the first issue, I went back to this Jupyter notebook and used the gff
gene track to find the overlaps instead. Easy!
To fix the second issue, I matched gene IDs in the gene-MI overlap file with Genbank IDs in the mRNA track to then match to Uniprot Accesion codes (confusing I know). When it came down to merging Uniprot Accession codes, I found that there were no common genes in MI between the two species! Wild.
How much of a single feature overlaps with MI?
Fixing bedtools
output
Earlier, I tried finding overlaps between C. gigas MI and genome feature tracks in this Jupyter notebook. My output wasn’t correctly tab-delimited, so I posted this issue. After trying various slight code modifications, Sam noticed I wasn’t using the most updated bedtools
version. I followed the instructuions here to download version 2.29.1, then reran my code. Updating the version was enough to fix my output issue! Turns out the only thing it needed to fix was the actual output, so the overlap counts I have don’t need to be revised.
Revising the figure
I returned to this R Markdown file to see if the distributions were different and fix my MI overlap boxplots. I found that the distribution of individual genome features in MI is significantly different between species (χ2 = 23.599, df = 2, p-value = 7.507e-06). I modified the y-axis label in the figure since it wasn’t accurate and added the revised C. gigas intron data.
Figure 2. Final boxplots showing genome feature overlaps with methylation islands.
What biological processes are represented in DML compared to the full genome?
C. virginica
Since I already made a back-to-back GOSlim plot for the paper submission, all I needed to do was recolor it to match the poster color scheme. I returned to this R Markdown file and did just that. Instead of grouping GOSlim terms by colors, I thought I’d just use different colors for the full genome and DML.
Figure 3. GOSlim terms in C. virginica genome and DML
C. gigas
For WSN, I examined GOSlim terms for genes with DML…OR SO I THOUGHT. When I went back to this Jupyter notebook, I found that I annotated genes containing DML with Uniprot Accession codes, but I didn’t actually subset my GOSlim list so that it only included those genes when making my figure in this R Markdown file. Since my goal is to make a back-to-back GOSlim plot for genes and genes with DML, I did that subsetting.
Figure 4. GOSlim terms in C. gigas genome and DML
Looking at both figures, there are two things that stand out:
- Both genomes have remarkably similar distributions of GOSlim terms
- DML processes are also similar, but there are potentially more developmental process GOSlim terms associated with C. gigas DML-gene overlaps.
While these figures are good, they take up too much space on the poster! Steven suggested I divde the percent of genes with DML with the percent of genes for each GOSlim term to get relative proportions. I could then plot the C. virginica and C. gigas data back-to-back.
I made the plot using the same code I used for the other plots, but for some reason it adds “0” to the x-axis label 3 times…unsure why that’s happening but I can fix it later.
Figure 5. Relative proportion of GOSlim terms for genes with DML vs. all genes for C. virginica and C. gigas
Final poster
I tried my hand ata the #betterposter format! It was difficult to figure out what the main takeaway should be since there were a lot of interesting components, but we settled on some summary statistics from MI and DML as well as the GOSlim plot. Here’s hoping people will be interested at Ocean sciences next week!
Going forward
- Send for printing!