Gigas and Virginica Comparison Part 4

Reframing my questions and making a poster

I have (almost) all analyses completed, so it’s time to make the poster! I showed Steven the analyses I did thus far, which helped me frame the sub-questions they correspond to.

Where in the genome are MI?

I calculated how much individual features overlapped with MI, but I don’t have any visual representation of the genomic location of MI. Steven suggested I create a stacked barplot similar to those I’ve made in the past.

C. virginica

To start, I returned to my Jupyter notebook. Since intersectBed output can change depending on which file you list first, I characterized two different kinds of overlaps:

  • MI first: Where are MI in the genome?
  • Features first: How much of a single feature overlaps with MI (see below)?

All output is saved in this folder. The naming conventions match which file was argument -a and which was -b (ex. MI-Exon means that MI was listed as -a). For two different features, it didn’t make sense to look at both kinds of overlaps. Understanding which MI are in intergenic regions is important, but I don’t know what it would tell me if there is an intergenic region inside my MI. Also, bedtools wouldn’t let me switch the inputs to even answer that question. Knowing which DML are in MI is important, but since they are single loci doing the opposite command will lead to the same output.

Table 1. Characterizing two different kinds of overlaps between MI and various genome features in C. virginica

Feature MI Location in Feature Individual Feature Overlaps with MI
Exons 22705 240133
Intron 28730 92472
Genes 30773 15009
mRNA 29805 29483
Transposable Elements 25085 107926
Putative Promoters 4217 8846
Other 1154 N/A
DML N/A 537

C. gigas

I rinsed and repeated with C. gigas MI in this Jupyter notebook.

Table 2. Characterizing two different kinds of overlaps between MI and various genome features in C. gigas

Feature MI Location in Feature Individual Feature Overlaps with MI
Exons 16135 49607
Intron 18322 51980
Genes 19278 8761
Transposable Elements 2988 5868
Putative Promoters 2498 2746
Other 3084 N/A
DML N/A 384

Stacked barplot

I updated this file with overlap information to include MI locations. I imported the revised file into this R Markdown document to analyze the distributions. The MI locations in various genomic features are significantly different between the two species (χ2 = 21.026, df = 4, p-value = 0.0003129).

Screen Shot 2020-02-13 at 10 04 36 PM

Figure 1. Location of MI in C. virginica and C. gigas genomes.

Are genes in MI conserved?

Which genes are in MI for both species? I started working on that question in this Jupyter notebook. My plan:

  1. Annotate Gene-MI overlaps for each species with Uniprot Accession codes
  2. Match Uniprot Accesion codes!

Annotating the C. gigas overlaps with Uniprot Accession codes was easy, but C. virginica was a bit more complicated. I identified overlaps using a BEDfile that didn’t have gene information! Additionally, I have several files with C. virginica Uniprot Accession codes and matching Genbank IDs, but those are found in mRNA.

To fix the first issue, I went back to this Jupyter notebook and used the gff gene track to find the overlaps instead. Easy!

To fix the second issue, I matched gene IDs in the gene-MI overlap file with Genbank IDs in the mRNA track to then match to Uniprot Accesion codes (confusing I know). When it came down to merging Uniprot Accession codes, I found that there were no common genes in MI between the two species! Wild.

How much of a single feature overlaps with MI?

Fixing bedtools output

Earlier, I tried finding overlaps between C. gigas MI and genome feature tracks in this Jupyter notebook. My output wasn’t correctly tab-delimited, so I posted this issue. After trying various slight code modifications, Sam noticed I wasn’t using the most updated bedtools version. I followed the instructuions here to download version 2.29.1, then reran my code. Updating the version was enough to fix my output issue! Turns out the only thing it needed to fix was the actual output, so the overlap counts I have don’t need to be revised.

Revising the figure

I returned to this R Markdown file to see if the distributions were different and fix my MI overlap boxplots. I found that the distribution of individual genome features in MI is significantly different between species (χ2 = 23.599, df = 2, p-value = 7.507e-06). I modified the y-axis label in the figure since it wasn’t accurate and added the revised C. gigas intron data.

Screen Shot 2020-02-13 at 4 11 27 PM

Figure 2. Final boxplots showing genome feature overlaps with methylation islands.

What biological processes are represented in DML compared to the full genome?

C. virginica

Since I already made a back-to-back GOSlim plot for the paper submission, all I needed to do was recolor it to match the poster color scheme. I returned to this R Markdown file and did just that. Instead of grouping GOSlim terms by colors, I thought I’d just use different colors for the full genome and DML.

Screen Shot 2020-02-14 at 10 52 57 AM

Figure 3. GOSlim terms in C. virginica genome and DML

C. gigas

For WSN, I examined GOSlim terms for genes with DML…OR SO I THOUGHT. When I went back to this Jupyter notebook, I found that I annotated genes containing DML with Uniprot Accession codes, but I didn’t actually subset my GOSlim list so that it only included those genes when making my figure in this R Markdown file. Since my goal is to make a back-to-back GOSlim plot for genes and genes with DML, I did that subsetting.

Screen Shot 2020-02-14 at 10 46 41 AM

Figure 4. GOSlim terms in C. gigas genome and DML

Looking at both figures, there are two things that stand out:

  1. Both genomes have remarkably similar distributions of GOSlim terms
  2. DML processes are also similar, but there are potentially more developmental process GOSlim terms associated with C. gigas DML-gene overlaps.

While these figures are good, they take up too much space on the poster! Steven suggested I divde the percent of genes with DML with the percent of genes for each GOSlim term to get relative proportions. I could then plot the C. virginica and C. gigas data back-to-back.

I made the plot using the same code I used for the other plots, but for some reason it adds “0” to the x-axis label 3 times…unsure why that’s happening but I can fix it later.

Screen Shot 2020-02-15 at 10 45 28 AM

Figure 5. Relative proportion of GOSlim terms for genes with DML vs. all genes for C. virginica and C. gigas

Final poster

I tried my hand ata the #betterposter format! It was difficult to figure out what the main takeaway should be since there were a lot of interesting components, but we settled on some summary statistics from MI and DML as well as the GOSlim plot. Here’s hoping people will be interested at Ocean sciences next week!

Screen Shot 2020-02-15 at 1 37 30 PM

Going forward

  1. Send for printing!
Written on February 13, 2020