WGBS Analysis Part 8
Adding GOslim annotations to genes
Last time I worked on my talk, I had trouble annotating genes with GOslim information. I returned to this Jupyter notebook to get GOslim terms for my WSN talk.
I was able to get pretty far in the GOslim annotation pipeline, but I was unable to sort my
blast output using
sort -V. I posted this issue. Sam told me that I needed to install
homebrew on the machine I was using
genefish. I was pretty sure I already installed
homebrew, but I did it again! Sam also pointed out that the user names in my C. gigas and C. virginica notebooks were different. Turns out I installed
homebrew on my laptop, which is what I used to annotate C. virginica genes with GOslim terms! Installing
coreutils, allowed me to use
gsort -V and successfully sort my
Once I annotated genes with GOslim terms, I separated out biological process and molecular function GOslim terms. Within each category, I noticed that there were some repeat lines. If a gene has multiple DML, it’s likely the same GOslim term would be assigned. I removed duplicate lines using
uniq. I saved unique molecular function GOslim terms and biological process GOslim terms. I also had some genes with “uncharacterized biological processes” GOslim terms that also had other specific GOslim terms assigned.
Figure 1. Gene with specific and unspecific GOslim term assigned.
I wanted to get rid of these specific lines but didn’t know how, so I posted this issue. Before I could figure out how to code this, Steven suggested I just remove all “uncharacterized biological processes” entries. Sam said I could use
grep --invert-match to find these entries and not include them in the output file. I saved unique, characterized GOslim entries for molecular function and biological processes.
Recoloring DML location plot
On the plane to WSN, I realized my black and white pots were, scienficially speaking, blah compared to the rest of my color-coordinated presentation. I decided to pick a color scheme from R that matched my presentation, then use it for my plots.
The color scheme that best matched my Powerpoint was the Red-Purple palette from
RColorBrewer. I selected five colors from this palette and reversed the order so the darkest color would be used first. I saved the palette as a new object. To see if the scheme was color-blind friendly, I used
Figure 2. Testing color scheme with
Looking at the plot, I could still distinguish between the five colors being used, so I decided to continue using it for my talk.
Figure 3. Recolored plot with location of all CpGs and DML.
Biological processes plot
Now that I had GOslim data and a color scheme, I was ready to make a GOslim plot! I started this R Markdown script to do just that while flying from Seattle to San Diego. There’s a bunch of code for GO-MWU in there from my C. virginica project I didn’t edit, but I figured I might as well keep it in case I decide to use a similar method with the C. gigas data.
I started by counting the frequency of each GOslim term in my dataset, then converted those frequencies to percentages. I saved this information here. I then grouped my GOslim terms into broader categories to assign colors:
- Metabolic Processes: DNA Metabolism, RNA Metabolism, Protein Metabolism, Other Metabolic Processes
- Stress Response: Stress Response
- Development: Cell Cycle and Proliferation, Developmental Processes, Cell Organization and Biogenesis
- Cell Activity: Cell-Cell Signaling, Cell Adhesion, Death, Signal Transduction, Transport
I plotted this intially as a pie chart, but it didn’t look that great. I decided to use a horizontal barplot instead, and organize the bars by category and percent of genes with DML in each category.
Figure 4. Biological processes represented by genes with DML.
I’m pretty happy with the format of this graph, so I think I’ll apply it to my C. virginica data too! Now it’s time to update my talk (found here).
- Present at WSN!
- Return to C. virginica data and finish. that. paper.
- Determine sequencing protocol for C. gigas adults and larvae