DML Analysis Part 42
Addressing collaborator feedback
I got some edits from Katie and Alan, so I figured I’d start to address that feedback. I’m also at the point where I don’t think I’ll make that Oct. 1 deadline, so I can take my time to make my paper as good as it can be.
CpG location figures
Now seems like a good time to tackle making better figures to describe DML or CpG locations in the C. virginica genome. The first thing I did was update overlap counts in this spreadsheet. In this R Markdown script, I followed Katie and Steven’s suggestions to make stacked barplots. I started by making figures to compare all CpG categories (found here).
Figure 2. Stacked barplot depicting location of various CpG categories (all CpGs in C. virginica genome, MBD-Enriched, methylated, sparsely methylated, and unmethylated) in specific genomic features (exons, introns, transposable elements, putative promoters, other).
I got feedback that this was too much information for a figure. I condensed the barplots to All CpGs and methylated CpGs. The rest of the information in the original figure was still in-text.
Figure 3. Stacked barplot depicting location of all CpGs and methylated CpGs in specific genomic features (exons, introns, transposable elements, putative promoters, other).
I also made a similar figure to compare MBD-enriched CpGs and DML.
Figure 4. Stacked barplot depicting location of CpG loci with 5x coverage and DML in specific genomic features (exons, introns, transposable elements, putative promoters, other).
Genes with multiple DML
Based on Katie’s feedback, I decided to look at trends in genes with multiple DML. I mentioned that there were some genes with multiple DML in the text, but I didn’t quantify how many DML genes had, or whether they were hyper- or hypomethylated. I examined the trends this R Markdown file.
Before looking into gene patterns, I thought I should go one step back and look at the distribution of DML in chromosomes. I created a barplot showing how many DML were found in each chromosome. I wanted to see if this tracked the number of genes in each chromosome, so I spent some (re: maybe too much) time learning how to add a second y-axis to a plot. I saved some useful code here. Essentially, I needed to save my original barplot as a new object, then add the second plot on top.
Figure 5. Barplot with number of DML and number of genes for each chromosome.
The interesting thing about this plot is that the number of DML track the number of genes, which lends evidence to our idea that DML influence gene activity! While it isn’t in this plot, the chromosome with the most genes is also not the largest chromosome, so it’s not just a chromosome size thing.
Then I dug into the number of DML in genes. The first thing I did was calculate the mean, median, and maximum number of DML in a gene. Most genes had only 1 DML, but there was a gene with 5 DML. For each gene with more than 1 DML, I wanted to see if the DML were all in the same direction (i.e. all hypermethylated vs. all hypomethylated). I found that not all DML went the same direction in the same gene! I created tables with the gene ID, number of DML, DML location, methylation difference, p-value, q-value, and annotation:
In Excel, I collated DML information for each gene, and counted how many DML were hypermethylated vs. hypomethylated:
I combined the chromosome information with the number of genes in each DML to create my pride and glory, a beautiful multipanel plot.
Figure 6. Number of DML in chromosomes, number of DML in genes, and number of DML in genes with more than 1 DML. There is only 1 gene with 5 DML, with 4 hypermethylated DML and 1 hypomethylated DML.
There’s no clear pattern of hyper/hypomethylation breakdowns in genes with multiple DML. Interesting find, but I have no idea what that could mean.
Scaled DML distributions
One thing Liew et al. did was look at percent methylation in each exon and intron type. While that seems really interesting, I realized I couldn’t do this for my data because 1) the C. virginica GFF formatting makes it hard to parse through and determine which exon/intron are which and 2) not all genes have the same genomic architecture! After brainstorming with Shelly, I decided to scale each gene from 0% to 100% and see wehre DML occur. This is useful to see if methylation occurs in any consistent location for each gene.
To start, I calculated the length of each gene. I then calculated the absolute position of the DML in the gene, and finally, the scaled position:
DMLGeneAnnotationNomRNA$geneLength <- DMLGeneAnnotationNomRNA$gene.end - DMLGeneAnnotationNomRNA$gene.start #Calculate gene length
DMLGeneAnnotationNomRNA$absPosition <- DMLGeneAnnotationNomRNA$start - DMLGeneAnnotationNomRNA$gene.start #Calculate the absolute position of the DML in the gene
DMLGeneAnnotationNomRNA$scaledPosition <- DMLGeneAnnotationNomRNA$absPosition / DMLGeneAnnotationNomRNA$geneLength #Calculate the scaled position of the DML in the gene
Next, I separated out hyper- and hypomethylated DML. My goal was to make a mirror plot, with information for hypermethylated DML on top and hypomethylated DML on the bottom. I found this code, and used it as a starting point for my second pride and joy.
Figure 7. Distribution of hyper- and hypomethylated DML across a theoretical gene.
Again, no clear patterns! Hyper- and hypomethylated DML seem to occur across the gene and aren’t concentrated in any one area.
Now time to wrap my brain around more text revisions.
Going forward
- Update methods and results
- Update paper repository
- Revise the discussion
- Revise the introduction
- Revise my abstract
- Address any new edits and clean up the text
- Post the paper on bioRXiv
- Submit to the Special Issue