DML Analysis Part 45
Addressing reviewer comments
I didn’t expect reviewer comments a month after I submitted the C. virginica gonad methylation paper, but here we are. They looked relatively minor, so I figured it was okay if I couldn’t get to them until now.
Additional genomic feature overlaps
One reviewer asked us to consider the location of DML with respect to 5’ UTR, coding sequences, and 3’ UTR. Since I had plenty of genome feature tracks, I decided to take this one step further and look at UTR, coding sequence, non-coding sequence, lncRNA, and intergenic region with DML, methylated CpGs, and methylation islands (yup, I’m including this now…see below).
I returned to this Jupyter notebook to look at overlaps between methylated CpGs and these other genome feature tracks. I did similar things in this notebook for MI, but I moved the code over to the previous notebook because I think it makes more sense to include this analysis in the same notebook where I generated the MI. This notebook is where I characterized DML locations.
Table 1. Genome feature overlaps with different CpG categories, methylation islands, and DML
Genome Feature | All 5x CpGs | Methylated CpGs | Sparsely Methylated CpGs | Unmethylated CpGs | DML |
---|---|---|---|---|---|
Total | 4304257 | 3181904 | 481788 | 640565 | 598 |
Exons | 1366779 | 1013691 | 105871 | 247217 | 368 |
Introns | 1884429 | 1504791 | 211143 | 168495 | 192 |
Exon UTR | 192907 | 128585 | 19280 | 45042 | 27 |
mRNA | 3140744 | 2437901 | 303890 | 398953 | 549 |
Coding Sequences | 1174256 | 885327 | 86624 | 202305 | 341 |
Non-Coding Sequences | 2933517 | 2164988 | 375671 | 392858 | 230 |
Genes | 3255049 | 2521653 | 317249 | 416147 | 560 |
Putative Promoters | 176156 | 106111 | 22870 | 47175 | 42 |
Transposable Elements | 1011883 | 755222 | 155293 | 101368 | 57 |
lncRNA | 82671 | 63588 | 9337 | 9746 | 5 |
Intergenic Regions | 1049088 | 660197 | 164528 | 224363 | 38 |
No Overlaps | 603597 | 372047 | 84582 | 146968 | 21 |
Percent methylation of genome features
The biggest task I got for figures was plotting at overall percent methylation for various genome features instead of plotting overlaps. I think it’s important to still plot overlaps to show where methylation occurs — something I don’t find immediately apparent from pure percent methylation plots. I decided to calculate methylation for each genome feature globally and break it down by treatment. For global methylation, I could use the concatenated 5x CpG track. For each treatment, I can calculate the median percent methylation using the 5 control or 5 treatment samples.
Conceptually, getting these numbers involved a few steps I based off of my DMG analysis:
- Interect sample bedgraphs with genome feature tracks (promoters, UTRs, exons, introns, transposable elements, intergenic regions)
- Add percent methylation information
- Calculate median percent methylation for the entire matrix
I used this R Markdown file to carry out my plan. Since I did something similar beforehand, I transferred full sample bedgraphs, sample bedgraphs with only positions, and sample bedgraphs with a sorting key I could use to add percent methylation information to processed files. I reformatted the 5x CpG bedgraph to match these requirements. For each genome feature, I ran intersectBed
with position-only files to find where loci and various genome feature files overlap:
#For each file that ends in posOnly
#Use intersectBed to find where loci and promoters intersect
#wb: Print all lines in the second file
#a: file that ends in posOnly
#b: promoter track
#Only keep specified columns
#Save output in a new file that has the same base name and ends in -UTROverlap
for f in *posOnly
do
/Users/Shared/bioinformatics/bedtools2/bin/intersectBed \
-wb \
-a ${f} \
-b ../2019-05-13-Generating-Genome-Feature-Tracks/C_virginica-3.0_Gnomon_exonUTR_yrv.bed \
| awk '{print $1"\t"$2"\t"$3"\t"$5"\t"$6}' \
> ${f}-UTROverlap
done
Then, I needed to add the percent methylation information to these overlaps! I took the overlap files I created and added a sorting key:
#For each file that ends in UTROverlap
#Print the first three columns (chr, start, end) with dashes in between, then the rest of the columns in the file
#Save the file with the base name + .sorted
for f in *UTROverlap
do
awk '{print $1"-"$2"-"$3"\t"$0}' ${f} \
| sort -k1,1 \
> ${f}.sorted
done
I used join
to add the percent methylation information using the sorting key:
#For each file that ends in bedgraph
#Join 2 files using the first column. The files and tab-delimited and the output should also be tab-delimited
#The first file ends with .sorted
#The second file ends with .posOnly-UTROverlap.sorted
#Add .annotated.percentMeth to the base name of the output file
for f in *bedgraph
do
join -j 1 -t $'\t' \
${f}.sorted \
${f}.posOnly-UTROverlap.sorted \
| awk '{print $2"\t"$3"\t"$4"\t"$9"\t"$10"\t"$5}' \
> ${f}-UTROverlap-percentMeth
done
Finally I calculated median methylation for each sample and the concatenated file:
#For each file that ends in *UTROverlap-percentMeth
#Sort numerically
#Calculate the median of the sixth column (percent methylation) for each file
#Save output in a new file
(for f in *UTROverlap-percentMeth
do
sort -n ${f} \
| awk ' { a[i++]=$6; } END { x=int((i+1)/2); if (x < (i+1)/2) print (a[x-1]+a[x])/2; else print a[x-1]; }'
done) > 2020-02-25-UTROverlap-percentMeth-Medians.txt
The order of the output is the global methylation file, sample 10, then samples 1-9. The output files can be found here:
- Promoter median methylation
- UTR median methylation
- Exon median methylation
- Intron median methylation
- Transposable element median methylation
- Intergenic median methylation
Now that I had this information, I needed to create a dataframe that I could use to create plots. I imported the data into R, then created some barplots (see below).
Adding methylation island analysis
Since I generated MI for my poster, Steven suggested I add the analysis to the revised paper. First, I characterzied any remaining overlaps bewteen genome feature tracks and MI in this notebook.
Table 2. MI and genome feature overlaps
Genome Feature | MI Overlaps with Features | Feature Overlaps with MI |
---|---|---|
Exons | 22705 | 240133 |
Introns | 28730 | 92472 |
Exon UTR | 8649 | 30827 |
mRNA | 29805 | 29483 |
Coding Sequences | 20872 | 226237 |
Non-Coding Sequences | 35932 | 98103 |
Genes | 30773 | 15009 |
Putative Promoters | 4217 | 8846 |
Transposable Elements | 25085 | 107926 |
lncRNA | 949 | 1108 |
Intergenic Regions | 10302 | 8526 |
No Overlaps | 1154 | N/A |
In this R Markdown file I recolored my methylation island overlap plot and added it to one with locations of all CpGs and methylated CpGs. I also revised previously-made boxplots in this R Markdown file. For the poster, I looked at how much each individual feature overlapped with respect to the length of the methylation island. For the revised boxplots, I thought it made more sense to look at the overlap length versus the length of the individual feature. For each genome feature — putative promoters, UTRs, exons, introns, transposable elements, and intergenic regions — I calculated the length of an individual feature (each line). I then divided the overlap length by the length of the individual feature.
Figure changes
Figure 2
Figure 2a: Location of all CpGs vs. methylated CpGs vs. methlyation islands
Now that I had additional overlap information, I could update my figures. I pulled additional CG motif overlap information from this Jupyter notebook and updated this document with overlap counts. In this R Markdown file, I updated figures to include UTR and intergenic region overlaps in addition to putative promoters, exons, introns, transposable elements, and other (no overlap between exons, introns, transposable elements or putative promoters…which may no longer make sense but I’ll wait for some feedback on that). I also included methylation island information!
Figure 2b: Individual feature overlap with methylation islands
After recalculating feature overlaps with methylation islands, I created a boxplot. It’s interesting to see that the entirety of exons and transposable elements overlap entirely with methylation islands, while introns, promoters, UTRs, and intergenic regions are more variable.
Figure 2c: Percent methylation across genome features
I first plotted global and sample methylation across genome features in one barplot…
…which was extremly busy. I then tried breaking it up by global vs. sample methylation.
While the global methylation plot was cleaner, the sample methylation plot was still pretty busy. Additionally, I think it’s important to have the global methylation information in the same plot since it’s the “background” for the sample methylation.
The plot version I settled on is a multipanel plot that compares different samples to global methylation for each genome feature:
Full revised figure
Figure 5
I was asked to normalize the number of DML in Figure 5A by the total number of nucleotides in each chromosome. I went to NCBI and pulled out the chromosome length (Mb). I then divided the number of DML in each chromosome by length (Mb) in [this R Markdown file. I considered using the full length instead, but then my y-axis was comprised of really small numbers. I figured I could just indicated in the y-axis text that I normalized by length (Mb) instead of length (bp).
Figure 6: Location of DML vs. enriched CpGs
Similar to Figure 2a, I added additional information for where DML were found!
Figure 8
No changes…I just didn’t include the attachment when I submitted the original paper RIP. It’ll be included this time.
Going forward
- Address remaining comments about discussion text
- Get feedback from Epigenetics Reading Group
- Update manuscript text
- Update response to reviewers
- Consolidate any co-author feedback
- Submit comment responses and reviesed manuscript
- Post revised paper on bioRXiv
- Update paper repository with new code and figures