DML Analysis Part 45

Addressing reviewer comments

I didn’t expect reviewer comments a month after I submitted the C. virginica gonad methylation paper, but here we are. They looked relatively minor, so I figured it was okay if I couldn’t get to them until now.

Additional genomic feature overlaps

One reviewer asked us to consider the location of DML with respect to 5’ UTR, coding sequences, and 3’ UTR. Since I had plenty of genome feature tracks, I decided to take this one step further and look at UTR, coding sequence, non-coding sequence, lncRNA, and intergenic region with DML, methylated CpGs, and methylation islands (yup, I’m including this now…see below).

I returned to this Jupyter notebook to look at overlaps between methylated CpGs and these other genome feature tracks. I did similar things in this notebook for MI, but I moved the code over to the previous notebook because I think it makes more sense to include this analysis in the same notebook where I generated the MI. This notebook is where I characterized DML locations.

Table 1. Genome feature overlaps with different CpG categories, methylation islands, and DML

Genome Feature	All 5x CpGs	Methylated CpGs	Sparsely Methylated CpGs	Unmethylated CpGs	DML
Total	4304257	3181904	481788	640565	598
Exons	1366779	1013691	105871	247217	368
Introns	1884429	1504791	211143	168495	192
Exon UTR	192907	128585	19280	45042	27
mRNA	3140744	2437901	303890	398953	549
Coding Sequences	1174256	885327	86624	202305	341
Non-Coding Sequences	2933517	2164988	375671	392858	230
Genes	3255049	2521653	317249	416147	560
Putative Promoters	176156	106111	22870	47175	42
Transposable Elements	1011883	755222	155293	101368	57
lncRNA	82671	63588	9337	9746	5
Intergenic Regions	1049088	660197	164528	224363	38
No Overlaps	603597	372047	84582	146968	21

Percent methylation of genome features

The biggest task I got for figures was plotting at overall percent methylation for various genome features instead of plotting overlaps. I think it’s important to still plot overlaps to show where methylation occurs — something I don’t find immediately apparent from pure percent methylation plots. I decided to calculate methylation for each genome feature globally and break it down by treatment. For global methylation, I could use the concatenated 5x CpG track. For each treatment, I can calculate the median percent methylation using the 5 control or 5 treatment samples.

Conceptually, getting these numbers involved a few steps I based off of my DMG analysis:

Interect sample bedgraphs with genome feature tracks (promoters, UTRs, exons, introns, transposable elements, intergenic regions)
Add percent methylation information
Calculate median percent methylation for the entire matrix

I used this R Markdown file to carry out my plan. Since I did something similar beforehand, I transferred full sample bedgraphs, sample bedgraphs with only positions, and sample bedgraphs with a sorting key I could use to add percent methylation information to processed files. I reformatted the 5x CpG bedgraph to match these requirements. For each genome feature, I ran intersectBed with position-only files to find where loci and various genome feature files overlap:

#For each file that ends in posOnly
#Use intersectBed to find where loci and promoters intersect
#wb: Print all lines in the second file
#a: file that ends in posOnly
#b: promoter track
#Only keep specified columns
#Save output in a new file that has the same base name and ends in -UTROverlap
for f in *posOnly
do
  /Users/Shared/bioinformatics/bedtools2/bin/intersectBed \
  -wb \
  -a ${f} \
  -b ../2019-05-13-Generating-Genome-Feature-Tracks/C_virginica-3.0_Gnomon_exonUTR_yrv.bed \
    | awk '{print $1"\t"$2"\t"$3"\t"$5"\t"$6}' \
  > ${f}-UTROverlap
done

Then, I needed to add the percent methylation information to these overlaps! I took the overlap files I created and added a sorting key:

#For each file that ends in UTROverlap
#Print the first three columns (chr, start, end) with dashes in between, then the rest of the columns in the file
#Save the file with the base name + .sorted
for f in *UTROverlap
do
  awk '{print $1"-"$2"-"$3"\t"$0}' ${f} \
  | sort -k1,1 \
  > ${f}.sorted
done

I used join to add the percent methylation information using the sorting key:

#For each file that ends in bedgraph
#Join 2 files using the first column. The files and tab-delimited and the output should also be tab-delimited
#The first file ends with .sorted
#The second file ends with .posOnly-UTROverlap.sorted
#Add .annotated.percentMeth to the base name of the output file
for f in *bedgraph
do
  join -j 1 -t $'\t' \
  ${f}.sorted \
  ${f}.posOnly-UTROverlap.sorted \
  | awk '{print $2"\t"$3"\t"$4"\t"$9"\t"$10"\t"$5}' \
  > ${f}-UTROverlap-percentMeth
done

Finally I calculated median methylation for each sample and the concatenated file:

#For each file that ends in *UTROverlap-percentMeth
#Sort numerically
#Calculate the median of the sixth column (percent methylation) for each file
#Save output in a new file
(for f in *UTROverlap-percentMeth
do
  sort -n ${f} \
  | awk ' { a[i++]=$6; } END { x=int((i+1)/2); if (x < (i+1)/2) print (a[x-1]+a[x])/2; else print a[x-1]; }'
done) > 2020-02-25-UTROverlap-percentMeth-Medians.txt

The order of the output is the global methylation file, sample 10, then samples 1-9. The output files can be found here:

Now that I had this information, I needed to create a dataframe that I could use to create plots. I imported the data into R, then created some barplots (see below).

Adding methylation island analysis

Since I generated MI for my poster, Steven suggested I add the analysis to the revised paper. First, I characterzied any remaining overlaps bewteen genome feature tracks and MI in this notebook.

Table 2. MI and genome feature overlaps

Genome Feature	MI Overlaps with Features	Feature Overlaps with MI
Exons	22705	240133
Introns	28730	92472
Exon UTR	8649	30827
mRNA	29805	29483
Coding Sequences	20872	226237
Non-Coding Sequences	35932	98103
Genes	30773	15009
Putative Promoters	4217	8846
Transposable Elements	25085	107926
lncRNA	949	1108
Intergenic Regions	10302	8526
No Overlaps	1154	N/A

In this R Markdown file I recolored my methylation island overlap plot and added it to one with locations of all CpGs and methylated CpGs. I also revised previously-made boxplots in this R Markdown file. For the poster, I looked at how much each individual feature overlapped with respect to the length of the methylation island. For the revised boxplots, I thought it made more sense to look at the overlap length versus the length of the individual feature. For each genome feature — putative promoters, UTRs, exons, introns, transposable elements, and intergenic regions — I calculated the length of an individual feature (each line). I then divided the overlap length by the length of the individual feature.

Figure changes

Figure 2

Figure 2a: Location of all CpGs vs. methylated CpGs vs. methlyation islands

Now that I had additional overlap information, I could update my figures. I pulled additional CG motif overlap information from this Jupyter notebook and updated this document with overlap counts. In this R Markdown file, I updated figures to include UTR and intergenic region overlaps in addition to putative promoters, exons, introns, transposable elements, and other (no overlap between exons, introns, transposable elements or putative promoters…which may no longer make sense but I’ll wait for some feedback on that). I also included methylation island information!

Screen Shot 2020-02-24 at 1 20 05 PM

Figure 2b: Individual feature overlap with methylation islands

After recalculating feature overlaps with methylation islands, I created a boxplot. It’s interesting to see that the entirety of exons and transposable elements overlap entirely with methylation islands, while introns, promoters, UTRs, and intergenic regions are more variable.

Screen Shot 2020-02-24 at 1 42 16 PM

Figure 2c: Percent methylation across genome features

I first plotted global and sample methylation across genome features in one barplot…

Screen Shot 2020-02-25 at 11 51 44 PM

…which was extremly busy. I then tried breaking it up by global vs. sample methylation.

Screen Shot 2020-02-25 at 11 52 09 PM

Screen Shot 2020-02-25 at 11 52 22 PM

While the global methylation plot was cleaner, the sample methylation plot was still pretty busy. Additionally, I think it’s important to have the global methylation information in the same plot since it’s the “background” for the sample methylation.

The plot version I settled on is a multipanel plot that compares different samples to global methylation for each genome feature:

Screen Shot 2020-02-25 at 11 51 56 PM

Full revised figure

Figure 5

I was asked to normalize the number of DML in Figure 5A by the total number of nucleotides in each chromosome. I went to NCBI and pulled out the chromosome length (Mb). I then divided the number of DML in each chromosome by length (Mb) in [this R Markdown file. I considered using the full length instead, but then my y-axis was comprised of really small numbers. I figured I could just indicated in the y-axis text that I normalized by length (Mb) instead of length (bp).

Screen Shot 2020-02-19 at 2 40 40 PM

Figure 6: Location of DML vs. enriched CpGs

Similar to Figure 2a, I added additional information for where DML were found!

Screen Shot 2020-02-24 at 11 37 31 AM

Figure 8

No changes…I just didn’t include the attachment when I submitted the original paper RIP. It’ll be included this time.

Going forward

Address remaining comments about discussion text
Get feedback from Epigenetics Reading Group
Update manuscript text
Update response to reviewers
Consolidate any co-author feedback
Submit comment responses and reviesed manuscript
Post revised paper on bioRXiv
Update paper repository with new code and figures

Written on February 19, 2020