MethCompare Part 5
Characterizing full samples
We got the all-clear from Shelly and Steven, so I started working on the CpG characterization pipeline with full samples in this Jupyter notebook.
Running the full pipeline
M. capitata
After downloading the 10x bedgraphs from this link, I classified CpG in each sample as either methylated (> 50%), sparsely methylated (10-50%), or unmethylated (< 10%). I created a summary table that included counts and percentages for us to understand the data.
Table 1. Number and proportion of methylated (> 50%), sparsely methylated (10-50%), and unmethylated (< 10%) CpG dinucleotides in M. capitata samples.
Sample | Method | CpG with Data | Methylated CpG | Sparsely Methylated CpG | Unmethylated CpG |
---|---|---|---|---|---|
10 | WGBS | 470893 | 35488 (7.5%) | 43405 (9.2%) | 392000 (83.2%) |
11 | WGBS | 479520 | 49640 (10.3%) | 42459 (8.9%) | 387421 (80.8%) |
12 | WGBS | 1756997 | 173085 (9.9%) | 127160 (7.2%) | 1456752 (82.9%) |
13 | RRBS | 2945967 | 204797 (7.0%) | 129206 (4.4%) | 2611964 (88.7%) |
14 | RRBS | 2310457 | 136505 (5.9%) | 107375 (4.6%) | 2066577 (89.4%) |
15 | RRBS | 2874355 | 172957 (6.0%) | 145976 (5.1%) | 2555422 (88.9%) |
16 | MBD-BSSeq | 44091 | 16168 (36.7%) | 5283 (12.0%) | 22640 (51.3%) |
17 | MBD-BSSeq | 21797 | 6080 (27.9%) | 2282 (10.5%) | 13435 (61.6%) |
18 | MBD-BSSeq | 14818 | 4837 (32.6%) | 1433 (9.7%) | 8548 (57.7%) |
Looking at each method, it seems like MBD-BSSeq has more interindividual variation than WGBS and RRBS. I think this may be a factor of the sybiodinium contamination, but it would be interesting to actually investigate that variation statistically.
I then looked at the distribution of CpG in genes, CDS, introns, and intergenic regions. I used the genome tracks I created, and used intersectBed -v
with the gene track to look at intergenic regions. When I checked my output for intersections with genes, I noticed that the gene track still had CDS and intron information in it even though I fixed this error! I’m not sure what happened with Github commits and potential branch merging, but I went back to this Jupyter notebook to recreate my gene track using only AUGUSTUS information. I also went through and looked at overlaps between CG motifs and each track. Most of the CG motifs for M. capitata are found in intergenic regions.
Table 2. CG motif overlaps with M. capitata genome tracks.
M. capitata Genome Feature | Number individual features | Overlaps with CG Motifs | % Total CG Motifs |
---|---|---|---|
Genes | 458273 | 9450564 | 32.9 |
Coding Sequences | 283926 | 1953206 | 6.8 |
Introns | 221428 | 7503314 | 26.2 |
Intergenic Regions | N/A | 19224826 | 67.0 |
Once I fixed my tracks, I used intersectBED
to look at intersections between samples and variosu genome features. Again, I made tables that had count and percentage information. Similar to waht I saw in Table 1, there was more variation between MBD-BSSeq samples than the other methods. However, MBD-BSSeq still performed better at capturing methylated regions.
Table 3. Number and percent of CpGs that overlap with genes in M. capitata.
Sample | Method | CpG with Data | Methylated CpG | Sparsely Methylated CpG | Unmethylated CpG |
---|---|---|---|---|---|
10 | WGBS | 160102 | 17007 (10.6%) | 15647 (9.8%) | 127448 (79.6%) |
11 | WGBS | 166510 | 23163 (13.9%) | 15042 (9.0%) | 128305 (77.1%) |
12 | WGBS | 634252 | 82744 (13.0%) | 44107 (7.0%) | 507401 (80.0%) |
13 | RRBS | 988858 | 99753 (10.1%) | 46167 (4.7%) | 842938 (85.2%) |
14 | RRBS | 780718 | 67014 (8.6%) | 39102 (5.0%) | 674602 (86.4%) |
15 | RRBS | 964930 | 86291 (8.9%) | 53429 (5.5%) | 825210 (85.5%) |
16 | MBD-BSSeq | 11499 | 6390 (55.6%) | 1215 (10.6%) | 3894 (33.9%) |
17 | MBD-BSSeq | 5127 | 2373 (46.3%) | 511 (10.0%) | 2243 (43.7%) |
18 | MBD-BSSeq | 3278 | 2018 (61.6%) | 165 (5.0%) | 1095 (33.4%) |
Table 4. Number and percent of CpGs that overlap with CDS in M. capitata.
Sample | Method | CpG with Data | Methylated CpG | Sparsely Methylated CpG | Unmethylated CpG |
---|---|---|---|---|---|
10 | WGBS | 63618 | 5918 (9.3%) | 7652 (12.0%) | 50048 (78.7%) |
11 | WGBS | 67079 | 8085 (12.1%) | 7257 (10.8%) | 51737 (77.1%) |
12 | WGBS | 209101 | 25530 (12.2%) | 16812 (8.0%) | 166759 (79.8%) |
13 | RRBS | 190252 | 14708 (7.7%) | 10382 (5.5%) | 165162 (86.8%) |
14 | RRBS | 151699 | 9279 (6.1%) | 8654 (5.7%) | 133766 (8.8%) |
15 | RRBS | 186877 | 11132 (6.0%) | 12318 (6.6%) | 163427 (87.5%) |
16 | MBD-BSSeq | 4664 | 2046 (43.9%) | 623 (13.4%) | 1995 (42.8%) |
17 | MBD-BSSeq | 2307 | 758 (32.9%) | 220 (9.5%) | 1329 (57.6%) |
18 | MBD-BSSeq | 1282 | 750 (58.5) | 93 (7.3%) | 439 (34.2%) |
Table 5. Number and percent of CpGs that overlap with introns in M. capitata.
Sample | Method | CpG with Data | Methylated CpG | Sparsely Methylated CpG | Unmethylated CpG |
---|---|---|---|---|---|
10 | WGBS | 96590 | 11106 (11.5%) | 8003 (8.3%) | 77481 (80.2%) |
11 | WGBS | 99522 | 15088 (15.2%) | 7793 (7.8%) | 76641 (77.0%) |
12 | WGBS | 425558 | 57278 (13.5%) | 27323 (6.4%) | 340957 (80.1%) |
13 | RRBS | 798984 | 85069 (10.6%) | 35795 (4.5%) | 678120 (84.9%) |
14 | RRBS | 629356 | 57748 (9.2%) | 30463 (4.8%) | 541145 (86.0%) |
15 | RRBS | 778492 | 75177 (9.7%) | 41135 (5.3%) | 662180 (85.1%) |
16 | MBD-BSSeq | 6841 | 4347 (63.5%) | 595 (8.7%) | 1899 (27.8%) |
17 | MBD-BSSeq | 2824 | 1618 (57.3%) | 291 (10.3%) | 915 (32.4%) |
18 | MBD-BSSeq | 2000 | 1270 (63.5%) | 73 (3.7%) | 657 (32.9%) |
Table 6. Number and percent of CpGs that overlap with intergenic regions in M. capitata.
Sample | Method | CpG with Data | Methylated CpG | Sparsely Methylated CpG | Unmethylated CpG |
---|---|---|---|---|---|
10 | WGBS | 310791 | 18481 (5.9%) | 27758 (8.9%) | 264552 (85.1%) |
11 | WGBS | 313010 | 26477 (8.5%) | 27417 (8.8%) | 259116 (82.8%) |
12 | WGBS | 1122745 | 90341 (8.0%) | 83053 (7.4%) | 949351 (84.6%) |
13 | RRBS | 1957109 | 105044 (5.4%) | 83039 (4.2%) | 1769026 (90.4%) |
14 | RRBS | 1529739 | 69491 (4.5%) | 68273 (4.5%) | 1391975 (91.0%) |
15 | RRBS | 1909425 | 86666 (4.5%) | 92547 (4.5%) | 1730212 (90.6%) |
16 | MBD-BSSeq | 32592 | 9778 (30.0%) | 4068 (12.5%) | 18746 (57.5%) |
17 | MBD-BSSeq | 16670 | 3707 (22.2%) | 1771 (10.6%) | 11192 (67.1%) |
18 | MBD-BSSeq | 11540 | 2819 (24.2%) | 1268 (11.0%) | 7453 (64.6%) |
I generated these tables manually, but I really need a way to do this programmatically. That’s a problem for later.
P. acuta
I did the same thing for all P. acuta samples! Just to be safe, I double-checked my code to generate the genome feature tracks and look at overlaps with CG motifs. Similar to M. capitata, most of the CG motifs were present in intergenic regions.
Table 7. P. acuta genome feature overlaps with CG motifs.
P. acuta Genome Feature | Number individual features | Overlaps with CG Motifs | % Total CG Motifs |
---|---|---|---|
Genes | 64558 | 3434720 | 35.6 |
Coding Sequences | 318484 | 1455630 | 15.1 |
Introns | 241534 | 1999490 | 20.7 |
Intergenic Regions | N/A | 5720900 | 59.3 |
I then looked at the number of methylated, sparsely methylated, and unmethylated CpGs in each sample. Similar to M. capitata, there was more interindividual variation for MBD-BSSeq samples. This variation was more pronounced for P. acuta.
Table 8. Number and proportion of methylated (> 50%), sparsely methylated (10-50%), and unmethylated (< 10%) CpG dinucleotides in P. acuta samples.
Sample | Method | CpG with Data | Methylated CpG | Sparsely Methylated CpG | Unmethylated CpG |
---|---|---|---|---|---|
1 | WGBS | 2518069 | 37201 (1.5%) | 83999 (3.3%) | 2396869 (95.2%) |
2 | WGBS | 3926923 | 66524 (1.7%) | 109999 (2.8%) | 3750400 (95.5%) |
3 | WGBS | 3028012 | 51081 (1.7%) | 104860 (3.5%) | 2872071 (94.9%) |
4 | RRBS | 1184293 | 12021 (1.0%) | 65647 (5.5%) | 1106625 (93.4%) |
5 | RRBS | 992337 | 14557 (1.5%) | 27831 (2.8%) | 949949 (95.7%) |
6 | RRBS | 1014588 | 10621 (1.0%) | 38383 (3.8%) | 965584 (95.2%) |
7 | MBD-BSSeq | 744052 | 195284 (26.2%) | 106836 (14.3%) | 441932 (59.4%) |
8 | MBD-BSSeq | 250032 | 156098 (62.4%) | 39889 (16.0%) | 54045 (21.6%) |
9 | MBD-BSSeq | 725079 | 187956 (25.9%) | 112592 (15.5%) | 424531 (58.5%) |
Finally, I looked at the overlaps between CpGs and genes, CDS, introns, and intergenic regions with bedtools
. Making these tables really drove home the importance of generating these programatically instead of manually! I divided so many numbers in succession with Google that I had to cofirm I wasn’t a bot many times.
Table 9. Number and percent of CpGs that overlap with genes in P. acuta.
Sample | Method | CpG with Data | Methylated CpG | Sparsely Methylated CpG | Unmethylated CpG |
---|---|---|---|---|---|
1 | WGBS | 1179434 | 25899 (2.2%) | 34626 (2.9%) | 1118909 (94.9%) |
2 | WGBS | 1821034 | 47716 (2.6%) | 49984 (2.7%) | 1723334 (94.6%) |
3 | WGBS | 1426087 | 35357 (2.5%) | 44108 (3.1%) | 1346622 (94.4%) |
4 | RRBS | 502813 | 4988 (1.0%) | 26205 (5.2%) | 471620 (93.8%) |
5 | RRBS | 416016 | 5815 (1.4%) | 10664 (2.6%) | 399537 (96.0%) |
6 | RRBS | 428818 | 4293 (1.0%) | 14614 (3.4%) | 409911 (95.6%) |
7 | MBD-BSSeq | 325489 | 87468 (26.9%) | 31803 (9.8%) | 206218 (63.4%) |
8 | MBD-BSSeq | 88675 | 60278 (68.0%) | 11894 (13.4%) | 16503 (18.6%) |
9 | MBD-BSSeq | 314739 | 92131 (29.3%) | 34146 (10.8%) | 188462 (59.9%) |
Table 10. Number and percent of CpGs that overlap with CDS in P. acuta.
Sample | Method | CpG with Data | Methylated CpG | Sparsely Methylated CpG | Unmethylated CpG |
---|---|---|---|---|---|
1 | WGBS | 850489 | 23872 (2.8%) | 26408 (3.1%) | 800209 (94.1%) |
2 | WGBS | 1219826 | 41839 (3.4%) | 34947 (2.9%) | 1143040 (93.7%) |
3 | WGBS | 1010637 | 32069 (3.2%) | 32550 (3.2%) | 946018 (93.6%) |
4 | RRBS | 334839 | 3840 (1.1%) | 18125 (5.4%) | 312874 (93.4%) |
5 | RRBS | 275416 | 4217 (1.5%) | 7617 (2.8%) | 263582 (95.7%) |
6 | RRBS | 283255 | 3230 (1.1%) | 10420 (3.7%) | 269605 (95.2%) |
7 | MBD-BSSeq | 284856 | 72744 (25.5%) | 24809 (8.7%) | 187303 (65.8%) |
8 | MBD-BSSeq | 73240 | 51129 (69.8%) | 9109 (12.4%) | 13002 (17.8%) |
9 | MBD-BSSeq | 265810 | 75705 (28.5%) | 26175 (9.8%) | 163930 (61.7%) |
Table 11. Number and percent of CpGs that overlap with introns in P. acuta.
Sample | Method | CpG with Data | Methylated CpG | Sparsely Methylated CpG | Unmethylated CpG |
---|---|---|---|---|---|
1 | WGBS | 733890 | 11164 (1.5%) | 19112 (2.6%) | 703614 (95.9%) |
2 | WGBS | 1237263 | 23483 (1.9%) | 32016 (2.6%) | 1181764 (95.5%) |
3 | WGBS | 908646 | 15720 (1.7%) | 25760 (2.8%) | 867166 (95.4%) |
4 | RRBS | 336436 | 2726 (0.8%) | 16685 (5.0%) | 317025 (94.2%) |
5 | RRBS | 277680 | 3514 (1.3%) | 6409 (2.3%) | 267757 (96.4%) |
6 | RRBS | 287803 | 2348 (0.8%) | 8824 (3.1%) | 276631 (96.1%) |
7 | MBD-BSSeq | 140343 | 42347 (30.2%) | 15003 (10.7%) | 82993 (59.1%) |
8 | MBD-BSSeq | 40384 | 27138 (67.2%) | 5541 (13.7%) | 7705 (19.1%) |
9 | MBD-BSSeq | 146914 | 46592 (31.7%) | 16913 (11.5%) | 83409 (56.8%) |
Table 12. Number and percent of CpGs that overlap with intergenic regions in P. acuta.
Sample | Method | CpG with Data | Methylated CpG | Sparsely Methylated CpG | Unmethylated CpG |
---|---|---|---|---|---|
1 | WGBS | 1339417 | 11319 (0.8%) | 49388 (3.7%) | 1278710 (95.5%) |
2 | WGBS | 2107121 | 18848 (0.9%) | 60047 (2.8%) | 2028226 (96.3%) |
3 | WGBS | 1602822 | 15747 (1.0%) | 60783 (3.8%) | 1526292 (95.2%) |
4 | RRBS | 681870 | 7040 (1.0%) | 39459 (5.8%) | 635371 (93.2%) |
5 | RRBS | 576682 | 8749 (1.5%) | 17176 (3.0%) | 550757 (95.5%) |
6 | RRBS | 586105 | 6335 (1.1%) | 23780 (4.1%) | 555990 (94.9%) |
7 | MBD-BSSeq | 418826 | 107907 (25.8%) | 75067 (17.9%) | 235852 (56.3%) |
8 | MBD-BSSeq | 161464 | 95906 (59.4%) | 28006 (17.3%) | 37552 (23.3%) |
9 | MBD-BSSeq | 410614 | 95913 (23.4%) | 78494 (19.1%) | 236207 (57.5%) |
Lookinga t these tables, it seems like sample 8 is the outlier for the MBD-BSSeq data. That file has less CpGs than the other two MBD-BSSeq files, and performs better at capturing methyalted regions. This could mean that there are less mapped reads for the sample becuse of C1 contamination, but I’m not sure. It’s defienitely something to look into later.
Going forward
- Create figures for CpG characterization in various genome features
- Update code for methylation frequency distribution figure
- Look into exon annotations in Liew and Li papers
- Generate repeat tracks for each species
- Figure out how to meaningfully concatenate data for each method
- Figure out methylation island analysis