DML Analysis Part 35
Finalizing C. virginica tracks and revisiting analyses
Generating genome feature tracks
Two weeks ago, I was struggling to generate a feasible intron track. At FROGER, Jon shared the code he used to generate C. virginica feature tracks to the wiki. The code he used to generate tracks was similar to my code, which was comforting. Looking at his code for intron tracks, he used a combination of complementBed
and intersectBed
, which is what Sam suggested in this issue. I decided to use that approach with my own tracks.
Instead of just creating an intron track, I decided to create all of the additional tracks Jon created: intergenic regions, non-coding sequences, introns, and mtDNA. Before I could create tracks, I needed to create a bedtools
friendly genome file. The genome file needed to have chromosome name and lengths. I previously created this document, so I just copied it to my analysis folder (it can be found here). I also isolated just the chromosome names in this document.
To isolate intergenic regions, I first needed to sort the gene track. I used sortBed
to organize the listings by chromosme name:
!{bedtoolsDirectory}sortBed \
-faidx {chromNames} \
-i C_virginica-3.0_Gnomon_gene_yrv.gff3 \
> C_virginica-3.0_Gnomon_gene_sorted_yrv.gff3
Jon also included a step where he used mergeBed
on the sorted file to merge any transcript variants. I skipped this step for now and sorted my exon track. I used complementBed
with the sorted gene track to find all possible locations of intergenic regions, then removed sorted exons with subtractBed
:
!{bedtoolsDirectory}complementBed \
-i {geneList} -sorted \
-g {chromLengths} \
| {bedtoolsDirectory}subtractBed \
-a - \
-b {exonList} \
> C_virginica-3.0_Gnomon_intergenic_yrv.gff3
Non-coding sequences were created using the complement of exons:
!{bedtoolsDirectory}complementBed \
-i {exonList} \
-g 2018-06-15-bedtools-Chromosome-Lengths.txt \
> C_virginica-3.0_Gnomon_noncoding_yrv.gff3
I then used intersectBed
with non-coding sequences and genes to find introns, since introns are non-coding sequences found in genes:
!{bedtoolsDirectory}intersectBed \
-a {nonCDS} \
-b {geneList} -sorted \
> C_virginica-3.0_Gnomon_intron_yrv.gff3
Coding sequences are those that are translated into proteins. By subtracting coding sequences from exons, I also derived untranslated regions of exons:
!{bedtoolsDirectory}subtractBed \
-a {exonList} \
-b {CDSList} \
-sorted \
-g {chromLengths} \
> C_virginica-3.0_Gnomon_exonUTR_yrv.gff3
Generating the mitochondrial DNA track was the easiest. The smallest chromosome, NC_007175.2, is mtDNA, so I just needed to isolate genes from that chromosome:
!grep "NC_007175.2" ref_C_virginica-3.0_top_level.gff3 > C_virginica-3.0_Gnomon_mtDNA_yrv.gff3
Visualizing tracks in IGV
Of course, no track generation is complete without looking at the products. I perused the tracks in this IGV session and was pretty happy with how they turned out! I am more confident in these tracks than the pre-generated ones because I understand how they were made.
Figures 1-3. All tracks in IGV.
Since my tracks looked good, I posted the .gff3
and .bed
verisions here.
Characterizing feature tracks
The reason I started generating new feature tracks is because Steven asked me to do so in this issue. I obtained the length of each feature track and counted overlaps with the pre-generated CG motif track.
Table 1. Length of feature tracks and number of overlaps with CG motifs. intersectBed
was used to obtain CG motif overlaps with various feature tracks, defining the CG motif track as -a
and the various feature tracks as -b
.
Feature | Number of Elements | CG Motif Overlaps with Feature |
---|---|---|
Genes | 38,929 | 7,914,842 |
Intergenic regions | 34,557 | 6,545,363 |
mRNA | 60,201 | 7,507,167 |
Exons | 731,279 | 2,330,546 |
Introns | 316,614 | 5,596,808 |
Coding sequences | 645,335 | 1,728,032 |
Non-coding sequences | 336,677 | 12,142,171 |
Untranslated regions of exons | 182,752 | 602,551 |
lncRNA | 4750 | 281,715 |
mtDNA | 78 | 431 |
I also posted Table 1 in the issue. I noticed that there are more CG motif overlaps in the introns over exons, and non-coding sequences over coding sequences. However, genes have more overlaps than intergenic regions.
Characterizing overlaps with DML
Now that I had better feature tracks, I figured I’d recharacterize my DML overlaps. Although I have more tracks I can use, I decided to focus on intersections with genes, exons, introns, transposable elements, and putative promoters. Anything that did not overlap with these categories could be considered “other” or “intergenic.” I opened this Jupyter notebook, downloaded the new tracks from gannet
, and changed the variable paths (again…so. forking. handy). I counted overlaps and wrote out files to this directory. I didn’t want to overwrite any of the files I generated with previous versions of my feature tracks, so I tagged everything with the date (2019-05-29). I also created a new directory for my revised flanking analysis, which can be found here. The previous version of my Jupyter notebook has overlap proportion calculations and extraneous chunks of code that were confusing, so I removed them from the notebook. It’s not much cleaner and easier to follow (I think).
Table 2. Overlaps between DML and various genome feature tracks.
Feature | Hypermethylated DML | Hypomethylated DML | All DML |
---|---|---|---|
Genes | 289 | 271 | 560 |
Unique Genes | 269 | 241 | 481 |
Exons | 190 | 178 | 368 |
Introns | 99 | 93 | 192 |
Transposable Elements (All) | 26 | 31 | 57 |
Transposable Elements (C. gigas only) | 16 | 23 | 39 |
Putative promoters | 44 | 23 | 67 |
Other | 9 | 6 | 15 |
One good thing from Table 2: the counts for the hypermethylated and hypomethylated DML always sum to the counts for all DML. For the most part, there’s an even split between hyper- and hypomethylation. There are twice as many hypermethylated DML in putative promoters than hypomethylated DML. This makes sense since promoters have “housekeeping” functions. There are more DML found in exons than introns, which is also interesting. Exons are the sections of the genes that are most likely retained in coding sequences. There are 481 unique genes with DML, and 15 in other (potentially intragenic) regions.
Characterizing general methylation trends
Keeping the bedtools
party going, I decided to recharacterize overlaps of feature tracks with methylated CpGs, sparsely methylated CpGs, and unmethylated CpGs. Again, I generated new files and tagged them with the date (2019-05-29).
Table 3. Overlaps between various genome feature tracks and all 5x CpGs, methylated CpGs, sparsely methylated CpGs, and unmethylated CpGs.
Feature | All 5x CpGs | Methylated CpGs | Sparsely Methylated CpGs | Unmethylated CpGs |
---|---|---|---|---|
Genes | 3,255,049 | 2,521,653 | 317,249 | 416,147 |
Unique Genes | 33,126 | 25,496 | 26,953 | 27,753 |
Exons | 1,366,779 | 1,013,691 | 105,871 | 247,217 |
Introns | 1,884,429 | 1,504,791 | 211,143 | 168,495 |
Transposable Elements (All) | 1,011,883 | 755,222 | 155,293 | 101,368 |
Transposable Elements (C. gigas only) | 767,604 | 610,208 | 108,858 | 48,538 |
Putative promoters | 203,330 | 134,468 | 27,429 | 41,433 |
Other | 579,159 | 349,064 | 81,168 | 148,927 |
The main concern Mac brought up in this issue that sent me on a saga of feature track generation was that overlap counts for methylated CpGs (methylatedCpG
) were greater than those for CG motifs (totalCpG
). That doesn’t seem to be the case this time!
Returning to the chi-squared test
Since I felt better about my overlap counts, I updated this document so I could them for statistical analyses. Mac said she talked to Loveday Conquest about statistical tests, and that after some discussion chi-squared tests were deemed appropriate. I’m just going to continue with the chi-squared tests in this R Markdown document and perhaps ask Katie Lotterhos if she has other ideas or knows of any post-hoc tests during our next meeting.
Table 4. Results of chi-squared tests between various CpG groupings. All CpGs refer to all CG motifs in the C. virginica genome. MBD-Enriched CpGs are those found in C. virginica samples with at least 5x sequencing depth. Of those CpG enriched by MBD treatment, methylated CpGs were those with at least 50% methylation. DML were at lest 50% different between treatment with q-values < 0.01. Comparisons were only conducted with a particular CpG grouping and it’s associated background (i.e. All CpGs vs. MBD-Enriched, MBD-Enriched vs. Methylated CpGs, and Methylated CpGs vs. DML), with N/A referring to tests that were not conducted. Values above the diagonal are χ² statistics with df = 4, and values below the diagonal are associated p-values.
CpG Category | All CpGs | MBD-Enriched | Methylated CpGs | DML |
---|---|---|---|---|
All CpGs | - | 906940 | N/A | N/A |
MBD-Enriched | < 2.2e-16 | - | 15016 | N/A |
Methylated CpGs | N/A | < 2.2e-16 | - | 356.7 |
DML | N/A | N/A | < 2.2e-16 | - |
Figures 4-7. Proportion CpGs located in various genomic features for different CpG group comparisons. The figure comparing MBD-Enriched, methylated, sparsely methylated, and unmethylated proportions may be overkill.
All comparisons were significant, indicating that the distribution of CpGs in various genomic features is likely nonrandom. For All CpGs vs. MBD-Enriched, there was a higher proportion of all CpGs located in other regions (likely intergenic), but there was a higher proportion of enriched CpGs in exons and introns. The MBD-Enriched and methylated CpG groups have similar distributions, even though the test said the distriutions were significantly different. My guess is that this was driven by the slight differences between proportion CpGs in exons, promoters, and other regions. There was a much higher proportion of DML in exons and promoters versus methylated CpGs. Higher proportions of methylated CpGs were in introns, transposable elements, and other regions. It’s interesting to see a higher proportion of DML in exons, since those are the regions that could get translated into proteins.
Going forward
- Figure out what’s going on with DMR
- Look at overlap proportions for DMR
- Generate figures for DMR analyses
- Conduct a gene enrichment for DML and DMR
- Work through gene-level analysis
- Update methods and results
- Update paper repository
- Write the discussion
- Write the introduction
- Revise my abstract
- Share draft paper at the next Eastern Oyster Project Meeting