Killifish Hypoxia RRBS Part 18
Digging into DMR
Now that I had a list of DMR, I wanted to understand the genomic location of DMR for future functional interpretation. Since I have so few DMR, it doesn’t make sense to do any follow-up gene enrichment analysis. Genomic locations and associated gene annotation information is all I need!
Finishing genome feature tracks
I returned to this Jupyter notebook to finish my genome feature tracks. I pulled out start and stop codon information from the GTF. For intergenic regions, I just took the complement of genes since I didn’t have any promoter information. I can always modify the tracks once I get additional information, but for now I was set.
Mystery of non-existent 16 DMR
When looking at my DMR, I noticed that the log file for 20_5_N
stated that 2 DMR were identified. This was weird, because I distinctly remember identifying 16 DMR with 10-500 reads per locus, q-value < 0.05, absolute methylation different of 0.1, and at least 10 CpGs per DMR. When I looked at the log file generated by my tests, it confirmed 16 DMR identified! I was confused as to why I only have 2 DMR instead of 16, so I repeated analyses by refiltering and running through BAT_summarize
and BAT_DMRcalling
.
When I repeated settings (MDP_max = 500, MDP_min = 10, c = 10
), I still only got 2 DMR for 20 vs. 5 within NBH! I tried BAT_filter_vcf
, BAT_summarize
, and BAT_DMRcalling
again with no MDP_max
to see if I identified 16 DMR with those settings.
My suspicion is that I didn’t clear files when repeating analyses with different settings, so the output from my run without MDP_max
never got overwritten. In any case, I’m now confident about the number of DMR identified with my chosen settings.
Table 1. Number of DMR identified with each setting
Contrast | Group 1 | Group 2 | c = 10, L4 included | MDP_max = 500, c = 10, L4 included | MDP_max = 500, c = 10, L4 excluded |
---|---|---|---|---|---|
All samples | N | S | 0 | 0 | 0 |
OC | N | S | 0 | 1 | 1 |
20 | N | S | 6 | 1 | 1 |
5 | N | S | 7 | 2 | 2 |
N | 20 | 5 | 16 | 2 | 2 |
N | 20 | OC | 2 | 0 | 0 |
S | 20 | 5 | 1 | 1 | 1 |
S | 20 | OC | 1 | 1 | 1 |
Overlaps between L4 and non-L4 DMR
Previously I identified DMR without L4 samples since there is a weird coverage pattern in the L4 samples. I wanted to know if there were any common DMR identified with and without L4 samples. To do this, I created this Jupyter notebook. I used intersectBed
to count overlaps between L4 and non-L4 DMR. Turns out all of my DMR overlapped, regardless if I included the L4 samples or not! That made me feel a lot better about including the L4 samples: they aren’t skewing DMR identification. I decided to use the DMR lists with L4 samples for downstream annotation.
Closest genome features to DMR
In the same Jupyter notebook, I used closestBed
to determine the closest genome feature to each DMR, including overlaps. Prior to using closestBed
, I sorted all DMR BEDfiles with sortBed
and removed empty files:
for f in ../06-DMR/*/*.bed
do
/opt/homebrew/bin/sortBed \
-i ${f} \
-faidx ../../genome-features/mummichog.chrom.length \
> $(basename ${f%_qval.0.05.bed})_DMR.sort.bed
done
!rm 20_OC_N_DMR.sort.bed all_pop_DMR.sort.bed
After sorting the files, I looked for closest gene, CDS, intron, 5’ UTR, 3’ UTR, start codon, stop codon, and intergenic region:
#For sorted BEDfiles
#Find the closest gene feature
#Report distances, using lower numbers for genome feature location as "upstream"
%%bash
for f in *DMR.sort.bed
do
/opt/homebrew/bin/closestBed \
-D "ref" \
-a ${f} \
-b ../../genome-features/Fundulus_heteroclitus-3.0.2.105-gene.gff \
-g ../../genome-features/mummichog.chrom.length \
> $(basename ${f%_DMR.sort.bed})_DMR.closestGene.bed
done
The output can be found here.
Table 2. DMR from each contrast and distance to closest genome feature
Contrast | Group 1 | Group 2 | Chr | Start | End | Meth Diff | Genes | 5’ UTR | CDS | 3’ UTR | Introns | Start Codon | Stop Codon | Intergenic |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
OC | N | S | KN805540.1 | 286626 | 286713 | 0.724750 | 0 | 327344 | -4084 | 307386 | 0 | -121140 | -15569 | -15572 |
20 | N | S | KN806932.1 | 34456 | 34729 | -0.270357 | -24519 | -32826 | -24521 | -24519 | -24661 | -32823 | N/A | 0 |
5 | N | S | KN805688.1 | 500679 | 500775 | -0.565449 | -55168 | -55168 | -55211 | -58247 | -55391 | -55211 | -58244 | 0 |
5 | N | S | KN811384.1 | 437656 | 437724 | 0.517333 | 0 | 7047 | -2570 | -9991 | 0 | 7044 | -9988 | 10789 |
N | 20 | 5 | KN806487.1 | 130260 | 130476 | 0.241000 | N/A | N/A | N/A | N/A | N/A | N/A | N/A | 0 |
N | 20 | 5 | KN811321.1 | 578486 | 578795 | -0.222821 | -22011 | 197537 | -22014 | -168583 | -22249 | -157801 | -22011 | 0 |
S | 20 | 5 | KN811550.1 | 428041 | 428146 | 0.332250 | 0 | 4662 | 3518 | 10922 | 0 | 5138 | 10919 | 11125 |
S | 20 | OC | KN811550.1 | 428041 | 428146 | 0.335750 | 0 | 4662 | 3518 | 10922 | 0 | 5138 | 10919 | 11125 |
Visualizing DMR
To visualize the DMR, I created a UCSC Genome Browser session. The session information is saved here. To add the DMR tracks, I used the bedGraphs with methylation differences for each DMR. I copied the data when adding a track, and included this heading based on example #4:
track name ="NAME" description="DESCRIPTION"
#chr chromStart chromEnd meth.diff
I viewed each DMR in the browser and took a screenshot since there were additional genome features in the browser that I didn’t have in IGV, like CpG islands.
Figures 1-9. DMR in UCSC Genome Browser
Going forward
- Revise methylation landscape information
- Update methods and results
- Match DMR with RNA-Seq information
- Start mapping with new genome
- Try DMR identification with
bismark
andmethylKit
- Create OSF repository for all intermediate files