Killifish Hypoxia RRBS Part 18

Digging into DMR

Now that I had a list of DMR, I wanted to understand the genomic location of DMR for future functional interpretation. Since I have so few DMR, it doesn’t make sense to do any follow-up gene enrichment analysis. Genomic locations and associated gene annotation information is all I need!

Finishing genome feature tracks

I returned to this Jupyter notebook to finish my genome feature tracks. I pulled out start and stop codon information from the GTF. For intergenic regions, I just took the complement of genes since I didn’t have any promoter information. I can always modify the tracks once I get additional information, but for now I was set.

Mystery of non-existent 16 DMR

When looking at my DMR, I noticed that the log file for 20_5_N stated that 2 DMR were identified. This was weird, because I distinctly remember identifying 16 DMR with 10-500 reads per locus, q-value < 0.05, absolute methylation different of 0.1, and at least 10 CpGs per DMR. When I looked at the log file generated by my tests, it confirmed 16 DMR identified! I was confused as to why I only have 2 DMR instead of 16, so I repeated analyses by refiltering and running through BAT_summarize and BAT_DMRcalling.

When I repeated settings (MDP_max = 500, MDP_min = 10, c = 10), I still only got 2 DMR for 20 vs. 5 within NBH! I tried BAT_filter_vcf, BAT_summarize, and BAT_DMRcalling again with no MDP_max to see if I identified 16 DMR with those settings.

My suspicion is that I didn’t clear files when repeating analyses with different settings, so the output from my run without MDP_max never got overwritten. In any case, I’m now confident about the number of DMR identified with my chosen settings.

Table 1. Number of DMR identified with each setting

Contrast	Group 1	Group 2	c = 10, L4 included	MDP_max = 500, c = 10, L4 included	MDP_max = 500, c = 10, L4 excluded
All samples	N	S	0	0	0
OC	N	S	0	1	1
20	N	S	6	1	1
5	N	S	7	2	2
N	20	5	16	2	2
N	20	OC	2	0	0
S	20	5	1	1	1
S	20	OC	1	1	1

Overlaps between L4 and non-L4 DMR

Previously I identified DMR without L4 samples since there is a weird coverage pattern in the L4 samples. I wanted to know if there were any common DMR identified with and without L4 samples. To do this, I created this Jupyter notebook. I used intersectBed to count overlaps between L4 and non-L4 DMR. Turns out all of my DMR overlapped, regardless if I included the L4 samples or not! That made me feel a lot better about including the L4 samples: they aren’t skewing DMR identification. I decided to use the DMR lists with L4 samples for downstream annotation.

Closest genome features to DMR

In the same Jupyter notebook, I used closestBed to determine the closest genome feature to each DMR, including overlaps. Prior to using closestBed, I sorted all DMR BEDfiles with sortBed and removed empty files:

for f in ../06-DMR/*/*.bed
do
   /opt/homebrew/bin/sortBed \
   -i ${f} \
   -faidx ../../genome-features/mummichog.chrom.length \
   > $(basename ${f%_qval.0.05.bed})_DMR.sort.bed
done

!rm 20_OC_N_DMR.sort.bed all_pop_DMR.sort.bed

After sorting the files, I looked for closest gene, CDS, intron, 5’ UTR, 3’ UTR, start codon, stop codon, and intergenic region:

#For sorted BEDfiles
#Find the closest gene feature
#Report distances, using lower numbers for genome feature location as "upstream"

%%bash

for f in *DMR.sort.bed
do
   /opt/homebrew/bin/closestBed \
   -D "ref" \
   -a ${f} \
   -b ../../genome-features/Fundulus_heteroclitus-3.0.2.105-gene.gff \
   -g ../../genome-features/mummichog.chrom.length \
   > $(basename ${f%_DMR.sort.bed})_DMR.closestGene.bed
done

The output can be found here.

Table 2. DMR from each contrast and distance to closest genome feature

Contrast	Group 1	Group 2	Chr	Start	End	Meth Diff	Genes	5’ UTR	CDS	3’ UTR	Introns	Start Codon	Stop Codon	Intergenic
OC	N	S	KN805540.1	286626	286713	0.724750	0	327344	-4084	307386	0	-121140	-15569	-15572
20	N	S	KN806932.1	34456	34729	-0.270357	-24519	-32826	-24521	-24519	-24661	-32823	N/A	0
5	N	S	KN805688.1	500679	500775	-0.565449	-55168	-55168	-55211	-58247	-55391	-55211	-58244	0
5	N	S	KN811384.1	437656	437724	0.517333	0	7047	-2570	-9991	0	7044	-9988	10789
N	20	5	KN806487.1	130260	130476	0.241000	N/A	N/A	N/A	N/A	N/A	N/A	N/A	0
N	20	5	KN811321.1	578486	578795	-0.222821	-22011	197537	-22014	-168583	-22249	-157801	-22011	0
S	20	5	KN811550.1	428041	428146	0.332250	0	4662	3518	10922	0	5138	10919	11125
S	20	OC	KN811550.1	428041	428146	0.335750	0	4662	3518	10922	0	5138	10919	11125

Visualizing DMR

To visualize the DMR, I created a UCSC Genome Browser session. The session information is saved here. To add the DMR tracks, I used the bedGraphs with methylation differences for each DMR. I copied the data when adding a track, and included this heading based on example #4:

track name ="NAME" description="DESCRIPTION"
#chr chromStart chromEnd meth.diff

I viewed each DMR in the browser and took a screenshot since there were additional genome features in the browser that I didn’t have in IGV, like CpG islands.

Screen Shot 2022-06-07 at 10 22 10 AM