Killifish Hypoxia RRBS Part 18

Digging into DMR

Now that I had a list of DMR, I wanted to understand the genomic location of DMR for future functional interpretation. Since I have so few DMR, it doesn’t make sense to do any follow-up gene enrichment analysis. Genomic locations and associated gene annotation information is all I need!

Finishing genome feature tracks

I returned to this Jupyter notebook to finish my genome feature tracks. I pulled out start and stop codon information from the GTF. For intergenic regions, I just took the complement of genes since I didn’t have any promoter information. I can always modify the tracks once I get additional information, but for now I was set.

Mystery of non-existent 16 DMR

When looking at my DMR, I noticed that the log file for 20_5_N stated that 2 DMR were identified. This was weird, because I distinctly remember identifying 16 DMR with 10-500 reads per locus, q-value < 0.05, absolute methylation different of 0.1, and at least 10 CpGs per DMR. When I looked at the log file generated by my tests, it confirmed 16 DMR identified! I was confused as to why I only have 2 DMR instead of 16, so I repeated analyses by refiltering and running through BAT_summarize and BAT_DMRcalling.

When I repeated settings (MDP_max = 500, MDP_min = 10, c = 10), I still only got 2 DMR for 20 vs. 5 within NBH! I tried BAT_filter_vcf, BAT_summarize, and BAT_DMRcalling again with no MDP_max to see if I identified 16 DMR with those settings.

My suspicion is that I didn’t clear files when repeating analyses with different settings, so the output from my run without MDP_max never got overwritten. In any case, I’m now confident about the number of DMR identified with my chosen settings.

Table 1. Number of DMR identified with each setting

Contrast Group 1 Group 2 c = 10, L4 included MDP_max = 500, c = 10, L4 included MDP_max = 500, c = 10, L4 excluded
All samples N S 0 0 0
OC N S 0 1 1
20 N S 6 1 1
5 N S 7 2 2
N 20 5 16 2 2
N 20 OC 2 0 0
S 20 5 1 1 1
S 20 OC 1 1 1

Overlaps between L4 and non-L4 DMR

Previously I identified DMR without L4 samples since there is a weird coverage pattern in the L4 samples. I wanted to know if there were any common DMR identified with and without L4 samples. To do this, I created this Jupyter notebook. I used intersectBed to count overlaps between L4 and non-L4 DMR. Turns out all of my DMR overlapped, regardless if I included the L4 samples or not! That made me feel a lot better about including the L4 samples: they aren’t skewing DMR identification. I decided to use the DMR lists with L4 samples for downstream annotation.

Closest genome features to DMR

In the same Jupyter notebook, I used closestBed to determine the closest genome feature to each DMR, including overlaps. Prior to using closestBed, I sorted all DMR BEDfiles with sortBed and removed empty files:

for f in ../06-DMR/*/*.bed
do
   /opt/homebrew/bin/sortBed \
   -i ${f} \
   -faidx ../../genome-features/mummichog.chrom.length \
   > $(basename ${f%_qval.0.05.bed})_DMR.sort.bed
done

!rm 20_OC_N_DMR.sort.bed all_pop_DMR.sort.bed

After sorting the files, I looked for closest gene, CDS, intron, 5’ UTR, 3’ UTR, start codon, stop codon, and intergenic region:

#For sorted BEDfiles
#Find the closest gene feature
#Report distances, using lower numbers for genome feature location as "upstream"

%%bash

for f in *DMR.sort.bed
do
   /opt/homebrew/bin/closestBed \
   -D "ref" \
   -a ${f} \
   -b ../../genome-features/Fundulus_heteroclitus-3.0.2.105-gene.gff \
   -g ../../genome-features/mummichog.chrom.length \
   > $(basename ${f%_DMR.sort.bed})_DMR.closestGene.bed
done

The output can be found here.

Table 2. DMR from each contrast and distance to closest genome feature

Contrast Group 1 Group 2 Chr Start End Meth Diff Genes 5’ UTR CDS 3’ UTR Introns Start Codon Stop Codon Intergenic
OC N S KN805540.1 286626 286713 0.724750 0 327344 -4084 307386 0 -121140 -15569 -15572
20 N S KN806932.1 34456 34729 -0.270357 -24519 -32826 -24521 -24519 -24661 -32823 N/A 0
5 N S KN805688.1 500679 500775 -0.565449 -55168 -55168 -55211 -58247 -55391 -55211 -58244 0
5 N S KN811384.1 437656 437724 0.517333 0 7047 -2570 -9991 0 7044 -9988 10789
N 20 5 KN806487.1 130260 130476 0.241000 N/A N/A N/A N/A N/A N/A N/A 0
N 20 5 KN811321.1 578486 578795 -0.222821 -22011 197537 -22014 -168583 -22249 -157801 -22011 0
S 20 5 KN811550.1 428041 428146 0.332250 0 4662 3518 10922 0 5138 10919 11125
S 20 OC KN811550.1 428041 428146 0.335750 0 4662 3518 10922 0 5138 10919 11125

Visualizing DMR

To visualize the DMR, I created a UCSC Genome Browser session. The session information is saved here. To add the DMR tracks, I used the bedGraphs with methylation differences for each DMR. I copied the data when adding a track, and included this heading based on example #4:

track name ="NAME" description="DESCRIPTION"
#chr chromStart chromEnd meth.diff

I viewed each DMR in the browser and took a screenshot since there were additional genome features in the browser that I didn’t have in IGV, like CpG islands.

Screen Shot 2022-06-07 at 10 22 10 AM

Screen Shot 2022-06-07 at 10 23 04 AM

Screen Shot 2022-06-07 at 10 38 28 AM

Screen Shot 2022-06-07 at 10 40 46 AM

Screen Shot 2022-06-07 at 10 43 27 AM

Screen Shot 2022-06-07 at 10 44 55 AM

Screen Shot 2022-06-07 at 10 45 15 AM

Screen Shot 2022-06-07 at 10 46 50 AM

Screen Shot 2022-06-07 at 10 49 32 AM

Figures 1-9. DMR in UCSC Genome Browser

Going forward

  1. Revise methylation landscape information
  2. Update methods and results
  3. Match DMR with RNA-Seq information
  4. Start mapping with new genome
  5. Try DMR identification with bismark and methylKit
  6. Create OSF repository for all intermediate files
Written on June 1, 2022