Killifish Hypoxia RRBS Part 25

Characterizing the general methylation landscape with the new genome

When looking through the output from the new genome to characterize the methylation landscape in this Jupyter notebook, I noticed very few common loci used in comparisons. This was a red flag for me: if the algorithim doesn’t have >100,000 loci to run comparisons, then I wouldn’t trust its conclusions! So here we are, troubleshooting.

`BAT_summarize` and `BAT_overview`

There are two things I can mess around with: missingness and coverage. Right now, my coverage is 10 reads per locus. I could drop down to five, but I’d like to keep 10 if possible. So that bring us to missingness. The default for BAT_summarize is to only consider loci with data for all samples. I relaxed this to allow for one sample missing data per group for a comparison (--mis1 1 and --mis2 1) in this script. I can’t go above this since some of my population x oxygen treatment conditions only have three samples. I then reran my BAT_analysis script. I ended up with some weird extra files with incorrect naming conventions, like 5_OC_N/N_metilene_NO_OC.txt and 5_OC_S/S_metilene_NO_OC.txt. I deleted all my output and reran my script and didn’t end up with those files! Who knows what happened. With my remaining files, I looked at the number of common loci used for each comparison.

Table 1. Number of common loci for various comparisons

Population	Treatment	Old Genome	New Genome	New Genome, Missing 1
NB vs. SC	All	14896	2	148753
NB vs. SC	Outside Control	116535	6	782514
NB vs. SC	Normoxia	74029	5	852762
NB vs. SC	Hypoxia	197124	6	1008741
New Bedford	OC vs. NO	152299	8	913977
New Bedford	OC vs. HY	N/A	550768	1117814
New Bedford	NO vs. HY	176287	6	1071858
Scorton Creek	OC vs. NO	59358	3	758536
Scorton Creek	OC vs. HY	N/A	108815	729115
Scorton Creek	NO vs. HY	82306	5	871574

WAY more loci when allowing for missingness! As expected, the comparison with all samples has the smallest number of common loci. I’ll move forward with this output.

Methylation landscape

I then returned to this Jupyter notebook to recharacterize the methylation landscape with revised output. I also modified my approach. I created the union bedGraph to get an accurate count of unique loci used in the analysis. However, to actually calculate percent methylation, I decided to only look at the BAT_summarize output for the all sample comparison. This is a file with all common loci across all samples. By using this file, I can conservatively gauge the landscape, since the union bedgraph has loci with missing data for several samples (sometimes all but one!).

I imported the output file into pandas, then calculated average methylation for each population and treatment within each population at each locus:

#Import data into pandas
#Check head
df = pd.read_table("../../summarize/missing_1/all_pop/all_pop_metilene_N_S.txt")
df.head(5)

#Average all NBH samples for total genome methylation information and save as a new column
#Average all OC samples and save as a new column
#Average all NO samples and save as a new column
#Average all HY samples and save as a new column
#NA are not included in averages
#Check output
df['NBH.average'] = df[['N_OC-N5' ,'N_OC-N1' ,'N_OC-N2', 'N_OC-N4', 'N_20-N4', 'N_20-N2', 'N_20-N1', 'N_5-N1', 'N_5-N2', 'N_5-N3']].mean(axis=1)
df['NBH.OC.average'] = df[['N_OC-N5' ,'N_OC-N1' ,'N_OC-N2', 'N_OC-N4']].mean(axis=1)
df['NBH.NO.average'] = df[['N_20-N4', 'N_20-N2', 'N_20-N1']].mean(axis=1)
df['NBH.HY.average'] = df[['N_5-N1', 'N_5-N2', 'N_5-N3']].mean(axis=1)

#Average all SC samples for total genome methylation information and save as a new column
#Average all OC samples and save as a new column
#Average all NO samples and save as a new column
#Average all HY samples and save as a new column
#NA are not included in averages
#Check output
df['SC.average'] = df['average'] = df[['S_OC-S1', 'S_OC-S2', 'S_OC-S3', 'S_OC-S5', 'S_20-S1', 'S_20-S3', 'S_20-S4', 'S_20-S2', 'S_5-S3', 'S_5-S4', 'S_5-S2', 'S_5-S1']].mean(axis=1)
df['SC.OC.average'] = df[['S_OC-S1', 'S_OC-S2', 'S_OC-S3', 'S_OC-S5']].mean(axis=1)
df['SC.NO.average'] = df[['S_20-S1', 'S_20-S3', 'S_20-S4', 'S_20-S2']].mean(axis=1)
df['SC.HY.average'] = df[['S_5-S3', 'S_5-S4', 'S_5-S2', 'S_5-S1']].mean(axis=1)

I wrote out this file here. Then, I used a series of awk commands to calculate percent methylation for each population and population x treatment:

#Remove header
#Find CpGs with > 0% methylation
#Count number of CpGs
! tail -n+2 all-common-loci-averages.bedgraph \
| awk -F'\t' -v OFS='\t' '{if ($26 > 0) { print $26 }}' ${f} \
| wc -l

#Remove header
#Count number of CpGs
! tail -n+2 all-common-loci-averages.bedgraph \
| wc -l

#Calculate average methylation
! tail -n+2 all-common-loci-averages.bedgraph \
| awk '{ total += $26; count++ } END { print total/count }'

Table 2. Average methylation for various treatments

Population	Treatment	Number of Methylated CpGs (%)	Percent Methylation
New Bedford	All	121017 (81.4%)	28.3
New Bedford	Outside Control	101532 (68.3%)	27.9
New Bedford	Normoxia	92498 (62.2%)	27.2
New Bedford	Hypoxia	96541 (64.9%)	29.7
Scorton Creek	All	124001 (83.3%)	28.8
Scorton Creek	Outside Control	124001 (83.3%)	28.8
Scorton Creek	Normoxia	99786 (67.1%)	28.8
Scorton Creek	Hypoxia	101036 (68.0%)	29.2

Overall, I wouldn’t say there’s a clear difference in average methylation by population and treatment. This doesn’t surprise me since methylation in this species seems so minimal to begin with when compared with humans or mammals. I’ll run this by Neel before I finalize any figures.

`BAT_DMRcalling`

Since it takes literally two minutes, I decided to call DMR with the revised output! Now that there were more loci, I wondered if there would algorithmically be more DMR. Apart from a small issue where I couldn’t call DMR comparing New Bedford and Scorton Creek outside control samples because the files were not in the right place, this step was pretty chill. The output can be found here.

Just as a reminder to myself: DMR have at least 10 CpGs, a minimum absolute difference ≥ 0.1, and an adjusted p-value < 0.05. Previously I also messed around with my BAT_filter_vcf settings such that my maximum read count was 500 instead of the default 100. This was because I initially got 0 DMR for all contrasts with the old genome so we wanted to have more lenient filtering restrictions.

Table 3. DMR for various comparisons

Population	Treatment	Old Genome	New Genome	New Genome, Missing 1
NB vs. SC	All	0	0	11
NB vs. SC	Outside Control	1	0	1
NB vs. SC	Normoxia	1	0	0
NB vs. SC	Hypoxia	1	0	13
New Bedford	OC vs. NO	2	0	10
New Bedford	OC vs. HY	16	59	59
New Bedford	NO vs. HY	N/A	0	23
Scorton Creek	OC vs. NO	2	0	0
Scorton Creek	OC vs. HY	1	0	0
Scorton Creek	NO vs. HY	1	0	0

It’s interesting to see that none of the oxygen treatment comparisons within Scorton Creek have any DMR! This could suggest that the more toxicant-sensitive Scorton Creek doesn’t mount an adaptive response to hypoxia, whereas the more toxicant-resistant New Bedford Harbor has a fine-tuned molecular regulation system that includes methylation responses. It will be interesting to see where the DMR are located and if they overlap with any interesting pathways or DEG.

Another thing I see is that there are 11 DMR that are different between all New Bedford and all Scorton Creek samples. These could shed light on additional population-level differences. There are also DMR associated with hypoxia exposure between the two populations. I wonder if any of these DMR are in interesting regions, or if they overlap with the within-population oxygen treatment DMR. Lots of good stuff to investigate!

Going forward

Update methods and results
Create methylation landscape figure
Determine the genomic locations of DMR
Create DMR figure
Update OSF repository for all intermediate files
Map STAR transcript names to Ensembl IDs from genome
Conduct pathway analysis for RNA-Seq data by population
Identify known SNP/DMR overlaps

Written on May 22, 2024