Killifish Hypoxia RRBS Part 25
Characterizing the general methylation landscape with the new genome
When looking through the output from the new genome to characterize the methylation landscape in this Jupyter notebook, I noticed very few common loci used in comparisons. This was a red flag for me: if the algorithim doesn’t have >100,000 loci to run comparisons, then I wouldn’t trust its conclusions! So here we are, troubleshooting.
BAT_summarize
and BAT_overview
There are two things I can mess around with: missingness and coverage. Right now, my coverage is 10 reads per locus. I could drop down to five, but I’d like to keep 10 if possible. So that bring us to missingness. The default for BAT_summarize
is to only consider loci with data for all samples. I relaxed this to allow for one sample missing data per group for a comparison (--mis1 1
and --mis2 1
) in this script. I can’t go above this since some of my population x oxygen treatment conditions only have three samples. I then reran my BAT_analysis
script. I ended up with some weird extra files with incorrect naming conventions, like 5_OC_N/N_metilene_NO_OC.txt
and 5_OC_S/S_metilene_NO_OC.txt
. I deleted all my output and reran my script and didn’t end up with those files! Who knows what happened. With my remaining files, I looked at the number of common loci used for each comparison.
Table 1. Number of common loci for various comparisons
Population | Treatment | Old Genome | New Genome | New Genome, Missing 1 |
---|---|---|---|---|
NB vs. SC | All | 14896 | 2 | 148753 |
NB vs. SC | Outside Control | 116535 | 6 | 782514 |
NB vs. SC | Normoxia | 74029 | 5 | 852762 |
NB vs. SC | Hypoxia | 197124 | 6 | 1008741 |
New Bedford | OC vs. NO | 152299 | 8 | 913977 |
New Bedford | OC vs. HY | N/A | 550768 | 1117814 |
New Bedford | NO vs. HY | 176287 | 6 | 1071858 |
Scorton Creek | OC vs. NO | 59358 | 3 | 758536 |
Scorton Creek | OC vs. HY | N/A | 108815 | 729115 |
Scorton Creek | NO vs. HY | 82306 | 5 | 871574 |
WAY more loci when allowing for missingness! As expected, the comparison with all samples has the smallest number of common loci. I’ll move forward with this output.
Methylation landscape
I then returned to this Jupyter notebook to recharacterize the methylation landscape with revised output. I also modified my approach. I created the union bedGraph to get an accurate count of unique loci used in the analysis. However, to actually calculate percent methylation, I decided to only look at the BAT_summarize
output for the all sample comparison. This is a file with all common loci across all samples. By using this file, I can conservatively gauge the landscape, since the union bedgraph has loci with missing data for several samples (sometimes all but one!).
I imported the output file into pandas
, then calculated average methylation for each population and treatment within each population at each locus:
#Import data into pandas
#Check head
df = pd.read_table("../../summarize/missing_1/all_pop/all_pop_metilene_N_S.txt")
df.head(5)
#Average all NBH samples for total genome methylation information and save as a new column
#Average all OC samples and save as a new column
#Average all NO samples and save as a new column
#Average all HY samples and save as a new column
#NA are not included in averages
#Check output
df['NBH.average'] = df[['N_OC-N5' ,'N_OC-N1' ,'N_OC-N2', 'N_OC-N4', 'N_20-N4', 'N_20-N2', 'N_20-N1', 'N_5-N1', 'N_5-N2', 'N_5-N3']].mean(axis=1)
df['NBH.OC.average'] = df[['N_OC-N5' ,'N_OC-N1' ,'N_OC-N2', 'N_OC-N4']].mean(axis=1)
df['NBH.NO.average'] = df[['N_20-N4', 'N_20-N2', 'N_20-N1']].mean(axis=1)
df['NBH.HY.average'] = df[['N_5-N1', 'N_5-N2', 'N_5-N3']].mean(axis=1)
#Average all SC samples for total genome methylation information and save as a new column
#Average all OC samples and save as a new column
#Average all NO samples and save as a new column
#Average all HY samples and save as a new column
#NA are not included in averages
#Check output
df['SC.average'] = df['average'] = df[['S_OC-S1', 'S_OC-S2', 'S_OC-S3', 'S_OC-S5', 'S_20-S1', 'S_20-S3', 'S_20-S4', 'S_20-S2', 'S_5-S3', 'S_5-S4', 'S_5-S2', 'S_5-S1']].mean(axis=1)
df['SC.OC.average'] = df[['S_OC-S1', 'S_OC-S2', 'S_OC-S3', 'S_OC-S5']].mean(axis=1)
df['SC.NO.average'] = df[['S_20-S1', 'S_20-S3', 'S_20-S4', 'S_20-S2']].mean(axis=1)
df['SC.HY.average'] = df[['S_5-S3', 'S_5-S4', 'S_5-S2', 'S_5-S1']].mean(axis=1)
I wrote out this file here. Then, I used a series of awk
commands to calculate percent methylation for each population and population x treatment:
#Remove header
#Find CpGs with > 0% methylation
#Count number of CpGs
! tail -n+2 all-common-loci-averages.bedgraph \
| awk -F'\t' -v OFS='\t' '{if ($26 > 0) { print $26 }}' ${f} \
| wc -l
#Remove header
#Count number of CpGs
! tail -n+2 all-common-loci-averages.bedgraph \
| wc -l
#Calculate average methylation
! tail -n+2 all-common-loci-averages.bedgraph \
| awk '{ total += $26; count++ } END { print total/count }'
Table 2. Average methylation for various treatments
Population | Treatment | Number of Methylated CpGs (%) | Percent Methylation |
---|---|---|---|
New Bedford | All | 121017 (81.4%) | 28.3 |
New Bedford | Outside Control | 101532 (68.3%) | 27.9 |
New Bedford | Normoxia | 92498 (62.2%) | 27.2 |
New Bedford | Hypoxia | 96541 (64.9%) | 29.7 |
Scorton Creek | All | 124001 (83.3%) | 28.8 |
Scorton Creek | Outside Control | 124001 (83.3%) | 28.8 |
Scorton Creek | Normoxia | 99786 (67.1%) | 28.8 |
Scorton Creek | Hypoxia | 101036 (68.0%) | 29.2 |
Overall, I wouldn’t say there’s a clear difference in average methylation by population and treatment. This doesn’t surprise me since methylation in this species seems so minimal to begin with when compared with humans or mammals. I’ll run this by Neel before I finalize any figures.
BAT_DMRcalling
Since it takes literally two minutes, I decided to call DMR with the revised output! Now that there were more loci, I wondered if there would algorithmically be more DMR. Apart from a small issue where I couldn’t call DMR comparing New Bedford and Scorton Creek outside control samples because the files were not in the right place, this step was pretty chill. The output can be found here.
Just as a reminder to myself: DMR have at least 10 CpGs, a minimum absolute difference ≥ 0.1, and an adjusted p-value < 0.05. Previously I also messed around with my BAT_filter_vcf
settings such that my maximum read count was 500 instead of the default 100. This was because I initially got 0 DMR for all contrasts with the old genome so we wanted to have more lenient filtering restrictions.
Table 3. DMR for various comparisons
Population | Treatment | Old Genome | New Genome | New Genome, Missing 1 |
---|---|---|---|---|
NB vs. SC | All | 0 | 0 | 11 |
NB vs. SC | Outside Control | 1 | 0 | 1 |
NB vs. SC | Normoxia | 1 | 0 | 0 |
NB vs. SC | Hypoxia | 1 | 0 | 13 |
New Bedford | OC vs. NO | 2 | 0 | 10 |
New Bedford | OC vs. HY | 16 | 59 | 59 |
New Bedford | NO vs. HY | N/A | 0 | 23 |
Scorton Creek | OC vs. NO | 2 | 0 | 0 |
Scorton Creek | OC vs. HY | 1 | 0 | 0 |
Scorton Creek | NO vs. HY | 1 | 0 | 0 |
It’s interesting to see that none of the oxygen treatment comparisons within Scorton Creek have any DMR! This could suggest that the more toxicant-sensitive Scorton Creek doesn’t mount an adaptive response to hypoxia, whereas the more toxicant-resistant New Bedford Harbor has a fine-tuned molecular regulation system that includes methylation responses. It will be interesting to see where the DMR are located and if they overlap with any interesting pathways or DEG.
Another thing I see is that there are 11 DMR that are different between all New Bedford and all Scorton Creek samples. These could shed light on additional population-level differences. There are also DMR associated with hypoxia exposure between the two populations. I wonder if any of these DMR are in interesting regions, or if they overlap with the within-population oxygen treatment DMR. Lots of good stuff to investigate!
Going forward
- Update methods and results
- Create methylation landscape figure
- Determine the genomic locations of DMR
- Create DMR figure
- Update OSF repository for all intermediate files
- Map STAR transcript names to Ensembl IDs from genome
- Conduct pathway analysis for RNA-Seq data by population
- Identify known SNP/DMR overlaps