Killifish Hypoxia RRBS Part 11
Figuring out my BAT_overview
issues
Since I had BAT_summarize
output for all contrasts, my next step is to identify DMR. I’m using the default settings for BAT_DMRcalling
:
- Report DMR with an adjusted p-value (q-value) less than 0.05
- At least 10 Cs per DMRs
- Minimum difference of 0.1 in mean methylation rate
- No minium nucleotide length
Source of issues
I ran the following code interactively to identify DMR between the two populations irrespective of oxygen condition:
BAT_DMRcalling \
-q /yaaminiv/killifish-hypoxia-RRBS/output/05-analysis/summarize/all_pop/all_pop_metilene_N_S.txt \
-o /scratch/06-DMR/all_pop \
-a N \
-b S
When I looked at the output, I had 0 DMR! This seemed ridiculous, so I did some digging into the input file. Similar to the output bedGraph from BAT_summarize
, there were three extra columns in data rows that weren’t present in the header row. I decided to create an output file that didn’t have these extra columns, similar to what I tried for the bedGraphs. For some reason my awk
code wasn’t running and I tried pulling out subsets of columns. Thank goodness, because I realized what the real issue was:
(base) [yaamini.venkataraman@pn035 all_pop]$ awk '{print $1"\t"$2"\t"$3"\t"$4"\t"$5}' all_pop_metilene_N_S.txt | head
chr pos N_20-N4 N_5-N1 N_5-N2
KN805535.1 dna:primary_assembly primary_assembly:Fundulus_heteroclitus-3.0.2:KN805535.1:1:799646:1 REF 414192
KN805610.1 dna:primary_assembly primary_assembly:Fundulus_heteroclitus-3.0.2:KN805610.1:1:629617:1 REF 285262
KN805610.1 dna:primary_assembly primary_assembly:Fundulus_heteroclitus-3.0.2:KN805610.1:1:629617:1 REF 285275
KN805610.1 dna:primary_assembly primary_assembly:Fundulus_heteroclitus-3.0.2:KN805610.1:1:629617:1 REF 285278
KN805697.1 dna:primary_assembly primary_assembly:Fundulus_heteroclitus-3.0.2:KN805697.1:1:580679:1 REF 375041
KN805750.1 dna:primary_assembly primary_assembly:Fundulus_heteroclitus-3.0.2:KN805750.1:1:549272:1 REF 122053
KN805750.1 dna:primary_assembly primary_assembly:Fundulus_heteroclitus-3.0.2:KN805750.1:1:549272:1 REF 122065
KN805750.1 dna:primary_assembly primary_assembly:Fundulus_heteroclitus-3.0.2:KN805750.1:1:549272:1 REF 122071
KN805750.1 dna:primary_assembly primary_assembly:Fundulus_heteroclitus-3.0.2:KN805750.1:1:549272:1 REF 122078
My chromosome information was being broken up into four columns, which is why I had three additional columns in my BAT_summarize
output files! I tracked this down to the output of my filtered files:
(base) [yaamini.venkataraman@pn035 filtered]$ awk '{print $1"\t"$2"\t"$3"\t"$4"\t"$5}' 190626_I114_FCH7TVNBBXY_L2_5-S3_CG.sort.bedgraph | head
JXMV01047516.1 dna:primary_assembly primary_assembly:Fundulus_heteroclitus-3.0.2:JXMV01047516.1:1:31188:1 REF 1872
JXMV01047516.1 dna:primary_assembly primary_assembly:Fundulus_heteroclitus-3.0.2:JXMV01047516.1:1:31188:1 REF 2221
JXMV01047516.1 dna:primary_assembly primary_assembly:Fundulus_heteroclitus-3.0.2:JXMV01047516.1:1:31188:1 REF 3092
JXMV01047516.1 dna:primary_assembly primary_assembly:Fundulus_heteroclitus-3.0.2:JXMV01047516.1:1:31188:1 REF 6204
JXMV01047516.1 dna:primary_assembly primary_assembly:Fundulus_heteroclitus-3.0.2:JXMV01047516.1:1:31188:1 REF 14978
JXMV01047516.1 dna:primary_assembly primary_assembly:Fundulus_heteroclitus-3.0.2:JXMV01047516.1:1:31188:1 REF 14999
JXMV01047516.1 dna:primary_assembly primary_assembly:Fundulus_heteroclitus-3.0.2:JXMV01047516.1:1:31188:1 REF 15005
JXMV01047516.1 dna:primary_assembly primary_assembly:Fundulus_heteroclitus-3.0.2:JXMV01047516.1:1:31188:1 REF 15034
JXMV01047516.1 dna:primary_assembly primary_assembly:Fundulus_heteroclitus-3.0.2:JXMV01047516.1:1:31188:1 REF 15039
JXMV01047516.1 dna:primary_assembly primary_assembly:Fundulus_heteroclitus-3.0.2:JXMV01047516.1:1:31188:1 REF 15044
(base) [yaamini.venkataraman@pn035 filtered]$ head 190626_I114_FCH7TVNBBXY_L2_5-S3_CG.sort.bedgraph
JXMV01047516.1 dna:primary_assembly primary_assembly:Fundulus_heteroclitus-3.0.2:JXMV01047516.1:1:31188:1 REF 1872 1873 1.00
JXMV01047516.1 dna:primary_assembly primary_assembly:Fundulus_heteroclitus-3.0.2:JXMV01047516.1:1:31188:1 REF 2221 2222 0.15
JXMV01047516.1 dna:primary_assembly primary_assembly:Fundulus_heteroclitus-3.0.2:JXMV01047516.1:1:31188:1 REF 3092 3093 1.00
JXMV01047516.1 dna:primary_assembly primary_assembly:Fundulus_heteroclitus-3.0.2:JXMV01047516.1:1:31188:1 REF 6204 6205 1.00
JXMV01047516.1 dna:primary_assembly primary_assembly:Fundulus_heteroclitus-3.0.2:JXMV01047516.1:1:31188:1 REF 14978 14979 1.00
JXMV01047516.1 dna:primary_assembly primary_assembly:Fundulus_heteroclitus-3.0.2:JXMV01047516.1:1:31188:1 REF 14999 15000 1.00
JXMV01047516.1 dna:primary_assembly primary_assembly:Fundulus_heteroclitus-3.0.2:JXMV01047516.1:1:31188:1 REF 15005 15006 1.00
JXMV01047516.1 dna:primary_assembly primary_assembly:Fundulus_heteroclitus-3.0.2:JXMV01047516.1:1:31188:1 REF 15034 15035 1.00
JXMV01047516.1 dna:primary_assembly primary_assembly:Fundulus_heteroclitus-3.0.2:JXMV01047516.1:1:31188:1 REF 15039 15040 1.00
JXMV01047516.1 dna:primary_assembly primary_assembly:Fundulus_heteroclitus-3.0.2:JXMV01047516.1:1:31188:1 REF 15044 15045 1.00
Looks like I needed to remove the second, third, and fourth columns. It was at this point I missed working in a Jupyter notebook because it’s much easier to see that kind of stuff in chunks instead of in the Terminal output. BUT I DIGRESS. I added a line of code to my bedGraph sorting command and tested it on one file:
(base) [yaamini.venkataraman@pn035 filtered]$ /vortexfs1/home/yaamini.venkataraman/bedtools2/bin/sortBed \
> -i 190626_I114_FCH7TVNBBXY_L2_5-S3_CG.bedgraph \
> | awk '{print $1"\t"$5"\t"$6"\t"$7}' \
> | head
JXMV01047516.1 1872 1873 1.00
JXMV01047516.1 2221 2222 0.15
JXMV01047516.1 3092 3093 1.00
JXMV01047516.1 6204 6205 1.00
JXMV01047516.1 14978 14979 1.00
JXMV01047516.1 14999 15000 1.00
JXMV01047516.1 15005 15006 1.00
JXMV01047516.1 15034 15035 1.00
JXMV01047516.1 15039 15040 1.00
JXMV01047516.1 15044 15045 1.00
The output looked in-line with what the BAT_filtering
output looked like in the example, so that’s a good sign! I remade my sorted bedGraphs with this change.
#Sort bedGraphs
#Keep columns 1, 5, 6, 7 (remove extra chromosome columns)
for f in *bedgraph
do
/vortexfs1/home/yaamini.venkataraman/bedtools2/bin/sortBed \
-i ${f} \
| awk '{print $1"\t"$5"\t"$6"\t"$7}' \
> $(basename ${f%.bedgraph}).sort.bedgraph
done
(base) [yaamini.venkataraman@poseidon-l2 filtered]$ head 190626_I114_FCH7TVNBBXY_L3_OC-N4_CG.sort.bedgraph
JXMV01047516.1 356 357 1.00
JXMV01047516.1 371 372 1.00
JXMV01047516.1 382 383 1.00
JXMV01047516.1 6012 6013 1.00
JXMV01047516.1 15106 15107 1.00
JXMV01047516.1 15109 15110 1.00
JXMV01047516.1 15241 15242 1.00
JXMV01047516.1 19994 19995 1.00
JXMV01047516.1 19995 19996 1.00
JXMV01047516.1 20022 20023 1.00
The revised sorted bedGraphs can be found in this directory.
Running BAT_summarize
(again)
The next step was to run BAT_summarize
to see if summary bedGraph and text files would play well with BAT_overview.R
and BAT_DMRcalling
. I simply ran my script again and checked the number of columns in the output files:
(base) [yaamini.venkataraman@poseidon-l2 5_pop_diff]$ awk '{print NF}' *summary*bedgraph | head
10
10
10
10
10
10
10
10
10
10
There is no difference in the number of columns between the header and data rows! The output files can be found in this directory.
Trying BAT_overview
(again)
Moment of truth! I wanted to use one of the output bedGraphs to try BAT_overview
. If I end up with the same errors, then there is an another issue I am not aware of yet.
Singularity> BAT_overview.R \
> -i /scratch/05-analysis/summarize/all_pop/all_pop_summary_N_S.bedgraph \
> -o /scratch/05-analysis/overview/all_pop \
> --groups N,S
[INFO] Mon May 09, 11:37:42 AM, 2022 BAT_overview v0.1 started
[INFO] Mon May 09, 11:37:42 AM, 2022 Checking flags
[INFO] Mon May 09, 11:37:42 AM, 2022 Reading input /scratch/05-analysis/summarize/all_pop/all_pop_summary_N_S.bedgraph
[INFO] Mon May 09, 11:37:42 AM, 2022 Plot statistics to /scratch/05-analysis/overview/all_pop.pdf
IT WORKS. I drafted a new script for BAT_overview
and updated the environmental variable information. I then ran my SLURM script. Half of the contrasts produced output, while the other half didn’t:
Oxygen within NB: Normoxia vs. Hypoxia
[INFO] Mon May 09, 12:04:56 PM, 2022 BAT_overview v0.1 started
[INFO] Mon May 09, 12:04:56 PM, 2022 Checking flags
[INFO] Mon May 09, 12:04:56 PM, 2022 Reading input /scratch/05-analysis/summarize/20_5_N/N_summary_20_5.bedgraph
Error in main() :
Required group identfier for group 1 not found in input.
Execution halted
Oxygen within NB: Normoxia vs. OC
[INFO] Mon May 09, 12:04:56 PM, 2022 BAT_overview v0.1 started
[INFO] Mon May 09, 12:04:56 PM, 2022 Checking flags
[INFO] Mon May 09, 12:04:56 PM, 2022 Reading input /scratch/05-analysis/summarize/20_OC_N/N_summary_20_OC.bedgraph
Error in main() :
Required group identfier for group 1 not found in input.
Execution halted
Oxygen within SC: Normoxia vs. Hypoxia
[INFO] Mon May 09, 12:04:56 PM, 2022 BAT_overview v0.1 started
[INFO] Mon May 09, 12:04:56 PM, 2022 Checking flags
[INFO] Mon May 09, 12:04:56 PM, 2022 Reading input /scratch/05-analysis/summarize/20_5_S/S_summary_20_5.bedgraph
Error in main() :
Required group identfier for group 1 not found in input.
Execution halted
Oxygen within SC: Normoxia vs. OC
[INFO] Mon May 09, 12:04:57 PM, 2022 BAT_overview v0.1 started
[INFO] Mon May 09, 12:04:57 PM, 2022 Checking flags
[INFO] Mon May 09, 12:04:57 PM, 2022 Reading input /scratch/05-analysis/summarize/20_OC_S/S_summary_20_OC.bedgraph
Error in main() :
Required group identfier for group 1 not found in input.
Execution halted
It seems like “20” may not be an appropriate group identifier? I went back to my BAT_summarize
code and used “NO” and “HY” to specify normoxia and hypoxia, respectively. I made the same change in my BAT_overview
code. I reran BAT_summarize
and BAT_overview
for these contrasts and it worked! The BAT_overview
output can be found in this directory.
BAT_overview
results
I looked through the pdfs produced by BAT_overview
and there were some things I noticed:
- The hierarchical clustering didn’t perfectly distinguish between the groups I was comparing (population or oxygen condition). This isn’t a big problem when comparing normoxia vs. outside control, but strange when comparing the two populations or normoxia vs. hypoxia (Note: clustering was done using a euclidean distance matrix and Ward method).
- In OC and normoxia, NBH had more variable methylation rates when compared to SC, but less variable methylation in hypoxia.
- Both populations had higher methylation in response to hypoxia.
Questions I now have:
- Should I use the dendograms to identify outlier samples?
- What distribution of methylation should I expect?
Going forward
- Identify DMR
- Write methods and results
- Obtain other important methylation landscape information
- Start RNA-Seq analysis
- Create OSF repository for all intermediate files