MethCompare Part 14
Reworking flanking and intergenic regions
One of my tasks this week was to split the flanking region track into separate upstream and downstream flanking tracks. The main hypothesis associated with RRBS is that it captures methylation in promoter regions. Although my stacked barplots did not show any large amount of methylation occuring in flanking regions, we thought this was a necessary step to address the RRBS hypothesis.
Creating upstream and downstream flanks
In this Jupyter notebook I started to separate out entries in the combined flanking region track. I used the following code to parse out every other line as an upstream or downstream flank, depending on whether the strand was a forward or reverse strand:
#Select forward strands
#Separate upstream and downstream flanks
!grep "+" Mcap.GFFannotation.flanks.gff \
| awk '{ if (NR%2) print > "Mcap.GFFannotation.flanks.forwardUpstream.gff"; \
else print > "Mcap.GFFannotation.flanks.forwardDownstream.gff" }'
I then used cat
to combine up- or downstream flanks from forward and reverse strands. I used these revised tracks to characterize the locations of CpGs! Feeling good, I realized I needed to look at my tracks in IGV. I opened up the M. capitata and P. acuta IGV sessions realized that my code did not parse things correctly:
Figure 1. Error in M. capitata flank parsing.
Since I used the same code for both species, I assumed there was an error in the P. acuta tracks too. I looked at the flankBed
documentation and noticed I could specify a flank length for a left (-l
) or right (-r
) flank. I tried running the following code but got an error:
I posted this issue, and Shelly suggested I create new tracks using -l
and -r
specification. Instead of using these separately, I could set one to be 1000 and another to be 0. If I combined that wiht -s
, I could get upstream and downstream flanks correctly while incorporating strand direction:
#Create flanking regions
#Create upstream flanks (-l) based on strand (-s)
#Subtract existing genes so flanks do not have any overlap
!{bedtoolsDirectory}flankBed \
-i Pact.GFFannotation.Genes.gff \
-g Pact.genome_assembly-sequence-lengths.txt \
-l 1000 \
-r 0 \
-s \
| {bedtoolsDirectory}subtractBed \
-a - \
-b Pact.GFFannotation.Genes.gff \
> Pact.GFFannotation.flanks.Upstream.gff
#Create flanking regions
#Create downstream flanks (-r) based on strand (-s)
#Subtract existing genes so flanks do not have any overlap
!{bedtoolsDirectory}flankBed \
-i Pact.GFFannotation.Genes.gff \
-g Pact.genome_assembly-sequence-lengths.txt \
-l 0 \
-r 1000 \
-s \
| {bedtoolsDirectory}subtractBed \
-a - \
-b Pact.GFFannotation.Genes.gff \
> Pact.GFFannotation.flanks.Downstream.gff
I confirmed these tracks in IGV, then uploaded them to this gannet
folder. I used the URL version of the final tacks in the sessions I saved so they can be viewed from any machine. I then proceeded to characterize genomic locations of CpGs with my new flank tracks.
M. capitata:
P. acuta:
New intergenic tracks
After I characterized the location of CpGs with upstream and downstream flanks for the second time, I was ready to finish up for the day when I realized I needed to augment my intergenic region tracks. As defined with intersectBed -v
, the intergenic regions were any part of the genome that did not include genes. However, flanking regions now overlapped with my intergenic regions. It doesn’t make sense for something to be considered a flank and an intergenic region, so I decided to include flanking regions in my intersectBed -v
code. When I tried adding an additional filename to my bedtools
loops, it wouldn’t work! I then realized it would be better to just create intergenic region tracks, so that’s what I did.
Back in this Jupyter notebook, I created intergenic region tracks by finding the complement of genes and subtracting out flanks. I used the concatenated flanking region tracks instead of the separate tracks since they contained the same information.
#Find intergenic regions
#Subtract defined flanks from intergenic regions
!/usr/local/bin/complementBed \
-i Mcap.GFFannotation.gene.gff \
-g Mcap.genome_assembly-sequence-lengths.txt \
| /usr/local/bin/subtractBed \
-a - \
-b Mcap.GFFannotation.flanks.gff \
| awk '{print $1"\t"$2"\t"$3}' \
> Mcap.GFFannotation.intergenic.bed
I originally created GFF files, but when I viewed them in IGV they weren’t displaying. I remembered I needed to convert them to BEDfiles to be able to view in IGV (I encountered something similar with C. virginica tracks), which is why they are BEDfiles. The BEDfile for M. capitata is here and the P. acuta track is here.
Characterizing locations
Alright, now I’ll get into characterizing CpG locations. I used the new flank and intergenic region tracks to identify genomic locations for union data, DMC, and upset plot data. I created summary files that I will use for downstream analyses.
M. capitata:
- Union data upstream flanks
- Union data downstream flanks
- Union data intergenic regions
- Upset data upstream flanks
- Upset data downstream flanks
- Upset data intergenic regions
P. acuta:
- Union data upstream flanks
- Union data downstream flanks
- Union data intergenic regions
- Upset data upstream flanks
- Upset data downstream flanks
- Upset data intergenic regions
- DMC upstream flanks
- DMC downstream flanks
- DMC intergenic regions
Going forward
- Update methods
- Update results
- Locate TE tracks
- Characterize intersections between data and TE, and create summary tables
- Perform statistical tests for all stacked barplot data and correct for multiple comparisons
- Revise stacked barplots
- Replicate Liew et al. Figure 1B
- Look into program for mCpG identification