MethCompare Part 14

Reworking flanking and intergenic regions

One of my tasks this week was to split the flanking region track into separate upstream and downstream flanking tracks. The main hypothesis associated with RRBS is that it captures methylation in promoter regions. Although my stacked barplots did not show any large amount of methylation occuring in flanking regions, we thought this was a necessary step to address the RRBS hypothesis.

Creating upstream and downstream flanks

In this Jupyter notebook I started to separate out entries in the combined flanking region track. I used the following code to parse out every other line as an upstream or downstream flank, depending on whether the strand was a forward or reverse strand:

#Select forward strands
#Separate upstream and downstream flanks
!grep "+" Mcap.GFFannotation.flanks.gff \
| awk '{ if (NR%2) print > "Mcap.GFFannotation.flanks.forwardUpstream.gff"; \
else print > "Mcap.GFFannotation.flanks.forwardDownstream.gff" }'

I then used cat to combine up- or downstream flanks from forward and reverse strands. I used these revised tracks to characterize the locations of CpGs! Feeling good, I realized I needed to look at my tracks in IGV. I opened up the M. capitata and P. acuta IGV sessions realized that my code did not parse things correctly:

Screen Shot 2020-05-11 at 1 35 54 PM

Figure 1. Error in M. capitata flank parsing.

Since I used the same code for both species, I assumed there was an error in the P. acuta tracks too. I looked at the flankBed documentation and noticed I could specify a flank length for a left (-l) or right (-r) flank. I tried running the following code but got an error:

Screen Shot 2020-05-11 at 3 04 59 PM

I posted this issue, and Shelly suggested I create new tracks using -l and -r specification. Instead of using these separately, I could set one to be 1000 and another to be 0. If I combined that wiht -s, I could get upstream and downstream flanks correctly while incorporating strand direction:

#Create flanking regions
#Create upstream flanks (-l) based on strand (-s)
#Subtract existing genes so flanks do not have any overlap
!{bedtoolsDirectory}flankBed \
-i Pact.GFFannotation.Genes.gff \
-g Pact.genome_assembly-sequence-lengths.txt \
-l 1000 \
-r 0 \
-s \
| {bedtoolsDirectory}subtractBed \
-a - \
-b Pact.GFFannotation.Genes.gff \
> Pact.GFFannotation.flanks.Upstream.gff

#Create flanking regions
#Create downstream flanks (-r) based on strand (-s)
#Subtract existing genes so flanks do not have any overlap
!{bedtoolsDirectory}flankBed \
-i Pact.GFFannotation.Genes.gff \
-g Pact.genome_assembly-sequence-lengths.txt \
-l 0 \
-r 1000 \
-s \
| {bedtoolsDirectory}subtractBed \
-a - \
-b Pact.GFFannotation.Genes.gff \
> Pact.GFFannotation.flanks.Downstream.gff

I confirmed these tracks in IGV, then uploaded them to this gannet folder. I used the URL version of the final tacks in the sessions I saved so they can be viewed from any machine. I then proceeded to characterize genomic locations of CpGs with my new flank tracks.

M. capitata:

P. acuta:

New intergenic tracks

After I characterized the location of CpGs with upstream and downstream flanks for the second time, I was ready to finish up for the day when I realized I needed to augment my intergenic region tracks. As defined with intersectBed -v, the intergenic regions were any part of the genome that did not include genes. However, flanking regions now overlapped with my intergenic regions. It doesn’t make sense for something to be considered a flank and an intergenic region, so I decided to include flanking regions in my intersectBed -v code. When I tried adding an additional filename to my bedtools loops, it wouldn’t work! I then realized it would be better to just create intergenic region tracks, so that’s what I did.

Back in this Jupyter notebook, I created intergenic region tracks by finding the complement of genes and subtracting out flanks. I used the concatenated flanking region tracks instead of the separate tracks since they contained the same information.

#Find intergenic regions
#Subtract defined flanks from intergenic regions
!/usr/local/bin/complementBed \
-i Mcap.GFFannotation.gene.gff \
-g Mcap.genome_assembly-sequence-lengths.txt \
| /usr/local/bin/subtractBed \
-a - \
-b Mcap.GFFannotation.flanks.gff \
| awk '{print $1"\t"$2"\t"$3}' \
> Mcap.GFFannotation.intergenic.bed

I originally created GFF files, but when I viewed them in IGV they weren’t displaying. I remembered I needed to convert them to BEDfiles to be able to view in IGV (I encountered something similar with C. virginica tracks), which is why they are BEDfiles. The BEDfile for M. capitata is here and the P. acuta track is here.

Characterizing locations

Alright, now I’ll get into characterizing CpG locations. I used the new flank and intergenic region tracks to identify genomic locations for union data, DMC, and upset plot data. I created summary files that I will use for downstream analyses.

M. capitata:

P. acuta:

Going forward

Update methods
Update results
Locate TE tracks
Characterize intersections between data and TE, and create summary tables
Perform statistical tests for all stacked barplot data and correct for multiple comparisons
Revise stacked barplots
Replicate Liew et al. Figure 1B
Look into program for mCpG identification

Written on May 11, 2020