MethCompare Part 11

More feature tracks

My larger goal this week is to run various CpG categories and DMC through my genome feature characterization pipeline. Before doing this, I wanted to obtain flanking regions and transposable element tracks and add them to my pipeline.

Creating flanking regions

The easiest thing on my to-do list was to generate flanking region tracks. I could easily do this with flankBed, then use subtractBed to remove any overlap with annotated genes. My one hang-up was creating a genome information file, with a column for chromosome and another column with the length of the chromosome. To make this file for C. virginica, I simply looked at the NCBI website and copied down the information for the 11 chromosomes that existed. It wasn’t going to be that simple for M. capitata or P. acuta.

Before posting an issue, I dug through the FASTA Markdown in the code repository. I found some truly FASTAstic code (lol) to obtain sequence lengths from a FASTA file using awk. In this Jupyter notebook, I tested the code with the M. capitata genome FASTA file:

#Obtain sequence lengths for each chromosome
!awk '$0 ~ ">" {print c; c=0;printf substr($0,2,100) "\t"; } $0 !~ ">" {c+=length($0);} END { print c; }' \
Mcap.genome_assembly.fa \
> Mcap.genome_assembly-sequence-lengths.txt

I checked the end of both the sequence length file and the genome to make sure that the last chromosome in each file matched. I repeated this process with P. acuta, and saved the M. capita file here and P. acuta file here.

I used these files as input for my bedtools code. To create the flanking regions, I specified that I wanted to create 1000 kb flanks on either side of annotated genes with flankBed. I then piped the output into subtractBed and subtracted gene regions from my flanking regions. I did this in case there were any genes that overlapped with my flanking regions:

#Create 1kb flanking regions
#Subtract existing genes so flanks do not have any overlap
!{bedtoolsDirectory}flankBed \
-i Pact.GFFannotation.Genes.gff \
-g Pact.genome_assembly-sequence-lengths.txt \
-b 1000 \
| {bedtoolsDirectory}subtractBed \
-a - \
-b Pact.GFFannotation.Genes.gff \
> Pact.GFFannotation.flanks.gff

I saved the M. capitata and P. acuta flanking region tracks. I also added these tracks to the M. capitata and P. acuta IGV sessions with the other feature tracks.

Screen Shot 2020-05-05 at 2 46 20 PM

Figure 1. M. capitata genome features

Screen Shot 2020-05-05 at 2 48 15 PM

Figure 2. P. acuta genome features

Intersections with flanking regions

In this script I characterized the overlap between flanking regions and individual M. capitata and P. acuta samples. Before doing this, I went through my local repository and deleted all of the intermediate files generated by this script. Since I can download bedgraphs from gannet, produce the intermediate files using the code in the script, and save any relevant count information in a text file, I didn’t need to host them in the repository. After removing files and adding them to the gitignore, I used intersectBed to characterize overlaps between the samples and flanking regions. I didn’t split this up by upstream or downstream flanks since I figured just knowing about the flanks is good for now:

I also looked at the intersection between flanks and CG motifs. Flanking regions in M. capitata overlapped with 9.6% of CG motifs, and 17.8% of CG motifs in P. acuta.

Going forward

  1. Update methods
  2. Update results
  3. Redo CpG distributions with union bedgraphs
  4. Locate TE tracks
  5. Characterize intersections between data and TE, and create summary tables
  6. Create figures for CpG characterization in various genome features
  7. Figure out methylation island analysis
Written on May 5, 2020