WGBS Analysis Part 25
Generating genome feature tracks
Before I could determine the location of DML, I needed to make genome feature tracks for the Roslin genome! I figured I could pull from my experience (and code) making these tracks for the C. virginica and the coral methylation papers.
RefSeq information
I started by creating this Jupyter notebook and downloading the NCBI RefSeq annotation. I briefly remember Steven mentioning the RefSeq annotation had more genome feature information than the Genbank version, so I figured I’d start there. After unzipping my downloaded file, I wanted to get an idea for which features were present in the file. I used cut
to extract the column with genome feature information (column 3), then got the unique entries with sort | uniq -c
:
!cut -f3 ncbi-genomes-2021-05-04/GCF_902806645.1_cgigas_uk_roslin_v1_genomic.gff | sort | uniq -c
There were a bunch of genome features in this annotation file! I decided to take the genes, CDS, exons, lncRNA, and mRNA tracks. I could then use that information for exon UTR, introns, intergenic regions, and flanking regions. The output also gave me the number of all of those features, so I could use that information to ensure my track extraction was correct! I started extracting the genes with the following grep
code:
#Isolate gene entries from multiple annotation databses. Tab must be included between database and feature
!grep "Gnomon gene" \
ncbi-genomes-2021-05-04/GCF_902806645.1_cgigas_uk_roslin_v1_genomic.gff \
> cgigas_uk_roslin_v1_gene.gff
When I used wc -l
to check the number of genes, but I was missing ~1000 genes! I then decided to see which databases were used to create the annotation file:
#Database identifiers for extracting features
!cut -f2 ncbi-genomes-2021-05-04/GCF_902806645.1_cgigas_uk_roslin_v1_genomic.gff | sort | uniq -c | tail
Then, I used all four databases to extract all genes:
#Isolate gene entries from multiple annotation databses. Tab mus be included between database and feature
!grep -e "Gnomon gene" -e "RefSeq gene" -e "cmsearch gene" -e "tRNAscan-SE gene" \
ncbi-genomes-2021-05-04/GCF_902806645.1_cgigas_uk_roslin_v1_genomic.gff \
> cgigas_uk_roslin_v1_gene.gff
Issues with bedtools
I proceeded to extract CDS, exon, lncRNA, and mRNA tracks. Before I could get the othere tracks, I needed to extract chromosome names and lengths. I used the following code to create the genome file required by bedtools
:
!awk '$0 ~ ">" {print c; c=0;printf substr($0,2,100) "\t"; } $0 !~ ">" {c+=length($0);} END { print c; }' \
ncbi-genomes-2021-05-04/GCF_902806645.1_cgigas_uk_roslin_v1_genomic.fna \
| awk '{print $1"\t"$2}' \
> cgigas_uk_roslin_v1-sequence-lengths.txt
Unfortunately, the header for each sequence has extra descriptive text! I initially tried using tr " " "\t"
to separate out the descriptive text into columns, then only save the first and last columns (hopefully the chromosome name and sequence length)l. However, the descriptive text separated into different numbers of columns depending on the row, so that wasn’t an option. I used tr " " "\t"
with the awk
command and cut
in one chunk to extract the chromosome name and save it in a text file. Then, I used the awk
command and cut
only to extract the sequence length. Finally, I used paste
to combine the columns and create my genome file!
I then proceeded to use complementBed
with the exon track to create a non-coding sequence track. However I encountered an error!
For some reason, complementBed
wasn’t accepting the genome file I created, even though it had the two required fields. I posted this discussion to get input while I did some troubleshooting. In referring to my old notebooks, I realized that I was using an older version of bedtools
than I had previously! When I used the newer version, I was able to run complementBed
successfully. After I created the non-coding sequence, intron, and intergenic tracks, I decided to create the flanking tracks. I then encountered the same error with flankBed
!
Sam suggested I make genome file again in one chunk, instead of pasting two columns together. Instead of pasting two columns together, I went down the rabbit hole of trying to cut the text string “Crassostrea…….shotgun sequence”. I changed the awk
command so it pulls out characters 2-14 of the sequence header. This worked well for the end of the file, but not the beginning:
I ended up adding sed 's/Cr//g'
after the awk
code (take characters 2-14 of the header) to remove the remaining “Cr” in certain entries!
I still got the same error with flankBed
, so I was going to work from the index file like Sam suggested. Before I got around to doing that, I noticed that there was an empty first line in my file:
I added a final modification to my code to remove the first line of the file, and that worked!
!awk '$0 ~ ">" {print c; c=0;printf substr($0,2,14) "\t"; } $0 !~ ">" {c+=length($0);} END { print c; }' \
ncbi-genomes-2021-05-04/GCF_902806645.1_cgigas_uk_roslin_v1_genomic.fna \
| sed 's/Cr//g' \
| awk '{print $1"\t"$2}' \
| tail -n +2 \
> cgigas_uk_roslin_v1-sequence-lengths.txt
Transposable elements
The last track I needed was for transposable elements. I initially posted this issue, and Sam produced a RepeatMasker track. Both of us also realized there’s RepeatMasker output on the NCBI annotation page! I downloaded the output. I noticed that there were several columns I didn’t need, and tried to create a BEDfile with just the chromosome, start, and stop information. When I tried to cut
the columns, I (of course) couldn’t get it to work!
I posted this discussion, and Sam suggested I use the following code (which of course worked):
#Convert RepeatMasker output to a BEDfile
#Skip the first 4 lines
#Print columns 5-7 as a tab-delimited output
!tail -n +4 cgigas_uk_roslin_v1_rm.te \
| awk 'BEGIN{OFS= "\t"} {print $5, $6, $7}' \
> cgigas_uk_roslin_v1_rm.te.bed
IGV
Finally, I wanted to look at these tracks in IGV. I created this IGV session, and breezed through the chromosomes to ensure the tracks were generated properly:
I uploaded the tracks to owl
and updated the Roberts Lab Handbook with the tracks. Now I think I’m ready to use these genome feature tracks!
Going forward
- Finish running
methylKit
randomization tests on R Studio server - Write methods
- Write results
- Update
mox
handbook with R information - Obtain relatedness matrix and SNPs with EpiDiverse/snp
- Determine genomic location of DML
- Identify overlaps between DML, SNP data, and other epigenomic data
- Identify significantly enriched GOterms associated with DML
- Identify methylation islands and non-methylated regions
- Determine if larval DNA/RNA should be extracted