WGBS Analysis Part 28

Working with `BS-Snper` SNPs

I’m returning to the BS-Snper analysis so I can characterize overlaps with SNPs and DML! This will help me understand if the differential methylation is solely attributable to treatment differences, or if there is also a genetic component.

Understanding VCF output

First, I wanted to understand (roughly) the VCF output I got from BS-Snper. I found this link that had good explanations of the header and columns. The first ~100 lines of the file are part of the VCF header, and I ignored it for the most part. Once I got to the column header information and actual output, I appreciated the clear information from the GATK website. Based on my workflow, I needed a way to filter SNPs with a C reference allele and T alternative allele.

Screen Shot 2021-05-15 at 8 12 06 PM

Filtering C/T SNPs

Steven suggested a few weeks ago that I could use bedtools intersect to look at overlaps between SNPs and CG sites or DML. I opened this Jupyter notebook to first filter C/T SNPs. I thought knowing how many C/T SNPs were present in my data would be helpful.

I knew I could use VCF files with intersectBed, and at first I tried intersecting the VCF with the CpG methylation output. However, intersectBed wouldn’t accept the format of the CpG tab-delimited output, since it only had one column with base pair position information. Instead, I decided to use the CG motif track I generated. Before writing out the file, I looked at the head and saw there were G/A SNPs that intersected with my CG motif track! This makes sense, because the CG motif track has positions for the Cs AND Gs that make up the CpG dinucleotide. Based on Mac’s comment on my previous discussion, she was looking at C/T SNPs only. I then used grep to filter the standard input so the reference and alternative allele columns were C and T, respectively:

#Intersect VCF with SNP locations and CG motif track
#BEDtools output has C/T and non-C/T SNPs
#Use grep to isolate C/T SNPs (tab between C and T required)

!{bedtoolsDirectory}intersectBed \
-u \
-a /Volumes/web/spartina/project-gigas-oa-meth/output/BS-Snper/SNP-results.vcf \
-b ../../genome-feature-files/cgigas_uk_roslin_v1_fuzznuc_CGmotif.gff \
| grep "C T" \
> /Volumes/web/spartina/project-gigas-oa-meth/output/BS-Snper/CT-SNPs.tab

My filtered SNP file can be found here! I proceeded to do something similar with SNPs from individual samples:

%%bash

for f in zr3616*vcf
do
    /Users/Shared/bioinformatics/bedtools2/bin/intersectBed \
    -u \
    -a ${f} \
    -b /Users/yaamini/Documents/project-gigas-oa-meth/genome-feature-files/cgigas_uk_roslin_v1_fuzznuc_CGmotif.gff \
    | grep "C	T" \
    > ${f}_CT-SNPs.tab
    head ${f}_CT-SNPs.tab
    wc -l ${f}_CT-SNPs.tab
done

At this point, I ended up with some really ugly file names (ex. zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe.SNP-results.vcf_CT-SNPs.tab). I couldn’t figure out how to get xargs | basename to work within a loop, so I looked for another way to rename files en masse. Stack Overflow reminded me that mv existed beyond moving files from one place to another! The syntax looked a lot like sed, but it allowed me to remove the same pattern from all my filenames:

%%bash

for f in zr3616*CT-SNPs.tab
do
    [ -f ${f} ] || continue
    mv "${f}" "${f//_R1_val_1_val_1_val_1_bismark_bt2_pe.SNP-results.vcf/}"

done

The merged BAM file contained ~5 million SNPs, and ~120,000 of those were C/T SNPs (Table 1). Looking at individual files, there were ~3 million overall SNPs for each sample, with ~65,000-70,000 C/T SNPs. That means about 2% of SNPs in any BAM file were C/T SNPs. I meant to look at the C/T SNPs in IGV, but after I added my BAM files and was adding SNP files, the application crashed. I’ll try again later.

Overlaps with DML

I used intersectBed to look at overlaps between merged SNPs or individual SNPs, and either female-, indeterminate- or common DML in this notebook.

Table 1. Number of overall SNPs, C/T SNPs, and SNPs overlapping with various DML categories.

BAM	All SNPs	C/T SNPs	Female-DML	Indeterminate-DML
Merged	5,519,308	122,306	38	298
1	3,083,382	69,845	18	130
2	3,105,080	70,244	17	201
3	3,202,988	72,064	15	145
4	3,232,583	73,664	17	138
5	3,007,518	66,965	18	155
6	3,204,395	72,123	19	138
7	3,083,706	68,856	21	128
8	3,100,801	69,695	23	138

For the merged BAM file, ~12% of female-DML are SNPs, while ~4% of the indeterminate-DML are! I’m not really sure what this means, and how many of the individual DML overlap with eachother or the merged file. I think I’ll need to mess around with the SNP-DML overlap lists in R to understand how many unique locations are SNPs, and create some sort of visualization.

Going forward

Finish running methylKit randomization tests on R Studio server
Update mox handbook with R information
Obtain relatedness matrix and SNPs with EpiDiverse/snp
Identify overlaps between SNP data and other epigenomic data
Write methods
Write results
Identify significantly enriched GOterms associated with DML
Identify methylation islands and non-methylated regions
Determine if larval DNA/RNA should be extracted

Written on May 11, 2021