Hawaii Gigas Methylation Analysis Part 12

Identifying SNPs

Now that samtools is installed and I merged BAM files, I’m ready to identify SNPs with BS-Snper! I’m going to identify SNPs from the merged BAM file and from individual BAM files. Steven suggested that once I have SNP vcf files, I could use bedtools intersect to figure out where the SNPs overlap with the CG track and DML I’ll identify later, so I won’t need to filter SNPs once I identify them.

Merged SNPs

First, I decided to identify SNPs from the merged BAM file I created with samtools merge. Based on Mac’s suggestion, I ran BS-Snper with the default parameters. However, I set --mincov 5 instead of using the default coverage of 10 to be consistent with bismark and methylKit parameters.

#Defaults: --minhetfreq 0.1 --minhomfreq 0.85 --minquali 15 --maxcover 1000 --minread2 2 --errorate 0.02 --mapvalue 20
#Saving output to gannet
!perl /Users/Shared/Apps/BS-Snper-master/BS-Snper.pl \
--fa /Volumes/web/spartina/project-oyster-oa/data/Cg-roslin/cgigas_uk_roslin_v1_genomic-mito.fa \
--input /Volumes/web/spartina/project-oyster-oa/Haws/BS-Snper/merged-sorted-deduplicated.bam \
--output /Volumes/web/spartina/project-oyster-oa/Haws/BS-Snper/SNP-candidates.txt \
--methcg /Volumes/web/spartina/project-oyster-oa/Haws/BS-Snper/CpG-meth-info.tab \
--methchg /Volumes/web/spartina/project-oyster-oa/Haws/BS-Snper/CHG-meth-info.tab \
--methchh /Volumes/web/spartina/project-oyster-oa/Haws/BS-Snper/CHH-meth-info.tab \
--mincover 5 \
> /Volumes/web/spartina/project-oyster-oa/Haws/BS-Snper/SNP-results.vcf 2> merged.ERR.log

I wrote the output to this gannet folder instead of trying to deal with large files my Github repository.

Individual SNPs

Since I ran the script successfully, I moved onto SNP identification from individual files! I modified a loop from Mac to identify SNPs in my files. First, I changed my working directory to the one with BAM files. Within the same code chunk, I needed to create variables for the file names, then use the modified file names within the BS-Snper loop:

%%bash

FILES=$(ls *deduplicated.sorted.bam)
echo ${FILES}

for file in ${FILES}
do
    NAME=$(echo ${file} | awk -F "." '{print $1}')
    echo ${NAME}

    perl /Users/Shared/Apps/BS-Snper-master/BS-Snper.pl \
    --fa /Volumes/web/spartina/project-oyster-oa/data/Cg-roslin/cgigas_uk_roslin_v1_genomic-mito.fa \
    --input ${NAME}.deduplicated.sorted.bam \
    --output ${NAME}.SNP-candidates.txt \
    --methcg ${NAME}.CpG-meth-info.tab \
    --methchg ${NAME}.CHG-meth-info.tab \
    --methchh ${NAME}.CHH-meth-info.tab \
    --mincover 5 \
    > ${NAME}.SNP-results.vcf 2> ${NAME}.ERR.log

done

Initially I specified --fa with a FASTA variable, but I kept running into an error! I then remembered that within Jupyter notebooks I had issues specifying variables within loops, so I used the absolute file path instead. Once I had all the files, I moved them to the BS-Snper directory on gannet.

# Move files to BS-Snper directory
!mv *SNP-candidates.txt *meth-info.tab *SNP-results.vcf *ERR.log ../BS-Snper/

Katherine suggested I merge with vcf-merge, but I don’t think I need to do that if I could run a loop with bedtools intersect to determine if any individual SNPs overlap with DML. Now that I identified SNPs for this dataset, I can do something similar with the Manchester data! I’m also going to try EpiDiverse/snp since it would give me a relatedness matrix in addition to SNP variants.

Going forward

Try EpiDiverse/snp for SNP extraction from WGBS data
Run methylKit script on mox
Investigate comparison mechanisms for samples with different ploidy in oysters and other taxa
Test-run DSS and ramwas
Transfer scripts used to a nextflow workflow
Update methods
Update results

Written on April 5, 2021