Hawaii Gigas Methylation Analysis Part 12
Identifying SNPs
Now that samtools
is installed and I merged BAM files, I’m ready to identify SNPs with BS-Snper
! I’m going to identify SNPs from the merged BAM file and from individual BAM files. Steven suggested that once I have SNP vcf files, I could use bedtools intersect
to figure out where the SNPs overlap with the CG track and DML I’ll identify later, so I won’t need to filter SNPs once I identify them.
Merged SNPs
First, I decided to identify SNPs from the merged BAM file I created with samtools merge
. Based on Mac’s suggestion, I ran BS-Snper
with the default parameters. However, I set --mincov 5
instead of using the default coverage of 10 to be consistent with bismark
and methylKit
parameters.
#Defaults: --minhetfreq 0.1 --minhomfreq 0.85 --minquali 15 --maxcover 1000 --minread2 2 --errorate 0.02 --mapvalue 20
#Saving output to gannet
!perl /Users/Shared/Apps/BS-Snper-master/BS-Snper.pl \
--fa /Volumes/web/spartina/project-oyster-oa/data/Cg-roslin/cgigas_uk_roslin_v1_genomic-mito.fa \
--input /Volumes/web/spartina/project-oyster-oa/Haws/BS-Snper/merged-sorted-deduplicated.bam \
--output /Volumes/web/spartina/project-oyster-oa/Haws/BS-Snper/SNP-candidates.txt \
--methcg /Volumes/web/spartina/project-oyster-oa/Haws/BS-Snper/CpG-meth-info.tab \
--methchg /Volumes/web/spartina/project-oyster-oa/Haws/BS-Snper/CHG-meth-info.tab \
--methchh /Volumes/web/spartina/project-oyster-oa/Haws/BS-Snper/CHH-meth-info.tab \
--mincover 5 \
> /Volumes/web/spartina/project-oyster-oa/Haws/BS-Snper/SNP-results.vcf 2> merged.ERR.log
I wrote the output to this gannet
folder instead of trying to deal with large files my Github repository.
Individual SNPs
Since I ran the script successfully, I moved onto SNP identification from individual files! I modified a loop from Mac to identify SNPs in my files. First, I changed my working directory to the one with BAM files. Within the same code chunk, I needed to create variables for the file names, then use the modified file names within the BS-Snper loop:
%%bash
FILES=$(ls *deduplicated.sorted.bam)
echo ${FILES}
for file in ${FILES}
do
NAME=$(echo ${file} | awk -F "." '{print $1}')
echo ${NAME}
perl /Users/Shared/Apps/BS-Snper-master/BS-Snper.pl \
--fa /Volumes/web/spartina/project-oyster-oa/data/Cg-roslin/cgigas_uk_roslin_v1_genomic-mito.fa \
--input ${NAME}.deduplicated.sorted.bam \
--output ${NAME}.SNP-candidates.txt \
--methcg ${NAME}.CpG-meth-info.tab \
--methchg ${NAME}.CHG-meth-info.tab \
--methchh ${NAME}.CHH-meth-info.tab \
--mincover 5 \
> ${NAME}.SNP-results.vcf 2> ${NAME}.ERR.log
done
Initially I specified --fa
with a FASTA variable, but I kept running into an error! I then remembered that within Jupyter notebooks I had issues specifying variables within loops, so I used the absolute file path instead. Once I had all the files, I moved them to the BS-Snper
directory on gannet
.
# Move files to BS-Snper directory
!mv *SNP-candidates.txt *meth-info.tab *SNP-results.vcf *ERR.log ../BS-Snper/
Katherine suggested I merge with vcf-merge
, but I don’t think I need to do that if I could run a loop with bedtools intersect
to determine if any individual SNPs overlap with DML. Now that I identified SNPs for this dataset, I can do something similar with the Manchester data! I’m also going to try EpiDiverse/snp since it would give me a relatedness matrix in addition to SNP variants.
Going forward
- Try EpiDiverse/snp for SNP extraction from WGBS data
- Run
methylKit
script onmox
- Investigate comparison mechanisms for samples with different ploidy in oysters and other taxa
- Test-run DSS and ramwas
- Transfer scripts used to a nextflow workflow
- Update methods
- Update results