WGBS Analysis Part 16

Reviewing revised alignments

I got my Roslin genome-aligned data back! Last week the paper for this genome was published as well, so it’ll be nice to have that resource while I work with this data. Steven noticed that they have their own version of the mitochondrial genome, which is different than the sequence I used. This shouldn’t be a problem since mitochondrial DNA isn’t methylated.

multiqc

I transferred all mox bismark output to this gannet folder. I didn’t run multiqc on mox, so I did that in the same folder. I then copied the HTML reports and multiqc output to this repo folder! Once I was done moving files around, I looked at the multiqc report.

Similar to what I saw with the Hawaii samples, alignment to the Roslin genome was slightly lower, ranging between 61.5% to 62.0%. Percent duplicate reads was also lower when aligning to the Roslin genome! Percent duplication is much higher for this dataset than the Hawaii one (22.6%-26.4%), but that’s probably because the histology DNA wasn’t high quality. I noticed that the samples with more M C’s/M Aligned had higher percent duplication, but overall percent aligned wasn’t higher for these samples.

Screen Shot 2021-03-30 at 11 02 30 AM

Figure 1. General alignment statistics for oyster_v9 genome (left) and Roslin genome (right)

IGV

Since my alignment statistics generally looked good, I got to move onto a more important step: checking the alignments in IGV to make sure there isn’t any methylation in the mitochondrial sequence! I opened a new IGV session and entered in the genome URL, but it wouldn’t load since I didn’t have an index file. I posted this discussion to determine how to create a .fai file. After posting the discussion, I did a quick search and found samtools faidx. I decided to try samtools to create the index file. I returned to this discussion post where I had some issues running samtools, and successfully installed the latest version of htslib using these instructions. However, I ran into an error installing the latest samtools version! I downloaded the latest version on genefish and ran ./configure --prefix=/Users/Shared/Apps/samtools-1.12. Got the following error:

config.status: creating config.mk
dyld: Library not loaded: /usr/local/opt/mpfr/lib/libmpfr.4.dylib
  Referenced from: /usr/local/bin/gawk
  Reason: image not found
./config.status: line 879: 83237 Done(141)               eval sed \"\$ac_sed_extra\" "$ac_file_inputs"
     83238 Abort trap: 6           | $AWK -f "$ac_tmp/subs.awk" > $ac_tmp/out
config.status: error: could not create config.mk

Sam suggested I upgrade gawk using brew upgrade gawk. When I tried that, I got another error:

Screen Shot 2021-03-30 at 11 47 14 AM

Sam then suggested I install gawk before running samtools. While I was going down that rabbit hole, Steven commented on my original discussion post and said I should make an index file with IGV. IGV didn’t give me an option to create an index file when I added the genome URL, but I found this page in their documentation that has instructions to create a genome. Under Genomes > Create .genome File, I added in the FASTA URL and named the genome “Roslin-gigas-mito.genome.” It worked! I saved the genome and index file in this directory with my combined FASTA sequence. At last, I was able to load in sample bedgraphs and look at the mitochondrial sequence.

Screen Shot 2021-03-30 at 1 20 52 PM

Figure 2. Mitochondrial sequence in IGV

The mitochondrial sequence was predominantly unmethylated, but there were some methylated CpGs in individual samples. I posted this discussion to determine what happened. Steven suggested I look only at 5x coverage. When he mentioned this, I realized I already created 5x bedgraphs, and should have been using those from the beginning! I loaded the 5x bedgraphs into IGV and looked at the mitochondrial genome.

Screen Shot 2021-04-04 at 1 14 16 PM

Screen Shot 2021-04-04 at 1 14 32 PM

Screen Shot 2021-04-04 at 1 14 51 PM

Figures 3-5. Mitochondrial sequence in IGV using 5x begraphs

Looking at the full sequencence with 5x begraphs is definitely an improvement, but there are two areas with very slight methylation. I don’t see a lot of consistency between samples either, so I don’t think it’s likely any of these areas will be called as DML. In any case, these methylation traces may be a sign of something else going on. I could remove these locations from my dataset, bump up coverage to 10x to see if that reduces methylation, or try mapping to the Roslin mitochondrial sequence. I saved my IGV session and commented on the discussion to get input on how to proceed.

Going forward

  1. Write methods
  2. Identify SNPs in WGBS data mapped to Roslin genome
  3. Revise methylKit study design: separate tests for indeterminate and female oysters
  4. Run R script on mox
  5. Write results
  6. Identify genomic location of DML
  7. Determine if RNA should be extracted
  8. Determine if larval DNA/RNA should be extracted
Written on March 29, 2021