WGBS Analysis Part 30

Finishing up analyses and figures!

The time has come to wrap this shit up (or at least…wrap it up enough for my defense). I’m going to perform necessary statistical tests, make figures, and conduct GOterm enrichment!

Principal Components Analysis

I started by making figures associated with my methylKit output in this R Markdown document. The first thing I wanted to do was make a PCA to visualize global methylation patterns. I used code from my C. virginica paper to do this.

Screen Shot 2021-06-06 at 12 43 58 PM

Figure 1. PCA

Based on this figure, it’s easy to see the lack of clear global methylation differences by treatment. In a revised version of this figure, I want to add different symbols that indicate maturation stage to see if that contributes to the commonalities in global methylation profiles.

DML heatmaps

To visualize methylation differences between treatments at DML, I wanted to create a multipanel figure with 4 different heatmaps: 1) female-DML that don’t overlap with SNPs, 2) female-DML that overlap with SNPs, 3) indeterminate-DML that don’t overlap with SNPs, and 4) indeterminate-DML that overlap with SNPs. My hope was that breaking down the heatmap by these categories will allow me to look at hyper- and hypomethylation differences. For example, maybe all my SNP DML are hypermethylated, and all of my non-SNP DML are hypomethylated! Well…maybe not that extreme but I thought I should investigate.

In this R Markdown document, I created four separate heatmaps using heatmap.2. Unfortunately, heatmap.2 calls plot.new() each time it’s run, so I can’t make a multipanel plot in R. Instead, I created a multipanel with this InDesign document, then saved it as a pdf.

Screen Shot 2021-06-06 at 4 50 06 PM

Figure 2. Multipanel heatmap

So no drastic differences in methylation patterns based on DML identity. I’ll revise this figure to only include one heatmap for female- or indeterminate-DML, and maybe add an asterisk to the rows associated with SNPs to see if they cluster together.

Chromosomal distribution

The last thing I wanted to do with in my methylKit R Markdown document is look at the distribution of DML in C. gigas chromosomes. Essentially, I want to know if all the DML are located in on chromosome, or if they are distributed relatively evenly. To do this, I counted the number of DML per chromosome. I also normalized the number of DML in each chromosome by the number of CpGs present, as DML can’t exist where there are no CpGs! I then created a stacked barplot with the number of DML in each chromosome normalized by the number of CpGs, then added a secondary axis with the number of genes in each chromosome.

Screen Shot 2021-06-06 at 11 47 57 PM

Figure 3. Chromosomal distribution of DML

The chromosomes with the most DML had the most genes, so that tracks.

Distribution of CpGs in genome features

To characterize the general methylation landscape, I created this R Markdown script. I used the 5x unionBED methylation data to create a frequency distribution of CpGs and methylation status. I also used the unionBED overlaps with genome feature files to conduct a contingency test and create a stacked barplot. The distribution of highly methylated CpGs was significantly different than all 5x CpGs detected by WGBS.

Screen Shot 2021-06-07 at 3 04 39 AM

Figure 4. Frequency distribution and genomic location of CpGs

Distribution of DML in genome features

Finally, I created a stacked barplot to visualize the distribution of DML in the genome! Based on my contingency test results, the number of female-DML in CDS and downstream flanks differed significantly from highly methylated CpGs, and the distribution of indeterminate-DML also differed significantly from highly methylated CpGs with respect to upstream flanks, lncRNA, and intergenic regions, in addition to CDS and downstream flanks.

Screen Shot 2021-06-07 at 4 05 59 AM

Figure 5. Genomic location of DML

topGO enrichment

Finally, I conducted a GOterm enrichment test using topGO, following the workflow I used for the Hawaii data in this Jupyter notebook. Unfortunately, I didn’t find any enriched terms in my data! I returned to this Jupyter notebook to add gene product information to my list of annotated genes with female-DML. To do this, I isolated the transcript ID and product information from the nucleotide sequence I used for blastx:

#Convert to tab-delimited file
!perl -e '$count=0; $len=0; while(<>) {s/\r?\n//; s/\t/ /g; if (s/^>//) { if ($. != 1) {print "\n"} s/ |$/\t/; $count++; $_ .= "\t";} else {s/ //g; $len += length($_)} print $_;} print "\n"; warn "\nConverted $count FASTA records in $. lines to tabular format\nTotal sequence length: $len\n\n";' \
../../genome-feature-files/ncbi-genomes-2021-05-04/GCF_902806645.1_cgigas_uk_roslin_v1_rna_from_genomic.fna \
> ../../genome-feature-files/ncbi-genomes-2021-05-04/GCF_902806645.1_cgigas_uk_roslin_v1_rna_from_genomic.tab

#Isolate annotation
!cut -f2 ../../genome-feature-files/ncbi-genomes-2021-05-04/GCF_902806645.1_cgigas_uk_roslin_v1_rna_from_genomic.tab \
| tr "[]" "\t" \
| cut -f6,8 \
| tr "=" "\t" \
| cut -f2 \
> cgigas_uk_roslin_v1_rna_from_genomic_annotation.tab

#Isolate transcript ID
!cut -f2 ../../genome-feature-files/ncbi-genomes-2021-05-04/GCF_902806645.1_cgigas_uk_roslin_v1_rna_from_genomic.tab \
| tr "[]" "\t" \
| cut -f6,8 \
| tr "=" "\t" \
| cut -f4 \
> cgigas_uk_roslin_v1_rna_from_genomic_transcript.tab

#Paste annotation and transcript together
#Extract lines with a transcript ID (XM)
#Sort
#Save output
!paste cgigas_uk_roslin_v1_rna_from_genomic_transcript.tab cgigas_uk_roslin_v1_rna_from_genomic_annotation.tab \
| grep "XM" \
| sort \
> cgigas_uk_roslin_v1_rna_from_genomic_annot.transcript.tab

#Join files to get product names for genes with DML
!join -1 1 -2 1 -t $'\t' \
DML-pH-75-Cov5-Fem.GeneIDs.geneOverlap.transcriptIDs.GOAnnot \
cgigas_uk_roslin_v1_rna_from_genomic_annot.transcript.tab \
| uniq \
| sort \
> DML-pH-75-Cov5-Fem.GeneIDs.geneOverlap.transcriptIDs.GOAnnot.productID

Going forward

  1. Run methylKit with 4 vs. 4
  2. Revise figures
  3. Perform randomization test
  4. Update mox handbook with R information
  5. Incorporate other epigenomic data
  6. Identify overlaps between SNP data and other epigenomic data
  7. Identify methylation islands and non-methylated regions
  8. Determine if larval DNA/RNA should be extracted
Written on June 9, 2021