# MethCompare Part 15

## Statistical comparisons for methods

Last time I attempted statistical comparisons I realized I needed to use chi-squared on count data, not proportions. I wanted a better way to make pairwise comparisons, so I posted this issue. Shelly suggested a method she used in this script, so I attempted to replicate that with my data over the past few days. I also clarified that I could use percent methylation data and that I averaged the union data correctly.

### Methylation status

I piloted everything with the methylation status information from union data in this R Markdown document. For each pairwise method comparison (WGBS vs. RRBS, WGBS vs. MBD-BS, RRBS vs. MBD-BS), I created contingency tables for each methylation status (strong methylation, moderate methylation, weak methylation). Each contingency table had one column for CpGs in that type, and another column for all other CpGs. The rows were the different sequencing methods. Once I created the tables, I ran a chi-squared test with `chisq.test`. I then extracted data from the test model using `broom::tidy` (aka MAGIC) and saved it as a dataframe! Finally, I added CpG type information to the column with test results and used `rbind` to merge results from one test with other test results:

``````McapCpGTypeWGRR <- data.frame() #Create empty dataframe to store chi-squared results
for(i in 1:ncol(McapCpGTypeStatTest)) { #For each CpG type
Method1CpG <- McapCpGTypeStatTest[1,i] #Variable for # CpG type for method 1
Method2CpG <- McapCpGTypeStatTest[2,i] #Variable for # CpG type for method 2
Method1NotCpG <- sum(McapCpGTypeStatTest[1,-i]) #Variable for # other CpG types for method 1
Method2NotCpG <- sum(McapCpGTypeStatTest[2,-i]) #Variable for # other CpG types for method 2
ct <- matrix(c(Method1CpG,Method2CpG,Method1NotCpG,Method2NotCpG), ncol = 2) #Create contingency table
colnames(ct) <- c(as.character(colnames(McapCpGTypeStatTest[i])), paste0("Not", colnames(McapCpGTypeStatTest[i]))) #Assign column names: type, not type
rownames(ct) <- c(as.character(row.names(McapCpGTypeStatTest)[1:2])) #Assign row names: method 1, method 2
print(ct) #Confirm table is correct
ctResults <- data.frame(broom::tidy(chisq.test(ct))) #Create dataframe storing chi-sq stats results. Use broom::tidy to extract results from test output
ctResults\$CpGType <- as.character(colnames(McapCpGTypeStatTest)[i]) #Add CpG type to results
McapCpGTypeWGRR <- rbind(McapCpGTypeWGRR, ctResults) #Add test statistics to master table
}
``````

I also used an FDR correction to correct for multiple comparisons and also included metadata on the pairwise comparison I did:

``````McapCpGTypeWGRR\$p.adj <- p.adjust(McapCpGTypeWGRR\$p.value, method = "fdr") #Correct p-value using FDR
McapCpGTypeWGRR\$comparison <- rep("WGBS vs. RRBS", times = 3) #Add methods compared
``````

I repeated this process for P. acuta, and used a similar loop to compare methods between species! I saved the output for the statistical tests for each species and for the interspecies comparisons:

Looking at the p-values, everything was still significant. I’ll keep this for now, but perhaps I need to use a stricter correction method.

The next thing I did is revise my stacked barplots! My initial goal was to add statistical difference information in the figures. Since every pairwise comparison was significant, I opted against this. I used “strong,” “moderate,” and “weak” language to describe methylation status instead of “methylated,” “sparsely methylated,” and “unmethylated.” I also added legends to each plot and modified axis labels! Modifying the legend took a bit of finagling, since I needed to match x and y coordinates for `text` with `legend`. I also created a multipanel plot to use in the paper.

Figures 1-3. Methylation status stacked barplots for M. capitata](https://github.com/hputnam/Meth_Compare/blob/master/Output/Mcap_union-CpG-Type.pdf), [P. acuta*, and a multipanel plot.

### Genomic location

Since everything worked, the next step was to repeat everything with CpG feature overlap data!

#### Union data

Before I could do pairwise comparisons, I needed to add counts for CG motif overlaps with upstream and downstream flanking regions. In this Jupyter notebook, I used `intersectBed` to get overlaps, updated the notebook’s summary tables, and saved the overlap counts:

I then imported the upstream and downstream flank overlap information into my R Markdown script. I modified the genome feature overlap counts for both species so it included this information:

The pairwise comparisons are basically the same for feature overlaps, but I also added pairwise comparisons between CpG locations in the genome and sequencing methods. Again, everything was significant:

I updated the percentages to include upstream and downstream flank information, and removed the total flank information since it was redundant:

Finally, I revised my the stacked barplots! Similar to the CpG methylation status plots, I updated the axes, added in legends, and created a multipanel plot. I didn’t conduct stastical comparisons for the plots that incorporated methylation status and genomic location, but I did add a legend.

Figures 4-6. Genomic location stacked barplots for M. capitata](https://github.com/hputnam/Meth_Compare/blob/master/Output/Mcap_union-CpG-Features.pdf), [P. acuta*, and a multipanel plot.

Figures 7-8. Genomic location by methylation status for M. capitata and *P. acuta.

I added all of the statistical output and bargraph links to this issue so I could keep track of my progress.

#### DMC

I didn’t need to revise barplots for the method-associated DMC, but I did want to run statistical comparisons. Before running any statistical tests, I updated the overlap counts and percents with up- and downstream flank information, and added information for the background — locations of all CpGs in the P. acuta genome. Next, I did three pairwise tests: all CpGs vs. RM DMC, all CpGs vs. WM DMC, and RM DMC vs. WM DMC. I performed the tests in this R Markdown script and saved the test ouptut here. Finally I updated this issue so Mac would know I ran the tests!

#### Upset plot data

In this R Markdown script I also reformatted upset plot data and stacked barplots. For each species, I added information on all genomic CpGs and up- and downstream flank overlap information to the count data. I removed the gene and flank data from the percent overlap information since it would be redudant with other features.

M. capitata:

P. acuta:

Figures 9-10. Genomic location of upset plot data for M. capitata and P. acuta.

When it came to the statistical comparisons, I wasn’t sure what pairwise tests I needed to do. Including genome information, there were nine categories of CpGs. If I did pairwise tests between all of them, that would be a lot. I commented in this issue and asked Shelly which comparisons she was most interested in.

### Replicating Liew et al. Figure 1B

Since this is a coral project, Hollie suggested we replicate Liew et al. (2018) Fig. 1B so we can easily compare results between our species and Stylaphora pistillata. The figure is a barplot that has genomic feature on the y-axis and a percent of methylated CpGs vs. all CpGs on the x-axis.

In this R Markdown script I created divided the number of strongly methylated CpGs in a feature by all of the genomic CpGs in that feature. I created two versions of the horizontal grouped barplot: one with flanking information, and one without. Liew et al. (2018) did not include flanking information, but I did since the data was easily accessible. I figured we could decide what kind of question we wanted to answer with the figure and pick a version accordingly.

Figures 10-11. Ratio of strongly methylated to all CpGs in a genomic feature for M. capitata and *P. acuta.

Figures 10-11. Ratio of strongly methylated to all CpGs in genic regions for M. capitata and *P. acuta.

Looking at these figures, it’s clear that RRBS detects the most strongly methylated CpGs in M. capitata and WGBS detects the most in P. acuta. Low coverage of MBD-BS could explain why it does poorly in this figure but looks like it is enriching for strongly methylated CpGs in the stacked barplots.

### Going forward

1. Update methods
2. Update results
3. Locate TE tracks
4. [Characterize intersections between data and TE, and create summary tables]
5. Perform statistical comparisons for upset plot data (https://github.com/hputnam/Meth_Compare/issues/59)
6. Look into program for mCpG identification
Written on May 14, 2020