CEABiGR Part 2
Trying to tune the sPLS
Based on my initial sPLS results, I needed to separate male and female samples and run individual analyses to understand the relationship between pH, exon methylation, and expression.
I extracted the female data expression and methylation data and filtered it the same I did the larger dataset: removed exons with any missing values for any sample and lowly variable exons:
exonMethylationFem <- exonMethylationMod %>% filter(., grepl("F", rownames(.), fixed = TRUE)) %>% t(.) %>% as.data.frame(.) %>% rownames_to_column("exon") %>% drop_na(.) %>% rowwise() %>% mutate(range = max(c_across(where(is.numeric))) - min(c_across(where(is.numeric)))) %>% column_to_rownames("exon") %>% filter(., range > 10) %>% select(., !range) %>% t(.) %>% as.data.frame(.)
After filtering the data, I went through the process of tuning parameters described in the
mixOmics manual. Before I could run an sPLS, I needed to ensure that the parameters I was using were appropriate for the data. The first parameter I tested was the number of components to include in the analysis. I ran a PLS with the female data, specifying
ncomp = 4 for four components.
exonResult.plsFem <- pls(exonMethylationFem, exonExpressionFem, ncomp = 4) #Run a PLS with a sufficient number of components
I then tried to test the significance of these components using
perf, which supplies a k-fold cross-validation:
exonPerf.plsFem <- perf(exonResult.plsFem, validation = "Mfold", folds = 10, progressBar = FALSE, nrepeat = 1000) #Use perf for repeated k-fold cross-validation to determine the number of components that should be included
Unfortunately, R Studio crashed! Around the time I started running the code,
raven needed to be rebooted. I tried the code again and R Studio still crashed, so it wasn’t just computer rebooting that caused the issue. I tried running
perf again with
folds = 3 and
nrepeat = 100 to see if I could reduce processing power, but I still ended up crashing R Studio. I tried these things over a few weeks when I had time, so the saga was really prolonged.
At this point, I posted this discussion to get insight from Sam and Steven. Sam confirmed that
raven did not have enough processing power to handle the
perf command, and suggested running the code on the R Studio server on
Before moving over to
mox completely, I decided to try running the same commands using a subset of the data to troubleshoot the code. This way, I could let code run on
mox without having to fix errors. Turns out using a subset didn’t work, since
mixOmics needs a substantial number of overlapping exons in both methylation and expression datasets! I kept running into errors, so I tried messing around with tuning X and Y. This also lead to R Studio crashing, even when using two components, two folds, and only ten repeats. Yikes.
Looks like I have to use
mox no matter what! But first, I’ll focus on updating the paper with foundational methods and results so when I return to this analysis it has a clear purpose.
- Update foundation methods
- Update foundation results
- Revise with new expression data from Ariana
- Tune sPLS parameters
- Run sex-specific SPLS
- Identify drivers