CEABiGR Part 13
Counting DML in genes with a predominant isoform shift
At what I thought was the end of my predominant isoform analysis, I realized I should count DML in genes with a predominant isoform shift. I then created this issue to keep track of my revelation. So today, let’s tackle my revelation.
Creating input files
In the same issue, I pinged Steven to ask if there were already files with DML and gene annotations available. I poked around the repo and couldn’t find anything, so I decided to make my own. I returned to this Jupyter notebook to use bedtools
for annotations:
#Find overlaps between DML and gene list with annotations
#Save overlaps with annotations
!{bedtoolsDirectory}intersectBed \
-wb \
-a {femaleDML} \
-b ../../genome-features/C_virginica-3.0-gene.gff \
> female_dml-Gene-wb.bed
!{bedtoolsDirectory}intersectBed \
-wb \
-a {maleDML} \
-b ../../genome-features/C_virginica-3.0-gene.gff \
> male_dml-Gene-wb.bed
!head male_dml-Gene-wb.bed
The output files are in this folder.
DML presence
I counted the number of DML in relevant genes in this R Markdown file. First, I imported the DML data:
DMLLocationsFemale <- read.delim("../DML-characterization/female_dml-Gene-wb.bed", sep = "\t", header = FALSE, col.names = c("chr", "start", "end", "f_DML", "meth.diff", "gene.chr", "Gnomon", "gene", "gene.start", "gene.end", "V11", "strand", "V13", "V14")) %>%
dplyr::select(., "chr", "start", "end", "meth.diff", "gene.start", "gene.end", "V14") %>%
separate(., col = V14, into = c("gene_name", "misc"), sep = ";Dbxref=GeneID:") %>%
dplyr::select(., -misc) %>%
mutate(., gene_name = gsub(pattern = "ID=gene-", replacement = "", x = gene_name)) #Import bedfile with DML locations in genes with gene ID annotations. Rename columns as file is imported then select columns of interest. Separate gene ID annotation information from misc information, and remove misc column. Remove extra information before gene ID
DMLLocationsMale <- read.delim("../DML-characterization/male_dml-Gene-wb.bed", sep = "\t", header = FALSE, col.names = c("chr", "start", "end", "f_DML", "meth.diff", "gene.chr", "Gnomon", "gene", "gene.start", "gene.end", "V11", "strand", "V13", "V14")) %>%
dplyr::select(., "chr", "start", "end", "meth.diff", "gene.start", "gene.end", "V14") %>%
separate(., col = V14, into = c("gene_name", "misc"), sep = ";Dbxref=GeneID:") %>%
dplyr::select(., -misc) %>%
mutate(., gene_name = gsub(pattern = "ID=gene-", replacement = "", x = gene_name)) #Import bedfile with DML locations in genes with gene ID annotations. Rename columns as file is imported then select columns of interest. Separate gene ID annotation information from misc information, and remove misc column. Remove extra information before gene ID
I then joined the data with the wide-format data:
predIsoFemWideDML <- left_join(predIsoFemWide, DMLLocationsFemale, by = "gene_name") %>%
dplyr::select(., -c(chr, gene.start, gene.end)) %>%
group_by(., gene_name) %>%
mutate(., DMLcount = n()) %>%
mutate(., DMLcount = case_when(is.na(meth.diff) == TRUE ~ 0,
is.na(meth.diff) == FALSE ~ DMLcount)) #Join pred iso data with DML annotations and remove superfluous columns. Group by gene name and count the number of DML per gene. When there is no DML in the gene (meth.diff == NA), change the DML count to 0.
I saved the dataframe here. Then, I counted how many genes had a particular number of DML:
predIsoFemWideDML %>%
group_by(., DMLcount, change) %>%
summarise(., count = n()) #Group data by DML count and change, then count the number of genes with or without predominant isoform shifts with various DML counts
Table 1. Female genes with DML counts
DML count | Shift? | Gene Count |
---|---|---|
0 | 0 | 30253 |
0 | 1 | 2065 |
1 | 0 | 53 |
1 | 1 | 2 |
2 | 0 | 14 |
2 | 1 | 2 |
3 | 0 | 3 |
Majority of genes did not contain DML, irrespective of whether or not the predominant isoform shifted.
I did the same thing for the male data, which can be found here.
Table 2. Male genes with DML counts
DML Count | Shift? | Gene Count |
---|---|---|
0 | 0 | 7692 |
0 | 1 | 1541 |
1 | 0 | 429 |
1 | 1 | 71 |
2 | 0 | 212 |
2 | 1 | 46 |
3 | 0 | 84 |
3 | 1 | 27 |
4 | 0 | 76 |
4 | 1 | 8 |
5 | 0 | 35 |
5 | 1 | 5 |
6 | 0 | 24 |
7 | 0 | 14 |
8 | 0 | 16 |
9 | 1 | 18 |
13 | 0 | 13 |
14 | 1 | 14 |
15 | 1 | 15 |
17 | 0 | 17 |
23 | 0 | 23 |
25 | 1 | 25 |
My next move was to plot this very unexciting data.
predIsoFemWideDML %>%
dplyr::select(., -c(start, end, meth.diff)) %>%
unique(.) %>%
ggplot(aes(x = DMLcount, fill = as.factor(change))) +
geom_histogram() +
facet_wrap(.~change, scales = "free_x", labeller = labeller(change = facet_labels)) +
labs(x = "Number of DML", y = "Number of Genes") +
scale_fill_manual(values = c(plotColors[5], plotColors[8]),
labels = c("No Shift", "Shift"),
name = "") +
theme_classic(base_size = 15)
Figures 1-2. Distributions of DML counts for genes with and without a shift in the predominant isoform for females (top) and males (bottom)
Although I don’t think that DML count had any bearing on the shift in the predominant isoform, I needed to do my due diligence and add DML count into the model:
predIsoFemModel <- glm(I(change == 1) ~ change_meth + DMLcount +
log10(length) + log2(change_exp + 1),
family = binomial(),
data = (predIsoFemWideDML %>% dplyr::select(., -c(start, end, meth.diff)) %>% unique(.) %>% ungroup(.))) #Create a binomial model, where 1 = the predominant isoform shifted when comparing low and control pH conditions. Use change in methylation, gene length, and change in gene expression as explanatory variables
This variable was retained in the final model for both females and males. Guess I have some tables, methods, and results to update.
Going forward
- Update methods
- Update results
- Finish maximum transcript analysis
- Perform formal DML enrichment analysis
- Quantify DML in genes with a change in maximum transcript
- Update methods and results
- Compare with Aranda and Liew models in discussion section
- Read recent papers
- Write discussion
- Write abstract
- Outline introduction
- Write introduction
- Send to co-authors for approval
- Submit to journal special issue