DMG Analysis

Starting the differentially methylated gene analysis

After weeks of thinking about it, I’m finally looking at differentially methylated genes! I’m replicating the differentially methylated gene analysis from this paper. The authors used python scripts for the analysis, so I started by reviewing their steps:

  1. Annotate coverage files from bismark with gene and exon information
  2. Calculte median % methylation for each gene
  3. Check normality and variance assumptions
  4. Perform t-tests to see if median % methylation differs by treatment, and correct p-values for multiple comparisons

For my analysis, I created this repository folder and downloaded initial python scripts from this repository. I figured I could do the rest of the analysis in R, so I didn’t download any other python scripts.

Choosing a workflow

The first step is to annotate coverage files from bismark with gene information. I am using this python script, which I got from ). The majority of the script imports other functions and scripts.

I started running the script but encountered this error:

Screen Shot 2019-08-14 at 2 38 21 PM

I think the issue came from the .gff3 I downloaded from NCBI.

Screen Shot 2019-08-14 at 2 42 10 PM

I posted this issue, and Sam confirmed that the parse.gff3 script is the reason why I got an error. The script I downloaded isn’t a general GFF3 parser, so I would need to modify it for my own use. He asked me what I wanted to do, and I realized I don’t need to use the exact script. Talking with Shelly on Friday, I also found out that Hollie was working on something similar. In the issue’s comments, she shared this Markdown document and this R script with me. Since I’m more proficient in R, I decided to modify Hollie’s workflow for my use.

Annotating coverage files

Master gene annotation list

I created this R Markdown script for my analysis. The first thing I did was create a list of annotated genes. This was relatively straightforward for me because I’ve done it beforetwice!. I matched genes with mRNA, and ended up with a preliminary list of 60,201 entries — the exact number of mRNA. The file can be found here.

Then came the first tricky part: selecting the longest mRNA for each gene. When looking at differentially methylated genes, Liew et al. assigned a function to the gene by using information from the longest mRNA. I wanted to do the same thing. I initially used sort --unique to try and do this, but that didn’t work. I posted this issue, and Sam found a solution.

#For the 6th field [$6] (gene ID), remove all duplicate entries in a csv file (-F","). These entries are saved and can be extracted as a different output if needed. 
#Save output to a new file
awk -F"," '!x[$6]++' 2019-08-14-All-Annotated-Gene-and-mRNA.csv \
> 2019-08-14-All-Annotated-Gene-and-mRNA-Unique.csv

I was left with ~34,000 lines, so under half of all mRNA were from duplicate genes. I appended Uniprot Accession codes and GOterms to each line and saved the file here.

Issues with full_join

To match loci in coverage files with the genes they’re in, Hollie used full_join from dplyr, then %>% filter to retain loci that were within the gene’s start and end positions. I tried doing the same thing and got an error:

temp <- full_join(sample1, geneAnnotationsCondensedUniprotGOterms, by="gene.chr") %>% filter(start >= gene.start & start <= gene.end)
Error in full_join_impl(x, y, by_x, by_y, aux_x, aux_y, na_matches, environment()) : 
  negative length vectors are not allowed

Baffled, I went to Stack Overflow and found this could happen because the resultant table has too many lines. I posted this issue, in which Sam suggested I try the same code with subsets of my input files, and try the code on ostrich since it has more RAM. On genefish, I tried running the full_join using data from one chromosome in a sample, but it didn’t work. I tried running full samples on ostrich but that didn’t work either. Finally, I got code to work on ostrich using one chromosome from a sample and the annotated gene list with Uniprot Accession codes and GOterms. I had to use merge instead of full_join because dplyr is not compatible with that version of R, which also meant I couldn’t use %>% to pipe the table into filter and needed a separate filter step. Two hours after I started running the merge code, it still wasn’t done! Fed up, I was reminded of my best friend: bedtools.

Using intersectBed

Why did I spend time trying to figure out an R workaround when I could use intersectBed — a tool made for this exact purpose! I quickly found there was one caveat with intersectBed. I couldn’t use the coverage file as is, since it contained numerical values that were not positions on the chromosome. I needed to remove the percent methylation column from each file, run intersectBed, then add the percent methylation information back in.

I used the following code to isolate the first three columns (chromosome, start, end) from each coverage file. I included \t to ensure the file will be tab-delimited (requirement for bedtools):

``{bash} #For each file that ends in bedgraph #Print the first three columns, tab-delimited, and save the columns in a new file that has the same base name and ends in posOnly for f in *bedgraph do awk ‘{print $1”\t”$2”\t”$3}’ ${f} > ${f}.posOnly done

With my newly created files, I was able to run `intersectBed`:

#For each file that ends in posOnly
#Use intersectBed to find where loci and genes intersect, allowing loci to be mapped to annotated genes
#wb: Print all lines in the second file
#a: file that ends in posOnly
#b: annotated gene list
#Save output in a new file that has the same base name and ends in -Annotated.txt
for f in *posOnly
  /Users/Shared/bioinformatics/bedtools2/bin/intersectBed \
  -wb \
  -a ${f} \
  -b 2019-08-14-Annotated-Gene-Longest-mRNA-Uniprot-GOterms.txt \
  > ${f}-Annotated.txt

Adding back percent methylation information

After annotating coverage files with intersectBed, I need to add percent methylation information back to the annotated files to calculate median percent methylation by gene. I could import 20 files into R, or I could use a series of bash commands. There are two steps, since Linux join does not allow you to join by multiple fields. The first step involves creating a new column with the information from the multiple columns you want to match in both files. The second step is to join.

Both my original coverage files and annotated coverage files have the same first three columns: chromosome, start, and end for each CpG locus. For all 20 files, I created a new column with information formatted as chromosome-start-end:

Then, I used join specifiying the columns I just created:

for f in *bedgraph
  join -j 1 -t $'\t' \
  ${f}.sorted \
  ${f}.posOnly-Annotated.txt.sorted \
  > ${f}.annotated-percentMeth

I used wc -l to check that the number of lines in the annotated files are the same as the number of lines in the files I just generated. There were no lines lost between the file pairs, so this part of my DMG analysis is done!

Going forward

  1. Calculate median percent methylation
  2. Conduct t-tests by gene to identify DMG
  3. Gene enrichment for DMG with GO-MWU
  4. Update methods and results
  5. Update paper repository
  6. Outline the discussion
  7. Write the discussion
  8. Write the introduction
  9. Revise my abstract
  10. Share the draft with collaborators and get feedback
  11. Post the paper on bioRXiv
  12. Prepare the manuscript for publication
Written on August 16, 2019