MethCompare Part 6
Retroactively obtaining checksums
Based on our conversations, we still don’t trust our data. While we run some alignments on a combined C1 and P. acuta genome to discard any reads we have that do not solely map to P. acuta, I decided to do something I should have done in the beginning: verify checksums.
Since I wget
10x bedgraphs for CpG classification, I need to look at checksums. When transferring files to or from gannet
or Mox, I can use rsync
which automatically verifies file integrity. I initially thought I would use Shelly’s awk
script to compare md5sum
hashes for gannet
files and those I downloaded to my computer, but after some discussion at Science Hour I learned I could just use md5sum -c
with a list of hashes and paths.
In this Jupyter notebook, I downloaded the master list of checksums Shelly generated for gannet
files on ostrich
using md5sum
. This file has hashes for files I am not using for my analysis. To see how md5sum
would handle this, I navigated to my folder with M. capitata 10x bedgraphs and ran md5sum -c all_031520-TG-bs_files_GANNET_md5sum.txt
. Since I was using genefish
which had gannet
mounted, FASTQC files with gannet
paths were being checked! I realized I needed to pare down my list of files.
After navigating to the folder with the checksum file, the first thing I tried was using grep
to extract all 10x bedgraphs for either M. capitata or P. acuta:
#Get all lines from original checksum document
#Extract information for 10x bedgraphs
#Extract information for Mcap data only
!cat all_031520-TG-bs_files_GANNET_md5sum.txt \
| grep 10x.bedgraph \
| grep Mcap \
> Mcap-10xbedgraph-GANNET-md5sum.txt
I then ran md5sum -c
with the file I generated. All of the paths were still gannet
paths so it did not verify checksums for the files I had. I unmounted gannet
and tried again with a file that only had hashes. I initially thought I could use cut -f1
to save the first column of data (hashes), but I realized that the hashes and file paths were all in one column! I used cut -c
to identify characters I wanted to keep in my file.
#Get all lines from original checksum document
#Extract information for 10x bedgraphs
#Extract information for Mcap data only
#Only keep the first 32 characters in each line (md5sum hashes)
#Save hashes
!cat all_031520-TG-bs_files_GANNET_md5sum.txt \
| grep 10x.bedgraph \
| grep Mcap \
| cut -c1-32 \
> Mcap-10xbedgraph-GANNET-md5sum-hashes.txt
When I ran md5sum -c
with only the hashes, I got an error that said I did not have properly formatted md5sum
file. That made sense because there were no file paths. The file paths in the original file were gannet
paths. Since the files had the same name on genefish
, I just needed a way to get rid of the first part of each file path. I did many searches for code that would remove characters from the middle of a line so I could keep both the hashes and correct file names. I tried using a variation of sed
or tr
that would delete /Volumes/web/seashell/bu-mox/scrubbed/031520-TG-bs/Mcap_tg/dedup
or /Volumes/web/seashell/bu-mox/scrubbed/031520-TG-bs/Mcap_tg/nodedup
, but neither command worked. I decided to try using cut -c
again to keep the last part of the path (the filename) and save it as a new file. I could then paste
the hashes and filenames together to create a checksum file that only had the files I was interested in.
First, I used a combination of rev
and cut -c
to keep the last 48 characters at the end of each row:
#Get all lines from original checksum document
#Extract information for 10x bedgraphs
#Extract information for Mcap data only
#Reverse order of characters in each line
#Only keep the first 48 characters in each line
#actually the last 48 characters in the original file, which maps to paths locally
#Reverse characters
#Save paths
!cat all_031520-TG-bs_files_GANNET_md5sum.txt \
| grep 10x.bedgraph \
| grep Mcap \
| rev \
| cut -c1-48 \
| rev \
> Mcap-10xbedgraph-GANNET-md5sum-paths.txt
Then I pasted the paths with the previously-generated hashes:
#Paste hashes and paths to create a md5sum file
#Save checksum file
#Check output
!paste Mcap-10xbedgraph-GANNET-md5sum-hashes.txt Mcap-10xbedgraph-GANNET-md5sum-paths.txt \
> Mcap-10xbedgraph-GANNET-md5sum.txt
!head Mcap-10xbedgraph-GANNET-md5sum.txt
I navigated back to my M. capitata folder and ran md5sum
:
It worked! I repeated this process with P. acuta files and got the all-clear:
Once we trust our data, I can run this pipeline knowing that there are some internal checks.
Going forward
- Look into exon annotations in Liew and Li papers
- Check code for union bedgraph files
- Figure out how to meaningfully concatenate data for each method
- Generate repeat tracks for each species
- Rerun the pipeline with full samples once pan-genome output is assessed and find a way to generate tables programmatically
- Create figures for CpG characterization in various genome features
- Update code for methylation frequency distribution figure
- Figure out methylation island analysis