Killifish Hypoxia RRBS Part 2
Trimming RRBS data
Before I ran any alignments, I wanted to trim adapter sequences and check post-trimming quality with
multiqc. Neel couln’t remember if the sequences were non-directional or directional, and from prior experience he said that he manually checked potential directionality in the alignment step. My plan is to get fully trimmed files using both methods, then troubleshoot alignment with a subset from each! All output from trimming can be found here.
I ran this script for
trimgalore, and made sure I included the
-rrbs argument necessary for hard-trimming base pairs off of the 3’ end. Everything was going great…until the penultimate sample. I got the following error for
Using Illumina adapter for trimming (count: 10552). Second best hit was smallRNA (count: 0) Writing report to '/vortexfs1/scratch/yaamini.venkataraman/02-trimgalore/190626_I114_FCH7TVNBBXY_L4_OC-N3_1.fq.gz_trimming_report.txt' SUMMARISING RUN PARAMETERS ========================== Input filename: /vortexfs1/home/naluru/Killifish/WHOI-Mummichog_RRBS/190626_I114_FCH7TVNBBXY_L4_OC-N3_1.fq.gz Trimming mode: paired-end Trim Galore version: 0.6.6 Cutadapt version: 3.5 Number of cores used for trimming: 1 Quality Phred score cutoff: 20 Quality encoding type selected: ASCII+33 Adapter sequence: 'AGATCGGAAGAGC' (Illumina TruSeq, Sanger iPCR; auto-detected) Maximum trimming error rate: 0.1 (default) Minimum required adapter overlap (stringency): 1 bp Minimum required sequence length for both reads before a sequence pair gets removed: 20 bp File was specified to be an MspI-digested RRBS sample. Read 1 sequences with adapter contamination will be trimmed a further 2 bp from their 3' end, and Read 2 sequences will be trimmed by 2 bp from their 5' end to remove potential methylation-biased bases from the end-repair reaction File was specified to be a non-directional MspI-digested RRBS sample. Sequences starting with either 'CAA' or 'CGA' will have the first 2 bp trimmed off to remove potential methylation-biased bases from the end-repair reaction Running FastQC on the data once trimming has completed Running FastQC with the following extra arguments: '--outdir /vortexfs1/scratch/yaamini.venkataraman/02-trimgalore --threads 28' Output file(s) will be GZIP compressed Cutadapt seems to be fairly up-to-date (version 3.5). Setting -j 1 >>> Now performing adaptive quality trimming with a Phred-score cutoff of: 20 <<< This is cutadapt 3.5 with Python 3.9.5 Command line parameters: -j 1 -e 0.1 -q 20 -a X /vortexfs1/home/naluru/Killifish/WHOI-Mummichog_RRBS/190626_I114_FCH7TVNBBXY_L4_OC-N3_1.fq.gz Processing reads on 1 core in single-end mode ... 10000000 sequences processed 20000000 sequences processed cutadapt: error: Error in FASTQ file at line 91357035: Line expected to start with '+', but found '@' >>> Quality trimming completed <<< 22839258 sequences processed in total Unable to close QUAL filehandle:
I removed this sample and ran the rest of my script successfully, including
multiqc. I decided to figure out what was happening with the last sample! I extracted the sequence with the error, as well as upstream and downstream sequences:
zcat 190626_I114_FCH7TVNBBXY_L4_OC-N3_1.fq.gz | sed -n '91357028,91357032p;91357041q' AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ @K00114:958:H7TVNBBXY:4:2218:9435:44535 1:N:0:TATAATAT_AGATCTCG TGGTAAGTTGAGAGTGATAGTTTTTTTTGGTTTGGTTTGTTTTTTGTTT + AAFFFJJJJJJJJJFJJJJJJJJJJJJJJJJJJJFJFJJJFJJJJJJJJ @K00114:958:H7TVNBBXY:4:2218:10044:44535 1:N:0:TATAATAT_AGACCTCA T:2218:7FJFAFJJJFAJJJJJJFJFFJFFFJJJJJFJJJJJJA< @K00114:958:H7TVNBBXY:4:2218:32684:44517 1:N:0:TATAATAA_AAACCCCG # Line with issue GGAGGGTGTCAATCCTGACGGTTATTTCCTAGCCAACTTAGAGCCAATA + <AA-FF-A-AAFJJFA7AF7AFFJ-77FAFJ--A<--<77<7-7---7F @K00114:958:H7TVNBBXY:4:2218:1641:44535 1:N:0:NATAATAT_NGATCTCG NGAAAAAAATGGTATAAATTGGAATGATGTTTTTTCGTTATTATATTTA
It looks like the second sequence,
@K00114:958:H7TVNBBXY:4:2218:10044:44535 1:N:0:TATAATAT_AGACCTCA...T:2218:7FJFAFJJJFAJJJJJJFJFFJFFFJJJJJFJJJJJJA< may not have the correct balance of sequence and quality information that the other sequences do. I’m not exactly sure how to fix it, so I sent Neel an email.
While waiting for Neel’s response, I decided to get a start on directional trimming. Since it takes a while, I figured that I may have a solution about the file with the error before the script starts trimming that file. I created a new script, and was sure to remove the
As I predicted, I ran into the same issue with sample 190626_I114_FCH7TVNBBXY_L4_OC-N3_1.fq.gz!
This is cutadapt 3.5 with Python 3.9.5 Command line parameters: -j 1 -e 0.1 -q 20 -a X /vortexfs1/home/naluru/Killifish/WHOI-Mummichog_RRBS/190626_I114_FCH7TVNBBXY_L4_OC-N3_1.fq.gz Processing reads on 1 core in single-end mode ... 10000000 sequences processed 20000000 sequences processed cutadapt: error: Error in FASTQ file at line 91357035: Line expected to start with '+', but found '@' >>> Quality trimming completed <<< 22839258 sequences processed in total Unable to close QUAL filehandle:
Neel wasn’t sure what’s going on with the sequences either, so I just ran the rest of the script without that file and with
multiqc…and it kept failing because apparently sample 190626_I114_FCH7TVNBBXY_L4_OC-S5_1.fq.gz was empty:
Output will be written into the directory: /vortexfs1/scratch/yaamini.venkataraman/02-directional/ Setting the option '--clip_r2 2' (to remove methylation bias from the start of Read 2) gzip: /vortexfs1/home/naluru/Killifish/WHOI-Mummichog_RRBS/190626_I114_FCH7TVNBBXY_L4_OC-S5_1.fq.gz: Permission denied Input file '/vortexfs1/home/naluru/Killifish/WHOI-Mummichog_RRBS/190626_I114_FCH7TVNBBXY_L4_OC-S5_1.fq.gz' seems to be completely empty. Consider respecifying!
I know a copy of these files exists at
/vortexfs1/scratch/yaamini.venkataraman/01-fastqc from when I did my initial QC, so I changed the script parameters to see if I could run TrimGalore from there. Thankfully that worked!
Issues with sequences
Since Neel wasn’t sure what was going on and doing a deep dive on the TrimGalore! Github wasn’t helpful, I posted this discussion to see if Sam had seen anything like this before. He said the sequence looked corrupted, and to confirm checksums with where the files were originally downloaded and the sequencer. If the files are not corrupted, then I may need to try using a different trimming software. I emailed Neel to see if I can get the original checksums.
Looking at the non-directional
multiqc report, all samples looked good in terms of adapters! The per tile sequence qualtiy, per sequence base content, and sequence duplication failed, but that makes sense because it’s RRBS data and highly duplicated. When I looked at the directional
multiqc report, the per base N content looked worse for a few samples. This could be a slight indication that the data is non-directional, but I won’t know until I try mapping.
- Troubleshoot alignment code with a few samples
- Figure out what’s happening with sample 190626_I114_FCH7TVNBBXY_L4_OC-N3_1.fq.gz
- Trim sample 190626_I114_FCH7TVNBBXY_L4_OC-N3_1.fq.gz appropriately
- Start alignment with all samples