Conversation
Fixes #15 The barcode extraction from FASTQ headers was failing with non-standard formats like SRA headers (e.g., "@SRR20318439.1 ... length=111") where the extracted "barcode" contained spaces, breaking downstream shell commands. Changes: - Refactored barcode extraction to sample first 10k reads and return the most frequent valid barcode (avoids single-read sequencing errors) - Validate barcodes against pattern ^[ACGTN+-]+$ (nucleotides with optional dual-index separator) - Fall back to "unknown" for files without valid barcodes - Extracted shared function to eliminate code duplication between paired-end and single-end processes - Added test case with SRA-style headers to verify the fix Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR fixes barcode extraction from FASTQ files with non-standard header formats (e.g., SRA headers) that was causing failures due to spaces in extracted barcodes breaking downstream shell commands.
Changes:
- Refactored barcode extraction to use a shared shell function that samples 10k reads, validates barcodes against a nucleotide pattern, and returns the most frequent valid barcode
- Added test case with SRA-style headers to verify the fix
Reviewed changes
Copilot reviewed 3 out of 5 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| fastq_to_ubam.nf | Introduced shared barcode extraction function and replaced direct barcode extraction in both paired-end and single-end processes |
| tests/fastq_to_ubam.nf.test | Added test case for non-standard SRA header format |
| tests/fastq_to_ubam.nf.test.snap | Added snapshot for the new SRA header test case |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
this went much better than pull request #49 with claude: |
lnblum
left a comment
There was a problem hiding this comment.
I like the added sampling of reads to find the barcode.
It was interesting to read the prompt log. I wondered if the agents would be confused by the fact that the issue was referring to code was substantially different from the current, but it seems like both identified that the barcode extraction code had been moved to the fastq_to_ubam.
Fixes #15
The barcode extraction from FASTQ headers was failing with non-standard formats like SRA headers (e.g., "@SRR20318439.1 ... length=111") where the extracted "barcode" contained spaces, breaking downstream shell commands.
Changes: