Skip to content

Conversation

@clairenoble
Copy link

I found issues with some sequences where the blocks from different parents were being assigned different overhangs when I ran the generate_libraries function if the amino acids in different variants were different at the breakpoints, leading to incompatible assemblies. This was the case when I ran the example in the Quickstart Guide on my sequences.

The updated version takes into account the amino acids at the breakpoints for all the parents, so all the blocks for all the homologues now have matching overhangs.

I have added an assembly step to the test_dna_blocks unit test, to help confirm that the generated sequences match their parents if all the blocks from a single parent are re-assembled.

… the blocks from different parents were being assigned different overhangs when I ran generate_libraries function when the amino acids in different variants were different at the breakpoints, leading to incompatible assemblies. The updated function takes into account the amino acids at the breakpoints for all the parents, so all the blocks for all the homologues can be made cross-compatible
@bbremer
Copy link
Collaborator

bbremer commented Jul 14, 2025

Good catch! Thanks for the PR. I haven't looked at this code for awhile, so I'll need a bit to remember how it works.

There seems to be a lot of changes that don't seem immediately relevant to the bug. Can you explain why these changes are needed in this specific commit?

Regardless, it looks like this code needs some TLC with four years of hindsight/experience. I'll get around that eventually, but let me know if you'd like to chat about how you use this package.

@clairenoble
Copy link
Author

Hiya, thanks for getting back to me so quickly! Appreciate this is a slightly older codebase, but I found the idea of GG compatible schema fragments super useful so wanted to send over my fix when I noticed the issue with the overhangs.
You're right that I made a few adjustments beyond what was strictly necessary for the bug fix, they’re mostly to help me keep things clear when debugging, or simplify the logic of the fix.

The main change was making sure the overhangs were always compatible. Basically the original code was determining the overhang codon based on the amino acid of each parent sequence individually. This was leading to issues if the parents have different amino acids at a breakpoint, they were then generating different, incompatible overhangs for the same junction. I spotted this as it was the case for the assembly I was testing, and the output FASTA with the gene blocks was making a lot of fragments that were incompatible for assembly.

The updated function now searches for a single “design” codon that works for that junction across homologues. It checks all possible amino acids from all parent sequences at that specific breakpoint position until it finds one whose codons can satisfy the overhang pattern. This means every DNA block for that junction gets the same, compatible overhang.
I also replaced the custom DUMMY_OVERHANG object with the standard Python None. This was mostly just to simply the logic. I thought using if oh_start is None was a more standard and slightly cleaner way to check for the absence of an overhang. It also made implementing the core fix a little more straightforward, as I didn't have to handle a custom class for the no overhang case
 
I also added some more descriptive error handling messages and variable names, mostly to help myself when I was debugging. The new function also has a new possible failure point, which is when there isn’t a possible amino acid/codon combination at a given breakpoint. I guess ideally the code would then shift to look at a different breakpoint if this scenario was triggered, but I haven’t implemented that. But for now it just throws an error to let you know that that breakpoint isn’t possible for the given homologue combination.
 
Then I’ve added some tests – I think these are all fairly self explanatory, I basically check that if you recombine all the parent blocks you still get the parent. It’s loosely based off a reply you gave someone who had an issue a few years back. Now that I think about it, there should probably be a test to check that the overhangs are actually compatible throughout the assemblies as that’s the entire point of this bug fix, oops!

Thank you for the kind offer! I'm finding the package very useful. I’m trying to link it through to our original DNA sequences and some scripts we have for designing GoldenGate primers to save on ordering DNA, hence why it’s so useful compared to a lot of SCHEMA stuff which just assumes you want to order whole genes..
Let me know if you have any other questions!

@bbremer
Copy link
Collaborator

bbremer commented Jul 16, 2025

Thanks for the explanations! IIRC, shouldn't the overhangs already be determined by the time you call dna_blocks? The bug you're seeing probably happens earlier, as you've alluded.

I'll keep investigating. If you have the time, could you submit a minimal reproducible example? Either in a bug report issue or attached to this PR.

The extraneous changes make a lot of sense, but they're distracting from the fix itself. Please reduce your PR to the minimum change necessary. I'd be happy to accept your refactoring PRs if you're willing to submit them later, but I'll work on cleaning up the code anyways.

Alternatively, feel free to make whatever changes you'd like to your fork and make sure to double-check the output! The newer version should be much more understandable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants