Skip to content

Conversation

@wwood
Copy link
Owner

@wwood wwood commented Jan 11, 2026

Motivation

  • Ensure matched protein sequence IDs produced by the supplement flow contain the originating SingleM/GraftM package basename for clearer provenance.
  • Validate that the code which replaces raw HMM names in output IDs with package basenames is working as expected.

Description

  • Updated the supplement worker to map HMM NAME entries to graftm_package_basename and to use that package basename when constructing matched protein sequence IDs.
  • Added validation when concatenating HMMs to fail on missing or duplicate HMM NAME entries and pass a hmm_name_to_package mapping into the per-genome worker.
  • Strengthened test_output_matched_protein_sequences in test/test_supplement.py to acquire the metapackage, collect package basenames via graftm_package_basename(), and assert the matched protein record IDs include one of those basenames.

Testing

  • Ran the focused test file with pixi run -e dev pytest test/test_supplement.py.
  • Test results: 7 passed, 2 skipped, 1 warning and the test run completed successfully.

Codex Task

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3aea28f719

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +466 to +467
if hmm_name in hmm_name_to_package:
raise Exception("Duplicate HMM NAME {} found for {}".format(hmm_name, hmm_path))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Allow metapackages with shared HMM NAMEs

The new duplicate check raises an exception whenever two packages share the same HMM NAME. Metapackages can legitimately include multiple SingleM packages that reuse the same GraftM alignment HMM (e.g., different window sizes or taxonomy variants backed by the same align_hmm), which means their NAME fields collide; in that case supplement will now abort before running hmmsearch even though it previously worked. If you need the package basename in matched protein IDs, consider disambiguating by rewriting HMM NAME entries to include the package basename (or falling back to the original HMM name when duplicates exist) instead of failing on duplicates.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants