Skip to content

feat: update CohereEmbedding to accept a customizable model name!#32

Merged
amindadgar merged 3 commits intomainfrom
feat/support-custom-cohere-models
Oct 14, 2025
Merged

feat: update CohereEmbedding to accept a customizable model name!#32
amindadgar merged 3 commits intomainfrom
feat/support-custom-cohere-models

Conversation

@amindadgar
Copy link
Member

@amindadgar amindadgar commented Oct 14, 2025

Summary by CodeRabbit

  • New Features

    • You can now provide a custom embedding model and specify its embedding dimension; system falls back to sensible defaults and test-mode mocks when not supplied.
    • Cohere embedding is configurable with a selectable model name (default remains the multilingual embed).
  • Chores

    • Package version bumped to 1.4.7.
    • CI workflows updated to reference a new reusable workflow source.

@coderabbitai
Copy link

coderabbitai bot commented Oct 14, 2025

Walkthrough

Bumped package version; made CohereEmbedding model name configurable; added optional embedding-model and embed-dimension parameters to the ingestion pipeline; and switched reusable CI workflow references from ci.yml@main to ci2.yml@main.

Changes

Cohort / File(s) Summary
Version bump
setup.py
Incremented package version from 1.4.6 to 1.4.7.
Cohere embedding configurability
tc_hivemind_backend/embeddings/cohere.py
CohereEmbedding.__init__ now accepts model_name: str = "embed-multilingual-v3.0" and uses self.model_name in co.embed calls instead of the hardcoded model name.
Ingestion pipeline — optional embedding model & dim
tc_hivemind_backend/ingest_qdrant.py
CustomIngestionPipeline.__init__ signature updated to include `embed_model: BaseEmbedding
CI workflow reference updates
.github/workflows/ci.yml, .github/workflows/publish.yml
Reusable workflow reference changed from ci.yml@main to ci2.yml@main in CI job definitions for pull_request and publish workflows.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant CohereEmbed as CohereEmbedding
  participant CohereAPI as Cohere API

  User->>CohereEmbed: init(model_name?)
  note right of CohereEmbed #D6EAF8: stores self.model_name (default\n"embed-multilingual-v3.0")
  User->>CohereEmbed: embed(texts)
  CohereEmbed->>CohereAPI: co.embed(input, model=self.model_name)
  CohereAPI-->>CohereEmbed: embeddings
  CohereEmbed-->>User: vectors
Loading
sequenceDiagram
  autonumber
  actor Caller
  participant Ingest as CustomIngestionPipeline
  participant Emb as Embedding (provided / Mock / Cohere)

  Caller->>Ingest: init(..., embed_model?, embed_dim?, testing?)
  alt embed_model provided
    Ingest->>Ingest: require embed_dim or raise ValueError
    Ingest->>Emb: use provided embed_model (embed_dim must be set)
  else not provided
    alt testing == True
      Ingest->>Emb: use MockEmbedding(embed_dim=resolved_dim)
    else
      Ingest->>Emb: use CohereEmbedding() (default model_name)
    end
  end
  Caller->>Ingest: ingest(...)
  Ingest->>Emb: embed(texts)
  Emb-->>Ingest: vectors
  Ingest-->>Caller: store/send vectors
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

I’m a rabbit in a coder’s glen,
I nudged the version, hopped again.
A model name now bounces free,
The pipeline asks which seed to be.
Carrot-powered embeddings, snug as can be 🥕🐇

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The pull request title clearly and concisely describes a key enhancement: making the CohereEmbedding model name configurable. It uses conventional commit style and avoids extraneous details. Although the PR also includes CI workflow adjustments and pipeline modifications, the title remains focused on a significant functional update that teammates can quickly understand.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/support-custom-cohere-models

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 73e909c and 710ff0e.

📒 Files selected for processing (2)
  • setup.py (1 hunks)
  • tc_hivemind_backend/embeddings/cohere.py (3 hunks)
🔇 Additional comments (2)
setup.py (1)

9-9: LGTM! Version bump is appropriate.

The minor version increment from 1.4.6 to 1.4.7 correctly reflects the addition of the new model_name parameter to the CohereEmbedding class.

tc_hivemind_backend/embeddings/cohere.py (1)

43-43: LGTM! Consistent usage of the configurable model name.

Both embedding paths (single-text and batch-text) correctly use self.model_name instead of the hardcoded value, ensuring consistent behavior across all embedding operations.

Also applies to: 55-55

CodeClimate service is unavailable for now.
…ng model and dimension!

Added parameters for embed_model and embed_dim to allow users to specify the embedding model and its dimension during ingestion. Improved error handling for cases where the embed model is provided without a corresponding dimension.
@amindadgar amindadgar merged commit 84d2e8d into main Oct 14, 2025
3 checks passed
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
tc_hivemind_backend/ingest_qdrant.py (2)

53-59: Clarify docstring on embed_dim and vector store dimension

Consider noting that embed_dim must match the Qdrant collection vector size to avoid ingestion errors.

         embed_dim: int | None
-            the embedding dimension to use for the ingestion pipeline
-            default is `None`
+            the embedding dimension to use for the ingestion pipeline
+            default is `None`
+            Note: This must match the Qdrant collection vector size; otherwise inserts will fail.

77-84: Default CohereEmbedding: allow env‑driven model name (optional)

To align with “customizable model name,” consider reading an optional COHERE_EMBED_MODEL_NAME env var when constructing the default CohereEmbedding.

Example:

import os  # at top

model_name = os.getenv("COHERE_EMBED_MODEL_NAME")  # e.g., "embed-english-v3.0"
self.embed_model = CohereEmbedding(model_name=model_name) if model_name else CohereEmbedding()
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 710ff0e and 907a160.

📒 Files selected for processing (3)
  • .github/workflows/ci.yml (1 hunks)
  • .github/workflows/publish.yml (1 hunks)
  • tc_hivemind_backend/ingest_qdrant.py (3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tc_hivemind_backend/ingest_qdrant.py (2)
tc_hivemind_backend/db/utils/model_hyperparams.py (1)
  • load_model_hyperparams (6-33)
tc_hivemind_backend/embeddings/cohere.py (1)
  • CohereEmbedding (9-100)
🪛 Ruff (0.14.0)
tc_hivemind_backend/ingest_qdrant.py

63-63: Local variable credentials is assigned to but never used

Remove assignment to unused variable credentials

(F841)


71-71: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (4)
.github/workflows/publish.yml (1)

10-10: Switch to ci2.yml: verify workflow_call and secrets compatibility

Looks good. Please confirm TogetherCrew/operations/.github/workflows/ci2.yml@main exists, is configured with workflow_call, and expects CC_TEST_REPORTER_ID as provided.

.github/workflows/ci.yml (1)

7-7: Reusable workflow updated to ci2.yml

LGTM. Please verify TogetherCrew/operations/.github/workflows/ci2.yml@main exposes workflow_call and that secret CC_TEST_REPORTER_ID is declared/consumed as expected.

tc_hivemind_backend/ingest_qdrant.py (2)

12-12: Import of BaseEmbedding

LGTM.


35-36: Extensible constructor signature

Adding embed_model and embed_dim is a good extension point.

self.community_id = community_id
self.qdrant_client = QdrantSingleton.get_instance().client

credentials = Credentials()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

🧩 Analysis chain

Remove unused variable (Ruff F841)

credentials = Credentials() is never used; drop it unless you rely on side effects.

-        credentials = Credentials()
+        # credentials removed: no longer needed here

If side effects are required (e.g., loading env), call the specific loader method and name it accordingly.


🏁 Script executed:

#!/usr/bin/env bash
set -e
rg -n 'credentials' -C2 tc_hivemind_backend/ingest_qdrant.py || true

Length of output: 479


Remove unused credentials assignment
Line 63: credentials = Credentials() is never used; delete this line.

🧰 Tools
🪛 Ruff (0.14.0)

63-63: Local variable credentials is assigned to but never used

Remove assignment to unused variable credentials

(F841)

🤖 Prompt for AI Agents
In tc_hivemind_backend/ingest_qdrant.py around line 63, remove the unused
assignment `credentials = Credentials()` which is never referenced elsewhere;
delete this line to avoid dead/unused code and ensure no further references to
`credentials` remain in the function or module.

Comment on lines +64 to +73
if embed_model is None:
if embed_dim is None:
_, self.embedding_dim = load_model_hyperparams()
else:
self.embedding_dim = embed_dim
else:
if embed_dim is None:
raise ValueError("Embed dimension should be provided if embed model is provided!")
self.embedding_dim = embed_dim

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Potential dimension mismatch: embed_dim not propagated to Qdrant vector store

You compute self.embedding_dim (or require it when a custom embed_model is supplied), but it’s not passed to QDrantVectorAccess/setup_qdrant_vector_store(). If that component derives dimension from env (EMBEDDING_DIM) or a default, it can diverge from the actual embedder’s output, leading to runtime errors on insert.

Action:

  • Ensure QDrantVectorAccess creates/opens the collection with the same dimension as self.embedding_dim.
  • Pass it explicitly if supported; otherwise extend the API.

Proposed change here (if setup_qdrant_vector_store accepts embedding_dim):

-        vector_store = vector_access.setup_qdrant_vector_store()
+        vector_store = vector_access.setup_qdrant_vector_store(embedding_dim=self.embedding_dim)

If it doesn’t, update QDrantVectorAccess accordingly. For example:

# tc_hivemind_backend/qdrant_vector_access.py
class QDrantVectorAccess:
    def setup_qdrant_vector_store(self, embedding_dim: int, ...):
        # use embedding_dim to create/open the collection with correct vector size
        ...

Optional safeguard: assert the embedder’s output length equals self.embedding_dim before first upsert (may call external API).

Also applies to: 113-116


🏁 Script executed:

#!/bin/bash
rg -n "class QDrantVectorAccess" -n .
rg -n "def setup_qdrant_vector_store" -n .
rg -n "setup_qdrant_vector_store" -n tc_hivemind_backend/ingest_qdrant.py

Length of output: 253


🏁 Script executed:

rg -n "EMBEDDING_DIM" -n tc_hivemind_backend/qdrant_vector_access.py
rg -n "vector_size" -n tc_hivemind_backend/qdrant_vector_access.py
rg -n "dim" -n tc_hivemind_backend/qdrant_vector_access.py

Length of output: 63


Propagate embedding_dim to QDrantVectorAccess.setup_qdrant_vector_store

The call in ingest_qdrant.py must pass self.embedding_dim to setup_qdrant_vector_store, and the method signature in tc_hivemind_backend/qdrant_vector_access.py should be updated to accept and use this parameter when creating/opening the collection to avoid runtime dimension mismatches.

🧰 Tools
🪛 Ruff (0.14.0)

71-71: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
In tc_hivemind_backend/ingest_qdrant.py around lines 64-73, the code sets
self.embedding_dim but does not pass it to
QDrantVectorAccess.setup_qdrant_vector_store; update the call where
setup_qdrant_vector_store is invoked to pass self.embedding_dim. In
tc_hivemind_backend/qdrant_vector_access.py update the setup_qdrant_vector_store
signature to accept an embedding_dim parameter, use that value when creating or
opening the Qdrant collection (set vector size for new collections and validate
against existing collection's vector size), and update any other call sites to
pass the new parameter. Ensure the code raises a clear error if the provided
embedding_dim mismatches an existing collection's dimension.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant