feat: update CohereEmbedding to accept a customizable model name! by amindadgar · Pull Request #32 · TogetherCrew/hivemind-backend

amindadgar · 2025-10-14T07:10:22Z

Summary by CodeRabbit

New Features
- You can now provide a custom embedding model and specify its embedding dimension; system falls back to sensible defaults and test-mode mocks when not supplied.
- Cohere embedding is configurable with a selectable model name (default remains the multilingual embed).
Chores
- Package version bumped to 1.4.7.
- CI workflows updated to reference a new reusable workflow source.

…ustomizable model name

coderabbitai · 2025-10-14T07:11:16Z

Walkthrough

Bumped package version; made CohereEmbedding model name configurable; added optional embedding-model and embed-dimension parameters to the ingestion pipeline; and switched reusable CI workflow references from ci.yml@main to ci2.yml@main.

Changes

Cohort / File(s)	Summary
Version bump `setup.py`	Incremented package version from `1.4.6` to `1.4.7`.
Cohere embedding configurability `tc_hivemind_backend/embeddings/cohere.py`	`CohereEmbedding.__init__` now accepts `model_name: str = "embed-multilingual-v3.0"` and uses `self.model_name` in `co.embed` calls instead of the hardcoded model name.
Ingestion pipeline — optional embedding model & dim `tc_hivemind_backend/ingest_qdrant.py`	`CustomIngestionPipeline.__init__` signature updated to include `embed_model: BaseEmbedding
CI workflow reference updates `.github/workflows/ci.yml`, `.github/workflows/publish.yml`	Reusable workflow reference changed from `ci.yml@main` to `ci2.yml@main` in CI job definitions for pull_request and publish workflows.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant CohereEmbed as CohereEmbedding
  participant CohereAPI as Cohere API

  User->>CohereEmbed: init(model_name?)
  note right of CohereEmbed #D6EAF8: stores self.model_name (default\n"embed-multilingual-v3.0")
  User->>CohereEmbed: embed(texts)
  CohereEmbed->>CohereAPI: co.embed(input, model=self.model_name)
  CohereAPI-->>CohereEmbed: embeddings
  CohereEmbed-->>User: vectors

sequenceDiagram
  autonumber
  actor Caller
  participant Ingest as CustomIngestionPipeline
  participant Emb as Embedding (provided / Mock / Cohere)

  Caller->>Ingest: init(..., embed_model?, embed_dim?, testing?)
  alt embed_model provided
    Ingest->>Ingest: require embed_dim or raise ValueError
    Ingest->>Emb: use provided embed_model (embed_dim must be set)
  else not provided
    alt testing == True
      Ingest->>Emb: use MockEmbedding(embed_dim=resolved_dim)
    else
      Ingest->>Emb: use CohereEmbedding() (default model_name)
    end
  end
  Caller->>Ingest: ingest(...)
  Ingest->>Emb: embed(texts)
  Emb-->>Ingest: vectors
  Ingest-->>Caller: store/send vectors

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

I’m a rabbit in a coder’s glen,
I nudged the version, hopped again.
A model name now bounces free,
The pipeline asks which seed to be.
Carrot-powered embeddings, snug as can be 🥕🐇

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The pull request title clearly and concisely describes a key enhancement: making the CohereEmbedding model name configurable. It uses conventional commit style and avoids extraneous details. Although the PR also includes CI workflow adjustments and pipeline modifications, the title remains focused on a significant functional update that teammates can quickly understand.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/support-custom-cohere-models

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 73e909c and 710ff0e.

📒 Files selected for processing (2)

setup.py (1 hunks)
tc_hivemind_backend/embeddings/cohere.py (3 hunks)

🔇 Additional comments (2)

setup.py (1)

9-9: LGTM! Version bump is appropriate.

The minor version increment from 1.4.6 to 1.4.7 correctly reflects the addition of the new model_name parameter to the CohereEmbedding class.

tc_hivemind_backend/embeddings/cohere.py (1)

43-43: LGTM! Consistent usage of the configurable model name.

Both embedding paths (single-text and batch-text) correctly use self.model_name instead of the hardcoded value, ensuring consistent behavior across all embedding operations.

Also applies to: 55-55

tc_hivemind_backend/embeddings/cohere.py

CodeClimate service is unavailable for now.

…ng model and dimension! Added parameters for embed_model and embed_dim to allow users to specify the embedding model and its dimension during ingestion. Improved error handling for cases where the embed model is provided without a corresponding dimension.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

tc_hivemind_backend/ingest_qdrant.py (2)
53-59: Clarify docstring on embed_dim and vector store dimension

Consider noting that embed_dim must match the Qdrant collection vector size to avoid ingestion errors.
         embed_dim: int | None
-            the embedding dimension to use for the ingestion pipeline
-            default is `None`
+            the embedding dimension to use for the ingestion pipeline
+            default is `None`
+            Note: This must match the Qdrant collection vector size; otherwise inserts will fail.
77-84: Default CohereEmbedding: allow env‑driven model name (optional)

To align with “customizable model name,” consider reading an optional COHERE_EMBED_MODEL_NAME env var when constructing the default CohereEmbedding.

Example:
import os  # at top

model_name = os.getenv("COHERE_EMBED_MODEL_NAME")  # e.g., "embed-english-v3.0"
self.embed_model = CohereEmbedding(model_name=model_name) if model_name else CohereEmbedding()

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 710ff0e and 907a160.

📒 Files selected for processing (3)

.github/workflows/ci.yml (1 hunks)
.github/workflows/publish.yml (1 hunks)
tc_hivemind_backend/ingest_qdrant.py (3 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

tc_hivemind_backend/ingest_qdrant.py (2)

tc_hivemind_backend/db/utils/model_hyperparams.py (1)

load_model_hyperparams (6-33)

tc_hivemind_backend/embeddings/cohere.py (1)

CohereEmbedding (9-100)

🪛 Ruff (0.14.0)

tc_hivemind_backend/ingest_qdrant.py

63-63: Local variable credentials is assigned to but never used

Remove assignment to unused variable credentials

(F841)

71-71: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (4)

.github/workflows/publish.yml (1)

10-10: Switch to ci2.yml: verify workflow_call and secrets compatibility

Looks good. Please confirm TogetherCrew/operations/.github/workflows/ci2.yml@main exists, is configured with workflow_call, and expects CC_TEST_REPORTER_ID as provided.

.github/workflows/ci.yml (1)

7-7: Reusable workflow updated to ci2.yml

LGTM. Please verify TogetherCrew/operations/.github/workflows/ci2.yml@main exposes workflow_call and that secret CC_TEST_REPORTER_ID is declared/consumed as expected.

tc_hivemind_backend/ingest_qdrant.py (2)

12-12: Import of BaseEmbedding

LGTM.

35-36: Extensible constructor signature

Adding embed_model and embed_dim is a good extension point.

coderabbitai · 2025-10-14T07:27:38Z

tc_hivemind_backend/ingest_qdrant.py

        self.community_id = community_id
        self.qdrant_client = QdrantSingleton.get_instance().client

        credentials = Credentials()


🛠️ Refactor suggestion | 🟠 Major

🧩 Analysis chain

Remove unused variable (Ruff F841)

credentials = Credentials() is never used; drop it unless you rely on side effects.

- credentials = Credentials() + # credentials removed: no longer needed here

If side effects are required (e.g., loading env), call the specific loader method and name it accordingly.

🏁 Script executed:

#!/usr/bin/env bash set -e rg -n 'credentials' -C2 tc_hivemind_backend/ingest_qdrant.py || true

Length of output: 479

Remove unused credentials assignment
Line 63: credentials = Credentials() is never used; delete this line.

🧰 Tools

🪛 Ruff (0.14.0)

63-63: Local variable credentials is assigned to but never used

Remove assignment to unused variable credentials

(F841)

🤖 Prompt for AI Agents

In tc_hivemind_backend/ingest_qdrant.py around line 63, remove the unused assignment `credentials = Credentials()` which is never referenced elsewhere; delete this line to avoid dead/unused code and ensure no further references to `credentials` remain in the function or module.

coderabbitai · 2025-10-14T07:27:38Z

tc_hivemind_backend/ingest_qdrant.py

+        if embed_model is None:
+            if embed_dim is None:
+                _, self.embedding_dim = load_model_hyperparams()
+            else:
+                self.embedding_dim = embed_dim
+        else:
+            if embed_dim is None:
+                raise ValueError("Embed dimension should be provided if embed model is provided!")
+            self.embedding_dim = embed_dim
+


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Potential dimension mismatch: embed_dim not propagated to Qdrant vector store

You compute self.embedding_dim (or require it when a custom embed_model is supplied), but it’s not passed to QDrantVectorAccess/setup_qdrant_vector_store(). If that component derives dimension from env (EMBEDDING_DIM) or a default, it can diverge from the actual embedder’s output, leading to runtime errors on insert.

Action:

Ensure QDrantVectorAccess creates/opens the collection with the same dimension as self.embedding_dim.

Pass it explicitly if supported; otherwise extend the API.

Proposed change here (if setup_qdrant_vector_store accepts embedding_dim):

- vector_store = vector_access.setup_qdrant_vector_store() + vector_store = vector_access.setup_qdrant_vector_store(embedding_dim=self.embedding_dim)

If it doesn’t, update QDrantVectorAccess accordingly. For example:

# tc_hivemind_backend/qdrant_vector_access.py class QDrantVectorAccess: def setup_qdrant_vector_store(self, embedding_dim: int, ...): # use embedding_dim to create/open the collection with correct vector size ...

Optional safeguard: assert the embedder’s output length equals self.embedding_dim before first upsert (may call external API).

Also applies to: 113-116

🏁 Script executed:

#!/bin/bash rg -n "class QDrantVectorAccess" -n . rg -n "def setup_qdrant_vector_store" -n . rg -n "setup_qdrant_vector_store" -n tc_hivemind_backend/ingest_qdrant.py

Length of output: 253

🏁 Script executed:

rg -n "EMBEDDING_DIM" -n tc_hivemind_backend/qdrant_vector_access.py rg -n "vector_size" -n tc_hivemind_backend/qdrant_vector_access.py rg -n "dim" -n tc_hivemind_backend/qdrant_vector_access.py

Length of output: 63

Propagate embedding_dim to QDrantVectorAccess.setup_qdrant_vector_store

The call in ingest_qdrant.py must pass self.embedding_dim to setup_qdrant_vector_store, and the method signature in tc_hivemind_backend/qdrant_vector_access.py should be updated to accept and use this parameter when creating/opening the collection to avoid runtime dimension mismatches.

🧰 Tools

🪛 Ruff (0.14.0)

71-71: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents

In tc_hivemind_backend/ingest_qdrant.py around lines 64-73, the code sets self.embedding_dim but does not pass it to QDrantVectorAccess.setup_qdrant_vector_store; update the call where setup_qdrant_vector_store is invoked to pass self.embedding_dim. In tc_hivemind_backend/qdrant_vector_access.py update the setup_qdrant_vector_store signature to accept an embedding_dim parameter, use that value when creating or opening the Qdrant collection (set vector size for new collections and validate against existing collection's vector size), and update any other call sites to pass the new parameter. Ensure the code raises a clear error if the provided embedding_dim mismatches an existing collection's dimension.

chore: bump version to 1.4.7 and update CohereEmbedding to accept a c…

710ff0e

…ustomizable model name

coderabbitai bot reviewed Oct 14, 2025

View reviewed changes

tc_hivemind_backend/embeddings/cohere.py Show resolved Hide resolved

amindadgar added 2 commits October 14, 2025 10:48

feat: remove codeClimate from the project!

c2d1e07

CodeClimate service is unavailable for now.

amindadgar merged commit 84d2e8d into main Oct 14, 2025
3 checks passed

coderabbitai bot reviewed Oct 14, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: update CohereEmbedding to accept a customizable model name!#32

feat: update CohereEmbedding to accept a customizable model name!#32
amindadgar merged 3 commits intomainfrom
feat/support-custom-cohere-models

amindadgar commented Oct 14, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 14, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Oct 14, 2025

Uh oh!

coderabbitai bot Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

amindadgar commented Oct 14, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

amindadgar commented Oct 14, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 14, 2025 •

edited

Loading