feat: update CohereEmbedding to accept a customizable model name!#32
feat: update CohereEmbedding to accept a customizable model name!#32amindadgar merged 3 commits intomainfrom
Conversation
…ustomizable model name
WalkthroughBumped package version; made CohereEmbedding model name configurable; added optional embedding-model and embed-dimension parameters to the ingestion pipeline; and switched reusable CI workflow references from Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant CohereEmbed as CohereEmbedding
participant CohereAPI as Cohere API
User->>CohereEmbed: init(model_name?)
note right of CohereEmbed #D6EAF8: stores self.model_name (default\n"embed-multilingual-v3.0")
User->>CohereEmbed: embed(texts)
CohereEmbed->>CohereAPI: co.embed(input, model=self.model_name)
CohereAPI-->>CohereEmbed: embeddings
CohereEmbed-->>User: vectors
sequenceDiagram
autonumber
actor Caller
participant Ingest as CustomIngestionPipeline
participant Emb as Embedding (provided / Mock / Cohere)
Caller->>Ingest: init(..., embed_model?, embed_dim?, testing?)
alt embed_model provided
Ingest->>Ingest: require embed_dim or raise ValueError
Ingest->>Emb: use provided embed_model (embed_dim must be set)
else not provided
alt testing == True
Ingest->>Emb: use MockEmbedding(embed_dim=resolved_dim)
else
Ingest->>Emb: use CohereEmbedding() (default model_name)
end
end
Caller->>Ingest: ingest(...)
Ingest->>Emb: embed(texts)
Emb-->>Ingest: vectors
Ingest-->>Caller: store/send vectors
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
setup.py(1 hunks)tc_hivemind_backend/embeddings/cohere.py(3 hunks)
🔇 Additional comments (2)
setup.py (1)
9-9: LGTM! Version bump is appropriate.The minor version increment from 1.4.6 to 1.4.7 correctly reflects the addition of the new
model_nameparameter to theCohereEmbeddingclass.tc_hivemind_backend/embeddings/cohere.py (1)
43-43: LGTM! Consistent usage of the configurable model name.Both embedding paths (single-text and batch-text) correctly use
self.model_nameinstead of the hardcoded value, ensuring consistent behavior across all embedding operations.Also applies to: 55-55
CodeClimate service is unavailable for now.
…ng model and dimension! Added parameters for embed_model and embed_dim to allow users to specify the embedding model and its dimension during ingestion. Improved error handling for cases where the embed model is provided without a corresponding dimension.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
tc_hivemind_backend/ingest_qdrant.py (2)
53-59: Clarify docstring on embed_dim and vector store dimensionConsider noting that embed_dim must match the Qdrant collection vector size to avoid ingestion errors.
embed_dim: int | None - the embedding dimension to use for the ingestion pipeline - default is `None` + the embedding dimension to use for the ingestion pipeline + default is `None` + Note: This must match the Qdrant collection vector size; otherwise inserts will fail.
77-84: Default CohereEmbedding: allow env‑driven model name (optional)To align with “customizable model name,” consider reading an optional COHERE_EMBED_MODEL_NAME env var when constructing the default CohereEmbedding.
Example:
import os # at top model_name = os.getenv("COHERE_EMBED_MODEL_NAME") # e.g., "embed-english-v3.0" self.embed_model = CohereEmbedding(model_name=model_name) if model_name else CohereEmbedding()
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
.github/workflows/ci.yml(1 hunks).github/workflows/publish.yml(1 hunks)tc_hivemind_backend/ingest_qdrant.py(3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tc_hivemind_backend/ingest_qdrant.py (2)
tc_hivemind_backend/db/utils/model_hyperparams.py (1)
load_model_hyperparams(6-33)tc_hivemind_backend/embeddings/cohere.py (1)
CohereEmbedding(9-100)
🪛 Ruff (0.14.0)
tc_hivemind_backend/ingest_qdrant.py
63-63: Local variable credentials is assigned to but never used
Remove assignment to unused variable credentials
(F841)
71-71: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (4)
.github/workflows/publish.yml (1)
10-10: Switch to ci2.yml: verify workflow_call and secrets compatibilityLooks good. Please confirm TogetherCrew/operations/.github/workflows/ci2.yml@main exists, is configured with workflow_call, and expects CC_TEST_REPORTER_ID as provided.
.github/workflows/ci.yml (1)
7-7: Reusable workflow updated to ci2.ymlLGTM. Please verify TogetherCrew/operations/.github/workflows/ci2.yml@main exposes workflow_call and that secret CC_TEST_REPORTER_ID is declared/consumed as expected.
tc_hivemind_backend/ingest_qdrant.py (2)
12-12: Import of BaseEmbeddingLGTM.
35-36: Extensible constructor signatureAdding embed_model and embed_dim is a good extension point.
| self.community_id = community_id | ||
| self.qdrant_client = QdrantSingleton.get_instance().client | ||
|
|
||
| credentials = Credentials() |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
🧩 Analysis chain
Remove unused variable (Ruff F841)
credentials = Credentials() is never used; drop it unless you rely on side effects.
- credentials = Credentials()
+ # credentials removed: no longer needed hereIf side effects are required (e.g., loading env), call the specific loader method and name it accordingly.
🏁 Script executed:
#!/usr/bin/env bash
set -e
rg -n 'credentials' -C2 tc_hivemind_backend/ingest_qdrant.py || trueLength of output: 479
Remove unused credentials assignment
Line 63: credentials = Credentials() is never used; delete this line.
🧰 Tools
🪛 Ruff (0.14.0)
63-63: Local variable credentials is assigned to but never used
Remove assignment to unused variable credentials
(F841)
🤖 Prompt for AI Agents
In tc_hivemind_backend/ingest_qdrant.py around line 63, remove the unused
assignment `credentials = Credentials()` which is never referenced elsewhere;
delete this line to avoid dead/unused code and ensure no further references to
`credentials` remain in the function or module.
| if embed_model is None: | ||
| if embed_dim is None: | ||
| _, self.embedding_dim = load_model_hyperparams() | ||
| else: | ||
| self.embedding_dim = embed_dim | ||
| else: | ||
| if embed_dim is None: | ||
| raise ValueError("Embed dimension should be provided if embed model is provided!") | ||
| self.embedding_dim = embed_dim | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
Potential dimension mismatch: embed_dim not propagated to Qdrant vector store
You compute self.embedding_dim (or require it when a custom embed_model is supplied), but it’s not passed to QDrantVectorAccess/setup_qdrant_vector_store(). If that component derives dimension from env (EMBEDDING_DIM) or a default, it can diverge from the actual embedder’s output, leading to runtime errors on insert.
Action:
- Ensure QDrantVectorAccess creates/opens the collection with the same dimension as self.embedding_dim.
- Pass it explicitly if supported; otherwise extend the API.
Proposed change here (if setup_qdrant_vector_store accepts embedding_dim):
- vector_store = vector_access.setup_qdrant_vector_store()
+ vector_store = vector_access.setup_qdrant_vector_store(embedding_dim=self.embedding_dim)If it doesn’t, update QDrantVectorAccess accordingly. For example:
# tc_hivemind_backend/qdrant_vector_access.py
class QDrantVectorAccess:
def setup_qdrant_vector_store(self, embedding_dim: int, ...):
# use embedding_dim to create/open the collection with correct vector size
...Optional safeguard: assert the embedder’s output length equals self.embedding_dim before first upsert (may call external API).
Also applies to: 113-116
🏁 Script executed:
#!/bin/bash
rg -n "class QDrantVectorAccess" -n .
rg -n "def setup_qdrant_vector_store" -n .
rg -n "setup_qdrant_vector_store" -n tc_hivemind_backend/ingest_qdrant.pyLength of output: 253
🏁 Script executed:
rg -n "EMBEDDING_DIM" -n tc_hivemind_backend/qdrant_vector_access.py
rg -n "vector_size" -n tc_hivemind_backend/qdrant_vector_access.py
rg -n "dim" -n tc_hivemind_backend/qdrant_vector_access.pyLength of output: 63
Propagate embedding_dim to QDrantVectorAccess.setup_qdrant_vector_store
The call in ingest_qdrant.py must pass self.embedding_dim to setup_qdrant_vector_store, and the method signature in tc_hivemind_backend/qdrant_vector_access.py should be updated to accept and use this parameter when creating/opening the collection to avoid runtime dimension mismatches.
🧰 Tools
🪛 Ruff (0.14.0)
71-71: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
In tc_hivemind_backend/ingest_qdrant.py around lines 64-73, the code sets
self.embedding_dim but does not pass it to
QDrantVectorAccess.setup_qdrant_vector_store; update the call where
setup_qdrant_vector_store is invoked to pass self.embedding_dim. In
tc_hivemind_backend/qdrant_vector_access.py update the setup_qdrant_vector_store
signature to accept an embedding_dim parameter, use that value when creating or
opening the Qdrant collection (set vector size for new collections and validate
against existing collection's vector size), and update any other call sites to
pass the new parameter. Ensure the code raises a clear error if the provided
embedding_dim mismatches an existing collection's dimension.
Summary by CodeRabbit
New Features
Chores