-
Notifications
You must be signed in to change notification settings - Fork 18
Support more embedding function platform #110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. 📝 WalkthroughWalkthroughAdds a new OpenAI-compatible embedding base class and three provider implementations (OpenAI, Qwen, LiteLLM), exposes them via a new package initializer, updates DefaultEmbeddingFunction to be generically typed, removes an unused import in the SentenceTransformer embedding module, and adds unit tests for OpenAI and Qwen embedding functions. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Base as OpenAIBaseEmbeddingFunction
participant OpenAIClient as OpenAI Client Library
participant OpenAIAPI as OpenAI API
Client->>Base: __init__(model_name, api_key_env?, api_base?, dimensions?)
Base->>Base: verify openai installed
Base->>Base: resolve API key via _get_default_api_key_env()
Base->>Base: resolve api_base via _get_default_api_base()
Base->>OpenAIClient: initialize client with api_key, api_base, kwargs
Client->>Base: __call__(input)
Base->>Base: normalize input (str or list)
alt dimensions not provided and model unknown
Base->>OpenAIClient: request sample embedding to probe dimension
OpenAIClient->>OpenAIAPI: POST /embeddings (probe)
OpenAIAPI-->>OpenAIClient: sample embedding
OpenAIClient-->>Base: sample embedding
Base->>Base: infer dimension
end
Base->>OpenAIClient: embeddings.create(model, input, dimensions?)
OpenAIClient->>OpenAIAPI: POST /embeddings
OpenAIAPI-->>OpenAIClient: embeddings response
OpenAIClient-->>Base: response
Base->>Base: extract embeddings, validate count
Base-->>Client: return list of embedding vectors
sequenceDiagram
participant Client
participant Lite as LiteLLMEmbeddingFunction
participant LiteLLMLib as LiteLLM Library
participant Provider as External Provider
Client->>Lite: __init__(model_name, api_key_env?, **kwargs)
Lite->>Lite: validate primitive kwargs, store config, defer import
Client->>Lite: __call__(input)
Lite->>Lite: normalize input (str or list)
alt api_key_env provided
Lite->>Lite: read API key from environment (raise if missing)
end
Lite->>Lite: build embedding_kwargs (model, input, user kwargs)
Lite->>LiteLLMLib: lazy import and call embedding function
LiteLLMLib->>Provider: route request to configured provider
Provider-->>LiteLLMLib: return embeddings response (various shapes)
LiteLLMLib-->>Lite: response
Lite->>Lite: parse multiple response shapes, validate count
Lite-->>Client: return list of embedding vectors
Estimated code review effort🎯 4 (Complex) | ⏱️ ~70 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
📜 Recent review detailsConfiguration used: defaults Review profile: CHILL Plan: Pro 📒 Files selected for processing (2)
🧰 Additional context used🧬 Code graph analysis (2)src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py (2)
src/pyseekdb/utils/embedding_functions/openai_embedding_function.py (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
🔇 Additional comments (2)
✏️ Tip: You can disable this entire section by setting Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/pyseekdb/utils/embedding_functions/sentence_transformer_embedding_function.py (1)
68-83: Bug:__call__breaks for single-string input (list("hello")). This violates theDocuments = Union[str, List[str]]contract.Proposed diff
@@ def __call__(self, input: Documents) -> Embeddings: @@ - embeddings = self._model.encode( - list(input), + documents = [input] if isinstance(input, str) else list(input) + embeddings = self._model.encode( + documents, convert_to_numpy=True, normalize_embeddings=self.normalize_embeddings, )
🤖 Fix all issues with AI agents
In @src/pyseekdb/utils/embedding_functions/openai_embedding_function.py:
- Around line 5-80: The collection creation currently probes the embedding
function (causing paid API calls); add a model-dimension mapping and expose a
dimension attribute/parameter so probing is avoided: update
LiteLLMEmbeddingFunction to accept an optional dimension argument and/or provide
a read-only property (e.g., self.dimension) that subclasses can override, then
implement model-specific defaults in OpenAIEmbeddingFunction (and
QwenEmbeddingFunction) by mapping model_name -> dimension (e.g.,
"text-embedding-ada-002":1536, "text-embedding-3-small":1536,
"text-embedding-3-large":3072) and set that.dimension when not explicitly
provided; ensure client_base.py's create_collection checks for
embedding_function.dimension before doing the test probe.
- Around line 45-80: Update the OpenAIEmbeddingFunction __init__ to use
"text-embedding-3-small" as the default model_name instead of
"text-embedding-ada-002" and update the docstring to reflect the new default and
available alternatives ("text-embedding-3-small" and "text-embedding-3-large");
additionally, when an OpenAI-compatible proxy is used (detect via the presence
of api_base in kwargs), ensure the LiteLLM model name is prefixed with "openai/"
(e.g., "openai/text-embedding-3-small") or explicitly document that callers must
supply the "openai/" prefix so requests route to the proxy rather than the real
OpenAI service (make the change in the __init__ method handling
model_name/api_base and update the docstring accordingly).
In
@src/pyseekdb/utils/embedding_functions/sentence_transformer_embedding_function.py:
- Around line 12-25: The example uses collection.query("How are you?", top_k=1)
which no longer matches the current API; update the doc example to call
collection.query(query_texts=["How are you?"], n_results=1) (or the exact
parameter names used elsewhere) and ensure the rest of the snippet
(ids/documents/metadatas, SentenceTransformerEmbeddingFunction instantiation)
remains unchanged so it runs without copy/paste errors.
In @tests/integration_tests/test_openai_embedding_function.py:
- Around line 28-40: In test_litellm_env replace any use of bare `assert False`
with `pytest.fail(<message>)` or explicitly `raise AssertionError(<message>)` so
failures aren’t skipped under Python -O; also convert the stray triple-quoted
comment inside the function (the second docstring-like line) into a normal
inline or block comment to avoid confusion; update messages to include context
(e.g., "litellm package is not installed" and "OPENAI_API_KEY environment
variable is not set") and keep changes localized to the test_litellm_env
function.
In @tests/integration_tests/test_qwen_embedding_function.py:
- Around line 173-178: The query string uses a fullwidth question mark which can
break linters/tools; update the variable query_text (used in the
collection.query call) to use an ASCII question mark instead (e.g., change
"什么是嵌入?" to "什么是嵌入?") so the test avoids tooling surprises.
- Line 210: Remove the unnecessary f-string prefix in the test print call:
replace the print(f" Embedding generated successfully with custom parameters")
invocation in tests/integration_tests/test_qwen_embedding_function.py with a
plain string print (no interpolation) so it becomes print(" Embedding
generated successfully with custom parameters"); this updates the print
statement that has no placeholders to avoid misleading f-string usage.
- Around line 31-42: The test_litellm_env function uses "assert False" which can
be stripped with python -O; replace those with explicit raises of AssertionError
including the same messages (e.g., raise AssertionError("litellm package is not
installed") and raise AssertionError("QWEN_API_KEY environment variable is not
set")) so failures are always raised regardless of optimization flags.
🧹 Nitpick comments (7)
src/pyseekdb/client/embedding_function.py (1)
176-259: Avoid broadexcept Exception+ uselogger.exception()for download failures. The current pattern can hide root causes and loses stack traces.Proposed diff
@@ - logger.info("Successfully downloaded all model files from Hugging Face") - return True + logger.info("Successfully downloaded all model files from Hugging Face") + return True @@ - except Exception as e: - logger.error(f"Error downloading from Hugging Face: {e}") - return False + except Exception: + logger.exception("Error downloading from Hugging Face") + return Falsesrc/pyseekdb/utils/embedding_functions/__init__.py (1)
8-18: Export module looks good; consider satisfyingRUF022(__all__sort) if lint is enforced.tests/integration_tests/test_qwen_embedding_function.py (2)
67-108: Avoid printing full embeddings to logs. This is very noisy (and can get huge); logging the dimension is typically enough.
172-191: Cleanupexcept Exceptionis fine for test hygiene, but at least log the traceback for failures.tests/integration_tests/test_openai_embedding_function.py (2)
41-64: Consider removing redundanttest_litellm_env()calls.Since the entire test class is already decorated with
@pytest.mark.skipif(not os.environ.get("OPENAI_API_KEY"), ...), the API key check intest_litellm_env()is redundant when called from other test methods. The litellm import check could be moved to a class-level fixture orsetup_classmethod for cleaner test organization.
181-186: Consider catching a more specific exception.Catching bare
Exceptioncan mask unexpected errors. Consider catching specific exceptions thatdelete_collectionmight raise.Proposed fix
# Cleanup try: db_client.delete_collection(name=collection_name) print(f" Cleaned up collection: {collection_name}") - except Exception as e: + except (ValueError, RuntimeError) as e: # Adjust based on actual exceptions raised print(f" Warning: Failed to cleanup collection: {e}")src/pyseekdb/utils/embedding_functions/litellm_embedding_function.py (1)
82-95: Chain the exception and useTypeErrorfor type validation.Two improvements for exception handling:
- Use
raise ... from errto preserve the exception chain forImportError- Use
TypeErrorinstead ofValueErrorwhen validating argument types (line 93)Proposed fix
try: from litellm import embedding - except ImportError: + except ImportError as err: raise ValueError( "The litellm python package is not installed. Please install it with `pip install litellm`" - ) + ) from err self.model_name = model_name self.api_key_env = api_key_env for key, value in kwargs.items(): if not isinstance(value, (str, int, float, bool, list, dict, tuple, type(None))): - raise ValueError(f"Keyword argument {key} is not a primitive type") + raise TypeError(f"Keyword argument '{key}' must be a primitive type, got {type(value).__name__}") self.kwargs = kwargs self._embedding_func = embedding
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
src/pyseekdb/client/embedding_function.pysrc/pyseekdb/utils/embedding_functions/__init__.pysrc/pyseekdb/utils/embedding_functions/litellm_embedding_function.pysrc/pyseekdb/utils/embedding_functions/openai_embedding_function.pysrc/pyseekdb/utils/embedding_functions/qwen_embedding_function.pysrc/pyseekdb/utils/embedding_functions/sentence_transformer_embedding_function.pytests/integration_tests/test_openai_embedding_function.pytests/integration_tests/test_qwen_embedding_function.py
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-12-22T12:44:52.456Z
Learnt from: longbingljw
Repo: oceanbase/pyseekdb PR: 91
File: tests/integration_tests/test_collection_dml.py:36-36
Timestamp: 2025-12-22T12:44:52.456Z
Learning: In integration tests under tests/integration_tests, it is acceptable to use the internal db_client._server._execute() to run raw SQL (e.g., CREATE TABLE) when needing custom table setup with specific vector index configurations. Prefer using public/test harness APIs for common setup, and clearly document and limit the use of internal calls to ensure test reliability. Ensure proper cleanup and isolation for each test, and avoid leaking internal APIs into production or non-test code.
Applied to files:
tests/integration_tests/test_openai_embedding_function.pytests/integration_tests/test_qwen_embedding_function.py
📚 Learning: 2025-12-22T12:45:51.412Z
Learnt from: longbingljw
Repo: oceanbase/pyseekdb PR: 91
File: tests/integration_tests/test_collection_hybrid_search_builder_integration.py:127-132
Timestamp: 2025-12-22T12:45:51.412Z
Learning: In integration test Python files under tests/integration_tests, it is acceptable to use string interpolation to construct SQL INSERT statements for test data when the data is controlled and internal to the test. This pattern should only be used for trusted, test-only data and not with untrusted input; ensure the injection risk is mitigated by keeping inputs deterministic, non-user-supplied, and isolated to the test environment. If possible, prefer parameterized queries in non-test code, but in these tests, interpolate values selectively when you can guarantee safety.
Applied to files:
tests/integration_tests/test_openai_embedding_function.pytests/integration_tests/test_qwen_embedding_function.py
🧬 Code graph analysis (7)
src/pyseekdb/utils/embedding_functions/openai_embedding_function.py (1)
src/pyseekdb/utils/embedding_functions/litellm_embedding_function.py (1)
LiteLLMEmbeddingFunction(6-179)
tests/integration_tests/test_openai_embedding_function.py (2)
src/pyseekdb/utils/embedding_functions/openai_embedding_function.py (1)
OpenAIEmbeddingFunction(5-80)src/pyseekdb/client/embedding_function.py (1)
dimension_of(67-82)
src/pyseekdb/utils/embedding_functions/__init__.py (4)
src/pyseekdb/utils/embedding_functions/sentence_transformer_embedding_function.py (1)
SentenceTransformerEmbeddingFunction(5-83)src/pyseekdb/utils/embedding_functions/litellm_embedding_function.py (1)
LiteLLMEmbeddingFunction(6-179)src/pyseekdb/utils/embedding_functions/openai_embedding_function.py (1)
OpenAIEmbeddingFunction(5-80)src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py (1)
QwenEmbeddingFunction(5-92)
src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py (1)
src/pyseekdb/utils/embedding_functions/litellm_embedding_function.py (1)
LiteLLMEmbeddingFunction(6-179)
tests/integration_tests/test_qwen_embedding_function.py (5)
src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py (1)
QwenEmbeddingFunction(5-92)src/pyseekdb/client/collection.py (5)
client(78-80)embedding_function(88-90)get(364-428)name(63-65)add(103-149)src/pyseekdb/client/embedding_function.py (1)
dimension_of(67-82)tests/integration_tests/conftest.py (1)
db_client(147-181)src/pyseekdb/client/meta_info.py (1)
collection_name(22-26)
src/pyseekdb/client/embedding_function.py (1)
src/pyseekdb/client/collection.py (1)
dimension(73-75)
src/pyseekdb/utils/embedding_functions/sentence_transformer_embedding_function.py (2)
src/pyseekdb/client/collection.py (2)
client(78-80)embedding_function(88-90)src/pyseekdb/client/embedding_function.py (1)
EmbeddingFunction(38-65)
🪛 Ruff (0.14.10)
src/pyseekdb/utils/embedding_functions/litellm_embedding_function.py
85-87: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
85-87: Avoid specifying long messages outside the exception class
(TRY003)
93-93: Prefer TypeError exception for invalid type
(TRY004)
93-93: Avoid specifying long messages outside the exception class
(TRY003)
125-128: Avoid specifying long messages outside the exception class
(TRY003)
150-150: Avoid specifying long messages outside the exception class
(TRY003)
160-160: Avoid specifying long messages outside the exception class
(TRY003)
169-169: Avoid specifying long messages outside the exception class
(TRY003)
171-171: Avoid specifying long messages outside the exception class
(TRY003)
175-177: Avoid specifying long messages outside the exception class
(TRY003)
tests/integration_tests/test_openai_embedding_function.py
34-34: Do not assert False (python -O removes these calls), raise AssertionError()
Replace assert False
(B011)
39-39: Do not assert False (python -O removes these calls), raise AssertionError()
Replace assert False
(B011)
185-185: Do not catch blind exception: Exception
(BLE001)
src/pyseekdb/utils/embedding_functions/__init__.py
13-18: __all__ is not sorted
Apply an isort-style sorting to __all__
(RUF022)
tests/integration_tests/test_qwen_embedding_function.py
37-37: Do not assert False (python -O removes these calls), raise AssertionError()
Replace assert False
(B011)
41-41: Do not assert False (python -O removes these calls), raise AssertionError()
Replace assert False
(B011)
173-173: String contains ambiguous ? (FULLWIDTH QUESTION MARK). Did you mean ? (QUESTION MARK)?
(RUF001)
189-189: Do not catch blind exception: Exception
(BLE001)
210-210: f-string without any placeholders
Remove extraneous f prefix
(F541)
src/pyseekdb/client/embedding_function.py
254-254: Consider moving this statement to an else block
(TRY300)
256-256: Do not catch blind exception: Exception
(BLE001)
257-257: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: integration-test (embedded)
- GitHub Check: integration-test (oceanbase)
- GitHub Check: integration-test (server)
🔇 Additional comments (13)
src/pyseekdb/client/embedding_function.py (2)
37-66: GenericEmbeddingFunctionprotocol update looks consistent.
84-95:DefaultEmbeddingFunction(EmbeddingFunction[Documents])aligns typing with the rest of the embedding ecosystem.src/pyseekdb/utils/embedding_functions/sentence_transformer_embedding_function.py (1)
1-1: Import surface change (droppingSpace) is fine.src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py (3)
1-2: LGTM!Imports are appropriate and minimal.
5-40: LGTM!The class docstring is comprehensive with clear examples demonstrating installation, environment variable setup, and usage patterns with collections.
42-92: LGTM!The
__init__method correctly sets Qwen-specific defaults:
api_key_envdefaults to"QWEN_API_KEY"api_basedefaults to Qwen's DashScope endpointcustom_llm_providerset to"openai"for OpenAI-compatible APIencoding_formatdefaults to"float"The implementation follows the same pattern as
OpenAIEmbeddingFunctionwhile adding necessary Qwen-specific configuration.Minor: The variable
api_base_providedon line 73 could be inlined into the condition, but this is a stylistic preference and the current form is readable.tests/integration_tests/test_openai_embedding_function.py (3)
65-106: LGTM!Good coverage of single document, multiple documents, and empty input scenarios. The assertions properly validate the embedding structure and dimensions.
107-120: LGTM!Properly tests dimension detection using the
dimension_ofhelper and validates the expected 1536 dimensions fortext-embedding-ada-002.
188-214: LGTM!Good coverage of different OpenAI embedding models with correct dimension assertions:
text-embedding-ada-002: 1536 dimensionstext-embedding-3-small: 1536 dimensionstext-embedding-3-large: 3072 dimensionssrc/pyseekdb/utils/embedding_functions/litellm_embedding_function.py (4)
1-4: LGTM!Imports are appropriate and minimal.
6-51: LGTM!Comprehensive docstring with clear examples covering multiple providers (OpenAI, Hugging Face) and common configuration options.
97-132: LGTM!The
__call__method properly:
- Normalizes single string input to a list
- Handles empty input gracefully
- Builds embedding kwargs with model, input, and additional parameters
- Validates API key presence when
api_key_envis specified
134-179: LGTM!The response parsing logic is robust, handling multiple response formats:
EmbeddingResponseobject with.dataattribute (most common)- Dict response with
"data"key (backward compatibility)- List response (backward compatibility)
The validation at the end ensures the number of embeddings matches the input count, which is important for data integrity.
src/pyseekdb/utils/embedding_functions/openai_embedding_function.py
Outdated
Show resolved
Hide resolved
src/pyseekdb/utils/embedding_functions/sentence_transformer_embedding_function.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request adds support for OpenAI and Qwen embedding platforms through LiteLLM integration. The implementation introduces a base LiteLLMEmbeddingFunction class that provides unified access to multiple embedding model providers, with specialized wrapper classes for OpenAI and Qwen that set provider-specific defaults.
Changes:
- Added LiteLLMEmbeddingFunction base class supporting 100+ LLM platforms through LiteLLM SDK
- Implemented OpenAIEmbeddingFunction and QwenEmbeddingFunction wrapper classes with platform-specific configurations
- Added comprehensive integration tests for both platforms (skipped by default due to API key requirements)
- Updated documentation formatting and removed unused imports in existing embedding function code
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| src/pyseekdb/utils/embedding_functions/litellm_embedding_function.py | Base implementation for LiteLLM-based embedding functions with comprehensive response parsing |
| src/pyseekdb/utils/embedding_functions/openai_embedding_function.py | OpenAI-specific wrapper with default model and configuration |
| src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py | Qwen-specific wrapper with Alibaba Cloud endpoint and OpenAI-compatible API settings |
| src/pyseekdb/utils/embedding_functions/init.py | Module exports for new embedding function classes |
| src/pyseekdb/utils/embedding_functions/sentence_transformer_embedding_function.py | Removed unused Space import and improved documentation formatting |
| src/pyseekdb/client/embedding_function.py | Formatting improvements and type annotation additions to DefaultEmbeddingFunction |
| tests/integration_tests/test_openai_embedding_function.py | Comprehensive integration tests for OpenAI embedding function with multiple model variants |
| tests/integration_tests/test_qwen_embedding_function.py | Comprehensive integration tests for Qwen embedding function with Chinese text examples |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In @tests/unit_tests/test_openai_embedding_function.py:
- Around line 181-193: The test docstring and intent mismatch:
test_dimension_property_unknown_model expects an API fallback but uses a known
model from _OPENAI_MODEL_DIMENSIONS; update the test to use an unknown model
name (e.g., any string not present in _OPENAI_MODEL_DIMENSIONS) so
OpenAIEmbeddingFunction(model_name=...) triggers the API call for dimension
detection, or alternatively change the docstring/name to reflect that it is
testing a known model; modify test_dimension_property_unknown_model to reference
OpenAIEmbeddingFunction and _OPENAI_MODEL_DIMENSIONS accordingly so the behavior
matches the test intent.
In @tests/unit_tests/test_qwen_embedding_function.py:
- Around line 173-185: The test method test_dimension_property_unknown_model is
mislabeled: it claims to exercise the API fallback for unknown models but
constructs QwenEmbeddingFunction(model_name="text-embedding-v1"), which is
present in _QWEN_MODEL_DIMENSIONS so no API call occurs; either change the
docstring to describe that it verifies a known model dimension, or replace the
model_name with a truly unknown identifier (e.g., "unknown-embedding-model") to
force QwenEmbeddingFunction.dimension to call the API and validate the fallback
behavior.
🧹 Nitpick comments (10)
src/pyseekdb/utils/embedding_functions/openai_base_embedding_function.py (3)
56-61: Chain the exception for better debugging context.When re-raising an exception within an
exceptblock, useraise ... from errto preserve the original traceback.♻️ Proposed fix
try: from openai import OpenAI except ImportError: - raise ValueError( + raise ValueError( "The openai python package is not installed. Please install it with `pip install openai`" - ) + ) from None
77-92: Remove duplicate storage ofmodel_name.
model_nameis stored twice: asself.model_name(line 78) andself._model_name(line 92). The_model_nameassignment appears redundant sinceself.model_nameis already available and used in__call__.♻️ Proposed fix
# Store configuration self.model_name = model_name self.api_key_env = api_key_env self.api_base = api_base self._dimensions_param = dimensions self._client_kwargs = kwargs # Initialize OpenAI client self._client = OpenAI( api_key=api_key, base_url=api_base, **kwargs ) - - # Store original model name for dimension lookup - self._model_name = model_nameThen update the
dimensionproperty to useself.model_name:- if self._model_name in model_dimensions: - return model_dimensions[self._model_name] + if self.model_name in model_dimensions: + return model_dimensions[self.model_name]
154-160: Chain the exception for debugging context.When catching an exception and raising a new one, preserve the original cause using
raise ... from e.♻️ Proposed fix
try: embeddings = self([test_input]) except Exception as e: - raise RuntimeError(f"Failed to determine embedding dimension via API call: {e}") + raise RuntimeError(f"Failed to determine embedding dimension via API call: {e}") from etests/unit_tests/test_qwen_embedding_function.py (4)
27-38: Replaceassert Falsewithpytest.fail()for proper test failures.
assert Falseis removed when Python runs with optimization (-O), making these assertions ineffective. Usepytest.fail()instead.♻️ Proposed fix
def test_qwen_env(self): """Test if openai package is installed and required environment variables are set.""" try: import openai except ImportError: print("openai package is not installed") - assert False, "openai package is not installed" + pytest.fail("openai package is not installed") if not os.environ.get("DASHSCOPE_API_KEY"): print("DASHSCOPE_API_KEY environment variable is not set") - assert False, "DASHSCOPE_API_KEY environment variable is not set" + pytest.fail("DASHSCOPE_API_KEY environment variable is not set")
72-87: Avoid polluting environment with fake API key.Setting a fake API key (
"your-custom-key") in the environment could cause issues if subsequent tests attempt real API calls. Consider usingmonkeypatchfixture for isolated environment changes.♻️ Proposed fix using monkeypatch
- def test_initialization_with_custom_api_key_env(self): + def test_initialization_with_custom_api_key_env(self, monkeypatch): """Test QwenEmbeddingFunction initialization with custom API key env""" print("\n✅ Testing QwenEmbeddingFunction initialization with custom API key env") self.test_qwen_env() custom_key_env = "CUSTOM_QWEN_KEY" - if not os.environ.get(custom_key_env): - os.environ[custom_key_env] = "your-custom-key" + monkeypatch.setenv(custom_key_env, "test-key-value") ef = QwenEmbeddingFunction( model_name="text-embedding-v1", api_key_env=custom_key_env ) assert ef.api_key_env == custom_key_env print(f" Custom API key env: {ef.api_key_env}")
219-222: Minor: Unused loop variable and f-string without placeholders.Per static analysis hints, rename
ito_iand remove the extraneousfprefix.♻️ Proposed fix
- for i, emb in enumerate(embeddings): + for _i, emb in enumerate(embeddings): assert isinstance(emb, list) - assert len(emb) == len(embeddings[0]), f"All embeddings should have same dimension" + assert len(emb) == len(embeddings[0]), "All embeddings should have same dimension"
117-129: Minor: f-string without placeholders.Remove the extraneous
fprefix on line 129.♻️ Proposed fix
assert ef is not None - print(f" Initialized with timeout and max_retries") + print(" Initialized with timeout and max_retries")tests/unit_tests/test_openai_embedding_function.py (3)
27-38: Replaceassert Falsewithpytest.fail()for proper test failures.Same issue as Qwen tests -
assert Falseis removed with Python optimization flag.♻️ Proposed fix
def test_openai_env(self): """Test if openai package is installed and required environment variables are set.""" try: import openai except ImportError: print("openai package is not installed") - assert False, "openai package is not installed" + pytest.fail("openai package is not installed") if not os.environ.get("OPENAI_API_KEY"): print("OPENAI_API_KEY environment variable is not set") - assert False, "OPENAI_API_KEY environment variable is not set" + pytest.fail("OPENAI_API_KEY environment variable is not set")
227-231: Minor: Unused loop variable and f-string without placeholders.Per static analysis hints, rename
ito_iand remove the extraneousfprefix.♻️ Proposed fix
- for i, emb in enumerate(embeddings): + for _i, emb in enumerate(embeddings): assert isinstance(emb, list) - assert len(emb) == len(embeddings[0]), f"All embeddings should have same dimension" + assert len(emb) == len(embeddings[0]), "All embeddings should have same dimension"
109-120: Minor: f-string without placeholders.Remove the extraneous
fprefix on line 120.♻️ Proposed fix
assert ef is not None - print(f" Initialized with timeout and max_retries") + print(" Initialized with timeout and max_retries")
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
src/pyseekdb/utils/embedding_functions/openai_base_embedding_function.pysrc/pyseekdb/utils/embedding_functions/openai_embedding_function.pysrc/pyseekdb/utils/embedding_functions/qwen_embedding_function.pytests/unit_tests/test_openai_embedding_function.pytests/unit_tests/test_qwen_embedding_function.py
🧰 Additional context used
🧬 Code graph analysis (5)
tests/unit_tests/test_qwen_embedding_function.py (4)
src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py (3)
_get_default_api_base(95-101)_get_default_api_key_env(103-109)_get_model_dimensions(111-117)src/pyseekdb/client/embedding_function.py (2)
dimension_of(67-82)model(351-394)src/pyseekdb/utils/embedding_functions/openai_base_embedding_function.py (4)
dimension(129-160)_get_default_api_base(94-104)_get_default_api_key_env(106-116)_get_model_dimensions(118-126)src/pyseekdb/utils/embedding_functions/openai_embedding_function.py (3)
_get_default_api_base(97-103)_get_default_api_key_env(105-111)_get_model_dimensions(113-119)
src/pyseekdb/utils/embedding_functions/openai_embedding_function.py (1)
src/pyseekdb/utils/embedding_functions/openai_base_embedding_function.py (4)
OpenAIBaseEmbeddingFunction(6-201)_get_default_api_base(94-104)_get_default_api_key_env(106-116)_get_model_dimensions(118-126)
tests/unit_tests/test_openai_embedding_function.py (3)
src/pyseekdb/utils/embedding_functions/openai_embedding_function.py (4)
OpenAIEmbeddingFunction(13-119)_get_default_api_base(97-103)_get_default_api_key_env(105-111)_get_model_dimensions(113-119)src/pyseekdb/client/embedding_function.py (1)
dimension_of(67-82)src/pyseekdb/utils/embedding_functions/openai_base_embedding_function.py (4)
dimension(129-160)_get_default_api_base(94-104)_get_default_api_key_env(106-116)_get_model_dimensions(118-126)
src/pyseekdb/utils/embedding_functions/openai_base_embedding_function.py (1)
src/pyseekdb/utils/embedding_functions/openai_embedding_function.py (3)
_get_default_api_key_env(105-111)_get_default_api_base(97-103)_get_model_dimensions(113-119)
src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py (2)
src/pyseekdb/utils/embedding_functions/openai_base_embedding_function.py (4)
OpenAIBaseEmbeddingFunction(6-201)_get_default_api_base(94-104)_get_default_api_key_env(106-116)_get_model_dimensions(118-126)src/pyseekdb/utils/embedding_functions/openai_embedding_function.py (3)
_get_default_api_base(97-103)_get_default_api_key_env(105-111)_get_model_dimensions(113-119)
🪛 Ruff (0.14.10)
tests/unit_tests/test_qwen_embedding_function.py
33-33: Do not assert False (python -O removes these calls), raise AssertionError()
Replace assert False
(B011)
37-37: Do not assert False (python -O removes these calls), raise AssertionError()
Replace assert False
(B011)
129-129: f-string without any placeholders
Remove extraneous f prefix
(F541)
219-219: Loop control variable i not used within loop body
Rename unused i to _i
(B007)
221-221: f-string without any placeholders
Remove extraneous f prefix
(F541)
tests/unit_tests/test_openai_embedding_function.py
33-33: Do not assert False (python -O removes these calls), raise AssertionError()
Replace assert False
(B011)
37-37: Do not assert False (python -O removes these calls), raise AssertionError()
Replace assert False
(B011)
120-120: f-string without any placeholders
Remove extraneous f prefix
(F541)
227-227: Loop control variable i not used within loop body
Rename unused i to _i
(B007)
229-229: f-string without any placeholders
Remove extraneous f prefix
(F541)
src/pyseekdb/utils/embedding_functions/openai_base_embedding_function.py
59-61: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
59-61: Avoid specifying long messages outside the exception class
(TRY003)
72-75: Avoid specifying long messages outside the exception class
(TRY003)
156-156: Do not catch blind exception: Exception
(BLE001)
157-157: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
157-157: Avoid specifying long messages outside the exception class
(TRY003)
159-159: Avoid specifying long messages outside the exception class
(TRY003)
197-199: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: integration-test (server)
- GitHub Check: integration-test (oceanbase)
- GitHub Check: integration-test (embedded)
🔇 Additional comments (15)
src/pyseekdb/utils/embedding_functions/openai_base_embedding_function.py (3)
1-4: LGTM! Clean imports and module setup.The imports are appropriate for the base class functionality.
94-127: LGTM! Well-designed abstract hooks.The abstract method pattern with
NotImplementedErrorfor required overrides and a default empty dict for optional_get_model_dimensionsis appropriate for the extensibility goal.
162-201: LGTM! Robust implementation with proper input handling and validation.The
__call__method correctly handles single strings, empty inputs, and validates the response count. The implementation follows theEmbeddingFunctionprotocol appropriately.src/pyseekdb/utils/embedding_functions/openai_embedding_function.py (4)
1-11: LGTM! Clean imports and model dimension mapping.The
_OPENAI_MODEL_DIMENSIONSdictionary correctly maps known OpenAI embedding models to their default dimensions, with proper source documentation.
13-56: LGTM! Comprehensive documentation with practical examples.The class docstring provides clear usage examples including default initialization, custom models, and dimension reduction scenarios.
58-95: LGTM! Clean delegation to base class.The constructor properly delegates all parameters to the base class while providing sensible defaults (model_name defaulting to "text-embedding-ada-002").
97-119: LGTM! Hook methods correctly implement OpenAI-specific configuration.The
_get_default_api_base,_get_default_api_key_env, and_get_model_dimensionsmethods provide the correct OpenAI-specific values.src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py (4)
1-12: LGTM! Appropriate model dimension mapping for Qwen models.The
_QWEN_MODEL_DIMENSIONSdictionary correctly maps Qwen embedding model versions to their dimensions.
14-55: LGTM! Clear documentation with Qwen-specific examples.The docstring provides appropriate examples demonstrating initialization with required model_name and optional parameters.
57-93: LGTM! Constructor properly requires model_name.Unlike OpenAIEmbeddingFunction,
model_nameis a required parameter here, which is appropriate since Qwen doesn't have a single canonical default model.
95-117: LGTM! Qwen-specific hook implementations.The hook methods correctly return Qwen's DashScope API base URL, API key environment variable name, and model dimension mapping.
tests/unit_tests/test_qwen_embedding_function.py (2)
131-143: LGTM! Good test for missing API key validation.The test correctly saves, removes, tests, and restores the environment variable to validate the error handling.
225-234: LGTM! Good empty input edge case coverage.Testing empty input returns empty list validates the base class behavior correctly.
tests/unit_tests/test_openai_embedding_function.py (2)
70-82: Good: Usespytest.skipfor missing custom env var.Unlike the Qwen test, this correctly skips when the custom environment variable isn't set rather than polluting the environment with a fake value.
336-351: LGTM! Good parametric test for model dimensions.The test efficiently validates all known models return correct dimensions using a test case list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In `@src/pyseekdb/utils/embedding_functions/openai_embedding_function.py`:
- Around line 28-30: The example comment above the OpenAIEmbeddingFunction
instantiation references the old default model "text-embedding-ada-002"; update
that comment to match the actual default model used by OpenAIEmbeddingFunction
(text-embedding-3-small) or make the comment generic. Specifically, edit the
doc/example lines where ef = OpenAIEmbeddingFunction() is shown to either state
"Using default model (text-embedding-3-small)" or remove the model name, and
ensure consistency with the default defined in OpenAIEmbeddingFunction (the
default_model value inside that class/constructor).
In `@src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py`:
- Around line 4-11: Update the comment near _QWEN_MODEL_DIMENSIONS to document
that text-embedding-v3 and text-embedding-v4 support variable output dimensions
(not just the defaults currently listed); specifically note that v3 supports
1024, 768, and 512 and v4 supports 1024, 2048, 1536, 768, 512, 256, 128, and 64,
and that the output size can be configured via the DashScope API—leave the
default values in the mapping but add a clarifying comment referencing
_QWEN_MODEL_DIMENSIONS and the DashScope configuration option.
In `@tests/unit_tests/test_openai_embedding_function.py`:
- Around line 70-82: The test test_initialization_with_custom_api_key_env leaks
the CUSTOM_OPENAI_KEY environment variable; fix it by saving the original
os.environ.get("CUSTOM_OPENAI_KEY") before setting it, then wrap the setup and
assertion in a try/finally (or use a context manager) and in the finally block
restore the original value (reassign if not None, or delete the key if it was
absent) so OpenAIEmbeddingFunction(api_key_env=custom_key_env) runs but the
environment is restored afterward.
- Around line 27-38: The test_openai_env function uses `assert False` which can
be ignored when Python runs with -O; replace each `assert False, "message"` with
an explicit `raise AssertionError("message")` so failures are always raised
(update both the ImportError branch and the OPENAI_API_KEY check in
`test_openai_env`).
♻️ Duplicate comments (1)
tests/unit_tests/test_openai_embedding_function.py (1)
181-193: Test name and docstring are misleading - model is known, not unknown.The test
test_dimension_property_unknown_modeluses"text-embedding-ada-002", which is in_OPENAI_MODEL_DIMENSIONS. This won't trigger the API fallback for dimension detection. Either rename the test to reflect it tests a known model, or use an actually unknown model name.📝 Suggested fix (option 1: rename to reflect actual behavior)
def test_dimension_property_unknown_model(self): - """Test dimension property for unknown model (should make API call)""" - print("\n✅ Testing OpenAIEmbeddingFunction dimension property for unknown model") + """Test dimension property for known model returns expected value""" + print("\n✅ Testing OpenAIEmbeddingFunction dimension property for known model") self.test_openai_env() - # This will make an actual API call to get dimension ef = OpenAIEmbeddingFunction(model_name="text-embedding-ada-002") dim = ef.dimension # Should have a valid dimension - assert dim > 0 - print(f" Unknown model dimension (via API call): {dim}") + assert dim == 1536 + print(f" Known model dimension: {dim}")
🧹 Nitpick comments (2)
tests/unit_tests/test_openai_embedding_function.py (2)
120-120: Remove extraneous f-string prefix.This f-string has no placeholders.
- print(f" Initialized with timeout and max_retries") + print(" Initialized with timeout and max_retries")
227-230: Minor static analysis issues: unused loop variable and f-string without placeholders.📝 Suggested fix
- for i, emb in enumerate(embeddings): + for _i, emb in enumerate(embeddings): assert isinstance(emb, list) - assert len(emb) == len(embeddings[0]), f"All embeddings should have same dimension" + assert len(emb) == len(embeddings[0]), "All embeddings should have same dimension"Alternatively, if you don't need the index at all:
- for i, emb in enumerate(embeddings): + for emb in embeddings: assert isinstance(emb, list) - assert len(emb) == len(embeddings[0]), f"All embeddings should have same dimension" + assert len(emb) == len(embeddings[0]), "All embeddings should have same dimension"
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/pyseekdb/utils/embedding_functions/openai_embedding_function.pysrc/pyseekdb/utils/embedding_functions/qwen_embedding_function.pytests/unit_tests/test_openai_embedding_function.py
🧰 Additional context used
🧬 Code graph analysis (3)
src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py (2)
src/pyseekdb/utils/embedding_functions/openai_base_embedding_function.py (4)
OpenAIBaseEmbeddingFunction(6-201)_get_default_api_base(94-104)_get_default_api_key_env(106-116)_get_model_dimensions(118-126)src/pyseekdb/utils/embedding_functions/openai_embedding_function.py (3)
_get_default_api_base(97-103)_get_default_api_key_env(105-111)_get_model_dimensions(113-119)
tests/unit_tests/test_openai_embedding_function.py (4)
src/pyseekdb/utils/embedding_functions/openai_embedding_function.py (4)
OpenAIEmbeddingFunction(13-119)_get_default_api_base(97-103)_get_default_api_key_env(105-111)_get_model_dimensions(113-119)src/pyseekdb/client/embedding_function.py (1)
dimension_of(67-82)src/pyseekdb/utils/embedding_functions/openai_base_embedding_function.py (4)
dimension(129-160)_get_default_api_base(94-104)_get_default_api_key_env(106-116)_get_model_dimensions(118-126)src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py (3)
_get_default_api_base(95-101)_get_default_api_key_env(103-109)_get_model_dimensions(111-117)
src/pyseekdb/utils/embedding_functions/openai_embedding_function.py (1)
src/pyseekdb/utils/embedding_functions/openai_base_embedding_function.py (3)
_get_default_api_base(94-104)_get_default_api_key_env(106-116)_get_model_dimensions(118-126)
🪛 Ruff (0.14.11)
tests/unit_tests/test_openai_embedding_function.py
33-33: Do not assert False (python -O removes these calls), raise AssertionError()
Replace assert False
(B011)
37-37: Do not assert False (python -O removes these calls), raise AssertionError()
Replace assert False
(B011)
120-120: f-string without any placeholders
Remove extraneous f prefix
(F541)
227-227: Loop control variable i not used within loop body
Rename unused i to _i
(B007)
229-229: f-string without any placeholders
Remove extraneous f prefix
(F541)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: integration-test (oceanbase)
- GitHub Check: integration-test (server)
- GitHub Check: integration-test (embedded)
🔇 Additional comments (10)
src/pyseekdb/utils/embedding_functions/openai_embedding_function.py (3)
1-11: Model dimensions mapping looks good.The
_OPENAI_MODEL_DIMENSIONSconstant correctly maps known OpenAI embedding models to their dimensions, which avoids unnecessary API calls during dimension discovery for known models. This addresses the prior concern about paid API calls during collection creation.
58-95: Implementation correctly wires base class with OpenAI-specific defaults.The constructor properly delegates to the base class and the default model is now
"text-embedding-3-small"(the current recommended model). The parameter documentation is thorough and accurate.
97-119: Helper methods correctly implement provider-specific overrides.The
_get_default_api_base,_get_default_api_key_env, and_get_model_dimensionsmethods properly implement the abstract methods fromOpenAIBaseEmbeddingFunction, returning OpenAI-specific defaults.tests/unit_tests/test_openai_embedding_function.py (3)
1-24: Test structure and skip logic are well-designed.The test suite correctly skips when
OPENAI_API_KEYis not set, preventing CI failures while allowing manual testing with real API calls. The module docstring clearly explains how to run the tests.
136-159: Dimension tests for known models are thorough.Good coverage of all three known OpenAI models with correct expected dimensions.
244-270: Dimension parameter tests validate actual API behavior.These tests make real API calls to verify that the
dimensionsparameter correctly reduces embedding dimensions for text-embedding-3 models. This is valuable for ensuring the feature works end-to-end.src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py (4)
1-2: LGTM!Imports are correct and minimal.
14-55: LGTM!Well-documented class with comprehensive usage examples covering default initialization, custom parameters, and integration with the collection API.
57-93: LGTM!Constructor correctly delegates to the base class and documentation is thorough. The template method pattern works correctly here—Python's dynamic dispatch ensures the subclass overrides (
_get_default_api_base,_get_default_api_key_env) are called from the parent's__init__.
95-117: LGTM!Helper methods correctly implement the template method hooks with appropriate Qwen/DashScope defaults. The implementation is consistent with
OpenAIEmbeddingFunction.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
Summary
close #109
Support OpenAI and Qwen embedding platform based on OpenAI.
Solution Description
Qwen is a famous LLM platform in China and it has a type of API compatible with OpenAI.
So, I implemented an OpenAIBaseEmbeddingFunction class and supporting OpenAI and Qwen based on it.
It cost tokens when we test the embedding functions, so we'll not test them in github action.
Sorry that I can't reach openai.com from my compute, I can't test the OpenAIEmbeddingFunction now.
Another information:
LiteLLM is a Python SDK supporting more than 100 LLM platforms, including OpenAI. But when I Qwen with it, I got some stupid errors, such as
dimensionsparameter is not supported,float-formatis not supported. So I changed the base implementation to OpenAI. I can't find Qwen/dashscope embedding provider from LiteLLM official document yet even though there's a completion provide of dashscope.Qwen test result:
Summary by CodeRabbit
New Features
Behavior / Reliability
Documentation
Tests
✏️ Tip: You can customize this high-level summary in your review settings.