-
Notifications
You must be signed in to change notification settings - Fork 18
[Draft] support embedding function persistence #114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughThe changes implement embedding function serialization and reconstruction capabilities by introducing Changes
Sequence DiagramsequenceDiagram
participant Client
participant EmbeddingFunction as EmbeddingFunction<br/>(Protocol)
participant Concrete as Concrete Implementation<br/>(e.g., OpenAI)
participant Storage as Configuration<br/>Storage
rect rgba(100, 150, 255, 0.5)
Note over Client,Storage: Serialization Flow
Client->>Concrete: instance = OpenAIEmbeddingFunction(model="gpt-4")
Client->>Concrete: name = instance.name()
Concrete-->>Client: "openai"
Client->>Concrete: config = instance.get_config()
Concrete-->>Client: {model_name, api_key_env, ...}
Client->>Storage: save(name, config)
end
rect rgba(255, 150, 100, 0.5)
Note over Client,Storage: Deserialization Flow
Client->>Storage: (name, config) = load()
Storage-->>Client: ("openai", {model_name, ...})
Client->>EmbeddingFunction: instance = EmbeddingFunction.build_from_config(config)
EmbeddingFunction->>Concrete: build_from_config(config)
Concrete-->>Client: OpenAIEmbeddingFunction instance
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@pyproject.toml`:
- Line 8: The requires-python constraint currently uses ">=3.11,<=3.13" which
wrongly excludes 3.13.x patch releases; update the requires-python entry (the
requires-python key in pyproject.toml) to allow all 3.13 patch versions (for
example use a range like ">=3.11,<3.14") so 3.13.1+ are accepted while still
capping before 3.14.
In `@src/pyseekdb/utils/embedding_functions/openai_embedding_function.py`:
- Around line 138-166: The build_from_config in OpenAIEmbeddingFunction should
validate and normalize client_kwargs before unpacking; update build_from_config
to retrieve client_kwargs = config.get("client_kwargs", {}) then check its type
(isinstance(client_kwargs, dict)), if not, either coerce to {} or raise a
ValueError with a clear message, and only then pass **client_kwargs into the
OpenAIEmbeddingFunction constructor (reference:
OpenAIEmbeddingFunction.build_from_config and the client_kwargs variable).
🧹 Nitpick comments (3)
src/pyseekdb/client/embedding_function.py (2)
46-96: Addname()andbuild_from_config()to the Protocol for type-safety.The docstring documents these methods, but the Protocol doesn’t declare them, so type checking won’t enforce the persistence API. Consider adding them as abstract static methods.
♻️ Suggested interface addition
class EmbeddingFunction(Protocol[D]): @@ + `@staticmethod` + `@abstractmethod` + def name() -> str: + ... + @@ `@abstractmethod` def get_config(self) -> Dict[str, Any]: ... + + `@staticmethod` + `@abstractmethod` + def build_from_config(config: Dict[str, Any]) -> "EmbeddingFunction": + ...
506-508: Silence unused-argument lint inbuild_from_config.
configis intentionally unused here, but lint will flag it. Rename or explicitly ignore to avoid noise.✅ Minimal lint fix
- def build_from_config(config: Dict[str, Any]) -> "DefaultEmbeddingFunction": + def build_from_config(_config: Dict[str, Any]) -> "DefaultEmbeddingFunction": return DefaultEmbeddingFunction()tests/unit_tests/test_openai_embedding_function.py (1)
17-31: Centralize theis_openai_available()helper intest_utils.py.This helper is duplicated identically in both
test_qwen_embedding_function.pyandtest_openai_embedding_function.py. Moving it totest_utils.pywhere other test utilities likeEnvGuardalready live will eliminate duplication and prevent future drift.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
.github/workflows/ci.ymlpyproject.tomlsrc/pyseekdb/client/embedding_function.pysrc/pyseekdb/utils/embedding_functions/openai_base_embedding_function.pysrc/pyseekdb/utils/embedding_functions/openai_embedding_function.pysrc/pyseekdb/utils/embedding_functions/qwen_embedding_function.pysrc/pyseekdb/utils/embedding_functions/sentence_transformer_embedding_function.pytests/unit_tests/test_default_embedding_function.pytests/unit_tests/test_env_guard.pytests/unit_tests/test_openai_embedding_function.pytests/unit_tests/test_qwen_embedding_function.pytests/unit_tests/test_sentence_transformer_embedding_function.pytests/unit_tests/test_utils.py
🧰 Additional context used
🧬 Code graph analysis (9)
src/pyseekdb/utils/embedding_functions/openai_embedding_function.py (3)
src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py (3)
name(120-126)get_config(128-134)build_from_config(137-164)src/pyseekdb/utils/embedding_functions/sentence_transformer_embedding_function.py (3)
name(86-92)get_config(94-105)build_from_config(108-130)src/pyseekdb/utils/embedding_functions/openai_base_embedding_function.py (1)
get_config(203-218)
tests/unit_tests/test_sentence_transformer_embedding_function.py (1)
src/pyseekdb/utils/embedding_functions/sentence_transformer_embedding_function.py (4)
SentenceTransformerEmbeddingFunction(5-130)name(86-92)get_config(94-105)build_from_config(108-130)
tests/unit_tests/test_openai_embedding_function.py (4)
tests/unit_tests/test_utils.py (1)
env_guard(134-152)tests/unit_tests/test_qwen_embedding_function.py (1)
is_openai_available(20-27)src/pyseekdb/utils/embedding_functions/openai_embedding_function.py (4)
OpenAIEmbeddingFunction(13-166)name(122-128)get_config(130-136)build_from_config(139-166)src/pyseekdb/utils/embedding_functions/openai_base_embedding_function.py (1)
get_config(203-218)
tests/unit_tests/test_env_guard.py (1)
tests/unit_tests/test_utils.py (4)
EnvGuard(11-130)env_guard(134-152)save(51-67)restore(96-117)
src/pyseekdb/utils/embedding_functions/openai_base_embedding_function.py (4)
src/pyseekdb/client/embedding_function.py (2)
get_config(85-96)get_config(503-504)src/pyseekdb/utils/embedding_functions/openai_embedding_function.py (1)
get_config(130-136)src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py (1)
get_config(128-134)src/pyseekdb/utils/embedding_functions/sentence_transformer_embedding_function.py (1)
get_config(94-105)
tests/unit_tests/test_qwen_embedding_function.py (4)
tests/unit_tests/test_utils.py (1)
env_guard(134-152)tests/unit_tests/test_openai_embedding_function.py (7)
is_openai_available(20-27)test_name(370-372)test_get_config_with_defaults(374-387)test_get_config_with_dimensions(409-418)test_build_from_config_with_defaults(420-437)test_build_from_config_with_custom_values(439-458)test_persistence_roundtrip(460-475)src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py (4)
QwenEmbeddingFunction(14-164)name(120-126)get_config(128-134)build_from_config(137-164)src/pyseekdb/utils/embedding_functions/openai_base_embedding_function.py (1)
get_config(203-218)
src/pyseekdb/utils/embedding_functions/sentence_transformer_embedding_function.py (3)
src/pyseekdb/client/embedding_function.py (4)
name(500-501)get_config(85-96)get_config(503-504)build_from_config(507-508)src/pyseekdb/utils/embedding_functions/openai_embedding_function.py (3)
name(122-128)get_config(130-136)build_from_config(139-166)src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py (3)
name(120-126)get_config(128-134)build_from_config(137-164)
tests/unit_tests/test_default_embedding_function.py (1)
src/pyseekdb/client/embedding_function.py (5)
DefaultEmbeddingFunction(115-511)name(500-501)get_config(85-96)get_config(503-504)build_from_config(507-508)
src/pyseekdb/client/embedding_function.py (4)
src/pyseekdb/utils/embedding_functions/openai_base_embedding_function.py (1)
get_config(203-218)src/pyseekdb/utils/embedding_functions/openai_embedding_function.py (3)
get_config(130-136)name(122-128)build_from_config(139-166)src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py (3)
get_config(128-134)name(120-126)build_from_config(137-164)src/pyseekdb/utils/embedding_functions/sentence_transformer_embedding_function.py (3)
get_config(94-105)name(86-92)build_from_config(108-130)
🪛 Ruff (0.14.11)
src/pyseekdb/utils/embedding_functions/openai_embedding_function.py
153-153: Avoid specifying long messages outside the exception class
(TRY003)
src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py
151-151: Avoid specifying long messages outside the exception class
(TRY003)
tests/unit_tests/test_sentence_transformer_embedding_function.py
93-93: Possible hardcoded password assigned to argument: "use_auth_token"
(S106)
98-98: Possible hardcoded password assigned to: "use_auth_token"
(S105)
src/pyseekdb/client/embedding_function.py
507-507: Unused static method argument: config
(ARG004)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: integration-test (server)
- GitHub Check: integration-test (oceanbase)
- GitHub Check: integration-test (embedded)
- GitHub Check: unit-test
🔇 Additional comments (16)
.github/workflows/ci.yml (2)
28-30: No additional concerns for this change.
90-90: No additional concerns for this change.src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py (1)
119-163: LGTM — clean persistence hooks added.
name(),get_config(), andbuild_from_config()align with the new embedding function protocol and keep configuration restoration consistent.src/pyseekdb/utils/embedding_functions/sentence_transformer_embedding_function.py (1)
85-130: LGTM — consistent serialization/restore API.Defaults and config round‑trip look correct and match the new protocol.
src/pyseekdb/utils/embedding_functions/openai_base_embedding_function.py (1)
203-218: Client_kwargs may contain non-serializable objects; however, no JSON serialization of configs occurs in the current implementation.The current
get_config()implementation returns_client_kwargsdirectly, which could theoretically contain non-serializable objects likehttpx.Timeout(the OpenAI client does accept these). However, based on a comprehensive review of the codebase, embedding function configs are not persisted as JSON. Configs are only used in the in-memory roundtrip pattern:get_config()→build_from_config(). Both operations work with Python dictionaries and do not require JSON serialization. JSON serialization in the client code is limited to metadata and document fields, not embedding function configurations. The suggested guard is not needed for the current architecture.Likely an incorrect or invalid review comment.
tests/unit_tests/test_default_embedding_function.py (1)
18-58: Good persistence coverage.Covers identifier, config shape, build_from_config, and roundtrip, which matches the new contract.
tests/unit_tests/test_env_guard.py (1)
15-112: Nice coverage of EnvGuard usage patterns.The scenarios exercise restore semantics, nesting, and convenience helpers clearly.
tests/unit_tests/test_utils.py (1)
39-152: EnvGuard utility looks solid.Clear save/set/restore semantics and a lightweight contextmanager wrapper.
tests/unit_tests/test_openai_embedding_function.py (1)
363-475: Nice coverage for config persistence and round‑trip behavior.
The tests cover name/get_config/build_from_config plus round‑trip. Please verify thatOpenAIEmbeddingFunctioninitialization here remains side‑effect free (no network calls) when using dummy keys.tests/unit_tests/test_qwen_embedding_function.py (2)
17-31: Guarding Qwen tests with the OpenAI client check is sensible.
This avoids import errors when the openai-compatible client isn’t installed. Please confirm the guard matches the actual import path used byQwenEmbeddingFunction.
340-449: Persistence test coverage for Qwen looks solid.
Defaults, custom values, build_from_config, and round‑trip are well covered. Please verify these remain API‑call free withenv_guardso they can run offline.tests/unit_tests/test_sentence_transformer_embedding_function.py (5)
1-40: Module setup and availability helpers are clear and robust.
The guards for sentence‑transformers and CUDA keep the suite resilient in limited environments. Please confirm CI images that should run these tests include the required deps.
42-85: get_config defaults and custom CUDA values are well exercised.
Good coverage for name/defaults and CUDA-path config serialization.
100-186: build_from_config defaults, partial configs, and round‑trip are nicely covered.
These tests make the persistence contract explicit.
187-302: CUDA‑specific persistence and embedding checks are well gated.
Skip conditions keep these tests safe on non‑CUDA systems while still validating behavior when available.
86-99: [Your rewritten review comment text here]
[Exactly ONE classification tag]
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
| license = { text = "Apache-2.0" } | ||
| authors = [{ name = "OceanBase", email = "open_oceanbase@oceanbase.com" }] | ||
| requires-python = ">=3.11,<4.0" | ||
| requires-python = ">=3.11,<=3.13" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# Check if pyproject.toml exists and read line 8
if [ -f pyproject.toml ]; then
echo "=== pyproject.toml line 8 ==="
sed -n '8p' pyproject.toml
echo ""
echo "=== Context (lines 1-15) ==="
head -15 pyproject.toml
else
echo "pyproject.toml not found"
fiRepository: oceanbase/pyseekdb
Length of output: 634
🏁 Script executed:
# Verify PEP 440 version comparison semantics
pip install packaging > /dev/null 2>&1
python3 << 'EOF'
from packaging.specifiers import SpecifierSet
from packaging.version import Version
# Test version comparisons
spec = SpecifierSet(">=3.11,<=3.13")
test_versions = ["3.11.0", "3.12.0", "3.13.0", "3.13.1", "3.14.0"]
print("=== Version Specifier: >=3.11,<=3.13 ===")
for v in test_versions:
version = Version(v)
result = version in spec
print(f" {v}: {result}")
print("\n=== Alternative: >=3.11,<3.14 ===")
spec_alt = SpecifierSet(">=3.11,<3.14")
for v in test_versions:
version = Version(v)
result = version in spec_alt
print(f" {v}: {result}")
EOFRepository: python
Length of output: 278
Python cap excludes 3.13.x patch releases.
<=3.13 blocks installs on 3.13.1+ (since 3.13.1 > 3.13). If the intent is to allow all 3.13.x, use <3.14 instead.
🛠️ Suggested fix
-requires-python = ">=3.11,<=3.13"
+requires-python = ">=3.11,<3.14"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| requires-python = ">=3.11,<=3.13" | |
| requires-python = ">=3.11,<3.14" |
🤖 Prompt for AI Agents
In `@pyproject.toml` at line 8, The requires-python constraint currently uses
">=3.11,<=3.13" which wrongly excludes 3.13.x patch releases; update the
requires-python entry (the requires-python key in pyproject.toml) to allow all
3.13 patch versions (for example use a range like ">=3.11,<3.14") so 3.13.1+ are
accepted while still capping before 3.14.
| @staticmethod | ||
| def build_from_config(config: dict[str, Any]) -> "OpenAIEmbeddingFunction": | ||
| """Build an OpenAIEmbeddingFunction from its configuration dictionary. | ||
| Args: | ||
| config: Dictionary containing the embedding function's configuration | ||
| Returns: | ||
| Restored OpenAIEmbeddingFunction instance | ||
| Raises: | ||
| ValueError: If the configuration is invalid or missing required fields | ||
| """ | ||
| model_name = config.get("model_name") | ||
| if model_name is None: | ||
| raise ValueError("Missing required field 'model_name' in configuration") | ||
|
|
||
| api_key_env = config.get("api_key_env") | ||
| api_base = config.get("api_base") | ||
| dimensions = config.get("dimensions") | ||
| client_kwargs = config.get("client_kwargs", {}) | ||
|
|
||
| return OpenAIEmbeddingFunction( | ||
| model_name=model_name, | ||
| api_key_env=api_key_env, | ||
| api_base=api_base, | ||
| dimensions=dimensions, | ||
| **client_kwargs | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Validate client_kwargs before unpacking.
If persisted metadata is corrupted or legacy data stores client_kwargs=None/non-dict, **client_kwargs will raise a TypeError and hide the root cause. Normalize and validate first.
🔧 Proposed fix
- client_kwargs = config.get("client_kwargs", {})
+ client_kwargs = config.get("client_kwargs") or {}
+ if not isinstance(client_kwargs, dict):
+ raise ValueError("client_kwargs must be a dict in configuration")🧰 Tools
🪛 Ruff (0.14.11)
153-153: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
In `@src/pyseekdb/utils/embedding_functions/openai_embedding_function.py` around
lines 138 - 166, The build_from_config in OpenAIEmbeddingFunction should
validate and normalize client_kwargs before unpacking; update build_from_config
to retrieve client_kwargs = config.get("client_kwargs", {}) then check its type
(isinstance(client_kwargs, dict)), if not, either coerce to {} or raise a
ValueError with a clear message, and only then pass **client_kwargs into the
OpenAIEmbeddingFunction constructor (reference:
OpenAIEmbeddingFunction.build_from_config and the client_kwargs variable).
Summary
close #50
Solution Description
Summary by CodeRabbit
New Features
Chores
Tests
✏️ Tip: You can customize this high-level summary in your review settings.