Skip to content

Conversation

@hnwyllmm
Copy link
Member

@hnwyllmm hnwyllmm commented Jan 16, 2026

Summary

close #50

Solution Description

Summary by CodeRabbit

  • New Features

    • Added configuration capabilities to embedding functions, enabling serialization and reconstruction of settings across all embedding function types.
  • Chores

    • Updated Python version requirement to explicitly support Python 3.13.
    • Enhanced CI pipeline with additional test dependencies for ML and embedding libraries.
  • Tests

    • Added comprehensive test coverage for embedding function configuration persistence and environment variable management.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Jan 16, 2026

📝 Walkthrough

Walkthrough

The changes implement embedding function serialization and reconstruction capabilities by introducing name(), get_config(), and build_from_config() methods across the embedding function protocol and concrete implementations (OpenAI, Qwen, SentenceTransformer, Default). CI dependencies and Python version constraints are updated, with comprehensive persistence tests and test utilities added.

Changes

Cohort / File(s) Summary
CI and Project Configuration
.github/workflows/ci.yml, pyproject.toml
Adds pip install step for test dependencies (openai, sentence-transformers, torch, torchvision, torchaudio); tightens Python requirement from >=3.11,<4.0 to >=3.11,<=3.13
Embedding Function Protocol
src/pyseekdb/client/embedding_function.py
Adds abstract methods name(), get_config(), build_from_config() to EmbeddingFunction protocol and implements them in DefaultEmbeddingFunction for configuration routing and reconstruction
Embedding Function Base Implementation
src/pyseekdb/utils/embedding_functions/openai_base_embedding_function.py
Adds get_config() method returning model configuration dict (model_name, api_key_env, api_base, dimensions, client_kwargs)
OpenAI Embedding Implementation
src/pyseekdb/utils/embedding_functions/openai_embedding_function.py
Implements name() returning "openai", get_config(), and build_from_config() with model_name validation for configuration-based instantiation
Qwen Embedding Implementation
src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py
Implements name() returning "qwen", get_config(), and build_from_config() with model_name validation for configuration-based instantiation
SentenceTransformer Embedding Implementation
src/pyseekdb/utils/embedding_functions/sentence_transformer_embedding_function.py
Implements name() returning "sentence_transformer", get_config(), and build_from_config() with support for device, normalize_embeddings, and kwargs serialization
Test Utilities
tests/unit_tests/test_utils.py
Introduces EnvGuard class and env_guard context manager for managing environment variables during test execution with save/restore functionality
Environment Variable Tests
tests/unit_tests/test_env_guard.py
Adds unit tests for EnvGuard demonstrating context manager usage, chaining, nesting, and real-world API key scenarios
Default Embedding Function Tests
tests/unit_tests/test_default_embedding_function.py
Adds persistence tests verifying name(), get_config(), build_from_config(), and roundtrip behavior
OpenAI Embedding Function Tests
tests/unit_tests/test_openai_embedding_function.py
Adds is_openai_available() check and TestOpenAIEmbeddingFunctionPersistence class with configuration roundtrip tests; note: test class appears duplicated in file
Qwen Embedding Function Tests
tests/unit_tests/test_qwen_embedding_function.py
Adds OpenAI availability check, extends skip condition, and TestQwenEmbeddingFunctionPersistence class with persistence tests; note: test class appears duplicated in file
SentenceTransformer Embedding Function Tests
tests/unit_tests/test_sentence_transformer_embedding_function_persistence.py
Adds comprehensive TestSentenceTransformerEmbeddingFunctionPersistence class covering default/custom configs, CUDA-specific paths, and persistence roundtrips

Sequence Diagram

sequenceDiagram
    participant Client
    participant EmbeddingFunction as EmbeddingFunction<br/>(Protocol)
    participant Concrete as Concrete Implementation<br/>(e.g., OpenAI)
    participant Storage as Configuration<br/>Storage

    rect rgba(100, 150, 255, 0.5)
    Note over Client,Storage: Serialization Flow
    Client->>Concrete: instance = OpenAIEmbeddingFunction(model="gpt-4")
    Client->>Concrete: name = instance.name()
    Concrete-->>Client: "openai"
    Client->>Concrete: config = instance.get_config()
    Concrete-->>Client: {model_name, api_key_env, ...}
    Client->>Storage: save(name, config)
    end

    rect rgba(255, 150, 100, 0.5)
    Note over Client,Storage: Deserialization Flow
    Client->>Storage: (name, config) = load()
    Storage-->>Client: ("openai", {model_name, ...})
    Client->>EmbeddingFunction: instance = EmbeddingFunction.build_from_config(config)
    EmbeddingFunction->>Concrete: build_from_config(config)
    Concrete-->>Client: OpenAIEmbeddingFunction instance
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

Poem

🐰 A rabbit hops with joy today,
Config functions come to play!
Serialize, reconstruct with glee,
Embedding functions saved and free!
Test utilities, environment care—
PySeekDB's magic's everywhere! 🌟

🚥 Pre-merge checks | ✅ 3 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Linked Issues check ⚠️ Warning The PR adds name(), get_config(), and build_from_config() methods to all embedding function classes, enabling serialization and reconstruction. However, it lacks implementation of kv_attributes field on table, retrieval on get_collection, and actual recreation logic [#50]. Implement the remaining requirements: add kv_attributes field to table metadata, retrieve it in get_collection, and instantiate embedding functions from stored configuration.
Out of Scope Changes check ⚠️ Warning The PR includes multiple out-of-scope changes: Python version constraint tightening (3.13), new test utilities (EnvGuard), CI test dependency installation, and duplicate test classes in test_openai and test_qwen files. Remove unrelated changes: revert Python version constraint, remove EnvGuard unless essential, clean up CI configuration, and deduplicate test classes.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title '[Draft] support embedding function persistence' is specific and describes the main objective of adding persistence capabilities to embedding functions.
Docstring Coverage ✅ Passed Docstring coverage is 94.87% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@pyproject.toml`:
- Line 8: The requires-python constraint currently uses ">=3.11,<=3.13" which
wrongly excludes 3.13.x patch releases; update the requires-python entry (the
requires-python key in pyproject.toml) to allow all 3.13 patch versions (for
example use a range like ">=3.11,<3.14") so 3.13.1+ are accepted while still
capping before 3.14.

In `@src/pyseekdb/utils/embedding_functions/openai_embedding_function.py`:
- Around line 138-166: The build_from_config in OpenAIEmbeddingFunction should
validate and normalize client_kwargs before unpacking; update build_from_config
to retrieve client_kwargs = config.get("client_kwargs", {}) then check its type
(isinstance(client_kwargs, dict)), if not, either coerce to {} or raise a
ValueError with a clear message, and only then pass **client_kwargs into the
OpenAIEmbeddingFunction constructor (reference:
OpenAIEmbeddingFunction.build_from_config and the client_kwargs variable).
🧹 Nitpick comments (3)
src/pyseekdb/client/embedding_function.py (2)

46-96: Add name() and build_from_config() to the Protocol for type-safety.

The docstring documents these methods, but the Protocol doesn’t declare them, so type checking won’t enforce the persistence API. Consider adding them as abstract static methods.

♻️ Suggested interface addition
 class EmbeddingFunction(Protocol[D]):
@@
+    `@staticmethod`
+    `@abstractmethod`
+    def name() -> str:
+        ...
+
@@
     `@abstractmethod`
     def get_config(self) -> Dict[str, Any]:
         ...
+
+    `@staticmethod`
+    `@abstractmethod`
+    def build_from_config(config: Dict[str, Any]) -> "EmbeddingFunction":
+        ...

506-508: Silence unused-argument lint in build_from_config.

config is intentionally unused here, but lint will flag it. Rename or explicitly ignore to avoid noise.

✅ Minimal lint fix
-    def build_from_config(config: Dict[str, Any]) -> "DefaultEmbeddingFunction":
+    def build_from_config(_config: Dict[str, Any]) -> "DefaultEmbeddingFunction":
         return DefaultEmbeddingFunction()
tests/unit_tests/test_openai_embedding_function.py (1)

17-31: Centralize the is_openai_available() helper in test_utils.py.

This helper is duplicated identically in both test_qwen_embedding_function.py and test_openai_embedding_function.py. Moving it to test_utils.py where other test utilities like EnvGuard already live will eliminate duplication and prevent future drift.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 702f93c and 5c483b9.

📒 Files selected for processing (13)
  • .github/workflows/ci.yml
  • pyproject.toml
  • src/pyseekdb/client/embedding_function.py
  • src/pyseekdb/utils/embedding_functions/openai_base_embedding_function.py
  • src/pyseekdb/utils/embedding_functions/openai_embedding_function.py
  • src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py
  • src/pyseekdb/utils/embedding_functions/sentence_transformer_embedding_function.py
  • tests/unit_tests/test_default_embedding_function.py
  • tests/unit_tests/test_env_guard.py
  • tests/unit_tests/test_openai_embedding_function.py
  • tests/unit_tests/test_qwen_embedding_function.py
  • tests/unit_tests/test_sentence_transformer_embedding_function.py
  • tests/unit_tests/test_utils.py
🧰 Additional context used
🧬 Code graph analysis (9)
src/pyseekdb/utils/embedding_functions/openai_embedding_function.py (3)
src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py (3)
  • name (120-126)
  • get_config (128-134)
  • build_from_config (137-164)
src/pyseekdb/utils/embedding_functions/sentence_transformer_embedding_function.py (3)
  • name (86-92)
  • get_config (94-105)
  • build_from_config (108-130)
src/pyseekdb/utils/embedding_functions/openai_base_embedding_function.py (1)
  • get_config (203-218)
tests/unit_tests/test_sentence_transformer_embedding_function.py (1)
src/pyseekdb/utils/embedding_functions/sentence_transformer_embedding_function.py (4)
  • SentenceTransformerEmbeddingFunction (5-130)
  • name (86-92)
  • get_config (94-105)
  • build_from_config (108-130)
tests/unit_tests/test_openai_embedding_function.py (4)
tests/unit_tests/test_utils.py (1)
  • env_guard (134-152)
tests/unit_tests/test_qwen_embedding_function.py (1)
  • is_openai_available (20-27)
src/pyseekdb/utils/embedding_functions/openai_embedding_function.py (4)
  • OpenAIEmbeddingFunction (13-166)
  • name (122-128)
  • get_config (130-136)
  • build_from_config (139-166)
src/pyseekdb/utils/embedding_functions/openai_base_embedding_function.py (1)
  • get_config (203-218)
tests/unit_tests/test_env_guard.py (1)
tests/unit_tests/test_utils.py (4)
  • EnvGuard (11-130)
  • env_guard (134-152)
  • save (51-67)
  • restore (96-117)
src/pyseekdb/utils/embedding_functions/openai_base_embedding_function.py (4)
src/pyseekdb/client/embedding_function.py (2)
  • get_config (85-96)
  • get_config (503-504)
src/pyseekdb/utils/embedding_functions/openai_embedding_function.py (1)
  • get_config (130-136)
src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py (1)
  • get_config (128-134)
src/pyseekdb/utils/embedding_functions/sentence_transformer_embedding_function.py (1)
  • get_config (94-105)
tests/unit_tests/test_qwen_embedding_function.py (4)
tests/unit_tests/test_utils.py (1)
  • env_guard (134-152)
tests/unit_tests/test_openai_embedding_function.py (7)
  • is_openai_available (20-27)
  • test_name (370-372)
  • test_get_config_with_defaults (374-387)
  • test_get_config_with_dimensions (409-418)
  • test_build_from_config_with_defaults (420-437)
  • test_build_from_config_with_custom_values (439-458)
  • test_persistence_roundtrip (460-475)
src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py (4)
  • QwenEmbeddingFunction (14-164)
  • name (120-126)
  • get_config (128-134)
  • build_from_config (137-164)
src/pyseekdb/utils/embedding_functions/openai_base_embedding_function.py (1)
  • get_config (203-218)
src/pyseekdb/utils/embedding_functions/sentence_transformer_embedding_function.py (3)
src/pyseekdb/client/embedding_function.py (4)
  • name (500-501)
  • get_config (85-96)
  • get_config (503-504)
  • build_from_config (507-508)
src/pyseekdb/utils/embedding_functions/openai_embedding_function.py (3)
  • name (122-128)
  • get_config (130-136)
  • build_from_config (139-166)
src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py (3)
  • name (120-126)
  • get_config (128-134)
  • build_from_config (137-164)
tests/unit_tests/test_default_embedding_function.py (1)
src/pyseekdb/client/embedding_function.py (5)
  • DefaultEmbeddingFunction (115-511)
  • name (500-501)
  • get_config (85-96)
  • get_config (503-504)
  • build_from_config (507-508)
src/pyseekdb/client/embedding_function.py (4)
src/pyseekdb/utils/embedding_functions/openai_base_embedding_function.py (1)
  • get_config (203-218)
src/pyseekdb/utils/embedding_functions/openai_embedding_function.py (3)
  • get_config (130-136)
  • name (122-128)
  • build_from_config (139-166)
src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py (3)
  • get_config (128-134)
  • name (120-126)
  • build_from_config (137-164)
src/pyseekdb/utils/embedding_functions/sentence_transformer_embedding_function.py (3)
  • get_config (94-105)
  • name (86-92)
  • build_from_config (108-130)
🪛 Ruff (0.14.11)
src/pyseekdb/utils/embedding_functions/openai_embedding_function.py

153-153: Avoid specifying long messages outside the exception class

(TRY003)

src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py

151-151: Avoid specifying long messages outside the exception class

(TRY003)

tests/unit_tests/test_sentence_transformer_embedding_function.py

93-93: Possible hardcoded password assigned to argument: "use_auth_token"

(S106)


98-98: Possible hardcoded password assigned to: "use_auth_token"

(S105)

src/pyseekdb/client/embedding_function.py

507-507: Unused static method argument: config

(ARG004)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: integration-test (server)
  • GitHub Check: integration-test (oceanbase)
  • GitHub Check: integration-test (embedded)
  • GitHub Check: unit-test
🔇 Additional comments (16)
.github/workflows/ci.yml (2)

28-30: No additional concerns for this change.


90-90: No additional concerns for this change.

src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py (1)

119-163: LGTM — clean persistence hooks added.

name(), get_config(), and build_from_config() align with the new embedding function protocol and keep configuration restoration consistent.

src/pyseekdb/utils/embedding_functions/sentence_transformer_embedding_function.py (1)

85-130: LGTM — consistent serialization/restore API.

Defaults and config round‑trip look correct and match the new protocol.

src/pyseekdb/utils/embedding_functions/openai_base_embedding_function.py (1)

203-218: Client_kwargs may contain non-serializable objects; however, no JSON serialization of configs occurs in the current implementation.

The current get_config() implementation returns _client_kwargs directly, which could theoretically contain non-serializable objects like httpx.Timeout (the OpenAI client does accept these). However, based on a comprehensive review of the codebase, embedding function configs are not persisted as JSON. Configs are only used in the in-memory roundtrip pattern: get_config()build_from_config(). Both operations work with Python dictionaries and do not require JSON serialization. JSON serialization in the client code is limited to metadata and document fields, not embedding function configurations. The suggested guard is not needed for the current architecture.

Likely an incorrect or invalid review comment.

tests/unit_tests/test_default_embedding_function.py (1)

18-58: Good persistence coverage.

Covers identifier, config shape, build_from_config, and roundtrip, which matches the new contract.

tests/unit_tests/test_env_guard.py (1)

15-112: Nice coverage of EnvGuard usage patterns.

The scenarios exercise restore semantics, nesting, and convenience helpers clearly.

tests/unit_tests/test_utils.py (1)

39-152: EnvGuard utility looks solid.

Clear save/set/restore semantics and a lightweight contextmanager wrapper.

tests/unit_tests/test_openai_embedding_function.py (1)

363-475: Nice coverage for config persistence and round‑trip behavior.
The tests cover name/get_config/build_from_config plus round‑trip. Please verify that OpenAIEmbeddingFunction initialization here remains side‑effect free (no network calls) when using dummy keys.

tests/unit_tests/test_qwen_embedding_function.py (2)

17-31: Guarding Qwen tests with the OpenAI client check is sensible.
This avoids import errors when the openai-compatible client isn’t installed. Please confirm the guard matches the actual import path used by QwenEmbeddingFunction.


340-449: Persistence test coverage for Qwen looks solid.
Defaults, custom values, build_from_config, and round‑trip are well covered. Please verify these remain API‑call free with env_guard so they can run offline.

tests/unit_tests/test_sentence_transformer_embedding_function.py (5)

1-40: Module setup and availability helpers are clear and robust.
The guards for sentence‑transformers and CUDA keep the suite resilient in limited environments. Please confirm CI images that should run these tests include the required deps.


42-85: get_config defaults and custom CUDA values are well exercised.
Good coverage for name/defaults and CUDA-path config serialization.


100-186: build_from_config defaults, partial configs, and round‑trip are nicely covered.
These tests make the persistence contract explicit.


187-302: CUDA‑specific persistence and embedding checks are well gated.
Skip conditions keep these tests safe on non‑CUDA systems while still validating behavior when available.


86-99: [Your rewritten review comment text here]
[Exactly ONE classification tag]

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

license = { text = "Apache-2.0" }
authors = [{ name = "OceanBase", email = "open_oceanbase@oceanbase.com" }]
requires-python = ">=3.11,<4.0"
requires-python = ">=3.11,<=3.13"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Check if pyproject.toml exists and read line 8
if [ -f pyproject.toml ]; then
  echo "=== pyproject.toml line 8 ==="
  sed -n '8p' pyproject.toml
  echo ""
  echo "=== Context (lines 1-15) ==="
  head -15 pyproject.toml
else
  echo "pyproject.toml not found"
fi

Repository: oceanbase/pyseekdb

Length of output: 634


🏁 Script executed:

# Verify PEP 440 version comparison semantics
pip install packaging > /dev/null 2>&1
python3 << 'EOF'
from packaging.specifiers import SpecifierSet
from packaging.version import Version

# Test version comparisons
spec = SpecifierSet(">=3.11,<=3.13")

test_versions = ["3.11.0", "3.12.0", "3.13.0", "3.13.1", "3.14.0"]
print("=== Version Specifier: >=3.11,<=3.13 ===")
for v in test_versions:
    version = Version(v)
    result = version in spec
    print(f"  {v}: {result}")

print("\n=== Alternative: >=3.11,<3.14 ===")
spec_alt = SpecifierSet(">=3.11,<3.14")
for v in test_versions:
    version = Version(v)
    result = version in spec_alt
    print(f"  {v}: {result}")
EOF

Repository: python

Length of output: 278


Python cap excludes 3.13.x patch releases.

<=3.13 blocks installs on 3.13.1+ (since 3.13.1 > 3.13). If the intent is to allow all 3.13.x, use <3.14 instead.

🛠️ Suggested fix
-requires-python = ">=3.11,<=3.13"
+requires-python = ">=3.11,<3.14"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
requires-python = ">=3.11,<=3.13"
requires-python = ">=3.11,<3.14"
🤖 Prompt for AI Agents
In `@pyproject.toml` at line 8, The requires-python constraint currently uses
">=3.11,<=3.13" which wrongly excludes 3.13.x patch releases; update the
requires-python entry (the requires-python key in pyproject.toml) to allow all
3.13 patch versions (for example use a range like ">=3.11,<3.14") so 3.13.1+ are
accepted while still capping before 3.14.

Comment on lines +138 to +166
@staticmethod
def build_from_config(config: dict[str, Any]) -> "OpenAIEmbeddingFunction":
"""Build an OpenAIEmbeddingFunction from its configuration dictionary.
Args:
config: Dictionary containing the embedding function's configuration
Returns:
Restored OpenAIEmbeddingFunction instance
Raises:
ValueError: If the configuration is invalid or missing required fields
"""
model_name = config.get("model_name")
if model_name is None:
raise ValueError("Missing required field 'model_name' in configuration")

api_key_env = config.get("api_key_env")
api_base = config.get("api_base")
dimensions = config.get("dimensions")
client_kwargs = config.get("client_kwargs", {})

return OpenAIEmbeddingFunction(
model_name=model_name,
api_key_env=api_key_env,
api_base=api_base,
dimensions=dimensions,
**client_kwargs
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Validate client_kwargs before unpacking.

If persisted metadata is corrupted or legacy data stores client_kwargs=None/non-dict, **client_kwargs will raise a TypeError and hide the root cause. Normalize and validate first.

🔧 Proposed fix
-        client_kwargs = config.get("client_kwargs", {})
+        client_kwargs = config.get("client_kwargs") or {}
+        if not isinstance(client_kwargs, dict):
+            raise ValueError("client_kwargs must be a dict in configuration")
🧰 Tools
🪛 Ruff (0.14.11)

153-153: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
In `@src/pyseekdb/utils/embedding_functions/openai_embedding_function.py` around
lines 138 - 166, The build_from_config in OpenAIEmbeddingFunction should
validate and normalize client_kwargs before unpacking; update build_from_config
to retrieve client_kwargs = config.get("client_kwargs", {}) then check its type
(isinstance(client_kwargs, dict)), if not, either coerce to {} or raise a
ValueError with a clear message, and only then pass **client_kwargs into the
OpenAIEmbeddingFunction constructor (reference:
OpenAIEmbeddingFunction.build_from_config and the client_kwargs variable).

@hnwyllmm hnwyllmm marked this pull request as draft January 16, 2026 05:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Remove the embedding_function argument in get_collection API

1 participant