Skip to content

Conversation

@hnwyllmm
Copy link
Member

@hnwyllmm hnwyllmm commented Jan 12, 2026

Summary

close #109

Support OpenAI and Qwen embedding platform based on OpenAI.

Solution Description

Qwen is a famous LLM platform in China and it has a type of API compatible with OpenAI.
So, I implemented an OpenAIBaseEmbeddingFunction class and supporting OpenAI and Qwen based on it.

It cost tokens when we test the embedding functions, so we'll not test them in github action.

Sorry that I can't reach openai.com from my compute, I can't test the OpenAIEmbeddingFunction now.

Another information:
LiteLLM is a Python SDK supporting more than 100 LLM platforms, including OpenAI. But when I Qwen with it, I got some stupid errors, such as dimensions parameter is not supported, float-format is not supported. So I changed the base implementation to OpenAI. I can't find Qwen/dashscope embedding provider from LiteLLM official document yet even though there's a completion provide of dashscope.

Qwen test result:

platform darwin -- Python 3.13.0, pytest-9.0.2, pluggy-1.6.0 -- /Users/wangyunlai.wyl/Documents/project/github/pyseekdb/.venv/bin/python
cachedir: .pytest_cache
rootdir: /Users/wangyunlai.wyl/Documents/project/github/pyseekdb
configfile: pyproject.toml
plugins: anyio-4.12.1
collected 18 items                                                                                                                                   

tests/unit_tests/test_qwen_embedding_function.py::TestQwenEmbeddingFunction::test_qwen_env PASSED                                              [  5%]
tests/unit_tests/test_qwen_embedding_function.py::TestQwenEmbeddingFunction::test_initialization_with_defaults PASSED                          [ 11%]
tests/unit_tests/test_qwen_embedding_function.py::TestQwenEmbeddingFunction::test_initialization_with_different_models PASSED                  [ 16%]
tests/unit_tests/test_qwen_embedding_function.py::TestQwenEmbeddingFunction::test_initialization_with_custom_api_key_env PASSED                [ 22%]
tests/unit_tests/test_qwen_embedding_function.py::TestQwenEmbeddingFunction::test_initialization_with_custom_api_base PASSED                   [ 27%]
tests/unit_tests/test_qwen_embedding_function.py::TestQwenEmbeddingFunction::test_initialization_with_dimensions PASSED                        [ 33%]
tests/unit_tests/test_qwen_embedding_function.py::TestQwenEmbeddingFunction::test_initialization_with_kwargs PASSED                            [ 38%]
tests/unit_tests/test_qwen_embedding_function.py::TestQwenEmbeddingFunction::test_initialization_missing_api_key PASSED                        [ 44%]
tests/unit_tests/test_qwen_embedding_function.py::TestQwenEmbeddingFunction::test_dimension_property_known_models PASSED                       [ 50%]
tests/unit_tests/test_qwen_embedding_function.py::TestQwenEmbeddingFunction::test_dimension_property_unknown_model PASSED                      [ 55%]
tests/unit_tests/test_qwen_embedding_function.py::TestQwenEmbeddingFunction::test_call_single_document PASSED                                  [ 61%]
tests/unit_tests/test_qwen_embedding_function.py::TestQwenEmbeddingFunction::test_call_multiple_documents PASSED                               [ 66%]
tests/unit_tests/test_qwen_embedding_function.py::TestQwenEmbeddingFunction::test_call_empty_input PASSED                                      [ 72%]
tests/unit_tests/test_qwen_embedding_function.py::TestQwenEmbeddingFunction::test_call_with_dimensions_parameter PASSED                        [ 77%]
tests/unit_tests/test_qwen_embedding_function.py::TestQwenEmbeddingFunction::test_dimension_of_function PASSED                                 [ 83%]
tests/unit_tests/test_qwen_embedding_function.py::TestQwenEmbeddingFunction::test_get_default_api_base PASSED                                  [ 88%]
tests/unit_tests/test_qwen_embedding_function.py::TestQwenEmbeddingFunction::test_get_default_api_key_env PASSED                               [ 94%]
tests/unit_tests/test_qwen_embedding_function.py::TestQwenEmbeddingFunction::test_get_model_dimensions PASSED                                  [100%]

================================================================= 18 passed in 2.54s =================================================================

Summary by CodeRabbit

  • New Features

    • Added multiple new embedding providers (OpenAI-compatible, LiteLLM, Qwen) and a shared OpenAI-compatible embedding base.
  • Behavior / Reliability

    • Improved API key handling, model-dimension detection and validation, and clearer errors for misconfiguration.
  • Documentation

    • Updated docstrings and usage examples for embedding providers.
  • Tests

    • Added comprehensive tests for OpenAI and Qwen embedding flows (initialization, dimension handling, and calls).

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Jan 12, 2026

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

Adds a new OpenAI-compatible embedding base class and three provider implementations (OpenAI, Qwen, LiteLLM), exposes them via a new package initializer, updates DefaultEmbeddingFunction to be generically typed, removes an unused import in the SentenceTransformer embedding module, and adds unit tests for OpenAI and Qwen embedding functions.

Changes

Cohort / File(s) Summary
Base Infrastructure
src/pyseekdb/utils/embedding_functions/openai_base_embedding_function.py
New OpenAIBaseEmbeddingFunction(EmbeddingFunction[Documents]) with API key/base resolution, model→dimension mapping, runtime probing for unknown dimensions, input normalization, embedding request wiring, and response validation.
OpenAI Implementation
src/pyseekdb/utils/embedding_functions/openai_embedding_function.py
New OpenAIEmbeddingFunction subclass providing OpenAI defaults, _get_default_api_base, _get_default_api_key_env, and model-dimension mapping.
Qwen Implementation
src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py
New QwenEmbeddingFunction subclass with DashScope/Qwen defaults, _get_default_api_base, _get_default_api_key_env, and _QWEN_MODEL_DIMENSIONS mapping.
LiteLLM Implementation
src/pyseekdb/utils/embedding_functions/litellm_embedding_function.py
New LiteLLMEmbeddingFunction(EmbeddingFunction[Documents]) with lazy litellm import, optional env-key handling, flexible response parsing for multiple shapes, input normalization, and count validation.
Module Organization
src/pyseekdb/utils/embedding_functions/__init__.py
New package initializer exporting SentenceTransformerEmbeddingFunction, LiteLLMEmbeddingFunction, OpenAIEmbeddingFunction, QwenEmbeddingFunction via __all__.
DefaultEmbeddingFunction signature
src/pyseekdb/client/embedding_function.py
Updated DefaultEmbeddingFunction signature to inherit EmbeddingFunction[Documents] (now generic).
SentenceTransformer minor
src/pyseekdb/utils/embedding_functions/sentence_transformer_embedding_function.py
Removed unused Space import and minor docstring/path formatting changes.
Tests — OpenAI
tests/unit_tests/test_openai_embedding_function.py
New comprehensive test suite for OpenAIEmbeddingFunction covering initialization, API key/base handling, dimension logic (known/custom/probe), and __call__ behavior (skipped unless OPENAI_API_KEY set).
Tests — Qwen
tests/unit_tests/test_qwen_embedding_function.py
New comprehensive test suite for QwenEmbeddingFunction covering initialization, api key/base handling, dimension logic, and __call__ behavior (skipped unless DASHSCOPE_API_KEY set).

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Base as OpenAIBaseEmbeddingFunction
    participant OpenAIClient as OpenAI Client Library
    participant OpenAIAPI as OpenAI API

    Client->>Base: __init__(model_name, api_key_env?, api_base?, dimensions?)
    Base->>Base: verify openai installed
    Base->>Base: resolve API key via _get_default_api_key_env()
    Base->>Base: resolve api_base via _get_default_api_base()
    Base->>OpenAIClient: initialize client with api_key, api_base, kwargs

    Client->>Base: __call__(input)
    Base->>Base: normalize input (str or list)
    alt dimensions not provided and model unknown
        Base->>OpenAIClient: request sample embedding to probe dimension
        OpenAIClient->>OpenAIAPI: POST /embeddings (probe)
        OpenAIAPI-->>OpenAIClient: sample embedding
        OpenAIClient-->>Base: sample embedding
        Base->>Base: infer dimension
    end
    Base->>OpenAIClient: embeddings.create(model, input, dimensions?)
    OpenAIClient->>OpenAIAPI: POST /embeddings
    OpenAIAPI-->>OpenAIClient: embeddings response
    OpenAIClient-->>Base: response
    Base->>Base: extract embeddings, validate count
    Base-->>Client: return list of embedding vectors
Loading
sequenceDiagram
    participant Client
    participant Lite as LiteLLMEmbeddingFunction
    participant LiteLLMLib as LiteLLM Library
    participant Provider as External Provider

    Client->>Lite: __init__(model_name, api_key_env?, **kwargs)
    Lite->>Lite: validate primitive kwargs, store config, defer import

    Client->>Lite: __call__(input)
    Lite->>Lite: normalize input (str or list)
    alt api_key_env provided
        Lite->>Lite: read API key from environment (raise if missing)
    end
    Lite->>Lite: build embedding_kwargs (model, input, user kwargs)
    Lite->>LiteLLMLib: lazy import and call embedding function
    LiteLLMLib->>Provider: route request to configured provider
    Provider-->>LiteLLMLib: return embeddings response (various shapes)
    LiteLLMLib-->>Lite: response
    Lite->>Lite: parse multiple response shapes, validate count
    Lite-->>Client: return list of embedding vectors
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~70 minutes

Poem

🐰 New embeddings sprout from many ends,
OpenAI, Qwen, LiteLLM — three friends,
A base to guide them, types set true,
Tests hop in to prove what's new,
Vectors bloom — a rabbit cheers for you!

🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Title check ❓ Inconclusive The title "Support more embedding function platform" is vague and generic, using imprecise language without clearly conveying the specific implementation of OpenAI and Qwen support. Revise title to be more specific, e.g., "Add OpenAI and Qwen embedding function implementations with OpenAI-compatible base class" to clearly reflect the main changes.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed The PR successfully implements OpenAI and Qwen embedding support with an OpenAI-compatible base class as required by #109, including proper initialization, dimension handling, and API integration with comprehensive tests.
Out of Scope Changes check ✅ Passed All changes directly support the linked issue objectives: new embedding function classes (OpenAI, Qwen, LiteLLM), base class implementation, helper utilities, module initialization, and comprehensive tests—all aligned with the feature request.
Docstring Coverage ✅ Passed Docstring coverage is 98.78% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings


📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7dab341 and 887ec00.

📒 Files selected for processing (2)
  • src/pyseekdb/utils/embedding_functions/openai_embedding_function.py
  • src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py
🧰 Additional context used
🧬 Code graph analysis (2)
src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py (2)
src/pyseekdb/utils/embedding_functions/openai_base_embedding_function.py (4)
  • OpenAIBaseEmbeddingFunction (6-201)
  • _get_default_api_base (94-104)
  • _get_default_api_key_env (106-116)
  • _get_model_dimensions (118-126)
src/pyseekdb/utils/embedding_functions/openai_embedding_function.py (3)
  • _get_default_api_base (97-103)
  • _get_default_api_key_env (105-111)
  • _get_model_dimensions (113-119)
src/pyseekdb/utils/embedding_functions/openai_embedding_function.py (1)
src/pyseekdb/utils/embedding_functions/openai_base_embedding_function.py (3)
  • _get_default_api_base (94-104)
  • _get_default_api_key_env (106-116)
  • _get_model_dimensions (118-126)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: integration-test (embedded)
  • GitHub Check: integration-test (server)
  • GitHub Check: integration-test (oceanbase)
🔇 Additional comments (2)
src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py (1)

1-117: LGTM! Clean implementation following the established base class pattern.

The QwenEmbeddingFunction correctly extends OpenAIBaseEmbeddingFunction and properly implements all required methods:

  • _get_default_api_base() returns the DashScope compatible endpoint
  • _get_default_api_key_env() returns "DASHSCOPE_API_KEY"
  • _get_model_dimensions() returns the model-to-dimension mapping

The documentation is comprehensive with clear examples, and the dimension comments for v3/v4 appropriately note that these are defaults that can be changed.

src/pyseekdb/utils/embedding_functions/openai_embedding_function.py (1)

1-119: LGTM! Well-structured implementation with accurate documentation.

The OpenAIEmbeddingFunction correctly:

  • Defaults to text-embedding-3-small (OpenAI's current recommended model)
  • Properly documents dimension reduction capability for text-embedding-3 models (lines 80-82)
  • Provides clear examples including custom dimensions usage
  • Implements all required base class methods consistently

The model dimensions mapping and documentation align with OpenAI's official specifications.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/pyseekdb/utils/embedding_functions/sentence_transformer_embedding_function.py (1)

68-83: Bug: __call__ breaks for single-string input (list("hello")). This violates the Documents = Union[str, List[str]] contract.

Proposed diff
@@
     def __call__(self, input: Documents) -> Embeddings:
@@
-        embeddings = self._model.encode(
-            list(input),
+        documents = [input] if isinstance(input, str) else list(input)
+        embeddings = self._model.encode(
+            documents,
             convert_to_numpy=True,
             normalize_embeddings=self.normalize_embeddings,
         )
🤖 Fix all issues with AI agents
In @src/pyseekdb/utils/embedding_functions/openai_embedding_function.py:
- Around line 5-80: The collection creation currently probes the embedding
function (causing paid API calls); add a model-dimension mapping and expose a
dimension attribute/parameter so probing is avoided: update
LiteLLMEmbeddingFunction to accept an optional dimension argument and/or provide
a read-only property (e.g., self.dimension) that subclasses can override, then
implement model-specific defaults in OpenAIEmbeddingFunction (and
QwenEmbeddingFunction) by mapping model_name -> dimension (e.g.,
"text-embedding-ada-002":1536, "text-embedding-3-small":1536,
"text-embedding-3-large":3072) and set that.dimension when not explicitly
provided; ensure client_base.py's create_collection checks for
embedding_function.dimension before doing the test probe.
- Around line 45-80: Update the OpenAIEmbeddingFunction __init__ to use
"text-embedding-3-small" as the default model_name instead of
"text-embedding-ada-002" and update the docstring to reflect the new default and
available alternatives ("text-embedding-3-small" and "text-embedding-3-large");
additionally, when an OpenAI-compatible proxy is used (detect via the presence
of api_base in kwargs), ensure the LiteLLM model name is prefixed with "openai/"
(e.g., "openai/text-embedding-3-small") or explicitly document that callers must
supply the "openai/" prefix so requests route to the proxy rather than the real
OpenAI service (make the change in the __init__ method handling
model_name/api_base and update the docstring accordingly).

In
@src/pyseekdb/utils/embedding_functions/sentence_transformer_embedding_function.py:
- Around line 12-25: The example uses collection.query("How are you?", top_k=1)
which no longer matches the current API; update the doc example to call
collection.query(query_texts=["How are you?"], n_results=1) (or the exact
parameter names used elsewhere) and ensure the rest of the snippet
(ids/documents/metadatas, SentenceTransformerEmbeddingFunction instantiation)
remains unchanged so it runs without copy/paste errors.

In @tests/integration_tests/test_openai_embedding_function.py:
- Around line 28-40: In test_litellm_env replace any use of bare `assert False`
with `pytest.fail(<message>)` or explicitly `raise AssertionError(<message>)` so
failures aren’t skipped under Python -O; also convert the stray triple-quoted
comment inside the function (the second docstring-like line) into a normal
inline or block comment to avoid confusion; update messages to include context
(e.g., "litellm package is not installed" and "OPENAI_API_KEY environment
variable is not set") and keep changes localized to the test_litellm_env
function.

In @tests/integration_tests/test_qwen_embedding_function.py:
- Around line 173-178: The query string uses a fullwidth question mark which can
break linters/tools; update the variable query_text (used in the
collection.query call) to use an ASCII question mark instead (e.g., change
"什么是嵌入?" to "什么是嵌入?") so the test avoids tooling surprises.
- Line 210: Remove the unnecessary f-string prefix in the test print call:
replace the print(f"   Embedding generated successfully with custom parameters")
invocation in tests/integration_tests/test_qwen_embedding_function.py with a
plain string print (no interpolation) so it becomes print("   Embedding
generated successfully with custom parameters"); this updates the print
statement that has no placeholders to avoid misleading f-string usage.
- Around line 31-42: The test_litellm_env function uses "assert False" which can
be stripped with python -O; replace those with explicit raises of AssertionError
including the same messages (e.g., raise AssertionError("litellm package is not
installed") and raise AssertionError("QWEN_API_KEY environment variable is not
set")) so failures are always raised regardless of optimization flags.
🧹 Nitpick comments (7)
src/pyseekdb/client/embedding_function.py (1)

176-259: Avoid broad except Exception + use logger.exception() for download failures. The current pattern can hide root causes and loses stack traces.

Proposed diff
@@
-            logger.info("Successfully downloaded all model files from Hugging Face")
-            return True
+            logger.info("Successfully downloaded all model files from Hugging Face")
+            return True
@@
-        except Exception as e:
-            logger.error(f"Error downloading from Hugging Face: {e}")
-            return False
+        except Exception:
+            logger.exception("Error downloading from Hugging Face")
+            return False
src/pyseekdb/utils/embedding_functions/__init__.py (1)

8-18: Export module looks good; consider satisfying RUF022 (__all__ sort) if lint is enforced.

tests/integration_tests/test_qwen_embedding_function.py (2)

67-108: Avoid printing full embeddings to logs. This is very noisy (and can get huge); logging the dimension is typically enough.


172-191: Cleanup except Exception is fine for test hygiene, but at least log the traceback for failures.

tests/integration_tests/test_openai_embedding_function.py (2)

41-64: Consider removing redundant test_litellm_env() calls.

Since the entire test class is already decorated with @pytest.mark.skipif(not os.environ.get("OPENAI_API_KEY"), ...), the API key check in test_litellm_env() is redundant when called from other test methods. The litellm import check could be moved to a class-level fixture or setup_class method for cleaner test organization.


181-186: Consider catching a more specific exception.

Catching bare Exception can mask unexpected errors. Consider catching specific exceptions that delete_collection might raise.

Proposed fix
         # Cleanup
         try:
             db_client.delete_collection(name=collection_name)
             print(f"   Cleaned up collection: {collection_name}")
-        except Exception as e:
+        except (ValueError, RuntimeError) as e:  # Adjust based on actual exceptions raised
             print(f"   Warning: Failed to cleanup collection: {e}")
src/pyseekdb/utils/embedding_functions/litellm_embedding_function.py (1)

82-95: Chain the exception and use TypeError for type validation.

Two improvements for exception handling:

  1. Use raise ... from err to preserve the exception chain for ImportError
  2. Use TypeError instead of ValueError when validating argument types (line 93)
Proposed fix
         try:
             from litellm import embedding
-        except ImportError:
+        except ImportError as err:
             raise ValueError(
                 "The litellm python package is not installed. Please install it with `pip install litellm`"
-            )
+            ) from err

         self.model_name = model_name
         self.api_key_env = api_key_env
         for key, value in kwargs.items():
             if not isinstance(value, (str, int, float, bool, list, dict, tuple, type(None))):
-                raise ValueError(f"Keyword argument {key} is not a primitive type")
+                raise TypeError(f"Keyword argument '{key}' must be a primitive type, got {type(value).__name__}")
         self.kwargs = kwargs
         self._embedding_func = embedding
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9e3f34d and 13e2b29.

📒 Files selected for processing (8)
  • src/pyseekdb/client/embedding_function.py
  • src/pyseekdb/utils/embedding_functions/__init__.py
  • src/pyseekdb/utils/embedding_functions/litellm_embedding_function.py
  • src/pyseekdb/utils/embedding_functions/openai_embedding_function.py
  • src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py
  • src/pyseekdb/utils/embedding_functions/sentence_transformer_embedding_function.py
  • tests/integration_tests/test_openai_embedding_function.py
  • tests/integration_tests/test_qwen_embedding_function.py
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-12-22T12:44:52.456Z
Learnt from: longbingljw
Repo: oceanbase/pyseekdb PR: 91
File: tests/integration_tests/test_collection_dml.py:36-36
Timestamp: 2025-12-22T12:44:52.456Z
Learning: In integration tests under tests/integration_tests, it is acceptable to use the internal db_client._server._execute() to run raw SQL (e.g., CREATE TABLE) when needing custom table setup with specific vector index configurations. Prefer using public/test harness APIs for common setup, and clearly document and limit the use of internal calls to ensure test reliability. Ensure proper cleanup and isolation for each test, and avoid leaking internal APIs into production or non-test code.

Applied to files:

  • tests/integration_tests/test_openai_embedding_function.py
  • tests/integration_tests/test_qwen_embedding_function.py
📚 Learning: 2025-12-22T12:45:51.412Z
Learnt from: longbingljw
Repo: oceanbase/pyseekdb PR: 91
File: tests/integration_tests/test_collection_hybrid_search_builder_integration.py:127-132
Timestamp: 2025-12-22T12:45:51.412Z
Learning: In integration test Python files under tests/integration_tests, it is acceptable to use string interpolation to construct SQL INSERT statements for test data when the data is controlled and internal to the test. This pattern should only be used for trusted, test-only data and not with untrusted input; ensure the injection risk is mitigated by keeping inputs deterministic, non-user-supplied, and isolated to the test environment. If possible, prefer parameterized queries in non-test code, but in these tests, interpolate values selectively when you can guarantee safety.

Applied to files:

  • tests/integration_tests/test_openai_embedding_function.py
  • tests/integration_tests/test_qwen_embedding_function.py
🧬 Code graph analysis (7)
src/pyseekdb/utils/embedding_functions/openai_embedding_function.py (1)
src/pyseekdb/utils/embedding_functions/litellm_embedding_function.py (1)
  • LiteLLMEmbeddingFunction (6-179)
tests/integration_tests/test_openai_embedding_function.py (2)
src/pyseekdb/utils/embedding_functions/openai_embedding_function.py (1)
  • OpenAIEmbeddingFunction (5-80)
src/pyseekdb/client/embedding_function.py (1)
  • dimension_of (67-82)
src/pyseekdb/utils/embedding_functions/__init__.py (4)
src/pyseekdb/utils/embedding_functions/sentence_transformer_embedding_function.py (1)
  • SentenceTransformerEmbeddingFunction (5-83)
src/pyseekdb/utils/embedding_functions/litellm_embedding_function.py (1)
  • LiteLLMEmbeddingFunction (6-179)
src/pyseekdb/utils/embedding_functions/openai_embedding_function.py (1)
  • OpenAIEmbeddingFunction (5-80)
src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py (1)
  • QwenEmbeddingFunction (5-92)
src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py (1)
src/pyseekdb/utils/embedding_functions/litellm_embedding_function.py (1)
  • LiteLLMEmbeddingFunction (6-179)
tests/integration_tests/test_qwen_embedding_function.py (5)
src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py (1)
  • QwenEmbeddingFunction (5-92)
src/pyseekdb/client/collection.py (5)
  • client (78-80)
  • embedding_function (88-90)
  • get (364-428)
  • name (63-65)
  • add (103-149)
src/pyseekdb/client/embedding_function.py (1)
  • dimension_of (67-82)
tests/integration_tests/conftest.py (1)
  • db_client (147-181)
src/pyseekdb/client/meta_info.py (1)
  • collection_name (22-26)
src/pyseekdb/client/embedding_function.py (1)
src/pyseekdb/client/collection.py (1)
  • dimension (73-75)
src/pyseekdb/utils/embedding_functions/sentence_transformer_embedding_function.py (2)
src/pyseekdb/client/collection.py (2)
  • client (78-80)
  • embedding_function (88-90)
src/pyseekdb/client/embedding_function.py (1)
  • EmbeddingFunction (38-65)
🪛 Ruff (0.14.10)
src/pyseekdb/utils/embedding_functions/litellm_embedding_function.py

85-87: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


85-87: Avoid specifying long messages outside the exception class

(TRY003)


93-93: Prefer TypeError exception for invalid type

(TRY004)


93-93: Avoid specifying long messages outside the exception class

(TRY003)


125-128: Avoid specifying long messages outside the exception class

(TRY003)


150-150: Avoid specifying long messages outside the exception class

(TRY003)


160-160: Avoid specifying long messages outside the exception class

(TRY003)


169-169: Avoid specifying long messages outside the exception class

(TRY003)


171-171: Avoid specifying long messages outside the exception class

(TRY003)


175-177: Avoid specifying long messages outside the exception class

(TRY003)

tests/integration_tests/test_openai_embedding_function.py

34-34: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)


39-39: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)


185-185: Do not catch blind exception: Exception

(BLE001)

src/pyseekdb/utils/embedding_functions/__init__.py

13-18: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

tests/integration_tests/test_qwen_embedding_function.py

37-37: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)


41-41: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)


173-173: String contains ambiguous (FULLWIDTH QUESTION MARK). Did you mean ? (QUESTION MARK)?

(RUF001)


189-189: Do not catch blind exception: Exception

(BLE001)


210-210: f-string without any placeholders

Remove extraneous f prefix

(F541)

src/pyseekdb/client/embedding_function.py

254-254: Consider moving this statement to an else block

(TRY300)


256-256: Do not catch blind exception: Exception

(BLE001)


257-257: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: integration-test (embedded)
  • GitHub Check: integration-test (oceanbase)
  • GitHub Check: integration-test (server)
🔇 Additional comments (13)
src/pyseekdb/client/embedding_function.py (2)

37-66: Generic EmbeddingFunction protocol update looks consistent.


84-95: DefaultEmbeddingFunction(EmbeddingFunction[Documents]) aligns typing with the rest of the embedding ecosystem.

src/pyseekdb/utils/embedding_functions/sentence_transformer_embedding_function.py (1)

1-1: Import surface change (dropping Space) is fine.

src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py (3)

1-2: LGTM!

Imports are appropriate and minimal.


5-40: LGTM!

The class docstring is comprehensive with clear examples demonstrating installation, environment variable setup, and usage patterns with collections.


42-92: LGTM!

The __init__ method correctly sets Qwen-specific defaults:

  • api_key_env defaults to "QWEN_API_KEY"
  • api_base defaults to Qwen's DashScope endpoint
  • custom_llm_provider set to "openai" for OpenAI-compatible API
  • encoding_format defaults to "float"

The implementation follows the same pattern as OpenAIEmbeddingFunction while adding necessary Qwen-specific configuration.

Minor: The variable api_base_provided on line 73 could be inlined into the condition, but this is a stylistic preference and the current form is readable.

tests/integration_tests/test_openai_embedding_function.py (3)

65-106: LGTM!

Good coverage of single document, multiple documents, and empty input scenarios. The assertions properly validate the embedding structure and dimensions.


107-120: LGTM!

Properly tests dimension detection using the dimension_of helper and validates the expected 1536 dimensions for text-embedding-ada-002.


188-214: LGTM!

Good coverage of different OpenAI embedding models with correct dimension assertions:

  • text-embedding-ada-002: 1536 dimensions
  • text-embedding-3-small: 1536 dimensions
  • text-embedding-3-large: 3072 dimensions
src/pyseekdb/utils/embedding_functions/litellm_embedding_function.py (4)

1-4: LGTM!

Imports are appropriate and minimal.


6-51: LGTM!

Comprehensive docstring with clear examples covering multiple providers (OpenAI, Hugging Face) and common configuration options.


97-132: LGTM!

The __call__ method properly:

  • Normalizes single string input to a list
  • Handles empty input gracefully
  • Builds embedding kwargs with model, input, and additional parameters
  • Validates API key presence when api_key_env is specified

134-179: LGTM!

The response parsing logic is robust, handling multiple response formats:

  • EmbeddingResponse object with .data attribute (most common)
  • Dict response with "data" key (backward compatibility)
  • List response (backward compatibility)

The validation at the end ensures the number of embeddings matches the input count, which is important for data integrity.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds support for OpenAI and Qwen embedding platforms through LiteLLM integration. The implementation introduces a base LiteLLMEmbeddingFunction class that provides unified access to multiple embedding model providers, with specialized wrapper classes for OpenAI and Qwen that set provider-specific defaults.

Changes:

  • Added LiteLLMEmbeddingFunction base class supporting 100+ LLM platforms through LiteLLM SDK
  • Implemented OpenAIEmbeddingFunction and QwenEmbeddingFunction wrapper classes with platform-specific configurations
  • Added comprehensive integration tests for both platforms (skipped by default due to API key requirements)
  • Updated documentation formatting and removed unused imports in existing embedding function code

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/pyseekdb/utils/embedding_functions/litellm_embedding_function.py Base implementation for LiteLLM-based embedding functions with comprehensive response parsing
src/pyseekdb/utils/embedding_functions/openai_embedding_function.py OpenAI-specific wrapper with default model and configuration
src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py Qwen-specific wrapper with Alibaba Cloud endpoint and OpenAI-compatible API settings
src/pyseekdb/utils/embedding_functions/init.py Module exports for new embedding function classes
src/pyseekdb/utils/embedding_functions/sentence_transformer_embedding_function.py Removed unused Space import and improved documentation formatting
src/pyseekdb/client/embedding_function.py Formatting improvements and type annotation additions to DefaultEmbeddingFunction
tests/integration_tests/test_openai_embedding_function.py Comprehensive integration tests for OpenAI embedding function with multiple model variants
tests/integration_tests/test_qwen_embedding_function.py Comprehensive integration tests for Qwen embedding function with Chinese text examples

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In @tests/unit_tests/test_openai_embedding_function.py:
- Around line 181-193: The test docstring and intent mismatch:
test_dimension_property_unknown_model expects an API fallback but uses a known
model from _OPENAI_MODEL_DIMENSIONS; update the test to use an unknown model
name (e.g., any string not present in _OPENAI_MODEL_DIMENSIONS) so
OpenAIEmbeddingFunction(model_name=...) triggers the API call for dimension
detection, or alternatively change the docstring/name to reflect that it is
testing a known model; modify test_dimension_property_unknown_model to reference
OpenAIEmbeddingFunction and _OPENAI_MODEL_DIMENSIONS accordingly so the behavior
matches the test intent.

In @tests/unit_tests/test_qwen_embedding_function.py:
- Around line 173-185: The test method test_dimension_property_unknown_model is
mislabeled: it claims to exercise the API fallback for unknown models but
constructs QwenEmbeddingFunction(model_name="text-embedding-v1"), which is
present in _QWEN_MODEL_DIMENSIONS so no API call occurs; either change the
docstring to describe that it verifies a known model dimension, or replace the
model_name with a truly unknown identifier (e.g., "unknown-embedding-model") to
force QwenEmbeddingFunction.dimension to call the API and validate the fallback
behavior.
🧹 Nitpick comments (10)
src/pyseekdb/utils/embedding_functions/openai_base_embedding_function.py (3)

56-61: Chain the exception for better debugging context.

When re-raising an exception within an except block, use raise ... from err to preserve the original traceback.

♻️ Proposed fix
         try:
             from openai import OpenAI
         except ImportError:
-            raise ValueError(
+            raise ValueError(
                 "The openai python package is not installed. Please install it with `pip install openai`"
-            )
+            ) from None

77-92: Remove duplicate storage of model_name.

model_name is stored twice: as self.model_name (line 78) and self._model_name (line 92). The _model_name assignment appears redundant since self.model_name is already available and used in __call__.

♻️ Proposed fix
         # Store configuration
         self.model_name = model_name
         self.api_key_env = api_key_env
         self.api_base = api_base
         self._dimensions_param = dimensions
         self._client_kwargs = kwargs

         # Initialize OpenAI client
         self._client = OpenAI(
             api_key=api_key,
             base_url=api_base,
             **kwargs
         )
-
-        # Store original model name for dimension lookup
-        self._model_name = model_name

Then update the dimension property to use self.model_name:

-        if self._model_name in model_dimensions:
-            return model_dimensions[self._model_name]
+        if self.model_name in model_dimensions:
+            return model_dimensions[self.model_name]

154-160: Chain the exception for debugging context.

When catching an exception and raising a new one, preserve the original cause using raise ... from e.

♻️ Proposed fix
         try:
             embeddings = self([test_input])
         except Exception as e:
-            raise RuntimeError(f"Failed to determine embedding dimension via API call: {e}")
+            raise RuntimeError(f"Failed to determine embedding dimension via API call: {e}") from e
tests/unit_tests/test_qwen_embedding_function.py (4)

27-38: Replace assert False with pytest.fail() for proper test failures.

assert False is removed when Python runs with optimization (-O), making these assertions ineffective. Use pytest.fail() instead.

♻️ Proposed fix
     def test_qwen_env(self):
         """Test if openai package is installed and required environment variables are set."""
         try:
             import openai
         except ImportError:
             print("openai package is not installed")
-            assert False, "openai package is not installed"
+            pytest.fail("openai package is not installed")

         if not os.environ.get("DASHSCOPE_API_KEY"):
             print("DASHSCOPE_API_KEY environment variable is not set")
-            assert False, "DASHSCOPE_API_KEY environment variable is not set"
+            pytest.fail("DASHSCOPE_API_KEY environment variable is not set")

72-87: Avoid polluting environment with fake API key.

Setting a fake API key ("your-custom-key") in the environment could cause issues if subsequent tests attempt real API calls. Consider using monkeypatch fixture for isolated environment changes.

♻️ Proposed fix using monkeypatch
-    def test_initialization_with_custom_api_key_env(self):
+    def test_initialization_with_custom_api_key_env(self, monkeypatch):
         """Test QwenEmbeddingFunction initialization with custom API key env"""
         print("\n✅ Testing QwenEmbeddingFunction initialization with custom API key env")

         self.test_qwen_env()

         custom_key_env = "CUSTOM_QWEN_KEY"
-        if not os.environ.get(custom_key_env):
-            os.environ[custom_key_env] = "your-custom-key"
+        monkeypatch.setenv(custom_key_env, "test-key-value")

         ef = QwenEmbeddingFunction(
             model_name="text-embedding-v1",
             api_key_env=custom_key_env
         )
         assert ef.api_key_env == custom_key_env
         print(f"   Custom API key env: {ef.api_key_env}")

219-222: Minor: Unused loop variable and f-string without placeholders.

Per static analysis hints, rename i to _i and remove the extraneous f prefix.

♻️ Proposed fix
-        for i, emb in enumerate(embeddings):
+        for _i, emb in enumerate(embeddings):
             assert isinstance(emb, list)
-            assert len(emb) == len(embeddings[0]), f"All embeddings should have same dimension"
+            assert len(emb) == len(embeddings[0]), "All embeddings should have same dimension"

117-129: Minor: f-string without placeholders.

Remove the extraneous f prefix on line 129.

♻️ Proposed fix
         assert ef is not None
-        print(f"   Initialized with timeout and max_retries")
+        print("   Initialized with timeout and max_retries")
tests/unit_tests/test_openai_embedding_function.py (3)

27-38: Replace assert False with pytest.fail() for proper test failures.

Same issue as Qwen tests - assert False is removed with Python optimization flag.

♻️ Proposed fix
     def test_openai_env(self):
         """Test if openai package is installed and required environment variables are set."""
         try:
             import openai
         except ImportError:
             print("openai package is not installed")
-            assert False, "openai package is not installed"
+            pytest.fail("openai package is not installed")

         if not os.environ.get("OPENAI_API_KEY"):
             print("OPENAI_API_KEY environment variable is not set")
-            assert False, "OPENAI_API_KEY environment variable is not set"
+            pytest.fail("OPENAI_API_KEY environment variable is not set")

227-231: Minor: Unused loop variable and f-string without placeholders.

Per static analysis hints, rename i to _i and remove the extraneous f prefix.

♻️ Proposed fix
-        for i, emb in enumerate(embeddings):
+        for _i, emb in enumerate(embeddings):
             assert isinstance(emb, list)
-            assert len(emb) == len(embeddings[0]), f"All embeddings should have same dimension"
+            assert len(emb) == len(embeddings[0]), "All embeddings should have same dimension"

109-120: Minor: f-string without placeholders.

Remove the extraneous f prefix on line 120.

♻️ Proposed fix
         assert ef is not None
-        print(f"   Initialized with timeout and max_retries")
+        print("   Initialized with timeout and max_retries")
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 13e2b29 and 98a76cc.

📒 Files selected for processing (5)
  • src/pyseekdb/utils/embedding_functions/openai_base_embedding_function.py
  • src/pyseekdb/utils/embedding_functions/openai_embedding_function.py
  • src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py
  • tests/unit_tests/test_openai_embedding_function.py
  • tests/unit_tests/test_qwen_embedding_function.py
🧰 Additional context used
🧬 Code graph analysis (5)
tests/unit_tests/test_qwen_embedding_function.py (4)
src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py (3)
  • _get_default_api_base (95-101)
  • _get_default_api_key_env (103-109)
  • _get_model_dimensions (111-117)
src/pyseekdb/client/embedding_function.py (2)
  • dimension_of (67-82)
  • model (351-394)
src/pyseekdb/utils/embedding_functions/openai_base_embedding_function.py (4)
  • dimension (129-160)
  • _get_default_api_base (94-104)
  • _get_default_api_key_env (106-116)
  • _get_model_dimensions (118-126)
src/pyseekdb/utils/embedding_functions/openai_embedding_function.py (3)
  • _get_default_api_base (97-103)
  • _get_default_api_key_env (105-111)
  • _get_model_dimensions (113-119)
src/pyseekdb/utils/embedding_functions/openai_embedding_function.py (1)
src/pyseekdb/utils/embedding_functions/openai_base_embedding_function.py (4)
  • OpenAIBaseEmbeddingFunction (6-201)
  • _get_default_api_base (94-104)
  • _get_default_api_key_env (106-116)
  • _get_model_dimensions (118-126)
tests/unit_tests/test_openai_embedding_function.py (3)
src/pyseekdb/utils/embedding_functions/openai_embedding_function.py (4)
  • OpenAIEmbeddingFunction (13-119)
  • _get_default_api_base (97-103)
  • _get_default_api_key_env (105-111)
  • _get_model_dimensions (113-119)
src/pyseekdb/client/embedding_function.py (1)
  • dimension_of (67-82)
src/pyseekdb/utils/embedding_functions/openai_base_embedding_function.py (4)
  • dimension (129-160)
  • _get_default_api_base (94-104)
  • _get_default_api_key_env (106-116)
  • _get_model_dimensions (118-126)
src/pyseekdb/utils/embedding_functions/openai_base_embedding_function.py (1)
src/pyseekdb/utils/embedding_functions/openai_embedding_function.py (3)
  • _get_default_api_key_env (105-111)
  • _get_default_api_base (97-103)
  • _get_model_dimensions (113-119)
src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py (2)
src/pyseekdb/utils/embedding_functions/openai_base_embedding_function.py (4)
  • OpenAIBaseEmbeddingFunction (6-201)
  • _get_default_api_base (94-104)
  • _get_default_api_key_env (106-116)
  • _get_model_dimensions (118-126)
src/pyseekdb/utils/embedding_functions/openai_embedding_function.py (3)
  • _get_default_api_base (97-103)
  • _get_default_api_key_env (105-111)
  • _get_model_dimensions (113-119)
🪛 Ruff (0.14.10)
tests/unit_tests/test_qwen_embedding_function.py

33-33: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)


37-37: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)


129-129: f-string without any placeholders

Remove extraneous f prefix

(F541)


219-219: Loop control variable i not used within loop body

Rename unused i to _i

(B007)


221-221: f-string without any placeholders

Remove extraneous f prefix

(F541)

tests/unit_tests/test_openai_embedding_function.py

33-33: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)


37-37: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)


120-120: f-string without any placeholders

Remove extraneous f prefix

(F541)


227-227: Loop control variable i not used within loop body

Rename unused i to _i

(B007)


229-229: f-string without any placeholders

Remove extraneous f prefix

(F541)

src/pyseekdb/utils/embedding_functions/openai_base_embedding_function.py

59-61: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


59-61: Avoid specifying long messages outside the exception class

(TRY003)


72-75: Avoid specifying long messages outside the exception class

(TRY003)


156-156: Do not catch blind exception: Exception

(BLE001)


157-157: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


157-157: Avoid specifying long messages outside the exception class

(TRY003)


159-159: Avoid specifying long messages outside the exception class

(TRY003)


197-199: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: integration-test (server)
  • GitHub Check: integration-test (oceanbase)
  • GitHub Check: integration-test (embedded)
🔇 Additional comments (15)
src/pyseekdb/utils/embedding_functions/openai_base_embedding_function.py (3)

1-4: LGTM! Clean imports and module setup.

The imports are appropriate for the base class functionality.


94-127: LGTM! Well-designed abstract hooks.

The abstract method pattern with NotImplementedError for required overrides and a default empty dict for optional _get_model_dimensions is appropriate for the extensibility goal.


162-201: LGTM! Robust implementation with proper input handling and validation.

The __call__ method correctly handles single strings, empty inputs, and validates the response count. The implementation follows the EmbeddingFunction protocol appropriately.

src/pyseekdb/utils/embedding_functions/openai_embedding_function.py (4)

1-11: LGTM! Clean imports and model dimension mapping.

The _OPENAI_MODEL_DIMENSIONS dictionary correctly maps known OpenAI embedding models to their default dimensions, with proper source documentation.


13-56: LGTM! Comprehensive documentation with practical examples.

The class docstring provides clear usage examples including default initialization, custom models, and dimension reduction scenarios.


58-95: LGTM! Clean delegation to base class.

The constructor properly delegates all parameters to the base class while providing sensible defaults (model_name defaulting to "text-embedding-ada-002").


97-119: LGTM! Hook methods correctly implement OpenAI-specific configuration.

The _get_default_api_base, _get_default_api_key_env, and _get_model_dimensions methods provide the correct OpenAI-specific values.

src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py (4)

1-12: LGTM! Appropriate model dimension mapping for Qwen models.

The _QWEN_MODEL_DIMENSIONS dictionary correctly maps Qwen embedding model versions to their dimensions.


14-55: LGTM! Clear documentation with Qwen-specific examples.

The docstring provides appropriate examples demonstrating initialization with required model_name and optional parameters.


57-93: LGTM! Constructor properly requires model_name.

Unlike OpenAIEmbeddingFunction, model_name is a required parameter here, which is appropriate since Qwen doesn't have a single canonical default model.


95-117: LGTM! Qwen-specific hook implementations.

The hook methods correctly return Qwen's DashScope API base URL, API key environment variable name, and model dimension mapping.

tests/unit_tests/test_qwen_embedding_function.py (2)

131-143: LGTM! Good test for missing API key validation.

The test correctly saves, removes, tests, and restores the environment variable to validate the error handling.


225-234: LGTM! Good empty input edge case coverage.

Testing empty input returns empty list validates the base class behavior correctly.

tests/unit_tests/test_openai_embedding_function.py (2)

70-82: Good: Uses pytest.skip for missing custom env var.

Unlike the Qwen test, this correctly skips when the custom environment variable isn't set rather than polluting the environment with a fake value.


336-351: LGTM! Good parametric test for model dimensions.

The test efficiently validates all known models return correct dimensions using a test case list.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In `@src/pyseekdb/utils/embedding_functions/openai_embedding_function.py`:
- Around line 28-30: The example comment above the OpenAIEmbeddingFunction
instantiation references the old default model "text-embedding-ada-002"; update
that comment to match the actual default model used by OpenAIEmbeddingFunction
(text-embedding-3-small) or make the comment generic. Specifically, edit the
doc/example lines where ef = OpenAIEmbeddingFunction() is shown to either state
"Using default model (text-embedding-3-small)" or remove the model name, and
ensure consistency with the default defined in OpenAIEmbeddingFunction (the
default_model value inside that class/constructor).

In `@src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py`:
- Around line 4-11: Update the comment near _QWEN_MODEL_DIMENSIONS to document
that text-embedding-v3 and text-embedding-v4 support variable output dimensions
(not just the defaults currently listed); specifically note that v3 supports
1024, 768, and 512 and v4 supports 1024, 2048, 1536, 768, 512, 256, 128, and 64,
and that the output size can be configured via the DashScope API—leave the
default values in the mapping but add a clarifying comment referencing
_QWEN_MODEL_DIMENSIONS and the DashScope configuration option.

In `@tests/unit_tests/test_openai_embedding_function.py`:
- Around line 70-82: The test test_initialization_with_custom_api_key_env leaks
the CUSTOM_OPENAI_KEY environment variable; fix it by saving the original
os.environ.get("CUSTOM_OPENAI_KEY") before setting it, then wrap the setup and
assertion in a try/finally (or use a context manager) and in the finally block
restore the original value (reassign if not None, or delete the key if it was
absent) so OpenAIEmbeddingFunction(api_key_env=custom_key_env) runs but the
environment is restored afterward.
- Around line 27-38: The test_openai_env function uses `assert False` which can
be ignored when Python runs with -O; replace each `assert False, "message"` with
an explicit `raise AssertionError("message")` so failures are always raised
(update both the ImportError branch and the OPENAI_API_KEY check in
`test_openai_env`).
♻️ Duplicate comments (1)
tests/unit_tests/test_openai_embedding_function.py (1)

181-193: Test name and docstring are misleading - model is known, not unknown.

The test test_dimension_property_unknown_model uses "text-embedding-ada-002", which is in _OPENAI_MODEL_DIMENSIONS. This won't trigger the API fallback for dimension detection. Either rename the test to reflect it tests a known model, or use an actually unknown model name.

📝 Suggested fix (option 1: rename to reflect actual behavior)
     def test_dimension_property_unknown_model(self):
-        """Test dimension property for unknown model (should make API call)"""
-        print("\n✅ Testing OpenAIEmbeddingFunction dimension property for unknown model")
+        """Test dimension property for known model returns expected value"""
+        print("\n✅ Testing OpenAIEmbeddingFunction dimension property for known model")
         
         self.test_openai_env()

-        # This will make an actual API call to get dimension
         ef = OpenAIEmbeddingFunction(model_name="text-embedding-ada-002")
         dim = ef.dimension
         
         # Should have a valid dimension
-        assert dim > 0
-        print(f"   Unknown model dimension (via API call): {dim}")
+        assert dim == 1536
+        print(f"   Known model dimension: {dim}")
🧹 Nitpick comments (2)
tests/unit_tests/test_openai_embedding_function.py (2)

120-120: Remove extraneous f-string prefix.

This f-string has no placeholders.

-        print(f"   Initialized with timeout and max_retries")
+        print("   Initialized with timeout and max_retries")

227-230: Minor static analysis issues: unused loop variable and f-string without placeholders.

📝 Suggested fix
-        for i, emb in enumerate(embeddings):
+        for _i, emb in enumerate(embeddings):
             assert isinstance(emb, list)
-            assert len(emb) == len(embeddings[0]), f"All embeddings should have same dimension"
+            assert len(emb) == len(embeddings[0]), "All embeddings should have same dimension"

Alternatively, if you don't need the index at all:

-        for i, emb in enumerate(embeddings):
+        for emb in embeddings:
             assert isinstance(emb, list)
-            assert len(emb) == len(embeddings[0]), f"All embeddings should have same dimension"
+            assert len(emb) == len(embeddings[0]), "All embeddings should have same dimension"
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 98a76cc and 7dab341.

📒 Files selected for processing (3)
  • src/pyseekdb/utils/embedding_functions/openai_embedding_function.py
  • src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py
  • tests/unit_tests/test_openai_embedding_function.py
🧰 Additional context used
🧬 Code graph analysis (3)
src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py (2)
src/pyseekdb/utils/embedding_functions/openai_base_embedding_function.py (4)
  • OpenAIBaseEmbeddingFunction (6-201)
  • _get_default_api_base (94-104)
  • _get_default_api_key_env (106-116)
  • _get_model_dimensions (118-126)
src/pyseekdb/utils/embedding_functions/openai_embedding_function.py (3)
  • _get_default_api_base (97-103)
  • _get_default_api_key_env (105-111)
  • _get_model_dimensions (113-119)
tests/unit_tests/test_openai_embedding_function.py (4)
src/pyseekdb/utils/embedding_functions/openai_embedding_function.py (4)
  • OpenAIEmbeddingFunction (13-119)
  • _get_default_api_base (97-103)
  • _get_default_api_key_env (105-111)
  • _get_model_dimensions (113-119)
src/pyseekdb/client/embedding_function.py (1)
  • dimension_of (67-82)
src/pyseekdb/utils/embedding_functions/openai_base_embedding_function.py (4)
  • dimension (129-160)
  • _get_default_api_base (94-104)
  • _get_default_api_key_env (106-116)
  • _get_model_dimensions (118-126)
src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py (3)
  • _get_default_api_base (95-101)
  • _get_default_api_key_env (103-109)
  • _get_model_dimensions (111-117)
src/pyseekdb/utils/embedding_functions/openai_embedding_function.py (1)
src/pyseekdb/utils/embedding_functions/openai_base_embedding_function.py (3)
  • _get_default_api_base (94-104)
  • _get_default_api_key_env (106-116)
  • _get_model_dimensions (118-126)
🪛 Ruff (0.14.11)
tests/unit_tests/test_openai_embedding_function.py

33-33: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)


37-37: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)


120-120: f-string without any placeholders

Remove extraneous f prefix

(F541)


227-227: Loop control variable i not used within loop body

Rename unused i to _i

(B007)


229-229: f-string without any placeholders

Remove extraneous f prefix

(F541)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: integration-test (oceanbase)
  • GitHub Check: integration-test (server)
  • GitHub Check: integration-test (embedded)
🔇 Additional comments (10)
src/pyseekdb/utils/embedding_functions/openai_embedding_function.py (3)

1-11: Model dimensions mapping looks good.

The _OPENAI_MODEL_DIMENSIONS constant correctly maps known OpenAI embedding models to their dimensions, which avoids unnecessary API calls during dimension discovery for known models. This addresses the prior concern about paid API calls during collection creation.


58-95: Implementation correctly wires base class with OpenAI-specific defaults.

The constructor properly delegates to the base class and the default model is now "text-embedding-3-small" (the current recommended model). The parameter documentation is thorough and accurate.


97-119: Helper methods correctly implement provider-specific overrides.

The _get_default_api_base, _get_default_api_key_env, and _get_model_dimensions methods properly implement the abstract methods from OpenAIBaseEmbeddingFunction, returning OpenAI-specific defaults.

tests/unit_tests/test_openai_embedding_function.py (3)

1-24: Test structure and skip logic are well-designed.

The test suite correctly skips when OPENAI_API_KEY is not set, preventing CI failures while allowing manual testing with real API calls. The module docstring clearly explains how to run the tests.


136-159: Dimension tests for known models are thorough.

Good coverage of all three known OpenAI models with correct expected dimensions.


244-270: Dimension parameter tests validate actual API behavior.

These tests make real API calls to verify that the dimensions parameter correctly reduces embedding dimensions for text-embedding-3 models. This is valuable for ensuring the feature works end-to-end.

src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py (4)

1-2: LGTM!

Imports are correct and minimal.


14-55: LGTM!

Well-documented class with comprehensive usage examples covering default initialization, custom parameters, and integration with the collection API.


57-93: LGTM!

Constructor correctly delegates to the base class and documentation is thorough. The template method pattern works correctly here—Python's dynamic dispatch ensures the subclass overrides (_get_default_api_base, _get_default_api_key_env) are called from the parent's __init__.


95-117: LGTM!

Helper methods correctly implement the template method hooks with appropriate Qwen/DashScope defaults. The implementation is consistent with OpenAIEmbeddingFunction.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

@hnwyllmm hnwyllmm merged commit 702f93c into oceanbase:develop Jan 14, 2026
6 checks passed
@hnwyllmm hnwyllmm deleted the embedding-function-base branch January 14, 2026 03:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Support openai and qwen embedding platform

1 participant