Refactor: Refactor Scheduler to Support Dynamic Workflow Scheduling and Pipeline Pooling by weijinglin · Pull Request #301 · apache/hugegraph-ai

weijinglin · 2025-09-13T05:34:35Z

This PR refactors the Scheduler class to introduce a more flexible and extensible workflow scheduling mechanism. The main changes include:

Introduced a pipeline pool using a dictionary to manage different workflow types (e.g., build_vector_index, graph_extract), each with its own GPipelineManager, flow, prepare, and post-processing functions.
Added a schedule_flow method to dynamically select and execute workflows based on the flow name, supporting pipeline reuse and resource management.
Refactored the build_vector_index and graph_extract flows to separate preparation, execution, and post-processing logic, improving modularity and maintainability.
Updated related utility functions (graph_index_utils.py, vector_index_utils.py) to use the new schedule_flow interface.
Improved error handling and logging for schema parsing and pipeline execution.

These changes lay the foundation for supporting more complex and agentic workflows in the future, while also improving the efficiency and scalability of the current pipeline execution framework.

Copilot

Pull Request Overview

This PR refactors the Scheduler to support dynamic workflow scheduling through pipeline pooling, enabling more flexible and extensible workflow management. The changes separate workflow logic into preparation, execution, and post-processing phases for better modularity.

Key changes:

Introduced a Scheduler singleton with pipeline pooling for build_vector_index and graph_extract workflows
Created new flow classes (BuildVectorIndexFlow, GraphExtractFlow) and node-based operators for pipeline execution
Refactored utility functions to use the new scheduler.schedule_flow() interface instead of direct builder invocations

Reviewed Changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
hugegraph-llm/src/hugegraph_llm/flows/scheduler.py	Implements the core Scheduler and SchedulerSingleton classes for workflow management
hugegraph-llm/src/hugegraph_llm/flows/build_vector_index.py	Defines BuildVectorIndexFlow with prepare/build/post-processing logic
hugegraph-llm/src/hugegraph_llm/flows/graph_extract.py	Defines GraphExtractFlow with schema import and extraction logic
hugegraph-llm/src/hugegraph_llm/flows/common.py	Provides BaseFlow abstract class defining flow interface
hugegraph-llm/src/hugegraph_llm/state/ai_state.py	Introduces WkFlowInput and WkFlowState parameter classes for workflow state management
hugegraph-llm/src/hugegraph_llm/operators/util.py	Adds init_context helper for workflow context initialization
hugegraph-llm/src/hugegraph_llm/operators/document_op/chunk_split.py	Adds ChunkSplitNode implementing GNode-based chunk splitting
hugegraph-llm/src/hugegraph_llm/operators/index_op/build_vector_index.py	Adds BuildVectorIndexNode for vector index construction in pipeline
hugegraph-llm/src/hugegraph_llm/operators/llm_op/property_graph_extract.py	Adds PropertyGraphExtractNode for pipeline-based property graph extraction
hugegraph-llm/src/hugegraph_llm/operators/llm_op/info_extract.py	Adds InfoExtractNode for pipeline-based triple extraction
hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/schema_manager.py	Adds SchemaManagerNode for pipeline-based schema management
hugegraph-llm/src/hugegraph_llm/operators/common_op/check_schema.py	Adds CheckSchemaNode for pipeline-based schema validation
hugegraph-llm/src/hugegraph_llm/models/llms/init_llm.py	Adds standalone get_chat_llm, get_extract_llm, get_text2gql_llm functions
hugegraph-llm/src/hugegraph_llm/models/embeddings/init_embedding.py	Adds model_map and get_embedding function for embedding initialization
hugegraph-llm/src/hugegraph_llm/utils/vector_index_utils.py	Updates to use scheduler.schedule_flow() and model_map
hugegraph-llm/src/hugegraph_llm/utils/graph_index_utils.py	Updates extract_graph to use scheduler and renames original to extract_graph_origin
hugegraph-llm/pyproject.toml	Adds pycgraph dependency with git source specification

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-10-21T07:40:09Z

hugegraph-llm/src/hugegraph_llm/state/ai_state.py

+    language: str = None  # language configuration used by ChunkSplit Node
+    split_type: str = None  # split type used by ChunkSplit Node
+    example_prompt: str = None  # need by graph information extract
+    schema: str = None  # Schema information requeired by SchemaNode


Corrected spelling of 'requeired' to 'required'.

Suggested change

schema: str = None # Schema information requeired by SchemaNode

schema: str = None # Schema information required by SchemaNode

Copilot · 2025-10-21T07:40:10Z

hugegraph-llm/src/hugegraph_llm/flows/scheduler.py

+        flow: BaseFlow = self.pipeline_pool[flow]["flow"]
+        pipeline = manager.fetch()
+        if pipeline is None:
+            # call coresponding flow_func to create new workflow


Corrected spelling of 'coresponding' to 'corresponding'.

Suggested change

# call coresponding flow_func to create new workflow

# call corresponding flow_func to create new workflow

hugegraph-llm/pyproject.toml

imbajin · 2025-10-21T07:42:15Z

hugegraph-llm/pyproject.toml


 [tool.uv.sources]
 hugegraph-python-client = { workspace = true }
+pycgraph = { git = "https://github.com/ChunelFeng/CGraph.git", subdirectory = "python", rev = "main", marker = "sys_platform == 'linux'"  }


‼️ Critical: Platform-specific dependency with insufficient error handling

The pycgraph dependency is marked with marker = "sys_platform == 'linux'", meaning it will only install on Linux systems. However, the code imports and uses PyCGraph unconditionally without any platform checks or error handling.

Impact: The application will crash immediately on non-Linux systems (macOS, Windows) when trying to import modules from hugegraph_llm.flows.

Recommendation:

Add platform compatibility checks and graceful degradation

Provide clear error messages for unsupported platforms

Consider making CGraph support optional with a feature flag

Document platform requirements in README/docs

imbajin · 2025-10-21T07:42:23Z

hugegraph-llm/src/hugegraph_llm/flows/scheduler.py

+    def agentic_flow(self):
+        pass
+
+    def schedule_flow(self, flow: str, *args, **kwargs):


‼️ Critical: Missing thread safety in pipeline access

The schedule_flow method accesses and modifies the pipeline pool without proper synchronization. While SchedulerSingleton uses a lock for instance creation, the schedule_flow method itself doesn't protect concurrent access to the shared pipeline_pool dictionary.

Race condition scenarios:

Multiple threads calling schedule_flow for the same flow type

Concurrent fetch() and release() operations on the same manager

Pipeline state corruption during concurrent runs

Recommendation:
Add proper locking around pipeline operations:

def schedule_flow(self, flow: str, *args, **kwargs): if flow not in self.pipeline_pool: raise ValueError(f"Unsupported workflow {flow}") with threading.Lock(): # Add lock here manager = self.pipeline_pool[flow]["manager"] # ... rest of the logic

imbajin · 2025-10-21T07:42:30Z

hugegraph-llm/src/hugegraph_llm/flows/build_vector_index.py

+    def __init__(self):
+        pass
+
+    def prepare(self, prepared_input: WkFlowInput, texts):


‼️ Critical: Hardcoded configuration breaks flexibility

The prepare method hardcodes language = "zh" (Chinese) and split_type = "paragraph". This contradicts the PR's goal of "flexible and extensible workflow scheduling."

Issues:

Non-Chinese users cannot use this flow without code modification

No way to customize split strategy for different document types

Breaks the principle of configuration over convention

Recommendation:
Make these configurable parameters:

def prepare(self, prepared_input: WkFlowInput, texts, language="zh", split_type="paragraph"): prepared_input.texts = texts prepared_input.language = language prepared_input.split_type = split_type

imbajin · 2025-10-21T07:42:41Z

hugegraph-llm/src/hugegraph_llm/flows/scheduler.py

+        pipeline = manager.fetch()
+        if pipeline is None:
+            # call coresponding flow_func to create new workflow
+            pipeline = flow.build_flow(*args, **kwargs)


‼️ Critical: Incomplete error recovery logic

When pipeline.init() or pipeline.run() fails, the code raises an exception but never calls manager.add(pipeline) to return the pipeline to the pool. This creates a resource leak where failed pipelines are never recycled.

Impact: After max_pipeline failures, the system runs out of pipelines and cannot process new requests.

Recommendation:
Use try-finally to ensure pipelines are always returned:

pipeline = flow.build_flow(*args, **kwargs) try: status = pipeline.init() if status.isErr(): raise RuntimeError(f"Error in flow init: {status.getInfo()}") status = pipeline.run() if status.isErr(): raise RuntimeError(f"Error in flow execution: {status.getInfo()}") res = flow.post_deal(pipeline) return res finally: manager.add(pipeline)

imbajin · 2025-10-21T07:42:51Z

hugegraph-llm/src/hugegraph_llm/models/embeddings/init_embedding.py

@@ -17,10 +17,40 @@




⚠️ Important: Duplicate function definition

The get_embedding function is defined twice in this file - once as a standalone function (lines 18-51) and once as a method in the Embeddings class (lines 60-78). This creates maintenance issues and confusion.

Recommendation:

Remove the duplicate standalone function if it's not being used

Or deprecate the Embeddings class if migrating to the new pattern

Update all callers to use the consistent API

imbajin · 2025-10-21T07:42:59Z

hugegraph-llm/src/hugegraph_llm/operators/common_op/check_schema.py

+            edge_label["target_label"],
+            str,
+            "'target_label' in edge_label is not of correct type.",
+        )


⚠️ Important: Silent failure in key processing

The _process_keys method silently filters out keys that don't exist in the properties list (new_keys = [key for key in keys if key in label['properties']]). This could hide configuration errors where users specify invalid keys.

Issues:

Typos in key names go undetected

No feedback when configured keys are ignored

Could lead to unexpected schema behavior

Recommendation:
Add validation to warn or error on invalid keys:

def _process_keys(self, label: Dict[str, Any], key_type: str, default_keys: list) -> list: keys = label.get(key_type, default_keys) check_type(keys, list, f"'{key_type}' in {label['name']} is not of correct type.") invalid_keys = [key for key in keys if key not in label['properties']] if invalid_keys: log.warning(f"Keys {invalid_keys} in {key_type} are not present in properties for {label['name']}") new_keys = [key for key in keys if key in label['properties']] return new_keys

imbajin · 2025-10-21T07:43:11Z

hugegraph-llm/src/hugegraph_llm/flows/scheduler.py

+
+class Scheduler:
+    pipeline_pool: Dict[str, Any] = None
+    max_pipeline: int


⚠️ Important: Unused parameter

The max_pipeline parameter is stored in __init__ but never used anywhere in the class. The GPipelineManager is created without any size limits, so this parameter has no effect.

Recommendation:

Either implement pipeline pool size limits using this parameter

Or remove it if not needed yet

Document the intended behavior for future implementation

imbajin · 2025-10-21T07:43:19Z

hugegraph-llm/src/hugegraph_llm/flows/scheduler.py

+                error_msg = f"Error in flow execution: {status.getInfo()}"
+                log.error(error_msg)
+                raise RuntimeError(error_msg)
+            res = flow.post_deal(pipeline)


⚠️ Important: Inconsistent error handling

The code uses both raise ValueError and raise RuntimeError inconsistently. Also, the post_deal method can fail (e.g., JSON serialization errors) but is not wrapped in error handling.

Recommendation:

Define a clear exception hierarchy for the flows module

Wrap post_deal in try-catch to handle serialization errors

Document which exceptions callers should expect

imbajin · 2025-10-21T07:43:39Z

hugegraph-llm/src/hugegraph_llm/flows/__init__.py

+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.


🧹 Minor: Empty __init__.py with only license header

This file contains only the Apache license header with no actual code. While functional, consider adding a brief module docstring to describe the flows package purpose.

Suggestion:

# ... (license header) ... """ Workflow orchestration module for HugeGraph AI. This package provides flexible workflow scheduling and pipeline management for various AI tasks including vector indexing and graph extraction. """

imbajin · 2025-10-21T07:43:55Z

hugegraph-llm/src/hugegraph_llm/flows/common.py

+    """
+
+    @abstractmethod
+    def prepare(self, prepared_input: WkFlowInput, *args, **kwargs):


🧹 Minor: Missing type hints

The BaseFlow abstract class and its implementations lack type hints for parameters and return values. This reduces code maintainability and IDE support.

Suggestion:

from typing import Any from PyCGraph import GPipeline @abstractmethod def prepare(self, prepared_input: WkFlowInput, *args: Any, **kwargs: Any) -> None: """Pre-processing interface.""" pass @abstractmethod def build_flow(self, *args: Any, **kwargs: Any) -> GPipeline: """Interface for building the flow.""" pass @abstractmethod def post_deal(self, pipeline: GPipeline | None = None) -> str: """Post-processing interface.""" pass

weijinglin added 2 commits September 13, 2025 11:17

replace workflow execution framework to CGraph

d0ec820

improve framework

73033f2

dosubot bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Sep 13, 2025

github-actions bot added the llm label Sep 13, 2025

dosubot bot added the enhancement New feature or request label Sep 13, 2025

weijinglin added 7 commits September 13, 2025 13:37

Merge branch 'main' into agentic

226d805

fixed some mistakes

5e129c0

format

b497d14

apply ai's suggestion

a353eec

refactor scheduler

bba237c

add licenses

b31eca3

minor fix

f5f6f8e

imbajin requested a review from Copilot October 21, 2025 07:39

Copilot AI reviewed Oct 21, 2025

View reviewed changes

imbajin reviewed Oct 21, 2025

View reviewed changes

weijinglin closed this Oct 26, 2025

	schema: str = None # Schema information requeired by SchemaNode
	schema: str = None # Schema information required by SchemaNode

	# call coresponding flow_func to create new workflow
	# call corresponding flow_func to create new workflow

Conversation

weijinglin commented Sep 13, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants