-
Notifications
You must be signed in to change notification settings - Fork 65
Refactor: Refactor Scheduler to Support Dynamic Workflow Scheduling and Pipeline Pooling #301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
d0ec820
73033f2
226d805
5e129c0
b497d14
a353eec
bba237c
b31eca3
f5f6f8e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -58,6 +58,7 @@ dependencies = [ | |
| "apscheduler", | ||
| "litellm", | ||
| "hugegraph-python-client", | ||
| "pycgraph", | ||
| ] | ||
| [project.urls] | ||
| homepage = "https://hugegraph.apache.org/" | ||
|
|
@@ -85,3 +86,4 @@ allow-direct-references = true | |
|
|
||
| [tool.uv.sources] | ||
| hugegraph-python-client = { workspace = true } | ||
| pycgraph = { git = "https://github.com/ChunelFeng/CGraph.git", subdirectory = "python", rev = "main", marker = "sys_platform == 'linux'" } | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The Impact: The application will crash immediately on non-Linux systems (macOS, Windows) when trying to import modules from Recommendation:
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,16 @@ | ||
| # Licensed to the Apache Software Foundation (ASF) under one | ||
| # or more contributor license agreements. See the NOTICE file | ||
| # distributed with this work for additional information | ||
| # regarding copyright ownership. The ASF licenses this file | ||
| # to you under the Apache License, Version 2.0 (the | ||
| # "License"); you may not use this file except in compliance | ||
| # with the License. You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, | ||
| # software distributed under the License is distributed on an | ||
| # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| # KIND, either express or implied. See the License for the | ||
| # specific language governing permissions and limitations | ||
| # under the License. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧹 Minor: Empty This file contains only the Apache license header with no actual code. While functional, consider adding a brief module docstring to describe the flows package purpose. Suggestion: # ... (license header) ...
"""
Workflow orchestration module for HugeGraph AI.
This package provides flexible workflow scheduling and pipeline management
for various AI tasks including vector indexing and graph extraction.
""" |
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,55 @@ | ||
| # Licensed to the Apache Software Foundation (ASF) under one or more | ||
| # contributor license agreements. See the NOTICE file distributed with | ||
| # this work for additional information regarding copyright ownership. | ||
| # The ASF licenses this file to You under the Apache License, Version 2.0 | ||
| # (the "License"); you may not use this file except in compliance with | ||
| # the License. You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| from hugegraph_llm.flows.common import BaseFlow | ||
| from hugegraph_llm.state.ai_state import WkFlowInput | ||
|
|
||
| import json | ||
| from PyCGraph import GPipeline | ||
|
|
||
| from hugegraph_llm.operators.document_op.chunk_split import ChunkSplitNode | ||
| from hugegraph_llm.operators.index_op.build_vector_index import BuildVectorIndexNode | ||
| from hugegraph_llm.state.ai_state import WkFlowState | ||
|
|
||
|
|
||
| class BuildVectorIndexFlow(BaseFlow): | ||
| def __init__(self): | ||
| pass | ||
|
|
||
| def prepare(self, prepared_input: WkFlowInput, texts): | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The Issues:
Recommendation: def prepare(self, prepared_input: WkFlowInput, texts, language="zh", split_type="paragraph"):
prepared_input.texts = texts
prepared_input.language = language
prepared_input.split_type = split_type |
||
| prepared_input.texts = texts | ||
| prepared_input.language = "zh" | ||
| prepared_input.split_type = "paragraph" | ||
| return | ||
|
|
||
| def build_flow(self, texts): | ||
| pipeline = GPipeline() | ||
| # prepare for workflow input | ||
| prepared_input = WkFlowInput() | ||
| self.prepare(prepared_input, texts) | ||
|
|
||
| pipeline.createGParam(prepared_input, "wkflow_input") | ||
| pipeline.createGParam(WkFlowState(), "wkflow_state") | ||
|
|
||
| chunk_split_node = ChunkSplitNode() | ||
| build_vector_node = BuildVectorIndexNode() | ||
| pipeline.registerGElement(chunk_split_node, set(), "chunk_split") | ||
| pipeline.registerGElement(build_vector_node, {chunk_split_node}, "build_vector") | ||
|
|
||
| return pipeline | ||
|
|
||
| def post_deal(self, pipeline=None): | ||
| res = pipeline.getGParamWithNoEmpty("wkflow_state").to_json() | ||
| return json.dumps(res, ensure_ascii=False, indent=2) | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,45 @@ | ||
| # Licensed to the Apache Software Foundation (ASF) under one or more | ||
| # contributor license agreements. See the NOTICE file distributed with | ||
| # this work for additional information regarding copyright ownership. | ||
| # The ASF licenses this file to You under the Apache License, Version 2.0 | ||
| # (the "License"); you may not use this file except in compliance with | ||
| # the License. You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| from abc import ABC, abstractmethod | ||
|
|
||
| from hugegraph_llm.state.ai_state import WkFlowInput | ||
|
|
||
|
|
||
| class BaseFlow(ABC): | ||
| """ | ||
| Base class for flows, defines three interface methods: prepare, build_flow, and post_deal. | ||
| """ | ||
|
|
||
| @abstractmethod | ||
| def prepare(self, prepared_input: WkFlowInput, *args, **kwargs): | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧹 Minor: Missing type hints The Suggestion: from typing import Any
from PyCGraph import GPipeline
@abstractmethod
def prepare(self, prepared_input: WkFlowInput, *args: Any, **kwargs: Any) -> None:
"""Pre-processing interface."""
pass
@abstractmethod
def build_flow(self, *args: Any, **kwargs: Any) -> GPipeline:
"""Interface for building the flow."""
pass
@abstractmethod
def post_deal(self, pipeline: GPipeline | None = None) -> str:
"""Post-processing interface."""
pass |
||
| """ | ||
| Pre-processing interface. | ||
| """ | ||
| pass | ||
|
|
||
| @abstractmethod | ||
| def build_flow(self, *args, **kwargs): | ||
| """ | ||
| Interface for building the flow. | ||
| """ | ||
| pass | ||
|
|
||
| @abstractmethod | ||
| def post_deal(self, *args, **kwargs): | ||
| """ | ||
| Post-processing interface. | ||
| """ | ||
| pass | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,127 @@ | ||
| # Licensed to the Apache Software Foundation (ASF) under one or more | ||
| # contributor license agreements. See the NOTICE file distributed with | ||
| # this work for additional information regarding copyright ownership. | ||
| # The ASF licenses this file to You under the Apache License, Version 2.0 | ||
| # (the "License"); you may not use this file except in compliance with | ||
| # the License. You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| import json | ||
| from PyCGraph import GPipeline | ||
| from hugegraph_llm.flows.common import BaseFlow | ||
| from hugegraph_llm.state.ai_state import WkFlowInput, WkFlowState | ||
| from hugegraph_llm.operators.common_op.check_schema import CheckSchemaNode | ||
| from hugegraph_llm.operators.document_op.chunk_split import ChunkSplitNode | ||
| from hugegraph_llm.operators.hugegraph_op.schema_manager import SchemaManagerNode | ||
| from hugegraph_llm.operators.llm_op.info_extract import InfoExtractNode | ||
| from hugegraph_llm.operators.llm_op.property_graph_extract import ( | ||
| PropertyGraphExtractNode, | ||
| ) | ||
| from hugegraph_llm.utils.log import log | ||
|
|
||
|
|
||
| class GraphExtractFlow(BaseFlow): | ||
| def __init__(self): | ||
| pass | ||
|
|
||
| def _import_schema( | ||
| self, | ||
| from_hugegraph=None, | ||
| from_extraction=None, | ||
| from_user_defined=None, | ||
| ): | ||
| if from_hugegraph: | ||
| return SchemaManagerNode() | ||
| elif from_user_defined: | ||
| return CheckSchemaNode() | ||
| elif from_extraction: | ||
| raise NotImplementedError("Not implemented yet") | ||
| else: | ||
| raise ValueError("No input data / invalid schema type") | ||
|
|
||
| def prepare( | ||
| self, prepared_input: WkFlowInput, schema, texts, example_prompt, extract_type | ||
| ): | ||
| # prepare input data | ||
| prepared_input.texts = texts | ||
| prepared_input.language = "zh" | ||
| prepared_input.split_type = "document" | ||
| prepared_input.example_prompt = example_prompt | ||
| prepared_input.schema = schema | ||
| schema = schema.strip() | ||
| if schema.startswith("{"): | ||
| try: | ||
| schema = json.loads(schema) | ||
| prepared_input.schema = schema | ||
| except json.JSONDecodeError as exc: | ||
| log.error("Invalid JSON format in schema. Please check it again.") | ||
| raise ValueError("Invalid JSON format in schema.") from exc | ||
| else: | ||
| log.info("Get schema '%s' from graphdb.", schema) | ||
| prepared_input.graph_name = schema | ||
| return | ||
|
|
||
| def build_flow(self, schema, texts, example_prompt, extract_type): | ||
| pipeline = GPipeline() | ||
| prepared_input = WkFlowInput() | ||
| # prepare input data | ||
| self.prepare(prepared_input, schema, texts, example_prompt, extract_type) | ||
|
|
||
| pipeline.createGParam(prepared_input, "wkflow_input") | ||
| pipeline.createGParam(WkFlowState(), "wkflow_state") | ||
| schema = schema.strip() | ||
| schema_node = None | ||
| if schema.startswith("{"): | ||
| try: | ||
| schema = json.loads(schema) | ||
| schema_node = self._import_schema(from_user_defined=schema) | ||
| except json.JSONDecodeError as exc: | ||
| log.error("Invalid JSON format in schema. Please check it again.") | ||
| raise ValueError("Invalid JSON format in schema.") from exc | ||
| else: | ||
| log.info("Get schema '%s' from graphdb.", schema) | ||
| schema_node = self._import_schema(from_hugegraph=schema) | ||
|
|
||
| chunk_split_node = ChunkSplitNode() | ||
| graph_extract_node = None | ||
| if extract_type == "triples": | ||
| graph_extract_node = InfoExtractNode() | ||
| elif extract_type == "property_graph": | ||
| graph_extract_node = PropertyGraphExtractNode() | ||
| else: | ||
| raise ValueError(f"Unsupported extract_type: {extract_type}") | ||
| pipeline.registerGElement(schema_node, set(), "schema_node") | ||
| pipeline.registerGElement(chunk_split_node, set(), "chunk_split") | ||
| pipeline.registerGElement( | ||
| graph_extract_node, {schema_node, chunk_split_node}, "graph_extract" | ||
| ) | ||
|
|
||
| return pipeline | ||
|
|
||
| def post_deal(self, pipeline=None): | ||
| res = pipeline.getGParamWithNoEmpty("wkflow_state").to_json() | ||
| vertices = res.get("vertices", []) | ||
| edges = res.get("edges", []) | ||
| if not vertices and not edges: | ||
| log.info("Please check the schema.(The schema may not match the Doc)") | ||
| return json.dumps( | ||
| { | ||
| "vertices": vertices, | ||
| "edges": edges, | ||
| "warning": "The schema may not match the Doc", | ||
| }, | ||
| ensure_ascii=False, | ||
| indent=2, | ||
| ) | ||
| return json.dumps( | ||
| {"vertices": vertices, "edges": edges}, | ||
| ensure_ascii=False, | ||
| indent=2, | ||
| ) |
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,90 @@ | ||||||
| # Licensed to the Apache Software Foundation (ASF) under one or more | ||||||
| # contributor license agreements. See the NOTICE file distributed with | ||||||
| # this work for additional information regarding copyright ownership. | ||||||
| # The ASF licenses this file to You under the Apache License, Version 2.0 | ||||||
| # (the "License"); you may not use this file except in compliance with | ||||||
| # the License. You may obtain a copy of the License at | ||||||
| # | ||||||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||||||
| # | ||||||
| # Unless required by applicable law or agreed to in writing, software | ||||||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||||||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||||||
| # See the License for the specific language governing permissions and | ||||||
| # limitations under the License. | ||||||
|
|
||||||
| import threading | ||||||
| from typing import Dict, Any | ||||||
| from PyCGraph import GPipelineManager | ||||||
| from hugegraph_llm.flows.build_vector_index import BuildVectorIndexFlow | ||||||
| from hugegraph_llm.flows.common import BaseFlow | ||||||
| from hugegraph_llm.flows.graph_extract import GraphExtractFlow | ||||||
| from hugegraph_llm.utils.log import log | ||||||
|
|
||||||
|
|
||||||
| class Scheduler: | ||||||
| pipeline_pool: Dict[str, Any] = None | ||||||
| max_pipeline: int | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The Recommendation:
|
||||||
|
|
||||||
| def __init__(self, max_pipeline: int = 10): | ||||||
| self.pipeline_pool = {} | ||||||
| # pipeline_pool act as a manager of GPipelineManager which used for pipeline management | ||||||
| self.pipeline_pool["build_vector_index"] = { | ||||||
| "manager": GPipelineManager(), | ||||||
| "flow": BuildVectorIndexFlow(), | ||||||
| } | ||||||
| self.pipeline_pool["graph_extract"] = { | ||||||
| "manager": GPipelineManager(), | ||||||
| "flow": GraphExtractFlow(), | ||||||
| } | ||||||
| self.max_pipeline = max_pipeline | ||||||
|
|
||||||
| # TODO: Implement Agentic Workflow | ||||||
| def agentic_flow(self): | ||||||
| pass | ||||||
|
|
||||||
| def schedule_flow(self, flow: str, *args, **kwargs): | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The Race condition scenarios:
Recommendation: def schedule_flow(self, flow: str, *args, **kwargs):
if flow not in self.pipeline_pool:
raise ValueError(f"Unsupported workflow {flow}")
with threading.Lock(): # Add lock here
manager = self.pipeline_pool[flow]["manager"]
# ... rest of the logic |
||||||
| if flow not in self.pipeline_pool: | ||||||
| raise ValueError(f"Unsupported workflow {flow}") | ||||||
| manager = self.pipeline_pool[flow]["manager"] | ||||||
| flow: BaseFlow = self.pipeline_pool[flow]["flow"] | ||||||
| pipeline = manager.fetch() | ||||||
| if pipeline is None: | ||||||
| # call coresponding flow_func to create new workflow | ||||||
|
||||||
| # call coresponding flow_func to create new workflow | |
| # call corresponding flow_func to create new workflow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When pipeline.init() or pipeline.run() fails, the code raises an exception but never calls manager.add(pipeline) to return the pipeline to the pool. This creates a resource leak where failed pipelines are never recycled.
Impact: After max_pipeline failures, the system runs out of pipelines and cannot process new requests.
Recommendation:
Use try-finally to ensure pipelines are always returned:
pipeline = flow.build_flow(*args, **kwargs)
try:
status = pipeline.init()
if status.isErr():
raise RuntimeError(f"Error in flow init: {status.getInfo()}")
status = pipeline.run()
if status.isErr():
raise RuntimeError(f"Error in flow execution: {status.getInfo()}")
res = flow.post_deal(pipeline)
return res
finally:
manager.add(pipeline)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code uses both raise ValueError and raise RuntimeError inconsistently. Also, the post_deal method can fail (e.g., JSON serialization errors) but is not wrapped in error handling.
Recommendation:
- Define a clear exception hierarchy for the flows module
- Wrap
post_dealin try-catch to handle serialization errors - Document which exceptions callers should expect
Uh oh!
There was an error while loading. Please reload this page.