feat(llm): update keyword extraction method (BREAKING CHANGE) #282
feat(llm): update keyword extraction method (BREAKING CHANGE) #282imbajin merged 48 commits intoapache:mainfrom
Conversation
fix apache#224 problem, update new UI to support change keyword extracion method
There was a problem hiding this comment.
Pull Request Overview
This PR fixes issue #224 by updating the keyword extraction mechanism and UI options to support both TextRank and LLM-based extraction methods.
- Adds new extraction method parameters (extract_method, textrank_kwargs, language, mask_words, window_size) in the keyword extraction operator.
- Updates the RAG UI components to pass these new parameters.
- Updates dependencies (networkx and scipy) in pyproject.toml to support the new functionality.
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py | Added new extraction method settings and implemented a MultiLingualTextRank class. |
| hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py | Updated parameter passing to support the new keyword extraction options. |
| hugegraph-llm/src/hugegraph_llm/demo/rag_demo/rag_block.py | Revised UI inputs to include extraction method, language, mask words, and window size. |
| hugegraph-llm/pyproject.toml | Added networkx and scipy dependencies. |
Comments suppressed due to low confidence (1)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:190
- The 're' module is used in the _preprocess method but is not imported in the file, which may cause a runtime error. Please add 'import re' at the top of the file.
escaped_words = [re.escape(word) for word in self.mask_words]
fix the pylint check bug
imbajin
left a comment
There was a problem hiding this comment.
Code Review Summary
Overall Assessment: Changes Required
Key Findings
- 🚨 Critical: 3 issues found (breaking changes, error handling, security concerns)
⚠️ Medium: 4 improvements suggested- 🧹 Minor: 2 optimizations noted
Main Concerns
1. 🚨 Breaking API Changes Without Migration Path
The PR removes parameters (language, max_keywords) from method signatures in graph_rag_task.py without providing backward compatibility. This will break all existing code using these parameters.
File: hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (lines 57-82)
Recommendation: Add deprecated parameter support with warnings to maintain backward compatibility during transition period.
2. 🚨 Insufficient Error Handling in TextRank Implementation
The new MultiLingualTextRank class lacks proper error handling for edge cases.
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (line 133)
Issue: Division by zero possible when max(pagerank_scores) is 0
Fix: Add zero check before normalization
3. 🚨 Regex Injection Vulnerability
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (line 55)
The dynamic regex compilation from self.rules without proper sanitization could lead to ReDoS attacks.
Fix: Validate and sanitize regex patterns before compilation.
Medium Priority Issues
1. ⚠️ Uncaught Network Errors in NLTK Downloads
File: hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py (lines 53-57)
Only catching specific errors but not handling timeout or other network issues.
2. ⚠️ Missing Input Validation
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (lines 62-66)
The integer conversion for max_keywords catches exceptions but doesn't log warnings for invalid input.
3. ⚠️ Hardcoded Magic Numbers
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (line 31)
Window size validation uses hardcoded max value (10) without explanation.
4. ⚠️ Missing Tests
No test files included in this PR for the new TextRank functionality and hybrid extraction method.
Minor Issues
1. 🧹 Import Organization
Multiple files have import reorganization that seems unnecessary and makes the diff harder to read.
2. 🧹 Inconsistent Language Handling
The language determination logic appears in multiple places - should be centralized.
Positive Aspects
✅ Good addition of multiple extraction methods (LLM, TextRank, Hybrid)
✅ Proper logging added for debugging
✅ Config documentation updated appropriately
Required Actions Before Approval
- Add migration guide for breaking changes
- Fix the division by zero issue in TextRank
- Add input validation for regex patterns
- Include comprehensive tests for new functionality
- Consider adding backward compatibility layer
Detailed Code Review - Additional CommentsCritical Issue Details1. Division by Zero in TextRank PageRank NormalizationIn pagerank_scores = self.graph.pagerank(directed=False, damping=0.85, weights='weight')
if max(pagerank_scores) > 0:
pagerank_scores = [scores/max(pagerank_scores) for scores in pagerank_scores]Problem: The code already checks for Suggested Fix: pagerank_scores = self.graph.pagerank(directed=False, damping=0.85, weights='weight')
if pagerank_scores and max(pagerank_scores) > 0:
max_score = max(pagerank_scores)
pagerank_scores = [score/max_score for score in pagerank_scores]
else:
pagerank_scores = [0.0] * len(pagerank_scores) if pagerank_scores else []2. Breaking Changes in graph_rag_task.pyThe removal of Current problematic change (lines 57-82): # Before:
def extract_keywords(self, text=None, max_keywords=5, language="english", extract_template=None):
# After:
def extract_keywords(self, text=None, extract_template=None):Suggested backward-compatible implementation: def extract_keywords(
self,
text: Optional[str] = None,
extract_template: Optional[str] = None,
max_keywords: Optional[int] = None, # Deprecated
language: Optional[str] = None, # Deprecated
**kwargs
):
"""
Extract keywords from text.
Args:
text: Text to extract keywords from
extract_template: Template for extraction
max_keywords: [DEPRECATED] Use context['max_keywords'] instead
language: [DEPRECATED] Use llm_settings.language instead
"""
import warnings
if max_keywords is not None:
warnings.warn(
"max_keywords parameter is deprecated and will be removed in v2.0. "
"Please pass it via context dictionary instead.",
DeprecationWarning,
stacklevel=2
)
# Store in context for backward compatibility
self._context['max_keywords'] = max_keywords
if language is not None:
warnings.warn(
"language parameter is deprecated and will be removed in v2.0. "
"Please configure it via llm_settings instead.",
DeprecationWarning,
stacklevel=2
)
# Store for backward compatibility
self._context['language'] = language
self._operators.append(
KeywordExtract(
text=text,
extract_template=extract_template,
**kwargs
)
)
return self3. Regex Injection PreventionIn Current vulnerable code: self.rules = [r"https?://\\S+|www\\.\\S+",
r"\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}\\b",
r"\\b\\w+(?:[-'\\']\\w+)+\\b",
r"\\b\\d+[,.]\\d+\\b"]Suggested secure implementation: import re
class MultiLingualTextRank:
# Make rules immutable
_SAFE_RULES = tuple([
r"https?://\\S+|www\\.\\S+",
r"\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}\\b",
r"\\b\\w+(?:[-'\\']\\w+)+\\b",
r"\\b\\d+[,.]\\d+\\b"
])
def __init__(self, keyword_num: int = 5, window_size: int = 3):
# Validate and compile patterns at initialization
try:
self._compiled_rules = re.compile('|'.join(self._SAFE_RULES), re.IGNORECASE)
except re.error as e:
raise ValueError(f"Invalid regex pattern: {e}")4. NLTK Error Handling EnhancementIn def check_nltk_data(self):
"""Check and download required NLTK data packages."""
import socket
# Set timeout for network operations
original_timeout = socket.getdefaulttimeout()
socket.setdefaulttimeout(30)
try:
# ... existing code ...
for package in required_packages:
try:
# ... existing check ...
except LookupError:
max_retries = 3
for attempt in range(max_retries):
try:
log.info("Downloading NLTK package %s (attempt %d/%d)",
package, attempt + 1, max_retries)
nltk.download(package, download_dir=nltk_data_dir, quiet=False)
break
except (URLError, HTTPError, PermissionError, socket.timeout) as e:
if attempt == max_retries - 1:
log.error("Failed to download %s after %d attempts: %s",
package, max_retries, e)
else:
time.sleep(2 ** attempt) # Exponential backoff
finally:
socket.setdefaulttimeout(original_timeout)Testing RequirementsThe PR is missing test coverage for:
Example test structure: def test_textrank_empty_input():
extractor = MultiLingualTextRank()
result = extractor.extract_keywords("")
assert result == {}
def test_textrank_division_by_zero():
extractor = MultiLingualTextRank()
# Create scenario where all PageRank scores are 0
result = extractor.extract_keywords("a a a")
assert all(score >= 0 for score in result.values())
def test_deprecated_parameters():
with warnings.catch_warnings(record=True) as w:
pipeline = RAGPipeline()
pipeline.extract_keywords("test", max_keywords=5)
assert len(w) == 1
assert "deprecated" in str(w[0].message).lower() |
| try: | ||
| self._stopwords[lang] = stopwords.words(lang) | ||
| except LookupError as e: | ||
| log.warning("NLTK stopwords for lang=%s not found: %s; using empty list", lang, e) |
There was a problem hiding this comment.
The current error handling allows the system to continue with an empty stopwords list when NLTK data is unavailable, which could significantly degrade keyword extraction quality without clear user notification.
Problem:
except LookupError as e:
log.warning("NLTK stopwords for lang=%s not found: %s; using empty list", lang, e)
self._stopwords[lang] = []Recommendation:
Consider throwing an exception or providing a more prominent error (e.g., log.error) since operating without stopwords fundamentally changes extraction behavior. At minimum, this warning should be surfaced to the user through the API response.
| def _create_placeholder(match_obj): | ||
| nonlocal placeholder_id_counter | ||
| original_word = match_obj.group(0) | ||
| _placeholder = f" __shieldword_{placeholder_id_counter}__ " |
There was a problem hiding this comment.
The placeholder format __shieldword_{counter}__ could collide with actual text content containing similar patterns, causing incorrect word masking/unmasking.
Example failure case:
If the input text already contains "shieldword_0", the system would incorrectly treat it as a placeholder.
Recommendation:
Use UUID-based placeholders or add a unique session prefix:
import uuid
session_id = uuid.uuid4().hex[:8]
_placeholder = f" __shield_{session_id}_{placeholder_id_counter}__ "| start_time = time.perf_counter() | ||
| ranks = {} | ||
| try: | ||
| ranks = self._textrank_model.extract_keywords(self._query) |
There was a problem hiding this comment.
The code assumes LLM output is perfectly formatted but doesn't handle malformed responses, which will cause runtime failures.
Problem:
keyword, score = item.split(":")
llm_ranks[keyword.strip()] = float(score)If LLM returns keyword1::0.95 or keyword1 (no colon), this will crash.
Recommendation:
Add comprehensive error handling:
try:
parts = item.split(":", 1)
if len(parts) != 2:
log.warning("Skipping malformed item: %s", item)
continue
keyword, score_str = parts
keyword = keyword.strip()
if not keyword:
continue
score = float(score_str.strip())
if not 0.0 <= score <= 1.0:
log.warning("Score out of range for %s: %s", keyword, score)
score = max(0.0, min(1.0, score))
llm_ranks[keyword] = score
except (ValueError, AttributeError) as e:
log.warning("Failed to parse item '%s': %s", item, e)
continue| placeholder_id_counter += 1 | ||
| return _placeholder | ||
|
|
||
| special_regex = regex.compile('|'.join(self.rules), regex.V1) |
There was a problem hiding this comment.
The regex pattern is recompiled on every call to _word_mask, which is inefficient for repeated extractions.
Recommendation:
Compile once during initialization:
def __init__(self, keyword_num: int = 5, window_size: int = 3):
# ... existing code ...
self.rules = [r"https?://\S+|www\.\S+", ...]
self.special_regex = regex.compile('|'.join(self.rules), regex.V1)
def _word_mask(self, text):
# ... existing code ...
text = self.special_regex.sub(_create_placeholder, text)| reranker_type: Optional[Literal["cohere", "siliconflow"]] = None | ||
| keyword_extract_type: Literal["llm", "textrank", "hybrid"] = "llm" | ||
| window_size: Optional[int] = 3 | ||
| hybrid_llm_weights: Optional[float] = 0.5 |
There was a problem hiding this comment.
The config accepts hybrid_llm_weights but doesn't validate the 0.0-1.0 range at initialization.
Recommendation:
Add Pydantic field validation:
from pydantic import Field
hybrid_llm_weights: Optional[float] = Field(
default=0.5,
ge=0.0,
le=1.0,
description="LLM weight in hybrid mode (0.0-1.0)"
)| log.warning("NLTK stopwords for lang=%s not found: %s; using empty list", lang, e) | ||
| self._stopwords[lang] = [] | ||
|
|
||
| # final check |
There was a problem hiding this comment.
The NLTK package download logic is duplicated between stopwords() and check_nltk_data() methods, violating DRY principle.
Recommendation:
Extract common download logic:
def _download_nltk_package(self, package: str, path: str, nltk_data_dir: str) -> bool:
try:
nltk.data.find(path)
return True
except LookupError:
log.info("Downloading NLTK package: %s", package)
try:
return nltk.download(package, download_dir=nltk_data_dir, quiet=False)
except (URLError, HTTPError, PermissionError) as e:
log.warning("Failed to download %s: %s", package, e)
return False
BREAKING CHANGE
PLEASE UPDATE YOUR KEYWORD EXTRACT PROMPT
fix #224 problem, update new UI to support change keyword extraction method.
Main changes
Added options to the RAG interface for selecting the keyword extraction method(including LLM, TextRank, Hybrid) and the max number of keywords.

A 'TextRank mask words' setting has also been added. It allows users to manually input specific phrases composed of letters and symbols to prevent them from being split during word segmentation. And the input will also be saved.

Test results
TextRank Method:

-Input
-Result:

Hybrid Method:
