feat(llm): update keyword extraction method (BREAKING CHANGE) by Gfreely · Pull Request #282 · apache/hugegraph-ai

Gfreely · 2025-06-27T08:20:44Z

BREAKING CHANGE
PLEASE UPDATE YOUR KEYWORD EXTRACT PROMPT
fix #224 problem, update new UI to support change keyword extraction method.

Main changes

Added options to the RAG interface for selecting the keyword extraction method(including LLM, TextRank, Hybrid) and the max number of keywords.

A 'TextRank mask words' setting has also been added. It allows users to manually input specific phrases composed of letters and symbols to prevent them from being split during word segmentation. And the input will also be saved.

Test results

TextRank Method:
-Input

-Result:

Hybrid Method:

fix apache#224 problem, update new UI to support change keyword extracion method

Copilot

Pull Request Overview

This PR fixes issue #224 by updating the keyword extraction mechanism and UI options to support both TextRank and LLM-based extraction methods.

Adds new extraction method parameters (extract_method, textrank_kwargs, language, mask_words, window_size) in the keyword extraction operator.
Updates the RAG UI components to pass these new parameters.
Updates dependencies (networkx and scipy) in pyproject.toml to support the new functionality.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File	Description
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py	Added new extraction method settings and implemented a MultiLingualTextRank class.
hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py	Updated parameter passing to support the new keyword extraction options.
hugegraph-llm/src/hugegraph_llm/demo/rag_demo/rag_block.py	Revised UI inputs to include extraction method, language, mask words, and window size.
hugegraph-llm/pyproject.toml	Added networkx and scipy dependencies.

Comments suppressed due to low confidence (1)

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:190

The 're' module is used in the _preprocess method but is not imported in the file, which may cause a runtime error. Please add 'import re' at the top of the file.

                escaped_words = [re.escape(word) for word in self.mask_words]

fix the pylint check bug

imbajin

Code Review Summary

Overall Assessment: Changes Required

Key Findings

🚨 Critical: 3 issues found (breaking changes, error handling, security concerns)
⚠️ Medium: 4 improvements suggested
🧹 Minor: 2 optimizations noted

Main Concerns

1. 🚨 Breaking API Changes Without Migration Path

The PR removes parameters (language, max_keywords) from method signatures in graph_rag_task.py without providing backward compatibility. This will break all existing code using these parameters.

File: hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (lines 57-82)
Recommendation: Add deprecated parameter support with warnings to maintain backward compatibility during transition period.

2. 🚨 Insufficient Error Handling in TextRank Implementation

The new MultiLingualTextRank class lacks proper error handling for edge cases.

File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (line 133)
Issue: Division by zero possible when max(pagerank_scores) is 0
Fix: Add zero check before normalization

3. 🚨 Regex Injection Vulnerability

File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (line 55)
The dynamic regex compilation from self.rules without proper sanitization could lead to ReDoS attacks.
Fix: Validate and sanitize regex patterns before compilation.

Medium Priority Issues

1. ⚠️ Uncaught Network Errors in NLTK Downloads

File: hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py (lines 53-57)
Only catching specific errors but not handling timeout or other network issues.

2. ⚠️ Missing Input Validation

File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (lines 62-66)
The integer conversion for max_keywords catches exceptions but doesn't log warnings for invalid input.

3. ⚠️ Hardcoded Magic Numbers

File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (line 31)
Window size validation uses hardcoded max value (10) without explanation.

4. ⚠️ Missing Tests

No test files included in this PR for the new TextRank functionality and hybrid extraction method.

Minor Issues

1. 🧹 Import Organization

Multiple files have import reorganization that seems unnecessary and makes the diff harder to read.

2. 🧹 Inconsistent Language Handling

The language determination logic appears in multiple places - should be centralized.

Positive Aspects

✅ Good addition of multiple extraction methods (LLM, TextRank, Hybrid)
✅ Proper logging added for debugging
✅ Config documentation updated appropriately

Required Actions Before Approval

Add migration guide for breaking changes
Fix the division by zero issue in TextRank
Add input validation for regex patterns
Include comprehensive tests for new functionality
Consider adding backward compatibility layer

imbajin · 2025-09-29T10:56:09Z

Detailed Code Review - Additional Comments

Critical Issue Details

1. Division by Zero in TextRank PageRank Normalization

In textrank_word_extract.py, line 129-131:

pagerank_scores = self.graph.pagerank(directed=False, damping=0.85, weights='weight')
if max(pagerank_scores) > 0:
    pagerank_scores = [scores/max(pagerank_scores) for scores in pagerank_scores]

Problem: The code already checks for max(pagerank_scores) > 0, which is good, but doesn't handle the else case properly.

Suggested Fix:

pagerank_scores = self.graph.pagerank(directed=False, damping=0.85, weights='weight')
if pagerank_scores and max(pagerank_scores) > 0:
    max_score = max(pagerank_scores)
    pagerank_scores = [score/max_score for score in pagerank_scores]
else:
    pagerank_scores = [0.0] * len(pagerank_scores) if pagerank_scores else []

2. Breaking Changes in graph_rag_task.py

The removal of language and max_keywords parameters needs careful handling:

Current problematic change (lines 57-82):

# Before:
def extract_keywords(self, text=None, max_keywords=5, language="english", extract_template=None):

# After: 
def extract_keywords(self, text=None, extract_template=None):

Suggested backward-compatible implementation:

def extract_keywords(
    self,
    text: Optional[str] = None,
    extract_template: Optional[str] = None,
    max_keywords: Optional[int] = None,  # Deprecated
    language: Optional[str] = None,  # Deprecated
    **kwargs
):
    """
    Extract keywords from text.
    
    Args:
        text: Text to extract keywords from
        extract_template: Template for extraction
        max_keywords: [DEPRECATED] Use context['max_keywords'] instead
        language: [DEPRECATED] Use llm_settings.language instead
    """
    import warnings
    
    if max_keywords is not None:
        warnings.warn(
            "max_keywords parameter is deprecated and will be removed in v2.0. "
            "Please pass it via context dictionary instead.",
            DeprecationWarning,
            stacklevel=2
        )
        # Store in context for backward compatibility
        self._context['max_keywords'] = max_keywords
    
    if language is not None:
        warnings.warn(
            "language parameter is deprecated and will be removed in v2.0. "
            "Please configure it via llm_settings instead.",
            DeprecationWarning,
            stacklevel=2
        )
        # Store for backward compatibility
        self._context['language'] = language
        
    self._operators.append(
        KeywordExtract(
            text=text,
            extract_template=extract_template,
            **kwargs
        )
    )
    return self

3. Regex Injection Prevention

In textrank_word_extract.py, the rules list should be immutable and validated:

Current vulnerable code:

self.rules = [r"https?://\\S+|www\\.\\S+",
              r"\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}\\b",
              r"\\b\\w+(?:[-'\\']\\w+)+\\b",
              r"\\b\\d+[,.]\\d+\\b"]

Suggested secure implementation:

import re

class MultiLingualTextRank:
    # Make rules immutable
    _SAFE_RULES = tuple([
        r"https?://\\S+|www\\.\\S+",
        r"\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}\\b",
        r"\\b\\w+(?:[-'\\']\\w+)+\\b",
        r"\\b\\d+[,.]\\d+\\b"
    ])
    
    def __init__(self, keyword_num: int = 5, window_size: int = 3):
        # Validate and compile patterns at initialization
        try:
            self._compiled_rules = re.compile('|'.join(self._SAFE_RULES), re.IGNORECASE)
        except re.error as e:
            raise ValueError(f"Invalid regex pattern: {e}")

4. NLTK Error Handling Enhancement

In nltk_helper.py, add more comprehensive error handling:

def check_nltk_data(self):
    """Check and download required NLTK data packages."""
    import socket
    
    # Set timeout for network operations
    original_timeout = socket.getdefaulttimeout()
    socket.setdefaulttimeout(30)
    
    try:
        # ... existing code ...
        for package in required_packages:
            try:
                # ... existing check ...
            except LookupError:
                max_retries = 3
                for attempt in range(max_retries):
                    try:
                        log.info("Downloading NLTK package %s (attempt %d/%d)", 
                                package, attempt + 1, max_retries)
                        nltk.download(package, download_dir=nltk_data_dir, quiet=False)
                        break
                    except (URLError, HTTPError, PermissionError, socket.timeout) as e:
                        if attempt == max_retries - 1:
                            log.error("Failed to download %s after %d attempts: %s", 
                                    package, max_retries, e)
                        else:
                            time.sleep(2 ** attempt)  # Exponential backoff
    finally:
        socket.setdefaulttimeout(original_timeout)

Testing Requirements

The PR is missing test coverage for:

TextRank algorithm tests:
- Test with empty text
- Test with single word
- Test with non-ASCII characters
- Test window size edge cases
- Test PageRank score normalization
Hybrid extraction tests:
- Test weight balancing
- Test fallback when one method fails
- Test score combination logic
Migration tests:
- Test deprecated parameter warnings
- Test backward compatibility

Example test structure:

def test_textrank_empty_input():
    extractor = MultiLingualTextRank()
    result = extractor.extract_keywords("")
    assert result == {}

def test_textrank_division_by_zero():
    extractor = MultiLingualTextRank()
    # Create scenario where all PageRank scores are 0
    result = extractor.extract_keywords("a a a")  
    assert all(score >= 0 for score in result.values())

def test_deprecated_parameters():
    with warnings.catch_warnings(record=True) as w:
        pipeline = RAGPipeline()
        pipeline.extract_keywords("test", max_keywords=5)
        assert len(w) == 1
        assert "deprecated" in str(w[0].message).lower()

imbajin · 2025-10-20T11:49:52Z

hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py

+        try:
            self._stopwords[lang] = stopwords.words(lang)
+        except LookupError as e:
+            log.warning("NLTK stopwords for lang=%s not found: %s; using empty list", lang, e)


‼️ Critical Issue: Silent failure in stopwords handling

The current error handling allows the system to continue with an empty stopwords list when NLTK data is unavailable, which could significantly degrade keyword extraction quality without clear user notification.

Problem:

except LookupError as e: log.warning("NLTK stopwords for lang=%s not found: %s; using empty list", lang, e) self._stopwords[lang] = []

Recommendation:
Consider throwing an exception or providing a more prominent error (e.g., log.error) since operating without stopwords fundamentally changes extraction behavior. At minimum, this warning should be surfaced to the user through the API response.

imbajin · 2025-10-20T11:50:03Z

hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py

+        def _create_placeholder(match_obj):
+            nonlocal placeholder_id_counter
+            original_word = match_obj.group(0)
+            _placeholder = f" __shieldword_{placeholder_id_counter}__ "


‼️ Critical Issue: Placeholder collision vulnerability

The placeholder format __shieldword_{counter}__ could collide with actual text content containing similar patterns, causing incorrect word masking/unmasking.

Example failure case:
If the input text already contains "shieldword_0", the system would incorrectly treat it as a placeholder.

Recommendation:
Use UUID-based placeholders or add a unique session prefix:

import uuid session_id = uuid.uuid4().hex[:8] _placeholder = f" __shield_{session_id}_{placeholder_id_counter}__ "

imbajin · 2025-10-20T11:50:14Z

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

+        start_time = time.perf_counter()
+        ranks = {}
+        try:
+            ranks = self._textrank_model.extract_keywords(self._query)


‼️ Critical Issue: Unhandled exception in score parsing

The code assumes LLM output is perfectly formatted but doesn't handle malformed responses, which will cause runtime failures.

Problem:

keyword, score = item.split(":") llm_ranks[keyword.strip()] = float(score)

If LLM returns keyword1::0.95 or keyword1 (no colon), this will crash.

Recommendation:
Add comprehensive error handling:

try: parts = item.split(":", 1) if len(parts) != 2: log.warning("Skipping malformed item: %s", item) continue keyword, score_str = parts keyword = keyword.strip() if not keyword: continue score = float(score_str.strip()) if not 0.0 <= score <= 1.0: log.warning("Score out of range for %s: %s", keyword, score) score = max(0.0, min(1.0, score)) llm_ranks[keyword] = score except (ValueError, AttributeError) as e: log.warning("Failed to parse item '%s': %s", item, e) continue

imbajin · 2025-10-20T11:50:26Z

hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py

+            placeholder_id_counter += 1
+            return _placeholder
+
+        special_regex = regex.compile('|'.join(self.rules), regex.V1)


⚠️ Performance Issue: Repeated regex compilation

The regex pattern is recompiled on every call to _word_mask, which is inefficient for repeated extractions.

Recommendation:
Compile once during initialization:

def __init__(self, keyword_num: int = 5, window_size: int = 3): # ... existing code ... self.rules = [r"https?://\S+|www\.\S+", ...] self.special_regex = regex.compile('|'.join(self.rules), regex.V1) def _word_mask(self, text): # ... existing code ... text = self.special_regex.sub(_create_placeholder, text)

imbajin · 2025-10-20T11:50:38Z

hugegraph-llm/src/hugegraph_llm/config/llm_config.py

    reranker_type: Optional[Literal["cohere", "siliconflow"]] = None
+    keyword_extract_type: Literal["llm", "textrank", "hybrid"] = "llm"
+    window_size: Optional[int] = 3
+    hybrid_llm_weights: Optional[float] = 0.5


⚠️ Important: Missing validation for hybrid_llm_weights

The config accepts hybrid_llm_weights but doesn't validate the 0.0-1.0 range at initialization.

Recommendation:
Add Pydantic field validation:

from pydantic import Field hybrid_llm_weights: Optional[float] = Field( default=0.5, ge=0.0, le=1.0, description="LLM weight in hybrid mode (0.0-1.0)" )

imbajin · 2025-10-20T11:50:48Z

hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py

+            log.warning("NLTK stopwords for lang=%s not found: %s; using empty list", lang, e)
+            self._stopwords[lang] = []
+
+        # final check


⚠️ Code Quality: Duplicate NLTK download logic

The NLTK package download logic is duplicated between stopwords() and check_nltk_data() methods, violating DRY principle.

Recommendation:
Extract common download logic:

def _download_nltk_package(self, package: str, path: str, nltk_data_dir: str) -> bool: try: nltk.data.find(path) return True except LookupError: log.info("Downloading NLTK package: %s", package) try: return nltk.download(package, download_dir=nltk_data_dir, quiet=False) except (URLError, HTTPError, PermissionError) as e: log.warning("Failed to download %s: %s", package, e) return False

hugegraph-llm/pyproject.toml

TextRank-fix

11c211d

fix apache#224 problem, update new UI to support change keyword extracion method

github-actions bot added the llm label Jun 27, 2025

dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. enhancement New feature or request labels Jun 27, 2025

imbajin requested a review from Copilot June 30, 2025 12:29

Copilot AI reviewed Jun 30, 2025

View reviewed changes

feat(llm):TextRank fix

4e3fa9b

fix the pylint check bug

Gfreely changed the title ~~TextRank-fix~~ feat(llm): support TextRank Jun 30, 2025

Gfreely added 17 commits July 1, 2025 00:29

fix

a8313df

pylint bug fix

a4180ea

fix Potential issue

98471a4

fix default num

750d338

fix spilt

29ddeb1

fix bug

d2e846c

Update keyword_extract.py

9530dfb

support regular expression

f994411

Update keyword_extract.py

5c66bff

Update keyword_extract.py

f305f6c

fix language bug

78f9356

pylint fix

777589e

python-igraph version

2da9054

Merge remote-tracking branch 'origin/main' into test

79383bf

fix pyproject

9aae252

Update pyproject.toml

8b4884c

mark todo

0131563

This was referenced Jul 25, 2025

keyword_extract_copy, support textrank keyword extract #224

Closed

changetextrank #232

Closed

Gfreely added 3 commits August 6, 2025 21:52

merge main branch

6b6bfe5

Update keyword_extract.py

960481a

Update textrank

108caa5

Gfreely added 7 commits August 29, 2025 21:22

fix bug

b7f4136

fix bug

38064c3

Update keyword_extract.py

4379456

update language

61f91de

update language

27b048e

Update graph_rag_task.py

66c7ea8

Update word_extract.py

00edd28

Gfreely changed the title ~~feat(llm): support TextRank~~ feat(llm): (BREAKING CHANGE) update keyword extraction method Sep 12, 2025

Gfreely added 4 commits September 12, 2025 16:19

fix bug

7f1ce87

fix bug

f31a500

Update nltk_helper.py

9423bb4

Update nltk_helper.py

3ce504d

imbajin reviewed Sep 29, 2025

View reviewed changes

imbajin and others added 4 commits October 11, 2025 18:31

Merge branch 'main' into pr/282

d928fd8

fix bug

289ec96

Merge remote-tracking branch 'origin/TextRank-fix' into test

4083ae0

fix pylint

ff00016

imbajin reviewed Oct 20, 2025

View reviewed changes

hugegraph-llm/pyproject.toml Show resolved Hide resolved

fix bug

caf3156

imbajin approved these changes Oct 21, 2025

View reviewed changes

dosubot bot added the lgtm This PR has been approved by a maintainer label Oct 21, 2025

imbajin changed the title ~~feat(llm): (BREAKING CHANGE) update keyword extraction method~~ feat(llm): update keyword extraction method (BREAKING CHANGE) Oct 21, 2025

imbajin merged commit 2eb7834 into apache:main Oct 21, 2025
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(llm): update keyword extraction method (BREAKING CHANGE) #282

feat(llm): update keyword extraction method (BREAKING CHANGE) #282
imbajin merged 48 commits intoapache:mainfrom
Gfreely:TextRank-fix

Gfreely commented Jun 27, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

imbajin left a comment

Uh oh!

imbajin commented Sep 29, 2025

Uh oh!

imbajin Oct 20, 2025

Uh oh!

imbajin Oct 20, 2025

Uh oh!

imbajin Oct 20, 2025

Uh oh!

imbajin Oct 20, 2025

Uh oh!

imbajin Oct 20, 2025

Uh oh!

imbajin Oct 20, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Gfreely commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

imbajin left a comment

Choose a reason for hiding this comment

Code Review Summary

Key Findings

Main Concerns

1. 🚨 Breaking API Changes Without Migration Path

2. 🚨 Insufficient Error Handling in TextRank Implementation

3. 🚨 Regex Injection Vulnerability

Medium Priority Issues

1. ⚠️ Uncaught Network Errors in NLTK Downloads

2. ⚠️ Missing Input Validation

3. ⚠️ Hardcoded Magic Numbers

4. ⚠️ Missing Tests

Minor Issues

1. 🧹 Import Organization

2. 🧹 Inconsistent Language Handling

Positive Aspects

Required Actions Before Approval

Uh oh!

imbajin commented Sep 29, 2025

Detailed Code Review - Additional Comments

Critical Issue Details

1. Division by Zero in TextRank PageRank Normalization

2. Breaking Changes in graph_rag_task.py

3. Regex Injection Prevention

4. NLTK Error Handling Enhancement

Testing Requirements

Uh oh!

imbajin Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

imbajin Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

imbajin Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

imbajin Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

imbajin Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

imbajin Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Gfreely commented Jun 27, 2025 •

edited

Loading