Skip to content

feat(llm): update keyword extraction method (BREAKING CHANGE) #282

Merged
imbajin merged 48 commits intoapache:mainfrom
Gfreely:TextRank-fix
Oct 21, 2025
Merged

feat(llm): update keyword extraction method (BREAKING CHANGE) #282
imbajin merged 48 commits intoapache:mainfrom
Gfreely:TextRank-fix

Conversation

@Gfreely
Copy link
Contributor

@Gfreely Gfreely commented Jun 27, 2025

BREAKING CHANGE
PLEASE UPDATE YOUR KEYWORD EXTRACT PROMPT
fix #224 problem, update new UI to support change keyword extraction method.

Main changes

Added options to the RAG interface for selecting the keyword extraction method(including LLM, TextRank, Hybrid) and the max number of keywords.
QQ20250818-193453

A 'TextRank mask words' setting has also been added. It allows users to manually input specific phrases composed of letters and symbols to prevent them from being split during word segmentation. And the input will also be saved.
QQ20250818-193518

Test results

TextRank Method:
-Input
image

-Result:
image

Hybrid Method:
QQ20250818-193508

fix apache#224 problem, update new UI to support change keyword extracion method
@github-actions github-actions bot added the llm label Jun 27, 2025
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. enhancement New feature or request labels Jun 27, 2025
@imbajin imbajin requested a review from Copilot June 30, 2025 12:29
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes issue #224 by updating the keyword extraction mechanism and UI options to support both TextRank and LLM-based extraction methods.

  • Adds new extraction method parameters (extract_method, textrank_kwargs, language, mask_words, window_size) in the keyword extraction operator.
  • Updates the RAG UI components to pass these new parameters.
  • Updates dependencies (networkx and scipy) in pyproject.toml to support the new functionality.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py Added new extraction method settings and implemented a MultiLingualTextRank class.
hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py Updated parameter passing to support the new keyword extraction options.
hugegraph-llm/src/hugegraph_llm/demo/rag_demo/rag_block.py Revised UI inputs to include extraction method, language, mask words, and window size.
hugegraph-llm/pyproject.toml Added networkx and scipy dependencies.
Comments suppressed due to low confidence (1)

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:190

  • The 're' module is used in the _preprocess method but is not imported in the file, which may cause a runtime error. Please add 'import re' at the top of the file.
                escaped_words = [re.escape(word) for word in self.mask_words]

fix the pylint check bug
@Gfreely Gfreely changed the title TextRank-fix feat(llm): support TextRank Jun 30, 2025
@Gfreely Gfreely changed the title feat(llm): support TextRank feat(llm): (BREAKING CHANGE) update keyword extraction method Sep 12, 2025
Copy link
Member

@imbajin imbajin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

Overall Assessment: Changes Required

Key Findings

  • 🚨 Critical: 3 issues found (breaking changes, error handling, security concerns)
  • ⚠️ Medium: 4 improvements suggested
  • 🧹 Minor: 2 optimizations noted

Main Concerns

1. 🚨 Breaking API Changes Without Migration Path

The PR removes parameters (language, max_keywords) from method signatures in graph_rag_task.py without providing backward compatibility. This will break all existing code using these parameters.

File: hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (lines 57-82)
Recommendation: Add deprecated parameter support with warnings to maintain backward compatibility during transition period.

2. 🚨 Insufficient Error Handling in TextRank Implementation

The new MultiLingualTextRank class lacks proper error handling for edge cases.

File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (line 133)
Issue: Division by zero possible when max(pagerank_scores) is 0
Fix: Add zero check before normalization

3. 🚨 Regex Injection Vulnerability

File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (line 55)
The dynamic regex compilation from self.rules without proper sanitization could lead to ReDoS attacks.
Fix: Validate and sanitize regex patterns before compilation.

Medium Priority Issues

1. ⚠️ Uncaught Network Errors in NLTK Downloads

File: hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py (lines 53-57)
Only catching specific errors but not handling timeout or other network issues.

2. ⚠️ Missing Input Validation

File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (lines 62-66)
The integer conversion for max_keywords catches exceptions but doesn't log warnings for invalid input.

3. ⚠️ Hardcoded Magic Numbers

File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (line 31)
Window size validation uses hardcoded max value (10) without explanation.

4. ⚠️ Missing Tests

No test files included in this PR for the new TextRank functionality and hybrid extraction method.

Minor Issues

1. 🧹 Import Organization

Multiple files have import reorganization that seems unnecessary and makes the diff harder to read.

2. 🧹 Inconsistent Language Handling

The language determination logic appears in multiple places - should be centralized.

Positive Aspects

✅ Good addition of multiple extraction methods (LLM, TextRank, Hybrid)
✅ Proper logging added for debugging
✅ Config documentation updated appropriately

Required Actions Before Approval

  1. Add migration guide for breaking changes
  2. Fix the division by zero issue in TextRank
  3. Add input validation for regex patterns
  4. Include comprehensive tests for new functionality
  5. Consider adding backward compatibility layer

@imbajin
Copy link
Member

imbajin commented Sep 29, 2025

Detailed Code Review - Additional Comments

Critical Issue Details

1. Division by Zero in TextRank PageRank Normalization

In textrank_word_extract.py, line 129-131:

pagerank_scores = self.graph.pagerank(directed=False, damping=0.85, weights='weight')
if max(pagerank_scores) > 0:
    pagerank_scores = [scores/max(pagerank_scores) for scores in pagerank_scores]

Problem: The code already checks for max(pagerank_scores) > 0, which is good, but doesn't handle the else case properly.

Suggested Fix:

pagerank_scores = self.graph.pagerank(directed=False, damping=0.85, weights='weight')
if pagerank_scores and max(pagerank_scores) > 0:
    max_score = max(pagerank_scores)
    pagerank_scores = [score/max_score for score in pagerank_scores]
else:
    pagerank_scores = [0.0] * len(pagerank_scores) if pagerank_scores else []

2. Breaking Changes in graph_rag_task.py

The removal of language and max_keywords parameters needs careful handling:

Current problematic change (lines 57-82):

# Before:
def extract_keywords(self, text=None, max_keywords=5, language="english", extract_template=None):

# After: 
def extract_keywords(self, text=None, extract_template=None):

Suggested backward-compatible implementation:

def extract_keywords(
    self,
    text: Optional[str] = None,
    extract_template: Optional[str] = None,
    max_keywords: Optional[int] = None,  # Deprecated
    language: Optional[str] = None,  # Deprecated
    **kwargs
):
    """
    Extract keywords from text.
    
    Args:
        text: Text to extract keywords from
        extract_template: Template for extraction
        max_keywords: [DEPRECATED] Use context['max_keywords'] instead
        language: [DEPRECATED] Use llm_settings.language instead
    """
    import warnings
    
    if max_keywords is not None:
        warnings.warn(
            "max_keywords parameter is deprecated and will be removed in v2.0. "
            "Please pass it via context dictionary instead.",
            DeprecationWarning,
            stacklevel=2
        )
        # Store in context for backward compatibility
        self._context['max_keywords'] = max_keywords
    
    if language is not None:
        warnings.warn(
            "language parameter is deprecated and will be removed in v2.0. "
            "Please configure it via llm_settings instead.",
            DeprecationWarning,
            stacklevel=2
        )
        # Store for backward compatibility
        self._context['language'] = language
        
    self._operators.append(
        KeywordExtract(
            text=text,
            extract_template=extract_template,
            **kwargs
        )
    )
    return self

3. Regex Injection Prevention

In textrank_word_extract.py, the rules list should be immutable and validated:

Current vulnerable code:

self.rules = [r"https?://\\S+|www\\.\\S+",
              r"\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}\\b",
              r"\\b\\w+(?:[-'\\']\\w+)+\\b",
              r"\\b\\d+[,.]\\d+\\b"]

Suggested secure implementation:

import re

class MultiLingualTextRank:
    # Make rules immutable
    _SAFE_RULES = tuple([
        r"https?://\\S+|www\\.\\S+",
        r"\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}\\b",
        r"\\b\\w+(?:[-'\\']\\w+)+\\b",
        r"\\b\\d+[,.]\\d+\\b"
    ])
    
    def __init__(self, keyword_num: int = 5, window_size: int = 3):
        # Validate and compile patterns at initialization
        try:
            self._compiled_rules = re.compile('|'.join(self._SAFE_RULES), re.IGNORECASE)
        except re.error as e:
            raise ValueError(f"Invalid regex pattern: {e}")

4. NLTK Error Handling Enhancement

In nltk_helper.py, add more comprehensive error handling:

def check_nltk_data(self):
    """Check and download required NLTK data packages."""
    import socket
    
    # Set timeout for network operations
    original_timeout = socket.getdefaulttimeout()
    socket.setdefaulttimeout(30)
    
    try:
        # ... existing code ...
        for package in required_packages:
            try:
                # ... existing check ...
            except LookupError:
                max_retries = 3
                for attempt in range(max_retries):
                    try:
                        log.info("Downloading NLTK package %s (attempt %d/%d)", 
                                package, attempt + 1, max_retries)
                        nltk.download(package, download_dir=nltk_data_dir, quiet=False)
                        break
                    except (URLError, HTTPError, PermissionError, socket.timeout) as e:
                        if attempt == max_retries - 1:
                            log.error("Failed to download %s after %d attempts: %s", 
                                    package, max_retries, e)
                        else:
                            time.sleep(2 ** attempt)  # Exponential backoff
    finally:
        socket.setdefaulttimeout(original_timeout)

Testing Requirements

The PR is missing test coverage for:

  1. TextRank algorithm tests:

    • Test with empty text
    • Test with single word
    • Test with non-ASCII characters
    • Test window size edge cases
    • Test PageRank score normalization
  2. Hybrid extraction tests:

    • Test weight balancing
    • Test fallback when one method fails
    • Test score combination logic
  3. Migration tests:

    • Test deprecated parameter warnings
    • Test backward compatibility

Example test structure:

def test_textrank_empty_input():
    extractor = MultiLingualTextRank()
    result = extractor.extract_keywords("")
    assert result == {}

def test_textrank_division_by_zero():
    extractor = MultiLingualTextRank()
    # Create scenario where all PageRank scores are 0
    result = extractor.extract_keywords("a a a")  
    assert all(score >= 0 for score in result.values())

def test_deprecated_parameters():
    with warnings.catch_warnings(record=True) as w:
        pipeline = RAGPipeline()
        pipeline.extract_keywords("test", max_keywords=5)
        assert len(w) == 1
        assert "deprecated" in str(w[0].message).lower()

try:
self._stopwords[lang] = stopwords.words(lang)
except LookupError as e:
log.warning("NLTK stopwords for lang=%s not found: %s; using empty list", lang, e)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

‼️ Critical Issue: Silent failure in stopwords handling

The current error handling allows the system to continue with an empty stopwords list when NLTK data is unavailable, which could significantly degrade keyword extraction quality without clear user notification.

Problem:

except LookupError as e:
    log.warning("NLTK stopwords for lang=%s not found: %s; using empty list", lang, e)
    self._stopwords[lang] = []

Recommendation:
Consider throwing an exception or providing a more prominent error (e.g., log.error) since operating without stopwords fundamentally changes extraction behavior. At minimum, this warning should be surfaced to the user through the API response.

def _create_placeholder(match_obj):
nonlocal placeholder_id_counter
original_word = match_obj.group(0)
_placeholder = f" __shieldword_{placeholder_id_counter}__ "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

‼️ Critical Issue: Placeholder collision vulnerability

The placeholder format __shieldword_{counter}__ could collide with actual text content containing similar patterns, causing incorrect word masking/unmasking.

Example failure case:
If the input text already contains "shieldword_0", the system would incorrectly treat it as a placeholder.

Recommendation:
Use UUID-based placeholders or add a unique session prefix:

import uuid
session_id = uuid.uuid4().hex[:8]
_placeholder = f" __shield_{session_id}_{placeholder_id_counter}__ "

start_time = time.perf_counter()
ranks = {}
try:
ranks = self._textrank_model.extract_keywords(self._query)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

‼️ Critical Issue: Unhandled exception in score parsing

The code assumes LLM output is perfectly formatted but doesn't handle malformed responses, which will cause runtime failures.

Problem:

keyword, score = item.split(":")
llm_ranks[keyword.strip()] = float(score)

If LLM returns keyword1::0.95 or keyword1 (no colon), this will crash.

Recommendation:
Add comprehensive error handling:

try:
    parts = item.split(":", 1)
    if len(parts) != 2:
        log.warning("Skipping malformed item: %s", item)
        continue
    keyword, score_str = parts
    keyword = keyword.strip()
    if not keyword:
        continue
    score = float(score_str.strip())
    if not 0.0 <= score <= 1.0:
        log.warning("Score out of range for %s: %s", keyword, score)
        score = max(0.0, min(1.0, score))
    llm_ranks[keyword] = score
except (ValueError, AttributeError) as e:
    log.warning("Failed to parse item '%s': %s", item, e)
    continue

placeholder_id_counter += 1
return _placeholder

special_regex = regex.compile('|'.join(self.rules), regex.V1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Issue: Repeated regex compilation

The regex pattern is recompiled on every call to _word_mask, which is inefficient for repeated extractions.

Recommendation:
Compile once during initialization:

def __init__(self, keyword_num: int = 5, window_size: int = 3):
    # ... existing code ...
    self.rules = [r"https?://\S+|www\.\S+", ...]
    self.special_regex = regex.compile('|'.join(self.rules), regex.V1)

def _word_mask(self, text):
    # ... existing code ...
    text = self.special_regex.sub(_create_placeholder, text)

reranker_type: Optional[Literal["cohere", "siliconflow"]] = None
keyword_extract_type: Literal["llm", "textrank", "hybrid"] = "llm"
window_size: Optional[int] = 3
hybrid_llm_weights: Optional[float] = 0.5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Important: Missing validation for hybrid_llm_weights

The config accepts hybrid_llm_weights but doesn't validate the 0.0-1.0 range at initialization.

Recommendation:
Add Pydantic field validation:

from pydantic import Field

hybrid_llm_weights: Optional[float] = Field(
    default=0.5,
    ge=0.0,
    le=1.0,
    description="LLM weight in hybrid mode (0.0-1.0)"
)

log.warning("NLTK stopwords for lang=%s not found: %s; using empty list", lang, e)
self._stopwords[lang] = []

# final check
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Code Quality: Duplicate NLTK download logic

The NLTK package download logic is duplicated between stopwords() and check_nltk_data() methods, violating DRY principle.

Recommendation:
Extract common download logic:

def _download_nltk_package(self, package: str, path: str, nltk_data_dir: str) -> bool:
    try:
        nltk.data.find(path)
        return True
    except LookupError:
        log.info("Downloading NLTK package: %s", package)
        try:
            return nltk.download(package, download_dir=nltk_data_dir, quiet=False)
        except (URLError, HTTPError, PermissionError) as e:
            log.warning("Failed to download %s: %s", package, e)
            return False

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Oct 21, 2025
@imbajin imbajin changed the title feat(llm): (BREAKING CHANGE) update keyword extraction method feat(llm): update keyword extraction method (BREAKING CHANGE) Oct 21, 2025
@imbajin imbajin merged commit 2eb7834 into apache:main Oct 21, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request lgtm This PR has been approved by a maintainer llm size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants