Skip to content

keyword_extract_copy, support textrank keyword extract#224

Closed
LoskiClaw wants to merge 3 commits intoapache:mainfrom
LoskiClaw:addtextrank
Closed

keyword_extract_copy, support textrank keyword extract#224
LoskiClaw wants to merge 3 commits intoapache:mainfrom
LoskiClaw:addtextrank

Conversation

@LoskiClaw
Copy link

No description provided.

@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label May 9, 2025
@github-actions github-actions bot added the llm label May 9, 2025
@dosubot dosubot bot added the enhancement New feature or request label May 9, 2025
@imbajin imbajin requested a review from Copilot May 9, 2025 10:53
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a new keyword extraction module with support for TextRank-based extraction alongside LLM-based extraction.

  • Adds a new module "keyword_extract_copy.py" with support for both English and Chinese text.
  • Implements language-specific pre-processing and includes basic test functions.
Comments suppressed due to low confidence (2)

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract_copy.py:159

  • Replace print statements with formal assertions in test cases to ensure that failures are automatically detected during testing.
print( any(k in ["processing", "language", "human"] for k in result["keywords"]))

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract_copy.py:173

  • Consider using assertion statements instead of print statements in the test function for a more robust and automated testing approach.
print( any(k in expected_keywords for k in result["keywords"]))

Comment on lines +26 to +27
sys.path.append('/mnt/WD4T/workspace/hs/incubator-hugegraph-ai/hugegraph-llm/src')

Copy link

Copilot AI May 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid using hardcoded absolute paths to modify the module search path; consider configuring paths through environment variables or project configuration to ensure portability.

Suggested change
sys.path.append('/mnt/WD4T/workspace/hs/incubator-hugegraph-ai/hugegraph-llm/src')
import os
# Dynamically determine the base directory of the project
base_dir = os.path.abspath(os.path.join(os.path.dirname(__file__), "../../../.."))
sys.path.append(base_dir)

Copilot uses AI. Check for mistakes.
@MrJs133
Copy link
Contributor

MrJs133 commented May 14, 2025

Do we need to install gensim < 4.0.0 ?

@LoskiClaw
Copy link
Author

Do we need to install gensim < 4.0.0 ?

yes, I use version 3.8.1.

@MrJs133
Copy link
Contributor

MrJs133 commented May 14, 2025

Do we need to install gensim < 4.0.0 ?

yes, I use version 3.8.1.

I want to run this code, but I'm having issues related to packages versions. What are the versions of scipy, numpy, and Python that you're using?

@LoskiClaw
Copy link
Author

Do we need to install gensim < 4.0.0 ?

yes, I use version 3.8.1.

I want to run this code, but I'm having issues related to packages versions. What are the versions of scipy, numpy, and Python that you're using?

scipy=1.12.0,numpy=1.26.4, python=3.10.16

Gfreely added a commit to Gfreely/incubator-hugegraph-ai that referenced this pull request Jun 27, 2025
fix apache#224 problem, update new UI to support change keyword extracion method
@imbajin
Copy link
Member

imbajin commented Jul 25, 2025

addressed by #282

@imbajin imbajin closed this Jul 25, 2025
imbajin added a commit that referenced this pull request Oct 21, 2025
BREAKING CHANGE
**MUST** :UPDATE YOUR "KEYWORD EXTRACT PROMPT" To LATEST VERSION

fix #224 problem, update the new UI to support change keyword extraction
method.

**Main changes**

Added options to the RAG interface for selecting the keyword extraction
method(including LLM, TextRank, Hybrid) and the max number of keywords.
<img width="619" height="145" alt="QQ20250818-193453"
src="https://github.com/user-attachments/assets/3c0d21f0-82bb-4176-bfe2-1b0744c06b6d"
/>

A 'TextRank mask words' setting has also been added. It allows users to
manually input specific phrases composed of letters and symbols to
prevent them from being split during word segmentation. And the input
will also be saved.
<img width="1207" height="263" alt="QQ20250818-193518"
src="https://github.com/user-attachments/assets/6366789a-f87d-46a4-a85a-9f3b4d9ce9a5"
/>


**Test results**

TextRank Method:
-Input
<img width="363" height="144" alt="image"
src="https://github.com/user-attachments/assets/4a6267f7-3982-4fca-82df-60cd55bed6af"
/>

-Result:
<img width="232" height="118" alt="image"
src="https://github.com/user-attachments/assets/54a34d00-e588-44ad-9eff-d7281d7d93e5"
/>


Hybrid Method:
<img width="710" height="129" alt="QQ20250818-193508"
src="https://github.com/user-attachments/assets/541534fd-cec0-4002-9967-e49954a6c19e"
/>

---------

Co-authored-by: imbajin <jin@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request llm size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants