feat: update mediaWiki to be a search agent!#227
Conversation
- Updated `tc-hivemind-backend` to version 1.4.8 and added several `langchain` packages to `requirements.txt`. - Modified the `serialize_references` method in `temporal_tasks.py` to accept a broader type for references. - Implemented a new `MediaWikiQueryEngine` in `media_wiki.py` that utilizes an agent-based approach for querying MediaWiki, including improved error handling and content fetching. - Updated the prompt generation logic in `subquery_gen_prompt.py` to ensure specific handling for MediaWiki queries.
WalkthroughPR updates dependencies including LangChain ecosystem packages and tc-hivemind-backend, converts MediaWiki query engine to agent-based architecture using LangChain tools, updates serialize_references to handle both dict and SubQuestionAnswerPair types, and adds MediaWiki-specific prompt handling for subquery generation. Changes
Sequence DiagramsequenceDiagram
participant User
participant MediaWikiQueryEngine
participant Agent
participant MediaWikiSearchTool
participant MediaWikiAPI
participant LLM
User->>MediaWikiQueryEngine: query(query_str)
MediaWikiQueryEngine->>MediaWikiQueryEngine: prepare() if needed
MediaWikiQueryEngine->>Agent: Initialize with mediawiki_search tool
Agent->>LLM: Process query with system prompt
LLM->>Agent: Determine need for mediawiki_search
Agent->>MediaWikiSearchTool: Call mediawiki_search(query)
MediaWikiSearchTool->>MediaWikiAPI: Search & paginate results
MediaWikiAPI-->>MediaWikiSearchTool: Search results
loop For each page result
MediaWikiSearchTool->>MediaWikiAPI: _fetch_page_content(title)
MediaWikiAPI-->>MediaWikiSearchTool: Page content
end
MediaWikiSearchTool-->>Agent: Page titles → content mapping
Agent->>LLM: Generate response with References section
LLM-->>Agent: Response text with citations
Agent-->>MediaWikiQueryEngine: Agent response
MediaWikiQueryEngine->>MediaWikiQueryEngine: _extract_source_nodes_from_response()
MediaWikiQueryEngine-->>User: Response with NodeWithScore sources
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Possibly related issues
Possibly related PRs
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (4)
utils/query_engine/media_wiki.py (4)
67-80: Refactor to use else clause and more specific exception handling.The return statement on line 70 should be in an
elseblock for better readability. Additionally, catching bareExceptionon line 78 is too broad and may hide unexpected errors.Apply this diff:
- if not extract: - return "(No extract available. The page may be a redirect, disambiguation, or non-extractable.)" - - return extract + if extract: + return extract + else: + return "(No extract available. The page may be a redirect, disambiguation, or non-extractable.)" except requests.Timeout: logger.warning(f"Timeout error while fetching content for '{title}'") return "(Error: Request timeout while fetching page content.)" except requests.RequestException as e: logger.warning(f"Request error while fetching content for '{title}': {e}") - return f"(Error: Request failed - {str(e)})" - except Exception as e: + return f"(Error: Request failed - {e!s})" + except (KeyError, ValueError, AttributeError) as e: logger.warning(f"Unexpected error while fetching content for '{title}': {e}") - return f"(Error: {str(e)})" + return f"(Error: {e!s})"
120-131: Use more specific exception handling.Similar to the previous function, catching bare
Exceptionon line 129 is too broad. Consider catching specific exceptions that might occur during JSON parsing or response handling.Apply this diff:
except requests.RequestException as e: logger.warning(f"Request error during search: {e}") - return {"error": f"Search request failed - {str(e)}"} - except Exception as e: + return {"error": f"Search request failed - {e!s}"} + except (ValueError, KeyError) as e: logger.warning(f"Unexpected error during search: {e}") - return {"error": f"Search failed - {str(e)}"} + return {"error": f"Search failed - {e!s}"}
226-227: Consider documenting unused parameters with leading underscore.The parameters
enable_answer_skippingandtestingare intentionally unused for interface compatibility. Consider prefixing them with underscore (_enable_answer_skipping,_testing) to indicate they're intentionally unused, or add explicit documentation in the docstring.Apply this diff:
- def prepare(self, enable_answer_skipping: bool = False, testing: bool = False): + def prepare(self, _enable_answer_skipping: bool = False, _testing: bool = False): """ Prepare the query engine by initializing the agent. This method maintains interface compatibility with the old BaseQdrantEngine. + + Parameters: + ----------- + _enable_answer_skipping : bool + Unused - kept for interface compatibility + _testing : bool + Unused - kept for interface compatibility Returns:
344-346: Use robust URL construction.The URL construction on line 346 assumes
api_urlends with/api.phpand simply replaces it. This is brittle and may fail for different URL formats.Apply this diff:
+ from urllib.parse import urljoin, urlparse + # Create URL for the page url_route = title.replace(" ", "_") - url = f"{self.api_url.replace('/api.php', '')}/{url_route}" + # Get base URL by removing path from api_url + parsed = urlparse(self.api_url) + base_url = f"{parsed.scheme}://{parsed.netloc}" + url = urljoin(base_url, f"/{url_route}")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
requirements.txt(2 hunks)temporal_tasks.py(1 hunks)utils/query_engine/media_wiki.py(1 hunks)utils/query_engine/subquery_gen_prompt.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
utils/query_engine/media_wiki.py (2)
utils/query_engine/base_qdrant_engine.py (1)
prepare(25-36)temporal_tasks.py (1)
run(81-103)
🪛 Ruff (0.14.1)
utils/query_engine/media_wiki.py
70-70: Consider moving this statement to an else block
(TRY300)
77-77: Use explicit conversion flag
Replace with conversion flag
(RUF010)
78-78: Do not catch blind exception: Exception
(BLE001)
80-80: Use explicit conversion flag
Replace with conversion flag
(RUF010)
128-128: Use explicit conversion flag
Replace with conversion flag
(RUF010)
129-129: Do not catch blind exception: Exception
(BLE001)
131-131: Use explicit conversion flag
Replace with conversion flag
(RUF010)
182-182: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
185-185: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
226-226: Unused method argument: enable_answer_skipping
(ARG002)
226-226: Unused method argument: testing
(ARG002)
305-305: Consider moving this statement to an else block
(TRY300)
307-307: Do not catch blind exception: Exception
(BLE001)
308-308: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
311-311: Use explicit conversion flag
Replace with conversion flag
(RUF010)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: ci / test / Test
🔇 Additional comments (2)
temporal_tasks.py (1)
105-112: LGTM! Good defensive handling of mixed reference types.The broadened signature and isinstance check properly handle both the legacy dict format and pre-constructed SubQuestionAnswerPair objects. This maintains backward compatibility while supporting the new agent-based workflow.
requirements.txt (1)
8-8: LangChain packages are significantly outdated; verify compatibility before merging.The versions pinned in requirements.txt (lines 26-31) are current and free from documented security vulnerabilities. However, LangChain packages are 0.3.x while latest stable releases are 1.0.x (langchain, langchain-core, langchain-openai, langchain-text-splitters) — a major version gap. No known 2024 security vulnerabilities affect langchain 0.3.27 or langchain-community 0.3.30; known 2024 CVEs targeted earlier < 0.2.x versions. If intentionally pinned for compatibility, confirm the dependency suite works together as expected, especially given the OpenTelemetry instrumentation span (0.31.0 vs latest 0.47.5).
| def __init__( | ||
| self, | ||
| community_id: str, | ||
| platform_id: str = None, | ||
| api_url: str = "https://wiki.p2pfoundation.net/api.php", | ||
| max_pages: int = 10, | ||
| system_prompt: str = None, | ||
| llm_model: str = "gpt-4o-mini", | ||
| temperature: float = 0, | ||
| ) -> None: |
There was a problem hiding this comment.
Fix implicit Optional type annotations.
Parameters platform_id and system_prompt have default value None but are not explicitly typed as optional.
Apply this diff:
def __init__(
self,
community_id: str,
- platform_id: str = None,
+ platform_id: str | None = None,
api_url: str = "https://wiki.p2pfoundation.net/api.php",
max_pages: int = 10,
- system_prompt: str = None,
+ system_prompt: str | None = None,
llm_model: str = "gpt-4o-mini",
temperature: float = 0,
) -> None:📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def __init__( | |
| self, | |
| community_id: str, | |
| platform_id: str = None, | |
| api_url: str = "https://wiki.p2pfoundation.net/api.php", | |
| max_pages: int = 10, | |
| system_prompt: str = None, | |
| llm_model: str = "gpt-4o-mini", | |
| temperature: float = 0, | |
| ) -> None: | |
| def __init__( | |
| self, | |
| community_id: str, | |
| platform_id: str | None = None, | |
| api_url: str = "https://wiki.p2pfoundation.net/api.php", | |
| max_pages: int = 10, | |
| system_prompt: str | None = None, | |
| llm_model: str = "gpt-4o-mini", | |
| temperature: float = 0, | |
| ) -> None: |
🧰 Tools
🪛 Ruff (0.14.1)
182-182: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
185-185: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
🤖 Prompt for AI Agents
In utils/query_engine/media_wiki.py around lines 179 to 188, the parameters
platform_id and system_prompt are given default None but typed as plain str;
update their annotations to Optional[str] and add "from typing import Optional"
at the top of the file if not already imported so the signatures become
platform_id: Optional[str] = None and system_prompt: Optional[str] = None.
| except Exception as e: | ||
| logger.error(f"Error during MediaWiki query: {e}") | ||
| # Return error response | ||
| error_response = Response( | ||
| response=f"Error querying MediaWiki: {str(e)}", |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Use logging.exception for better error diagnostics.
When logging in an exception handler, use logging.exception() instead of logging.error() to automatically include the stack trace.
Apply this diff:
except Exception as e:
- logger.error(f"Error during MediaWiki query: {e}")
+ logger.exception(f"Error during MediaWiki query: {e}")
# Return error response
error_response = Response(
- response=f"Error querying MediaWiki: {str(e)}",
+ response=f"Error querying MediaWiki: {e!s}",
source_nodes=[]
)
return error_response📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| except Exception as e: | |
| logger.error(f"Error during MediaWiki query: {e}") | |
| # Return error response | |
| error_response = Response( | |
| response=f"Error querying MediaWiki: {str(e)}", | |
| except Exception as e: | |
| logger.exception(f"Error during MediaWiki query: {e}") | |
| # Return error response | |
| error_response = Response( | |
| response=f"Error querying MediaWiki: {e!s}", | |
| source_nodes=[] | |
| ) | |
| return error_response |
🧰 Tools
🪛 Ruff (0.14.1)
307-307: Do not catch blind exception: Exception
(BLE001)
308-308: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
311-311: Use explicit conversion flag
Replace with conversion flag
(RUF010)
🤖 Prompt for AI Agents
In utils/query_engine/media_wiki.py around lines 307 to 311, the exception
handler currently calls logger.error which omits the stack trace; replace
logger.error(f"Error during MediaWiki query: {e}") with logger.exception("Error
during MediaWiki query") so the stack trace is automatically included, and keep
constructing and returning the same error_response afterwards (no change to the
Response construction).
| - If the question is for MediaWiki or Wikipedia, don't change the question, just use the question as exactly as it is. | ||
| - If the question is for MediaWiki or Wikipedia, just return 1 subquestion. |
There was a problem hiding this comment.
Clarify the conflict with the "include every tool" requirement.
Line 24 states "just return 1 subquestion" for MediaWiki/Wikipedia queries, but line 22 requires "Include every tool at least once." This creates a logical conflict when MediaWiki is one of multiple tools.
Additionally, these instructions rely on the LLM to detect MediaWiki/Wikipedia questions without explicit criteria. Consider adding clear detection rules or handling this case programmatically before prompt generation.
Consider either:
- Removing the "include every tool" constraint when MediaWiki is detected, or
- Handling MediaWiki queries in a separate code path before reaching this prompt
tc-hivemind-backendto version 1.4.8 and added severallangchainpackages torequirements.txt.serialize_referencesmethod intemporal_tasks.pyto accept a broader type for references.MediaWikiQueryEngineinmedia_wiki.pythat utilizes an agent-based approach for querying MediaWiki, including improved error handling and content fetching.subquery_gen_prompt.pyto ensure specific handling for MediaWiki queries.Summary by CodeRabbit
Release Notes
New Features
Chores