feat: update mediaWiki to be a search agent! by amindadgar · Pull Request #227 · TogetherCrew/hivemind-bot

amindadgar · 2025-10-28T09:32:37Z

Updated tc-hivemind-backend to version 1.4.8 and added several langchain packages to requirements.txt.
Modified the serialize_references method in temporal_tasks.py to accept a broader type for references.
Implemented a new MediaWikiQueryEngine in media_wiki.py that utilizes an agent-based approach for querying MediaWiki, including improved error handling and content fetching.
Updated the prompt generation logic in subquery_gen_prompt.py to ensure specific handling for MediaWiki queries.

Summary by CodeRabbit

Release Notes

New Features
- Introduced agent-based MediaWiki query engine with improved content retrieval and result processing.
- Added intelligent routing for MediaWiki/Wikipedia queries to optimize performance.
Chores
- Updated backend dependencies and added LangChain framework integration with OpenTelemetry instrumentation support.

- Updated `tc-hivemind-backend` to version 1.4.8 and added several `langchain` packages to `requirements.txt`. - Modified the `serialize_references` method in `temporal_tasks.py` to accept a broader type for references. - Implemented a new `MediaWikiQueryEngine` in `media_wiki.py` that utilizes an agent-based approach for querying MediaWiki, including improved error handling and content fetching. - Updated the prompt generation logic in `subquery_gen_prompt.py` to ensure specific handling for MediaWiki queries.

coderabbitai · 2025-10-28T09:33:03Z

Walkthrough

PR updates dependencies including LangChain ecosystem packages and tc-hivemind-backend, converts MediaWiki query engine to agent-based architecture using LangChain tools, updates serialize_references to handle both dict and SubQuestionAnswerPair types, and adds MediaWiki-specific prompt handling for subquery generation.

Changes

Cohort / File(s)	Summary
Dependency Updates `requirements.txt`	Updated `tc-hivemind-backend` to 1.4.8; added LangChain ecosystem dependencies (`langchain`, `langchain-community`, `langchain-core`, `langchain-openai`, `langchain-text-splitters`) and OpenTelemetry instrumentation for LangChain; preserved `sentence-transformers>=2.0.0`
Reference Serialization `temporal_tasks.py`	Updated `serialize_references()` signature to accept `list[dict] \| list[SubQuestionAnswerPair]`; added runtime branching to handle pre-constructed SubQuestionAnswerPair objects directly without dict-key access
MediaWiki Query Engine Refactor `utils/query_engine/media_wiki.py`	Replaced BaseQdrantEngine subclass with agent-based implementation; added helper functions (`_fetch_page_content`, `_create_mediawiki_search_tool`); introduced new `MediaWikiQueryEngine` class with agent initialization, tool configuration, query execution, and reference extraction from agent responses
Prompt Generation Logic `utils/query_engine/subquery_gen_prompt.py`	Added special-case handling: MediaWiki/Wikipedia questions bypass modification and generate exactly 1 subquestion instead of multiple

Sequence Diagram

sequenceDiagram
    participant User
    participant MediaWikiQueryEngine
    participant Agent
    participant MediaWikiSearchTool
    participant MediaWikiAPI
    participant LLM
    
    User->>MediaWikiQueryEngine: query(query_str)
    MediaWikiQueryEngine->>MediaWikiQueryEngine: prepare() if needed
    MediaWikiQueryEngine->>Agent: Initialize with mediawiki_search tool
    
    Agent->>LLM: Process query with system prompt
    LLM->>Agent: Determine need for mediawiki_search
    Agent->>MediaWikiSearchTool: Call mediawiki_search(query)
    
    MediaWikiSearchTool->>MediaWikiAPI: Search & paginate results
    MediaWikiAPI-->>MediaWikiSearchTool: Search results
    
    loop For each page result
        MediaWikiSearchTool->>MediaWikiAPI: _fetch_page_content(title)
        MediaWikiAPI-->>MediaWikiSearchTool: Page content
    end
    
    MediaWikiSearchTool-->>Agent: Page titles → content mapping
    Agent->>LLM: Generate response with References section
    LLM-->>Agent: Response text with citations
    
    Agent-->>MediaWikiQueryEngine: Agent response
    MediaWikiQueryEngine->>MediaWikiQueryEngine: _extract_source_nodes_from_response()
    MediaWikiQueryEngine-->>User: Response with NodeWithScore sources

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

utils/query_engine/media_wiki.py: Substantial refactor replacing inheritance with agent-based architecture; requires verification of LangChain integration, tool configuration, and reference extraction logic
temporal_tasks.py: Type union handling needs validation of conditional branching and dict vs. object attribute access patterns
utils/query_engine/subquery_gen_prompt.py: MediaWiki-specific prompt logic should be verified against expected behavior and edge cases

Possibly related issues

feat: update MediaWiki to be an agent instead of a RAG pipeline! #226: Directly addresses the conversion of MediaWiki query engine to an agent-based implementation as outlined in this PR

Possibly related PRs

feat: Added retrieved nodes evaluation! #178: Updates serialize_references and SubQuestionAnswerPair type handling that align with the serialization changes in this PR
feat: Added references to answers! #103: Modifies reference-handling pipeline and query return types that relate to the MediaWikiQueryEngine response extraction
Update tc-hivemind-backend to 1.2.2 #81: Updates tc-hivemind-backend dependency version similarly addressed in this PR

Poem

🐰 Hops through the agent-based maze,
LangChain tools light the way ablaze,
MediaWiki searches, references grow,
SubQuestions serialized—on we go!
Dependencies bundled, the work is done,
Another hop forward—what fun! ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 77.78% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The pull request title "feat: update mediaWiki to be a search agent!" directly aligns with the primary architectural change in the changeset. The main modification is the complete refactoring of the MediaWikiQueryEngine in utils/query_engine/media_wiki.py from a BaseQdrantEngine subclass to an agent-based approach with improved content fetching and error handling. The title is specific, concise, and clearly communicates this core change without vague terminology. While the PR also includes supporting changes like dependency updates and modifications to serialize_references, the title appropriately focuses on the most significant refactoring effort.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/226-mediawiki-search-agent

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (4)

utils/query_engine/media_wiki.py (4)

67-80: Refactor to use else clause and more specific exception handling.

The return statement on line 70 should be in an else block for better readability. Additionally, catching bare Exception on line 78 is too broad and may hide unexpected errors.

Apply this diff:

-        if not extract:
-            return "(No extract available. The page may be a redirect, disambiguation, or non-extractable.)"
-
-        return extract
+        if extract:
+            return extract
+        else:
+            return "(No extract available. The page may be a redirect, disambiguation, or non-extractable.)"
     
     except requests.Timeout:
         logger.warning(f"Timeout error while fetching content for '{title}'")
         return "(Error: Request timeout while fetching page content.)"
     except requests.RequestException as e:
         logger.warning(f"Request error while fetching content for '{title}': {e}")
-        return f"(Error: Request failed - {str(e)})"
-    except Exception as e:
+        return f"(Error: Request failed - {e!s})"
+    except (KeyError, ValueError, AttributeError) as e:
         logger.warning(f"Unexpected error while fetching content for '{title}': {e}")
-        return f"(Error: {str(e)})"
+        return f"(Error: {e!s})"

120-131: Use more specific exception handling.

Similar to the previous function, catching bare Exception on line 129 is too broad. Consider catching specific exceptions that might occur during JSON parsing or response handling.

Apply this diff:

         except requests.RequestException as e:
             logger.warning(f"Request error during search: {e}")
-            return {"error": f"Search request failed - {str(e)}"}
-        except Exception as e:
+            return {"error": f"Search request failed - {e!s}"}
+        except (ValueError, KeyError) as e:
             logger.warning(f"Unexpected error during search: {e}")
-            return {"error": f"Search failed - {str(e)}"}
+            return {"error": f"Search failed - {e!s}"}

226-227: Consider documenting unused parameters with leading underscore.

The parameters enable_answer_skipping and testing are intentionally unused for interface compatibility. Consider prefixing them with underscore (_enable_answer_skipping, _testing) to indicate they're intentionally unused, or add explicit documentation in the docstring.

Apply this diff:

-    def prepare(self, enable_answer_skipping: bool = False, testing: bool = False):
+    def prepare(self, _enable_answer_skipping: bool = False, _testing: bool = False):
         """
         Prepare the query engine by initializing the agent.
         
         This method maintains interface compatibility with the old BaseQdrantEngine.
+        
+        Parameters:
+        -----------
+        _enable_answer_skipping : bool
+            Unused - kept for interface compatibility
+        _testing : bool
+            Unused - kept for interface compatibility
         
         Returns:

344-346: Use robust URL construction.

The URL construction on line 346 assumes api_url ends with /api.php and simply replaces it. This is brittle and may fail for different URL formats.

Apply this diff:

+                from urllib.parse import urljoin, urlparse
+                
                 # Create URL for the page
                 url_route = title.replace(" ", "_")
-                url = f"{self.api_url.replace('/api.php', '')}/{url_route}"
+                # Get base URL by removing path from api_url
+                parsed = urlparse(self.api_url)
+                base_url = f"{parsed.scheme}://{parsed.netloc}"
+                url = urljoin(base_url, f"/{url_route}")

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 736c835 and 20e4e00.

📒 Files selected for processing (4)

requirements.txt (2 hunks)
temporal_tasks.py (1 hunks)
utils/query_engine/media_wiki.py (1 hunks)
utils/query_engine/subquery_gen_prompt.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

utils/query_engine/media_wiki.py (2)

utils/query_engine/base_qdrant_engine.py (1)

prepare (25-36)

temporal_tasks.py (1)

run (81-103)

🪛 Ruff (0.14.1)

utils/query_engine/media_wiki.py

70-70: Consider moving this statement to an else block

(TRY300)

77-77: Use explicit conversion flag

Replace with conversion flag

(RUF010)

78-78: Do not catch blind exception: Exception

(BLE001)

80-80: Use explicit conversion flag

Replace with conversion flag

(RUF010)

128-128: Use explicit conversion flag

Replace with conversion flag

(RUF010)

129-129: Do not catch blind exception: Exception

(BLE001)

131-131: Use explicit conversion flag

Replace with conversion flag

(RUF010)

182-182: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

185-185: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

226-226: Unused method argument: enable_answer_skipping

(ARG002)

226-226: Unused method argument: testing

(ARG002)

305-305: Consider moving this statement to an else block

(TRY300)

307-307: Do not catch blind exception: Exception

(BLE001)

308-308: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

311-311: Use explicit conversion flag

Replace with conversion flag

(RUF010)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: ci / test / Test

🔇 Additional comments (2)

temporal_tasks.py (1)

105-112: LGTM! Good defensive handling of mixed reference types.

The broadened signature and isinstance check properly handle both the legacy dict format and pre-constructed SubQuestionAnswerPair objects. This maintains backward compatibility while supporting the new agent-based workflow.

requirements.txt (1)

8-8: LangChain packages are significantly outdated; verify compatibility before merging.

The versions pinned in requirements.txt (lines 26-31) are current and free from documented security vulnerabilities. However, LangChain packages are 0.3.x while latest stable releases are 1.0.x (langchain, langchain-core, langchain-openai, langchain-text-splitters) — a major version gap. No known 2024 security vulnerabilities affect langchain 0.3.27 or langchain-community 0.3.30; known 2024 CVEs targeted earlier < 0.2.x versions. If intentionally pinned for compatibility, confirm the dependency suite works together as expected, especially given the OpenTelemetry instrumentation span (0.31.0 vs latest 0.47.5).

coderabbitai · 2025-10-28T09:41:56Z

utils/query_engine/media_wiki.py

+    def __init__(
+        self, 
+        community_id: str, 
+        platform_id: str = None,
+        api_url: str = "https://wiki.p2pfoundation.net/api.php",
+        max_pages: int = 10,
+        system_prompt: str = None,
+        llm_model: str = "gpt-4o-mini",
+        temperature: float = 0,
+    ) -> None:


⚠️ Potential issue | 🟡 Minor

Fix implicit Optional type annotations.

Parameters platform_id and system_prompt have default value None but are not explicitly typed as optional.

Apply this diff:

def __init__( self, community_id: str, - platform_id: str = None, + platform_id: str | None = None, api_url: str = "https://wiki.p2pfoundation.net/api.php", max_pages: int = 10, - system_prompt: str = None, + system_prompt: str | None = None, llm_model: str = "gpt-4o-mini", temperature: float = 0, ) -> None:

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def __init__(

self,

community_id: str,

platform_id: str = None,

api_url: str = "https://wiki.p2pfoundation.net/api.php",

max_pages: int = 10,

system_prompt: str = None,

llm_model: str = "gpt-4o-mini",

temperature: float = 0,

) -> None:

def __init__(

self,

community_id: str,

platform_id: str | None = None,

api_url: str = "https://wiki.p2pfoundation.net/api.php",

max_pages: int = 10,

system_prompt: str | None = None,

llm_model: str = "gpt-4o-mini",

temperature: float = 0,

) -> None:

🧰 Tools

🪛 Ruff (0.14.1)

182-182: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

185-185: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

🤖 Prompt for AI Agents

In utils/query_engine/media_wiki.py around lines 179 to 188, the parameters platform_id and system_prompt are given default None but typed as plain str; update their annotations to Optional[str] and add "from typing import Optional" at the top of the file if not already imported so the signatures become platform_id: Optional[str] = None and system_prompt: Optional[str] = None.

coderabbitai · 2025-10-28T09:41:56Z

utils/query_engine/media_wiki.py

+        except Exception as e:
+            logger.error(f"Error during MediaWiki query: {e}")
+            # Return error response
+            error_response = Response(
+                response=f"Error querying MediaWiki: {str(e)}",


🛠️ Refactor suggestion | 🟠 Major

Use logging.exception for better error diagnostics.

When logging in an exception handler, use logging.exception() instead of logging.error() to automatically include the stack trace.

Apply this diff:

except Exception as e: - logger.error(f"Error during MediaWiki query: {e}") + logger.exception(f"Error during MediaWiki query: {e}") # Return error response error_response = Response( - response=f"Error querying MediaWiki: {str(e)}", + response=f"Error querying MediaWiki: {e!s}", source_nodes=[] ) return error_response

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

except Exception as e:

logger.error(f"Error during MediaWiki query: {e}")

# Return error response

error_response = Response(

response=f"Error querying MediaWiki: {str(e)}",

except Exception as e:

logger.exception(f"Error during MediaWiki query: {e}")

# Return error response

error_response = Response(

response=f"Error querying MediaWiki: {e!s}",

source_nodes=[]

)

return error_response

🧰 Tools

🪛 Ruff (0.14.1)

307-307: Do not catch blind exception: Exception

(BLE001)

308-308: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

311-311: Use explicit conversion flag

Replace with conversion flag

(RUF010)

🤖 Prompt for AI Agents

In utils/query_engine/media_wiki.py around lines 307 to 311, the exception handler currently calls logger.error which omits the stack trace; replace logger.error(f"Error during MediaWiki query: {e}") with logger.exception("Error during MediaWiki query") so the stack trace is automatically included, and keep constructing and returning the same error_response afterwards (no change to the Response construction).

coderabbitai · 2025-10-28T09:41:56Z

utils/query_engine/subquery_gen_prompt.py

+- If the question is for MediaWiki or Wikipedia, don't change the question, just use the question as exactly as it is.
+- If the question is for MediaWiki or Wikipedia, just return 1 subquestion.


⚠️ Potential issue | 🟠 Major

Clarify the conflict with the "include every tool" requirement.

Line 24 states "just return 1 subquestion" for MediaWiki/Wikipedia queries, but line 22 requires "Include every tool at least once." This creates a logical conflict when MediaWiki is one of multiple tools.

Additionally, these instructions rely on the LLM to detect MediaWiki/Wikipedia questions without explicit criteria. Consider adding clear detection rules or handling this case programmatically before prompt generation.

Consider either:

Removing the "include every tool" constraint when MediaWiki is detected, or

Handling MediaWiki queries in a separate code path before reaching this prompt

amindadgar changed the title ~~chore: update mediaWiki to be a search agent!~~ feat: update mediaWiki to be a search agent! Oct 28, 2025

amindadgar linked an issue Oct 28, 2025 that may be closed by this pull request

feat: update MediaWiki to be an agent instead of a RAG pipeline! #226

Closed

coderabbitai bot reviewed Oct 28, 2025

View reviewed changes

amindadgar merged commit b7f783d into main Oct 28, 2025
3 checks passed

coderabbitai bot mentioned this pull request Oct 29, 2025

feat: making the mediaWiki engine prompts conciser! #228

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: update mediaWiki to be a search agent!#227

feat: update mediaWiki to be a search agent!#227
amindadgar merged 1 commit intomainfrom
feat/226-mediawiki-search-agent

amindadgar commented Oct 28, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 28, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Oct 28, 2025

Uh oh!

coderabbitai bot Oct 28, 2025

Uh oh!

coderabbitai bot Oct 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		- If the question is for MediaWiki or Wikipedia, don't change the question, just use the question as exactly as it is.
		- If the question is for MediaWiki or Wikipedia, just return 1 subquestion.

Conversation

amindadgar commented Oct 28, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related issues

Possibly related PRs

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

amindadgar commented Oct 28, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 28, 2025 •

edited

Loading