Skip to content

feat: update mediaWiki to be a search agent!#227

Merged
amindadgar merged 1 commit intomainfrom
feat/226-mediawiki-search-agent
Oct 28, 2025
Merged

feat: update mediaWiki to be a search agent!#227
amindadgar merged 1 commit intomainfrom
feat/226-mediawiki-search-agent

Conversation

@amindadgar
Copy link
Member

@amindadgar amindadgar commented Oct 28, 2025

  • Updated tc-hivemind-backend to version 1.4.8 and added several langchain packages to requirements.txt.
  • Modified the serialize_references method in temporal_tasks.py to accept a broader type for references.
  • Implemented a new MediaWikiQueryEngine in media_wiki.py that utilizes an agent-based approach for querying MediaWiki, including improved error handling and content fetching.
  • Updated the prompt generation logic in subquery_gen_prompt.py to ensure specific handling for MediaWiki queries.

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced agent-based MediaWiki query engine with improved content retrieval and result processing.
    • Added intelligent routing for MediaWiki/Wikipedia queries to optimize performance.
  • Chores

    • Updated backend dependencies and added LangChain framework integration with OpenTelemetry instrumentation support.

- Updated `tc-hivemind-backend` to version 1.4.8 and added several `langchain` packages to `requirements.txt`.
- Modified the `serialize_references` method in `temporal_tasks.py` to accept a broader type for references.
- Implemented a new `MediaWikiQueryEngine` in `media_wiki.py` that utilizes an agent-based approach for querying MediaWiki, including improved error handling and content fetching.
- Updated the prompt generation logic in `subquery_gen_prompt.py` to ensure specific handling for MediaWiki queries.
@amindadgar amindadgar changed the title chore: update mediaWiki to be a search agent! feat: update mediaWiki to be a search agent! Oct 28, 2025
@amindadgar amindadgar linked an issue Oct 28, 2025 that may be closed by this pull request
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 28, 2025

Walkthrough

PR updates dependencies including LangChain ecosystem packages and tc-hivemind-backend, converts MediaWiki query engine to agent-based architecture using LangChain tools, updates serialize_references to handle both dict and SubQuestionAnswerPair types, and adds MediaWiki-specific prompt handling for subquery generation.

Changes

Cohort / File(s) Summary
Dependency Updates
requirements.txt
Updated tc-hivemind-backend to 1.4.8; added LangChain ecosystem dependencies (langchain, langchain-community, langchain-core, langchain-openai, langchain-text-splitters) and OpenTelemetry instrumentation for LangChain; preserved sentence-transformers>=2.0.0
Reference Serialization
temporal_tasks.py
Updated serialize_references() signature to accept list[dict] | list[SubQuestionAnswerPair]; added runtime branching to handle pre-constructed SubQuestionAnswerPair objects directly without dict-key access
MediaWiki Query Engine Refactor
utils/query_engine/media_wiki.py
Replaced BaseQdrantEngine subclass with agent-based implementation; added helper functions (_fetch_page_content, _create_mediawiki_search_tool); introduced new MediaWikiQueryEngine class with agent initialization, tool configuration, query execution, and reference extraction from agent responses
Prompt Generation Logic
utils/query_engine/subquery_gen_prompt.py
Added special-case handling: MediaWiki/Wikipedia questions bypass modification and generate exactly 1 subquestion instead of multiple

Sequence Diagram

sequenceDiagram
    participant User
    participant MediaWikiQueryEngine
    participant Agent
    participant MediaWikiSearchTool
    participant MediaWikiAPI
    participant LLM
    
    User->>MediaWikiQueryEngine: query(query_str)
    MediaWikiQueryEngine->>MediaWikiQueryEngine: prepare() if needed
    MediaWikiQueryEngine->>Agent: Initialize with mediawiki_search tool
    
    Agent->>LLM: Process query with system prompt
    LLM->>Agent: Determine need for mediawiki_search
    Agent->>MediaWikiSearchTool: Call mediawiki_search(query)
    
    MediaWikiSearchTool->>MediaWikiAPI: Search & paginate results
    MediaWikiAPI-->>MediaWikiSearchTool: Search results
    
    loop For each page result
        MediaWikiSearchTool->>MediaWikiAPI: _fetch_page_content(title)
        MediaWikiAPI-->>MediaWikiSearchTool: Page content
    end
    
    MediaWikiSearchTool-->>Agent: Page titles → content mapping
    Agent->>LLM: Generate response with References section
    LLM-->>Agent: Response text with citations
    
    Agent-->>MediaWikiQueryEngine: Agent response
    MediaWikiQueryEngine->>MediaWikiQueryEngine: _extract_source_nodes_from_response()
    MediaWikiQueryEngine-->>User: Response with NodeWithScore sources
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • utils/query_engine/media_wiki.py: Substantial refactor replacing inheritance with agent-based architecture; requires verification of LangChain integration, tool configuration, and reference extraction logic
  • temporal_tasks.py: Type union handling needs validation of conditional branching and dict vs. object attribute access patterns
  • utils/query_engine/subquery_gen_prompt.py: MediaWiki-specific prompt logic should be verified against expected behavior and edge cases

Possibly related issues

Possibly related PRs

Poem

🐰 Hops through the agent-based maze,
LangChain tools light the way ablaze,
MediaWiki searches, references grow,
SubQuestions serialized—on we go!
Dependencies bundled, the work is done,
Another hop forward—what fun!

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 77.78% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The pull request title "feat: update mediaWiki to be a search agent!" directly aligns with the primary architectural change in the changeset. The main modification is the complete refactoring of the MediaWikiQueryEngine in utils/query_engine/media_wiki.py from a BaseQdrantEngine subclass to an agent-based approach with improved content fetching and error handling. The title is specific, concise, and clearly communicates this core change without vague terminology. While the PR also includes supporting changes like dependency updates and modifications to serialize_references, the title appropriately focuses on the most significant refactoring effort.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/226-mediawiki-search-agent

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (4)
utils/query_engine/media_wiki.py (4)

67-80: Refactor to use else clause and more specific exception handling.

The return statement on line 70 should be in an else block for better readability. Additionally, catching bare Exception on line 78 is too broad and may hide unexpected errors.

Apply this diff:

-        if not extract:
-            return "(No extract available. The page may be a redirect, disambiguation, or non-extractable.)"
-
-        return extract
+        if extract:
+            return extract
+        else:
+            return "(No extract available. The page may be a redirect, disambiguation, or non-extractable.)"
     
     except requests.Timeout:
         logger.warning(f"Timeout error while fetching content for '{title}'")
         return "(Error: Request timeout while fetching page content.)"
     except requests.RequestException as e:
         logger.warning(f"Request error while fetching content for '{title}': {e}")
-        return f"(Error: Request failed - {str(e)})"
-    except Exception as e:
+        return f"(Error: Request failed - {e!s})"
+    except (KeyError, ValueError, AttributeError) as e:
         logger.warning(f"Unexpected error while fetching content for '{title}': {e}")
-        return f"(Error: {str(e)})"
+        return f"(Error: {e!s})"

120-131: Use more specific exception handling.

Similar to the previous function, catching bare Exception on line 129 is too broad. Consider catching specific exceptions that might occur during JSON parsing or response handling.

Apply this diff:

         except requests.RequestException as e:
             logger.warning(f"Request error during search: {e}")
-            return {"error": f"Search request failed - {str(e)}"}
-        except Exception as e:
+            return {"error": f"Search request failed - {e!s}"}
+        except (ValueError, KeyError) as e:
             logger.warning(f"Unexpected error during search: {e}")
-            return {"error": f"Search failed - {str(e)}"}
+            return {"error": f"Search failed - {e!s}"}

226-227: Consider documenting unused parameters with leading underscore.

The parameters enable_answer_skipping and testing are intentionally unused for interface compatibility. Consider prefixing them with underscore (_enable_answer_skipping, _testing) to indicate they're intentionally unused, or add explicit documentation in the docstring.

Apply this diff:

-    def prepare(self, enable_answer_skipping: bool = False, testing: bool = False):
+    def prepare(self, _enable_answer_skipping: bool = False, _testing: bool = False):
         """
         Prepare the query engine by initializing the agent.
         
         This method maintains interface compatibility with the old BaseQdrantEngine.
+        
+        Parameters:
+        -----------
+        _enable_answer_skipping : bool
+            Unused - kept for interface compatibility
+        _testing : bool
+            Unused - kept for interface compatibility
         
         Returns:

344-346: Use robust URL construction.

The URL construction on line 346 assumes api_url ends with /api.php and simply replaces it. This is brittle and may fail for different URL formats.

Apply this diff:

+                from urllib.parse import urljoin, urlparse
+                
                 # Create URL for the page
                 url_route = title.replace(" ", "_")
-                url = f"{self.api_url.replace('/api.php', '')}/{url_route}"
+                # Get base URL by removing path from api_url
+                parsed = urlparse(self.api_url)
+                base_url = f"{parsed.scheme}://{parsed.netloc}"
+                url = urljoin(base_url, f"/{url_route}")
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 736c835 and 20e4e00.

📒 Files selected for processing (4)
  • requirements.txt (2 hunks)
  • temporal_tasks.py (1 hunks)
  • utils/query_engine/media_wiki.py (1 hunks)
  • utils/query_engine/subquery_gen_prompt.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
utils/query_engine/media_wiki.py (2)
utils/query_engine/base_qdrant_engine.py (1)
  • prepare (25-36)
temporal_tasks.py (1)
  • run (81-103)
🪛 Ruff (0.14.1)
utils/query_engine/media_wiki.py

70-70: Consider moving this statement to an else block

(TRY300)


77-77: Use explicit conversion flag

Replace with conversion flag

(RUF010)


78-78: Do not catch blind exception: Exception

(BLE001)


80-80: Use explicit conversion flag

Replace with conversion flag

(RUF010)


128-128: Use explicit conversion flag

Replace with conversion flag

(RUF010)


129-129: Do not catch blind exception: Exception

(BLE001)


131-131: Use explicit conversion flag

Replace with conversion flag

(RUF010)


182-182: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


185-185: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


226-226: Unused method argument: enable_answer_skipping

(ARG002)


226-226: Unused method argument: testing

(ARG002)


305-305: Consider moving this statement to an else block

(TRY300)


307-307: Do not catch blind exception: Exception

(BLE001)


308-308: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


311-311: Use explicit conversion flag

Replace with conversion flag

(RUF010)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: ci / test / Test
🔇 Additional comments (2)
temporal_tasks.py (1)

105-112: LGTM! Good defensive handling of mixed reference types.

The broadened signature and isinstance check properly handle both the legacy dict format and pre-constructed SubQuestionAnswerPair objects. This maintains backward compatibility while supporting the new agent-based workflow.

requirements.txt (1)

8-8: LangChain packages are significantly outdated; verify compatibility before merging.

The versions pinned in requirements.txt (lines 26-31) are current and free from documented security vulnerabilities. However, LangChain packages are 0.3.x while latest stable releases are 1.0.x (langchain, langchain-core, langchain-openai, langchain-text-splitters) — a major version gap. No known 2024 security vulnerabilities affect langchain 0.3.27 or langchain-community 0.3.30; known 2024 CVEs targeted earlier < 0.2.x versions. If intentionally pinned for compatibility, confirm the dependency suite works together as expected, especially given the OpenTelemetry instrumentation span (0.31.0 vs latest 0.47.5).

Comment on lines +179 to +188
def __init__(
self,
community_id: str,
platform_id: str = None,
api_url: str = "https://wiki.p2pfoundation.net/api.php",
max_pages: int = 10,
system_prompt: str = None,
llm_model: str = "gpt-4o-mini",
temperature: float = 0,
) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix implicit Optional type annotations.

Parameters platform_id and system_prompt have default value None but are not explicitly typed as optional.

Apply this diff:

     def __init__(
         self, 
         community_id: str, 
-        platform_id: str = None,
+        platform_id: str | None = None,
         api_url: str = "https://wiki.p2pfoundation.net/api.php",
         max_pages: int = 10,
-        system_prompt: str = None,
+        system_prompt: str | None = None,
         llm_model: str = "gpt-4o-mini",
         temperature: float = 0,
     ) -> None:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def __init__(
self,
community_id: str,
platform_id: str = None,
api_url: str = "https://wiki.p2pfoundation.net/api.php",
max_pages: int = 10,
system_prompt: str = None,
llm_model: str = "gpt-4o-mini",
temperature: float = 0,
) -> None:
def __init__(
self,
community_id: str,
platform_id: str | None = None,
api_url: str = "https://wiki.p2pfoundation.net/api.php",
max_pages: int = 10,
system_prompt: str | None = None,
llm_model: str = "gpt-4o-mini",
temperature: float = 0,
) -> None:
🧰 Tools
🪛 Ruff (0.14.1)

182-182: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


185-185: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

🤖 Prompt for AI Agents
In utils/query_engine/media_wiki.py around lines 179 to 188, the parameters
platform_id and system_prompt are given default None but typed as plain str;
update their annotations to Optional[str] and add "from typing import Optional"
at the top of the file if not already imported so the signatures become
platform_id: Optional[str] = None and system_prompt: Optional[str] = None.

Comment on lines +307 to +311
except Exception as e:
logger.error(f"Error during MediaWiki query: {e}")
# Return error response
error_response = Response(
response=f"Error querying MediaWiki: {str(e)}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Use logging.exception for better error diagnostics.

When logging in an exception handler, use logging.exception() instead of logging.error() to automatically include the stack trace.

Apply this diff:

         except Exception as e:
-            logger.error(f"Error during MediaWiki query: {e}")
+            logger.exception(f"Error during MediaWiki query: {e}")
             # Return error response
             error_response = Response(
-                response=f"Error querying MediaWiki: {str(e)}",
+                response=f"Error querying MediaWiki: {e!s}",
                 source_nodes=[]
             )
             return error_response
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
except Exception as e:
logger.error(f"Error during MediaWiki query: {e}")
# Return error response
error_response = Response(
response=f"Error querying MediaWiki: {str(e)}",
except Exception as e:
logger.exception(f"Error during MediaWiki query: {e}")
# Return error response
error_response = Response(
response=f"Error querying MediaWiki: {e!s}",
source_nodes=[]
)
return error_response
🧰 Tools
🪛 Ruff (0.14.1)

307-307: Do not catch blind exception: Exception

(BLE001)


308-308: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


311-311: Use explicit conversion flag

Replace with conversion flag

(RUF010)

🤖 Prompt for AI Agents
In utils/query_engine/media_wiki.py around lines 307 to 311, the exception
handler currently calls logger.error which omits the stack trace; replace
logger.error(f"Error during MediaWiki query: {e}") with logger.exception("Error
during MediaWiki query") so the stack trace is automatically included, and keep
constructing and returning the same error_response afterwards (no change to the
Response construction).

Comment on lines +23 to +24
- If the question is for MediaWiki or Wikipedia, don't change the question, just use the question as exactly as it is.
- If the question is for MediaWiki or Wikipedia, just return 1 subquestion.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Clarify the conflict with the "include every tool" requirement.

Line 24 states "just return 1 subquestion" for MediaWiki/Wikipedia queries, but line 22 requires "Include every tool at least once." This creates a logical conflict when MediaWiki is one of multiple tools.

Additionally, these instructions rely on the LLM to detect MediaWiki/Wikipedia questions without explicit criteria. Consider adding clear detection rules or handling this case programmatically before prompt generation.

Consider either:

  1. Removing the "include every tool" constraint when MediaWiki is detected, or
  2. Handling MediaWiki queries in a separate code path before reaching this prompt

@amindadgar amindadgar merged commit b7f783d into main Oct 28, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: update MediaWiki to be an agent instead of a RAG pipeline!

1 participant