feat: updated the max returned reference counts! by amindadgar · Pull Request #229 · TogetherCrew/hivemind-bot

amindadgar · 2025-10-29T08:55:29Z

Summary by CodeRabbit

Chores
- Default number of returned references increased from 3 to 5.
Refactor
- Media search now fetches pages in parallel and raises the default page limit (10 → 30) for broader, faster results.

coderabbitai · 2025-10-29T08:55:50Z

Walkthrough

Added a global constant MAX_RETURNED_REFERENCES = 5; switched PrepareAnswerSources to use that constant as its default max_references; refactored MediaWiki search to fetch page contents in parallel and increased its max_pages default from 10 to 30.

Changes

Cohort / File(s)	Summary
Global constant definition `utils/globals.py`	Added new public constant `MAX_RETURNED_REFERENCES = 5`.
PrepareAnswerSources default `utils/query_engine/prepare_answer_sources.py`	Imported `MAX_RETURNED_REFERENCES` and updated `PrepareAnswerSources.__init__` default `max_references` to use it; docstring updated from 3 to 5.
MediaWiki parallel fetch & config `utils/query_engine/media_wiki.py`	Replaced sequential page-content fetches with concurrent fetching via ThreadPoolExecutor and as_completed; added logging around fetches and errors; changed `_create_mediawiki_search_tool` default `max_pages` from 10 to 30; preserved per-title error recording.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Client
    participant MediaWikiTool as MediaWikiSearchTool
    participant Executor as ThreadPoolExecutor
    participant Fetch as _fetch_page_content

    Note over Client,MediaWikiTool: New flow: collect titles, fetch concurrently
    Client->>MediaWikiTool: mediawiki_search(query)
    MediaWikiTool->>MediaWikiTool: gather page titles (up to max_pages)
    MediaWikiTool->>Executor: submit fetch tasks for each title
    Executor->>Fetch: run _fetch_page_content(title) [concurrent]
    Fetch-->>Executor: return content or error
    Executor-->>MediaWikiTool: as_completed yields results (title -> content/error)
    MediaWikiTool->>Client: aggregated results (content OR error markers)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Changes span 3 files with both API default centralization and control-flow concurrency changes.
Review focus:
- ThreadPoolExecutor usage and correct mapping of futures to titles.
- Error handling and logging for concurrent fetches.
- Ensure MAX_RETURNED_REFERENCES used consistently and docstring matches behavior.

Possibly related PRs

feat: using the only top 3 references across all sources! #141 — Related changes to PrepareAnswerSources constructor defaults for max_references.
feat: Added the max reference per data source support! #105 — Related work on limiting returned references and per-source reference handling.

Poem

🐰 Five seeds of truth I hop to store,

I pulled a constant from the floor,
Prepared answers now count to five,
Pages fetched in parallel jive,
Hopping through logs with a joyful roar 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Title Check	⚠️ Warning	The PR title "feat: updated the max returned reference counts!" accurately describes changes to the `MAX_RETURNED_REFERENCES` constant in `utils/globals.py` and its usage in `utils/query_engine/prepare_answer_sources.py`. However, the changeset also includes substantial modifications to `utils/query_engine/media_wiki.py` that introduce parallel fetching using ThreadPoolExecutor, increase the default `max_pages` from 10 to 30, and add comprehensive logging around fetch operations. Given that the source branch is named "feat/226-mediawiki-search-agent," the parallel fetching enhancements appear to be the primary feature, making the title incomplete and misleading about the true scope of the PR.	Consider revising the title to capture the main changes across all modified files, such as "feat: implement parallel MediaWiki fetching and consolidate reference limits" or similar, to accurately reflect both the performance improvements in media_wiki.py and the reference count configuration changes. This would give reviewers a clearer understanding of the PR's complete scope.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/226-mediawiki-search-agent

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

… be searched!

coderabbitai

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ab48311 and cbc944f.

📒 Files selected for processing (1)

utils/query_engine/media_wiki.py (3 hunks)

🧰 Additional context used

🪛 Ruff (0.14.2)

utils/query_engine/media_wiki.py

162-162: Do not catch blind exception: Exception

(BLE001)

163-163: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

164-164: Use explicit conversion flag

Replace with conversion flag

(RUF010)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: ci / test / Test

🔇 Additional comments (1)

utils/query_engine/media_wiki.py (1)

6-6: LGTM! Clean import for parallel processing.

The import of ThreadPoolExecutor and as_completed properly supports the parallel fetching implementation introduced later in the file.

coderabbitai · 2025-10-29T09:03:08Z

utils/query_engine/media_wiki.py



-def _create_mediawiki_search_tool(api_url: str, max_pages: int = 10):
+def _create_mediawiki_search_tool(api_url: str, max_pages: int = 30):


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify the impact of the 3x increase in default max_pages.

The increase from 10 to 30 pages aligns with the PR objectives and is reasonable given the new parallel fetching. However, ensure this doesn't hit MediaWiki API rate limits or cause performance issues in production.

🏁 Script executed:

#!/bin/bash # Description: Check if there are any rate limiting configurations or concerns in the codebase # Search for rate limit related code or configuration rg -n -i -C3 'rate.?limit|throttle|api.?limit' --type=py # Check for any MediaWiki API configuration or documentation fd -e md -e txt -e yaml -e yml | xargs rg -l -i 'mediawiki|api.?url'

Length of output: 512

🏁 Script executed:

#!/bin/bash # Description: Examine MediaWiki implementation and understand max_pages usage # Find and display the media_wiki.py file content, focusing on how max_pages is used fd -n 'media_wiki\.py' --type f -x cat {} # Search for any MediaWiki API documentation or configuration files fd -e md -e rst -e txt | xargs grep -l -i 'mediawiki.*api.*limit\|api.*rate\|max.*page' 2>/dev/null | head -20

Length of output: 230

🏁 Script executed:

#!/bin/bash # Description: Examine MediaWiki implementation and understand max_pages usage # Find and display the media_wiki.py file fd 'media_wiki\.py' --type f -exec cat {} \; # Also check how max_pages is referenced in the codebase rg -n 'max_pages' --type py -B2 -A5

Length of output: 3632

🏁 Script executed:

#!/bin/bash # Description: Examine parallel fetching implementation and check for API concerns # Look for parallel fetching implementation rg -n 'parallel|asyncio|concurrent|ThreadPool|ProcessPool' utils/query_engine/media_wiki.py -A3 -B1 # Check the actual fetching logic to see how many concurrent requests are made rg -n 'fetch_content|gather|map_async' utils/query_engine/media_wiki.py -A5 -B1 # Check if MediaWikiQueryEngine class max_pages default should also be updated rg -n 'class MediaWikiQueryEngine' utils/query_engine/media_wiki.py -A30 | head -50

Length of output: 2030

Update MediaWikiQueryEngine class to match function default for consistency.

The increase to max_pages=30 in the function (line 84) is safe because parallel fetches are capped at 10 concurrent workers (line 148). However, the MediaWikiQueryEngine class (line 195) still defaults to max_pages=10, creating an inconsistency. Update the class default to 30 to match the function and ensure consistent behavior across the codebase.

🤖 Prompt for AI Agents

In utils/query_engine/media_wiki.py around lines 84 and 195, the top-level function _create_mediawiki_search_tool sets max_pages=30 while the MediaWikiQueryEngine class still defaults max_pages to 10, causing inconsistent behavior; update the MediaWikiQueryEngine class default for max_pages to 30 to match the function, and ensure any constructor/type hints/docstrings reflect the new default so both code paths use the same max_pages value.

coderabbitai · 2025-10-29T09:03:08Z

utils/query_engine/media_wiki.py

+        # 2. Fetch content for each page in parallel
        pages_content = {}
-        for result in results[:max_pages]:
-            title = result.get("title")
-            if not title:
-                continue
+        titles_to_fetch = [
+            result.get("title") 
+            for result in results[:max_pages] 
+            if result.get("title")
+        ]
+
+        logger.info(f"Fetching content for {len(titles_to_fetch)} pages in parallel...")
+
+        # Use ThreadPoolExecutor for parallel fetching
+        with ThreadPoolExecutor(max_workers=min(10, len(titles_to_fetch))) as executor:
+            # Submit all fetch tasks
+            future_to_title = {
+                executor.submit(_fetch_page_content, api_url, title): title
+                for title in titles_to_fetch
+            }

-            logger.info(f"Fetching content for: {title}")
-            content = _fetch_page_content(api_url, title)
-            pages_content[title] = content
+            # Collect results as they complete
+            for future in as_completed(future_to_title):
+                title = future_to_title[future]
+                try:
+                    content = future.result()
+                    pages_content[title] = content
+                    logger.info(f"Successfully fetched content for: {title}")
+                except Exception as e:
+                    logger.error(f"Error fetching content for '{title}': {e}")
+                    pages_content[title] = f"(Error: Failed to fetch content - {str(e)})"



🛠️ Refactor suggestion | 🟠 Major

Use logging.exception for better error diagnostics.

The parallel fetching implementation is solid and follows best practices with proper resource management and error handling. However, line 163 should use logging.exception instead of logging.error to automatically include the stack trace, which aids debugging.

Apply this diff:

try: content = future.result() pages_content[title] = content logger.info(f"Successfully fetched content for: {title}") except Exception as e: - logger.error(f"Error fetching content for '{title}': {e}") + logger.exception(f"Error fetching content for '{title}'") pages_content[title] = f"(Error: Failed to fetch content - {str(e)})"

Optional: Consider extracting the magic number for max_workers.

The hardcoded max_workers=10 on line 148 could be extracted as a module-level or class-level constant for better maintainability.

For example, at the module level:

MAX_PARALLEL_FETCH_WORKERS = 10

Then use it as:

with ThreadPoolExecutor(max_workers=min(MAX_PARALLEL_FETCH_WORKERS, len(titles_to_fetch))) as executor:

Based on static analysis hints.

🧰 Tools

🪛 Ruff (0.14.2)

162-162: Do not catch blind exception: Exception

(BLE001)

163-163: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

164-164: Use explicit conversion flag

Replace with conversion flag

(RUF010)

🤖 Prompt for AI Agents

In utils/query_engine/media_wiki.py around lines 137 to 165, change the error logging inside the parallel fetch loop to use logger.exception(...) instead of logger.error(...) so the stack trace is captured automatically when an exception occurs; also consider extracting the hardcoded max_workers=10 into a module-level constant (e.g., MAX_PARALLEL_FETCH_WORKERS) and use min(MAX_PARALLEL_FETCH_WORKERS, len(titles_to_fetch)) when creating the ThreadPoolExecutor for better maintainability.

feat: updated the max returned reference counts!

ab48311

feat: improve mediaWiki search speed and triple the count of pages to…

cbc944f

… be searched!

coderabbitai bot reviewed Oct 29, 2025

View reviewed changes

amindadgar merged commit fb22815 into main Oct 29, 2025
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: updated the max returned reference counts!#229

feat: updated the max returned reference counts!#229
amindadgar merged 2 commits intomainfrom
feat/226-mediawiki-search-agent

amindadgar commented Oct 29, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 29, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Oct 29, 2025

Uh oh!

coderabbitai bot Oct 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant



		def _create_mediawiki_search_tool(api_url: str, max_pages: int = 10):
		def _create_mediawiki_search_tool(api_url: str, max_pages: int = 30):

Conversation

amindadgar commented Oct 29, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

amindadgar commented Oct 29, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 29, 2025 •

edited

Loading