Skip to content

feat: updated the max returned reference counts!#229

Merged
amindadgar merged 2 commits intomainfrom
feat/226-mediawiki-search-agent
Oct 29, 2025
Merged

feat: updated the max returned reference counts!#229
amindadgar merged 2 commits intomainfrom
feat/226-mediawiki-search-agent

Conversation

@amindadgar
Copy link
Member

@amindadgar amindadgar commented Oct 29, 2025

Summary by CodeRabbit

  • Chores

    • Default number of returned references increased from 3 to 5.
  • Refactor

    • Media search now fetches pages in parallel and raises the default page limit (10 → 30) for broader, faster results.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 29, 2025

Walkthrough

Added a global constant MAX_RETURNED_REFERENCES = 5; switched PrepareAnswerSources to use that constant as its default max_references; refactored MediaWiki search to fetch page contents in parallel and increased its max_pages default from 10 to 30.

Changes

Cohort / File(s) Summary
Global constant definition
utils/globals.py
Added new public constant MAX_RETURNED_REFERENCES = 5.
PrepareAnswerSources default
utils/query_engine/prepare_answer_sources.py
Imported MAX_RETURNED_REFERENCES and updated PrepareAnswerSources.__init__ default max_references to use it; docstring updated from 3 to 5.
MediaWiki parallel fetch & config
utils/query_engine/media_wiki.py
Replaced sequential page-content fetches with concurrent fetching via ThreadPoolExecutor and as_completed; added logging around fetches and errors; changed _create_mediawiki_search_tool default max_pages from 10 to 30; preserved per-title error recording.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Client
    participant MediaWikiTool as MediaWikiSearchTool
    participant Executor as ThreadPoolExecutor
    participant Fetch as _fetch_page_content

    Note over Client,MediaWikiTool: New flow: collect titles, fetch concurrently
    Client->>MediaWikiTool: mediawiki_search(query)
    MediaWikiTool->>MediaWikiTool: gather page titles (up to max_pages)
    MediaWikiTool->>Executor: submit fetch tasks for each title
    Executor->>Fetch: run _fetch_page_content(title) [concurrent]
    Fetch-->>Executor: return content or error
    Executor-->>MediaWikiTool: as_completed yields results (title -> content/error)
    MediaWikiTool->>Client: aggregated results (content OR error markers)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Changes span 3 files with both API default centralization and control-flow concurrency changes.
  • Review focus:
    • ThreadPoolExecutor usage and correct mapping of futures to titles.
    • Error handling and logging for concurrent fetches.
    • Ensure MAX_RETURNED_REFERENCES used consistently and docstring matches behavior.

Possibly related PRs

Poem

🐰 Five seeds of truth I hop to store,

I pulled a constant from the floor,
Prepared answers now count to five,
Pages fetched in parallel jive,
Hopping through logs with a joyful roar 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Title Check ⚠️ Warning The PR title "feat: updated the max returned reference counts!" accurately describes changes to the MAX_RETURNED_REFERENCES constant in utils/globals.py and its usage in utils/query_engine/prepare_answer_sources.py. However, the changeset also includes substantial modifications to utils/query_engine/media_wiki.py that introduce parallel fetching using ThreadPoolExecutor, increase the default max_pages from 10 to 30, and add comprehensive logging around fetch operations. Given that the source branch is named "feat/226-mediawiki-search-agent," the parallel fetching enhancements appear to be the primary feature, making the title incomplete and misleading about the true scope of the PR. Consider revising the title to capture the main changes across all modified files, such as "feat: implement parallel MediaWiki fetching and consolidate reference limits" or similar, to accurately reflect both the performance improvements in media_wiki.py and the reference count configuration changes. This would give reviewers a clearer understanding of the PR's complete scope.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/226-mediawiki-search-agent

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ab48311 and cbc944f.

📒 Files selected for processing (1)
  • utils/query_engine/media_wiki.py (3 hunks)
🧰 Additional context used
🪛 Ruff (0.14.2)
utils/query_engine/media_wiki.py

162-162: Do not catch blind exception: Exception

(BLE001)


163-163: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


164-164: Use explicit conversion flag

Replace with conversion flag

(RUF010)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: ci / test / Test
🔇 Additional comments (1)
utils/query_engine/media_wiki.py (1)

6-6: LGTM! Clean import for parallel processing.

The import of ThreadPoolExecutor and as_completed properly supports the parallel fetching implementation introduced later in the file.



def _create_mediawiki_search_tool(api_url: str, max_pages: int = 10):
def _create_mediawiki_search_tool(api_url: str, max_pages: int = 30):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify the impact of the 3x increase in default max_pages.

The increase from 10 to 30 pages aligns with the PR objectives and is reasonable given the new parallel fetching. However, ensure this doesn't hit MediaWiki API rate limits or cause performance issues in production.


🏁 Script executed:

#!/bin/bash
# Description: Check if there are any rate limiting configurations or concerns in the codebase

# Search for rate limit related code or configuration
rg -n -i -C3 'rate.?limit|throttle|api.?limit' --type=py

# Check for any MediaWiki API configuration or documentation
fd -e md -e txt -e yaml -e yml | xargs rg -l -i 'mediawiki|api.?url'

Length of output: 512


🏁 Script executed:

#!/bin/bash
# Description: Examine MediaWiki implementation and understand max_pages usage

# Find and display the media_wiki.py file content, focusing on how max_pages is used
fd -n 'media_wiki\.py' --type f -x cat {}

# Search for any MediaWiki API documentation or configuration files
fd -e md -e rst -e txt | xargs grep -l -i 'mediawiki.*api.*limit\|api.*rate\|max.*page' 2>/dev/null | head -20

Length of output: 230


🏁 Script executed:

#!/bin/bash
# Description: Examine MediaWiki implementation and understand max_pages usage

# Find and display the media_wiki.py file
fd 'media_wiki\.py' --type f -exec cat {} \;

# Also check how max_pages is referenced in the codebase
rg -n 'max_pages' --type py -B2 -A5

Length of output: 3632


🏁 Script executed:

#!/bin/bash
# Description: Examine parallel fetching implementation and check for API concerns

# Look for parallel fetching implementation
rg -n 'parallel|asyncio|concurrent|ThreadPool|ProcessPool' utils/query_engine/media_wiki.py -A3 -B1

# Check the actual fetching logic to see how many concurrent requests are made
rg -n 'fetch_content|gather|map_async' utils/query_engine/media_wiki.py -A5 -B1

# Check if MediaWikiQueryEngine class max_pages default should also be updated
rg -n 'class MediaWikiQueryEngine' utils/query_engine/media_wiki.py -A30 | head -50

Length of output: 2030


Update MediaWikiQueryEngine class to match function default for consistency.

The increase to max_pages=30 in the function (line 84) is safe because parallel fetches are capped at 10 concurrent workers (line 148). However, the MediaWikiQueryEngine class (line 195) still defaults to max_pages=10, creating an inconsistency. Update the class default to 30 to match the function and ensure consistent behavior across the codebase.

🤖 Prompt for AI Agents
In utils/query_engine/media_wiki.py around lines 84 and 195, the top-level
function _create_mediawiki_search_tool sets max_pages=30 while the
MediaWikiQueryEngine class still defaults max_pages to 10, causing inconsistent
behavior; update the MediaWikiQueryEngine class default for max_pages to 30 to
match the function, and ensure any constructor/type hints/docstrings reflect the
new default so both code paths use the same max_pages value.

Comment on lines +137 to 165
# 2. Fetch content for each page in parallel
pages_content = {}
for result in results[:max_pages]:
title = result.get("title")
if not title:
continue
titles_to_fetch = [
result.get("title")
for result in results[:max_pages]
if result.get("title")
]

logger.info(f"Fetching content for {len(titles_to_fetch)} pages in parallel...")

# Use ThreadPoolExecutor for parallel fetching
with ThreadPoolExecutor(max_workers=min(10, len(titles_to_fetch))) as executor:
# Submit all fetch tasks
future_to_title = {
executor.submit(_fetch_page_content, api_url, title): title
for title in titles_to_fetch
}

logger.info(f"Fetching content for: {title}")
content = _fetch_page_content(api_url, title)
pages_content[title] = content
# Collect results as they complete
for future in as_completed(future_to_title):
title = future_to_title[future]
try:
content = future.result()
pages_content[title] = content
logger.info(f"Successfully fetched content for: {title}")
except Exception as e:
logger.error(f"Error fetching content for '{title}': {e}")
pages_content[title] = f"(Error: Failed to fetch content - {str(e)})"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Use logging.exception for better error diagnostics.

The parallel fetching implementation is solid and follows best practices with proper resource management and error handling. However, line 163 should use logging.exception instead of logging.error to automatically include the stack trace, which aids debugging.

Apply this diff:

                 try:
                     content = future.result()
                     pages_content[title] = content
                     logger.info(f"Successfully fetched content for: {title}")
                 except Exception as e:
-                    logger.error(f"Error fetching content for '{title}': {e}")
+                    logger.exception(f"Error fetching content for '{title}'")
                     pages_content[title] = f"(Error: Failed to fetch content - {str(e)})"

Optional: Consider extracting the magic number for max_workers.

The hardcoded max_workers=10 on line 148 could be extracted as a module-level or class-level constant for better maintainability.

For example, at the module level:

MAX_PARALLEL_FETCH_WORKERS = 10

Then use it as:

with ThreadPoolExecutor(max_workers=min(MAX_PARALLEL_FETCH_WORKERS, len(titles_to_fetch))) as executor:

Based on static analysis hints.

🧰 Tools
🪛 Ruff (0.14.2)

162-162: Do not catch blind exception: Exception

(BLE001)


163-163: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


164-164: Use explicit conversion flag

Replace with conversion flag

(RUF010)

🤖 Prompt for AI Agents
In utils/query_engine/media_wiki.py around lines 137 to 165, change the error
logging inside the parallel fetch loop to use logger.exception(...) instead of
logger.error(...) so the stack trace is captured automatically when an exception
occurs; also consider extracting the hardcoded max_workers=10 into a
module-level constant (e.g., MAX_PARALLEL_FETCH_WORKERS) and use
min(MAX_PARALLEL_FETCH_WORKERS, len(titles_to_fetch)) when creating the
ThreadPoolExecutor for better maintainability.

@amindadgar amindadgar merged commit fb22815 into main Oct 29, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant