fix: update URL handling in DualQdrantRetrievalEngine! by amindadgar · Pull Request #225 · TogetherCrew/hivemind-bot

amindadgar · 2025-10-26T07:14:37Z

Enhanced the processing of nodes by adding URL generation for MediaWiki titles when the URL is not present in the metadata.
This change ensures that nodes are properly linked to their corresponding MediaWiki pages, improving the overall functionality of the retrieval engine.

Summary by CodeRabbit

Release Notes

Bug Fixes
- Search results for wiki-related content now include properly populated URLs, providing direct links to referenced pages.

- Enhanced the processing of nodes by adding URL generation for MediaWiki titles when the URL is not present in the metadata. - This change ensures that nodes are properly linked to their corresponding MediaWiki pages, improving the overall functionality of the retrieval engine.

coderabbitai · 2025-10-26T07:14:45Z

Walkthrough

The _process_basic_query method now enriches source node metadata by populating missing URL fields. For nodes lacking a metadata.url entry, the code derives a URL by transforming the node title (replacing spaces with underscores) and prefixing it with https://wiki.p2pfoundation.net/ before returning the Response object.

Changes

Cohort / File(s)	Summary
Metadata enrichment for retrieval nodes `utils/query_engine/dual_qdrant_retrieval_engine.py`	Added URL population logic in `_process_basic_query` to assign MediaWiki-based URLs to source nodes missing metadata.url; introduced `final_response` variable to hold constructed Response before return

Sequence Diagram

sequenceDiagram
    participant User
    participant Engine as dual_qdrant<br/>retrieval_engine
    participant LLM
    participant Response as Response<br/>Object

    User->>Engine: _process_basic_query(query)
    Engine->>LLM: Request response
    LLM-->>Engine: LLM response + nodes
    
    rect rgb(230, 245, 230)
    Note over Engine: NEW: Metadata Enrichment
    Engine->>Engine: For each node without<br/>metadata.url:<br/>Derive URL from title<br/>(replace spaces → _)<br/>Prefix with wiki domain
    end
    
    Engine->>Response: Construct Response<br/>with enriched nodes
    Response-->>Engine: final_response
    Engine-->>User: Return enriched Response

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Single file modification with localized logic change
URL derivation pattern requires verification (space-to-underscore conversion, domain prefix accuracy)
Ensure the metadata enrichment doesn't override existing URLs or cause side effects
Confirm the node iteration and metadata.url existence check are correct

Poem

🐰 A rabbit hops through wiki lanes,
Adding URLs to node domains,
With underscores where spaces rest,
Metadata flows, now nicely dressed! 🌐

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The pull request title "fix: update URL handling in DualQdrantRetrievalEngine!" clearly and specifically describes the main change in the changeset. The title accurately references the primary objective of the PR, which is to enhance URL handling by automatically generating MediaWiki URLs for nodes missing metadata URLs. The title is concise, directly related to the functionality modified in the code, and specific enough that a developer reviewing the commit history would quickly understand the change involves URL-handling improvements in the specified engine component.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/update-mediaWiki-url-reference-handling

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a8452b3 and 35e378f.

📒 Files selected for processing (1)

utils/query_engine/dual_qdrant_retrieval_engine.py (1 hunks)

coderabbitai · 2025-10-26T07:17:01Z

utils/query_engine/dual_qdrant_retrieval_engine.py

+        # TODO: cleanup this
+        # this is for MediaWiki platform which should be handled over the ETL and not here
+        for node in nodes:
+            if node.metadata.get("url") is None:
+                url_route = node.metadata.get("title")
+                if url_route is not None:
+                    url_route = url_route.replace(" ", "_")
+                    node.metadata["url"] = f"https://wiki.p2pfoundation.net/{url_route}"


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Move MediaWiki-specific logic to ETL or use a configuration-based approach.

The TODO comment correctly identifies that platform-specific URL generation should be handled in the ETL layer, not in the retrieval engine. Hardcoding MediaWiki logic here tightly couples this generic engine to a specific platform and violates separation of concerns.

Additionally, this logic introduces several concerns:

Metadata mutation risk: Modifying node.metadata["url"] in-place could affect cached or shared node instances elsewhere in the codebase, potentially causing unexpected side effects.

Inconsistent behavior: This URL generation only occurs in _process_basic_query but is missing from _process_summary_query (lines 313-364), where source_nodes (line 360) won't have URLs populated. The fallback to _process_basic_query at line 333 would apply the logic, but direct summary query paths won't.

Simplistic URL encoding: Using replace(" ", "_") doesn't handle special characters like #, ?, &, or international characters that require proper URL encoding for MediaWiki.

Consider one of these approaches:

Option 1 (Recommended): Move URL generation to the ETL pipeline so nodes arrive with populated URLs.

Option 2: If runtime generation is necessary, use a strategy pattern:

# At class level url_generator: Callable[[NodeWithScore], None] | None = None # In the method, after line 294 if self.url_generator: for node in nodes: self.url_generator(node)

Then inject platform-specific logic during setup:

def mediawiki_url_generator(node: NodeWithScore) -> None: if node.metadata.get("url") is None: title = node.metadata.get("title") if title: from urllib.parse import quote url_route = quote(title.replace(" ", "_"), safe="") node.metadata["url"] = f"https://wiki.p2pfoundation.net/{url_route}"

Run the following script to verify URL population inconsistency between query modes:

🏁 Script executed:

#!/bin/bash # Check if _process_summary_query has similar URL generation logic ast-grep --pattern $'def _process_summary_query($$$) { $$$ metadata["url"] $$$ }'

Length of output: 0

--- 🏁 Script executed: ```shell #!/bin/bash # Extract the _process_summary_query method and surrounding context # to verify URL generation handling and response construction rg -n -A50 'def _process_summary_query' utils/query_engine/dual_qdrant_retrieval_engine.py

Length of output: 2392

🏁 Script executed:

#!/bin/bash # Search for any URL-related code in the summary query method # to confirm whether URL generation is applied there rg -n 'metadata.*url|url.*metadata' utils/query_engine/dual_qdrant_retrieval_engine.py -C3

Length of output: 565

🏁 Script executed:

#!/bin/bash # Check where nodes from _process_basic_query are reused/cached # to assess metadata mutation risk rg -n 'node\.metadata\[' utils/query_engine/dual_qdrant_retrieval_engine.py -B2 -A2

Length of output: 572

🏁 Script executed:

#!/bin/bash # Verify the commented-out code at line 310 # and confirm hardcoded URL patterns sed -n '295,315p' utils/query_engine/dual_qdrant_retrieval_engine.py

Length of output: 835

Move MediaWiki-specific logic to ETL or configuration, and fix inconsistent URL handling between query modes.

Platform-specific URL generation is hardcoded in a generic retrieval engine, violating separation of concerns. The TODO comment (lines 299-300) correctly identifies this as technical debt that belongs in the ETL layer.

Verification confirms several actionable issues:

Inconsistent URL handling: URLs are only generated in _process_basic_query (lines 301-306). In _process_summary_query, the main code path (lines 335-360) returns raw_nodes as source_nodes without URL population. The fallback at line 333 would apply URLs only if no dates are found, making behavior inconsistent.

In-place metadata mutation (line 306): Modifying node.metadata["url"] mutates node objects. If nodes are cached or referenced elsewhere, this could cause unexpected side effects.

Dead code (line 310): Remove the commented-out # return final_response.

Simplistic URL encoding: Using replace(" ", "_") doesn't handle special characters like #, ?, &, or non-ASCII characters.

Required fixes:

Remove commented-out line 310

Apply URL generation consistently to both _process_basic_query and _process_summary_query paths, or move logic to ETL/configuration

Use proper URL encoding (e.g., urllib.parse.quote) instead of simple space replacement

🤖 Prompt for AI Agents

In utils/query_engine/dual_qdrant_retrieval_engine.py around lines 299-306: the MediaWiki-specific URL generation and the commented-out dead code should be removed or relocated to ETL/configuration; instead create a small helper that takes a Node and returns a new metadata dict (do not mutate node.metadata in-place), use urllib.parse.quote to safely encode the title (handling spaces and special/non-ASCII chars), remove the commented-out "return final_response" line, and call that helper consistently from both _process_basic_query and _process_summary_query (or call it once where nodes are produced) so URLs are applied deterministically across query modes.

coderabbitai · 2025-10-26T07:17:01Z

utils/query_engine/dual_qdrant_retrieval_engine.py

+
+        final_response = Response(response=str(response), source_nodes=nodes)

        # return final_response


⚠️ Potential issue | 🟡 Minor

Remove commented-out code.

The commented-out return statement serves no purpose and should be deleted to maintain code cleanliness.

Apply this diff:

- # return final_response return final_response

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# return final_response

return final_response

🤖 Prompt for AI Agents

In utils/query_engine/dual_qdrant_retrieval_engine.py around line 310, remove the leftover commented-out line "# return final_response" as it is dead code; delete that comment so the file contains no unused commented return statements and remains clean.

coderabbitai bot reviewed Oct 26, 2025

View reviewed changes

amindadgar merged commit 736c835 into main Oct 26, 2025
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: update URL handling in DualQdrantRetrievalEngine!#225

fix: update URL handling in DualQdrantRetrievalEngine!#225
amindadgar merged 1 commit intomainfrom
feat/update-mediaWiki-url-reference-handling

amindadgar commented Oct 26, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 26, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Oct 26, 2025

Uh oh!

coderabbitai bot Oct 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant


		final_response = Response(response=str(response), source_nodes=nodes)

		# return final_response

Conversation

amindadgar commented Oct 26, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Oct 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 26, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

amindadgar commented Oct 26, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 26, 2025 •

edited

Loading