fix: update URL handling in DualQdrantRetrievalEngine!#225
Conversation
- Enhanced the processing of nodes by adding URL generation for MediaWiki titles when the URL is not present in the metadata. - This change ensures that nodes are properly linked to their corresponding MediaWiki pages, improving the overall functionality of the retrieval engine.
WalkthroughThe Changes
Sequence DiagramsequenceDiagram
participant User
participant Engine as dual_qdrant<br/>retrieval_engine
participant LLM
participant Response as Response<br/>Object
User->>Engine: _process_basic_query(query)
Engine->>LLM: Request response
LLM-->>Engine: LLM response + nodes
rect rgb(230, 245, 230)
Note over Engine: NEW: Metadata Enrichment
Engine->>Engine: For each node without<br/>metadata.url:<br/>Derive URL from title<br/>(replace spaces → _)<br/>Prefix with wiki domain
end
Engine->>Response: Construct Response<br/>with enriched nodes
Response-->>Engine: final_response
Engine-->>User: Return enriched Response
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| # TODO: cleanup this | ||
| # this is for MediaWiki platform which should be handled over the ETL and not here | ||
| for node in nodes: | ||
| if node.metadata.get("url") is None: | ||
| url_route = node.metadata.get("title") | ||
| if url_route is not None: | ||
| url_route = url_route.replace(" ", "_") | ||
| node.metadata["url"] = f"https://wiki.p2pfoundation.net/{url_route}" |
There was a problem hiding this comment.
🧩 Analysis chain
Move MediaWiki-specific logic to ETL or use a configuration-based approach.
The TODO comment correctly identifies that platform-specific URL generation should be handled in the ETL layer, not in the retrieval engine. Hardcoding MediaWiki logic here tightly couples this generic engine to a specific platform and violates separation of concerns.
Additionally, this logic introduces several concerns:
-
Metadata mutation risk: Modifying
node.metadata["url"]in-place could affect cached or shared node instances elsewhere in the codebase, potentially causing unexpected side effects. -
Inconsistent behavior: This URL generation only occurs in
_process_basic_querybut is missing from_process_summary_query(lines 313-364), wheresource_nodes(line 360) won't have URLs populated. The fallback to_process_basic_queryat line 333 would apply the logic, but direct summary query paths won't. -
Simplistic URL encoding: Using
replace(" ", "_")doesn't handle special characters like#,?,&, or international characters that require proper URL encoding for MediaWiki.
Consider one of these approaches:
Option 1 (Recommended): Move URL generation to the ETL pipeline so nodes arrive with populated URLs.
Option 2: If runtime generation is necessary, use a strategy pattern:
# At class level
url_generator: Callable[[NodeWithScore], None] | None = None
# In the method, after line 294
if self.url_generator:
for node in nodes:
self.url_generator(node)Then inject platform-specific logic during setup:
def mediawiki_url_generator(node: NodeWithScore) -> None:
if node.metadata.get("url") is None:
title = node.metadata.get("title")
if title:
from urllib.parse import quote
url_route = quote(title.replace(" ", "_"), safe="")
node.metadata["url"] = f"https://wiki.p2pfoundation.net/{url_route}"Run the following script to verify URL population inconsistency between query modes:
🏁 Script executed:
#!/bin/bash
# Check if _process_summary_query has similar URL generation logic
ast-grep --pattern $'def _process_summary_query($$$) {
$$$
metadata["url"]
$$$
}'Length of output: 0
---
🏁 Script executed:
```shell
#!/bin/bash
# Extract the _process_summary_query method and surrounding context
# to verify URL generation handling and response construction
rg -n -A50 'def _process_summary_query' utils/query_engine/dual_qdrant_retrieval_engine.py
Length of output: 2392
🏁 Script executed:
#!/bin/bash
# Search for any URL-related code in the summary query method
# to confirm whether URL generation is applied there
rg -n 'metadata.*url|url.*metadata' utils/query_engine/dual_qdrant_retrieval_engine.py -C3Length of output: 565
🏁 Script executed:
#!/bin/bash
# Check where nodes from _process_basic_query are reused/cached
# to assess metadata mutation risk
rg -n 'node\.metadata\[' utils/query_engine/dual_qdrant_retrieval_engine.py -B2 -A2Length of output: 572
🏁 Script executed:
#!/bin/bash
# Verify the commented-out code at line 310
# and confirm hardcoded URL patterns
sed -n '295,315p' utils/query_engine/dual_qdrant_retrieval_engine.pyLength of output: 835
Move MediaWiki-specific logic to ETL or configuration, and fix inconsistent URL handling between query modes.
Platform-specific URL generation is hardcoded in a generic retrieval engine, violating separation of concerns. The TODO comment (lines 299-300) correctly identifies this as technical debt that belongs in the ETL layer.
Verification confirms several actionable issues:
-
Inconsistent URL handling: URLs are only generated in
_process_basic_query(lines 301-306). In_process_summary_query, the main code path (lines 335-360) returnsraw_nodesassource_nodeswithout URL population. The fallback at line 333 would apply URLs only if no dates are found, making behavior inconsistent. -
In-place metadata mutation (line 306): Modifying
node.metadata["url"]mutates node objects. If nodes are cached or referenced elsewhere, this could cause unexpected side effects. -
Dead code (line 310): Remove the commented-out
# return final_response. -
Simplistic URL encoding: Using
replace(" ", "_")doesn't handle special characters like#,?,&, or non-ASCII characters.
Required fixes:
- Remove commented-out line 310
- Apply URL generation consistently to both
_process_basic_queryand_process_summary_querypaths, or move logic to ETL/configuration - Use proper URL encoding (e.g.,
urllib.parse.quote) instead of simple space replacement
🤖 Prompt for AI Agents
In utils/query_engine/dual_qdrant_retrieval_engine.py around lines 299-306: the
MediaWiki-specific URL generation and the commented-out dead code should be
removed or relocated to ETL/configuration; instead create a small helper that
takes a Node and returns a new metadata dict (do not mutate node.metadata
in-place), use urllib.parse.quote to safely encode the title (handling spaces
and special/non-ASCII chars), remove the commented-out "return final_response"
line, and call that helper consistently from both _process_basic_query and
_process_summary_query (or call it once where nodes are produced) so URLs are
applied deterministically across query modes.
|
|
||
| final_response = Response(response=str(response), source_nodes=nodes) | ||
|
|
||
| # return final_response |
There was a problem hiding this comment.
Remove commented-out code.
The commented-out return statement serves no purpose and should be deleted to maintain code cleanliness.
Apply this diff:
- # return final_response
return final_response📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # return final_response | |
| return final_response |
🤖 Prompt for AI Agents
In utils/query_engine/dual_qdrant_retrieval_engine.py around line 310, remove
the leftover commented-out line "# return final_response" as it is dead code;
delete that comment so the file contains no unused commented return statements
and remains clean.
Summary by CodeRabbit
Release Notes