Skip to content

fix: update URL handling in DualQdrantRetrievalEngine!#225

Merged
amindadgar merged 1 commit intomainfrom
feat/update-mediaWiki-url-reference-handling
Oct 26, 2025
Merged

fix: update URL handling in DualQdrantRetrievalEngine!#225
amindadgar merged 1 commit intomainfrom
feat/update-mediaWiki-url-reference-handling

Conversation

@amindadgar
Copy link
Member

@amindadgar amindadgar commented Oct 26, 2025

  • Enhanced the processing of nodes by adding URL generation for MediaWiki titles when the URL is not present in the metadata.
  • This change ensures that nodes are properly linked to their corresponding MediaWiki pages, improving the overall functionality of the retrieval engine.

Summary by CodeRabbit

Release Notes

  • Bug Fixes
    • Search results for wiki-related content now include properly populated URLs, providing direct links to referenced pages.

- Enhanced the processing of nodes by adding URL generation for MediaWiki titles when the URL is not present in the metadata.
- This change ensures that nodes are properly linked to their corresponding MediaWiki pages, improving the overall functionality of the retrieval engine.
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 26, 2025

Walkthrough

The _process_basic_query method now enriches source node metadata by populating missing URL fields. For nodes lacking a metadata.url entry, the code derives a URL by transforming the node title (replacing spaces with underscores) and prefixing it with https://wiki.p2pfoundation.net/ before returning the Response object.

Changes

Cohort / File(s) Summary
Metadata enrichment for retrieval nodes
utils/query_engine/dual_qdrant_retrieval_engine.py
Added URL population logic in _process_basic_query to assign MediaWiki-based URLs to source nodes missing metadata.url; introduced final_response variable to hold constructed Response before return

Sequence Diagram

sequenceDiagram
    participant User
    participant Engine as dual_qdrant<br/>retrieval_engine
    participant LLM
    participant Response as Response<br/>Object

    User->>Engine: _process_basic_query(query)
    Engine->>LLM: Request response
    LLM-->>Engine: LLM response + nodes
    
    rect rgb(230, 245, 230)
    Note over Engine: NEW: Metadata Enrichment
    Engine->>Engine: For each node without<br/>metadata.url:<br/>Derive URL from title<br/>(replace spaces → _)<br/>Prefix with wiki domain
    end
    
    Engine->>Response: Construct Response<br/>with enriched nodes
    Response-->>Engine: final_response
    Engine-->>User: Return enriched Response
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

  • Single file modification with localized logic change
  • URL derivation pattern requires verification (space-to-underscore conversion, domain prefix accuracy)
  • Ensure the metadata enrichment doesn't override existing URLs or cause side effects
  • Confirm the node iteration and metadata.url existence check are correct

Poem

🐰 A rabbit hops through wiki lanes,
Adding URLs to node domains,
With underscores where spaces rest,
Metadata flows, now nicely dressed! 🌐

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The pull request title "fix: update URL handling in DualQdrantRetrievalEngine!" clearly and specifically describes the main change in the changeset. The title accurately references the primary objective of the PR, which is to enhance URL handling by automatically generating MediaWiki URLs for nodes missing metadata URLs. The title is concise, directly related to the functionality modified in the code, and specific enough that a developer reviewing the commit history would quickly understand the change involves URL-handling improvements in the specified engine component.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/update-mediaWiki-url-reference-handling

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a8452b3 and 35e378f.

📒 Files selected for processing (1)
  • utils/query_engine/dual_qdrant_retrieval_engine.py (1 hunks)

Comment on lines +299 to +306
# TODO: cleanup this
# this is for MediaWiki platform which should be handled over the ETL and not here
for node in nodes:
if node.metadata.get("url") is None:
url_route = node.metadata.get("title")
if url_route is not None:
url_route = url_route.replace(" ", "_")
node.metadata["url"] = f"https://wiki.p2pfoundation.net/{url_route}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Move MediaWiki-specific logic to ETL or use a configuration-based approach.

The TODO comment correctly identifies that platform-specific URL generation should be handled in the ETL layer, not in the retrieval engine. Hardcoding MediaWiki logic here tightly couples this generic engine to a specific platform and violates separation of concerns.

Additionally, this logic introduces several concerns:

  1. Metadata mutation risk: Modifying node.metadata["url"] in-place could affect cached or shared node instances elsewhere in the codebase, potentially causing unexpected side effects.

  2. Inconsistent behavior: This URL generation only occurs in _process_basic_query but is missing from _process_summary_query (lines 313-364), where source_nodes (line 360) won't have URLs populated. The fallback to _process_basic_query at line 333 would apply the logic, but direct summary query paths won't.

  3. Simplistic URL encoding: Using replace(" ", "_") doesn't handle special characters like #, ?, &, or international characters that require proper URL encoding for MediaWiki.

Consider one of these approaches:

Option 1 (Recommended): Move URL generation to the ETL pipeline so nodes arrive with populated URLs.

Option 2: If runtime generation is necessary, use a strategy pattern:

# At class level
url_generator: Callable[[NodeWithScore], None] | None = None

# In the method, after line 294
if self.url_generator:
    for node in nodes:
        self.url_generator(node)

Then inject platform-specific logic during setup:

def mediawiki_url_generator(node: NodeWithScore) -> None:
    if node.metadata.get("url") is None:
        title = node.metadata.get("title")
        if title:
            from urllib.parse import quote
            url_route = quote(title.replace(" ", "_"), safe="")
            node.metadata["url"] = f"https://wiki.p2pfoundation.net/{url_route}"

Run the following script to verify URL population inconsistency between query modes:


🏁 Script executed:

#!/bin/bash
# Check if _process_summary_query has similar URL generation logic

ast-grep --pattern $'def _process_summary_query($$$) {
  $$$
  metadata["url"]
  $$$
}'

Length of output: 0



---

🏁 Script executed:

```shell
#!/bin/bash
# Extract the _process_summary_query method and surrounding context
# to verify URL generation handling and response construction

rg -n -A50 'def _process_summary_query' utils/query_engine/dual_qdrant_retrieval_engine.py

Length of output: 2392


🏁 Script executed:

#!/bin/bash
# Search for any URL-related code in the summary query method
# to confirm whether URL generation is applied there

rg -n 'metadata.*url|url.*metadata' utils/query_engine/dual_qdrant_retrieval_engine.py -C3

Length of output: 565


🏁 Script executed:

#!/bin/bash
# Check where nodes from _process_basic_query are reused/cached
# to assess metadata mutation risk

rg -n 'node\.metadata\[' utils/query_engine/dual_qdrant_retrieval_engine.py -B2 -A2

Length of output: 572


🏁 Script executed:

#!/bin/bash
# Verify the commented-out code at line 310
# and confirm hardcoded URL patterns

sed -n '295,315p' utils/query_engine/dual_qdrant_retrieval_engine.py

Length of output: 835


Move MediaWiki-specific logic to ETL or configuration, and fix inconsistent URL handling between query modes.

Platform-specific URL generation is hardcoded in a generic retrieval engine, violating separation of concerns. The TODO comment (lines 299-300) correctly identifies this as technical debt that belongs in the ETL layer.

Verification confirms several actionable issues:

  1. Inconsistent URL handling: URLs are only generated in _process_basic_query (lines 301-306). In _process_summary_query, the main code path (lines 335-360) returns raw_nodes as source_nodes without URL population. The fallback at line 333 would apply URLs only if no dates are found, making behavior inconsistent.

  2. In-place metadata mutation (line 306): Modifying node.metadata["url"] mutates node objects. If nodes are cached or referenced elsewhere, this could cause unexpected side effects.

  3. Dead code (line 310): Remove the commented-out # return final_response.

  4. Simplistic URL encoding: Using replace(" ", "_") doesn't handle special characters like #, ?, &, or non-ASCII characters.

Required fixes:

  • Remove commented-out line 310
  • Apply URL generation consistently to both _process_basic_query and _process_summary_query paths, or move logic to ETL/configuration
  • Use proper URL encoding (e.g., urllib.parse.quote) instead of simple space replacement
🤖 Prompt for AI Agents
In utils/query_engine/dual_qdrant_retrieval_engine.py around lines 299-306: the
MediaWiki-specific URL generation and the commented-out dead code should be
removed or relocated to ETL/configuration; instead create a small helper that
takes a Node and returns a new metadata dict (do not mutate node.metadata
in-place), use urllib.parse.quote to safely encode the title (handling spaces
and special/non-ASCII chars), remove the commented-out "return final_response"
line, and call that helper consistently from both _process_basic_query and
_process_summary_query (or call it once where nodes are produced) so URLs are
applied deterministically across query modes.


final_response = Response(response=str(response), source_nodes=nodes)

# return final_response
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove commented-out code.

The commented-out return statement serves no purpose and should be deleted to maintain code cleanliness.

Apply this diff:

-        # return final_response
         return final_response
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# return final_response
return final_response
🤖 Prompt for AI Agents
In utils/query_engine/dual_qdrant_retrieval_engine.py around line 310, remove
the leftover commented-out line "# return final_response" as it is dead code;
delete that comment so the file contains no unused commented return statements
and remains clean.

@amindadgar amindadgar merged commit 736c835 into main Oct 26, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant