Skip to content

fix: scrape issues#698

Open
leo-notte wants to merge 2 commits intomainfrom
fix/scrape-issues
Open

fix: scrape issues#698
leo-notte wants to merge 2 commits intomainfrom
fix/scrape-issues

Conversation

@leo-notte
Copy link
Contributor

@leo-notte leo-notte commented Feb 3, 2026

  • take: allow root model, but prefer basemodel in the error fix
  • disable main content extractor as default
  • set url as default when large tokens and requires_schema but allow disabling

Summary by CodeRabbit

Release Notes

  • New Features

    • Smart link placeholder optimization: Links are now automatically handled based on content composition, enabling placeholders when appropriate for better processing.
  • Improvements

    • Scraper now includes more page content by default.
    • Enhanced error reporting for schema validation mismatches.

Greptile Overview

Greptile Summary

This PR fixes several scraping-related issues by improving defaults and error handling:

  • Smart link placeholder defaults: Changed use_link_placeholders from False to None (auto-detection). When URLs account for >= 50% of content in documents > 10k chars, placeholders are automatically enabled to reduce token usage and improve LLM performance
  • More permissive main content default: Changed only_main_content from True to False, giving users the full page content by default instead of extracting only main content
  • Better RootModel support: Added validation to allow RootModel[list[...]] schemas while providing helpful error messages when users pass non-list schemas for list responses, guiding them to use a BaseModel wrapper instead
  • Minor formatting: Added blank line in space.py and pyright ignore comments to suppress type checking warnings

The changes improve the developer experience by making the scraping behavior more intuitive and providing better error messages when schema mismatches occur.

Confidence Score: 4/5

  • This PR is safe to merge with minimal risk, mostly default behavior changes
  • The changes are well-structured and improve user experience. The auto-detection logic for link placeholders is sound, and the RootModel validation is correct. Main concern is the breaking change to only_main_content default (True → False) which may surprise existing users, though the new default is more intuitive
  • Pay attention to packages/notte-browser/src/notte_browser/scraping/schema.py for the RootModel validation logic

Important Files Changed

Filename Overview
packages/notte-sdk/src/notte_sdk/types.py Changed use_link_placeholders default from False to None (auto-detection), and only_main_content default from True to False
packages/notte-browser/src/notte_browser/scraping/pipe.py Replaced warning with auto-enable logic for use_link_placeholders when URLs exceed 50% of content
packages/notte-browser/src/notte_browser/scraping/schema.py Added RootModel support with list validation, improved error messages, added pyright ignore comments

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 3, 2026

Walkthrough

The PR implements coordinated updates across the scraping pipeline to introduce auto-determination of link placeholder handling and enhance schema validation. Key changes include: auto-resolution logic in the pipe layer that analyzes content URL density to conditionally enable placeholders, support for Pydantic RootModel schemas with list-structure validation, flipped defaults for main content extraction from True to False, and addition of a runtime guard to raise errors on extraction failures. Type signatures were updated to permit None values for auto-configuration paths and reflect new default behaviors across SDK and browser packages.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'fix: scrape issues' is vague and uses non-descriptive language that doesn't convey specific information about the changeset. Consider a more specific title that reflects the main changes, such as 'fix: auto-resolve link placeholders and update default scrape behavior' or similar.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/scrape-issues

Important

Action Needed: IP Allowlist Update

If your organization protects your Git platform with IP whitelisting, please add the new CodeRabbit IP address to your allowlist:

  • 136.113.208.247/32 (new)
  • 34.170.211.100/32
  • 35.222.179.152/32

Failure to add the new IP will result in interrupted reviews.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +134 to +137
if (
not issubclass(_response_format, RootModel)
or get_origin(_response_format.model_fields["root"].annotation) is not list
):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic could fail if _response_format is RootModel but doesn't have a root field in model_fields (though unlikely with valid Pydantic models)

Suggested change
if (
not issubclass(_response_format, RootModel)
or get_origin(_response_format.model_fields["root"].annotation) is not list
):
if not issubclass(_response_format, RootModel):
# err message hints at using a basemodel instead of rootmodel
Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/notte-browser/src/notte_browser/scraping/schema.py
Line: 134:137

Comment:
logic could fail if `_response_format` is `RootModel` but doesn't have a `root` field in `model_fields` (though unlikely with valid Pydantic models)

```suggestion
                        if not issubclass(_response_format, RootModel):
                            # err message hints at using a basemodel instead of rootmodel
```

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant