fix: scrape issues by leo-notte · Pull Request #698 · nottelabs/notte

leo-notte · 2026-02-03T16:44:52Z

take: allow root model, but prefer basemodel in the error fix
disable main content extractor as default
set url as default when large tokens and requires_schema but allow disabling

Summary by CodeRabbit

Release Notes

New Features
- Smart link placeholder optimization: Links are now automatically handled based on content composition, enabling placeholders when appropriate for better processing.
Improvements
- Scraper now includes more page content by default.
- Enhanced error reporting for schema validation mismatches.

Greptile Overview

Greptile Summary

This PR fixes several scraping-related issues by improving defaults and error handling:

Smart link placeholder defaults: Changed use_link_placeholders from False to None (auto-detection). When URLs account for >= 50% of content in documents > 10k chars, placeholders are automatically enabled to reduce token usage and improve LLM performance
More permissive main content default: Changed only_main_content from True to False, giving users the full page content by default instead of extracting only main content
Better RootModel support: Added validation to allow RootModel[list[...]] schemas while providing helpful error messages when users pass non-list schemas for list responses, guiding them to use a BaseModel wrapper instead
Minor formatting: Added blank line in space.py and pyright ignore comments to suppress type checking warnings

The changes improve the developer experience by making the scraping behavior more intuitive and providing better error messages when schema mismatches occur.

Confidence Score: 4/5

This PR is safe to merge with minimal risk, mostly default behavior changes
The changes are well-structured and improve user experience. The auto-detection logic for link placeholders is sound, and the RootModel validation is correct. Main concern is the breaking change to only_main_content default (True → False) which may surprise existing users, though the new default is more intuitive
Pay attention to packages/notte-browser/src/notte_browser/scraping/schema.py for the RootModel validation logic

Important Files Changed

Filename	Overview
packages/notte-sdk/src/notte_sdk/types.py	Changed `use_link_placeholders` default from `False` to `None` (auto-detection), and `only_main_content` default from `True` to `False`
packages/notte-browser/src/notte_browser/scraping/pipe.py	Replaced warning with auto-enable logic for `use_link_placeholders` when URLs exceed 50% of content
packages/notte-browser/src/notte_browser/scraping/schema.py	Added RootModel support with list validation, improved error messages, added pyright ignore comments

coderabbitai · 2026-02-03T16:45:15Z

Walkthrough

The PR implements coordinated updates across the scraping pipeline to introduce auto-determination of link placeholder handling and enhance schema validation. Key changes include: auto-resolution logic in the pipe layer that analyzes content URL density to conditionally enable placeholders, support for Pydantic RootModel schemas with list-structure validation, flipped defaults for main content extraction from True to False, and addition of a runtime guard to raise errors on extraction failures. Type signatures were updated to permit None values for auto-configuration paths and reflect new default behaviors across SDK and browser packages.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title 'fix: scrape issues' is vague and uses non-descriptive language that doesn't convey specific information about the changeset.	Consider a more specific title that reflects the main changes, such as 'fix: auto-resolve link placeholders and update default scrape behavior' or similar.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix/scrape-issues

Important

Action Needed: IP Allowlist Update

If your organization protects your Git platform with IP whitelisting, please add the new CodeRabbit IP address to your allowlist:

✨ 136.113.208.247/32 (new)
34.170.211.100/32
35.222.179.152/32

Failure to add the new IP will result in interrupted reviews.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

greptile-apps

_{3 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-03T16:48:13Z

packages/notte-browser/src/notte_browser/scraping/schema.py

+                        if (
+                            not issubclass(_response_format, RootModel)
+                            or get_origin(_response_format.model_fields["root"].annotation) is not list
+                        ):


logic could fail if _response_format is RootModel but doesn't have a root field in model_fields (though unlikely with valid Pydantic models)

Suggested change

if (

not issubclass(_response_format, RootModel)

or get_origin(_response_format.model_fields["root"].annotation) is not list

):

if not issubclass(_response_format, RootModel):

# err message hints at using a basemodel instead of rootmodel

Prompt To Fix With AI

This is a comment left during a code review. Path: packages/notte-browser/src/notte_browser/scraping/schema.py Line: 134:137 Comment: logic could fail if `_response_format` is `RootModel` but doesn't have a `root` field in `model_fields` (though unlikely with valid Pydantic models) ```suggestion if not issubclass(_response_format, RootModel): # err message hints at using a basemodel instead of rootmodel ``` How can I resolve this? If you propose a fix, please make it concise.

leo-notte added 2 commits February 3, 2026 17:44

fix: allow list output, disable only_main_content default

f05e9cd

use link placeholder default

81947d4

greptile-apps bot reviewed Feb 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: scrape issues#698

fix: scrape issues#698
leo-notte wants to merge 2 commits intomainfrom
fix/scrape-issues

leo-notte commented Feb 3, 2026 •

edited by greptile-apps bot

Loading

Uh oh!

coderabbitai bot commented Feb 3, 2026 •

edited

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

leo-notte commented Feb 3, 2026 • edited by greptile-apps bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Greptile Overview

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Uh oh!

coderabbitai bot commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

leo-notte commented Feb 3, 2026 •

edited by greptile-apps bot

Loading

coderabbitai bot commented Feb 3, 2026 •

edited

Loading