Conversation
WalkthroughThe PR implements coordinated updates across the scraping pipeline to introduce auto-determination of link placeholder handling and enhance schema validation. Key changes include: auto-resolution logic in the pipe layer that analyzes content URL density to conditionally enable placeholders, support for Pydantic RootModel schemas with list-structure validation, flipped defaults for main content extraction from True to False, and addition of a runtime guard to raise errors on extraction failures. Type signatures were updated to permit None values for auto-configuration paths and reflect new default behaviors across SDK and browser packages. Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Important Action Needed: IP Allowlist UpdateIf your organization protects your Git platform with IP whitelisting, please add the new CodeRabbit IP address to your allowlist:
Failure to add the new IP will result in interrupted reviews. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| if ( | ||
| not issubclass(_response_format, RootModel) | ||
| or get_origin(_response_format.model_fields["root"].annotation) is not list | ||
| ): |
There was a problem hiding this comment.
logic could fail if _response_format is RootModel but doesn't have a root field in model_fields (though unlikely with valid Pydantic models)
| if ( | |
| not issubclass(_response_format, RootModel) | |
| or get_origin(_response_format.model_fields["root"].annotation) is not list | |
| ): | |
| if not issubclass(_response_format, RootModel): | |
| # err message hints at using a basemodel instead of rootmodel |
Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/notte-browser/src/notte_browser/scraping/schema.py
Line: 134:137
Comment:
logic could fail if `_response_format` is `RootModel` but doesn't have a `root` field in `model_fields` (though unlikely with valid Pydantic models)
```suggestion
if not issubclass(_response_format, RootModel):
# err message hints at using a basemodel instead of rootmodel
```
How can I resolve this? If you propose a fix, please make it concise.
Summary by CodeRabbit
Release Notes
New Features
Improvements
Greptile Overview
Greptile Summary
This PR fixes several scraping-related issues by improving defaults and error handling:
use_link_placeholdersfromFalsetoNone(auto-detection). When URLs account for >= 50% of content in documents > 10k chars, placeholders are automatically enabled to reduce token usage and improve LLM performanceonly_main_contentfromTruetoFalse, giving users the full page content by default instead of extracting only main contentRootModel[list[...]]schemas while providing helpful error messages when users pass non-list schemas for list responses, guiding them to use a BaseModel wrapper insteadspace.pyand pyright ignore comments to suppress type checking warningsThe changes improve the developer experience by making the scraping behavior more intuitive and providing better error messages when schema mismatches occur.
Confidence Score: 4/5
only_main_contentdefault (True → False) which may surprise existing users, though the new default is more intuitivepackages/notte-browser/src/notte_browser/scraping/schema.pyfor the RootModel validation logicImportant Files Changed
use_link_placeholdersdefault fromFalsetoNone(auto-detection), andonly_main_contentdefault fromTruetoFalseuse_link_placeholderswhen URLs exceed 50% of content