feat: tree-sitter scoring robustness and edge cases#235
Open
dive2tech wants to merge 2 commits intoentrius:testfrom
Open
feat: tree-sitter scoring robustness and edge cases#235dive2tech wants to merge 2 commits intoentrius:testfrom
dive2tech wants to merge 2 commits intoentrius:testfrom
Conversation
- Add MAX_AST_DEPTH constant to limit recursion and avoid stack overflow - _normalize_content, _safe_encode_content, _safe_decode_node_text, _safe_content_byte_size for safe handling of None, non-str, empty, invalid UTF-8 - parse_code: normalize and safe encode; handle invalid content without raising - collect_node_signatures: depth-limited walk, safe node text decode - score_tree_diff: normalize old/new content before parsing - calculate_token_score_from_file_changes: skip empty/whitespace new content (skipped-empty), safe byte size check - Add tests/validator/test_tree_sitter_scoring.py with 35 edge-case and robustness tests Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Improves robustness and edge-case handling in the tree-sitter token scoring pipeline so invalid or malformed content does not crash validators and is handled in a predictable way.
Changes
MAX_AST_DEPTH — New constant to cap recursion when walking the AST and avoid stack overflow on very deep or pathological input.
Content normalization and safe encoding/decoding — Helpers handle
None, non-str, empty/whitespace, and invalid UTF-8 (usingerrors='replace') without raising:_normalize_content— ReturnsNonefor invalid or empty content; otherwise stripped string._safe_encode_content— UTF-8 encode with replace for invalid codepoints._safe_decode_node_text— Decode node bytes with replace for malformed UTF-8._safe_content_byte_size— Byte length for size checks; 0 on error or non-str.parse_code — Uses normalization and safe encode; returns
Nonefor invalid or unsupported content instead of raising.collect_node_signatures — Depth-limited walk (default
MAX_AST_DEPTH); decodes node text with_safe_decode_node_textso bad UTF-8 in source does not crash.score_tree_diff — Normalizes old/new content before parsing so empty or whitespace-only content is treated consistently.
calculate_token_score_from_file_changes — Normalizes content from
FileContentPair; if new content is empty or whitespace-only after normalization, skips with skipped-empty (score 0); uses _safe_content_byte_size for the size check so invalid UTF-8 does not raise.Tests
_normalize_content,_safe_encode_content,_safe_decode_node_text,_safe_content_byte_sizeparse_code(None, empty, non-str, invalid UTF-8, unknown language)collect_node_signatures(depth limit)score_tree_diff(None/empty/invalid UTF-8)calculate_token_score_from_file_changes(empty and whitespace-only new content → skipped-empty)Testing