Skip to content

feat: tree-sitter scoring robustness and edge cases#235

Open
dive2tech wants to merge 2 commits intoentrius:testfrom
dive2tech:feat/tree-sitter-robustness-edge-cases
Open

feat: tree-sitter scoring robustness and edge cases#235
dive2tech wants to merge 2 commits intoentrius:testfrom
dive2tech:feat/tree-sitter-robustness-edge-cases

Conversation

@dive2tech
Copy link

Summary

Improves robustness and edge-case handling in the tree-sitter token scoring pipeline so invalid or malformed content does not crash validators and is handled in a predictable way.

Changes

  • MAX_AST_DEPTH — New constant to cap recursion when walking the AST and avoid stack overflow on very deep or pathological input.

  • Content normalization and safe encoding/decoding — Helpers handle None, non-str, empty/whitespace, and invalid UTF-8 (using errors='replace') without raising:

    • _normalize_content — Returns None for invalid or empty content; otherwise stripped string.
    • _safe_encode_content — UTF-8 encode with replace for invalid codepoints.
    • _safe_decode_node_text — Decode node bytes with replace for malformed UTF-8.
    • _safe_content_byte_size — Byte length for size checks; 0 on error or non-str.
  • parse_code — Uses normalization and safe encode; returns None for invalid or unsupported content instead of raising.

  • collect_node_signatures — Depth-limited walk (default MAX_AST_DEPTH); decodes node text with _safe_decode_node_text so bad UTF-8 in source does not crash.

  • score_tree_diff — Normalizes old/new content before parsing so empty or whitespace-only content is treated consistently.

  • calculate_token_score_from_file_changes — Normalizes content from FileContentPair; if new content is empty or whitespace-only after normalization, skips with skipped-empty (score 0); uses _safe_content_byte_size for the size check so invalid UTF-8 does not raise.

Tests

  • tests/validator/test_tree_sitter_scoring.py (35 tests) — Covers:
    • _normalize_content, _safe_encode_content, _safe_decode_node_text, _safe_content_byte_size
    • parse_code (None, empty, non-str, invalid UTF-8, unknown language)
    • collect_node_signatures (depth limit)
    • score_tree_diff (None/empty/invalid UTF-8)
    • calculate_token_score_from_file_changes (empty and whitespace-only new content → skipped-empty)

Testing

  • pytest tests/validator/test_tree_sitter_scoring.py
  • pytest tests/validator/test_token_scoring_integration.py

dive2tech and others added 2 commits February 25, 2026 12:44
- Add MAX_AST_DEPTH constant to limit recursion and avoid stack overflow
- _normalize_content, _safe_encode_content, _safe_decode_node_text, _safe_content_byte_size for safe handling of None, non-str, empty, invalid UTF-8
- parse_code: normalize and safe encode; handle invalid content without raising
- collect_node_signatures: depth-limited walk, safe node text decode
- score_tree_diff: normalize old/new content before parsing
- calculate_token_score_from_file_changes: skip empty/whitespace new content (skipped-empty), safe byte size check
- Add tests/validator/test_tree_sitter_scoring.py with 35 edge-case and robustness tests

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant