feat: tree-sitter scoring robustness and edge cases by dive2tech · Pull Request #235 · entrius/gittensor

dive2tech · 2026-02-25T10:47:00Z

Summary

Improves robustness and edge-case handling in the tree-sitter token scoring pipeline so invalid or malformed content does not crash validators and is handled in a predictable way.

Changes

MAX_AST_DEPTH — New constant to cap recursion when walking the AST and avoid stack overflow on very deep or pathological input.
Content normalization and safe encoding/decoding — Helpers handle None, non-str, empty/whitespace, and invalid UTF-8 (using errors='replace') without raising:
- _normalize_content — Returns None for invalid or empty content; otherwise stripped string.
- _safe_encode_content — UTF-8 encode with replace for invalid codepoints.
- _safe_decode_node_text — Decode node bytes with replace for malformed UTF-8.
- _safe_content_byte_size — Byte length for size checks; 0 on error or non-str.
parse_code — Uses normalization and safe encode; returns None for invalid or unsupported content instead of raising.
collect_node_signatures — Depth-limited walk (default MAX_AST_DEPTH); decodes node text with _safe_decode_node_text so bad UTF-8 in source does not crash.
score_tree_diff — Normalizes old/new content before parsing so empty or whitespace-only content is treated consistently.
calculate_token_score_from_file_changes — Normalizes content from FileContentPair; if new content is empty or whitespace-only after normalization, skips with skipped-empty (score 0); uses _safe_content_byte_size for the size check so invalid UTF-8 does not raise.

Tests

tests/validator/test_tree_sitter_scoring.py (35 tests) — Covers:
- _normalize_content, _safe_encode_content, _safe_decode_node_text, _safe_content_byte_size
- parse_code (None, empty, non-str, invalid UTF-8, unknown language)
- collect_node_signatures (depth limit)
- score_tree_diff (None/empty/invalid UTF-8)
- calculate_token_score_from_file_changes (empty and whitespace-only new content → skipped-empty)

Testing

pytest tests/validator/test_tree_sitter_scoring.py
pytest tests/validator/test_token_scoring_integration.py

- Add MAX_AST_DEPTH constant to limit recursion and avoid stack overflow - _normalize_content, _safe_encode_content, _safe_decode_node_text, _safe_content_byte_size for safe handling of None, non-str, empty, invalid UTF-8 - parse_code: normalize and safe encode; handle invalid content without raising - collect_node_signatures: depth-limited walk, safe node text decode - score_tree_diff: normalize old/new content before parsing - calculate_token_score_from_file_changes: skip empty/whitespace new content (skipped-empty), safe byte size check - Add tests/validator/test_tree_sitter_scoring.py with 35 edge-case and robustness tests Co-authored-by: Cursor <cursoragent@cursor.com>

dive2tech and others added 2 commits February 25, 2026 12:44

Merge branch 'test' into feat/tree-sitter-robustness-edge-cases

dd40d80

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: tree-sitter scoring robustness and edge cases#235

feat: tree-sitter scoring robustness and edge cases#235
dive2tech wants to merge 2 commits intoentrius:testfrom
dive2tech:feat/tree-sitter-robustness-edge-cases

dive2tech commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dive2tech commented Feb 25, 2026

Summary

Changes

Tests

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant