fix(templater): handle multi-byte UTF-8 characters in Python bridge by ota2000 · Pull Request #2319 · quarylabs/sqruff

ota2000 · 2026-02-14T03:33:21Z

Summary

Add char_to_byte_indices() and char_idx_to_byte_idx() helper functions to lib-core for converting Python's character-based indices to Rust's byte-based indices
Update PythonTemplatedFile::to_templated_file() to convert all source_idx, source_slice, and templated_slice indices from character coordinates to byte coordinates before constructing the TemplatedFile
Remove unused to_templated_file_slice() and to_raw_file_slice() methods
Add regression tests in lib-core that verify TemplatedFile consistency checks pass with multi-byte characters, including a should_panic test that proves unconverted char indices cause the bug

Context

When SQL files contain multi-byte UTF-8 characters (e.g. Japanese comments, accented characters), the Python templater (dbt/jinja/python) panics with:

TemplatedFile. Consistency fail on running source length. 1321 != 1281

Root cause: Python's len() returns character count (Unicode code points), while Rust's String::len() returns byte count (UTF-8). For example, あ is 1 character in Python but 3 bytes in Rust. The PythonTemplatedFile::to_templated_file() was passing Python's character-based indices directly to Rust without conversion, causing the consistency check to fail.

Closes #2318
Related: #1328, #1431

Test plan

cargo test --package sqruff-lib-core — all 28 tests pass (including 10 new: 7 for index conversion helpers, 3 for TemplatedFile multi-byte regression)
cargo test --package sqruff-lib --lib — all 49 tests pass (no regressions)
cargo fmt --all -- --check — passes
should_panic test confirms that unconverted Python char indices trigger the exact panic being fixed

…idge Python uses character-based (Unicode code point) indices for string positions, while Rust's String::len() returns byte length (UTF-8). When SQL files contain multi-byte characters (Japanese, accented chars, etc.), this mismatch causes a panic in TemplatedFile's consistency check: "TemplatedFile. Consistency fail on running source length" The fix adds char-to-byte index conversion in PythonTemplatedFile::to_templated_file(), converting all source_idx and slice ranges from Python's character coordinates to Rust's byte coordinates before constructing the TemplatedFile. Closes quarylabs#2318 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add three tests to lib-core that verify TemplatedFile consistency checks work correctly with multi-byte characters: - test_templated_file_multibyte_consistency_check: Japanese comment - test_templated_file_multibyte_multiple_raw_slices: accented chars - test_templated_file_char_indices_cause_panic: proves that passing unconverted Python char indices causes the panic (should_panic) These tests run without the Python feature and validate the core invariant that was being violated. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ota2000 changed the title ~~fix(templater): convert Python char indices to Rust byte indices for multi-byte UTF-8 support~~ fix(templater): handle multi-byte UTF-8 characters in Python bridge Feb 14, 2026

ota2000 marked this pull request as ready for review February 14, 2026 03:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(templater): handle multi-byte UTF-8 characters in Python bridge#2319

fix(templater): handle multi-byte UTF-8 characters in Python bridge#2319
ota2000 wants to merge 2 commits intoquarylabs:mainfrom
ota2000:ota2000/fix/multibyte-char-index-conversion

ota2000 commented Feb 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ota2000 commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Context

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ota2000 commented Feb 14, 2026 •

edited

Loading