Skip to content

fix(templater): handle multi-byte UTF-8 characters in Python bridge#2319

Open
ota2000 wants to merge 2 commits intoquarylabs:mainfrom
ota2000:ota2000/fix/multibyte-char-index-conversion
Open

fix(templater): handle multi-byte UTF-8 characters in Python bridge#2319
ota2000 wants to merge 2 commits intoquarylabs:mainfrom
ota2000:ota2000/fix/multibyte-char-index-conversion

Conversation

@ota2000
Copy link
Contributor

@ota2000 ota2000 commented Feb 14, 2026

Summary

  • Add char_to_byte_indices() and char_idx_to_byte_idx() helper functions to lib-core for converting Python's character-based indices to Rust's byte-based indices
  • Update PythonTemplatedFile::to_templated_file() to convert all source_idx, source_slice, and templated_slice indices from character coordinates to byte coordinates before constructing the TemplatedFile
  • Remove unused to_templated_file_slice() and to_raw_file_slice() methods
  • Add regression tests in lib-core that verify TemplatedFile consistency checks pass with multi-byte characters, including a should_panic test that proves unconverted char indices cause the bug

Context

When SQL files contain multi-byte UTF-8 characters (e.g. Japanese comments, accented characters), the Python templater (dbt/jinja/python) panics with:

TemplatedFile. Consistency fail on running source length. 1321 != 1281

Root cause: Python's len() returns character count (Unicode code points), while Rust's String::len() returns byte count (UTF-8). For example, is 1 character in Python but 3 bytes in Rust. The PythonTemplatedFile::to_templated_file() was passing Python's character-based indices directly to Rust without conversion, causing the consistency check to fail.

Closes #2318
Related: #1328, #1431

Test plan

  • cargo test --package sqruff-lib-core — all 28 tests pass (including 10 new: 7 for index conversion helpers, 3 for TemplatedFile multi-byte regression)
  • cargo test --package sqruff-lib --lib — all 49 tests pass (no regressions)
  • cargo fmt --all -- --check — passes
  • should_panic test confirms that unconverted Python char indices trigger the exact panic being fixed

…idge

Python uses character-based (Unicode code point) indices for string
positions, while Rust's String::len() returns byte length (UTF-8).
When SQL files contain multi-byte characters (Japanese, accented chars,
etc.), this mismatch causes a panic in TemplatedFile's consistency check:

  "TemplatedFile. Consistency fail on running source length"

The fix adds char-to-byte index conversion in
PythonTemplatedFile::to_templated_file(), converting all source_idx
and slice ranges from Python's character coordinates to Rust's byte
coordinates before constructing the TemplatedFile.

Closes quarylabs#2318

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ota2000 ota2000 changed the title fix(templater): convert Python char indices to Rust byte indices for multi-byte UTF-8 support fix(templater): handle multi-byte UTF-8 characters in Python bridge Feb 14, 2026
@ota2000 ota2000 marked this pull request as ready for review February 14, 2026 03:43
Add three tests to lib-core that verify TemplatedFile consistency
checks work correctly with multi-byte characters:

- test_templated_file_multibyte_consistency_check: Japanese comment
- test_templated_file_multibyte_multiple_raw_slices: accented chars
- test_templated_file_char_indices_cause_panic: proves that passing
  unconverted Python char indices causes the panic (should_panic)

These tests run without the Python feature and validate the core
invariant that was being violated.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: TemplatedFile panic with multi-byte UTF-8 characters (char index vs byte index mismatch)

1 participant