fix(templater): handle multi-byte UTF-8 characters in Python bridge#2319
Open
ota2000 wants to merge 2 commits intoquarylabs:mainfrom
Open
fix(templater): handle multi-byte UTF-8 characters in Python bridge#2319ota2000 wants to merge 2 commits intoquarylabs:mainfrom
ota2000 wants to merge 2 commits intoquarylabs:mainfrom
Conversation
…idge Python uses character-based (Unicode code point) indices for string positions, while Rust's String::len() returns byte length (UTF-8). When SQL files contain multi-byte characters (Japanese, accented chars, etc.), this mismatch causes a panic in TemplatedFile's consistency check: "TemplatedFile. Consistency fail on running source length" The fix adds char-to-byte index conversion in PythonTemplatedFile::to_templated_file(), converting all source_idx and slice ranges from Python's character coordinates to Rust's byte coordinates before constructing the TemplatedFile. Closes quarylabs#2318 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add three tests to lib-core that verify TemplatedFile consistency checks work correctly with multi-byte characters: - test_templated_file_multibyte_consistency_check: Japanese comment - test_templated_file_multibyte_multiple_raw_slices: accented chars - test_templated_file_char_indices_cause_panic: proves that passing unconverted Python char indices causes the panic (should_panic) These tests run without the Python feature and validate the core invariant that was being violated. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
char_to_byte_indices()andchar_idx_to_byte_idx()helper functions tolib-corefor converting Python's character-based indices to Rust's byte-based indicesPythonTemplatedFile::to_templated_file()to convert allsource_idx,source_slice, andtemplated_sliceindices from character coordinates to byte coordinates before constructing theTemplatedFileto_templated_file_slice()andto_raw_file_slice()methodslib-corethat verifyTemplatedFileconsistency checks pass with multi-byte characters, including ashould_panictest that proves unconverted char indices cause the bugContext
When SQL files contain multi-byte UTF-8 characters (e.g. Japanese comments, accented characters), the Python templater (dbt/jinja/python) panics with:
Root cause: Python's
len()returns character count (Unicode code points), while Rust'sString::len()returns byte count (UTF-8). For example,あis 1 character in Python but 3 bytes in Rust. ThePythonTemplatedFile::to_templated_file()was passing Python's character-based indices directly to Rust without conversion, causing the consistency check to fail.Closes #2318
Related: #1328, #1431
Test plan
cargo test --package sqruff-lib-core— all 28 tests pass (including 10 new: 7 for index conversion helpers, 3 for TemplatedFile multi-byte regression)cargo test --package sqruff-lib --lib— all 49 tests pass (no regressions)cargo fmt --all -- --check— passesshould_panictest confirms that unconverted Python char indices trigger the exact panic being fixed