Skip to content

fix: support unquoted Unicode identifiers in ANSI and Postgres dialects#2282

Open
benfdking wants to merge 2 commits intomainfrom
fix/unicode-identifiers
Open

fix: support unquoted Unicode identifiers in ANSI and Postgres dialects#2282
benfdking wants to merge 2 commits intomainfrom
fix/unicode-identifiers

Conversation

@benfdking
Copy link
Collaborator

Summary

  • Replace ASCII-only character classes ([a-zA-Z0-9_]) with Unicode-aware classes (\p{L}, \p{N}) in lexer word patterns and parser identifier regexes for the ANSI and Postgres dialects
  • Fixes panics when linting SQL with unquoted multibyte identifiers (e.g. 日本語, café, über)
  • Adds test fixtures for Unicode identifiers across ANSI, Postgres, and DuckDB dialects

Closes #2067

@chatgpt-codex-connector
Copy link

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@github-actions
Copy link

github-actions bot commented Feb 5, 2026

Benchmark for cdcdbfe

Click to view benchmark
Test Base PR %
DepthMap::from_parent 53.9±0.83µs 54.4±1.26µs +0.93%
fix_complex_query 12.5±0.26ms 12.6±0.30ms +0.80%
fix_superlong 197.3±9.91ms 199.6±8.56ms +1.17%
parse_complex_query 4.1±0.04µs 4.2±0.03µs +2.44%
parse_expression_recursion 7.1±0.09µs 7.3±0.07µs +2.82%
parse_simple_query 1057.8±17.41ns 1053.3±16.45ns -0.43%

@benfdking benfdking force-pushed the fix/unicode-identifiers branch from cefa1ee to f87ada2 Compare February 5, 2026 21:35
@github-actions
Copy link

github-actions bot commented Feb 5, 2026

Benchmark for f9743fa

Click to view benchmark
Test Base PR %
DepthMap::from_parent 52.6±0.66µs 54.0±2.61µs +2.66%
fix_complex_query 12.9±0.10ms 12.9±0.39ms 0.00%
fix_superlong 225.6±8.02ms 232.3±10.61ms +2.97%
parse_complex_query 4.1±0.06µs 4.2±0.06µs +2.44%
parse_expression_recursion 7.0±0.10µs 7.2±0.15µs +2.86%
parse_simple_query 1039.2±19.62ns 1039.4±36.25ns +0.02%

Replace ASCII-only character classes ([a-zA-Z0-9_]) with Unicode-aware
classes (\p{L}, \p{N}) in lexer word patterns and parser identifier
regexes for the ANSI and Postgres dialects. This fixes panics when
linting SQL with unquoted multibyte identifiers (e.g. Japanese, French,
German characters).

Closes #2067
@benfdking benfdking force-pushed the fix/unicode-identifiers branch from f87ada2 to ce4af88 Compare February 11, 2026 09:02
@github-actions
Copy link

Benchmark for aaaa31f

Click to view benchmark
Test Base PR %
DepthMap::from_parent 52.3±0.70µs 52.9±0.84µs +1.15%
fix_complex_query 12.4±0.19ms 12.3±0.11ms -0.81%
fix_superlong 174.8±4.72ms 168.5±9.49ms -3.60%
parse_complex_query 4.1±0.04µs 4.2±0.04µs +2.44%
parse_expression_recursion 7.1±0.08µs 7.1±0.11µs 0.00%
parse_simple_query 1074.6±22.27ns 1071.5±19.80ns -0.29%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Unexpected exception with unquoted multibyte identifier

1 participant