Conversation
There was a problem hiding this comment.
For the most part, I tried to leverage/follow the hstore implementation as an example, as it seemed like the closest. Granted, it wasn't a perfect 1:1 but overall I think it tracks.
In the tests, I tried to be as comprehensive as possible. Initially I had considered incorporating the many different permutations found in the core PG test cases. Ultimately, though, settled on what I felt was a good cross section, trying to hit the core while also including sufficient coverage of the edges.
There were some cases that were not supported by CRDB, which have been appropriately noted and skipped. These were entirely related to how CRDB parses/handles escapes. These differences were confirmed via testing with CRDB as well as identifying how they are handled in the CRDB source.
I intentionally did not want to include tsquery as part of these changes. While I think it's obviously important that tsvector and tsquery both be available. It didn't seem like tsvector would be completely useless in isolation. And it's obviously a prerequisite for working with tsquery anyway. I'm happy to follow up with a second PR to introduce the other type if that's desired.
| _, err = pgconn.ConnectConfig(context.Background(), config) | ||
| require.Error(t, err, "connect should return error for invalid token") | ||
| } | ||
|
|
There was a problem hiding this comment.
This was caught/fixed via the linter.
| t.Run("PostgreSQL", func(t *testing.T) { | ||
| skipCockroachDB(t, "CockroachDB does not support these escape sequences in tsvector") |
There was a problem hiding this comment.
So, interestingly, the CRDB parser does not handle the '' case for escaping single quotes. As I was looking in to it, I found that it's because the parser doesn't do any kind of look ahead to check for the value of the next character. Therefore, if it encounters a second ' it'll assume that it's completed the parsing of the word portion of the lexeme. We do, however, make sure to handle/support that case as well as the \' case as Postgres allows for both.
| type TSVector struct { | ||
| Lexemes []TSVectorLexeme | ||
| Valid bool | ||
| } |
There was a problem hiding this comment.
The one thing that I waffled a bit back and forth on was building out a composite type like this. Candidly, it felt a little clumsy to work with in practice, but after some chewing I figured it made sense.
Initially, I thought perhaps going with type TSVector []TSVectorLexeme might have been a better approach, but ultimately decided against it to ensure the explicit inclusion of Valid.
| case '\'': | ||
| // Escaped quote ('') — write a literal single quote | ||
| if !p.atEnd() && p.peek() == '\'' { | ||
| p.consume() | ||
| buf.WriteByte('\'') | ||
| } else { | ||
| // Closing quote — lexeme is complete | ||
| return buf.String(), nil | ||
| } |
There was a problem hiding this comment.
Here is where we handle that look ahead that CRDB doesn't to determine if we are at the end of the lexeme or escaping a single-quote.
|
It seems reasonable. Though to be honest, I lack context for how |
Implement PostgreSQL `tsvector` type with support for: - Lexemes with positions and weights (A, B, C, D) - Binary and text format encoding/decoding - Quote and backslash escape handling - Array type support - CopyFrom operations Note: Some escape sequences (doubled quotes, backslash escapes) are PostgreSQL-specific and not supported by CockroachDB. Resolves jackc#2483
29c0f91 to
ea6b093
Compare
|
Yeah, to be fair, the application side of it isn't always clear to me either. Though, in the context of serialize/deserialize support for replication/snapshotting/etc. minimally supporting |
|
LGTM - thanks |

Implement PostgreSQL
tsvectortype with support for:Note: Some escape sequences (doubled quotes, backslash escapes) are PostgreSQL-specific and not supported by CockroachDB.
Resolves #2483