Skip to content

Add tsvector type support#2510

Merged
jackc merged 1 commit intojackc:masterfrom
abrightwell:abrightwell-tsvector
Mar 9, 2026
Merged

Add tsvector type support#2510
jackc merged 1 commit intojackc:masterfrom
abrightwell:abrightwell-tsvector

Conversation

@abrightwell
Copy link
Contributor

Implement PostgreSQL tsvector type with support for:

  • Lexemes with positions and weights (A, B, C, D)
  • Binary and text format encoding/decoding
  • Quote and backslash escape handling
  • Array type support
  • CopyFrom operations

Note: Some escape sequences (doubled quotes, backslash escapes) are PostgreSQL-specific and not supported by CockroachDB.

Resolves #2483

Copy link
Contributor Author

@abrightwell abrightwell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the most part, I tried to leverage/follow the hstore implementation as an example, as it seemed like the closest. Granted, it wasn't a perfect 1:1 but overall I think it tracks.

In the tests, I tried to be as comprehensive as possible. Initially I had considered incorporating the many different permutations found in the core PG test cases. Ultimately, though, settled on what I felt was a good cross section, trying to hit the core while also including sufficient coverage of the edges.

There were some cases that were not supported by CRDB, which have been appropriately noted and skipped. These were entirely related to how CRDB parses/handles escapes. These differences were confirmed via testing with CRDB as well as identifying how they are handled in the CRDB source.

I intentionally did not want to include tsquery as part of these changes. While I think it's obviously important that tsvector and tsquery both be available. It didn't seem like tsvector would be completely useless in isolation. And it's obviously a prerequisite for working with tsquery anyway. I'm happy to follow up with a second PR to introduce the other type if that's desired.

_, err = pgconn.ConnectConfig(context.Background(), config)
require.Error(t, err, "connect should return error for invalid token")
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was caught/fixed via the linter.

Comment on lines +429 to +430
t.Run("PostgreSQL", func(t *testing.T) {
skipCockroachDB(t, "CockroachDB does not support these escape sequences in tsvector")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, interestingly, the CRDB parser does not handle the '' case for escaping single quotes. As I was looking in to it, I found that it's because the parser doesn't do any kind of look ahead to check for the value of the next character. Therefore, if it encounters a second ' it'll assume that it's completed the parsing of the word portion of the lexeme. We do, however, make sure to handle/support that case as well as the \' case as Postgres allows for both.

Comment on lines +23 to +26
type TSVector struct {
Lexemes []TSVectorLexeme
Valid bool
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The one thing that I waffled a bit back and forth on was building out a composite type like this. Candidly, it felt a little clumsy to work with in practice, but after some chewing I figured it made sense.

Initially, I thought perhaps going with type TSVector []TSVectorLexeme might have been a better approach, but ultimately decided against it to ensure the explicit inclusion of Valid.

Comment on lines +387 to +395
case '\'':
// Escaped quote ('') — write a literal single quote
if !p.atEnd() && p.peek() == '\'' {
p.consume()
buf.WriteByte('\'')
} else {
// Closing quote — lexeme is complete
return buf.String(), nil
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is where we handle that look ahead that CRDB doesn't to determine if we are at the end of the lexeme or escaping a single-quote.

@jackc
Copy link
Owner

jackc commented Mar 7, 2026

It seems reasonable. Though to be honest, I lack context for how tsvector and tsquery are used outside of the database. Whenever I've used them, values of those types were only used internally to the PG server.

Implement PostgreSQL `tsvector` type with support for:

- Lexemes with positions and weights (A, B, C, D)
- Binary and text format encoding/decoding
- Quote and backslash escape handling
- Array type support
- CopyFrom operations

Note: Some escape sequences (doubled quotes, backslash escapes) are
PostgreSQL-specific and not supported by CockroachDB.

Resolves jackc#2483
@abrightwell abrightwell force-pushed the abrightwell-tsvector branch from 29c0f91 to ea6b093 Compare March 8, 2026 13:43
@abrightwell
Copy link
Contributor Author

Yeah, to be fair, the application side of it isn't always clear to me either. Though, in the context of serialize/deserialize support for replication/snapshotting/etc. minimally supporting tsvector seemed like a good starting point as it relates to the whole 'tsearch' feature. With tsquery honestly I'm not convinced that there are many use cases where it would be a column type. As my interactions with it have always been entirely at query time and never persisted.

@abrightwell
Copy link
Contributor Author

The initial workflow run failed on the Check formatting step. That's been corrected.

Then, on my fork, it made it through but failed on Test (1.24, 15). Basically, it SIGTERMed on that particular test. My suspicion was that it's an OOM issue in the container as I know that the -race flag can send the memory quite high.

Locally, I was able to reproduce the SIGTERM case while observing docker stats. Sometimes it would max and fail, other times it would get VERY close and eventually pass:

❯ devcontainer exec --workspace-folder . go version
go version go1.24.12 linux/arm64

❯ devcontainer exec --workspace-folder . ./test.sh pg15 -parallel=1 -race ./...
==> Testing against PostgreSQL 15 (port 5415)
ok      github.com/jackc/pgx/v5 17.651s
?       github.com/jackc/pgx/v5/examples/chat   [no test files]
?       github.com/jackc/pgx/v5/examples/todo   [no test files]
?       github.com/jackc/pgx/v5/examples/url_shortener  [no test files]
?       github.com/jackc/pgx/v5/internal/faultyconn     [no test files]
ok      github.com/jackc/pgx/v5/internal/iobufpool      1.148s
ok      github.com/jackc/pgx/v5/internal/pgio   1.007s
ok      github.com/jackc/pgx/v5/internal/pgmock 1.014s
ok      github.com/jackc/pgx/v5/internal/sanitize       1.009s
?       github.com/jackc/pgx/v5/internal/stmtcache      [no test files]
?       github.com/jackc/pgx/v5/log/testingadapter      [no test files]
ok      github.com/jackc/pgx/v5/multitracer     1.009s
ok      github.com/jackc/pgx/v5/pgconn  55.439s
ok      github.com/jackc/pgx/v5/pgconn/ctxwatch 2.103s
ok      github.com/jackc/pgx/v5/pgconn/internal/bgreader        5.001s
ok      github.com/jackc/pgx/v5/pgproto3        86.007s
?       github.com/jackc/pgx/v5/pgproto3/example/pgfortune      [no test files]
ok      github.com/jackc/pgx/v5/pgtype  8.784s
ok      github.com/jackc/pgx/v5/pgtype/zeronull 1.086s
ok      github.com/jackc/pgx/v5/pgxpool 23.783s
?       github.com/jackc/pgx/v5/pgxtest [no test files]
ok      github.com/jackc/pgx/v5/stdlib  7.174s
?       github.com/jackc/pgx/v5/testsetup       [no test files]
ok      github.com/jackc/pgx/v5/tracelog        2.684s
==> Tests passed against PostgreSQL 15

Interestingly, these passed for pretty much every other configuration in the CI matrix. So it's likely just one of those 'things'... 🤷

image

@jackc jackc merged commit 6e1e9eb into jackc:master Mar 9, 2026
14 checks passed
@jackc
Copy link
Owner

jackc commented Mar 9, 2026

LGTM - thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

pgx.CopyFromRows does not work with tsvector column in destination table

2 participants