dolthub/doltgresql#1648: fix VALUES clause type inference#2187
dolthub/doltgresql#1648: fix VALUES clause type inference#2187codeaucafe wants to merge 12 commits intodolthub:mainfrom
Conversation
Add ResolveValuesTypes analyzer rule to compute common types across all VALUES rows, not just the first row. Previously, DoltgreSQL would incorrectly use only the first value to determine column types, causing errors when subsequent values had different types like VALUES(1),(2.01),(3). Changes: - Two-pass transformation strategy: first pass transforms VDT nodes with unified types, second pass updates GetField expressions in parent nodes - Use FindCommonType() to resolve types per PostgreSQL rules - Apply ImplicitCast for type conversions and UnknownCoercion for unknown-typed literals - Handle aggregates via getSourceSchema() - Add UnknownCoercion expression type for unknown -> target coercion without conversion Tests: - Add 4 bats integration tests for mixed int/decimal VALUES - Add 3 Go test cases covering int-first, decimal-first, SUM aggregate, and multi-column scenarios Refs: dolthub#1648
|
Greetings! I'll dig into this for the review. As for the comment regarding documentation, in general the majority of the code is written to Postgres 15 specifications, however we still add features from newer versions when applicable. Chasing the newest version is a never-ending target, but we'll definitely update to a newer target relatively soon. Regarding documentation pinning though, we should never reference the current, as that changes over time. We definitely don't want critical information changing unknowingly. As long as the version is pinned for the documentation, that should be fine. |
|
Thank you. Yea I saw the code was mostly toward psql15 spec but wasn't sure if I missed any updates or other docs indicating writing towards late version spec (if much even changed from psql15 spec to later versions' specs?). Thank you for clarifying. Also, thanks in advance for reviewing the PR. |
Hydrocharged
left a comment
There was a problem hiding this comment.
Within my PR comments, I tend to be a bit wordy to try and help with the general direction for future PRs. For this one though, it's a good first draft! I stepped through and modified the code locally to make sure I fully understood what was happening, and the intention behind most of the changes so I could give a decent review. I ended up making some modifications to reduce the overall size of the PR as well (with a lot of the comments left here added in), which you can view here:
Let me know once the feedback has been addressed, and I'll take another look at it.
| // Unknown type can be coerced to any type without explicit cast | ||
| // Use UnknownCoercion to report the target type while passing through values | ||
| newTuples[rowIdx][colIdx] = pgexprs.NewUnknownCoercion(expr, toType) |
There was a problem hiding this comment.
This seems like an error. The value returned by the expression must align with the reported type. If expr returns a decimal.Decimal, then the reported type must be NUMERIC, otherwise we will panic. We always make the assumption that the correct type is reported for the value (assuming we're working with DoltgresType), and this breaks that assumption, which will cause a panic (potentially far removed from the actual source of incongruity).
In addition, from debugging through the tests, this branch is only called with an expression of type unknown that converts to type text, which both happen to have the same underlying data type string.
There was a problem hiding this comment.
After looking through the code a bit more, it looks like this new expression type was added due to a bad practice that we are using throughout the codebase. It is true that we can interpret the unknown type as any type (specifically through an I/O cast), and we're correctly doing that in every place that retrieves a cast by checking if the returned cast function was nil. Considering we are doing it everywhere immediately after the function call, and it's universal everywhere that we want a cast, the logic should be in the function call itself. This extends not just to implicit casts, but assignment and explicit too.
server/expression/implicit_cast.go
Outdated
| // UnknownCoercion wraps an expression with unknown type to coerce it to a target type. | ||
| // Unlike ImplicitCast, this doesn't perform any actual conversion - it just changes the | ||
| // reported type since unknown type literals can coerce to any type in PostgreSQL. | ||
| type UnknownCoercion struct { |
There was a problem hiding this comment.
This should be in a different file. We sometimes put multiple nodes, expressions, or passes in a single file, but only when they are deeply tied (something like Value and ValueParam for example). In other projects (such as go-mysql-server) we are far looser on this and made it quite hard to parse the project without the search functions of an IDE, so with Doltgres we're attempting to right some of those wrongs.
| // This ensures VALUES(1),(2.01),(3) correctly infers numeric type, not integer. | ||
| func ResolveValuesTypes(ctx *sql.Context, a *analyzer.Analyzer, node sql.Node, scope *plan.Scope, selector analyzer.RuleSelector, qFlags *sql.QueryFlags) (sql.Node, transform.TreeIdentity, error) { | ||
| // Track which VDTs we transform so we can update parent nodes | ||
| transformedVDTs := make(map[*plan.ValueDerivedTable]sql.Schema) |
There was a problem hiding this comment.
As a map value, this is never used. updateGetFieldTypes does not use the parameter at all. This is only used for the len(transformedVDTs) > 0 check below.
| func getSourceSchema(n sql.Node) sql.Schema { | ||
| switch node := n.(type) { | ||
| case *plan.GroupBy: | ||
| // GroupBy's Schema() returns aggregate output, but we need the source schema | ||
| return getSourceSchema(node.Child) | ||
| case *plan.Filter: | ||
| return getSourceSchema(node.Child) | ||
| case *plan.Sort: | ||
| return getSourceSchema(node.Child) | ||
| case *plan.Limit: | ||
| return getSourceSchema(node.Child) | ||
| case *plan.Offset: | ||
| return getSourceSchema(node.Child) | ||
| case *plan.Distinct: | ||
| return getSourceSchema(node.Child) | ||
| case *plan.SubqueryAlias: | ||
| // SubqueryAlias wraps a VDT - get the child's schema | ||
| return node.Child.Schema() | ||
| case *plan.ValueDerivedTable: | ||
| return node.Schema() | ||
| default: | ||
| // For other nodes, return their schema directly | ||
| return n.Schema() | ||
| } | ||
| } |
There was a problem hiding this comment.
Functions like this are a bit iffy, as they rely on us to update this function whenever we add a new node in go-mysql-server (GMS). There's no indication in GMS that we need to head to Doltgres to update anything, so this could suddenly fail for untested queries on a seemingly unrelated change. This kind of function should reside in GMS, unless we are extremely confident that we will never need to update this function in the future.
For this specific function, since we seem to be looking for either a *plan.SubqueryAlias or *plan.ValueDerivedTable node, and everything else seems to embed a UnaryNode, we could instead just search for those two, and check if it's a UnaryNode, navigating through the child if it is.
| } else { | ||
| // candidateType can convert to typ, but not vice versa, so typ is more general | ||
| // Per PostgreSQL docs: "If the resolution type can be implicitly converted to the | ||
| // other type but not vice-versa, select the other type as the new resolution type." | ||
| candidateType = typ | ||
| if candidateType.IsPreferred { | ||
| candidateType = typ | ||
| // "Then, if the new resolution type is preferred, stop considering further inputs." | ||
| preferredTypeFound = true | ||
| } | ||
| } else { | ||
| return nil, errors.Errorf("found another preferred candidate type") | ||
| } | ||
| if preferredTypeFound { | ||
| break |
There was a problem hiding this comment.
Since we're breaking if we've found the preferred type, it makes more sense to eliminate the boolean and return once we've found the preferred.
In the old code, we continued to go through the remaining types and check for implicit casts, which could return an error early. For this code, once we've finished the loop, we should do a 2nd loop that ensures that all types have an implicit cast (skipping unknown, the same types will return an identity cast).
| columnTypes[colIdx][rowIdx] = pgType | ||
| } else { | ||
| // Non-DoltgresType encountered - should have been sanitized | ||
| // Return unchanged and let TypeSanitizer handle it |
There was a problem hiding this comment.
This analyzer step comes after type sanitizer, so it's an error if we still have GMS types
| case *plan.Distinct: | ||
| return getSourceSchema(node.Child) | ||
| case *plan.SubqueryAlias: | ||
| // SubqueryAlias wraps a VDT - get the child's schema |
There was a problem hiding this comment.
Is this always true? This seems like too general of a node for this assumption to hold true for all possible queries.
| } | ||
|
|
||
| // updateGetFieldExpr updates a GetField expression to use the correct type from the child schema | ||
| func updateGetFieldExpr(expr sql.Expression, childSchema sql.Schema) (sql.Expression, bool, error) { |
There was a problem hiding this comment.
An easier (and more robust) way of finding a GetField expression would be to use the transform functions (such as pgtransform.NodeExprsWithOpaque
| if isVDT { | ||
| // Use WithExpressions to preserve all VDT fields (name, columns, id, cols) | ||
| // while updating the expressions and recalculating the schema | ||
| newNode, err := vdt.WithExpressions(flatExprs...) | ||
| if err != nil { | ||
| return nil, transform.NewTree, err | ||
| } | ||
| return newNode, transform.NewTree, nil | ||
| } | ||
|
|
||
| // For standalone Values node, use WithExpressions as well | ||
| newNode, err := values.WithExpressions(flatExprs...) | ||
| if err != nil { | ||
| return nil, transform.NewTree, err | ||
| } | ||
| return newNode, transform.NewTree, nil |
There was a problem hiding this comment.
Since both ValueDerivedTable and Values implements sql.Expressioner, we can simplify this with just having a single Expressioner node that we assign them to.
| {Numeric("3")}, | ||
| }, | ||
| }, | ||
| { |
There was a problem hiding this comment.
We can never have too many tests. These existing tests are fine, but there are permutations mentioned in the comments and code of resolve_values_types.go that we aren't testing here, such as GROUP BY. There's also DISTINCT, LIMIT, etc. Subqueries should be tested too since there is specific logic to handle SubqueryAlias, and that should also be checked to ensure that LIMIT, ORDER BY, etc. work within the subquery. It's possible that some of these things may not work in general (I don't think we have a test with a LIMIT within a subquery), but in those cases they should just be skipped tests so we know to fix it later (and that's assuming that other logic is the cause, if it's due to the PR then it should be fixed or the appropriate TODO should be left within the code itself explaining what's missing).
|
Also, regarding the first question:
I didn't want to answer before reviewing to make sure that I understood everything correctly, but the approach of adding another analyzer step works perfectly for this case. In general, if we're using the |
|
@Hydrocharged the detailed comments are fantastic (what I wish all reviews were like). I'll get to addressing the sometime this week. Thank you! |
|
update: still going through these requested changes. Had other things I needed to handle first, so I'll probably finish this sometime during the next weekend. thank you. cheers |
|
update: got sniped by work and life so I didn't finish as much as I thought I would last weekend, but I'm still slowing going through your change reqs/comments. Should get something up by end of weekend I hope. |
|
update, made it about halfway through over weekend/today, but didn't finish since got sick over weekend from concert. Should actually be finish this week though haha. Thanks for being flexible and still watching this PR (as seen per response emojis). Thanks again! |
|
No worries, no rush on our end! |
Refactor ResolveValuesTypes analyzer rule to use simpler implementation based on PR review feedback. Changes centralize unknown type handling and eliminate fragile tree traversal logic. Changes: - Use TableId-based lookup instead of recursive tree traversal to update GetField types, eliminating dependency on specific node types like SubqueryAlias - Leverage pgtransform.NodeExprsWithOpaque for expression updates instead of manual recursion through four helper functions - Move unknown type handling into cast functions (GetExplicitCast, GetAssignmentCast, GetImplicitCast) to eliminate scattered checks across call sites - Add requiresCasts return value to FindCommonType to optimize case where no type conversion is needed - Simplify VALUES node transformation using sql.Expressioner interface to handle both ValueDerivedTable and Values uniformly - Add comprehensive test coverage for VALUES with GROUP BY, DISTINCT, LIMIT/OFFSET, ORDER BY, subqueries, WHERE clause, aggregates, and combined operations This refactoring reduces code complexity from ~300 lines to ~180 lines while improving maintainability and eliminating potential bugs from manual tree walking. Refs: dolthub#1648
Update error messages in resolve_values_types.go to follow existing error messaging conventions in analyzer: - Add "VALUES:" prefix to match pattern used in analzyer code files, such as in assign_update_casts.go (UPDATE:) and in assign_insert_casts.go (INSERT:) - Also fix return value of n to nil when returning error Refs: dolthub#1648
|
FYI pushed some changes. Will next work on adding/fixing tests, as well as fixing shallow copy issue in resolve_values_types.go I noticed right before signing off for tonight. |
|
I think most of the work is done, just need to go through the tests (add and/or refactor). Then I can take time to review to make sure I shouldn't refactor any of the implementations |
1 similar comment
|
I think most of the work is done, just need to go through the tests (add and/or refactor). Then I can take time to review to make sure I shouldn't refactor any of the implementations |
Add more tests for VALUES clause resolution following PR review comments; also additional edge cases. Tests here verify mixed-type column inference, NULL handling, error cases, and integration with SQL operations like GROUP BY, DISTINCT, LIMIT, ORDER BY, WHERE, aggregates, CTEs, and JOINs. Refs: dolthub#1648
|
FYI pushed my tests but just now realized my original comment didn't get entered I guess. I believe it was about 3 tests* I added that seemed to be failing from my code changes. Will verify tomorrow and provide options. Cheers |
|
Make sure you pull in the latest changes from |
|
Will do. Also realized I had a typo for "3 years" when it should have said "3 tests" 🤦 . I think maybe I figure out why and got a fix. I'll try and get the new code pushed tonight |
Fix two bugs in `ResolveValuesTypes` func that were introduced by our initial code implementation. Both bugs only showed up when VALUES type inference interacted with JOINs or aggregates: - Bug 1: JOIN GetField index: The original code used gf.Index() - 1 to look up columns in VDT schemas, but GetField indices are global across joined tables (e.g., a.n=0, b.id=1, b.label=2), not per-table offsets. This caused out-of-bounds errors in JOIN's. Fixed by matching cols by name instead of index calc'ing. - Bug 2: Aggregate type propagation: The first pass updates GetFields that read directly from a VDT, BUT when a type change ripples through an aggregate (e.g., int4 to numeric inside MIN), the aggregate return type changes while parent nodes still have GetFields with the old type. This can cause runtime panics from type/value mismatches. Fixed by adding a second pass that syncs each GetField type with the child node's actual schema. Test updates: SUM now returns numeric instead of float64 when operating on numeric inputs (matches PostgreSQL behavior). Unskipped 3 tests (2 JOIN, 1 MIN/MAX) that now pass. Refs: dolthub#1648
…1648-values-clause-type-inference # Conflicts: # server/expression/explicit_cast.go
|
FYI I think maybe this is a bit better to review now: I fixed those 3 tests caused by my code implementation (2 JOIN, 1 MIN/MAX). Note, there are 3 skipped tests, but are the pre-existing subquery bugs on I found existed on main prior to my code Given the 3 skipped tests appear to be out of scope of the GH issue this PR attempt (resolving values types) to fix I avoided working on them in this PR. Specifically, the 3 skipped tests have the same problem which is that ops inside a subquery over VALUES (arithmetic, LIMIT, ORDER BY) are silently ignored (no error), so I guess the nodes are not getting applied. From existing GitHub issues I don't think this issue has been created yet, so I'll make an issue once this PR is good and merged and then I can reference those skipped tests for repro For info on latest commit to fix 3 tests I realized were failing due to my implementation after fixing according to your PR comments: The JOIN tests were failing b/c GetField indices are global across all tables in a query, not per-table offsets into the VDT schema (I think this is the case). The original code did The MIN/MAX test were panicking because the 1st pass updated GetField types inside the aggregate (e.g., int4 to numeric), this changed the aggregate's return type, but the parent Project node's GetFields still had old type. SO, I added a second pass that walks the tree and syncs each GetField type w/ the child node's actual schema. |
Would you say that it's in a complete state? I was waiting until it was fully ready before giving it the second pass. |
Remove inline cast logic from ExplicitCast.Eval since it's now being handled by getRecordCast() in cast.go, and called from GetExplicitCast before returning nil. Also, remove duplicate UnknownLiteralCast fallback in GetImplicitCast and unused core import from explicit_cast.go. Last, clean up test name; don't include GH issue number. Refs: dolthub#1648
codeaucafe
left a comment
There was a problem hiding this comment.
@Hydrocharged I think this PR is good to review now. Cleaned up some small/basic things I missed 🤦♂️. I think this is complete enough to review now.
I think it would be good to add more exhaustive BATS tests like the unit tests additions so I'm going to work on that. Feel free to wait to review until thats done, unless you think I shouldn't even write more bats tests for this PR.
|
I should be able to push up more bats tests tomorrow FYI |
|
I'll wait for the additional tests, since they may uncover extra points that should be addressed. Also, we can never have too many tests! |
|
thanks @Hydrocharged, I just pushed the BATS test so I think its in a state to finally be reviewed again. Thank you for your patience. No rush on reviewing this since its the weekend. |
|
One final thing. I just noticed that tests weren't yet enabled on the repository (that's my mistake), so I just enabled them. I'll make sure those are all passing before the final review. |
|
no problem. I just pushed up another change to fix the format workflow failure. thanks again for taking the time to review this and answer all my questions. |
Hydrocharged
left a comment
There was a problem hiding this comment.
Almost there! Just a few small corrections
| // Find the matching column by name and update if the type changed | ||
| gfNameLower := strings.ToLower(gf.Name()) | ||
| for _, col := range childSchema { | ||
| if strings.ToLower(col.Name) == gfNameLower && gf.Type() != col.Type { |
There was a problem hiding this comment.
We shouldn't use strings.ToLower as names are case-sensitive. In Postgres, if you type CoLnAmE and cOlNaMe, then they both resolve to colname. However, "COLname" will preserve the case since it's inside of quotes. This could be confirmed with a test using the same name and multiple different casings. This goes for all strings.ToLower uses in this file.
Reorg VALUES bats tests to its own values.bats from types.bats. Also, inline the getFieldWithType helper func, improve error messages in transformValuesNode, and add test cases for case-sensitive quoted column names and case-differing aggregate columns. Refs: dolthub#1648
|
@Hydrocharged thank you for reviewing. I refactored accordingly to all your suggestions EXCEPT the the one regarding not using I hope to get something before end of week, so feel free to hold on reviewing again until I get that pushed up. Thanks again for taking the time to review my PR. |
|
Also, make sure to resolve conversations on GitHub as they're resolved in your code. |
|
@Hydrocharged I did some more digging into the
SO, in the second pass of ResolveValuesTypes, when the owning node is a Project, for example, and we collect the child schema via: The Project's child is the GroupBy node, so I confirmed this with runtime logging:
I also tried a recall that for the second pass can't use non-name matching because the two things we need to match are:
sql.Column (sql/column.go) appears to have no ColumnId or any ID field. GetField (sql/expression/get_field.go) has a separate exprId (ColumnId) field, but I believe there's nothing on the child schema side to match it against. So, only shared identifier between a GetField and its corresponding child schema column is name, and the name has the casing asymmetry from GMS. Maybe I'm missing something and there is something better to match on? I think to get this second-pass part of code working properly it would require either:
Is this casing handling done in GMS correct? Should I work on fixing this part on corresponding PR in GMS? |
So GMS (short for |
Summary
VALUES clause type inference now uses PostgreSQL's common type resolution algorithm instead of using only the first row's type.
existing issue:
SELECT * FROM (VALUES(1),(2.01),(3)) v(n)failed withinteger: unhandled type: decimal.DecimalChanges
ResolveValuesTypesanalyzer rule that:FindCommonType()to resolve per PostgreSQL's [UNION/CASE type resolution](https://www.postgresql.org/docs/15/typeconv-union-case.html; per FindCommonType doc comment)UnknownCoercionexpression for unknown to target type coercionQuestions for Reviewers
Analyzer rule approach: Is adding this as an analyzer rule (after
TypeSanitizer) the right approach? I considered alternatives but this seemed cleanest for handling the two-pass transformation needed for GetField updates. Open to feedback if there's a better pattern.PostgreSQL version: The code references PostgreSQL 15 docs. Should this stay as-is since doltgresql targets PG 15, or should it use
/docs/current/?Fixes: #1648