[yugabyte/yugabyte-db#28166] Add test to reproduce error and validate fix#182
Open
fourpointfour wants to merge 2 commits intoyugabyte:ybdb-debezium-2.5.2from
Open
Conversation
fourpointfour
added a commit
to yugabyte/yugabyte-db
that referenced
this pull request
Aug 19, 2025
Summary: After generating unique record IDs, for comparing 2 records, we have checks to compare them using commit_time, record_time, write_id, etc to ascertain which record comes before in a sorted order. The check is performed in the method `CDCSDKUniqueRecordID::GreaterThanDistributedLSN`. Now when the GUC `yb_disable_transactional_writes` is set, we have multiple records inserted in a single `WRITE_OP` batch for a single shard transaction, there's a possibility that the records end up having the same `commit_time`, `record_time`, `write_id` and `table_id` so we fallback to comparing the primary keys. The core issue was that the VWAL expects each individual tablet to send records in order as per the order determined by `CDCSDKUniqueRecordID::GreaterThanDistributedLSN`. This was being violated and the situation leads us to a data loss scenario where we end up losing multiple records inserted in the same batch of a single shard transaction. The fix to this requires a mechanism to reliably have a fixed sorting order when all the other parameters end up with the same value. This PR adds the same mechanism by assigning a `write_id` to every single shard record based on the index of the wal records batch we are processing. By doing this, we'll be breaking the tie using `write_id` and we can reliably sort the records without filtering them. Jira: DB-17813 Test Plan: The tests to reproduce the error and validate the fix has been added as a part of the logical replication connector's test suite in the following PR: yugabyte/debezium#182 Additionally, even though the issue will not be applicable to the gRPC connector, we are also adding a test prudently to gRPC connector as well: yugabyte/debezium-connector-yugabytedb#379 Reviewers: asrinivasan, sumukh.phalgaonkar, skumar Reviewed By: skumar Subscribers: ycdcxcluster Differential Revision: https://phorge.dev.yugabyte.com/D45857
fourpointfour
added a commit
to yugabyte/yugabyte-db
that referenced
this pull request
Aug 21, 2025
…d record Summary: **Backport description:** No merge conflicts were encountered. After generating unique record IDs, for comparing 2 records, we have checks to compare them using commit_time, record_time, write_id, etc to ascertain which record comes before in a sorted order. The check is performed in the method `CDCSDKUniqueRecordID::GreaterThanDistributedLSN`. Now when the GUC `yb_disable_transactional_writes` is set, we have multiple records inserted in a single `WRITE_OP` batch for a single shard transaction, there's a possibility that the records end up having the same `commit_time`, `record_time`, `write_id` and `table_id` so we fallback to comparing the primary keys. The core issue was that the VWAL expects each individual tablet to send records in order as per the order determined by `CDCSDKUniqueRecordID::GreaterThanDistributedLSN`. This was being violated and the situation leads us to a data loss scenario where we end up losing multiple records inserted in the same batch of a single shard transaction. The fix to this requires a mechanism to reliably have a fixed sorting order when all the other parameters end up with the same value. This PR adds the same mechanism by assigning a `write_id` to every single shard record based on the index of the wal records batch we are processing. By doing this, we'll be breaking the tie using `write_id` and we can reliably sort the records without filtering them. Jira: DB-17813 Original commit: c397d2f / D45857 Test Plan: The tests to reproduce the error and validate the fix has been added as a part of the logical replication connector's test suite in the following PR: yugabyte/debezium#182 Additionally, even though the issue will not be applicable to the gRPC connector, we are also adding a test prudently to gRPC connector as well: yugabyte/debezium-connector-yugabytedb#379 Reviewers: asrinivasan, sumukh.phalgaonkar, skumar Reviewed By: sumukh.phalgaonkar Subscribers: ycdcxcluster Differential Revision: https://phorge.dev.yugabyte.com/D46157
fourpointfour
added a commit
to yugabyte/yugabyte-db
that referenced
this pull request
Aug 21, 2025
…d record Summary: **Backport description:** No merge conflicts were encountered. After generating unique record IDs, for comparing 2 records, we have checks to compare them using commit_time, record_time, write_id, etc to ascertain which record comes before in a sorted order. The check is performed in the method `CDCSDKUniqueRecordID::GreaterThanDistributedLSN`. Now when the GUC `yb_disable_transactional_writes` is set, we have multiple records inserted in a single `WRITE_OP` batch for a single shard transaction, there's a possibility that the records end up having the same `commit_time`, `record_time`, `write_id` and `table_id` so we fallback to comparing the primary keys. The core issue was that the VWAL expects each individual tablet to send records in order as per the order determined by `CDCSDKUniqueRecordID::GreaterThanDistributedLSN`. This was being violated and the situation leads us to a data loss scenario where we end up losing multiple records inserted in the same batch of a single shard transaction. The fix to this requires a mechanism to reliably have a fixed sorting order when all the other parameters end up with the same value. This PR adds the same mechanism by assigning a `write_id` to every single shard record based on the index of the wal records batch we are processing. By doing this, we'll be breaking the tie using `write_id` and we can reliably sort the records without filtering them. Jira: DB-17813 Original commit: c397d2f / D45857 Test Plan: The tests to reproduce the error and validate the fix has been added as a part of the logical replication connector's test suite in the following PR: yugabyte/debezium#182 Additionally, even though the issue will not be applicable to the gRPC connector, we are also adding a test prudently to gRPC connector as well: yugabyte/debezium-connector-yugabytedb#379 Reviewers: asrinivasan, sumukh.phalgaonkar, skumar Reviewed By: sumukh.phalgaonkar Subscribers: ycdcxcluster Differential Revision: https://phorge.dev.yugabyte.com/D46189
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds a test which reliably reproduces the error reported in yugabyte/yugabyte-db#28166 and also validates that the fix landed on the service side fixes the issue.