[yugabyte/yugabyte-db#28166] Add test to reproduce error and validate fix by fourpointfour · Pull Request #182 · yugabyte/debezium

fourpointfour · 2025-08-07T08:06:43Z

This PR adds a test which reliably reproduces the error reported in yugabyte/yugabyte-db#28166 and also validates that the fix landed on the service side fixes the issue.

Summary: After generating unique record IDs, for comparing 2 records, we have checks to compare them using commit_time, record_time, write_id, etc to ascertain which record comes before in a sorted order. The check is performed in the method `CDCSDKUniqueRecordID::GreaterThanDistributedLSN`. Now when the GUC `yb_disable_transactional_writes` is set, we have multiple records inserted in a single `WRITE_OP` batch for a single shard transaction, there's a possibility that the records end up having the same `commit_time`, `record_time`, `write_id` and `table_id` so we fallback to comparing the primary keys. The core issue was that the VWAL expects each individual tablet to send records in order as per the order determined by `CDCSDKUniqueRecordID::GreaterThanDistributedLSN`. This was being violated and the situation leads us to a data loss scenario where we end up losing multiple records inserted in the same batch of a single shard transaction. The fix to this requires a mechanism to reliably have a fixed sorting order when all the other parameters end up with the same value. This PR adds the same mechanism by assigning a `write_id` to every single shard record based on the index of the wal records batch we are processing. By doing this, we'll be breaking the tie using `write_id` and we can reliably sort the records without filtering them. Jira: DB-17813 Test Plan: The tests to reproduce the error and validate the fix has been added as a part of the logical replication connector's test suite in the following PR: yugabyte/debezium#182 Additionally, even though the issue will not be applicable to the gRPC connector, we are also adding a test prudently to gRPC connector as well: yugabyte/debezium-connector-yugabytedb#379 Reviewers: asrinivasan, sumukh.phalgaonkar, skumar Reviewed By: skumar Subscribers: ycdcxcluster Differential Revision: https://phorge.dev.yugabyte.com/D45857

…d record Summary: **Backport description:** No merge conflicts were encountered. After generating unique record IDs, for comparing 2 records, we have checks to compare them using commit_time, record_time, write_id, etc to ascertain which record comes before in a sorted order. The check is performed in the method `CDCSDKUniqueRecordID::GreaterThanDistributedLSN`. Now when the GUC `yb_disable_transactional_writes` is set, we have multiple records inserted in a single `WRITE_OP` batch for a single shard transaction, there's a possibility that the records end up having the same `commit_time`, `record_time`, `write_id` and `table_id` so we fallback to comparing the primary keys. The core issue was that the VWAL expects each individual tablet to send records in order as per the order determined by `CDCSDKUniqueRecordID::GreaterThanDistributedLSN`. This was being violated and the situation leads us to a data loss scenario where we end up losing multiple records inserted in the same batch of a single shard transaction. The fix to this requires a mechanism to reliably have a fixed sorting order when all the other parameters end up with the same value. This PR adds the same mechanism by assigning a `write_id` to every single shard record based on the index of the wal records batch we are processing. By doing this, we'll be breaking the tie using `write_id` and we can reliably sort the records without filtering them. Jira: DB-17813 Original commit: c397d2f / D45857 Test Plan: The tests to reproduce the error and validate the fix has been added as a part of the logical replication connector's test suite in the following PR: yugabyte/debezium#182 Additionally, even though the issue will not be applicable to the gRPC connector, we are also adding a test prudently to gRPC connector as well: yugabyte/debezium-connector-yugabytedb#379 Reviewers: asrinivasan, sumukh.phalgaonkar, skumar Reviewed By: sumukh.phalgaonkar Subscribers: ycdcxcluster Differential Revision: https://phorge.dev.yugabyte.com/D46157

…d record Summary: **Backport description:** No merge conflicts were encountered. After generating unique record IDs, for comparing 2 records, we have checks to compare them using commit_time, record_time, write_id, etc to ascertain which record comes before in a sorted order. The check is performed in the method `CDCSDKUniqueRecordID::GreaterThanDistributedLSN`. Now when the GUC `yb_disable_transactional_writes` is set, we have multiple records inserted in a single `WRITE_OP` batch for a single shard transaction, there's a possibility that the records end up having the same `commit_time`, `record_time`, `write_id` and `table_id` so we fallback to comparing the primary keys. The core issue was that the VWAL expects each individual tablet to send records in order as per the order determined by `CDCSDKUniqueRecordID::GreaterThanDistributedLSN`. This was being violated and the situation leads us to a data loss scenario where we end up losing multiple records inserted in the same batch of a single shard transaction. The fix to this requires a mechanism to reliably have a fixed sorting order when all the other parameters end up with the same value. This PR adds the same mechanism by assigning a `write_id` to every single shard record based on the index of the wal records batch we are processing. By doing this, we'll be breaking the tie using `write_id` and we can reliably sort the records without filtering them. Jira: DB-17813 Original commit: c397d2f / D45857 Test Plan: The tests to reproduce the error and validate the fix has been added as a part of the logical replication connector's test suite in the following PR: yugabyte/debezium#182 Additionally, even though the issue will not be applicable to the gRPC connector, we are also adding a test prudently to gRPC connector as well: yugabyte/debezium-connector-yugabytedb#379 Reviewers: asrinivasan, sumukh.phalgaonkar, skumar Reviewed By: sumukh.phalgaonkar Subscribers: ycdcxcluster Differential Revision: https://phorge.dev.yugabyte.com/D46189

added test to repro

e9f9d7f

fourpointfour requested review from Sumukh-Phalgaonkar, asrinivasanyb and suranjan August 7, 2025 08:06

fourpointfour self-assigned this Aug 7, 2025

fourpointfour added the bug Something isn't working label Aug 7, 2025

added multi shards inserts in the mix of insert workload

4880841

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[yugabyte/yugabyte-db#28166] Add test to reproduce error and validate fix#182

[yugabyte/yugabyte-db#28166] Add test to reproduce error and validate fix#182
fourpointfour wants to merge 2 commits intoyugabyte:ybdb-debezium-2.5.2from
fourpointfour:db-17813-single-shard-data-loss

fourpointfour commented Aug 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

fourpointfour commented Aug 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant