Skip to content

[yugabyte/yugabyte-db#28166] Add test to reproduce error and validate fix#182

Open
fourpointfour wants to merge 2 commits intoyugabyte:ybdb-debezium-2.5.2from
fourpointfour:db-17813-single-shard-data-loss
Open

[yugabyte/yugabyte-db#28166] Add test to reproduce error and validate fix#182
fourpointfour wants to merge 2 commits intoyugabyte:ybdb-debezium-2.5.2from
fourpointfour:db-17813-single-shard-data-loss

Conversation

@fourpointfour
Copy link

This PR adds a test which reliably reproduces the error reported in yugabyte/yugabyte-db#28166 and also validates that the fix landed on the service side fixes the issue.

@fourpointfour fourpointfour self-assigned this Aug 7, 2025
@fourpointfour fourpointfour added the bug Something isn't working label Aug 7, 2025
fourpointfour added a commit to yugabyte/yugabyte-db that referenced this pull request Aug 19, 2025
Summary:
After generating unique record IDs, for comparing 2 records, we have checks to compare them using commit_time, record_time, write_id, etc to ascertain which record comes before in a sorted order. The check is performed in the method `CDCSDKUniqueRecordID::GreaterThanDistributedLSN`.

Now when the GUC `yb_disable_transactional_writes` is set, we have multiple records inserted in a single `WRITE_OP` batch for a single shard transaction, there's a possibility that the records end up having the same `commit_time`, `record_time`, `write_id` and `table_id` so we fallback to comparing the primary keys. The core issue was that the VWAL expects each individual tablet to send records in order as per the order determined by `CDCSDKUniqueRecordID::GreaterThanDistributedLSN`. This was being violated and the situation leads us to a data loss scenario where we end up losing multiple records inserted in the same batch of a single shard transaction.

The fix to this requires a mechanism to reliably have a fixed sorting order when all the other parameters end up with the same value. This PR adds the same mechanism by assigning a `write_id` to every single shard record based on the index of the wal records batch we are processing. By doing this, we'll be breaking the tie using `write_id` and we can reliably sort the records without filtering them.
Jira: DB-17813

Test Plan:
The tests to reproduce the error and validate the fix has been added as a part of the logical replication connector's test suite in the following PR:
yugabyte/debezium#182

Additionally, even though the issue will not be applicable to the gRPC connector, we are also adding a test prudently to gRPC connector as well:
yugabyte/debezium-connector-yugabytedb#379

Reviewers: asrinivasan, sumukh.phalgaonkar, skumar

Reviewed By: skumar

Subscribers: ycdcxcluster

Differential Revision: https://phorge.dev.yugabyte.com/D45857
fourpointfour added a commit to yugabyte/yugabyte-db that referenced this pull request Aug 21, 2025
…d record

Summary:
**Backport description:**
No merge conflicts were encountered.

After generating unique record IDs, for comparing 2 records, we have checks to compare them using commit_time, record_time, write_id, etc to ascertain which record comes before in a sorted order. The check is performed in the method `CDCSDKUniqueRecordID::GreaterThanDistributedLSN`.

Now when the GUC `yb_disable_transactional_writes` is set, we have multiple records inserted in a single `WRITE_OP` batch for a single shard transaction, there's a possibility that the records end up having the same `commit_time`, `record_time`, `write_id` and `table_id` so we fallback to comparing the primary keys. The core issue was that the VWAL expects each individual tablet to send records in order as per the order determined by `CDCSDKUniqueRecordID::GreaterThanDistributedLSN`. This was being violated and the situation leads us to a data loss scenario where we end up losing multiple records inserted in the same batch of a single shard transaction.

The fix to this requires a mechanism to reliably have a fixed sorting order when all the other parameters end up with the same value. This PR adds the same mechanism by assigning a `write_id` to every single shard record based on the index of the wal records batch we are processing. By doing this, we'll be breaking the tie using `write_id` and we can reliably sort the records without filtering them.
Jira: DB-17813

Original commit: c397d2f / D45857

Test Plan:
The tests to reproduce the error and validate the fix has been added as a part of the logical replication connector's test suite in the following PR:
yugabyte/debezium#182

Additionally, even though the issue will not be applicable to the gRPC connector, we are also adding a test prudently to gRPC connector as well:
yugabyte/debezium-connector-yugabytedb#379

Reviewers: asrinivasan, sumukh.phalgaonkar, skumar

Reviewed By: sumukh.phalgaonkar

Subscribers: ycdcxcluster

Differential Revision: https://phorge.dev.yugabyte.com/D46157
fourpointfour added a commit to yugabyte/yugabyte-db that referenced this pull request Aug 21, 2025
…d record

Summary:
**Backport description:**
No merge conflicts were encountered.

After generating unique record IDs, for comparing 2 records, we have checks to compare them using commit_time, record_time, write_id, etc to ascertain which record comes before in a sorted order. The check is performed in the method `CDCSDKUniqueRecordID::GreaterThanDistributedLSN`.

Now when the GUC `yb_disable_transactional_writes` is set, we have multiple records inserted in a single `WRITE_OP` batch for a single shard transaction, there's a possibility that the records end up having the same `commit_time`, `record_time`, `write_id` and `table_id` so we fallback to comparing the primary keys. The core issue was that the VWAL expects each individual tablet to send records in order as per the order determined by `CDCSDKUniqueRecordID::GreaterThanDistributedLSN`. This was being violated and the situation leads us to a data loss scenario where we end up losing multiple records inserted in the same batch of a single shard transaction.

The fix to this requires a mechanism to reliably have a fixed sorting order when all the other parameters end up with the same value. This PR adds the same mechanism by assigning a `write_id` to every single shard record based on the index of the wal records batch we are processing. By doing this, we'll be breaking the tie using `write_id` and we can reliably sort the records without filtering them.
Jira: DB-17813

Original commit: c397d2f / D45857

Test Plan:
The tests to reproduce the error and validate the fix has been added as a part of the logical replication connector's test suite in the following PR:
yugabyte/debezium#182

Additionally, even though the issue will not be applicable to the gRPC connector, we are also adding a test prudently to gRPC connector as well:
yugabyte/debezium-connector-yugabytedb#379

Reviewers: asrinivasan, sumukh.phalgaonkar, skumar

Reviewed By: sumukh.phalgaonkar

Subscribers: ycdcxcluster

Differential Revision: https://phorge.dev.yugabyte.com/D46189
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant