-
Notifications
You must be signed in to change notification settings - Fork 34
fix(*): move, add, remove, split operator inconsistent after maintainer move #3691
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
fix(*): move, add, remove, split operator inconsistent after maintainer move #3691
Conversation
…e, split after maintainer restart
Summary of ChangesHello @wlwilliamx, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical consistency issue within the scheduler where the state of active dispatcher operators could be lost or become inconsistent following a maintainer restart. By enhancing the Protocol Buffer definitions to include operator-specific metadata and implementing robust tracking mechanisms within the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces changes to ensure span consistency for operators after a maintainer restart. The main changes involve adding an enabledSplit flag to dispatchers and tracking ongoing operators to restore them during bootstrap. My review focuses on ensuring correctness, consistency, and robustness of these new mechanisms. I've identified a few areas for improvement, including a typo, a misleading comment, a copy-paste error in a log message, and an unhandled error. Overall, the changes are well-structured and address the intended problem.
downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator.go
Outdated
Show resolved
Hide resolved
|
/test pull-integration-test |
|
@wlwilliamx: The specified target(s) for The following commands are available to trigger optional jobs: Use DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/test pull-cdc-mysql-integration-heavy |
|
/test pull-cdc-mysql-integration-light |
|
/test pull-cdc-mysql-integration-heavy |
|
/test pull-cdc-mysql-integration-heavy |
| // or just a part of the table (span). When true, the dispatcher handles the entire table; | ||
| // when false, it only handles a portion of the table. | ||
| isCompleteTable bool | ||
| enabledSplit bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we need this field?
…st-due-to-maintainer-move-operator-lost
|
/test pull-cdc-mysql-integration-heavy |
|
/test pull-cdc-mysql-integration-heavy |
|
/test pull-cdc-mysql-integration-light |
|
/retest |
| ) | ||
| redoInfos[dispatcherID] = info | ||
| } else { | ||
| dispatcherManager.currentOperatorMap.Store(operatorKey, req) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be put outside.
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: wk989898 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
[LGTM Timeline notifier]Timeline:
|
|
/retest |
| if isRedo && (!dispatcherManager.RedoEnable || dispatcherManager.redoDispatcherMap == nil) { | ||
| return common.DispatcherID{}, false | ||
| } | ||
| if _, operatorExists := dispatcherManager.currentOperatorMap.Load(operatorKey); operatorExists { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If one dispatcher is in move operator, then maintainer generater a remove operator for this dispatcher, but you seem just discard the remove action here. It may cause some strang problems
| // - The task is removed (for example, due to DDL). | ||
| removed atomic.Bool | ||
| spanController *span.Controller | ||
| // This add operator may be a part of move/split operator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the comment, we may should first explain what the "operatorType means", why operateType could be "move" in a AddDispatcherOperator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides, you may could explain why some operators have this field, but others doesn't
| finished atomic.Bool | ||
| postFinish func() | ||
| spanController *span.Controller | ||
| // This remove operator may be a part of move/split operator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto
| @@ -0,0 +1,362 @@ | |||
| #!/bin/bash | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a description here to describe what is the main purpose for this test, and what is the main steps for the tests.
Test code is often more complex and harder to read and maintain. Having a comprehensive explanation and step-by-step instructions makes it easier to check and fix test failures later on.
| if span == nil { | ||
| span = spanInfo.Span | ||
| } | ||
| if schemaID == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when will the schemaID == 0, and why we need to make schemaID = spanInfo.SchemaID here?
| return spanInfoByID | ||
| } | ||
|
|
||
| func (c *Controller) restoreCurrentWorkingOperators( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this section difficult to understand, mainly because it involves many special logic checks. This might stem from the numerous special cases reported by Bootstrap, such as when a span has a value but the operator doesn't, or vice versa. I think it would be better to provide an overview of the information it receives, the special cases, and their origins before explaining the logic, and then explain the underlying logic. This would make it easier to understand.
From your perspective, all the special processing logic within a function might seem clear, but those unfamiliar with it might wonder why there are different processing checks and whether these special checks are truly necessary or for some other purpose. Therefore, the best approach as an author is to clearly explain the logic through specific case statements in the comments.
| spanController.ShouldEnableSplit(tableSplitMap[spanInfo.Span.TableID]), | ||
| ) | ||
| spanController.AddReplicatingSpan(replicaSet) | ||
| } else if replicaSet.GetNodeID() == "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when will the node be ""
| splitEnabled := spanController.ShouldEnableSplit(table.Splitable) | ||
| // Add new table if not working | ||
| if isTableWorking { | ||
| if isTableWorking || isTableSpanExists { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when will isTableWorking == false but isTableSpanExists == true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And the related logic is a little confused for me. Maybe you can explain here.
| // If a dispatcher becomes non-working but there's no operator handling it, | ||
| // it means the dispatcher is removed unexpectedly (e.g. maintainer failover loses the operator), | ||
| // and we must reschedule it to avoid the dispatcher being lost forever. | ||
| if status.ComponentStatus == heartbeatpb.ComponentState_Stopped || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand this is a fallback logic, but I'm a little unsure if we really won't reach this point under all normal logic. Could you help me confirm this? We're considering various scenarios such as message resending, multiple sending, late arrival, and uncertain message order across multiple nodes.
…ainer-move-operator-lost' into fix/dispatcher-lost-due-to-maintainer-move-operator-lost
|
/test pull-cdc-mysql-integration-light |
|
/retest |
|
@wlwilliamx: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What problem does this PR solve?
Issue Number: close #3411
During maintainer failover/restart, in-flight dispatcher operators (add/remove/move/split) could be lost. If a dispatcher becomes Stopped/Removed while the corresponding operator state is missing, the span may never be rescheduled, leading to “lost dispatcher” and stalled replication. This is reproducible when dispatcher creation/close is blocked (e.g. move-table/split/remove in progress) and affects both default and redo modes.
What is changed and how it works?
Check List
Tests
Questions
Will it cause performance regression or break compatibility?
None
Do you need to update user documentation, design documentation or monitoring documentation?
None
Release note