Skip to content

Conversation

@hongyunyan
Copy link
Collaborator

@hongyunyan hongyunyan commented Jan 12, 2026

What problem does this PR solve?

Issue Number: close #3962

What is changed and how it works?

This pull request significantly improves the message handling mechanism within the DispatcherOrchestrator by implementing a de-duplication strategy for incoming messages. By preventing the processing of redundant retry messages, the system can maintain better performance and responsiveness, ensuring that critical operations are not delayed or starved due to an overload of duplicate requests. This change enhances the overall stability and efficiency of the message dispatching system.

Highlights

  • Duplicate Message Handling: Introduced a new pendingMessageQueue to de-duplicate messages based on (changefeedID, messageType), preventing floods of retry messages from blocking or starving other requests.
  • Asynchronous Message Processing Improvement: The DispatcherOrchestrator now utilizes the pendingMessageQueue instead of a simple buffered channel for asynchronous message processing, enhancing robustness and efficiency.
  • New Data Structure: A pendingMessageKey struct was added to uniquely identify messages for de-duplication purposes.
  • Unit Tests: Comprehensive unit tests have been added for the pendingMessageQueue to ensure its correctness in handling duplicates, preserving order, and proper closure.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Questions

Will it cause performance regression or break compatibility?
Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Please refer to [Release Notes Language Style Guide](https://pingcap.github.io/tidb-dev-guide/contribute-to-tidb/release-notes-style-guide.html) to write a quality release note.

If you don't think this PR needs a release note then fill it with `None`.

@ti-chi-bot ti-chi-bot bot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Jan 12, 2026
@gemini-code-assist
Copy link

Summary of Changes

Hello @hongyunyan, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly improves the message handling mechanism within the DispatcherOrchestrator by implementing a de-duplication strategy for incoming messages. By preventing the processing of redundant retry messages, the system can maintain better performance and responsiveness, ensuring that critical operations are not delayed or starved due to an overload of duplicate requests. This change enhances the overall stability and efficiency of the message dispatching system.

Highlights

  • Duplicate Message Handling: Introduced a new pendingMessageQueue to de-duplicate messages based on (changefeedID, messageType), preventing floods of retry messages from blocking or starving other requests.
  • Asynchronous Message Processing Improvement: The DispatcherOrchestrator now utilizes the pendingMessageQueue instead of a simple buffered channel for asynchronous message processing, enhancing robustness and efficiency.
  • New Data Structure: A pendingMessageKey struct was added to uniquely identify messages for de-duplication purposes.
  • Unit Tests: Comprehensive unit tests have been added for the pendingMessageQueue to ensure its correctness in handling duplicates, preserving order, and proper closure.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@ti-chi-bot ti-chi-bot bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jan 12, 2026
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a pendingMessageQueue to de-duplicate messages in the DispatcherOrchestrator, preventing floods of retry messages. The implementation replaces the previous channel-based approach with this new queue, which correctly drops duplicate messages while one is being processed. The changes include updating the message handling logic, the shutdown sequence, and adding comprehensive unit tests for the new queue.

My review focuses on the new queue's implementation and its integration. I've identified a couple of areas for potential improvement:

  • Refactoring getPendingMessageKey for better conciseness.
  • Removing a seemingly redundant nil check in handleMessages for improved clarity.

Overall, this is a solid improvement that addresses the issue of message floods effectively.

Comment on lines 170 to 190
func getPendingMessageKey(msg *messaging.TargetMessage) (pendingMessageKey, bool) {
switch req := msg.Message[0].(type) {
case *heartbeatpb.MaintainerBootstrapRequest:
return pendingMessageKey{
changefeedID: common.NewChangefeedIDFromPB(req.ChangefeedID),
msgType: msg.Type,
}, true
case *heartbeatpb.MaintainerPostBootstrapRequest:
return pendingMessageKey{
changefeedID: common.NewChangefeedIDFromPB(req.ChangefeedID),
msgType: msg.Type,
}, true
case *heartbeatpb.MaintainerCloseRequest:
return pendingMessageKey{
changefeedID: common.NewChangefeedIDFromPB(req.ChangefeedID),
msgType: msg.Type,
}, true
default:
// Channel is full, log warning and drop the message
log.Warn("message channel is full, dropping message", zap.Any("message", msg.Message))
return nil
return pendingMessageKey{}, false
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The switch statement in this function contains duplicated code for extracting req.ChangefeedID and constructing the pendingMessageKey. You can refactor this to be more concise by extracting the ChangefeedID into a variable within the switch, and then constructing the pendingMessageKey once after the switch.

func getPendingMessageKey(msg *messaging.TargetMessage) (pendingMessageKey, bool) {
	var changefeedID *heartbeatpb.ChangefeedID
	switch req := msg.Message[0].(type) {
	case *heartbeatpb.MaintainerBootstrapRequest:
		changefeedID = req.ChangefeedID
	case *heartbeatpb.MaintainerPostBootstrapRequest:
		changefeedID = req.ChangefeedID
	case *heartbeatpb.MaintainerCloseRequest:
		changefeedID = req.ChangefeedID
	default:
		return pendingMessageKey{}, false
	}
	return pendingMessageKey{
		changefeedID: common.NewChangefeedIDFromPB(changefeedID),
		msgType:      msg.Type,
	}, true
}

Comment on lines 202 to 205
if msg == nil {
m.msgQueue.Done(key)
continue
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This check for msg == nil appears to be unnecessary. Given the logic of pendingMessageQueue, a key is only pushed to the queue after being added to the pending map. Since handleMessages processes messages serially from the queue in a single goroutine, m.msgQueue.Get(key) should not return nil for a key that was just successfully popped. Removing this block would simplify the code. If there's a subtle race condition this is protecting against, it would be beneficial to add a comment explaining it.

"go.uber.org/zap"
)

type pendingMessageKey struct {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move this struct to a new file to make the code clean

Comment on lines 66 to 72
q.mu.Lock()
if _, ok := q.pending[key]; ok {
q.mu.Unlock()
return false
}
q.pending[key] = msg
q.mu.Unlock()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to extract the code as a function

func (m *DispatcherOrchestrator) handleMessages() {
for {
select {
case <-ctx.Done():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use close the msgQueue instead of ctx.cancel function here.
Becauses msgQueue will freeze when there is no data, making it impossible to perceive changes in ctx, and a forced close operation is required.

}

// De-duplicate by (changefeedID, messageType) to avoid floods of retry messages.
_ = m.msgQueue.TryEnqueue(key, msg)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that the message with the same key as an existing message is not a retry message?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. Besides, according to the original logic, if the channel is full, the message will be dropped directly, so I think the behavior will be the same.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I pause the changefeed and remove the changefeed, the key is the same, but the request is not the same.

@ti-chi-bot ti-chi-bot bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Jan 13, 2026
@hongyunyan
Copy link
Collaborator Author

/retest

@ti-chi-bot
Copy link

ti-chi-bot bot commented Jan 13, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: flowbehappy, wk989898

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [flowbehappy,wk989898]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Jan 13, 2026
@ti-chi-bot
Copy link

ti-chi-bot bot commented Jan 13, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-01-13 08:54:29.974120606 +0000 UTC m=+347714.035985514: ☑️ agreed by wk989898.
  • 2026-01-13 09:15:02.362367711 +0000 UTC m=+348946.424232619: ☑️ agreed by flowbehappy.

@hongyunyan
Copy link
Collaborator Author

/retest

@hongyunyan
Copy link
Collaborator Author

/test pull-cdc-mysql-integration-light-next-gen

@hongyunyan
Copy link
Collaborator Author

/retest

@hongyunyan
Copy link
Collaborator Author

/test pull-cdc-mysql-integration-heavy

@hongyunyan
Copy link
Collaborator Author

/retest

1 similar comment
@hongyunyan
Copy link
Collaborator Author

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved lgtm release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Changefeed deletion can hang when downstream is unreachable due to dispatcher bootstrap flooding and bounded message channel drops

3 participants