Skip to content

discovery: fix gossiper shutdown deadlock#10540

Merged
Roasbeef merged 2 commits intolightningnetwork:masterfrom
morehouse:gossiper_shutdown_deadlock
Feb 4, 2026
Merged

discovery: fix gossiper shutdown deadlock#10540
Roasbeef merged 2 commits intolightningnetwork:masterfrom
morehouse:gossiper_shutdown_deadlock

Conversation

@morehouse
Copy link
Collaborator

When processing a remote network announcement, it is possible for two error messages to be sent back on the errChan. Since Brontide doesn't actually read from errChan, and since errChan only buffered one error message, the sending goroutine would deadlock forever. This would only become apparent when the gossiper attempted to shut down and got hung up.

For now, we can fix this simply by buffering up to two error messages on errChan. There is an existing TODO to restructure this logic entirely to use the actor model, and we can do a more thorough fix as part of that work.

This bug was discovered while doing full node fuzz testing and was triggered by sending a specific channel_announcement message and then shutting down LND.

@gemini-code-assist
Copy link

Summary of Changes

Hello @morehouse, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical deadlock issue within the gossiper component, specifically during its shutdown sequence. The fix involves a targeted adjustment to an internal error channel's buffering capacity, preventing goroutines from blocking indefinitely when multiple errors are generated. This enhancement significantly improves the robustness and reliability of the gossiper's shutdown process, ensuring cleaner termination even under specific error conditions.

Highlights

  • Deadlock Fix: Resolved a potential deadlock in the gossiper's ProcessRemoteAnnouncement function during shutdown. This occurred when two error messages were sent to a single-buffered errChan and not read by Brontide, causing the sending goroutine to block indefinitely.
  • Error Channel Buffering: Increased the buffer size of the errChan in ProcessRemoteAnnouncement from 1 to 2. This change accommodates scenarios where up to two error messages might be sent, preventing goroutine blocking and ensuring clean shutdown.
  • Regression Test: Added a new test TestGossiperShutdownWrongChainAnnouncement to specifically reproduce the deadlock condition. This test verifies that the gossiper can now shut down cleanly after processing problematic channel announcements that previously led to deadlocks.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • discovery/gossiper.go
    • Increased the buffer size of the errChan in ProcessRemoteAnnouncement from 1 to 2.
    • Added comments explaining the rationale for the buffer increase and a TODO for future refactoring using an actor model.
  • discovery/gossiper_test.go
    • Introduced TestGossiperShutdownWrongChainAnnouncement, a new regression test.
    • This test simulates a scenario where a channel announcement with a wrong chain hash and identical node IDs triggers two error messages, verifying that the gossiper can now shut down without deadlocking.
Activity
  • The author identified and fixed a deadlock issue during full node fuzz testing.
  • A regression test was added to ensure the fix prevents future occurrences of the deadlock.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a potential deadlock in the gossiper during shutdown. The fix is simple and effective: increasing the buffer size of an error channel to prevent blocking when multiple errors are sent without a reader. The change is well-commented, explaining the reason for the fix and noting it as a temporary solution with a TODO for a future redesign. A comprehensive regression test has been added that accurately reproduces the deadlock scenario, ensuring the fix is effective and preventing future regressions. The test is clear and well-documented. Overall, this is a solid contribution that improves the stability of the gossiper.

@lightninglabs-deploy
Copy link
Collaborator

🟠 PR Severity: HIGH

Gossip protocol deadlock fix | 2 files | 9 lines changed

🟠 High (1 file)
  • discovery/gossiper.go - Gossip protocol component, deadlock fix affecting peer connection handling
🟢 Tests (1 file)
  • discovery/gossiper_test.go - Test coverage for the fix

Analysis

This PR fixes a shutdown deadlock in the gossiper component. The discovery/* package handles the Lightning Network gossip protocol, which is critical infrastructure for network graph maintenance and peer communication.

Why HIGH severity:

  • Changes to discovery/gossiper.go affect the gossip protocol subsystem
  • Deadlocks in shutdown paths can cause node hangs and require manual intervention
  • While the change is small (9 lines), it touches concurrency control in a core networking component
  • According to classification rules, discovery/* falls under HIGH severity tier

Review considerations:

  • Verify the deadlock scenario is properly resolved
  • Check that the fix doesn't introduce new race conditions
  • Ensure graceful shutdown behavior is maintained
  • Test coverage appears adequate with 61 lines of new tests

To override, add a `severity-override-{critical,high,medium,low}` label.

@saubyk saubyk added this to v0.21 Feb 3, 2026
@saubyk saubyk added this to the v0.21.0 milestone Feb 3, 2026
@saubyk saubyk moved this to In progress in v0.21 Feb 3, 2026
@ziggie1984
Copy link
Collaborator

Thank you for the fix, please make sure the linter passes.

Please also add release-notes for LND 21.0 release notes.

@morehouse morehouse force-pushed the gossiper_shutdown_deadlock branch from 5d1d71c to 3b7b8b0 Compare February 3, 2026 15:13
@morehouse
Copy link
Collaborator Author

Linter fixed and release note added.

When processing a remote network announcement, it is possible for two
error messages to be sent back on the errChan.  Since Brontide doesn't
actually read from errChan, and since errChan only buffered one error
message, the sending goroutine would deadlock forever.  This would only
become apparent when the gossiper attempted to shut down and got hung
up.

For now, we can fix this simply by buffering up to two error messages on
errChan.  There is an existing TODO to restructure this logic entirely
to use the actor model, and we can do a more thorough fix as part of
that work.

This bug was discovered while doing full node fuzz testing and was
triggered by sending a specific channel_announcement message and then
shutting down LND.
@morehouse morehouse force-pushed the gossiper_shutdown_deadlock branch from 3b7b8b0 to 5fa025e Compare February 3, 2026 15:16
@morehouse
Copy link
Collaborator Author

Rebased

Copy link
Collaborator

@ziggie1984 ziggie1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (pending CI)

//
// TODO(ziggie): Redesign this once the actor model pattern becomes
// available. See https://github.com/lightningnetwork/lnd/pull/9820.
errChan := make(chan error, 2)
Copy link
Contributor

@NishantBansal2003 NishantBansal2003 Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanted to share a few thoughts here. Even if the actor model pattern is not being used, it doesn’t feel intuitive to me that a single announcement message from a peer would result in two errors being sent on the error channel. Once we encounter the first error, it seems reasonable to return immediately.

If the current focus is just on fixing the issue rather than revisiting the flow, that’s totally fine. I mainly wanted to understand the reasoning behind emitting a second error here (especially since a blocked slot in the gossiper seems to have a much larger impact than a deadlock during shutdown).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR should only focus on solving the problem which was found during fuzzing. However I think we can revisit this area in the future making sure that the design follows proper coding design principles.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally I consider the pattern of sending errors over a channel in a function like this an anti-pattern. It's caused many bugs (like this) in the past, and can also make the control flow a bit harder to track as well.

Once we encounter the first error, it seems reasonable to return immediately.

Agreed. fn.Future can work nicely here for that, as it will just return that value instead of mssing around with channels here.

Copy link
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🐔

//
// TODO(ziggie): Redesign this once the actor model pattern becomes
// available. See https://github.com/lightningnetwork/lnd/pull/9820.
errChan := make(chan error, 2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally I consider the pattern of sending errors over a channel in a function like this an anti-pattern. It's caused many bugs (like this) in the past, and can also make the control flow a bit harder to track as well.

Once we encounter the first error, it seems reasonable to return immediately.

Agreed. fn.Future can work nicely here for that, as it will just return that value instead of mssing around with channels here.

@Roasbeef Roasbeef merged commit 5c7d3ad into lightningnetwork:master Feb 4, 2026
48 of 49 checks passed
@github-project-automation github-project-automation bot moved this from In progress to Done in v0.21 Feb 4, 2026
@morehouse morehouse deleted the gossiper_shutdown_deadlock branch February 4, 2026 20:28
@Roasbeef Roasbeef added the backport-v0.20.x-branch This label is used to trigger the creation of a backport PR to the branch `v0.20.x-branch`. label Feb 5, 2026
@github-actions
Copy link

github-actions bot commented Feb 5, 2026

Created backport PR for v0.20.x-branch:

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin backport-10540-to-v0.20.x-branch
git worktree add --checkout .worktree/backport-10540-to-v0.20.x-branch backport-10540-to-v0.20.x-branch
cd .worktree/backport-10540-to-v0.20.x-branch
git reset --hard HEAD^
git cherry-pick -x 5fa025e9e7a180de0a81750757fcc20eb37b28a9
git push --force-with-lease

Roasbeef pushed a commit that referenced this pull request Feb 5, 2026
yyforyongyu added a commit that referenced this pull request Feb 6, 2026
…20.x-branch

[v0.20.x-branch] Backport #10540: discovery: fix gossiper shutdown deadlock
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-v0.20.x-branch This label is used to trigger the creation of a backport PR to the branch `v0.20.x-branch`. bug fix gossip

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

6 participants