generated from MetaMask/metamask-module-template
-
Notifications
You must be signed in to change notification settings - Fork 6
feat(ocap-kernel): add permanent failure detection for reconnection #789
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
sirtimid
wants to merge
6
commits into
main
Choose a base branch
from
sirtimid/permanent-failure-detection
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+798
−15
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add error pattern tracking to ReconnectionManager to detect when a peer is permanently unreachable. When the same error code (ECONNREFUSED, EHOSTUNREACH, ENOTFOUND, or ENETUNREACH) occurs consecutively N times (default 5), the peer is marked as permanently failed and reconnection attempts stop. Changes: - Add error history tracking per peer in ReconnectionManager - Add isPermanentlyFailed() and clearPermanentFailure() methods - Add getNetworkErrorCode() helper to extract error codes - Integrate error recording into reconnection lifecycle - Check permanent failure status before attempting reconnection Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add comprehensive unit tests for: - getNetworkErrorCode helper function - Error tracking in ReconnectionManager (recordError, getErrorHistory) - Permanent failure detection (isPermanentlyFailed) - Clearing permanent failure state (clearPermanentFailure) - Custom consecutive error threshold - Integration with startReconnection, clearPeer, and clear Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add integration tests for permanent failure detection in the reconnection lifecycle: - Gives up when peer is permanently failed at start of loop - Records errors after failed dial attempts - Gives up when error triggers permanent failure - Continues retrying when error does not trigger failure - handleConnectionLoss skips reconnection for permanently failed peers Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Update existing tests to work with permanent failure detection changes: - Add getNetworkErrorCode export to index test - Add getNetworkErrorCode mock to transport tests - Update startReconnection mocks to return true - Add isPermanentlyFailed and recordError mocks to ReconnectionManager Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Contributor
Coverage Report
File Coverage
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Fix issues identified in code review: - Add bounds validation for consecutiveErrorThreshold (must be >= 1) - Cap error history to prevent unbounded memory growth The error history is now limited to the threshold size since we only need the last N errors for pattern detection. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
When a user explicitly calls reconnectPeer, clear the permanent failure status so the reconnection can proceed. Previously, permanently failed peers could not be manually reconnected because startReconnection would return false without attempting any connection. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Closes #688
Summary
ReconnectionManagerto detect permanently unreachable peersChanges
kernel-errors:
getNetworkErrorCode()helper to extract error codes from errorsocap-kernel:
ReconnectionStatewitherrorHistoryandpermanentlyFailedfieldsrecordError(),isPermanentlyFailed(),clearPermanentFailure()methods toReconnectionManagerstartReconnection()to returnfalsefor permanently failed peers and reset error historyTest plan
getNetworkErrorCodehelperReconnectionManager🤖 Generated with Claude Code
Note
Medium Risk
Changes reconnection control flow to short-circuit retries and alter
startReconnectionsemantics, which could affect connectivity and recovery behavior for peers. Risk is moderated by extensive new unit/integration tests around error classification and give-up paths.Overview
Adds
getNetworkErrorCodeto@metamask/kernel-errorsand exports it to consistently derive a stable network error identifier fromerror.code,error.name, or relayNO_RESERVATIONmessages.Updates
ocap-kernelreconnection to track recent dial failures per peer and mark peers as permanently failed after N consecutive identical “unreachable” errors (ECONNREFUSED,EHOSTUNREACH,ENOTFOUND,ENETUNREACH). The reconnection lifecycle now records error codes, bails out immediately for permanently failed peers, andstartReconnectionreturnsfalsewhen reconnection should be skipped; manualreconnectPeerclears permanent-failure state before retrying.Written by Cursor Bugbot for commit a1c76bb. This will update automatically on new commits. Configure here.