Skip to content

Conversation

@sirtimid
Copy link
Contributor

@sirtimid sirtimid commented Jan 29, 2026

Closes #688

Summary

  • Add error pattern tracking to ReconnectionManager to detect permanently unreachable peers
  • When the same error code (ECONNREFUSED, EHOSTUNREACH, ENOTFOUND, or ENETUNREACH) occurs consecutively N times (default 5), the peer is marked as permanently failed
  • Permanent failures stop reconnection attempts to avoid wasted retries for unreachable peers

Changes

kernel-errors:

  • Add getNetworkErrorCode() helper to extract error codes from errors

ocap-kernel:

  • Extend ReconnectionState with errorHistory and permanentlyFailed fields
  • Add recordError(), isPermanentlyFailed(), clearPermanentFailure() methods to ReconnectionManager
  • Update startReconnection() to return false for permanently failed peers and reset error history
  • Integrate error recording into reconnection lifecycle
  • Check permanent failure status before attempting reconnection

Test plan

  • Unit tests for getNetworkErrorCode helper
  • Unit tests for error tracking in ReconnectionManager
  • Unit tests for permanent failure detection (consecutive identical errors)
  • Unit tests for clearing permanent failure state
  • Integration tests for reconnection lifecycle with permanent failure
  • All existing tests pass

🤖 Generated with Claude Code


Note

Medium Risk
Changes reconnection control flow to short-circuit retries and alter startReconnection semantics, which could affect connectivity and recovery behavior for peers. Risk is moderated by extensive new unit/integration tests around error classification and give-up paths.

Overview
Adds getNetworkErrorCode to @metamask/kernel-errors and exports it to consistently derive a stable network error identifier from error.code, error.name, or relay NO_RESERVATION messages.

Updates ocap-kernel reconnection to track recent dial failures per peer and mark peers as permanently failed after N consecutive identical “unreachable” errors (ECONNREFUSED, EHOSTUNREACH, ENOTFOUND, ENETUNREACH). The reconnection lifecycle now records error codes, bails out immediately for permanently failed peers, and startReconnection returns false when reconnection should be skipped; manual reconnectPeer clears permanent-failure state before retrying.

Written by Cursor Bugbot for commit a1c76bb. This will update automatically on new commits. Configure here.

sirtimid and others added 4 commits January 29, 2026 15:07
Add error pattern tracking to ReconnectionManager to detect when a peer
is permanently unreachable. When the same error code (ECONNREFUSED,
EHOSTUNREACH, ENOTFOUND, or ENETUNREACH) occurs consecutively N times
(default 5), the peer is marked as permanently failed and reconnection
attempts stop.

Changes:
- Add error history tracking per peer in ReconnectionManager
- Add isPermanentlyFailed() and clearPermanentFailure() methods
- Add getNetworkErrorCode() helper to extract error codes
- Integrate error recording into reconnection lifecycle
- Check permanent failure status before attempting reconnection

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add comprehensive unit tests for:
- getNetworkErrorCode helper function
- Error tracking in ReconnectionManager (recordError, getErrorHistory)
- Permanent failure detection (isPermanentlyFailed)
- Clearing permanent failure state (clearPermanentFailure)
- Custom consecutive error threshold
- Integration with startReconnection, clearPeer, and clear

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add integration tests for permanent failure detection in the
reconnection lifecycle:
- Gives up when peer is permanently failed at start of loop
- Records errors after failed dial attempts
- Gives up when error triggers permanent failure
- Continues retrying when error does not trigger failure
- handleConnectionLoss skips reconnection for permanently failed peers

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Update existing tests to work with permanent failure detection changes:
- Add getNetworkErrorCode export to index test
- Add getNetworkErrorCode mock to transport tests
- Update startReconnection mocks to return true
- Add isPermanentlyFailed and recordError mocks to ReconnectionManager

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@github-actions
Copy link
Contributor

github-actions bot commented Jan 29, 2026

Coverage Report

Status Category Percentage Covered / Total
🔵 Lines 88.67%
⬆️ +0.08%
5833 / 6578
🔵 Statements 88.55%
⬆️ +0.08%
5926 / 6692
🔵 Functions 87.72%
⬆️ +0.06%
1515 / 1727
🔵 Branches 84.99%
⬆️ +0.16%
2119 / 2493
File Coverage
File Stmts Branches Functions Lines Uncovered Lines
Changed Files
packages/kernel-errors/src/index.ts 100%
🟰 ±0%
100%
🟰 ±0%
100%
🟰 ±0%
100%
🟰 ±0%
packages/kernel-errors/src/utils/getNetworkErrorCode.ts 100% 100% 100% 100%
packages/ocap-kernel/src/remotes/platform/reconnection-lifecycle.ts 89.18%
⬆️ +2.98%
88.57%
⬆️ +2.37%
80%
🟰 ±0%
89.18%
⬆️ +2.98%
113-117, 137-138, 231-232
packages/ocap-kernel/src/remotes/platform/reconnection.ts 98.36%
⬇️ -1.64%
95.83%
⬇️ -4.17%
100%
🟰 ±0%
98.33%
⬇️ -1.67%
287
packages/ocap-kernel/src/remotes/platform/transport.ts 88.27%
⬆️ +0.08%
80%
🟰 ±0%
80%
🟰 ±0%
88.27%
⬆️ +0.08%
89, 108-111, 144, 178-196, 219, 352, 364, 416, 427
Generated in workflow #3440 for commit a1c76bb by the Vitest Coverage Report Action

Fix issues identified in code review:
- Add bounds validation for consecutiveErrorThreshold (must be >= 1)
- Cap error history to prevent unbounded memory growth

The error history is now limited to the threshold size since we only
need the last N errors for pattern detection.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@sirtimid sirtimid marked this pull request as ready for review January 29, 2026 15:18
@sirtimid sirtimid requested a review from a team as a code owner January 29, 2026 15:18
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

When a user explicitly calls reconnectPeer, clear the permanent failure
status so the reconnection can proceed. Previously, permanently failed
peers could not be manually reconnected because startReconnection would
return false without attempting any connection.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Remote comms: Error pattern analysis and permanent failure detection

2 participants