feat(ocap-kernel): add permanent failure detection for reconnection #789

sirtimid · 2026-01-29T14:21:42Z

Closes #688

Summary

Add error pattern tracking to ReconnectionManager to detect permanently unreachable peers
When the same error code (ECONNREFUSED, EHOSTUNREACH, ENOTFOUND, or ENETUNREACH) occurs consecutively N times (default 5), the peer is marked as permanently failed
Permanent failures stop reconnection attempts to avoid wasted retries for unreachable peers

Changes

kernel-errors:

Add getNetworkErrorCode() helper to extract error codes from errors

ocap-kernel:

Extend ReconnectionState with errorHistory and permanentlyFailed fields
Add recordError(), isPermanentlyFailed(), clearPermanentFailure() methods to ReconnectionManager
Update startReconnection() to return false for permanently failed peers and reset error history
Integrate error recording into reconnection lifecycle
Check permanent failure status before attempting reconnection

Test plan

Unit tests for getNetworkErrorCode helper
Unit tests for error tracking in ReconnectionManager
Unit tests for permanent failure detection (consecutive identical errors)
Unit tests for clearing permanent failure state
Integration tests for reconnection lifecycle with permanent failure
All existing tests pass

🤖 Generated with Claude Code

Note

Medium Risk
Changes reconnection control flow to short-circuit retries and alter startReconnection semantics, which could affect connectivity and recovery behavior for peers. Risk is moderated by extensive new unit/integration tests around error classification and give-up paths.

Overview
Adds getNetworkErrorCode to @metamask/kernel-errors and exports it to consistently derive a stable network error identifier from error.code, error.name, or relay NO_RESERVATION messages.

Updates ocap-kernel reconnection to track recent dial failures per peer and mark peers as permanently failed after N consecutive identical “unreachable” errors (ECONNREFUSED, EHOSTUNREACH, ENOTFOUND, ENETUNREACH). The reconnection lifecycle now records error codes, bails out immediately for permanently failed peers, and startReconnection returns false when reconnection should be skipped; manual reconnectPeer clears permanent-failure state before retrying.

^{Written by Cursor Bugbot for commit a1c76bb. This will update automatically on new commits. Configure here.}

Add error pattern tracking to ReconnectionManager to detect when a peer is permanently unreachable. When the same error code (ECONNREFUSED, EHOSTUNREACH, ENOTFOUND, or ENETUNREACH) occurs consecutively N times (default 5), the peer is marked as permanently failed and reconnection attempts stop. Changes: - Add error history tracking per peer in ReconnectionManager - Add isPermanentlyFailed() and clearPermanentFailure() methods - Add getNetworkErrorCode() helper to extract error codes - Integrate error recording into reconnection lifecycle - Check permanent failure status before attempting reconnection Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add comprehensive unit tests for: - getNetworkErrorCode helper function - Error tracking in ReconnectionManager (recordError, getErrorHistory) - Permanent failure detection (isPermanentlyFailed) - Clearing permanent failure state (clearPermanentFailure) - Custom consecutive error threshold - Integration with startReconnection, clearPeer, and clear Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add integration tests for permanent failure detection in the reconnection lifecycle: - Gives up when peer is permanently failed at start of loop - Records errors after failed dial attempts - Gives up when error triggers permanent failure - Continues retrying when error does not trigger failure - handleConnectionLoss skips reconnection for permanently failed peers Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Update existing tests to work with permanent failure detection changes: - Add getNetworkErrorCode export to index test - Add getNetworkErrorCode mock to transport tests - Update startReconnection mocks to return true - Add isPermanentlyFailed and recordError mocks to ReconnectionManager Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

github-actions · 2026-01-29T14:28:59Z

Coverage Report

Status	Category	Percentage	Covered / Total
🔵	Lines	88.67% ⬆️ +0.08%	5833 / 6578
🔵	Statements	88.55% ⬆️ +0.08%	5926 / 6692
🔵	Functions	87.72% ⬆️ +0.06%	1515 / 1727
🔵	Branches	84.99% ⬆️ +0.16%	2119 / 2493

File Coverage

File	Stmts	Branches	Functions	Lines	Uncovered Lines
Changed Files
packages/kernel-errors/src/index.ts	100% 🟰 ±0%	100% 🟰 ±0%	100% 🟰 ±0%	100% 🟰 ±0%
packages/kernel-errors/src/utils/getNetworkErrorCode.ts	100%	100%	100%	100%
packages/ocap-kernel/src/remotes/platform/reconnection-lifecycle.ts	89.18% ⬆️ +2.98%	88.57% ⬆️ +2.37%	80% 🟰 ±0%	89.18% ⬆️ +2.98%	113-117, 137-138, 231-232
packages/ocap-kernel/src/remotes/platform/reconnection.ts	98.36% ⬇️ -1.64%	95.83% ⬇️ -4.17%	100% 🟰 ±0%	98.33% ⬇️ -1.67%	287
packages/ocap-kernel/src/remotes/platform/transport.ts	88.27% ⬆️ +0.08%	80% 🟰 ±0%	80% 🟰 ±0%	88.27% ⬆️ +0.08%	89, 108-111, 144, 178-196, 219, 352, 364, 416, 427

Generated in workflow #3440 for commit a1c76bb by the Vitest Coverage Report Action

Fix issues identified in code review: - Add bounds validation for consecutiveErrorThreshold (must be >= 1) - Cap error history to prevent unbounded memory growth The error history is now limited to the threshold size since we only need the last N errors for pattern detection. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

packages/ocap-kernel/src/remotes/platform/reconnection-lifecycle.ts

When a user explicitly calls reconnectPeer, clear the permanent failure status so the reconnection can proceed. Previously, permanently failed peers could not be manually reconnected because startReconnection would return false without attempting any connection. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

sirtimid and others added 4 commits January 29, 2026 15:07

sirtimid marked this pull request as ready for review January 29, 2026 15:18

sirtimid requested a review from a team as a code owner January 29, 2026 15:18

cursor bot reviewed Jan 29, 2026

View reviewed changes

packages/ocap-kernel/src/remotes/platform/reconnection-lifecycle.ts Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ocap-kernel): add permanent failure detection for reconnection #789

feat(ocap-kernel): add permanent failure detection for reconnection #789

sirtimid commented Jan 29, 2026 •

edited by cursor bot

Loading

Uh oh!

github-actions bot commented Jan 29, 2026 •

edited

Loading

Uh oh!

cursor bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat(ocap-kernel): add permanent failure detection for reconnection #789

Are you sure you want to change the base?

feat(ocap-kernel): add permanent failure detection for reconnection #789

Conversation

sirtimid commented Jan 29, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test plan

Uh oh!

github-actions bot commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Coverage Report

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sirtimid commented Jan 29, 2026 •

edited by cursor bot

Loading

github-actions bot commented Jan 29, 2026 •

edited

Loading