feat(transport): add rate limiting for messages and connections #776

sirtimid · 2026-01-26T19:51:24Z

Closes #661

Summary

Add sliding window rate limiting to protect against flooding attacks
Message rate limiting: 100 messages/second per peer (configurable via maxMessagesPerSecond)
Connection attempt rate limiting: 10 attempts/minute per peer (configurable via maxConnectionAttemptsPerMinute)

Implementation Details

New SlidingWindowRateLimiter class with automatic pruning of old events
Rate checks integrated into sendRemoteMessage before sending
Connection rate checks before dialing new connections
Throws ResourceLimitError when limits are exceeded
Rate limiter state cleaned up when peers become stale or transport stops

Test plan

All 1665 tests pass
Lint passes
New test file with 28 tests for rate limiter

🤖 Generated with Claude Code

Note

Medium Risk
Introduces new rate-limiting behavior in the remote transport and reconnection loop, which can change message delivery/retry patterns and cause new ResourceLimitError failures under load or misconfiguration.

Overview
Adds per-peer sliding-window rate limiting to remote comms: outbound sends now enforce maxMessagesPerSecond, and new connection dials/reconnects enforce maxConnectionAttemptsPerMinute, both throwing ResourceLimitError when exceeded.

Extends ResourceLimitError to include new limitType values (messageRate, connectionRate), exports related types, and adds isResourceLimitError for limit-type-aware handling.

Updates reconnection logic to treat rate-limit and connection-limit ResourceLimitErrors as retryable (with attempt-count adjustment for rate-limit pre-dial failures) and tightens channel cleanup by explicitly closing rejected inbound channels and channels opened before a post-dial connection-limit failure. Tests and E2E helpers are adjusted accordingly.

^{Written by Cursor Bugbot for commit 6be75fa. This will update automatically on new commits. Configure here.}

packages/ocap-kernel/src/remotes/platform/transport.ts

github-actions · 2026-01-26T21:25:22Z

Coverage Report

Status	Category	Percentage	Covered / Total
🔵	Lines	88.5% ⬆️ +0.05%	5880 / 6644
🔵	Statements	88.39% ⬆️ +0.06%	5977 / 6762
🔵	Functions	87.47% ⬆️ +0.03%	1529 / 1748
🔵	Branches	84.76% ⬆️ +0.18%	2131 / 2514

File Coverage

File	Stmts	Branches	Functions	Lines	Uncovered Lines
Changed Files
packages/kernel-errors/src/index.ts	100% 🟰 ±0%	100% 🟰 ±0%	100% 🟰 ±0%	100% 🟰 ±0%
packages/kernel-errors/src/errors/ResourceLimitError.ts	100% 🟰 ±0%	100% 🟰 ±0%	100% 🟰 ±0%	100% 🟰 ±0%
packages/kernel-errors/src/utils/isResourceLimitError.ts	100%	100%	100%	100%
packages/ocap-kernel/src/remotes/types.ts	100% 🟰 ±0%	100% 🟰 ±0%	100% 🟰 ±0%	100% 🟰 ±0%
packages/ocap-kernel/src/remotes/platform/constants.ts	100% 🟰 ±0%	100% 🟰 ±0%	100% 🟰 ±0%	100% 🟰 ±0%
packages/ocap-kernel/src/remotes/platform/rate-limiter.ts	100%	100%	100%	100%
packages/ocap-kernel/src/remotes/platform/reconnection-lifecycle.ts	88.57% ⬆️ +2.37%	87.87% ⬆️ +1.67%	80% 🟰 ±0%	88.57% ⬆️ +2.37%	107-111, 131-132, 231-232
packages/ocap-kernel/src/remotes/platform/reconnection.ts	100% 🟰 ±0%	100% 🟰 ±0%	100% 🟰 ±0%	100% 🟰 ±0%
packages/ocap-kernel/src/remotes/platform/transport.ts	85.87% ⬇️ -2.32%	80.59% ⬆️ +0.59%	75% ⬇️ -5.00%	85.87% ⬇️ -2.32%	103, 122-131, 163, 197-215, 238, 322, 384, 396, 434, 460, 471

Generated in workflow #3454 for commit 6be75fa by the Vitest Coverage Report Action

packages/ocap-kernel/src/remotes/platform/transport.ts

packages/ocap-kernel/src/remotes/platform/reconnection-lifecycle.ts

packages/ocap-kernel/src/remotes/platform/transport.ts

packages/ocap-kernel/src/remotes/platform/reconnection-lifecycle.ts

packages/ocap-kernel/src/remotes/platform/rate-limiter.ts

packages/ocap-kernel/src/remotes/platform/reconnection-lifecycle.ts

packages/ocap-kernel/src/remotes/platform/rate-limiter.ts

packages/kernel-errors/src/index.ts

packages/ocap-kernel/src/remotes/platform/transport.ts

grypez

A nice PR overall. I have one request from you: a question about reconnection-lifecycle.test.ts.

Comments:

I expect the test suite to flake where it uses real timers instead of fake ones. I won't gate approval on it because we already have flaky tests and having more of them will motivate us to treat them properly.

I left some comments on testing, pertinent not to this PR but to repo management.

grypez · 2026-01-29T16:45:21Z

packages/kernel-errors/src/utils/isResourceLimitError.test.ts

+describe('isResourceLimitError', () => {
+  describe('without limitType parameter', () => {
+    it('returns true for ResourceLimitError', () => {
+      const error = new ResourceLimitError('limit exceeded');
+      expect(isResourceLimitError(error)).toBe(true);
+    });
+
+    it('returns true for ResourceLimitError with any limitType', () => {
+      const connectionError = new ResourceLimitError('connection limit', {
+        data: { limitType: 'connection' },
+      });
+      const rateError = new ResourceLimitError('rate limit', {
+        data: { limitType: 'connectionRate' },
+      });
+
+      expect(isResourceLimitError(connectionError)).toBe(true);
+      expect(isResourceLimitError(rateError)).toBe(true);
+    });
+
+    it('returns false for regular Error', () => {
+      const error = new Error('some error');
+      expect(isResourceLimitError(error)).toBe(false);
+    });
+
+    it('returns false for null', () => {
+      expect(isResourceLimitError(null)).toBe(false);
+    });
+
+    it('returns false for undefined', () => {
+      expect(isResourceLimitError(undefined)).toBe(false);
+    });
+
+    it('returns false for non-error objects', () => {
+      expect(isResourceLimitError({ message: 'fake error' })).toBe(false);
+    });
+  });
+
+  describe('with limitType parameter', () => {
+    it('returns true when limitType matches', () => {
+      const error = new ResourceLimitError('connection limit', {
+        data: { limitType: 'connection' },
+      });
+      expect(isResourceLimitError(error, 'connection')).toBe(true);
+    });
+
+    it('returns false when limitType does not match', () => {
+      const error = new ResourceLimitError('connection limit', {
+        data: { limitType: 'connection' },
+      });
+      expect(isResourceLimitError(error, 'connectionRate')).toBe(false);
+    });
+
+    it('returns false when error has no limitType', () => {
+      const error = new ResourceLimitError('limit exceeded');
+      expect(isResourceLimitError(error, 'connection')).toBe(false);
+    });
+
+    it('returns false for non-ResourceLimitError even with matching-like data', () => {
+      const error = new Error('some error');
+      expect(isResourceLimitError(error, 'connection')).toBe(false);
+    });
+
+    it.each([
+      'connection',
+      'connectionRate',
+      'messageSize',
+      'messageRate',
+    ] as const)('correctly identifies %s limitType', (limitType) => {
+      const error = new ResourceLimitError('limit exceeded', {
+        data: { limitType },
+      });
+      expect(isResourceLimitError(error, limitType)).toBe(true);
+    });
+  });
+});


LG in this PR.

This whole suite tests $x_i \mapsto y_i$. There must be a shorter way to write this moving forward, that reads faster.

packages/ocap-kernel/src/remotes/platform/rate-limiter.test.ts

packages/ocap-kernel/src/remotes/platform/reconnection-lifecycle.test.ts

packages/ocap-kernel/src/remotes/platform/rate-limiter.ts

Add sliding window rate limiting to protect against flooding attacks: - Message rate limiting: 100 messages/second per peer (configurable) - Connection attempt rate limiting: 10 attempts/minute per peer (configurable) Implementation: - Add SlidingWindowRateLimiter class with automatic pruning - Add maxMessagesPerSecond and maxConnectionAttemptsPerMinute options - Integrate rate checks in sendRemoteMessage before sending - Integrate connection rate checks before dialing new connections - Clean up rate limiter state when peers become stale or transport stops Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add comprehensive test coverage for SlidingWindowRateLimiter: - Basic limit checking (wouldExceedLimit) - Event recording and pruning - checkAndRecord with error handling - getCurrentCount with window expiration - clearKey and clear methods - pruneStale for cleanup - Sliding window behavior with real timing Also test factory functions and constants. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add 'messageRate' and 'connectionRate' to ResourceLimitError limitType - Update rate limiter to use correct limit type enum values - Update tests to match new limit types Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Move message rate recording to after successful write instead of before send attempt. This prevents failed sends from consuming rate quota. - Add connection rate limiting to automatic reconnection attempts via checkConnectionRateLimit dependency in reconnection lifecycle. - Handle ResourceLimitError gracefully during reconnection by continuing the loop after backoff instead of giving up on the peer. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add remoteCommsOptions parameter to setupAliceAndBob helper - Configure higher maxMessagesPerSecond for queue limit test to ensure rate limiting doesn't interfere with queue limit testing Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…rror - Call getCurrentCount once and reuse the value for both message and data - Use DEFAULT_MESSAGE_RATE_WINDOW_MS constant instead of hardcoded 1000 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Move DEFAULT_MESSAGE_RATE_LIMIT, DEFAULT_MESSAGE_RATE_WINDOW_MS, DEFAULT_CONNECTION_RATE_LIMIT, and DEFAULT_CONNECTION_RATE_WINDOW_MS to constants.ts for consistency with other default constants. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Revert to using checkAndRecord() for message rate limiting instead of separate check and record calls. The separated approach had a TOCTOU race where concurrent sends could all pass the check before any recorded, bypassing the rate limit. Yes, failed sends now consume quota, but this is necessary for security - recording after send would allow attackers to make unlimited concurrent attempts that bypass the rate limit. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…annels Two bug fixes: 1. Rate-limited reconnection attempts no longer consume retry quota. Previously, incrementAttempt was called before the rate limit check, so rate-limited attempts counted against maxRetryAttempts even though no dial was performed. Now, decrementAttempt is called when rate limited to undo the premature increment. 2. Rejected inbound connections are now properly closed. When an inbound connection is rejected (due to intentional close or connection limit), the channel is now closed via closeChannel() to prevent resource leaks from dangling libp2p streams. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add validation in SlidingWindowRateLimiter constructor to reject non-positive values for maxEvents and windowMs, preventing misconfiguration that would cause unexpected behavior - Optimize checkAndRecord to use getCurrentCount instead of duplicating the Date.now() call and timestamp filtering logic - Refactor constructor validation tests to use it.each for conciseness Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The ResourceLimitError catch block incorrectly treated all such errors as rate limit errors where "no actual dial was performed." However, checkConnectionLimit() throws with limitType: 'connection' AFTER dial succeeds. This caused: 1. decrementAttempt incorrectly called (dial was performed, should count) 2. Log message incorrectly said "rate limited" 3. Dialed channel leaked since never closed or registered Fix: - Check error.data.limitType to distinguish 'connectionRate' (before dial) from 'connection' (after dial) - Only decrement attempt count for rate limit errors (connectionRate) - Add closeChannel dependency to close leaked channels - Wrap checkConnectionLimit in try/catch to close channel on error Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…Data types Extract reusable types from ResourceLimitError to improve type safety and reduce inline type casts when checking error data. - Add ResourceLimitType union type for limit types - Add ResourceLimitErrorData type for error data structure - Export both types from kernel-errors package - Use ResourceLimitErrorData in reconnection-lifecycle for cleaner code Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The validation `maxEvents <= 0` passes for NaN and Infinity since both comparisons return false. This silently disables rate limiting entirely as `currentCount >= NaN` and `currentCount >= Infinity` are always false. Fix by using Number.isFinite() which rejects NaN, Infinity, and -Infinity, ensuring rate limiting cannot be bypassed via faulty configuration parsing. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…connection Connection limit errors (ResourceLimitError with limitType: 'connection') were falling through to isRetryableNetworkError(), which doesn't recognize them, causing permanent reconnection failure via onRemoteGiveUp. Added explicit handling to continue the retry loop for connection limit errors. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Extracts the ResourceLimitError check pattern into a reusable type guard function that optionally checks for a specific limitType. This simplifies the error handling code in reconnection-lifecycle.ts. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

When checkConnectionLimit() throws during reconnection, the code closes the channel before re-throwing. If closeChannel itself throws, that error would propagate instead of the original ResourceLimitError, causing reconnection to give up prematurely via onRemoteGiveUp. Wrap closeChannel in try-catch to ensure the original error is always re-thrown regardless of cleanup success. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…ntCount Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The transport tests were failing because the mock for @metamask/kernel-errors did not include isResourceLimitError, which was recently added. When the reconnection lifecycle code tried to call this function, it failed with undefined is not a function, causing the reconnection loop to crash. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Extract duplicated channel closing logic in the inbound connection handler into a dedicated helper function to reduce duplication and ensure consistent error handling. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Make test assertions more precise by verifying exact call counts for checkConnectionLimit and checkConnectionRateLimit. In a single successful reconnection attempt, each should be called exactly once. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…ness Add a flushPromises helper and use it in handleConnectionLoss tests that trigger fire-and-forget async reconnection work. This ensures all pending microtasks complete before assertions run, preventing potential flakiness from async operations bleeding into subsequent tests. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

cursor · 2026-01-29T19:15:03Z

packages/ocap-kernel/src/remotes/platform/transport.ts

      logger.log(
        `${channel.peerId}:: rejecting inbound connection from intentionally closed peer`,
      );
-      // Don't add to channels map and don't start reading - connection will naturally close


Channel leak when connection limit exceeded after dial

Medium Severity

In sendRemoteMessage, when checkConnectionLimit() at line 390 throws a ResourceLimitError after a successful dial, the dialed channel is not closed before the error is rethrown. The catch block at lines 393-401 checks for ResourceLimitError and rethrows it without closing the channel, causing a resource leak. The correct pattern is implemented in tryReconnect in reconnection-lifecycle.ts, which wraps the checkConnectionLimit() call in its own try-catch to close the channel before rethrowing.

sirtimid requested a review from a team as a code owner January 26, 2026 19:51

cursor bot reviewed Jan 26, 2026

View reviewed changes

packages/ocap-kernel/src/remotes/platform/transport.ts Outdated Show resolved Hide resolved

cursor bot reviewed Jan 26, 2026

View reviewed changes

packages/ocap-kernel/src/remotes/platform/transport.ts Show resolved Hide resolved

cursor bot reviewed Jan 26, 2026

View reviewed changes

packages/ocap-kernel/src/remotes/platform/transport.ts Outdated Show resolved Hide resolved

cursor bot reviewed Jan 26, 2026

View reviewed changes

packages/ocap-kernel/src/remotes/platform/transport.ts Outdated Show resolved Hide resolved

sirtimid force-pushed the sirtimid/add-rate-limiting-v2 branch from cf0fab5 to 44c20c4 Compare January 27, 2026 23:42

cursor bot reviewed Jan 27, 2026

View reviewed changes

packages/ocap-kernel/src/remotes/platform/reconnection-lifecycle.ts Show resolved Hide resolved

packages/ocap-kernel/src/remotes/platform/transport.ts Show resolved Hide resolved

sirtimid force-pushed the sirtimid/add-rate-limiting-v2 branch from 44c20c4 to 840c5eb Compare January 28, 2026 15:20