Skip to content

Conversation

@sirtimid
Copy link
Contributor

@sirtimid sirtimid commented Jan 26, 2026

Closes #661

Summary

  • Add sliding window rate limiting to protect against flooding attacks
  • Message rate limiting: 100 messages/second per peer (configurable via maxMessagesPerSecond)
  • Connection attempt rate limiting: 10 attempts/minute per peer (configurable via maxConnectionAttemptsPerMinute)

Implementation Details

  • New SlidingWindowRateLimiter class with automatic pruning of old events
  • Rate checks integrated into sendRemoteMessage before sending
  • Connection rate checks before dialing new connections
  • Throws ResourceLimitError when limits are exceeded
  • Rate limiter state cleaned up when peers become stale or transport stops

Test plan

  • All 1665 tests pass
  • Lint passes
  • New test file with 28 tests for rate limiter

🤖 Generated with Claude Code


Note

Medium Risk
Introduces new rate-limiting behavior in the remote transport and reconnection loop, which can change message delivery/retry patterns and cause new ResourceLimitError failures under load or misconfiguration.

Overview
Adds per-peer sliding-window rate limiting to remote comms: outbound sends now enforce maxMessagesPerSecond, and new connection dials/reconnects enforce maxConnectionAttemptsPerMinute, both throwing ResourceLimitError when exceeded.

Extends ResourceLimitError to include new limitType values (messageRate, connectionRate), exports related types, and adds isResourceLimitError for limit-type-aware handling.

Updates reconnection logic to treat rate-limit and connection-limit ResourceLimitErrors as retryable (with attempt-count adjustment for rate-limit pre-dial failures) and tightens channel cleanup by explicitly closing rejected inbound channels and channels opened before a post-dial connection-limit failure. Tests and E2E helpers are adjusted accordingly.

Written by Cursor Bugbot for commit 6be75fa. This will update automatically on new commits. Configure here.

@sirtimid sirtimid requested a review from a team as a code owner January 26, 2026 19:51
@github-actions
Copy link
Contributor

github-actions bot commented Jan 26, 2026

Coverage Report

Status Category Percentage Covered / Total
🔵 Lines 88.5%
⬆️ +0.05%
5880 / 6644
🔵 Statements 88.39%
⬆️ +0.06%
5977 / 6762
🔵 Functions 87.47%
⬆️ +0.03%
1529 / 1748
🔵 Branches 84.76%
⬆️ +0.18%
2131 / 2514
File Coverage
File Stmts Branches Functions Lines Uncovered Lines
Changed Files
packages/kernel-errors/src/index.ts 100%
🟰 ±0%
100%
🟰 ±0%
100%
🟰 ±0%
100%
🟰 ±0%
packages/kernel-errors/src/errors/ResourceLimitError.ts 100%
🟰 ±0%
100%
🟰 ±0%
100%
🟰 ±0%
100%
🟰 ±0%
packages/kernel-errors/src/utils/isResourceLimitError.ts 100% 100% 100% 100%
packages/ocap-kernel/src/remotes/types.ts 100%
🟰 ±0%
100%
🟰 ±0%
100%
🟰 ±0%
100%
🟰 ±0%
packages/ocap-kernel/src/remotes/platform/constants.ts 100%
🟰 ±0%
100%
🟰 ±0%
100%
🟰 ±0%
100%
🟰 ±0%
packages/ocap-kernel/src/remotes/platform/rate-limiter.ts 100% 100% 100% 100%
packages/ocap-kernel/src/remotes/platform/reconnection-lifecycle.ts 88.57%
⬆️ +2.37%
87.87%
⬆️ +1.67%
80%
🟰 ±0%
88.57%
⬆️ +2.37%
107-111, 131-132, 231-232
packages/ocap-kernel/src/remotes/platform/reconnection.ts 100%
🟰 ±0%
100%
🟰 ±0%
100%
🟰 ±0%
100%
🟰 ±0%
packages/ocap-kernel/src/remotes/platform/transport.ts 85.87%
⬇️ -2.32%
80.59%
⬆️ +0.59%
75%
⬇️ -5.00%
85.87%
⬇️ -2.32%
103, 122-131, 163, 197-215, 238, 322, 384, 396, 434, 460, 471
Generated in workflow #3454 for commit 6be75fa by the Vitest Coverage Report Action

@sirtimid sirtimid force-pushed the sirtimid/add-rate-limiting-v2 branch from cf0fab5 to 44c20c4 Compare January 27, 2026 23:42
@sirtimid sirtimid force-pushed the sirtimid/add-rate-limiting-v2 branch from 44c20c4 to 840c5eb Compare January 28, 2026 15:20
@sirtimid sirtimid force-pushed the sirtimid/add-rate-limiting-v2 branch from 859577d to 871dbd2 Compare January 28, 2026 15:39
@rekmarks rekmarks requested a review from grypez January 28, 2026 17:42
@sirtimid sirtimid force-pushed the sirtimid/add-rate-limiting-v2 branch from 8b53542 to 020fc36 Compare January 28, 2026 17:45
@sirtimid sirtimid force-pushed the sirtimid/add-rate-limiting-v2 branch from 527dc83 to a6d921b Compare January 29, 2026 11:40
Copy link
Contributor

@grypez grypez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nice PR overall. I have one request from you: a question about reconnection-lifecycle.test.ts.

Comments:

I expect the test suite to flake where it uses real timers instead of fake ones. I won't gate approval on it because we already have flaky tests and having more of them will motivate us to treat them properly.

I left some comments on testing, pertinent not to this PR but to repo management.

Comment on lines +6 to +80
describe('isResourceLimitError', () => {
describe('without limitType parameter', () => {
it('returns true for ResourceLimitError', () => {
const error = new ResourceLimitError('limit exceeded');
expect(isResourceLimitError(error)).toBe(true);
});

it('returns true for ResourceLimitError with any limitType', () => {
const connectionError = new ResourceLimitError('connection limit', {
data: { limitType: 'connection' },
});
const rateError = new ResourceLimitError('rate limit', {
data: { limitType: 'connectionRate' },
});

expect(isResourceLimitError(connectionError)).toBe(true);
expect(isResourceLimitError(rateError)).toBe(true);
});

it('returns false for regular Error', () => {
const error = new Error('some error');
expect(isResourceLimitError(error)).toBe(false);
});

it('returns false for null', () => {
expect(isResourceLimitError(null)).toBe(false);
});

it('returns false for undefined', () => {
expect(isResourceLimitError(undefined)).toBe(false);
});

it('returns false for non-error objects', () => {
expect(isResourceLimitError({ message: 'fake error' })).toBe(false);
});
});

describe('with limitType parameter', () => {
it('returns true when limitType matches', () => {
const error = new ResourceLimitError('connection limit', {
data: { limitType: 'connection' },
});
expect(isResourceLimitError(error, 'connection')).toBe(true);
});

it('returns false when limitType does not match', () => {
const error = new ResourceLimitError('connection limit', {
data: { limitType: 'connection' },
});
expect(isResourceLimitError(error, 'connectionRate')).toBe(false);
});

it('returns false when error has no limitType', () => {
const error = new ResourceLimitError('limit exceeded');
expect(isResourceLimitError(error, 'connection')).toBe(false);
});

it('returns false for non-ResourceLimitError even with matching-like data', () => {
const error = new Error('some error');
expect(isResourceLimitError(error, 'connection')).toBe(false);
});

it.each([
'connection',
'connectionRate',
'messageSize',
'messageRate',
] as const)('correctly identifies %s limitType', (limitType) => {
const error = new ResourceLimitError('limit exceeded', {
data: { limitType },
});
expect(isResourceLimitError(error, limitType)).toBe(true);
});
});
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG in this PR.

This whole suite tests $x_i \mapsto y_i$. There must be a shorter way to write this moving forward, that reads faster.

sirtimid and others added 10 commits January 29, 2026 19:56
Add sliding window rate limiting to protect against flooding attacks:
- Message rate limiting: 100 messages/second per peer (configurable)
- Connection attempt rate limiting: 10 attempts/minute per peer (configurable)

Implementation:
- Add SlidingWindowRateLimiter class with automatic pruning
- Add maxMessagesPerSecond and maxConnectionAttemptsPerMinute options
- Integrate rate checks in sendRemoteMessage before sending
- Integrate connection rate checks before dialing new connections
- Clean up rate limiter state when peers become stale or transport stops

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add comprehensive test coverage for SlidingWindowRateLimiter:
- Basic limit checking (wouldExceedLimit)
- Event recording and pruning
- checkAndRecord with error handling
- getCurrentCount with window expiration
- clearKey and clear methods
- pruneStale for cleanup
- Sliding window behavior with real timing

Also test factory functions and constants.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add 'messageRate' and 'connectionRate' to ResourceLimitError limitType
- Update rate limiter to use correct limit type enum values
- Update tests to match new limit types

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Move message rate recording to after successful write instead of before
  send attempt. This prevents failed sends from consuming rate quota.
- Add connection rate limiting to automatic reconnection attempts via
  checkConnectionRateLimit dependency in reconnection lifecycle.
- Handle ResourceLimitError gracefully during reconnection by continuing
  the loop after backoff instead of giving up on the peer.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add remoteCommsOptions parameter to setupAliceAndBob helper
- Configure higher maxMessagesPerSecond for queue limit test to
  ensure rate limiting doesn't interfere with queue limit testing

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…rror

- Call getCurrentCount once and reuse the value for both message and data
- Use DEFAULT_MESSAGE_RATE_WINDOW_MS constant instead of hardcoded 1000

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Move DEFAULT_MESSAGE_RATE_LIMIT, DEFAULT_MESSAGE_RATE_WINDOW_MS,
DEFAULT_CONNECTION_RATE_LIMIT, and DEFAULT_CONNECTION_RATE_WINDOW_MS
to constants.ts for consistency with other default constants.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Revert to using checkAndRecord() for message rate limiting instead of
separate check and record calls. The separated approach had a TOCTOU
race where concurrent sends could all pass the check before any recorded,
bypassing the rate limit.

Yes, failed sends now consume quota, but this is necessary for security -
recording after send would allow attackers to make unlimited concurrent
attempts that bypass the rate limit.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…annels

Two bug fixes:

1. Rate-limited reconnection attempts no longer consume retry quota.
   Previously, incrementAttempt was called before the rate limit check,
   so rate-limited attempts counted against maxRetryAttempts even though
   no dial was performed. Now, decrementAttempt is called when rate
   limited to undo the premature increment.

2. Rejected inbound connections are now properly closed. When an inbound
   connection is rejected (due to intentional close or connection limit),
   the channel is now closed via closeChannel() to prevent resource
   leaks from dangling libp2p streams.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add validation in SlidingWindowRateLimiter constructor to reject
  non-positive values for maxEvents and windowMs, preventing
  misconfiguration that would cause unexpected behavior
- Optimize checkAndRecord to use getCurrentCount instead of duplicating
  the Date.now() call and timestamp filtering logic
- Refactor constructor validation tests to use it.each for conciseness

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
sirtimid and others added 12 commits January 29, 2026 19:56
The ResourceLimitError catch block incorrectly treated all such errors
as rate limit errors where "no actual dial was performed." However,
checkConnectionLimit() throws with limitType: 'connection' AFTER dial
succeeds. This caused:

1. decrementAttempt incorrectly called (dial was performed, should count)
2. Log message incorrectly said "rate limited"
3. Dialed channel leaked since never closed or registered

Fix:
- Check error.data.limitType to distinguish 'connectionRate' (before dial)
  from 'connection' (after dial)
- Only decrement attempt count for rate limit errors (connectionRate)
- Add closeChannel dependency to close leaked channels
- Wrap checkConnectionLimit in try/catch to close channel on error

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…Data types

Extract reusable types from ResourceLimitError to improve type safety
and reduce inline type casts when checking error data.

- Add ResourceLimitType union type for limit types
- Add ResourceLimitErrorData type for error data structure
- Export both types from kernel-errors package
- Use ResourceLimitErrorData in reconnection-lifecycle for cleaner code

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The validation `maxEvents <= 0` passes for NaN and Infinity since both
comparisons return false. This silently disables rate limiting entirely
as `currentCount >= NaN` and `currentCount >= Infinity` are always false.

Fix by using Number.isFinite() which rejects NaN, Infinity, and -Infinity,
ensuring rate limiting cannot be bypassed via faulty configuration parsing.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…connection

Connection limit errors (ResourceLimitError with limitType: 'connection') were
falling through to isRetryableNetworkError(), which doesn't recognize them,
causing permanent reconnection failure via onRemoteGiveUp. Added explicit
handling to continue the retry loop for connection limit errors.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Extracts the ResourceLimitError check pattern into a reusable type guard
function that optionally checks for a specific limitType. This simplifies
the error handling code in reconnection-lifecycle.ts.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When checkConnectionLimit() throws during reconnection, the code closes
the channel before re-throwing. If closeChannel itself throws, that error
would propagate instead of the original ResourceLimitError, causing
reconnection to give up prematurely via onRemoteGiveUp.

Wrap closeChannel in try-catch to ensure the original error is always
re-thrown regardless of cleanup success.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…ntCount

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The transport tests were failing because the mock for @metamask/kernel-errors
did not include isResourceLimitError, which was recently added. When the
reconnection lifecycle code tried to call this function, it failed with
undefined is not a function, causing the reconnection loop to crash.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Extract duplicated channel closing logic in the inbound connection handler
into a dedicated helper function to reduce duplication and ensure consistent
error handling.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Make test assertions more precise by verifying exact call counts for
checkConnectionLimit and checkConnectionRateLimit. In a single successful
reconnection attempt, each should be called exactly once.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@sirtimid sirtimid force-pushed the sirtimid/add-rate-limiting-v2 branch from ab1ab0f to f215020 Compare January 29, 2026 18:56
…ness

Add a flushPromises helper and use it in handleConnectionLoss tests that
trigger fire-and-forget async reconnection work. This ensures all pending
microtasks complete before assertions run, preventing potential flakiness
from async operations bleeding into subsequent tests.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

logger.log(
`${channel.peerId}:: rejecting inbound connection from intentionally closed peer`,
);
// Don't add to channels map and don't start reading - connection will naturally close
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Channel leak when connection limit exceeded after dial

Medium Severity

In sendRemoteMessage, when checkConnectionLimit() at line 390 throws a ResourceLimitError after a successful dial, the dialed channel is not closed before the error is rethrown. The catch block at lines 393-401 checks for ResourceLimitError and rethrows it without closing the channel, causing a resource leak. The correct pattern is implemented in tryReconnect in reconnection-lifecycle.ts, which wraps the checkConnectionLimit() call in its own try-catch to close the channel before rethrowing.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Remote comms: Basic Rate Limiting

3 participants