-
Notifications
You must be signed in to change notification settings - Fork 6
feat(transport): add rate limiting for messages and connections #776
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Coverage Report
File Coverage
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
cf0fab5 to
44c20c4
Compare
44c20c4 to
840c5eb
Compare
859577d to
871dbd2
Compare
8b53542 to
020fc36
Compare
527dc83 to
a6d921b
Compare
grypez
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A nice PR overall. I have one request from you: a question about reconnection-lifecycle.test.ts.
Comments:
I expect the test suite to flake where it uses real timers instead of fake ones. I won't gate approval on it because we already have flaky tests and having more of them will motivate us to treat them properly.
I left some comments on testing, pertinent not to this PR but to repo management.
| describe('isResourceLimitError', () => { | ||
| describe('without limitType parameter', () => { | ||
| it('returns true for ResourceLimitError', () => { | ||
| const error = new ResourceLimitError('limit exceeded'); | ||
| expect(isResourceLimitError(error)).toBe(true); | ||
| }); | ||
|
|
||
| it('returns true for ResourceLimitError with any limitType', () => { | ||
| const connectionError = new ResourceLimitError('connection limit', { | ||
| data: { limitType: 'connection' }, | ||
| }); | ||
| const rateError = new ResourceLimitError('rate limit', { | ||
| data: { limitType: 'connectionRate' }, | ||
| }); | ||
|
|
||
| expect(isResourceLimitError(connectionError)).toBe(true); | ||
| expect(isResourceLimitError(rateError)).toBe(true); | ||
| }); | ||
|
|
||
| it('returns false for regular Error', () => { | ||
| const error = new Error('some error'); | ||
| expect(isResourceLimitError(error)).toBe(false); | ||
| }); | ||
|
|
||
| it('returns false for null', () => { | ||
| expect(isResourceLimitError(null)).toBe(false); | ||
| }); | ||
|
|
||
| it('returns false for undefined', () => { | ||
| expect(isResourceLimitError(undefined)).toBe(false); | ||
| }); | ||
|
|
||
| it('returns false for non-error objects', () => { | ||
| expect(isResourceLimitError({ message: 'fake error' })).toBe(false); | ||
| }); | ||
| }); | ||
|
|
||
| describe('with limitType parameter', () => { | ||
| it('returns true when limitType matches', () => { | ||
| const error = new ResourceLimitError('connection limit', { | ||
| data: { limitType: 'connection' }, | ||
| }); | ||
| expect(isResourceLimitError(error, 'connection')).toBe(true); | ||
| }); | ||
|
|
||
| it('returns false when limitType does not match', () => { | ||
| const error = new ResourceLimitError('connection limit', { | ||
| data: { limitType: 'connection' }, | ||
| }); | ||
| expect(isResourceLimitError(error, 'connectionRate')).toBe(false); | ||
| }); | ||
|
|
||
| it('returns false when error has no limitType', () => { | ||
| const error = new ResourceLimitError('limit exceeded'); | ||
| expect(isResourceLimitError(error, 'connection')).toBe(false); | ||
| }); | ||
|
|
||
| it('returns false for non-ResourceLimitError even with matching-like data', () => { | ||
| const error = new Error('some error'); | ||
| expect(isResourceLimitError(error, 'connection')).toBe(false); | ||
| }); | ||
|
|
||
| it.each([ | ||
| 'connection', | ||
| 'connectionRate', | ||
| 'messageSize', | ||
| 'messageRate', | ||
| ] as const)('correctly identifies %s limitType', (limitType) => { | ||
| const error = new ResourceLimitError('limit exceeded', { | ||
| data: { limitType }, | ||
| }); | ||
| expect(isResourceLimitError(error, limitType)).toBe(true); | ||
| }); | ||
| }); | ||
| }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG in this PR.
This whole suite tests
packages/ocap-kernel/src/remotes/platform/reconnection-lifecycle.test.ts
Show resolved
Hide resolved
Add sliding window rate limiting to protect against flooding attacks: - Message rate limiting: 100 messages/second per peer (configurable) - Connection attempt rate limiting: 10 attempts/minute per peer (configurable) Implementation: - Add SlidingWindowRateLimiter class with automatic pruning - Add maxMessagesPerSecond and maxConnectionAttemptsPerMinute options - Integrate rate checks in sendRemoteMessage before sending - Integrate connection rate checks before dialing new connections - Clean up rate limiter state when peers become stale or transport stops Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add comprehensive test coverage for SlidingWindowRateLimiter: - Basic limit checking (wouldExceedLimit) - Event recording and pruning - checkAndRecord with error handling - getCurrentCount with window expiration - clearKey and clear methods - pruneStale for cleanup - Sliding window behavior with real timing Also test factory functions and constants. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add 'messageRate' and 'connectionRate' to ResourceLimitError limitType - Update rate limiter to use correct limit type enum values - Update tests to match new limit types Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Move message rate recording to after successful write instead of before send attempt. This prevents failed sends from consuming rate quota. - Add connection rate limiting to automatic reconnection attempts via checkConnectionRateLimit dependency in reconnection lifecycle. - Handle ResourceLimitError gracefully during reconnection by continuing the loop after backoff instead of giving up on the peer. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add remoteCommsOptions parameter to setupAliceAndBob helper - Configure higher maxMessagesPerSecond for queue limit test to ensure rate limiting doesn't interfere with queue limit testing Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…rror - Call getCurrentCount once and reuse the value for both message and data - Use DEFAULT_MESSAGE_RATE_WINDOW_MS constant instead of hardcoded 1000 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Move DEFAULT_MESSAGE_RATE_LIMIT, DEFAULT_MESSAGE_RATE_WINDOW_MS, DEFAULT_CONNECTION_RATE_LIMIT, and DEFAULT_CONNECTION_RATE_WINDOW_MS to constants.ts for consistency with other default constants. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Revert to using checkAndRecord() for message rate limiting instead of separate check and record calls. The separated approach had a TOCTOU race where concurrent sends could all pass the check before any recorded, bypassing the rate limit. Yes, failed sends now consume quota, but this is necessary for security - recording after send would allow attackers to make unlimited concurrent attempts that bypass the rate limit. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…annels Two bug fixes: 1. Rate-limited reconnection attempts no longer consume retry quota. Previously, incrementAttempt was called before the rate limit check, so rate-limited attempts counted against maxRetryAttempts even though no dial was performed. Now, decrementAttempt is called when rate limited to undo the premature increment. 2. Rejected inbound connections are now properly closed. When an inbound connection is rejected (due to intentional close or connection limit), the channel is now closed via closeChannel() to prevent resource leaks from dangling libp2p streams. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add validation in SlidingWindowRateLimiter constructor to reject non-positive values for maxEvents and windowMs, preventing misconfiguration that would cause unexpected behavior - Optimize checkAndRecord to use getCurrentCount instead of duplicating the Date.now() call and timestamp filtering logic - Refactor constructor validation tests to use it.each for conciseness Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The ResourceLimitError catch block incorrectly treated all such errors as rate limit errors where "no actual dial was performed." However, checkConnectionLimit() throws with limitType: 'connection' AFTER dial succeeds. This caused: 1. decrementAttempt incorrectly called (dial was performed, should count) 2. Log message incorrectly said "rate limited" 3. Dialed channel leaked since never closed or registered Fix: - Check error.data.limitType to distinguish 'connectionRate' (before dial) from 'connection' (after dial) - Only decrement attempt count for rate limit errors (connectionRate) - Add closeChannel dependency to close leaked channels - Wrap checkConnectionLimit in try/catch to close channel on error Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…Data types Extract reusable types from ResourceLimitError to improve type safety and reduce inline type casts when checking error data. - Add ResourceLimitType union type for limit types - Add ResourceLimitErrorData type for error data structure - Export both types from kernel-errors package - Use ResourceLimitErrorData in reconnection-lifecycle for cleaner code Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The validation `maxEvents <= 0` passes for NaN and Infinity since both comparisons return false. This silently disables rate limiting entirely as `currentCount >= NaN` and `currentCount >= Infinity` are always false. Fix by using Number.isFinite() which rejects NaN, Infinity, and -Infinity, ensuring rate limiting cannot be bypassed via faulty configuration parsing. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…connection Connection limit errors (ResourceLimitError with limitType: 'connection') were falling through to isRetryableNetworkError(), which doesn't recognize them, causing permanent reconnection failure via onRemoteGiveUp. Added explicit handling to continue the retry loop for connection limit errors. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Extracts the ResourceLimitError check pattern into a reusable type guard function that optionally checks for a specific limitType. This simplifies the error handling code in reconnection-lifecycle.ts. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When checkConnectionLimit() throws during reconnection, the code closes the channel before re-throwing. If closeChannel itself throws, that error would propagate instead of the original ResourceLimitError, causing reconnection to give up prematurely via onRemoteGiveUp. Wrap closeChannel in try-catch to ensure the original error is always re-thrown regardless of cleanup success. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…ntCount Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The transport tests were failing because the mock for @metamask/kernel-errors did not include isResourceLimitError, which was recently added. When the reconnection lifecycle code tried to call this function, it failed with undefined is not a function, causing the reconnection loop to crash. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Extract duplicated channel closing logic in the inbound connection handler into a dedicated helper function to reduce duplication and ensure consistent error handling. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Make test assertions more precise by verifying exact call counts for checkConnectionLimit and checkConnectionRateLimit. In a single successful reconnection attempt, each should be called exactly once. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
ab1ab0f to
f215020
Compare
…ness Add a flushPromises helper and use it in handleConnectionLoss tests that trigger fire-and-forget async reconnection work. This ensures all pending microtasks complete before assertions run, preventing potential flakiness from async operations bleeding into subsequent tests. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
| logger.log( | ||
| `${channel.peerId}:: rejecting inbound connection from intentionally closed peer`, | ||
| ); | ||
| // Don't add to channels map and don't start reading - connection will naturally close |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Channel leak when connection limit exceeded after dial
Medium Severity
In sendRemoteMessage, when checkConnectionLimit() at line 390 throws a ResourceLimitError after a successful dial, the dialed channel is not closed before the error is rethrown. The catch block at lines 393-401 checks for ResourceLimitError and rethrows it without closing the channel, causing a resource leak. The correct pattern is implemented in tryReconnect in reconnection-lifecycle.ts, which wraps the checkConnectionLimit() call in its own try-catch to close the channel before rethrowing.
Closes #661
Summary
maxMessagesPerSecond)maxConnectionAttemptsPerMinute)Implementation Details
SlidingWindowRateLimiterclass with automatic pruning of old eventssendRemoteMessagebefore sendingResourceLimitErrorwhen limits are exceededTest plan
🤖 Generated with Claude Code
Note
Medium Risk
Introduces new rate-limiting behavior in the remote transport and reconnection loop, which can change message delivery/retry patterns and cause new
ResourceLimitErrorfailures under load or misconfiguration.Overview
Adds per-peer sliding-window rate limiting to remote comms: outbound sends now enforce
maxMessagesPerSecond, and new connection dials/reconnects enforcemaxConnectionAttemptsPerMinute, both throwingResourceLimitErrorwhen exceeded.Extends
ResourceLimitErrorto include newlimitTypevalues (messageRate,connectionRate), exports related types, and addsisResourceLimitErrorfor limit-type-aware handling.Updates reconnection logic to treat rate-limit and connection-limit
ResourceLimitErrors as retryable (with attempt-count adjustment for rate-limit pre-dial failures) and tightens channel cleanup by explicitly closing rejected inbound channels and channels opened before a post-dial connection-limit failure. Tests and E2E helpers are adjusted accordingly.Written by Cursor Bugbot for commit 6be75fa. This will update automatically on new commits. Configure here.