fix: clear stale routingInfo on restart to prevent slow/unstable connections#220
fix: clear stale routingInfo on restart to prevent slow/unstable connections#220
Conversation
…sage delivery
Root cause: inbound messages were held in two back-to-back buffer phases before
being emitted to listeners, producing 35-60 s end-to-end delays:
Phase 1 – socket.ts offline buffer
ev.buffer() is called on every (re)connection and flushed only when the server
sends CB:ib,,offline (all offline notifications delivered). On busy accounts
the server can take 10-30+ s to drain the offline queue, holding every
buffered event — including fresh live messages — hostage for that duration.
The existing event-buffer auto-flush was the only safety net (default 15 s).
Phase 2 – chats.ts AwaitingInitialSync (first connection only)
A second ev.buffer() is started after receivedPendingNotifications fires and
held for up to 20 s while waiting for a history-sync notification + doAppStateSync.
Stacking these two phases produced the observed 35-60 s total delay.
Fixes (surgical, no behaviour change on the send path):
1. src/Utils/event-buffer.ts
• Default bufferTimeoutMs 15 000 → 5 000 ms (BAILEYS_BUFFER_TIMEOUT_MS)
• Default minBufferTimeoutMs 3 000 → 1 000 ms (BAILEYS_BUFFER_MIN_TIMEOUT_MS)
• Default maxBufferTimeoutMs 20 000 → 8 000 ms (BAILEYS_BUFFER_MAX_TIMEOUT_MS)
All three remain fully overridable via environment variables.
2. src/Socket/socket.ts
• Added OFFLINE_BUFFER_TIMEOUT_MS safety timer (default 5 s, env-configurable).
If CB:ib,,offline does not arrive within 5 s the buffer is force-flushed so
live messages are never delayed beyond that cap.
• CB:ib,,offline handler clears the safety timer on the happy path and marks
didStartBuffer = false to avoid a double-flush.
3. src/Socket/chats.ts
• AwaitingInitialSync fallback timeout 20 000 → 8 000 ms.
History that arrives late is still processed via processMessage regardless
of the state-machine phase (existing behaviour, unchanged).
Worst-case delivery latency after this change:
Reconnection (accountSyncCounter > 0): ≤ 5 s (was ≤ 15 s)
First connection with history sync : ≤ 5 s + 8 s = 13 s (was 35-60 s)
No changes to: send path, button/list/carousel, tcTokenFetchingJids,
forceSnapshotCollections, LID/PN mapping, or app-state resilience.
https://claude.ai/code/session_015McJNWJwABDTEwx4bfG4C7
…ndependent of env The offline-buffer safety timer in socket.ts (which caps how long the CB:ib,,offline phase can block live message delivery) must remain short regardless of what operators set for BAILEYS_BUFFER_TIMEOUT_MS. Operators often set BAILEYS_BUFFER_TIMEOUT_MS=30000 (30 s) for better Prometheus/history batching. Reading that env var for the offline timer would have kept the safety net at 30 s, defeating the fix entirely. The offline-phase timer is now a hardcoded 5_000 ms constant with an explicit comment explaining why it must not inherit the general buffer timeout. All other behaviour is unchanged. https://claude.ai/code/session_015McJNWJwABDTEwx4bfG4C7
…post-close flush If the socket closes (auth failure, network drop) before CB:ib,,offline arrives, the 5 s safety timer was still running. After 5 s its callback would find didStartBuffer=true and call ev.flush() on an already-closed session, risking stale/partial events being emitted and reprocessed on the next reconnect. Fix: clear offlineBufferTimeout and reset didStartBuffer=false inside end(), immediately after the existing clearInterval/clearTimeout block, mirroring how awaitingSyncTimeout is cleaned up in chats.ts on connection close. Addresses review comments from Codex (P2) and Copilot on PR #217. https://claude.ai/code/session_015McJNWJwABDTEwx4bfG4C7
Covers the three interaction points of the 5 s safety timer introduced in socket.ts to cap the offline-buffer phase: startBuffer() — arms the timer on reconnection onOffline() — CB:ib,,offline happy path: cancels timer, flushes once onClose() — end() path: cancels timer, resets flag, no post-close flush Test cases (15 total): - Timer fires after exactly 5 s and calls flush + warn - Timer sets offlineBufferTimeout=undefined and didStartBuffer=false - No flush if didStartBuffer was already false when timer fires - CB:ib,,offline cancels timer → only one flush regardless of timing - CB:ib,,offline is idempotent (spurious second call = no extra flush) - end() cancels timer → advancing past 5 s triggers no flush - end() is a no-op when called before startBuffer or after onOffline - Boundary checks: no flush at 4 999 ms, flushes at exactly 5 000 ms Follows the same standalone-function pattern used in bad-ack-handling.test.ts to test socket closures without instantiating makeSocket. Addresses Copilot review comment on PR #217. https://claude.ai/code/session_015McJNWJwABDTEwx4bfG4C7
CI runners do not have SSH keys configured, so yarn was failing with "Permission denied (publickey)" when resolving: git+ssh://git@github.com/whiskeysockets/libsignal-node.git Changed to HTTPS which works without any SSH key setup: git+https://github.com/whiskeysockets/libsignal-node.git The commit hash and package.json entry are unchanged. https://claude.ai/code/session_015McJNWJwABDTEwx4bfG4C7
Commit 0bccee8 accidentally replaced the Yarn 4 Berry lock file (format v8, with __metadata, resolution:, checksum:) with a Yarn 1 Classic lock file (# yarn lockfile v1, resolved:, integrity:). The CI runs `yarn install --immutable` with Yarn 4 (corepack yarn@4.x). When Yarn 4 encounters a Yarn 1-format lock file it needs to migrate/regenerate it, which --immutable forbids → build failure. Restoring the original Yarn 4 format from before the bad commit. Note: the original lock file already used HTTPS for libsignal-node: resolution: "libsignal@https://github.com/whiskeysockets/libsignal-node.git#commit=..." So no further SSH→HTTPS fix is needed. https://claude.ai/code/session_015McJNWJwABDTEwx4bfG4C7
…lowness After a code update deployed via pm2 restart, active WhatsApp connections often remain slow even with the new code. The root cause is routingInfo stored in creds.json: it directs the socket to reconnect to the same WhatsApp edge server, which may retain stale server-side state (throttling, bad session state) from the previous version. A QR re-scan fixes it because it creates a new session on a fresh edge server. This option discards routingInfo before the WebSocket URL is constructed, forcing WhatsApp to assign a fresh edge server — equivalent to the clean state after a QR re-scan, but without invalidating Signal keys or auth credentials (no re-scan needed). The cleared state is immediately persisted via creds.update so that subsequent restarts before the server assigns new routingInfo also benefit. Usage in zpro-backend: set clearRoutingInfoOnStart: true on the first startSock() call after a deployment, then false on reconnections. https://claude.ai/code/session_015McJNWJwABDTEwx4bfG4C7
Make clearRoutingInfoOnStart: true the default so every restart (pm2, server reboot, deploy) automatically gets a fresh edge server assignment without any configuration change in the consumer. The old routingInfo becomes stale after any restart anyway — the WA server always issues a new one during the handshake. Keeping the stale value forces reconnection to a potentially overloaded or broken edge server, causing slow or unstable sessions. With this default, consumers that explicitly pass clearRoutingInfoOnStart: false can still opt out. https://claude.ai/code/session_015McJNWJwABDTEwx4bfG4C7
…red stale routingInfo When a channel is reconnected after disconnect (same zpro channel, new QR scan), the auth state still carries creds.me?.id from the previous session. This caused the offline buffer to activate and hold all incoming messages for up to 5 seconds while waiting for CB:ib,,offline (which may arrive late on accounts with large backlogs). Fix: track whether clearRoutingInfoOnStart actually cleared a stale routingInfo. If it did, this is clearly a reconnect-after-disconnect scenario, not a cold start that needs event batching. In this case, skip the offline buffer entirely so live messages are delivered immediately instead of being held for up to 5 s. Normal cold restarts (routingInfo already absent) are unaffected — they still use the 5 s safety cap as before. https://claude.ai/code/session_015McJNWJwABDTEwx4bfG4C7
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
Thanks for opening this pull request and contributing to the project! The next step is for the maintainers to review your changes. If everything looks good, it will be approved and merged into the main branch. In the meantime, anyone in the community is encouraged to test this pull request and provide feedback. ✅ How to confirm it worksIf you’ve tested this PR, please comment below with: This helps us speed up the review and merge process. 📦 To test this PR locally:If you encounter any issues or have feedback, feel free to comment as well. |
There was a problem hiding this comment.
Pull request overview
Adds connection-start mitigations aimed at improving reconnect reliability/latency by forcing fresh edge routing and capping how long “offline backlog” buffering can block live events.
Changes:
- Introduces
clearRoutingInfoOnStartSocketConfig option and clears persistedcreds.routingInfobefore connecting (plus persists the cleared state). - Adds an offline-buffer safety timer in
socket.tsand ensures it’s cleared on socket end; adds focused Jest coverage for this timer logic. - Reduces several default buffering/initial-sync wait timeouts to flush earlier under stall conditions.
Reviewed changes
Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/tests/Socket/offline-buffer-timeout.test.ts | Adds unit tests mirroring the new offline-buffer safety timer behavior. |
| src/Utils/event-buffer.ts | Lowers default BAILEYS_BUFFER_* timeouts to flush buffered events sooner. |
| src/Types/Socket.ts | Adds clearRoutingInfoOnStart option with JSDoc guidance. |
| src/Socket/socket.ts | Clears stored routingInfo on start, persists creds update, adds offline-buffer safety timer + end() cleanup. |
| src/Socket/chats.ts | Reduces AwaitingInitialSync timeout from 20s to 8s and updates logs/comments. |
| src/Defaults/index.ts | Enables clearRoutingInfoOnStart by default in DEFAULT_CONNECTION_CONFIG. |
Comments suppressed due to low confidence (1)
src/Socket/socket.ts:521
- This
creds.updateemit will fire on every socket creation wheneverclearRoutingInfoOnStartis true androutingInfois already undefined (i.e., even when nothing was cleared). That can trigger unnecessary consumersaveCreds()writes/side effects. Emit only when you actually modifiedroutingInfo(e.g., gate onhadStaleRoutingInfo), and prefer emitting a minimal update payload instead of the entire creds object.
const ev = makeEventBuffer(logger)
// Persist the routingInfo clearing so the consumer's saveCreds() writes the clean state to disk.
// This ensures that if the process restarts again before the server assigns new routingInfo,
// the stale value is not reused.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Add missing space after 'if'/'else if' keywords in CB:stream:error handler - Reformat long logger.warn/info lines to stay within line length limit Fixes CI linting failures introduced by recent commits. https://claude.ai/code/session_015McJNWJwABDTEwx4bfG4C7
…tion The merge of master into this branch (a2bd33b) left two merge artifacts: 1. A duplicated `if (config.clearRoutingInfoOnStart...)` block nested inside the first, missing its closing brace — this caused TS1005 '}' expected at the end of the file. 2. A duplicate `const OFFLINE_BUFFER_TIMEOUT_MS` declaration (one from each branch) which would cause a duplicate identifier error. Both are removed, leaving the correct single implementation. https://claude.ai/code/session_015McJNWJwABDTEwx4bfG4C7
fix: clear stale routingInfo on restart to prevent slow/unstable connections