Skip to content

fix: clear stale routingInfo on restart to prevent slow/unstable connections#220

Merged
rsalcara merged 13 commits intomasterfrom
claude/analyze-whatsapp-log-gcJdX
Feb 25, 2026
Merged

fix: clear stale routingInfo on restart to prevent slow/unstable connections#220
rsalcara merged 13 commits intomasterfrom
claude/analyze-whatsapp-log-gcJdX

Conversation

@rsalcara
Copy link
Owner

fix: clear stale routingInfo on restart to prevent slow/unstable connections

…sage delivery

Root cause: inbound messages were held in two back-to-back buffer phases before
being emitted to listeners, producing 35-60 s end-to-end delays:

  Phase 1 – socket.ts offline buffer
    ev.buffer() is called on every (re)connection and flushed only when the server
    sends CB:ib,,offline (all offline notifications delivered). On busy accounts
    the server can take 10-30+ s to drain the offline queue, holding every
    buffered event — including fresh live messages — hostage for that duration.
    The existing event-buffer auto-flush was the only safety net (default 15 s).

  Phase 2 – chats.ts AwaitingInitialSync (first connection only)
    A second ev.buffer() is started after receivedPendingNotifications fires and
    held for up to 20 s while waiting for a history-sync notification + doAppStateSync.
    Stacking these two phases produced the observed 35-60 s total delay.

Fixes (surgical, no behaviour change on the send path):

1. src/Utils/event-buffer.ts
   • Default bufferTimeoutMs  15 000 → 5 000 ms  (BAILEYS_BUFFER_TIMEOUT_MS)
   • Default minBufferTimeoutMs 3 000 → 1 000 ms  (BAILEYS_BUFFER_MIN_TIMEOUT_MS)
   • Default maxBufferTimeoutMs 20 000 → 8 000 ms  (BAILEYS_BUFFER_MAX_TIMEOUT_MS)
   All three remain fully overridable via environment variables.

2. src/Socket/socket.ts
   • Added OFFLINE_BUFFER_TIMEOUT_MS safety timer (default 5 s, env-configurable).
     If CB:ib,,offline does not arrive within 5 s the buffer is force-flushed so
     live messages are never delayed beyond that cap.
   • CB:ib,,offline handler clears the safety timer on the happy path and marks
     didStartBuffer = false to avoid a double-flush.

3. src/Socket/chats.ts
   • AwaitingInitialSync fallback timeout 20 000 → 8 000 ms.
     History that arrives late is still processed via processMessage regardless
     of the state-machine phase (existing behaviour, unchanged).

Worst-case delivery latency after this change:
  Reconnection  (accountSyncCounter > 0): ≤ 5 s  (was ≤ 15 s)
  First connection with history sync    : ≤ 5 s + 8 s = 13 s  (was 35-60 s)

No changes to: send path, button/list/carousel, tcTokenFetchingJids,
forceSnapshotCollections, LID/PN mapping, or app-state resilience.

https://claude.ai/code/session_015McJNWJwABDTEwx4bfG4C7
…ndependent of env

The offline-buffer safety timer in socket.ts (which caps how long the
CB:ib,,offline phase can block live message delivery) must remain short
regardless of what operators set for BAILEYS_BUFFER_TIMEOUT_MS.

Operators often set BAILEYS_BUFFER_TIMEOUT_MS=30000 (30 s) for better
Prometheus/history batching. Reading that env var for the offline timer
would have kept the safety net at 30 s, defeating the fix entirely.

The offline-phase timer is now a hardcoded 5_000 ms constant with an
explicit comment explaining why it must not inherit the general buffer
timeout. All other behaviour is unchanged.

https://claude.ai/code/session_015McJNWJwABDTEwx4bfG4C7
…post-close flush

If the socket closes (auth failure, network drop) before CB:ib,,offline arrives,
the 5 s safety timer was still running. After 5 s its callback would find
didStartBuffer=true and call ev.flush() on an already-closed session, risking
stale/partial events being emitted and reprocessed on the next reconnect.

Fix: clear offlineBufferTimeout and reset didStartBuffer=false inside end(),
immediately after the existing clearInterval/clearTimeout block, mirroring how
awaitingSyncTimeout is cleaned up in chats.ts on connection close.

Addresses review comments from Codex (P2) and Copilot on PR #217.

https://claude.ai/code/session_015McJNWJwABDTEwx4bfG4C7
Covers the three interaction points of the 5 s safety timer introduced
in socket.ts to cap the offline-buffer phase:

  startBuffer() — arms the timer on reconnection
  onOffline()   — CB:ib,,offline happy path: cancels timer, flushes once
  onClose()     — end() path: cancels timer, resets flag, no post-close flush

Test cases (15 total):
  - Timer fires after exactly 5 s and calls flush + warn
  - Timer sets offlineBufferTimeout=undefined and didStartBuffer=false
  - No flush if didStartBuffer was already false when timer fires
  - CB:ib,,offline cancels timer → only one flush regardless of timing
  - CB:ib,,offline is idempotent (spurious second call = no extra flush)
  - end() cancels timer → advancing past 5 s triggers no flush
  - end() is a no-op when called before startBuffer or after onOffline
  - Boundary checks: no flush at 4 999 ms, flushes at exactly 5 000 ms

Follows the same standalone-function pattern used in bad-ack-handling.test.ts
to test socket closures without instantiating makeSocket.

Addresses Copilot review comment on PR #217.

https://claude.ai/code/session_015McJNWJwABDTEwx4bfG4C7
CI runners do not have SSH keys configured, so yarn was failing with
"Permission denied (publickey)" when resolving:
  git+ssh://git@github.com/whiskeysockets/libsignal-node.git

Changed to HTTPS which works without any SSH key setup:
  git+https://github.com/whiskeysockets/libsignal-node.git

The commit hash and package.json entry are unchanged.

https://claude.ai/code/session_015McJNWJwABDTEwx4bfG4C7
Commit 0bccee8 accidentally replaced the Yarn 4 Berry lock file (format v8,
with __metadata, resolution:, checksum:) with a Yarn 1 Classic lock file
(# yarn lockfile v1, resolved:, integrity:).

The CI runs `yarn install --immutable` with Yarn 4 (corepack yarn@4.x). When
Yarn 4 encounters a Yarn 1-format lock file it needs to migrate/regenerate it,
which --immutable forbids → build failure.

Restoring the original Yarn 4 format from before the bad commit.
Note: the original lock file already used HTTPS for libsignal-node:
  resolution: "libsignal@https://github.com/whiskeysockets/libsignal-node.git#commit=..."
So no further SSH→HTTPS fix is needed.

https://claude.ai/code/session_015McJNWJwABDTEwx4bfG4C7
…lowness

After a code update deployed via pm2 restart, active WhatsApp connections
often remain slow even with the new code. The root cause is routingInfo
stored in creds.json: it directs the socket to reconnect to the same
WhatsApp edge server, which may retain stale server-side state (throttling,
bad session state) from the previous version. A QR re-scan fixes it because
it creates a new session on a fresh edge server.

This option discards routingInfo before the WebSocket URL is constructed,
forcing WhatsApp to assign a fresh edge server — equivalent to the clean
state after a QR re-scan, but without invalidating Signal keys or auth
credentials (no re-scan needed).

The cleared state is immediately persisted via creds.update so that
subsequent restarts before the server assigns new routingInfo also benefit.

Usage in zpro-backend: set clearRoutingInfoOnStart: true on the first
startSock() call after a deployment, then false on reconnections.

https://claude.ai/code/session_015McJNWJwABDTEwx4bfG4C7
Make clearRoutingInfoOnStart: true the default so every restart
(pm2, server reboot, deploy) automatically gets a fresh edge server
assignment without any configuration change in the consumer.

The old routingInfo becomes stale after any restart anyway — the WA
server always issues a new one during the handshake. Keeping the stale
value forces reconnection to a potentially overloaded or broken edge
server, causing slow or unstable sessions.

With this default, consumers that explicitly pass
clearRoutingInfoOnStart: false can still opt out.

https://claude.ai/code/session_015McJNWJwABDTEwx4bfG4C7
…red stale routingInfo

When a channel is reconnected after disconnect (same zpro channel, new QR scan),
the auth state still carries creds.me?.id from the previous session. This caused
the offline buffer to activate and hold all incoming messages for up to 5 seconds
while waiting for CB:ib,,offline (which may arrive late on accounts with large backlogs).

Fix: track whether clearRoutingInfoOnStart actually cleared a stale routingInfo.
If it did, this is clearly a reconnect-after-disconnect scenario, not a cold start
that needs event batching. In this case, skip the offline buffer entirely so live
messages are delivered immediately instead of being held for up to 5 s.

Normal cold restarts (routingInfo already absent) are unaffected — they still use
the 5 s safety cap as before.

https://claude.ai/code/session_015McJNWJwABDTEwx4bfG4C7
Copilot AI review requested due to automatic review settings February 25, 2026 04:32
@chatgpt-codex-connector
Copy link

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@github-actions
Copy link

github-actions bot commented Feb 25, 2026

Thanks for opening this pull request and contributing to the project!

The next step is for the maintainers to review your changes. If everything looks good, it will be approved and merged into the main branch.

In the meantime, anyone in the community is encouraged to test this pull request and provide feedback.

✅ How to confirm it works

If you’ve tested this PR, please comment below with:

Tested and working ✅

This helps us speed up the review and merge process.

📦 To test this PR locally:

# NPM
npm install @whiskeysockets/baileys@rsalcara/InfiniteAPI#claude/analyze-whatsapp-log-gcJdX

# Yarn (v2+)
yarn add @whiskeysockets/baileys@rsalcara/InfiniteAPI#claude/analyze-whatsapp-log-gcJdX

# PNPM
pnpm add @whiskeysockets/baileys@rsalcara/InfiniteAPI#claude/analyze-whatsapp-log-gcJdX

If you encounter any issues or have feedback, feel free to comment as well.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds connection-start mitigations aimed at improving reconnect reliability/latency by forcing fresh edge routing and capping how long “offline backlog” buffering can block live events.

Changes:

  • Introduces clearRoutingInfoOnStart SocketConfig option and clears persisted creds.routingInfo before connecting (plus persists the cleared state).
  • Adds an offline-buffer safety timer in socket.ts and ensures it’s cleared on socket end; adds focused Jest coverage for this timer logic.
  • Reduces several default buffering/initial-sync wait timeouts to flush earlier under stall conditions.

Reviewed changes

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/tests/Socket/offline-buffer-timeout.test.ts Adds unit tests mirroring the new offline-buffer safety timer behavior.
src/Utils/event-buffer.ts Lowers default BAILEYS_BUFFER_* timeouts to flush buffered events sooner.
src/Types/Socket.ts Adds clearRoutingInfoOnStart option with JSDoc guidance.
src/Socket/socket.ts Clears stored routingInfo on start, persists creds update, adds offline-buffer safety timer + end() cleanup.
src/Socket/chats.ts Reduces AwaitingInitialSync timeout from 20s to 8s and updates logs/comments.
src/Defaults/index.ts Enables clearRoutingInfoOnStart by default in DEFAULT_CONNECTION_CONFIG.
Comments suppressed due to low confidence (1)

src/Socket/socket.ts:521

  • This creds.update emit will fire on every socket creation whenever clearRoutingInfoOnStart is true and routingInfo is already undefined (i.e., even when nothing was cleared). That can trigger unnecessary consumer saveCreds() writes/side effects. Emit only when you actually modified routingInfo (e.g., gate on hadStaleRoutingInfo), and prefer emitting a minimal update payload instead of the entire creds object.

	const ev = makeEventBuffer(logger)

	// Persist the routingInfo clearing so the consumer's saveCreds() writes the clean state to disk.
	// This ensures that if the process restarts again before the server assigns new routingInfo,
	// the stale value is not reused.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Add missing space after 'if'/'else if' keywords in CB:stream:error handler
- Reformat long logger.warn/info lines to stay within line length limit

Fixes CI linting failures introduced by recent commits.

https://claude.ai/code/session_015McJNWJwABDTEwx4bfG4C7
…tion

The merge of master into this branch (a2bd33b) left two merge artifacts:

1. A duplicated `if (config.clearRoutingInfoOnStart...)` block nested inside
   the first, missing its closing brace — this caused TS1005 '}' expected
   at the end of the file.

2. A duplicate `const OFFLINE_BUFFER_TIMEOUT_MS` declaration (one from each
   branch) which would cause a duplicate identifier error.

Both are removed, leaving the correct single implementation.

https://claude.ai/code/session_015McJNWJwABDTEwx4bfG4C7
@rsalcara rsalcara merged commit f4daa91 into master Feb 25, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants