Skip to content

Comments

fix: resolve Bad MAC, No Session, Invalid PreKey errors (#1769)#2372

Open
kobie3717 wants to merge 1 commit intoWhiskeySockets:masterfrom
kobie3717:fix/bad-mac-session-stability
Open

fix: resolve Bad MAC, No Session, Invalid PreKey errors (#1769)#2372
kobie3717 wants to merge 1 commit intoWhiskeySockets:masterfrom
kobie3717:fix/bad-mac-session-stability

Conversation

@kobie3717
Copy link

Problem

Users experience persistent decryption failures manifesting as:

  • Bad MAC / Failed to decrypt
  • No matching session
  • Invalid PreKey ID

These errors are the most reported issue in the repository (#1769, #2340, #2362) and have been occurring since the WhatsApp LID (Linked Identity) migration.

Root Causes

1. LID/PN Transaction Race Condition

When WhatsApp sends messages via both PN (Phone Number) and LID (Linked ID) JIDs for the same contact, decryptMessage and encryptMessage use the raw JID as the transaction mutex key. Two concurrent operations for the same logical session (one via PN, one via LID) acquire different locks, allowing concurrent session state mutations that corrupt the ratchet → Bad MAC.

2. Aggressive PN Session Deletion During Migration

migrateSession() copies the session from PN→LID address then deletes the PN session (sessionUpdates[pnAddr] = null). Any in-flight messages still addressed to the PN JID immediately fail with No matching session.

3. Immediate PreKey Deletion

removePreKey() deletes the pre-key immediately after first use. When WhatsApp retransmits the same message (common during connectivity issues), the pre-key is already gone → Invalid PreKey ID.

Changes

Fix 1: Canonical JID Resolution for Transaction Locks (src/Signal/libsignal.ts)

Before entering a transaction in decryptMessage / encryptMessage, resolve the JID to its canonical (LID-preferred) form via the existing LIDMappingStore. This ensures PN and LID operations for the same contact serialize on the same mutex key.

Fix 2: Retain PN Session During LID Migration (src/Signal/libsignal.ts)

In migrateSession(), copy the session to the LID address but do not delete the PN session. The PN session will naturally fall out of use as new messages arrive under the LID address. This is safe because signal storage already resolves PN→LID internally via resolveLIDSignalAddress.

Fix 3: Delayed PreKey Deletion (src/Signal/libsignal.ts)

Replace immediate pre-key deletion with a 5-minute grace period. Used pre-keys are scheduled for deletion via a lightweight timer. Retransmissions within the grace window succeed. The timer uses unref() to avoid blocking process exit.

Risk Assessment

  • Fix 1 (canonical JID lock): Low risk. Uses the existing getLIDForPN() lookup. Falls back to the original JID if no mapping exists. Worst case: one extra async lookup per encrypt/decrypt.
  • Fix 2 (PN session retention): Low risk. Retains data that was previously deleted. The signal storage layer already resolves PN→LID for lookups, so the retained PN session is a safety net, not a conflict source.
  • Fix 3 (delayed prekey deletion): Low risk. Pre-keys are still deleted, just after a 5-minute delay. The grace period Map is bounded by the number of pre-keys used in that window (typically single digits). Timer self-cleans when empty.

Testing Notes

  • All pre-existing TypeScript compilation errors remain unchanged; no new errors introduced
  • These fixes address the protocol-level causes; thorough testing requires multi-device scenarios with LID migration in progress
  • Recommended to test with high message volume during PN→LID transition

Fixes #1769. Related: #2340, #2362.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] "Bad Mac", "Failed to decrypt...", "Closed session...", "No session...", "Invalid PreKey..."

1 participant