fix: retry tmux pane operations for WSL2 race condition#78
fix: retry tmux pane operations for WSL2 race condition#78arosstale wants to merge 1 commit intojayminwest:mainfrom
Conversation
On WSL2, tmux occasionally reports 'can't find pane' immediately after session creation. The pane exists but hasn't been registered yet. Changes: - createSession: add 100ms delay + 3 retries with backoff for list-panes - sendKeys: add retry loop (default 3 attempts) for transient pane errors - Distinguish 'can't find pane' (transient, retryable) from 'session not found' (permanent, throw immediately) All 71 tmux tests pass. No new dependencies.
There was a problem hiding this comment.
Pull request overview
This PR fixes a WSL2-specific race condition (issue #73) where tmux reports "can't find pane" immediately after session creation because the pane hasn't been fully registered yet. The fix adds retry logic with backoff to two functions in the tmux module.
Changes:
- Added a 100ms initial delay and retry loop (up to 3 attempts with incremental backoff) for
list-panesincreateSessionafternew-sessioncompletes - Added retry logic in
sendKeysfor transient "can't find pane" errors, with a configurablemaxRetriesparameter (default 3) and incremental backoff, while keeping non-retryable errors (session not found, no server) as immediate throws
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| await Bun.sleep(100); | ||
|
|
||
| // Retrieve the actual PID of the process running inside the tmux pane. | ||
| // Retry up to 3 times with backoff for WSL2 race conditions where the |
There was a problem hiding this comment.
The comment says "Retry up to 3 times" but the loop for (let attempt = 0; attempt < 3; attempt++) makes 3 total attempts, which is 1 initial attempt + 2 retries. The comment should say either "Retry up to 2 times" or "Try up to 3 times" to accurately reflect the behavior. This contrasts with sendKeys below which uses maxRetries consistently to mean the number of retries (separate from the initial attempt).
| // Retry up to 3 times with backoff for WSL2 race conditions where the | |
| // Try up to 3 times with backoff for WSL2 race conditions where the |
| // Brief delay before querying pane PID — on WSL2, tmux needs time to | ||
| // register the pane after session creation (see #73). | ||
| await Bun.sleep(100); |
There was a problem hiding this comment.
The unconditional 100ms Bun.sleep(100) delay is applied on all platforms, not just WSL2. While 100ms is small, this adds up when multiple sessions are created (e.g., in sling which may launch several agents). Consider either making this configurable, detecting WSL2 (e.g. checking /proc/version for "microsoft"), or removing this initial sleep and relying solely on the retry loop which already handles the case where list-panes fails on the first attempt.
| // "can't find pane" is a transient race condition on WSL2 — the session | ||
| // exists but the pane hasn't been fully registered yet. Retry with backoff. | ||
| if (trimmedStderr.includes("can't find pane") || trimmedStderr.includes("cant find pane")) { | ||
| if (attempt < maxRetries) { | ||
| const delayMs = 250 * (attempt + 1); | ||
| await Bun.sleep(delayMs); | ||
| continue; | ||
| } | ||
| // Exhausted retries — report as pane-specific error | ||
| throw new AgentError( | ||
| `Tmux pane for session "${name}" not found after ${maxRetries + 1} attempts. On WSL2, this can indicate a tmux startup race condition. Try increasing the retry count or adding a delay after session creation.`, | ||
| { agentName: name }, | ||
| ); | ||
| } |
There was a problem hiding this comment.
The new retry behavior for "can't find pane" errors in sendKeys is not covered by tests. The existing test suite has comprehensive test coverage for all other error paths (session not found, no server running, generic error). Consider adding tests that verify:
- Retry succeeds on a transient "can't find pane" error (mock fails once, then succeeds)
- All retries exhausted for persistent "can't find pane" errors
- Non-retryable errors (e.g., "session not found") throw immediately without retrying
| let pidResult: { stdout: string; stderr: string; exitCode: number } | undefined; | ||
| for (let attempt = 0; attempt < 3; attempt++) { | ||
| pidResult = await runCommand(["tmux", "list-panes", "-t", name, "-F", "#{pane_pid}"]); | ||
| if (pidResult.exitCode === 0) break; | ||
| await Bun.sleep(250 * (attempt + 1)); | ||
| } |
There was a problem hiding this comment.
The retry logic for list-panes in createSession has no test coverage. The existing test suite covers the happy path and various error conditions for createSession. Consider adding a test where list-panes fails on the first attempt but succeeds on a subsequent attempt to verify the retry behavior works correctly.
lucabarak
left a comment
There was a problem hiding this comment.
The fix makes sense and the error categorization (transient "can't find pane" vs permanent "session not found") is well thought out. A couple things to address:
-
Tests — the retry logic isn't covered by any new tests. The existing tmux tests already use
Bun.spyOnonrunCommand, so the same pattern would work here to simulate transient failures followed by success. Per CONTRIBUTING.md, tests are required. -
Unconditional
Bun.sleep(100)increateSession— this adds 100ms to every session creation on all platforms. Since the retry loop already handles the case wherelist-panesfails, the initial sleep shouldn't be needed for non-WSL2 users. Consider removing it and relying on the retry backoff alone. -
Minor: retry count is hardcoded to
3increateSessionbut configurable viamaxRetriesinsendKeys— making them consistent would be cleaner.
Good direction though — this should fix #73 once the tests are added.
Problem
Fixes #73. On WSL2, tmux reports
can't find paneimmediately after session creation. The session exists but the pane hasn't been registered yet — a timing race betweennew-sessionandlist-panes/send-keys.The reporter confirmed tmux itself works fine; the issue is Overstory querying the pane too fast after creation.
Fix
Two changes in
src/worktree/tmux.ts:1.
createSession: retrylist-panesafter session creationnew-sessionlist-panesup to 3 times with 250/500/750ms backoff2.
sendKeys: retry on transientcan't find paneerrorscan't find pane(transient, retryable) fromsession not found(permanent, throw immediately)maxRetriesparameter (default 3) for callers that need controlTesting
biome checkcleantsc --noEmitcleanFuture direction: native Windows support
This fix helps WSL2, but Windows users still need WSL for tmux. I'd like to propose considering alternative session backends for native Windows (and beyond):
A
SessionBackendinterface (createSession,sendKeys,killSession,capturePaneContent) would let Overstory auto-detect the best available backend:tmuxif available (current behavior, Linux/macOS/WSL)mprocsif installed (cross-platform, good TUI)psmuxon Windows without WSL (last resort)This would make
ov slingwork on native Windows without any WSL dependency. Happy to work on a follow-up PR if there's interest.