Skip to content

fix: retry tmux pane operations for WSL2 race condition#78

Open
arosstale wants to merge 1 commit intojayminwest:mainfrom
arosstale:fix/wsl2-tmux-pane-race
Open

fix: retry tmux pane operations for WSL2 race condition#78
arosstale wants to merge 1 commit intojayminwest:mainfrom
arosstale:fix/wsl2-tmux-pane-race

Conversation

@arosstale
Copy link
Contributor

Problem

Fixes #73. On WSL2, tmux reports can't find pane immediately after session creation. The session exists but the pane hasn't been registered yet — a timing race between new-session and list-panes/send-keys.

The reporter confirmed tmux itself works fine; the issue is Overstory querying the pane too fast after creation.

Fix

Two changes in src/worktree/tmux.ts:

1. createSession: retry list-panes after session creation

  • Added 100ms initial delay after new-session
  • Retry list-panes up to 3 times with 250/500/750ms backoff

2. sendKeys: retry on transient can't find pane errors

  • Added retry loop (default 3 attempts, 250ms incremental backoff)
  • Distinguishes can't find pane (transient, retryable) from session not found (permanent, throw immediately)
  • New maxRetries parameter (default 3) for callers that need control

Testing

  • All 71 existing tmux tests pass
  • biome check clean
  • tsc --noEmit clean

Future direction: native Windows support

This fix helps WSL2, but Windows users still need WSL for tmux. I'd like to propose considering alternative session backends for native Windows (and beyond):

Tool Platform What it does
psmux Windows native PowerShell-based terminal multiplexer, no WSL needed
mprocs Windows + Linux + macOS Rust-based process runner with TUI, manages multiple concurrent processes
tmux Linux/macOS/WSL Current backend, proven and mature

A SessionBackend interface (createSession, sendKeys, killSession, capturePaneContent) would let Overstory auto-detect the best available backend:

  1. tmux if available (current behavior, Linux/macOS/WSL)
  2. mprocs if installed (cross-platform, good TUI)
  3. psmux on Windows without WSL (last resort)

This would make ov sling work on native Windows without any WSL dependency. Happy to work on a follow-up PR if there's interest.

On WSL2, tmux occasionally reports 'can't find pane' immediately after
session creation. The pane exists but hasn't been registered yet.

Changes:
- createSession: add 100ms delay + 3 retries with backoff for list-panes
- sendKeys: add retry loop (default 3 attempts) for transient pane errors
- Distinguish 'can't find pane' (transient, retryable) from 'session not
  found' (permanent, throw immediately)

All 71 tmux tests pass. No new dependencies.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a WSL2-specific race condition (issue #73) where tmux reports "can't find pane" immediately after session creation because the pane hasn't been fully registered yet. The fix adds retry logic with backoff to two functions in the tmux module.

Changes:

  • Added a 100ms initial delay and retry loop (up to 3 attempts with incremental backoff) for list-panes in createSession after new-session completes
  • Added retry logic in sendKeys for transient "can't find pane" errors, with a configurable maxRetries parameter (default 3) and incremental backoff, while keeping non-retryable errors (session not found, no server) as immediate throws

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

await Bun.sleep(100);

// Retrieve the actual PID of the process running inside the tmux pane.
// Retry up to 3 times with backoff for WSL2 race conditions where the
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says "Retry up to 3 times" but the loop for (let attempt = 0; attempt < 3; attempt++) makes 3 total attempts, which is 1 initial attempt + 2 retries. The comment should say either "Retry up to 2 times" or "Try up to 3 times" to accurately reflect the behavior. This contrasts with sendKeys below which uses maxRetries consistently to mean the number of retries (separate from the initial attempt).

Suggested change
// Retry up to 3 times with backoff for WSL2 race conditions where the
// Try up to 3 times with backoff for WSL2 race conditions where the

Copilot uses AI. Check for mistakes.
Comment on lines +126 to +128
// Brief delay before querying pane PID — on WSL2, tmux needs time to
// register the pane after session creation (see #73).
await Bun.sleep(100);
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unconditional 100ms Bun.sleep(100) delay is applied on all platforms, not just WSL2. While 100ms is small, this adds up when multiple sessions are created (e.g., in sling which may launch several agents). Consider either making this configurable, detecting WSL2 (e.g. checking /proc/version for "microsoft"), or removing this initial sleep and relying solely on the retry loop which already handles the case where list-panes fails on the first attempt.

Copilot uses AI. Check for mistakes.
Comment on lines +588 to +601
// "can't find pane" is a transient race condition on WSL2 — the session
// exists but the pane hasn't been fully registered yet. Retry with backoff.
if (trimmedStderr.includes("can't find pane") || trimmedStderr.includes("cant find pane")) {
if (attempt < maxRetries) {
const delayMs = 250 * (attempt + 1);
await Bun.sleep(delayMs);
continue;
}
// Exhausted retries — report as pane-specific error
throw new AgentError(
`Tmux pane for session "${name}" not found after ${maxRetries + 1} attempts. On WSL2, this can indicate a tmux startup race condition. Try increasing the retry count or adding a delay after session creation.`,
{ agentName: name },
);
}
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new retry behavior for "can't find pane" errors in sendKeys is not covered by tests. The existing test suite has comprehensive test coverage for all other error paths (session not found, no server running, generic error). Consider adding tests that verify:

  1. Retry succeeds on a transient "can't find pane" error (mock fails once, then succeeds)
  2. All retries exhausted for persistent "can't find pane" errors
  3. Non-retryable errors (e.g., "session not found") throw immediately without retrying

Copilot uses AI. Check for mistakes.
Comment on lines +133 to +138
let pidResult: { stdout: string; stderr: string; exitCode: number } | undefined;
for (let attempt = 0; attempt < 3; attempt++) {
pidResult = await runCommand(["tmux", "list-panes", "-t", name, "-F", "#{pane_pid}"]);
if (pidResult.exitCode === 0) break;
await Bun.sleep(250 * (attempt + 1));
}
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry logic for list-panes in createSession has no test coverage. The existing test suite covers the happy path and various error conditions for createSession. Consider adding a test where list-panes fails on the first attempt but succeeds on a subsequent attempt to verify the retry behavior works correctly.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

@lucabarak lucabarak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix makes sense and the error categorization (transient "can't find pane" vs permanent "session not found") is well thought out. A couple things to address:

  1. Tests — the retry logic isn't covered by any new tests. The existing tmux tests already use Bun.spyOn on runCommand, so the same pattern would work here to simulate transient failures followed by success. Per CONTRIBUTING.md, tests are required.

  2. Unconditional Bun.sleep(100) in createSession — this adds 100ms to every session creation on all platforms. Since the retry loop already handles the case where list-panes fails, the initial sleep shouldn't be needed for non-WSL2 users. Consider removing it and relying on the retry backoff alone.

  3. Minor: retry count is hardcoded to 3 in createSession but configurable via maxRetries in sendKeys — making them consistent would be cleaner.

Good direction though — this should fix #73 once the tests are added.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Failed to send keys to tmux session: can't find pane (WSL2)

3 participants