Skip to content

Supervisor resilience: failure handling, conflict resolution, and observability #563

@acreeger

Description

@acreeger

Summary

Add failure handling with configurable retries, a dedicated conflict resolution agent for merge failures, resume support for crashed supervisors, and structured progress reporting (log output + JSON file for VS Code extension).

Context

Part of the Autonomous Swarm Mode epic (#557). This issue extends the supervisor loop (#562) with production-grade resilience and observability.

Scope

1. Failure Handling & Retry

When an agent exits with non-zero status:

  1. Release the Beads claim (bd update <id> --status open)
  2. Cleanup the failed worktree
  3. Increment failure counter for this task
  4. If failureCount < swarm.maxRetries:
    • Log "Retrying task #N (attempt X of Y)..."
    • Create fresh worktree, spawn new agent
  5. If failureCount >= swarm.maxRetries:
    • Mark as failed in Beads (bd close <id> --reason "Failed after N retries")
    • Log failure, continue with other tasks
    • Important: If failed task was blocking others, those remain blocked

At swarm completion, report aggregate failures:

Swarm complete: 6/7 tasks succeeded, 1 failed
Failed: #103 "Implement auth" (2 attempts, agent error)

2. Conflict Resolution Agent

When the merge queue encounters a conflict:

  1. Detect merge failure (GitHub API returns conflict status)
  2. Log "Merge conflict detected for PR #N. Spawning resolver..."
  3. Spawn a lightweight Claude Code agent in the child's worktree:
    • Agent prompt: "Rebase branch <child-branch> onto <epic-branch>, resolve all merge conflicts preserving the intent of both changes, then force-push."
    • Use il spin -p with a conflict-resolution-specific prompt (new template or env var)
    • Agent has context: the PR diff, the epic branch state
  4. After resolver exits:
    • Retry merge
    • If still conflicts and conflictRetryCount < swarm.maxConflictRetries: repeat
    • If exhausted: mark task as failed, skip

3. Resume Support

When supervisor starts and detects existing Beads state for this epic:

  1. Read Beads task statuses (bd list --json)
  2. Skip tasks marked as closed (already completed)
  3. For tasks marked in_progress:
    • Check PID file for running processes
    • If process still running: re-attach monitoring
    • If process dead: release claim, treat as failure (retry applies)
  4. For tasks marked open/ready: proceed normally
  5. Log "Resuming swarm: X completed, Y in progress, Z remaining"

4. Progress Reporting

Terminal output:

  • On state change: log structured line with timestamp
  • Periodic summary (every 30s or on change): "Active: 3/3 | Completed: 4/7 | Failed: 0 | Blocked: 0"

JSON progress file:
Written to ~/.config/iloom-ai/looms/<epic-loom-id>/swarm-progress.json on every state change:

{
  "epicIssue": 42,
  "epicBranch": "issue-42-swarm-mode",
  "status": "running|completed|failed|paused",
  "startedAt": "2026-02-05T10:00:00Z",
  "updatedAt": "2026-02-05T10:15:30Z",
  "dag": {
    "nodes": [
      {
        "issue": 101,
        "title": "Add settings schema",
        "status": "completed|in_progress|blocked|ready|failed",
        "agentPid": null,
        "logFile": "/path/to/agent-logs/101.log",
        "attempts": 1,
        "prNumber": 145,
        "startedAt": "...",
        "completedAt": "..."
      }
    ],
    "edges": [
      { "from": 101, "to": 103 }
    ]
  },
  "stats": {
    "total": 7,
    "completed": 4,
    "inProgress": 2,
    "failed": 0,
    "blocked": 1,
    "ready": 0
  },
  "failures": [
    { "issue": 105, "reason": "Agent exited with code 1", "attempts": 2 }
  ]
}

This file is the contract between the supervisor and the VS Code extension. The extension watches it with fs.watch() and renders the swarm state.

Acceptance Criteria

  • Agent failures trigger claim release, worktree cleanup, and retry
  • Configurable retry count from settings (default 1)
  • Failed blocking tasks correctly leave downstream tasks blocked
  • Merge conflicts spawn resolver agent
  • Resolver agent rebases and retries merge
  • Conflict retries respect maxConflictRetries setting (default 3)
  • Supervisor can resume from crashed state (reads Beads + PID file)
  • Progress JSON file written on every state change
  • Terminal output shows clear, structured progress
  • Aggregate failure report at swarm completion
  • Unit tests for failure/retry state machine
  • Unit tests for resume logic with various Beads states

Scope Boundaries

  • Does NOT modify the core supervisor loop structure (extends it)
  • Conflict resolution agent uses a simple prompt, not a full custom agent definition (can be enhanced later)

Dependencies

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions