Skip to content

Comments

feat: add backup agent failover for review jobs#283

Merged
wesm merged 4 commits intoroborev-dev:mainfrom
nstrayer:backup-review-agent
Feb 18, 2026
Merged

feat: add backup agent failover for review jobs#283
wesm merged 4 commits intoroborev-dev:mainfrom
nstrayer:backup-review-agent

Conversation

@nstrayer
Copy link
Contributor

@nstrayer nstrayer commented Feb 17, 2026

Human summary

I found myself hitting codex limits really quickly and then dumping back to copilot in this case. I wanted a way to automate this so I didnt have to keep switching the config.toml. This PR adds a *backup_agent setting as a fallback in case the review fails.

Coincidentally while making this PR I ran out of codex and the fallback worked perfectly!

Summary

  • Add per-workflow and default backup agent configuration (default_backup_agent, review_backup_agent, etc.) with the same repo-overrides-global priority as primary agents
  • Implement failover in the worker pool: when an agent error exhausts retries, the worker resolves the backup agent from config and requeues the job with the new agent
  • Distinguish agent errors (eligible for failover) from non-agent errors (prompt build failures) in the worker's retry logic
  • No schema migration — backup agent is resolved from config at failover time, not stored per-job

Changes

  • internal/config/config.go — backup agent fields on Config/RepoConfig, ResolveBackupAgentForWorkflow resolution
  • internal/storage/jobs.goFailoverJob(jobID, workerID, backupAgent) atomically swaps agent and requeues
  • internal/daemon/worker.goresolveBackupAgent reads config at failover time, canonicalizes (verify installed, skip if same as primary), failOrRetryAgent path with failover split from failOrRetry
  • Tests for config resolution, storage failover, worker-level backup agent resolution, and edge cases

Test plan

  • go test ./... passes
  • Verify failover works end-to-end: configure backup_agent, trigger a review with a failing primary agent, confirm job retries with the backup
  • Verify no failover when backup_agent is unset (existing behavior unchanged)
  • Verify second failover with same backup is a no-op (agent already swapped)

@roborev-ci
Copy link

roborev-ci bot commented Feb 17, 2026

roborev: Combined Review

Verdict: One medium-severity concurrency/integrity issue remains; no High/Critical findings were reported.

Medium

  1. Missing worker ownership check in failover update
  • References: internal/storage/jobs.go:1129, internal/storage/db.go:1129, related call path internal/daemon/worker.go:467
  • Deduplicated from: Review 3 (codex/default), Review 4 (gemini/default), Review 2 (gemini/security; reported as Low there)
  • Issue: FailoverJob updates a running job using id + status='running' but does not verify worker_id. A stale/zombie worker can potentially fail over a job now owned by another worker, causing incorrect requeue/agent-swap behavior and duplicate/conflicting execution.
  • Recommended fix: Pass workerID into FailoverJob and require AND worker_id = ? in the UPDATE ... WHERE clause (and consider aligning similar state-transition paths like retry).

Synthesized from 4 reviews (agents: codex, gemini | types: security, default)

@wesm
Copy link
Collaborator

wesm commented Feb 17, 2026

ah, thanks for looking at this! I was planning to do this eventually since I was having to manually switch over to gemini after limiting out my ChatGPT Pro plan, but failing over gracefully is much better.

@roborev-ci
Copy link

roborev-ci bot commented Feb 18, 2026

roborev: Combined Review (6cee3c96)

Verdict: Mostly clean change set, but one Medium correctness/concurrency issue should be fixed before merge.

Medium

  1. Retry ownership check missing can allow stale worker requeue
    • Files: internal/storage/jobs.go:1110, internal/daemon/worker.go:450
    • Finding: RetryJob updates using id + status='running' without validating worker_id ownership. A stale/zombie worker may requeue a job currently owned by another worker, risking duplicate execution or lost completion writes.
    • Suggested fix: Make retry ownership-aware (e.g., RetryJob(jobID, workerID, maxRetries) with AND worker_id = ? in the UPDATE) and update call sites accordingly.

Synthesized from 4 reviews (agents: codex, gemini | types: security, default)

@roborev-ci
Copy link

roborev-ci bot commented Feb 18, 2026

roborev: Combined Review (93792070)

Verdict: 2 Medium-severity issues need fixes before merge; no High/Critical findings were reported.

Medium

  1. Backup agent canonicalization may silently choose unintended fallback

    • Refs: internal/daemon/server.go:551, internal/daemon/ci_poller.go:434
    • Backup canonicalization uses agent.GetAvailable(...), which can fall back to a different installed agent when the configured backup is unavailable. This conflicts with the intended behavior to clear unavailable backups.
    • Recommended fix: use strict resolution (agent.Get + agent.IsAvailable) and clear backup_agent when the configured backup is not explicitly available.
  2. Retry path lacks worker-ownership scoping (stale worker interference)

    • Refs: internal/daemon/worker.go:449, internal/storage/jobs.go:1117
    • failOrRetryInner calls RetryJob(job.ID, maxRetries) without worker_id ownership checks. A stale worker can requeue a job owned by another worker if status is still running.
    • Recommended fix: add optional worker_id scoping to RetryJob (matching FailJob/FailoverJob patterns) and pass workerID from worker flow.

Synthesized from 4 reviews (agents: codex, gemini | types: security, default)

@roborev-ci
Copy link

roborev-ci bot commented Feb 18, 2026

roborev: Combined Review (83b35426)

Verdict: No Medium, High, or Critical issues identified across the combined reviews.

Findings (Medium+)

None.

Notes

  • One reviewer flagged a Low severity code-quality duplication in:
    • internal/daemon/ci_poller.go (435-447)
    • internal/daemon/server.go (556-568)
      This is below the requested reporting threshold and is omitted from findings.

Synthesized from 4 reviews (agents: codex, gemini | types: default, security)

@roborev-ci
Copy link

roborev-ci bot commented Feb 18, 2026

roborev: Combined Review (8ee8b96b)

Verdict: Mostly clean overall, with one Medium correctness risk to address before merge.

Medium

  • internal/storage/jobs.go:1104 and internal/storage/jobs.go:1136
    RetryJob and FailoverJob update job state without updating updated_at. Since sync export uses updated_at > synced_at, these transitions may be skipped, causing sync/state drift (for example, if the daemon exits before a later terminal-state write).
    Recommended fix: include updated_at = ? in both UPDATE statements (using current RFC3339 time), and add/extend sync tests to ensure retried/failovered jobs are returned by GetJobsToSync after prior sync.

Synthesized from 4 reviews (agents: codex, gemini | types: security, default)

@roborev-ci
Copy link

roborev-ci bot commented Feb 18, 2026

roborev: Combined Review (69580b78)

Verdict: No Medium/High/Critical issues found; code appears clean for merge.

Findings (Medium+)

None.

All reported concerns were Low severity (or no issues) and are omitted per your rule.


Synthesized from 4 reviews (agents: codex, gemini | types: default, security)

@wesm
Copy link
Collaborator

wesm commented Feb 18, 2026

Thanks @nstrayer — I'm going to rebase this and take a review pass on it

wesm and others added 2 commits February 18, 2026 11:42
Add backup_agent configuration (per-workflow and default fallback) that
allows review jobs to automatically retry with a different agent when
the primary agent fails. Backup agent is resolved and canonicalized at
enqueue time, and FailoverJob atomically swaps the agent and requeues.

Worker pool distinguishes agent errors (eligible for failover) from
prompt/infra errors. FailJob, RetryJob, and FailoverJob are all scoped
by worker_id to prevent stale workers from interfering with reclaimed
jobs.

Co-Authored-By: Nick Strayer <nick.strayer@posit.co>
Remove backup_agent from the database schema and review_jobs model.
Instead of storing the backup agent per-job at enqueue time, the worker
now resolves it from config.toml when failover is actually needed.

This avoids a schema migration for an infrequent operation and keeps
the backup agent decision current with config changes.

Changes:
- Remove BackupAgent from ReviewJob struct and EnqueueOpts
- Remove backup_agent column migration from db.go
- Remove backup_agent from all SQL (INSERT/SELECT/Scan)
- Refactor FailoverJob to accept backupAgent as a parameter
- Add resolveBackupAgent to WorkerPool (reads config at failover time)
- Remove backup agent resolution from server.go and ci_poller.go enqueue paths
- Remove enqueue-time canonicalization tests (now handled at failover)
- Update FailoverJob tests for new signature

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@wesm
Copy link
Collaborator

wesm commented Feb 18, 2026

The main feedback I have on this is that I want to pull the backup agent out of the database table and resolve it from the config at failover time — my assumption is that failover is going to be more exceptional for most users and so there is some possibility of things like:

  1. user enqueues review
  2. user modified config, changing failover agent
  3. different failover agent is used for the review from step 1) than was intended at enqueue time

this strikes me as a bit contrived and won't really cause issues in practice. I'll push changes here shortly after I refine them a bit

@wesm wesm force-pushed the backup-review-agent branch from 69580b7 to 1a7144f Compare February 18, 2026 18:08
Cover resolveBackupAgent behavior: no config, unknown agent, same as
primary, workflow mapping (default/security), workflow mismatch, and
default_backup_agent fallback.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@roborev-ci
Copy link

roborev-ci bot commented Feb 18, 2026

roborev: Combined Review (4d2ff290)

Verdict: One Medium correctness issue should be fixed before merge; no security issues were identified.

Medium

  1. Failover can keep an incompatible model when switching agents
    • Locations: internal/storage/jobs.go:1149, internal/daemon/worker.go:332
    • Issue: Failover updates job.agent but leaves job.model unchanged. If the original model is unsupported by the backup agent, retries can fail and negate failover behavior.
    • Suggested fix: On failover, also reset/re-resolve model for the backup agent (at minimum clear it so the backup agent default is used), and add a test covering primary→backup with incompatible model values.

Synthesized from 4 reviews (agents: codex, gemini | types: security, default)

- FailoverJob now sets model = NULL so the backup agent uses its own
  default rather than inheriting a potentially incompatible model from
  the primary agent
- Set RepoPath to t.TempDir() in TestResolveBackupAgent to avoid
  reading .roborev.toml from the working directory
- Add design workflow test case for completeness
- Add "clears model on failover" test in TestFailoverJob

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@roborev-ci
Copy link

roborev-ci bot commented Feb 18, 2026

roborev: Combined Review (1c51cf34)

Verdict: No Medium, High, or Critical issues found in this commit range.

Findings (Medium+)

None.


Synthesized from 4 reviews (agents: codex, gemini | types: security, default)

@wesm wesm merged commit 9699138 into roborev-dev:main Feb 18, 2026
8 checks passed
@wesm
Copy link
Collaborator

wesm commented Feb 18, 2026

thanks @nstrayer!

wesm added a commit that referenced this pull request Feb 19, 2026
## Summary

- Detect hard quota-exhaustion errors from agents (e.g., Gemini free
tier), skip retries, cool down the agent for the duration in the error
message (or 30min default), and attempt failover to backup agent
- Jobs that fail due to quota get a `quota: ` error prefix; CI poller
uses this to distinguish quota skips from real failures
- Quota-only batches get `success` commit status instead of `error`; PR
comments show `skipped (quota)` instead of `failed`
- No schema changes — quota skips reuse the existing `failed` status
with a convention-based error prefix

## Interaction with backup agent failover (#283)

This PR reuses the `default_backup_agent` failover from #283. On quota
errors, the worker skips retries and immediately attempts failover to
the backup agent. Both the quota path and the normal retry-exhaustion
path check `isAgentCoolingDown(backupAgent)` before failing over, which
prevents a bounce loop between two agents that are both exhausted.

## Test plan

- [x] `TestIsQuotaError` — matches hard quota patterns, rejects
transient rate limits
- [x] `TestParseQuotaCooldown` — extracts Go durations, falls back
correctly
- [x] `TestAgentCooldown` — set/check/expiry lifecycle
- [x] `TestFailOrRetryInner_QuotaSkipsRetries` — quota error skips
retries, sets cooldown
- [x] `TestFailOrRetryInner_RetryExhaustedBackupInCooldown` — no bounce
loop when backup in cooldown
- [x] `TestFailoverOrFail_FailsOverToBackup` — failover to backup on
quota
- [x] `TestFailoverOrFail_NoBackupFailsWithQuotaPrefix` — quota prefix
when no backup
- [x] `TestCIPollerPostBatchResults_*` — quota-only = success status,
mixed = error
- [x] `TestFormatAllFailedComment_AllQuotaSkipped` — "Review Skipped"
header
- [x] `TestBuildSynthesisPrompt_QuotaSkippedLabel` — `[SKIPPED]` label
in synthesis
- [x] `go test ./...` passes

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants