feat: add backup agent failover for review jobs by nstrayer · Pull Request #283 · roborev-dev/roborev

nstrayer · 2026-02-17T21:38:35Z

Human summary

I found myself hitting codex limits really quickly and then dumping back to copilot in this case. I wanted a way to automate this so I didnt have to keep switching the config.toml. This PR adds a *backup_agent setting as a fallback in case the review fails.

Coincidentally while making this PR I ran out of codex and the fallback worked perfectly!

Summary

Add per-workflow and default backup agent configuration (default_backup_agent, review_backup_agent, etc.) with the same repo-overrides-global priority as primary agents
Implement failover in the worker pool: when an agent error exhausts retries, the worker resolves the backup agent from config and requeues the job with the new agent
Distinguish agent errors (eligible for failover) from non-agent errors (prompt build failures) in the worker's retry logic
No schema migration — backup agent is resolved from config at failover time, not stored per-job

Changes

internal/config/config.go — backup agent fields on Config/RepoConfig, ResolveBackupAgentForWorkflow resolution
internal/storage/jobs.go — FailoverJob(jobID, workerID, backupAgent) atomically swaps agent and requeues
internal/daemon/worker.go — resolveBackupAgent reads config at failover time, canonicalizes (verify installed, skip if same as primary), failOrRetryAgent path with failover split from failOrRetry
Tests for config resolution, storage failover, worker-level backup agent resolution, and edge cases

Test plan

go test ./... passes
Verify failover works end-to-end: configure backup_agent, trigger a review with a failing primary agent, confirm job retries with the backup
Verify no failover when backup_agent is unset (existing behavior unchanged)
Verify second failover with same backup is a no-op (agent already swapped)

roborev-ci · 2026-02-17T21:47:34Z

roborev: Combined Review

Verdict: One medium-severity concurrency/integrity issue remains; no High/Critical findings were reported.

Medium

Missing worker ownership check in failover update

References: internal/storage/jobs.go:1129, internal/storage/db.go:1129, related call path internal/daemon/worker.go:467
Deduplicated from: Review 3 (codex/default), Review 4 (gemini/default), Review 2 (gemini/security; reported as Low there)
Issue: FailoverJob updates a running job using id + status='running' but does not verify worker_id. A stale/zombie worker can potentially fail over a job now owned by another worker, causing incorrect requeue/agent-swap behavior and duplicate/conflicting execution.
Recommended fix: Pass workerID into FailoverJob and require AND worker_id = ? in the UPDATE ... WHERE clause (and consider aligning similar state-transition paths like retry).

Synthesized from 4 reviews (agents: codex, gemini | types: security, default)

wesm · 2026-02-17T21:49:10Z

ah, thanks for looking at this! I was planning to do this eventually since I was having to manually switch over to gemini after limiting out my ChatGPT Pro plan, but failing over gracefully is much better.

roborev-ci · 2026-02-18T13:54:23Z

roborev: Combined Review (`6cee3c96`)

Verdict: Mostly clean change set, but one Medium correctness/concurrency issue should be fixed before merge.

Medium

Retry ownership check missing can allow stale worker requeue
- Files: internal/storage/jobs.go:1110, internal/daemon/worker.go:450
- Finding: RetryJob updates using id + status='running' without validating worker_id ownership. A stale/zombie worker may requeue a job currently owned by another worker, risking duplicate execution or lost completion writes.
- Suggested fix: Make retry ownership-aware (e.g., RetryJob(jobID, workerID, maxRetries) with AND worker_id = ? in the UPDATE) and update call sites accordingly.

Synthesized from 4 reviews (agents: codex, gemini | types: security, default)

roborev-ci · 2026-02-18T14:02:00Z

roborev: Combined Review (`93792070`)

Verdict: 2 Medium-severity issues need fixes before merge; no High/Critical findings were reported.

Medium

Backup agent canonicalization may silently choose unintended fallback
- Refs: internal/daemon/server.go:551, internal/daemon/ci_poller.go:434
- Backup canonicalization uses agent.GetAvailable(...), which can fall back to a different installed agent when the configured backup is unavailable. This conflicts with the intended behavior to clear unavailable backups.
- Recommended fix: use strict resolution (agent.Get + agent.IsAvailable) and clear backup_agent when the configured backup is not explicitly available.
Retry path lacks worker-ownership scoping (stale worker interference)
- Refs: internal/daemon/worker.go:449, internal/storage/jobs.go:1117
- failOrRetryInner calls RetryJob(job.ID, maxRetries) without worker_id ownership checks. A stale worker can requeue a job owned by another worker if status is still running.
- Recommended fix: add optional worker_id scoping to RetryJob (matching FailJob/FailoverJob patterns) and pass workerID from worker flow.

Synthesized from 4 reviews (agents: codex, gemini | types: security, default)

roborev-ci · 2026-02-18T14:30:32Z

roborev: Combined Review (`83b35426`)

Verdict: No Medium, High, or Critical issues identified across the combined reviews.

Findings (Medium+)

None.

Notes

One reviewer flagged a Low severity code-quality duplication in:
- internal/daemon/ci_poller.go (435-447)
- internal/daemon/server.go (556-568)
  This is below the requested reporting threshold and is omitted from findings.

Synthesized from 4 reviews (agents: codex, gemini | types: default, security)

roborev-ci · 2026-02-18T14:42:20Z

roborev: Combined Review (`8ee8b96b`)

Verdict: Mostly clean overall, with one Medium correctness risk to address before merge.

Medium

internal/storage/jobs.go:1104 and internal/storage/jobs.go:1136
RetryJob and FailoverJob update job state without updating updated_at. Since sync export uses updated_at > synced_at, these transitions may be skipped, causing sync/state drift (for example, if the daemon exits before a later terminal-state write).
Recommended fix: include updated_at = ? in both UPDATE statements (using current RFC3339 time), and add/extend sync tests to ensure retried/failovered jobs are returned by GetJobsToSync after prior sync.

Synthesized from 4 reviews (agents: codex, gemini | types: security, default)

roborev-ci · 2026-02-18T15:30:30Z

roborev: Combined Review (`69580b78`)

Verdict: No Medium/High/Critical issues found; code appears clean for merge.

Findings (Medium+)

None.

All reported concerns were Low severity (or no issues) and are omitted per your rule.

Synthesized from 4 reviews (agents: codex, gemini | types: default, security)

wesm · 2026-02-18T17:31:53Z

Thanks @nstrayer — I'm going to rebase this and take a review pass on it

Add backup_agent configuration (per-workflow and default fallback) that allows review jobs to automatically retry with a different agent when the primary agent fails. Backup agent is resolved and canonicalized at enqueue time, and FailoverJob atomically swaps the agent and requeues. Worker pool distinguishes agent errors (eligible for failover) from prompt/infra errors. FailJob, RetryJob, and FailoverJob are all scoped by worker_id to prevent stale workers from interfering with reclaimed jobs. Co-Authored-By: Nick Strayer <nick.strayer@posit.co>

Remove backup_agent from the database schema and review_jobs model. Instead of storing the backup agent per-job at enqueue time, the worker now resolves it from config.toml when failover is actually needed. This avoids a schema migration for an infrequent operation and keeps the backup agent decision current with config changes. Changes: - Remove BackupAgent from ReviewJob struct and EnqueueOpts - Remove backup_agent column migration from db.go - Remove backup_agent from all SQL (INSERT/SELECT/Scan) - Refactor FailoverJob to accept backupAgent as a parameter - Add resolveBackupAgent to WorkerPool (reads config at failover time) - Remove backup agent resolution from server.go and ci_poller.go enqueue paths - Remove enqueue-time canonicalization tests (now handled at failover) - Update FailoverJob tests for new signature Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

wesm · 2026-02-18T18:00:34Z

The main feedback I have on this is that I want to pull the backup agent out of the database table and resolve it from the config at failover time — my assumption is that failover is going to be more exceptional for most users and so there is some possibility of things like:

user enqueues review
user modified config, changing failover agent
different failover agent is used for the review from step 1) than was intended at enqueue time

this strikes me as a bit contrived and won't really cause issues in practice. I'll push changes here shortly after I refine them a bit

Cover resolveBackupAgent behavior: no config, unknown agent, same as primary, workflow mapping (default/security), workflow mismatch, and default_backup_agent fallback. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

roborev-ci · 2026-02-18T18:14:47Z

roborev: Combined Review (`4d2ff290`)

Verdict: One Medium correctness issue should be fixed before merge; no security issues were identified.

Medium

Failover can keep an incompatible model when switching agents
- Locations: internal/storage/jobs.go:1149, internal/daemon/worker.go:332
- Issue: Failover updates job.agent but leaves job.model unchanged. If the original model is unsupported by the backup agent, retries can fail and negate failover behavior.
- Suggested fix: On failover, also reset/re-resolve model for the backup agent (at minimum clear it so the backup agent default is used), and add a test covering primary→backup with incompatible model values.

Synthesized from 4 reviews (agents: codex, gemini | types: security, default)

- FailoverJob now sets model = NULL so the backup agent uses its own default rather than inheriting a potentially incompatible model from the primary agent - Set RepoPath to t.TempDir() in TestResolveBackupAgent to avoid reading .roborev.toml from the working directory - Add design workflow test case for completeness - Add "clears model on failover" test in TestFailoverJob Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

roborev-ci · 2026-02-18T18:57:35Z

roborev: Combined Review (`1c51cf34`)

Verdict: No Medium, High, or Critical issues found in this commit range.

Findings (Medium+)

None.

Synthesized from 4 reviews (agents: codex, gemini | types: security, default)

wesm · 2026-02-18T19:00:57Z

thanks @nstrayer!

## Summary - Detect hard quota-exhaustion errors from agents (e.g., Gemini free tier), skip retries, cool down the agent for the duration in the error message (or 30min default), and attempt failover to backup agent - Jobs that fail due to quota get a `quota: ` error prefix; CI poller uses this to distinguish quota skips from real failures - Quota-only batches get `success` commit status instead of `error`; PR comments show `skipped (quota)` instead of `failed` - No schema changes — quota skips reuse the existing `failed` status with a convention-based error prefix ## Interaction with backup agent failover (#283) This PR reuses the `default_backup_agent` failover from #283. On quota errors, the worker skips retries and immediately attempts failover to the backup agent. Both the quota path and the normal retry-exhaustion path check `isAgentCoolingDown(backupAgent)` before failing over, which prevents a bounce loop between two agents that are both exhausted. ## Test plan - [x] `TestIsQuotaError` — matches hard quota patterns, rejects transient rate limits - [x] `TestParseQuotaCooldown` — extracts Go durations, falls back correctly - [x] `TestAgentCooldown` — set/check/expiry lifecycle - [x] `TestFailOrRetryInner_QuotaSkipsRetries` — quota error skips retries, sets cooldown - [x] `TestFailOrRetryInner_RetryExhaustedBackupInCooldown` — no bounce loop when backup in cooldown - [x] `TestFailoverOrFail_FailsOverToBackup` — failover to backup on quota - [x] `TestFailoverOrFail_NoBackupFailsWithQuotaPrefix` — quota prefix when no backup - [x] `TestCIPollerPostBatchResults_*` — quota-only = success status, mixed = error - [x] `TestFormatAllFailedComment_AllQuotaSkipped` — "Review Skipped" header - [x] `TestBuildSynthesisPrompt_QuotaSkippedLabel` — `[SKIPPED]` label in synthesis - [x] `go test ./...` passes 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

wesm and others added 2 commits February 18, 2026 11:42

wesm force-pushed the backup-review-agent branch from 69580b7 to 1a7144f Compare February 18, 2026 18:08

Add tests for worker-level backup agent resolution

4d2ff29

Cover resolveBackupAgent behavior: no config, unknown agent, same as primary, workflow mapping (default/security), workflow mismatch, and default_backup_agent fallback. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

wesm merged commit 9699138 into roborev-dev:main Feb 18, 2026
8 checks passed

wesm mentioned this pull request Feb 19, 2026

Quota-aware agent cooldown with graceful skip #309

Merged

11 tasks

Comments

Conversation

nstrayer commented Feb 17, 2026 • edited by wesm Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Human summary

Summary

Changes

Test plan

Uh oh!

roborev-ci bot commented Feb 17, 2026

roborev: Combined Review

Medium

Uh oh!

wesm commented Feb 17, 2026

Uh oh!

roborev-ci bot commented Feb 18, 2026

roborev: Combined Review (6cee3c96)

Medium

Uh oh!

roborev-ci bot commented Feb 18, 2026

roborev: Combined Review (93792070)

Medium

Uh oh!

roborev-ci bot commented Feb 18, 2026

roborev: Combined Review (83b35426)

Findings (Medium+)

Notes

Uh oh!

roborev-ci bot commented Feb 18, 2026

roborev: Combined Review (8ee8b96b)

Medium

Uh oh!

roborev-ci bot commented Feb 18, 2026

roborev: Combined Review (69580b78)

Findings (Medium+)

Uh oh!

wesm commented Feb 18, 2026

Uh oh!

wesm commented Feb 18, 2026

Uh oh!

roborev-ci bot commented Feb 18, 2026

roborev: Combined Review (4d2ff290)

Medium

Uh oh!

roborev-ci bot commented Feb 18, 2026

roborev: Combined Review (1c51cf34)

Findings (Medium+)

Uh oh!

Uh oh!

wesm commented Feb 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nstrayer commented Feb 17, 2026 •

edited by wesm

Loading

roborev: Combined Review (`6cee3c96`)

roborev: Combined Review (`93792070`)

roborev: Combined Review (`83b35426`)

roborev: Combined Review (`8ee8b96b`)

roborev: Combined Review (`69580b78`)

roborev: Combined Review (`4d2ff290`)

roborev: Combined Review (`1c51cf34`)