feat: add backup agent failover for review jobs#283
Conversation
roborev: Combined ReviewVerdict: One medium-severity concurrency/integrity issue remains; no High/Critical findings were reported. Medium
Synthesized from 4 reviews (agents: codex, gemini | types: security, default) |
|
ah, thanks for looking at this! I was planning to do this eventually since I was having to manually switch over to gemini after limiting out my ChatGPT Pro plan, but failing over gracefully is much better. |
roborev: Combined Review (
|
roborev: Combined Review (
|
roborev: Combined Review (
|
roborev: Combined Review (
|
roborev: Combined Review (
|
|
Thanks @nstrayer — I'm going to rebase this and take a review pass on it |
Add backup_agent configuration (per-workflow and default fallback) that allows review jobs to automatically retry with a different agent when the primary agent fails. Backup agent is resolved and canonicalized at enqueue time, and FailoverJob atomically swaps the agent and requeues. Worker pool distinguishes agent errors (eligible for failover) from prompt/infra errors. FailJob, RetryJob, and FailoverJob are all scoped by worker_id to prevent stale workers from interfering with reclaimed jobs. Co-Authored-By: Nick Strayer <nick.strayer@posit.co>
Remove backup_agent from the database schema and review_jobs model. Instead of storing the backup agent per-job at enqueue time, the worker now resolves it from config.toml when failover is actually needed. This avoids a schema migration for an infrequent operation and keeps the backup agent decision current with config changes. Changes: - Remove BackupAgent from ReviewJob struct and EnqueueOpts - Remove backup_agent column migration from db.go - Remove backup_agent from all SQL (INSERT/SELECT/Scan) - Refactor FailoverJob to accept backupAgent as a parameter - Add resolveBackupAgent to WorkerPool (reads config at failover time) - Remove backup agent resolution from server.go and ci_poller.go enqueue paths - Remove enqueue-time canonicalization tests (now handled at failover) - Update FailoverJob tests for new signature Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
The main feedback I have on this is that I want to pull the backup agent out of the database table and resolve it from the config at failover time — my assumption is that failover is going to be more exceptional for most users and so there is some possibility of things like:
this strikes me as a bit contrived and won't really cause issues in practice. I'll push changes here shortly after I refine them a bit |
69580b7 to
1a7144f
Compare
Cover resolveBackupAgent behavior: no config, unknown agent, same as primary, workflow mapping (default/security), workflow mismatch, and default_backup_agent fallback. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
roborev: Combined Review (
|
- FailoverJob now sets model = NULL so the backup agent uses its own default rather than inheriting a potentially incompatible model from the primary agent - Set RepoPath to t.TempDir() in TestResolveBackupAgent to avoid reading .roborev.toml from the working directory - Add design workflow test case for completeness - Add "clears model on failover" test in TestFailoverJob Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
roborev: Combined Review (
|
|
thanks @nstrayer! |
## Summary - Detect hard quota-exhaustion errors from agents (e.g., Gemini free tier), skip retries, cool down the agent for the duration in the error message (or 30min default), and attempt failover to backup agent - Jobs that fail due to quota get a `quota: ` error prefix; CI poller uses this to distinguish quota skips from real failures - Quota-only batches get `success` commit status instead of `error`; PR comments show `skipped (quota)` instead of `failed` - No schema changes — quota skips reuse the existing `failed` status with a convention-based error prefix ## Interaction with backup agent failover (#283) This PR reuses the `default_backup_agent` failover from #283. On quota errors, the worker skips retries and immediately attempts failover to the backup agent. Both the quota path and the normal retry-exhaustion path check `isAgentCoolingDown(backupAgent)` before failing over, which prevents a bounce loop between two agents that are both exhausted. ## Test plan - [x] `TestIsQuotaError` — matches hard quota patterns, rejects transient rate limits - [x] `TestParseQuotaCooldown` — extracts Go durations, falls back correctly - [x] `TestAgentCooldown` — set/check/expiry lifecycle - [x] `TestFailOrRetryInner_QuotaSkipsRetries` — quota error skips retries, sets cooldown - [x] `TestFailOrRetryInner_RetryExhaustedBackupInCooldown` — no bounce loop when backup in cooldown - [x] `TestFailoverOrFail_FailsOverToBackup` — failover to backup on quota - [x] `TestFailoverOrFail_NoBackupFailsWithQuotaPrefix` — quota prefix when no backup - [x] `TestCIPollerPostBatchResults_*` — quota-only = success status, mixed = error - [x] `TestFormatAllFailedComment_AllQuotaSkipped` — "Review Skipped" header - [x] `TestBuildSynthesisPrompt_QuotaSkippedLabel` — `[SKIPPED]` label in synthesis - [x] `go test ./...` passes 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Human summary
I found myself hitting codex limits really quickly and then dumping back to copilot in this case. I wanted a way to automate this so I didnt have to keep switching the config.toml. This PR adds a
*backup_agentsetting as a fallback in case the review fails.Coincidentally while making this PR I ran out of codex and the fallback worked perfectly!
Summary
default_backup_agent,review_backup_agent, etc.) with the same repo-overrides-global priority as primary agentsChanges
internal/config/config.go— backup agent fields onConfig/RepoConfig,ResolveBackupAgentForWorkflowresolutioninternal/storage/jobs.go—FailoverJob(jobID, workerID, backupAgent)atomically swaps agent and requeuesinternal/daemon/worker.go—resolveBackupAgentreads config at failover time, canonicalizes (verify installed, skip if same as primary),failOrRetryAgentpath with failover split fromfailOrRetryTest plan
go test ./...passesbackup_agent, trigger a review with a failing primary agent, confirm job retries with the backupbackup_agentis unset (existing behavior unchanged)