Skip to content

Move review and prompt content to its own database table #229

@wesm

Description

@wesm

My review database is coming large (over 5000 reviews, 177 megabytes), so I am thinking about restructuring things so that the metadata tables do not have the prompt and review content inline, which will improve sqlite scan performance.

Schema Refactoring: Separate Large Content from Metadata Tables

Problem

The review_jobs and reviews tables store large text content (prompts,
diffs, review output) inline alongside small metadata columns. As the database
grows (currently 177MB), this has several consequences:

  • ListJobs pulls rv.output for every row just to compute verdicts at
    query time via ParseVerdict(). Review output can be 10-100KB+.
  • ListJobs pulls j.prompt for every row. The TUI queue view only uses
    it to check prompt != "" for running jobs (to decide if the prompt view is
    available). Prompts can be 5-20KB+.
  • SQLite B-tree leaf pages carry per-value overhead even when large values
    overflow to separate pages. Scanning the main table touches more pages than
    necessary for metadata-only queries.
  • PostgreSQL TOAST detoasting has real CPU overhead, and wide rows hurt HOT
    updates and index-only scans.
  • VACUUM effectiveness is reduced when large and small data share the same
    table structure.

Proposed Schema Changes

New tables

-- Large content from review_jobs
CREATE TABLE job_content (
  job_id INTEGER PRIMARY KEY REFERENCES review_jobs(id),
  prompt TEXT,
  diff_content TEXT,
  output_prefix TEXT
);

-- Large content from reviews
CREATE TABLE review_content (
  review_id INTEGER PRIMARY KEY REFERENCES reviews(id),
  prompt TEXT NOT NULL,
  output TEXT NOT NULL
);

Columns to remove

  • review_jobs: drop prompt, diff_content, output_prefix
  • reviews: drop prompt, output

New column on reviews

ALTER TABLE reviews ADD COLUMN verdict TEXT;  -- 'P' or 'F', computed at write time

Storing the verdict at review creation time (in CompleteJob) eliminates the
need to fetch rv.output in ListJobs entirely.

Query Impact Analysis

review_jobs content columns

Function File Read/Write Columns Migration impact
EnqueueJob jobs.go:794 Write prompt, diff_content, output_prefix Write to job_content instead
SaveJobPrompt jobs.go:927 Write prompt Update job_content.prompt
ClaimJob jobs.go:880 Read prompt, diff_content JOIN job_content (worker needs both)
CompleteJob jobs.go:962 Read output_prefix JOIN job_content (needs prefix)
ListJobs jobs.go:1154 Read prompt Stop selecting it. Add has_prompt bool column or use EXISTS subquery. TUI only checks prompt != "" for running jobs.

reviews content columns

Function File Read/Write Columns Migration impact
CompleteJob jobs.go:990 Write prompt, output Write to review_content instead. Also compute and store reviews.verdict.
GetReviewByJobID reviews.go:20 Read prompt, output JOIN review_content (detail view needs both)
GetReviewByCommitSHA reviews.go:98 Read prompt, output JOIN review_content
GetAllReviewsForGitRef reviews.go:169 Read prompt, output JOIN review_content (synthesis context)
GetRecentReviewsForRepo reviews.go:199 Read prompt, output JOIN review_content
GetReviewByID reviews.go:280 Read prompt, output JOIN review_content
GetBatchReviews ci.go:227 Read output JOIN review_content
ListJobs jobs.go:1154 Read output (via rv JOIN) Stop selecting it. Read reviews.verdict directly instead.

Key optimization: ListJobs after migration

The ListJobs query becomes a pure metadata scan:

SELECT j.id, j.repo_id, j.commit_id, j.git_ref, j.branch, j.agent,
       j.reasoning, j.status, j.enqueued_at, j.started_at, j.finished_at,
       j.worker_id, j.error, j.has_prompt, j.retry_count,
       COALESCE(j.agentic, 0), r.root_path, r.name, c.subject,
       rv.addressed, rv.verdict,
       j.source_machine_id, j.uuid, j.model, j.job_type, j.review_type
FROM review_jobs j
JOIN repos r ON r.id = j.repo_id
LEFT JOIN commits c ON c.id = j.commit_id
LEFT JOIN reviews rv ON rv.job_id = j.id

No large text columns touched. Verdict is pre-computed. has_prompt replaces
the full prompt text for the "is prompt viewable?" check.

Migration Strategy

SQLite

roborev statically links a modern SQLite (3.35+), so ALTER TABLE DROP COLUMN
is available.

  1. Begin transaction
  2. Create job_content and review_content tables
  3. INSERT INTO job_content SELECT id, prompt, diff_content, output_prefix FROM review_jobs
  4. INSERT INTO review_content SELECT rv.id, rv.prompt, rv.output FROM reviews rv
  5. Add has_prompt column: ALTER TABLE review_jobs ADD COLUMN has_prompt INTEGER NOT NULL DEFAULT 0
  6. UPDATE review_jobs SET has_prompt = 1 WHERE prompt IS NOT NULL AND prompt != ''
  7. Add verdict column: ALTER TABLE reviews ADD COLUMN verdict TEXT
  8. Compute and store verdicts (requires application-level logic since
    ParseVerdict is in Go -- iterate rows in Go and batch update)
  9. ALTER TABLE review_jobs DROP COLUMN prompt
  10. ALTER TABLE review_jobs DROP COLUMN diff_content
  11. ALTER TABLE review_jobs DROP COLUMN output_prefix
  12. ALTER TABLE reviews DROP COLUMN prompt
  13. ALTER TABLE reviews DROP COLUMN output
  14. Commit
  15. VACUUM to reclaim space (outside transaction)

The VACUUM after dropping the large columns is the big win -- it rewrites the
entire database, compacting the main tables to just metadata.

PostgreSQL

Postgres can ALTER TABLE DROP COLUMN natively, but dropped columns leave
dead space until VACUUM FULL or a table rewrite.

  1. Begin transaction
  2. Create job_content and review_content tables
  3. INSERT INTO ... SELECT for both
  4. Add has_prompt and verdict columns
  5. Populate has_prompt and verdict
  6. ALTER TABLE review_jobs DROP COLUMN prompt, DROP COLUMN diff_content, DROP COLUMN output_prefix
  7. ALTER TABLE reviews DROP COLUMN prompt, DROP COLUMN output
  8. Commit
  9. VACUUM FULL review_jobs; VACUUM FULL reviews; (outside transaction, requires exclusive lock)

Schema version

Bump the schema version for both backends. The migration should be
backward-compatible in the sense that old code will fail gracefully (missing
columns) rather than silently corrupt data.

Sync Considerations

The sync system (internal/storage/sync.go) serializes jobs and reviews for
cross-machine sync. The sync format will need to include content from the new
tables. This should be handled by JOINing the content tables in the sync export
queries and writing to both tables on sync import.

Schema version gating

EnsureSchema in postgres.go hard-fails when the database schema version is
newer than the client's pgSchemaVersion:

return fmt.Errorf("database schema version %d is newer than supported version %d", ...)

This means all sync clients must be upgraded before the Postgres migration
runs
, or they will stop syncing entirely. Options:

  1. Upgrade-first: Require all machines to upgrade roborev before any one of
    them runs the Postgres migration. The new code can handle both old and new
    schemas (check for content tables, fall back to inline columns if missing).
  2. Two-phase migration: First release adds the new tables and starts
    writing to both locations (old columns + new tables) without bumping the
    schema version or dropping columns. Second release bumps the version, drops
    the old columns, and stops dual-writing.

Rolling upgrade strategy (recommended)

The existing schema version gate provides clean rolling upgrade behavior:

  1. First client upgrades: Runs Postgres migration (creates content tables,
    drops old columns, bumps schema version). Also runs local SQLite migration.
  2. Other clients (old version): EnsureSchema fails with version mismatch.
    The sync worker closes the connection and retries periodically. The daemon
    continues working locally -- reviews still run, the TUI still works, only
    sync is paused.
  3. Other clients upgrade at their own pace: Each runs its local SQLite
    migration on startup. On next sync attempt, EnsureSchema succeeds
    (version now matches), sync resumes.

This requires no two-phase migration or dual-write complexity. The schema
version gate already provides the right semantics.

Pre-work: Improve the EnsureSchema error path to log a user-friendly
message: "Sync: upgrade roborev to resume syncing (database schema v%d, client supports v%d)" instead of the current developer-facing error format.
This could ship in any release before the migration.

TUI Changes

  • ReviewJob.Prompt field becomes empty in ListJobs results. Add
    HasPrompt bool field to the struct. Update TUI code that checks
    job.Prompt != "" to check job.HasPrompt instead.
  • For running jobs where the TUI displays the prompt directly (before a review
    exists), fetch from job_content on demand (same as fetching a review).
  • ReviewJob.Verdict is already a *string field populated from ListJobs --
    just read it from reviews.verdict instead of computing from rv.output.

Risks and Rollback

  • Data integrity: The migration copies data before dropping columns. If
    the migration fails partway, the transaction rolls back and nothing changes.
  • Rollback path: Keep the old schema version code available for one release
    cycle. A downgrade migration could re-merge the content tables back (slow but
    safe).
  • Performance during migration: VACUUM can be slow on large databases.
    The 177MB database should VACUUM in seconds, but larger deployments may take
    longer. Consider warning the user or running VACUUM in the background.

Scope

This is a standalone change that touches:

  • internal/storage/db.go (schema, migrations)
  • internal/storage/jobs.go (EnqueueJob, SaveJobPrompt, ClaimJob, CompleteJob, ListJobs)
  • internal/storage/reviews.go (all Get* functions)
  • internal/storage/ci.go (GetBatchReviews)
  • internal/storage/sync.go (export/import)
  • internal/storage/postgres.go (Postgres equivalents of all the above)
  • internal/storage/models.go (HasPrompt field)
  • cmd/roborev/tui.go (HasPrompt checks, verdict from DB)
  • internal/daemon/server.go (API responses, if prompt/output are returned)

Estimated at 15-25 files touched depending on test coverage.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions