Skip to content

Conversation

@graphite-app
Copy link
Contributor

@graphite-app graphite-app bot commented Jan 6, 2026

This draft PR was created by the Graphite merge queue.
Trunk will be fast forwarded to the HEAD of this PR when CI passes, and the original PRs will be closed.

The following PRs are included in this draft PR:

@vercel
Copy link

vercel bot commented Jan 6, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Review Updated (UTC)
rivetkit-serverless Error Error Jan 6, 2026 2:32am
3 Skipped Deployments
Project Deployment Review Updated (UTC)
rivet-cloud Ignored Ignored Jan 6, 2026 2:32am
rivet-inspector Ignored Ignored Preview Jan 6, 2026 2:32am
rivet-site Ignored Ignored Preview Jan 6, 2026 2:32am

@graphite-app graphite-app bot closed this Jan 6, 2026
@graphite-app graphite-app bot deleted the gtmq_spec_13a265_1767666690523-8592c0fd-c57a-4ead-9ac3-d7437ffb1708 branch January 6, 2026 02:33
@claude
Copy link

claude bot commented Jan 6, 2026

Pull Request Review - Graphite Merge Queue

This PR combines 8 individual PRs into a merge queue. Reviewed the complete changeset (146 files, +5170/-906 lines).

Overview

Well-structured merge introducing significant improvements to error handling, actor lifecycle management, and developer experience in RivetKit.

Key Features

1. Actor Error Reporting System

Sophisticated two-layer error tracking:

  • Direct actor errors (workflow state)
  • Pool errors (runner_pool_error_tracker workflow)
  • Error enrichment via API queries

Quality: Excellent

  • Well-documented in docs/engine/ACTOR_ERRORS.md
  • Hysteresis mechanism (3 consecutive successes)
  • Comprehensive test coverage (1019 lines)

2. Guard Fail-Fast Mechanism

Guard detects serverless pool errors within ~3s instead of full timeout.

Concern: Hardcoded timing values at guard/src/routing/pegboard_gateway.rs:56
Recommendation: Extract as config values

3. Drain on Version Upgrade

New drain_on_version_upgrade flag auto-drains old runners.
Quality: Strong implementation with 464 lines of tests

4. RivetKit TypeScript Improvements

  • Context type specialization
  • RIVET_EXPOSE_ERRORS env var
  • Fixed c.client origin handling

Code Quality

Strengths

  1. Excellent documentation with diagrams
  2. Comprehensive testing (1,483 new test lines)
  3. Consistent error handling patterns
  4. Proper database transactions and pagination

Areas for Improvement

  1. Hardcoded Values Should Be Config

    • Guard fail-fast delays (1s initial, 2s polling)
    • Backfill batch size (1024)
    • Transaction timeout (2500ms)
  2. Cache TTL Too Short

    • Pool error cache: 500ms at runner_config/get_error.rs:32
    • Recommendation: Increase to 2-5s
  3. Missing Metric Documentation

    • RUNNER_VERSION_UPGRADE_DRAIN needs doc comment
  4. Test Fragility

    • Fixed sleeps may be flaky in slow CI
    • Use exponential backoff polling

Security: No Critical Issues

  • Error sanitization: Proper
  • Input validation: DNS subdomain checks
  • Authorization: Token validation maintained

Minor: Test panic with full response body could leak data in CI logs

Performance

Positive:

  • Batch processing, pagination, early timeouts, caching

Concerns:

  • Workflow queries per request
  • Individual exists checks in backfill

Test Coverage: Excellent

actors_scheduling_errors.rs (1,019 lines):

  • HTTP errors, connection refused, invalid payloads
  • Runner disconnect scenarios
  • Guard fail-fast timing
  • Crash policies

runner_drain_on_version.rs (464 lines):

  • Various drain scenarios

Missing:

  • Concurrent backfill runs
  • Error tracker state persistence
  • Large error message truncation

Breaking Changes

API Schema (backward compatible):

  • Added error to Actor
  • Added drain_on_version_upgrade to RunnerConfig
  • Added runner_pool_error to RunnerConfigResponse

Package Removal:

  • rivet-serverless-backfill (internal only)

Recommendations

Must Fix: None - production-ready

Should Fix Soon:

  1. Extract hardcoded timings to config
  2. Increase cache TTL to 2-5s
  3. Add metric documentation

Nice to Have:

  • Improve test robustness
  • Add missing test cases
  • Consider batch exists checks

Final Verdict: Approve with Minor Suggestions

High-quality work significantly improving actor error handling. Well-designed architecture, comprehensively tested, properly documented.

Strengths: Architecture, tests, error handling, performance
Minor Weaknesses: Hardcoded values, short cache TTL, test fragility

None blocking. Suggestions can be addressed post-merge.

Merge risk: Low
Production impact: Positive (better error visibility, faster failure detection)


Review by Claude Code - 146 files across 8 PRs analyzed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants