feat(engine): return error for pending actors #3730

NathanFlurry · 2026-01-01T23:25:06Z

No description provided.

vercel · 2026-01-01T23:25:11Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Review	Updated (UTC)
rivetkit-serverless	Error		Jan 6, 2026 2:31am

3 Skipped Deployments

Project	Deployment	Review	Updated (UTC)
rivet-cloud	Ignored		Jan 6, 2026 2:31am
rivet-inspector	Ignored	Preview	Jan 6, 2026 2:31am
rivet-site	Ignored	Preview	Jan 6, 2026 2:31am

NathanFlurry · 2026-01-01T23:25:14Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

feat(engine): return error for pending actors #3730 : 2 dependent PRs (#3675 , #3740 ) 👈 (View in Graphite)
chore(gas): add support for find_workflows #3739
chore(docs): document abort signal #3673
chore(rivetkit): clean up logging configuration #3672
chore(rivetkit): RIVET_EXPOSE_ERRORS #3671
chore(rivetkit): specialize context types for better compiler errors #3669
fix(rivetkit): fix c.client not respecting correct origin #3668
fix(rivetkit): allow providing no args to createRivetKit #3667
main

How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

_{An organization admin has enabled the Graphite Merge Queue in this repository.} _{Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.}

This stack of pull requests is managed by Graphite. Learn more about stacking.

engine/packages/pegboard/src/ops/runner_config/get_error.rs

engine/packages/pegboard/src/workflows/actor/mod.rs

engine/packages/pegboard/src/workflows/actor/setup.rs

engine/packages/pegboard/src/workflows/serverless/conn.rs

engine/packages/pegboard/src/workflows/runner_pool_error_tracker.rs

engine/packages/pegboard/src/workflows/serverless/conn.rs

engine/packages/pegboard/src/workflows/actor/mod.rs

engine/packages/bootstrap/src/lib.rs

engine/packages/pegboard/src/workflows/runner2.rs

engine/packages/pegboard/src/workflows/actor_runner_name_selector_backfill.rs

graphite-app · 2026-01-06T02:19:58Z

Merge activity

Jan 6, 2:19 AM UTC: NathanFlurry added this pull request to the Graphite merge queue.
Jan 6, 2:20 AM UTC: The Graphite merge queue removed this pull request due to downstack failures on PR #3667.
Jan 6, 2:20 AM UTC: The Graphite merge queue removed this pull request due to downstack failures on PR #3667.
Jan 6, 2:21 AM UTC: NathanFlurry added this pull request to the Graphite merge queue.
Jan 6, 2:22 AM UTC: The Graphite merge queue removed this pull request due to downstack failures on PR #3667.
Jan 6, 2:22 AM UTC: The Graphite merge queue removed this pull request due to downstack failures on PR #3667.
Jan 6, 2:31 AM UTC: NathanFlurry added this pull request to the Graphite merge queue.
Jan 6, 2:31 AM UTC: CI is running for this pull request on a draft pull request (#3754) due to your merge queue CI optimization settings.
Jan 6, 2:32 AM UTC: Merged by the Graphite merge queue via draft PR: #3754.

claude · 2026-01-06T02:25:03Z

PR Review: Actor Error Handling Enhancement

This is a substantial and well-architected feature that adds comprehensive error tracking and reporting for Rivet Actors. The implementation demonstrates strong systems design with clear separation of concerns.

Architecture Overview ✅

The two-layer error architecture is well-designed:

Direct actor errors tracked in workflow state
Runner pool errors tracked separately via dedicated workflow
Clean separation enables independent querying and enrichment

The documentation in ACTOR_ERRORS.md is excellent and should help future maintainers understand the system.

Code Quality Observations

Strengths 💪

Excellent Documentation: The ACTOR_ERRORS.md file with mermaid diagram clearly explains the architecture
Proper Error Propagation: The RunnerPoolError enum provides detailed, actionable error information
Hysteresis Implementation: The consecutive success counter (config: runner_pool_consecutive_successes_to_clear_error) prevents error flapping
Guard Fail-Fast: The guard polling mechanism (pegboard_gateway.rs:230-259) prevents unnecessary timeouts
Comprehensive Testing: actors_scheduling_errors.rs includes substantial test coverage with mock serverless endpoints
Cache Strategy: Short TTL (500ms) in get_error.rs:32 is appropriate for error state

Issues & Concerns 🚨

1. Critical: Missing Error Reporting in HTTP Error Path ⚠️

In serverless/conn.rs:365-379, when an HTTP error occurs, the error is reported but the function returns early with bail! without checking for drain signal. This could leave connections in an inconsistent state.

Location: engine/packages/pegboard/src/workflows/serverless/conn.rs:365-379

Impact: If a drain signal arrives during error handling, it won't be processed, potentially preventing graceful shutdown.

Recommendation: Follow the pattern used in other error paths - return an error result that's handled by the outer loop which checks for drain signals.

2. TypeScript Error Type Mismatch

The TypeScript types in rivetkit-typescript/packages/rivetkit/src/client/errors.ts don't fully align with the Rust API types.

Location: rivetkit-typescript/packages/rivetkit/src/client/errors.ts:50-58

Issue: The TypeScript type uses snake_case variants that don't match the OpenAPI schema:

http_error vs serverless_http_error
stream_ended_early vs serverless_stream_ended_early

Impact: Runtime deserialization errors when TypeScript clients receive error responses.

Recommendation: Update TypeScript types to match the exact snake_case serialization from Rust. Consider generating TypeScript types from OpenAPI schema to prevent drift.

3. Guard Polling Timing Constants Hardcoded

In guard/src/routing/pegboard_gateway.rs:14-17, timing constants are hardcoded and not configurable.

Recommendation: Consider adding these to the pegboard config similar to runner_pool_consecutive_successes_to_clear_error. Mark as experimental initially.

4. Potential Memory Growth in Error Tracker

The runner_pool_error_tracker workflow uses commit_interval(1) and listens to up to 256 signals at once.

Location: pegboard/src/workflows/runner_pool_error_tracker.rs:42-45

Issue: Processing 256 signals with commit_interval(1) could create significant transaction overhead for high-traffic runner pools.

Recommendation:

Consider batching commits (e.g., commit_interval(10))
Add metrics to monitor transaction sizes and processing time
Document the trade-off between freshness and transaction overhead

5. Missing Validation: Empty Runner List

In runner_config/get_error.rs:24-26, early return doesn't log.

Recommendation: Add trace-level logging when returning early for empty input.

6. Inconsistent Error Handling for Missing Workflows

In actor/util.rs:87-100, when the error tracker workflow isn't found, it's silently skipped.

Impact: Could hide issues with error tracker workflow creation.

Recommendation: Add debug-level logging when error tracker workflow doesn't exist for a runner that needs it.

Best Practices & Adherence to CLAUDE.md

✅ Followed Correctly:

Hard tabs used consistently
Proper error types with RivetError derive
Workspace dependencies used correctly
Structured logging with field syntax (e.g., tracing::debug!(?actor_id, "message"))
Database keys properly namespaced
Timestamps as i64 milliseconds with _ts suffix
Custom errors in guard/src/errors.rs follow the pattern

⚠️ Attention Needed:

Consider whether ACTOR_ERRORS.md should reference "Rivet Actor" vs "actor" per CLAUDE.md guidelines

Security Considerations

✅ Good:

Error messages properly sanitized (body truncated to 512 chars in conn.rs:366)
No credential leakage in error reporting
Token validation noted as "implemented in EE" with clear comment

⚠️ Minor Concerns:

HTTP response bodies included in errors could potentially expose sensitive serverless endpoint information. Consider adding a config flag to redact response bodies in production.

Performance Considerations

Potential Issues:

Guard Polling: Every pending request polls error tracker every 2 seconds. For many concurrent requests, this could create load.
- Mitigation: The 500ms cache TTL in get_error.rs helps, but consider adding a circuit breaker if polling becomes excessive.
N+1 Query Pattern: In runner_configs.rs:87-96, fetching pool errors is properly batched. ✅
Workflow Query Cost: Finding workflows by input in get_error.rs:53-65 requires scanning workflow metadata. Monitor performance as the number of runner configs grows.

Testing Coverage

✅ Excellent:

actors_scheduling_errors.rs has 1019 new lines of comprehensive tests
Mock serverless endpoints simulate various failure scenarios
Tests cover HTTP errors, stream ending early, and error clearing after successes

Missing Test Cases:

Guard fail-fast behavior when error tracker returns active error
Hysteresis behavior (consecutive success counter)
Concurrent error reporting from multiple serverless connections
Error tracker workflow lifecycle (creation, shutdown)

Recommendation: Add integration tests for guard fail-fast and hysteresis.

API Changes

Breaking Changes:

✅ None - New error and runner_pool_error fields are optional, maintaining backward compatibility.

Deprecations:

None identified.

Migration & Deployment

Backfills:

The PR includes two backfills properly orchestrated in bootstrap/src/backfill.rs:

actor_runner_name_selector_backfill - Migrates existing actor data
serverless_backfill - Sets up error trackers for serverless configs

✅ Both check for completion before running, preventing duplicate execution.

Deployment Order:

The backfills in bootstrap ensure safe rolling deployment. No special deployment order required.

Recommendations Summary

Critical (Must Fix):

Fix HTTP error path in serverless/conn.rs:365-379 to properly handle drain signals
Fix TypeScript type mismatches in errors.ts to match Rust serialization

High Priority:

Add logging for empty runner lists and missing error tracker workflows
Add integration tests for guard fail-fast and hysteresis behavior

Medium Priority:

Consider making guard polling intervals configurable
Monitor error tracker transaction overhead; consider adjusting commit_interval if needed
Add circuit breaker for guard polling under high load

Low Priority:

Consider redacting HTTP response bodies in production environments
Generate TypeScript types from OpenAPI to prevent future drift

Conclusion

This is a high-quality PR that significantly improves actor error observability. The architecture is well-designed with proper separation of concerns. The two critical issues (HTTP error handling and TypeScript type mismatch) should be addressed before merge, but overall this is excellent work.

Recommendation: Approve with requested changes for the two critical issues.

Great work on this feature! The documentation and test coverage are particularly impressive. 🎉

This was referenced Jan 1, 2026

chore(rivetkit): clean up logging configuration #3672

Closed

fix(rivetkit): fix c.client not respecting correct origin #3668

Closed

chore(docs): document abort signal #3673

Closed

This was referenced Jan 1, 2026

fix(rivetkit): allow providing no args to createRivetKit #3667

Merged

chore(rivetkit): specialize context types for better compiler errors #3669

Closed

chore(rivetkit): RIVET_EXPOSE_ERRORS #3671

Closed

vercel bot had a problem deploying to Preview – rivetkit-serverless January 1, 2026 23:25 Failure

NathanFlurry force-pushed the 12-23-feat_engine_return_error_for_pending_actors branch from 27bd42b to fa230af Compare January 2, 2026 09:20

vercel bot had a problem deploying to Preview – rivetkit-serverless January 2, 2026 09:21 Failure

MasterPtato requested changes Jan 2, 2026

View reviewed changes

Copilot AI mentioned this pull request Jan 2, 2026

Add error indicator column to runner pool table #3735

Closed

NathanFlurry changed the base branch from 12-23-chore_docs_document_abort_signal to graphite-base/3730 January 2, 2026 22:30

NathanFlurry force-pushed the graphite-base/3730 branch from 4072138 to d3bc69d Compare January 2, 2026 22:30

NathanFlurry force-pushed the 12-23-feat_engine_return_error_for_pending_actors branch from fa230af to 7039e59 Compare January 2, 2026 22:30

NathanFlurry changed the base branch from graphite-base/3730 to 01-02-chore_gas_add_support_for_find_workflows_ January 2, 2026 22:31

NathanFlurry mentioned this pull request Jan 2, 2026

chore(gas): add support for find_workflows #3739

Closed

vercel bot had a problem deploying to Preview – rivetkit-serverless January 2, 2026 22:32 Failure

jog1t force-pushed the 01-02-chore_gas_add_support_for_find_workflows_ branch from d3bc69d to 79307ef Compare January 2, 2026 23:33

jog1t force-pushed the 12-23-feat_engine_return_error_for_pending_actors branch from 7039e59 to 653f032 Compare January 2, 2026 23:33

This was referenced Jan 2, 2026

chore: update engine-full sdk #3740

Open

feat(dash): show runner pool errors #3741

Open

fix(dash): use ineligible icon for runners #3742

Open

vercel bot had a problem deploying to Preview – rivetkit-serverless January 2, 2026 23:34 Failure

NathanFlurry force-pushed the 01-02-chore_gas_add_support_for_find_workflows_ branch from 79307ef to d3bc69d Compare January 2, 2026 23:59

NathanFlurry force-pushed the 12-23-feat_engine_return_error_for_pending_actors branch from 653f032 to 48e0be0 Compare January 2, 2026 23:59

jog1t mentioned this pull request Jan 2, 2026

feat(dash): surface actor errors in ui #3743

Open

vercel bot had a problem deploying to Preview – rivetkit-serverless January 3, 2026 00:00 Failure

NathanFlurry requested a review from MasterPtato January 3, 2026 00:02

vercel bot had a problem deploying to Preview – rivetkit-serverless January 5, 2026 03:14 Failure

NathanFlurry force-pushed the 12-23-feat_engine_return_error_for_pending_actors branch from aef019b to 9b59e8a Compare January 5, 2026 18:22

NathanFlurry force-pushed the 01-02-chore_gas_add_support_for_find_workflows_ branch from d3bc69d to c16bf7b Compare January 5, 2026 18:22

vercel bot had a problem deploying to Preview – rivetkit-serverless January 5, 2026 18:23 Failure

MasterPtato approved these changes Jan 5, 2026

View reviewed changes

engine/packages/pegboard/src/workflows/actor/mod.rs Outdated Show resolved Hide resolved

NathanFlurry force-pushed the 12-23-feat_engine_return_error_for_pending_actors branch from 9b59e8a to 5c32da2 Compare January 5, 2026 19:16

vercel bot had a problem deploying to Preview – rivetkit-serverless January 5, 2026 19:17 Failure

NathanFlurry force-pushed the 12-23-feat_engine_return_error_for_pending_actors branch from 5c32da2 to 295f622 Compare January 5, 2026 20:58

vercel bot had a problem deploying to Preview – rivetkit-serverless January 5, 2026 20:58 Failure

NathanFlurry force-pushed the 12-23-feat_engine_return_error_for_pending_actors branch from 295f622 to eaa0341 Compare January 5, 2026 21:50

vercel bot had a problem deploying to Preview – rivetkit-serverless January 5, 2026 21:50 Failure

MasterPtato requested changes Jan 6, 2026

View reviewed changes

NathanFlurry force-pushed the 01-02-chore_gas_add_support_for_find_workflows_ branch from c16bf7b to 3040686 Compare January 6, 2026 02:04

NathanFlurry force-pushed the 12-23-feat_engine_return_error_for_pending_actors branch from eaa0341 to 050a3d9 Compare January 6, 2026 02:04

vercel bot had a problem deploying to Preview – rivetkit-serverless January 6, 2026 02:05 Failure

NathanFlurry requested a review from MasterPtato January 6, 2026 02:07

MasterPtato approved these changes Jan 6, 2026

View reviewed changes

NathanFlurry force-pushed the 12-23-feat_engine_return_error_for_pending_actors branch from 050a3d9 to 99737e0 Compare January 6, 2026 02:19

NathanFlurry force-pushed the 01-02-chore_gas_add_support_for_find_workflows_ branch from 3040686 to 3e3a986 Compare January 6, 2026 02:19

NathanFlurry mentioned this pull request Jan 6, 2026

feat(engine): optionally automatically drain old runners on new runner version connected #3675

Closed

vercel bot had a problem deploying to Preview – rivetkit-serverless January 6, 2026 02:20 Failure

feat(engine): return error for pending actors

47a8179

NathanFlurry force-pushed the 01-02-chore_gas_add_support_for_find_workflows_ branch from 3e3a986 to daf1f12 Compare January 6, 2026 02:30

NathanFlurry force-pushed the 12-23-feat_engine_return_error_for_pending_actors branch from 99737e0 to 47a8179 Compare January 6, 2026 02:30

graphite-app bot pushed a commit that referenced this pull request Jan 6, 2026

feat(engine): return error for pending actors (#3730)

cf8d918

vercel bot had a problem deploying to Preview – rivetkit-serverless January 6, 2026 02:31 Failure

graphite-app bot closed this Jan 6, 2026

graphite-app bot deleted the 12-23-feat_engine_return_error_for_pending_actors branch January 6, 2026 02:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(engine): return error for pending actors #3730

feat(engine): return error for pending actors #3730

Uh oh!

NathanFlurry commented Jan 1, 2026

Uh oh!

vercel bot commented Jan 1, 2026 •

edited

Loading

Uh oh!

NathanFlurry commented Jan 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

graphite-app bot commented Jan 6, 2026 •

edited

Loading

Uh oh!

claude bot commented Jan 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat(engine): return error for pending actors #3730

feat(engine): return error for pending actors #3730

Uh oh!

Conversation

NathanFlurry commented Jan 1, 2026

Uh oh!

vercel bot commented Jan 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NathanFlurry commented Jan 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to use the Graphite Merge Queue

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

graphite-app bot commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge activity

Uh oh!

claude bot commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Actor Error Handling Enhancement

Architecture Overview ✅

Code Quality Observations

Strengths 💪

Issues & Concerns 🚨

1. Critical: Missing Error Reporting in HTTP Error Path ⚠️

2. TypeScript Error Type Mismatch

3. Guard Polling Timing Constants Hardcoded

4. Potential Memory Growth in Error Tracker

5. Missing Validation: Empty Runner List

6. Inconsistent Error Handling for Missing Workflows

Best Practices & Adherence to CLAUDE.md

✅ Followed Correctly:

⚠️ Attention Needed:

Security Considerations

✅ Good:

⚠️ Minor Concerns:

Performance Considerations

Potential Issues:

Testing Coverage

✅ Excellent:

Missing Test Cases:

API Changes

Breaking Changes:

Deprecations:

Migration & Deployment

Backfills:

Deployment Order:

Recommendations Summary

Critical (Must Fix):

High Priority:

Medium Priority:

Low Priority:

Conclusion

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vercel bot commented Jan 1, 2026 •

edited

Loading

NathanFlurry commented Jan 1, 2026 •

edited

Loading

graphite-app bot commented Jan 6, 2026 •

edited

Loading

claude bot commented Jan 6, 2026 •

edited

Loading