Skip to content

Conversation

@NathanFlurry
Copy link
Member

No description provided.

@vercel
Copy link

vercel bot commented Jan 1, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Review Updated (UTC)
rivetkit-serverless Error Error Jan 6, 2026 2:31am
3 Skipped Deployments
Project Deployment Review Updated (UTC)
rivet-cloud Ignored Ignored Jan 6, 2026 2:31am
rivet-inspector Ignored Ignored Preview Jan 6, 2026 2:31am
rivet-site Ignored Ignored Preview Jan 6, 2026 2:31am

Copy link
Member Author

NathanFlurry commented Jan 1, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@NathanFlurry NathanFlurry changed the base branch from 12-23-chore_docs_document_abort_signal to graphite-base/3730 January 2, 2026 22:30
@NathanFlurry NathanFlurry force-pushed the 12-23-feat_engine_return_error_for_pending_actors branch from fa230af to 7039e59 Compare January 2, 2026 22:30
@NathanFlurry NathanFlurry changed the base branch from graphite-base/3730 to 01-02-chore_gas_add_support_for_find_workflows_ January 2, 2026 22:31
@jog1t jog1t force-pushed the 01-02-chore_gas_add_support_for_find_workflows_ branch from d3bc69d to 79307ef Compare January 2, 2026 23:33
@jog1t jog1t force-pushed the 12-23-feat_engine_return_error_for_pending_actors branch from 7039e59 to 653f032 Compare January 2, 2026 23:33
@NathanFlurry NathanFlurry force-pushed the 01-02-chore_gas_add_support_for_find_workflows_ branch from 79307ef to d3bc69d Compare January 2, 2026 23:59
@NathanFlurry NathanFlurry force-pushed the 12-23-feat_engine_return_error_for_pending_actors branch from 653f032 to 48e0be0 Compare January 2, 2026 23:59
@NathanFlurry NathanFlurry force-pushed the 01-02-chore_gas_add_support_for_find_workflows_ branch from c16bf7b to 3040686 Compare January 6, 2026 02:04
@NathanFlurry NathanFlurry force-pushed the 12-23-feat_engine_return_error_for_pending_actors branch from eaa0341 to 050a3d9 Compare January 6, 2026 02:04
@NathanFlurry NathanFlurry force-pushed the 12-23-feat_engine_return_error_for_pending_actors branch from 050a3d9 to 99737e0 Compare January 6, 2026 02:19
@NathanFlurry NathanFlurry force-pushed the 01-02-chore_gas_add_support_for_find_workflows_ branch from 3040686 to 3e3a986 Compare January 6, 2026 02:19
@graphite-app
Copy link
Contributor

graphite-app bot commented Jan 6, 2026

Merge activity

@claude
Copy link

claude bot commented Jan 6, 2026

PR Review: Actor Error Handling Enhancement

This is a substantial and well-architected feature that adds comprehensive error tracking and reporting for Rivet Actors. The implementation demonstrates strong systems design with clear separation of concerns.

Architecture Overview ✅

The two-layer error architecture is well-designed:

  • Direct actor errors tracked in workflow state
  • Runner pool errors tracked separately via dedicated workflow
  • Clean separation enables independent querying and enrichment

The documentation in ACTOR_ERRORS.md is excellent and should help future maintainers understand the system.


Code Quality Observations

Strengths 💪

  1. Excellent Documentation: The ACTOR_ERRORS.md file with mermaid diagram clearly explains the architecture
  2. Proper Error Propagation: The RunnerPoolError enum provides detailed, actionable error information
  3. Hysteresis Implementation: The consecutive success counter (config: runner_pool_consecutive_successes_to_clear_error) prevents error flapping
  4. Guard Fail-Fast: The guard polling mechanism (pegboard_gateway.rs:230-259) prevents unnecessary timeouts
  5. Comprehensive Testing: actors_scheduling_errors.rs includes substantial test coverage with mock serverless endpoints
  6. Cache Strategy: Short TTL (500ms) in get_error.rs:32 is appropriate for error state

Issues & Concerns 🚨

1. Critical: Missing Error Reporting in HTTP Error Path ⚠️

In serverless/conn.rs:365-379, when an HTTP error occurs, the error is reported but the function returns early with bail! without checking for drain signal. This could leave connections in an inconsistent state.

Location: engine/packages/pegboard/src/workflows/serverless/conn.rs:365-379

Impact: If a drain signal arrives during error handling, it won't be processed, potentially preventing graceful shutdown.

Recommendation: Follow the pattern used in other error paths - return an error result that's handled by the outer loop which checks for drain signals.

2. TypeScript Error Type Mismatch

The TypeScript types in rivetkit-typescript/packages/rivetkit/src/client/errors.ts don't fully align with the Rust API types.

Location: rivetkit-typescript/packages/rivetkit/src/client/errors.ts:50-58

Issue: The TypeScript type uses snake_case variants that don't match the OpenAPI schema:

  • http_error vs serverless_http_error
  • stream_ended_early vs serverless_stream_ended_early

Impact: Runtime deserialization errors when TypeScript clients receive error responses.

Recommendation: Update TypeScript types to match the exact snake_case serialization from Rust. Consider generating TypeScript types from OpenAPI schema to prevent drift.

3. Guard Polling Timing Constants Hardcoded

In guard/src/routing/pegboard_gateway.rs:14-17, timing constants are hardcoded and not configurable.

Recommendation: Consider adding these to the pegboard config similar to runner_pool_consecutive_successes_to_clear_error. Mark as experimental initially.

4. Potential Memory Growth in Error Tracker

The runner_pool_error_tracker workflow uses commit_interval(1) and listens to up to 256 signals at once.

Location: pegboard/src/workflows/runner_pool_error_tracker.rs:42-45

Issue: Processing 256 signals with commit_interval(1) could create significant transaction overhead for high-traffic runner pools.

Recommendation:

  • Consider batching commits (e.g., commit_interval(10))
  • Add metrics to monitor transaction sizes and processing time
  • Document the trade-off between freshness and transaction overhead

5. Missing Validation: Empty Runner List

In runner_config/get_error.rs:24-26, early return doesn't log.

Recommendation: Add trace-level logging when returning early for empty input.

6. Inconsistent Error Handling for Missing Workflows

In actor/util.rs:87-100, when the error tracker workflow isn't found, it's silently skipped.

Impact: Could hide issues with error tracker workflow creation.

Recommendation: Add debug-level logging when error tracker workflow doesn't exist for a runner that needs it.


Best Practices & Adherence to CLAUDE.md

✅ Followed Correctly:

  • Hard tabs used consistently
  • Proper error types with RivetError derive
  • Workspace dependencies used correctly
  • Structured logging with field syntax (e.g., tracing::debug!(?actor_id, "message"))
  • Database keys properly namespaced
  • Timestamps as i64 milliseconds with _ts suffix
  • Custom errors in guard/src/errors.rs follow the pattern

⚠️ Attention Needed:

  • Consider whether ACTOR_ERRORS.md should reference "Rivet Actor" vs "actor" per CLAUDE.md guidelines

Security Considerations

✅ Good:

  • Error messages properly sanitized (body truncated to 512 chars in conn.rs:366)
  • No credential leakage in error reporting
  • Token validation noted as "implemented in EE" with clear comment

⚠️ Minor Concerns:

  • HTTP response bodies included in errors could potentially expose sensitive serverless endpoint information. Consider adding a config flag to redact response bodies in production.

Performance Considerations

Potential Issues:

  1. Guard Polling: Every pending request polls error tracker every 2 seconds. For many concurrent requests, this could create load.

    • Mitigation: The 500ms cache TTL in get_error.rs helps, but consider adding a circuit breaker if polling becomes excessive.
  2. N+1 Query Pattern: In runner_configs.rs:87-96, fetching pool errors is properly batched. ✅

  3. Workflow Query Cost: Finding workflows by input in get_error.rs:53-65 requires scanning workflow metadata. Monitor performance as the number of runner configs grows.


Testing Coverage

✅ Excellent:

  • actors_scheduling_errors.rs has 1019 new lines of comprehensive tests
  • Mock serverless endpoints simulate various failure scenarios
  • Tests cover HTTP errors, stream ending early, and error clearing after successes

Missing Test Cases:

  1. Guard fail-fast behavior when error tracker returns active error
  2. Hysteresis behavior (consecutive success counter)
  3. Concurrent error reporting from multiple serverless connections
  4. Error tracker workflow lifecycle (creation, shutdown)

Recommendation: Add integration tests for guard fail-fast and hysteresis.


API Changes

Breaking Changes:

None - New error and runner_pool_error fields are optional, maintaining backward compatibility.

Deprecations:

None identified.


Migration & Deployment

Backfills:

The PR includes two backfills properly orchestrated in bootstrap/src/backfill.rs:

  1. actor_runner_name_selector_backfill - Migrates existing actor data
  2. serverless_backfill - Sets up error trackers for serverless configs

✅ Both check for completion before running, preventing duplicate execution.

Deployment Order:

The backfills in bootstrap ensure safe rolling deployment. No special deployment order required.


Recommendations Summary

Critical (Must Fix):

  1. Fix HTTP error path in serverless/conn.rs:365-379 to properly handle drain signals
  2. Fix TypeScript type mismatches in errors.ts to match Rust serialization

High Priority:

  1. Add logging for empty runner lists and missing error tracker workflows
  2. Add integration tests for guard fail-fast and hysteresis behavior

Medium Priority:

  1. Consider making guard polling intervals configurable
  2. Monitor error tracker transaction overhead; consider adjusting commit_interval if needed
  3. Add circuit breaker for guard polling under high load

Low Priority:

  1. Consider redacting HTTP response bodies in production environments
  2. Generate TypeScript types from OpenAPI to prevent future drift

Conclusion

This is a high-quality PR that significantly improves actor error observability. The architecture is well-designed with proper separation of concerns. The two critical issues (HTTP error handling and TypeScript type mismatch) should be addressed before merge, but overall this is excellent work.

Recommendation: Approve with requested changes for the two critical issues.


Great work on this feature! The documentation and test coverage are particularly impressive. 🎉

@NathanFlurry NathanFlurry force-pushed the 01-02-chore_gas_add_support_for_find_workflows_ branch from 3e3a986 to daf1f12 Compare January 6, 2026 02:30
@NathanFlurry NathanFlurry force-pushed the 12-23-feat_engine_return_error_for_pending_actors branch from 99737e0 to 47a8179 Compare January 6, 2026 02:30
@graphite-app graphite-app bot closed this Jan 6, 2026
@graphite-app graphite-app bot deleted the 12-23-feat_engine_return_error_for_pending_actors branch January 6, 2026 02:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants