Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 4 additions & 14 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

5 changes: 1 addition & 4 deletions Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@

[workspace]
resolver = "2"
members = ["engine/packages/actor-kv","engine/packages/api-builder","engine/packages/api-peer","engine/packages/api-public","engine/packages/api-types","engine/packages/api-util","engine/packages/bootstrap","engine/packages/cache","engine/packages/cache-purge","engine/packages/cache-result","engine/packages/clickhouse-inserter","engine/packages/clickhouse-user-query","engine/packages/config","engine/packages/dump-openapi","engine/packages/engine","engine/packages/env","engine/packages/epoxy","engine/packages/error","engine/packages/error-macros","engine/packages/gasoline","engine/packages/gasoline-macros","engine/packages/guard","engine/packages/guard-core","engine/packages/logs","engine/packages/metrics","engine/packages/namespace","engine/packages/pegboard","engine/packages/pegboard-gateway","engine/packages/pegboard-runner","engine/packages/pools","engine/packages/postgres-util","engine/packages/runtime","engine/packages/serverless-backfill","engine/packages/service-manager","engine/packages/telemetry","engine/packages/test-deps","engine/packages/test-deps-docker","engine/packages/tracing-reconfigure","engine/packages/tracing-utils","engine/packages/types","engine/packages/universaldb","engine/packages/universalpubsub","engine/packages/util","engine/packages/util-id","engine/packages/workflow-worker","engine/sdks/rust/api-full","engine/sdks/rust/data","engine/sdks/rust/epoxy-protocol","engine/sdks/rust/runner-protocol","engine/sdks/rust/ups-protocol"]
members = ["engine/packages/actor-kv","engine/packages/api-builder","engine/packages/api-peer","engine/packages/api-public","engine/packages/api-types","engine/packages/api-util","engine/packages/bootstrap","engine/packages/cache","engine/packages/cache-purge","engine/packages/cache-result","engine/packages/clickhouse-inserter","engine/packages/clickhouse-user-query","engine/packages/config","engine/packages/dump-openapi","engine/packages/engine","engine/packages/env","engine/packages/epoxy","engine/packages/error","engine/packages/error-macros","engine/packages/gasoline","engine/packages/gasoline-macros","engine/packages/guard","engine/packages/guard-core","engine/packages/logs","engine/packages/metrics","engine/packages/namespace","engine/packages/pegboard","engine/packages/pegboard-gateway","engine/packages/pegboard-runner","engine/packages/pools","engine/packages/postgres-util","engine/packages/runtime","engine/packages/service-manager","engine/packages/telemetry","engine/packages/test-deps","engine/packages/test-deps-docker","engine/packages/tracing-reconfigure","engine/packages/tracing-utils","engine/packages/types","engine/packages/universaldb","engine/packages/universalpubsub","engine/packages/util","engine/packages/util-id","engine/packages/workflow-worker","engine/sdks/rust/api-full","engine/sdks/rust/data","engine/sdks/rust/epoxy-protocol","engine/sdks/rust/runner-protocol","engine/sdks/rust/ups-protocol"]

[workspace.package]
version = "2.0.33"
Expand Down Expand Up @@ -370,9 +370,6 @@ path = "engine/packages/postgres-util"
[workspace.dependencies.rivet-runtime]
path = "engine/packages/runtime"

[workspace.dependencies.rivet-serverless-backfill]
path = "engine/packages/serverless-backfill"

[workspace.dependencies.rivet-service-manager]
path = "engine/packages/service-manager"

Expand Down
116 changes: 116 additions & 0 deletions docs/engine/ACTOR_ERRORS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
# Actor Errors

## Overview

Actor errors come from two sources:

1. **Direct actor errors** - From the actor workflow itself (stored in workflow state as `ActorError`)
2. **Runner pool errors** - From serverless endpoints, tracked by the runner pool error tracker workflow

Errors use a two-layer representation:

- `ActorError` - Internal enum stored in workflow state (`NoCapacity`, `RunnerNoResponse`)
- `ActorError` - API-facing enum with enriched context (`PoolError`, `NoCapacity`, `RunnerNoResponse`)

The **runner pool error tracker** monitors serverless endpoint health. When errors occur, it stores them with timestamps. Errors are cleared after 3 consecutive successes (hysteresis).

## Error Sources

### Direct Actor Errors

Stored in actor workflow state as `failure_reason`:

- **NoCapacity** - No runner available to allocate the actor
- **RunnerNoResponse** - Runner was allocated but became unresponsive (includes `runner_id`)

Set when:
- Runner allocation fails or times out
- Runner disconnects while actor was allocated

### Runner Pool Errors

Tracked separately by `runner_pool_error_tracker` workflow for serverless configs:

- **RunnerPoolError** - Contains error details from failed endpoint requests. Variants include:
- `ServerlessHttpError` - HTTP error (status code, body)
- `ServerlessStreamEndedEarly` - SSE stream ended before runner initialized
- `ServerlessConnectionError` - Network/connection error
- `ServerlessInvalidBase64` - Invalid base64 in SSE message
- `ServerlessInvalidPayload` - Invalid protocol payload
- `InternalError` - Internal errors (namespace not found, config not found, etc.)

The error tracker:
- Receives errors from the serverless pool manager
- Stores `active_error` with timestamp
- Clears error after 3 consecutive successful runner startups
- Makes state queryable for API enrichment and guard fail-fast

## Error Consumption

### API Response Enrichment

When fetching actors via the API (`pegboard::ops::actor::get`):

1. Check if actor has `failure_reason: NoCapacity`
2. Query the error tracker workflow for the actor's runner config
3. If pool has active error, return `PoolError` with details
4. Otherwise return plain `NoCapacity`

This provides users with actionable error info (e.g., "serverless endpoint returned 500") instead of generic "no capacity".

### Guard Fail-Fast

When guard is waiting for an actor to become ready:

1. Wait 1 second before starting checks (allow normal startup)
2. Every 2 seconds, poll the error tracker workflow
3. If active error exists, return `actor_runner_failed` immediately

This prevents clients from waiting the full ready timeout when the runner pool is known to be failing.

### Runner Configs API

The runner configs list endpoint enriches serverless configs with current pool errors:

1. For each serverless config, find its error tracker workflow
2. If active error exists, populate `pool_error` field
3. Allows dashboards to show pool health status

## Error Flow Diagram

```mermaid
---
config:
theme: mc
look: classic
---
flowchart TD
subgraph Sources
AWF[Actor Workflow]
SE[Serverless Endpoint]
end

subgraph Storage
WFS[Workflow State<br/>failure_reason: ActorError]
ET[Error Tracker Workflow<br/>active_error: RunnerPoolError]
end

subgraph Consumers
API[Actor Get API]
Guard[Guard Fail-Fast]
RC[Runner Configs API]
end

AWF -->|NoCapacity<br/>RunnerNoResponse| WFS
SE -->|Connection errors| ET

WFS --> API
ET --> API
API -->|ActorError| Client1[Client]

ET --> Guard
Guard -->|actor_runner_failed| Client2[Client]

ET --> RC
RC -->|pool_error| Dashboard[Dashboard]
```
5 changes: 5 additions & 0 deletions engine/artifacts/errors/guard.actor_runner_failed.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading
Loading