Skip to content

Commit 47a8179

Browse files
committed
feat(engine): return error for pending actors
1 parent daf1f12 commit 47a8179

File tree

71 files changed

+3286
-403
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

71 files changed

+3286
-403
lines changed

Cargo.lock

Lines changed: 4 additions & 14 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11

22
[workspace]
33
resolver = "2"
4-
members = ["engine/packages/actor-kv","engine/packages/api-builder","engine/packages/api-peer","engine/packages/api-public","engine/packages/api-types","engine/packages/api-util","engine/packages/bootstrap","engine/packages/cache","engine/packages/cache-purge","engine/packages/cache-result","engine/packages/clickhouse-inserter","engine/packages/clickhouse-user-query","engine/packages/config","engine/packages/dump-openapi","engine/packages/engine","engine/packages/env","engine/packages/epoxy","engine/packages/error","engine/packages/error-macros","engine/packages/gasoline","engine/packages/gasoline-macros","engine/packages/guard","engine/packages/guard-core","engine/packages/logs","engine/packages/metrics","engine/packages/namespace","engine/packages/pegboard","engine/packages/pegboard-gateway","engine/packages/pegboard-runner","engine/packages/pools","engine/packages/postgres-util","engine/packages/runtime","engine/packages/serverless-backfill","engine/packages/service-manager","engine/packages/telemetry","engine/packages/test-deps","engine/packages/test-deps-docker","engine/packages/tracing-reconfigure","engine/packages/tracing-utils","engine/packages/types","engine/packages/universaldb","engine/packages/universalpubsub","engine/packages/util","engine/packages/util-id","engine/packages/workflow-worker","engine/sdks/rust/api-full","engine/sdks/rust/data","engine/sdks/rust/epoxy-protocol","engine/sdks/rust/runner-protocol","engine/sdks/rust/ups-protocol"]
4+
members = ["engine/packages/actor-kv","engine/packages/api-builder","engine/packages/api-peer","engine/packages/api-public","engine/packages/api-types","engine/packages/api-util","engine/packages/bootstrap","engine/packages/cache","engine/packages/cache-purge","engine/packages/cache-result","engine/packages/clickhouse-inserter","engine/packages/clickhouse-user-query","engine/packages/config","engine/packages/dump-openapi","engine/packages/engine","engine/packages/env","engine/packages/epoxy","engine/packages/error","engine/packages/error-macros","engine/packages/gasoline","engine/packages/gasoline-macros","engine/packages/guard","engine/packages/guard-core","engine/packages/logs","engine/packages/metrics","engine/packages/namespace","engine/packages/pegboard","engine/packages/pegboard-gateway","engine/packages/pegboard-runner","engine/packages/pools","engine/packages/postgres-util","engine/packages/runtime","engine/packages/service-manager","engine/packages/telemetry","engine/packages/test-deps","engine/packages/test-deps-docker","engine/packages/tracing-reconfigure","engine/packages/tracing-utils","engine/packages/types","engine/packages/universaldb","engine/packages/universalpubsub","engine/packages/util","engine/packages/util-id","engine/packages/workflow-worker","engine/sdks/rust/api-full","engine/sdks/rust/data","engine/sdks/rust/epoxy-protocol","engine/sdks/rust/runner-protocol","engine/sdks/rust/ups-protocol"]
55

66
[workspace.package]
77
version = "2.0.33"
@@ -370,9 +370,6 @@ path = "engine/packages/postgres-util"
370370
[workspace.dependencies.rivet-runtime]
371371
path = "engine/packages/runtime"
372372

373-
[workspace.dependencies.rivet-serverless-backfill]
374-
path = "engine/packages/serverless-backfill"
375-
376373
[workspace.dependencies.rivet-service-manager]
377374
path = "engine/packages/service-manager"
378375

docs/engine/ACTOR_ERRORS.md

Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
# Actor Errors
2+
3+
## Overview
4+
5+
Actor errors come from two sources:
6+
7+
1. **Direct actor errors** - From the actor workflow itself (stored in workflow state as `ActorError`)
8+
2. **Runner pool errors** - From serverless endpoints, tracked by the runner pool error tracker workflow
9+
10+
Errors use a two-layer representation:
11+
12+
- `ActorError` - Internal enum stored in workflow state (`NoCapacity`, `RunnerNoResponse`)
13+
- `ActorError` - API-facing enum with enriched context (`PoolError`, `NoCapacity`, `RunnerNoResponse`)
14+
15+
The **runner pool error tracker** monitors serverless endpoint health. When errors occur, it stores them with timestamps. Errors are cleared after 3 consecutive successes (hysteresis).
16+
17+
## Error Sources
18+
19+
### Direct Actor Errors
20+
21+
Stored in actor workflow state as `failure_reason`:
22+
23+
- **NoCapacity** - No runner available to allocate the actor
24+
- **RunnerNoResponse** - Runner was allocated but became unresponsive (includes `runner_id`)
25+
26+
Set when:
27+
- Runner allocation fails or times out
28+
- Runner disconnects while actor was allocated
29+
30+
### Runner Pool Errors
31+
32+
Tracked separately by `runner_pool_error_tracker` workflow for serverless configs:
33+
34+
- **RunnerPoolError** - Contains error details from failed endpoint requests. Variants include:
35+
- `ServerlessHttpError` - HTTP error (status code, body)
36+
- `ServerlessStreamEndedEarly` - SSE stream ended before runner initialized
37+
- `ServerlessConnectionError` - Network/connection error
38+
- `ServerlessInvalidBase64` - Invalid base64 in SSE message
39+
- `ServerlessInvalidPayload` - Invalid protocol payload
40+
- `InternalError` - Internal errors (namespace not found, config not found, etc.)
41+
42+
The error tracker:
43+
- Receives errors from the serverless pool manager
44+
- Stores `active_error` with timestamp
45+
- Clears error after 3 consecutive successful runner startups
46+
- Makes state queryable for API enrichment and guard fail-fast
47+
48+
## Error Consumption
49+
50+
### API Response Enrichment
51+
52+
When fetching actors via the API (`pegboard::ops::actor::get`):
53+
54+
1. Check if actor has `failure_reason: NoCapacity`
55+
2. Query the error tracker workflow for the actor's runner config
56+
3. If pool has active error, return `PoolError` with details
57+
4. Otherwise return plain `NoCapacity`
58+
59+
This provides users with actionable error info (e.g., "serverless endpoint returned 500") instead of generic "no capacity".
60+
61+
### Guard Fail-Fast
62+
63+
When guard is waiting for an actor to become ready:
64+
65+
1. Wait 1 second before starting checks (allow normal startup)
66+
2. Every 2 seconds, poll the error tracker workflow
67+
3. If active error exists, return `actor_runner_failed` immediately
68+
69+
This prevents clients from waiting the full ready timeout when the runner pool is known to be failing.
70+
71+
### Runner Configs API
72+
73+
The runner configs list endpoint enriches serverless configs with current pool errors:
74+
75+
1. For each serverless config, find its error tracker workflow
76+
2. If active error exists, populate `pool_error` field
77+
3. Allows dashboards to show pool health status
78+
79+
## Error Flow Diagram
80+
81+
```mermaid
82+
---
83+
config:
84+
theme: mc
85+
look: classic
86+
---
87+
flowchart TD
88+
subgraph Sources
89+
AWF[Actor Workflow]
90+
SE[Serverless Endpoint]
91+
end
92+
93+
subgraph Storage
94+
WFS[Workflow State<br/>failure_reason: ActorError]
95+
ET[Error Tracker Workflow<br/>active_error: RunnerPoolError]
96+
end
97+
98+
subgraph Consumers
99+
API[Actor Get API]
100+
Guard[Guard Fail-Fast]
101+
RC[Runner Configs API]
102+
end
103+
104+
AWF -->|NoCapacity<br/>RunnerNoResponse| WFS
105+
SE -->|Connection errors| ET
106+
107+
WFS --> API
108+
ET --> API
109+
API -->|ActorError| Client1[Client]
110+
111+
ET --> Guard
112+
Guard -->|actor_runner_failed| Client2[Client]
113+
114+
ET --> RC
115+
RC -->|pool_error| Dashboard[Dashboard]
116+
```

engine/artifacts/errors/guard.actor_runner_failed.json

Lines changed: 5 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)