Skip to content

Conversation

@NathanFlurry
Copy link
Member

No description provided.

@vercel
Copy link

vercel bot commented Dec 25, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Review Updated (UTC)
rivetkit-serverless Error Error Jan 6, 2026 2:31am
3 Skipped Deployments
Project Deployment Review Updated (UTC)
rivet-cloud Ignored Ignored Jan 6, 2026 2:31am
rivet-inspector Ignored Ignored Preview Jan 6, 2026 2:31am
rivet-site Ignored Ignored Preview Jan 6, 2026 2:31am

Copy link
Member Author

NathanFlurry commented Dec 25, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link

claude bot commented Dec 25, 2025

PR Review: Auto-drain old runners on version upgrade

Summary

This PR adds an optional drain_on_version_upgrade flag to the Normal runner configuration that automatically stops older runner versions when a new version connects. The implementation spans both runner workflow variants (runner.rs and runner2.rs) with identical logic.


Code Quality & Best Practices

✅ Strengths

  1. Consistent implementation across both runner workflows (runner.rs and runner2.rs)
  2. Backward compatible - uses Option<bool> with #[serde(default)]
  3. Proper schema evolution - correctly updates the vbare schema and all conversion layers
  4. Good separation of concerns - draining logic isolated in dedicated activity
  5. Follows Rivet patterns - uses workflow activities, signals, and UDB correctly
  6. Comment quality - good documentation on the drain_older_versions function

⚠️ Issues Found

1. Code Duplication (High Priority)

The drain_older_versions activity is duplicated identically in both runner.rs and runner2.rs (87 lines each). This violates DRY principles and creates maintenance burden.

Location:

  • engine/packages/pegboard/src/workflows/runner.rs:1189-1247
  • engine/packages/pegboard/src/workflows/runner2.rs:841-899

Recommendation: Extract this activity into a shared module (e.g., workflows/common.rs or activities/runner.rs) that both workflows can import.

2. Missing Structured Logging

The activity uses a tracing span but does not log the operation or its results.

Current:

.custom_instrument(tracing::info_span!("drain_older_versions_tx"))

Recommended: Add structured logging per CLAUDE.md guidelines:

tracing::info!(
    ?namespace_id,
    ?name,
    version,
    older_runner_count = older_runners.len(),
    "draining older runner versions"
);

Add this before returning from the activity to provide observability.

3. Potential Race Condition (Medium Priority)

The drain operation happens after the new runner is registered. There is a window where:

  1. New runner registers (line 135 in runner.rs)
  2. Drain activity fetches config & scans for old runners (lines 138-144)
  3. Stop signals are sent (lines 145-152)

If actors are allocated to old runners between steps 1-2, they may be unnecessarily terminated.

Consider: Should the drain happen before allocating pending actors (line 156), or should we verify old runners are idle before stopping them?

4. Error Handling - Silent Failures

The drain_older_versions activity returns early silently in multiple cases:

  • Config not found (line 1203-1205)
  • Non-Normal runner kind (line 1209-1211)
  • Feature disabled (line 1212-1214)

While returning empty Vec is reasonable, these cases are indistinguishable from "no old runners exist". Consider logging at debug level:

let Some(config) = config.into_iter().next() else {
    tracing::debug!("runner config not found, skipping drain");
    return Ok(vec![]);
};

5. Missing Input Validation

No validation that input.version is reasonable (e.g., not 0, not MAX). While this may be validated upstream, defensive programming suggests checking here too.


Performance Considerations

Database Scan Efficiency

The activity uses StreamingMode::WantAll to scan all runners for a given namespace+name combination (line 1228).

Questions:

  • What is the expected number of concurrent runner versions in practice? (2-3? 10+?)
  • Could this scan be expensive for namespaces with many historical runner versions?

Consideration: If this becomes a performance bottleneck, you could:

  1. Add pagination/limits to prevent unbounded scans
  2. Use a more specific range scan if version numbers are sequential
  3. Cache the runner config check result

Security Concerns

Stop Signal Authorization

The code sends Stop signals to older runners without additional authorization checks. This relies on:

  1. The workflow context having permission to signal other workflows
  2. Runner configs being properly scoped to the namespace

Verify: Are there any scenarios where a malicious runner could trigger draining of legitimate runners? The namespace_id scoping should prevent this, but worth confirming.


Architectural Questions

Why Two Runner Workflows?

The existence of both runner.rs and runner2.rs suggests ongoing migration.

Questions:

  1. Is runner2.rs the new implementation replacing runner.rs?
  2. Should new features only go in one of them?
  3. Is there a plan to deprecate one?

This affects whether code duplication is temporary or needs proper abstraction.

Placement of Drain Logic

In runner.rs, the drain happens inside the Reconcile loop after registration (line 137-152).
In runner2.rs, it happens at the top level after initial connection (line 60-75).

Question: Is this placement difference intentional? Could it cause behavioral differences between the two implementations?


Test Coverage

No tests found for the pegboard package.

Recommended test cases:

  1. ✅ Drain enabled - verify old runners receive Stop signal
  2. ✅ Drain disabled - verify no Stop signals sent
  3. ✅ Non-Normal runner kind - verify graceful handling
  4. ✅ No older runners exist - verify empty result
  5. ✅ Multiple old versions - verify all stopped
  6. ⚠️ Edge case: version wraparound or version 0

Suggestions

Minor Improvements

  1. Add timing metrics to understand drain operation performance:
let _timer = metrics::RUNNER_DRAIN_DURATION
    .with_label_values(&[&namespace_id.to_string()])
    .start_timer();
  1. Document the feature - Add comments explaining when users should enable drain_on_version_upgrade vs manual draining

  2. Consider graceful vs immediate stop - The Stop signal uses reset_actor_rescheduling: false. Should there be a grace period for actors to finish?


Verdict

Overall: Good implementation with room for improvement

The feature works correctly and follows Rivet patterns well. Main concerns are:

  • Code duplication (should be addressed before merge)
  • Missing observability (logging)
  • No test coverage (consider adding)
  • Timing of drain operation (verify correctness)

Action Items

  • High Priority: Extract duplicated drain_older_versions to shared module
  • Medium Priority: Add structured logging
  • Medium Priority: Verify drain timing does not cause actor allocation issues
  • Nice to have: Add test coverage
  • Nice to have: Document when to use this feature

Review generated by Claude Code

@pkg-pr-new
Copy link

pkg-pr-new bot commented Dec 25, 2025

More templates

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@3675

@rivetkit/db

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/db@3675

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@3675

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@3675

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@3675

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@3675

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@3675

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@3675

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@3675

commit: f160185

@NathanFlurry NathanFlurry force-pushed the 12-24-feat_engine_optionally_automatically_drain_old_runners_on_new_runner_version_connected branch from 4f348fe to d9e7133 Compare December 25, 2025 21:34
@vercel vercel bot temporarily deployed to Preview – rivetkit-serverless December 25, 2025 21:34 Inactive
@NathanFlurry NathanFlurry force-pushed the 12-24-feat_engine_optionally_automatically_drain_old_runners_on_new_runner_version_connected branch from d9e7133 to f160185 Compare December 26, 2025 22:24
@vercel vercel bot temporarily deployed to Preview – rivetkit-serverless December 26, 2025 22:24 Inactive
@claude
Copy link

claude bot commented Dec 26, 2025

Pull Request Review: feat(engine): optionally automatically drain old runners on new runner version connected

Overview

This PR implements automatic draining of older runner versions when a new version connects. The feature is controlled by a new drain_on_version_upgrade boolean field in the runner configuration. The implementation looks solid with good test coverage.


✅ Strengths

  1. Well-tested: Comprehensive test suite covering:

    • Normal and serverless runners
    • Feature enabled and disabled scenarios
    • Multiple older versions being drained simultaneously
    • 464 lines of new test code with good coverage
  2. Metrics integration: Added RUNNER_VERSION_UPGRADE_DRAIN counter for observability (metrics.rs:19-24)

  3. Backwards compatible: Defaults to false for the new field, maintaining existing behavior

  4. Schema versioning: Properly implemented v3 schema with migration from v2 (namespace_runner_config.rs:129-163)

  5. Workflow integration: Cleanly integrated into the runner2 workflow lifecycle (runner2.rs:66-81)


🔍 Issues Found

1. Duplicate drain logic in workflow (runner2.rs:66-81)

Severity: Medium

The drain operation is executed in both the workflow activity AND within the operation itself:

// In workflow (runner2.rs:66-81)
let drain_result = ctx
    .activity(DrainOlderVersionsInput { ... })
    .await?;
for workflow_id in drain_result.older_runner_workflow_ids {
    ctx.signal(Stop { ... })
        .to_workflow_id(workflow_id)
        .send()
        .await?;
}

// In operation (drain.rs:100-108)
if input.send_runner_stop_signals {
    for workflow_id in &older_runners {
        ctx.signal(crate::workflows::runner2::Stop { ... })
            .to_workflow_id(*workflow_id)
            .send()
            .await?;
    }
}

This creates redundant signal sending. The workflow receives the list of workflow IDs and sends Stop signals, but the operation also sends signals if send_runner_stop_signals is true. Looking at the activity call, send_runner_stop_signals isn't set, so it defaults based on the Input struct definition (drain.rs:11-18).

Recommendation: Either:

  • Remove signal sending from the workflow and always set send_runner_stop_signals: true in the activity
  • Remove signal sending from the operation and keep it in the workflow
  • Document clearly why both exist (if intentional for different use cases)

The current code at drain.rs:100 sends signals when send_runner_stop_signals is true, but I don't see where this parameter is set in the workflow call.


2. Missing workflow input parameter (runner2.rs:67-72)

Severity: Low

The DrainOlderVersionsInput activity doesn't pass send_runner_stop_signals:

ctx.activity(DrainOlderVersionsInput {
    namespace_id: input.namespace_id,
    name: input.name.clone(),
    version: input.version,
})

But Input struct in drain.rs:11-18 has this field. This means it will use the default value, but there's no explicit control.

Recommendation: Add explicit send_runner_stop_signals: false to make the intent clear, since the workflow handles signaling.


3. Hard-coded test values (api_runner_configs_*.rs)

Severity: Low

All test files now have drain_on_version_upgrade: true hard-coded, even when the test isn't about this feature:

// api_runner_configs_list.rs:44, 99, 152, 158, etc.
drain_on_version_upgrade: true,

These tests are about listing/upserting configs, not about drain behavior.

Recommendation: Use false or the default for tests unrelated to the drain feature. This makes tests clearer and reduces noise in diffs.


4. Inconsistent protocol version handling (test_runner/protocol.rs:6-18)

Severity: Low

The test runner protocol helper now uses MK2, but the comment and function naming could be clearer:

pub const PROTOCOL_VERSION: u16 = rp::PROTOCOL_MK2_VERSION;

/// Helper to decode messages from server (MK2)
pub fn decode_to_client(buf: &[u8], protocol_version: u16) -> Result<rp2::ToClient> {

Good that it's marked as MK2, but the protocol_version parameter is still accepted but not used in validation.

Recommendation: Consider adding a protocol version check or document why it's safe to accept any version.


5. Runner wait time increase (test_runner/runner.rs:263-269)

Severity: Info

Added a 2-second sleep after runner becomes ready:

// In MK2, we need to wait for the workflow to process the Init signal
// and mark the runner as eligible for actor allocation.
// This can take some time due to workflow processing:
// 1. Workflow receives Init signal
// 2. Workflow executes MarkEligible activity
// 3. Database is updated with runner allocation index
tokio::time::sleep(Duration::from_millis(2000)).await;

This is well-commented but seems like a test-only workaround. In production, is there a better way to wait for eligibility status rather than a fixed sleep?

Recommendation: Consider if this reveals a gap in the API (e.g., missing ready/eligible status that could be polled).


🎯 Performance Considerations

  1. Database scan for older runners (drain.rs:51-81)

    • Uses StreamingMode::WantAll to scan RunnerAllocIdxKey
    • This is fine for moderate numbers of runners
    • Concern: Could become slow if a single namespace+name has many runners
    • Mitigation: The scan is scoped to namespace+name, which should limit results
  2. Sequential signal sending (drain.rs:101-108, runner2.rs:74-80)

    • Signals are sent sequentially in a loop
    • Impact: O(n) latency where n = number of old runners
    • Recommendation: Consider sending signals concurrently using join_all or similar

🔒 Security Considerations

✅ No security issues identified. The feature:

  • Respects runner config permissions (only affects runners in the same namespace)
  • Uses existing authentication/authorization
  • Doesn't expose sensitive data

📚 Documentation & Code Quality

  1. Good:

    • Clear comments in test files explaining what's being tested
    • Tracing logs with appropriate context (drain.rs:84-90)
    • Helper functions for test wait logic are well-documented
  2. Could improve:

    • No user-facing documentation about when to enable drain_on_version_upgrade
    • Missing API documentation about the trade-offs (e.g., temporary reduced capacity)

🧪 Test Coverage

Excellent coverage with 4 comprehensive test scenarios:

  • ✅ Normal runner with drain enabled
  • ✅ Normal runner with drain disabled
  • ✅ Serverless runner with drain enabled
  • ✅ Multiple older versions being drained

Missing edge cases:

  1. What happens if config is changed while runners are running?
  2. Race condition: What if v2 and v3 connect simultaneously?
  3. What if a runner fails to drain (never stops)?

🐛 TypeScript Changes

The TypeScript changes look minimal and correct:

  • actor-driver.ts:22-23: Simplified protocol import ✅
  • runner.ts:24-25: Added drainOnVersionUpgrade?: boolean field ✅
  • env-vars.ts: Added utility for env var handling ✅

No issues found in TS code.


📋 Recommendations Summary

Must Fix:

  1. ❗ Clarify duplicate signal sending in workflow vs operation (Issue [SVC-2555] Set up issue templates #1)
  2. Add explicit send_runner_stop_signals: false parameter (Issue [SVC-2479] Send cluster events to PostHog #2)

Should Fix:
3. Use drain_on_version_upgrade: false in unrelated tests (Issue #3)
4. Consider concurrent signal sending for performance (Performance #2)

Nice to Have:
5. Add documentation about when to enable this feature
6. Add edge case tests (race conditions, config changes)
7. Consider making runner eligibility check more deterministic (Issue #5)


✅ Verdict

This is a well-implemented feature with strong test coverage. The main concern is the potential duplicate signal sending between the workflow and operation. Once that's clarified/fixed, this should be good to merge.

Estimated risk: Low (mostly opt-in feature with good defaults)
Code quality: High
Test coverage: Excellent

@NathanFlurry NathanFlurry force-pushed the 12-24-feat_engine_optionally_automatically_drain_old_runners_on_new_runner_version_connected branch from f160185 to 5d41ee6 Compare January 5, 2026 18:44
@NathanFlurry NathanFlurry marked this pull request as ready for review January 5, 2026 19:16
@graphite-app
Copy link
Contributor

graphite-app bot commented Jan 6, 2026

Merge activity

@NathanFlurry NathanFlurry force-pushed the 12-24-feat_engine_optionally_automatically_drain_old_runners_on_new_runner_version_connected branch from e4476fd to 69ed9ac Compare January 6, 2026 02:30
@NathanFlurry NathanFlurry force-pushed the 12-23-feat_engine_return_error_for_pending_actors branch from 99737e0 to 47a8179 Compare January 6, 2026 02:30
graphite-app bot pushed a commit that referenced this pull request Jan 6, 2026
@graphite-app graphite-app bot closed this Jan 6, 2026
@graphite-app graphite-app bot deleted the 12-24-feat_engine_optionally_automatically_drain_old_runners_on_new_runner_version_connected branch January 6, 2026 02:32
@claude
Copy link

claude bot commented Jan 6, 2026

Pull Request Review: Automatic Runner Draining on Version Upgrade

Overview

This PR adds an optional drain_on_version_upgrade flag to runner configurations that automatically drains older runner versions when a new version connects.

✅ Strengths

  1. Well-Tested: Comprehensive test coverage with 6 test scenarios
  2. Backwards Compatible: Defaults to false
  3. Good Observability: Added metrics tracking
  4. Clean Integration: Properly integrated into workflow lifecycle
  5. Schema Evolution: Correctly added new field to v3 schema

🔍 Code Quality Issues

1. Missing Import Statements (Critical)

In runner_drain_on_version.rs: The test file uses items that aren't explicitly imported (relying on glob imports from common module). This violates CLAUDE.md guidelines about adding imports at the top of files.

2. Inconsistent Error Handling

In drain.rs:37-42: Silent fallback when config doesn't exist. Consider logging when config is missing.

3. Potential Race Condition

In runner2.rs:68-82: The workflow doesn't wait for drained runners to actually stop or for their actors to be rescheduled before allocating new actors. This could lead to temporary orphaned actors.

4. Test Reliability Concern

In runner.rs:264-269: Hardcoded 2-second sleep is anti-pattern. Use proper polling instead.

🔒 Security

No critical security issues identified.

⚡ Performance

  1. Database Scan Overhead: Scans all runners on every new connection. Low impact for typical use, could be noticeable with thousands of runners.
  2. Sequential Signal Sending: Consider making concurrent for better performance.

📊 Test Coverage

Excellent coverage with 6 scenarios. Missing edge cases: simultaneous connections, active drain conflicts, config updates during lifecycle.

🐛 Potential Bugs

  1. Test Cleanup: Multiple tests don't shutdown all runners, causing resource leaks
  2. Missing Error Context: Signal sending errors lack runner ID context

🎯 Recommendations

High Priority:

  1. Add explicit imports per CLAUDE.md
  2. Fix test cleanup
  3. Add error context to signal loop
  4. Replace hardcoded sleep with polling

Medium Priority:
5. Log when config is missing
6. Document or fix race condition
7. Make signal sending concurrent
8. Add OpenAPI description

Low Priority:
9. Add edge case tests
10. Monitor scan performance

Summary

Well-implemented feature with good test coverage. Main issues are code style violations and test reliability concerns. Functionally sound but would benefit from improvements before merging.


Review generated by Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants