Skip to content

Conversation

@NathanFlurry
Copy link
Member

No description provided.

@vercel
Copy link

vercel bot commented Dec 25, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

4 Skipped Deployments
Project Deployment Review Updated (UTC)
rivet-cloud Ignored Ignored Dec 26, 2025 10:24pm
rivet-inspector Ignored Ignored Preview Dec 26, 2025 10:24pm
rivet-site Ignored Ignored Preview Dec 26, 2025 10:24pm
rivetkit-serverless Skipped Skipped Dec 26, 2025 10:24pm

Copy link
Member Author


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link

claude bot commented Dec 25, 2025

PR Review: Auto-drain old runners on version upgrade

Summary

This PR adds an optional drain_on_version_upgrade flag to the Normal runner configuration that automatically stops older runner versions when a new version connects. The implementation spans both runner workflow variants (runner.rs and runner2.rs) with identical logic.


Code Quality & Best Practices

✅ Strengths

  1. Consistent implementation across both runner workflows (runner.rs and runner2.rs)
  2. Backward compatible - uses Option<bool> with #[serde(default)]
  3. Proper schema evolution - correctly updates the vbare schema and all conversion layers
  4. Good separation of concerns - draining logic isolated in dedicated activity
  5. Follows Rivet patterns - uses workflow activities, signals, and UDB correctly
  6. Comment quality - good documentation on the drain_older_versions function

⚠️ Issues Found

1. Code Duplication (High Priority)

The drain_older_versions activity is duplicated identically in both runner.rs and runner2.rs (87 lines each). This violates DRY principles and creates maintenance burden.

Location:

  • engine/packages/pegboard/src/workflows/runner.rs:1189-1247
  • engine/packages/pegboard/src/workflows/runner2.rs:841-899

Recommendation: Extract this activity into a shared module (e.g., workflows/common.rs or activities/runner.rs) that both workflows can import.

2. Missing Structured Logging

The activity uses a tracing span but does not log the operation or its results.

Current:

.custom_instrument(tracing::info_span!("drain_older_versions_tx"))

Recommended: Add structured logging per CLAUDE.md guidelines:

tracing::info!(
    ?namespace_id,
    ?name,
    version,
    older_runner_count = older_runners.len(),
    "draining older runner versions"
);

Add this before returning from the activity to provide observability.

3. Potential Race Condition (Medium Priority)

The drain operation happens after the new runner is registered. There is a window where:

  1. New runner registers (line 135 in runner.rs)
  2. Drain activity fetches config & scans for old runners (lines 138-144)
  3. Stop signals are sent (lines 145-152)

If actors are allocated to old runners between steps 1-2, they may be unnecessarily terminated.

Consider: Should the drain happen before allocating pending actors (line 156), or should we verify old runners are idle before stopping them?

4. Error Handling - Silent Failures

The drain_older_versions activity returns early silently in multiple cases:

  • Config not found (line 1203-1205)
  • Non-Normal runner kind (line 1209-1211)
  • Feature disabled (line 1212-1214)

While returning empty Vec is reasonable, these cases are indistinguishable from "no old runners exist". Consider logging at debug level:

let Some(config) = config.into_iter().next() else {
    tracing::debug!("runner config not found, skipping drain");
    return Ok(vec![]);
};

5. Missing Input Validation

No validation that input.version is reasonable (e.g., not 0, not MAX). While this may be validated upstream, defensive programming suggests checking here too.


Performance Considerations

Database Scan Efficiency

The activity uses StreamingMode::WantAll to scan all runners for a given namespace+name combination (line 1228).

Questions:

  • What is the expected number of concurrent runner versions in practice? (2-3? 10+?)
  • Could this scan be expensive for namespaces with many historical runner versions?

Consideration: If this becomes a performance bottleneck, you could:

  1. Add pagination/limits to prevent unbounded scans
  2. Use a more specific range scan if version numbers are sequential
  3. Cache the runner config check result

Security Concerns

Stop Signal Authorization

The code sends Stop signals to older runners without additional authorization checks. This relies on:

  1. The workflow context having permission to signal other workflows
  2. Runner configs being properly scoped to the namespace

Verify: Are there any scenarios where a malicious runner could trigger draining of legitimate runners? The namespace_id scoping should prevent this, but worth confirming.


Architectural Questions

Why Two Runner Workflows?

The existence of both runner.rs and runner2.rs suggests ongoing migration.

Questions:

  1. Is runner2.rs the new implementation replacing runner.rs?
  2. Should new features only go in one of them?
  3. Is there a plan to deprecate one?

This affects whether code duplication is temporary or needs proper abstraction.

Placement of Drain Logic

In runner.rs, the drain happens inside the Reconcile loop after registration (line 137-152).
In runner2.rs, it happens at the top level after initial connection (line 60-75).

Question: Is this placement difference intentional? Could it cause behavioral differences between the two implementations?


Test Coverage

No tests found for the pegboard package.

Recommended test cases:

  1. ✅ Drain enabled - verify old runners receive Stop signal
  2. ✅ Drain disabled - verify no Stop signals sent
  3. ✅ Non-Normal runner kind - verify graceful handling
  4. ✅ No older runners exist - verify empty result
  5. ✅ Multiple old versions - verify all stopped
  6. ⚠️ Edge case: version wraparound or version 0

Suggestions

Minor Improvements

  1. Add timing metrics to understand drain operation performance:
let _timer = metrics::RUNNER_DRAIN_DURATION
    .with_label_values(&[&namespace_id.to_string()])
    .start_timer();
  1. Document the feature - Add comments explaining when users should enable drain_on_version_upgrade vs manual draining

  2. Consider graceful vs immediate stop - The Stop signal uses reset_actor_rescheduling: false. Should there be a grace period for actors to finish?


Verdict

Overall: Good implementation with room for improvement

The feature works correctly and follows Rivet patterns well. Main concerns are:

  • Code duplication (should be addressed before merge)
  • Missing observability (logging)
  • No test coverage (consider adding)
  • Timing of drain operation (verify correctness)

Action Items

  • High Priority: Extract duplicated drain_older_versions to shared module
  • Medium Priority: Add structured logging
  • Medium Priority: Verify drain timing does not cause actor allocation issues
  • Nice to have: Add test coverage
  • Nice to have: Document when to use this feature

Review generated by Claude Code

@pkg-pr-new
Copy link

pkg-pr-new bot commented Dec 25, 2025

More templates

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@3675

@rivetkit/db

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/db@3675

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@3675

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@3675

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@3675

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@3675

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@3675

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@3675

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@3675

commit: f160185

@NathanFlurry NathanFlurry force-pushed the 12-24-feat_engine_optionally_automatically_drain_old_runners_on_new_runner_version_connected branch from 4f348fe to d9e7133 Compare December 25, 2025 21:34
@vercel vercel bot temporarily deployed to Preview – rivetkit-serverless December 25, 2025 21:34 Inactive
@NathanFlurry NathanFlurry force-pushed the 12-24-feat_engine_optionally_automatically_drain_old_runners_on_new_runner_version_connected branch from d9e7133 to f160185 Compare December 26, 2025 22:24
@vercel vercel bot temporarily deployed to Preview – rivetkit-serverless December 26, 2025 22:24 Inactive
@claude
Copy link

claude bot commented Dec 26, 2025

PR Review: Auto-drain runners on version upgrade

Overview

This PR adds an optional drain_on_version_upgrade flag to the Normal runner config that automatically stops older runner versions when a new version connects. The implementation is duplicated across both runner.rs and runner2.rs workflows.


Code Quality & Best Practices

Good Patterns

  • Clean separation with dedicated DrainOlderVersionsInput struct and activity
  • Proper error handling with early returns
  • Consistent code style matching the codebase conventions
  • Good use of structured logging with tracing::info_span!
  • Appropriate use of Option<bool> with unwrap_or(false) for backward compatibility

Concerns

1. Code Duplication
The drain_older_versions activity is duplicated identically in both runner.rs (lines 1179-1247) and runner2.rs (lines 831-899). This violates DRY principles and creates maintenance burden.

Recommendation: Extract this into a shared module or utility function that both workflows can use.

2. Missing Error Handling Context
In runner.rs:145-152 and runner2.rs:67-74, if sending the Stop signal fails for any workflow_id, the error propagates up but we lose context about which runners successfully received the stop signal and which failed.

Recommendation: Consider whether this should log errors and continue (best effort), collect failures and return them, or use a fire-and-forget pattern for non-critical signals.

3. Database Scan Performance
The implementation scans all runners in the subspace using StreamingMode::WantAll. For namespaces with many runners across versions, this could be expensive.

Questions:

  • Is there an index or better query pattern available?
  • Should there be a limit on the number of runners to drain?
  • What happens if there are hundreds of old runners?

Potential Bugs & Issues

Critical: Race Condition Potential

Issue: The drain operation happens AFTER InsertDbInput completes (runner.rs:137-152). There is a window where:

  1. New runner inserts itself into DB
  2. Drain activity starts scanning
  3. Old runners might allocate new actors before receiving Stop signal
  4. Stop signal arrives but actors are already scheduled

Recommendation: Consider if the drain should happen before insertion (to prevent race) or accept the race as tolerable (actors will just reschedule when old runner drains).

Signal Delivery Not Guaranteed

The code sends Stop signals but does not verify they were received or acted upon.

Questions:

  • What if a workflow_id no longer exists?
  • Should this use .graceful_not_found() like in runner.rs:277?
  • Is there confirmation that runners actually stopped?

Version Comparison Edge Case

The version comparison if key.version < input.version (runner.rs:1237) assumes:

  • Versions always increment
  • No version rollbacks
  • Equal versions are skipped (correct, but worth documenting)

Performance Considerations

1. Database Transaction Scope
The drain operation runs a full database transaction scan on every new runner connection. For high-churn environments, this could impact performance.

2. Serial Signal Sending
The Stop signals are sent sequentially in a loop. Could potentially be parallelized if the signal sending is I/O bound, though the current approach is safer for ordered operations.

3. Cache Bypass Consideration
The code uses bypass_cache: false which is good, but consider if there is a scenario where the cache could be stale and cause incorrect behavior.


Security Concerns

No obvious security issues identified. The feature:

  • Operates within namespace boundaries
  • Uses existing authentication/authorization
  • Does not expose sensitive data
  • Follows existing signal patterns

Test Coverage

Major Gap: No tests included

This feature needs tests for:

  1. Happy path: New runner connects, old runners are drained
  2. Disabled feature: drain_on_version_upgrade: false or None - no draining occurs
  3. No older runners: New runner is first version - no-op
  4. Multiple old versions: Drains all runners with version < current
  5. Serverless config: Verify serverless runners are not affected
  6. Race conditions: Verify behavior when actors are being allocated during drain
  7. Signal failure: What happens when Stop signal fails?
  8. Empty config: Behavior when runner config does not exist

Additional Issues

1. Missing Documentation

  • No comments explaining when/why to use drain_on_version_upgrade
  • No migration guide for existing deployments
  • Schema change (namespace.runner_config.v2.bare) lacks inline documentation

2. Logging Gaps
No logging when draining occurs. Should add structured logging to track namespace_id, name, current_version, and count of older_runners being drained.

3. Metrics
Consider adding metrics for:

  • Number of runners drained per version upgrade
  • Time taken to drain
  • Failures to send stop signals

Recommendations Summary

High Priority

  1. Add comprehensive tests - Critical for feature reliability
  2. Investigate race condition - Determine if acceptable or needs fixing
  3. Add graceful error handling - Use .graceful_not_found() for stop signals
  4. Eliminate code duplication - Extract shared drain logic

Medium Priority

  1. Add logging - Track when draining occurs and results
  2. Document the feature - Explain use cases and behavior
  3. Consider metrics - Track drain operations for monitoring

Low Priority

  1. Performance review - Evaluate scan performance at scale
  2. Consider parallelization - For stop signal sending if beneficial

Verdict

The implementation is functionally sound but needs tests, better error handling, and should address the code duplication. The race condition should be evaluated to determine if it is acceptable or requires mitigation.

Status: Needs work before merge

  • Add tests (blocking)
  • Fix code duplication (blocking)
  • Add graceful error handling (recommended)
  • Consider race condition implications (recommended)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants