Skip to content

Conversation

@MasterPtato
Copy link
Contributor

No description provided.

@vercel
Copy link

vercel bot commented Dec 17, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Review Updated (UTC)
rivetkit-serverless Error Error Jan 6, 2026 1:34am
3 Skipped Deployments
Project Deployment Review Updated (UTC)
rivet-cloud Ignored Ignored Jan 6, 2026 1:34am
rivet-inspector Ignored Ignored Preview Jan 6, 2026 1:34am
rivet-site Ignored Ignored Preview Jan 6, 2026 1:34am

Copy link
Contributor Author

MasterPtato commented Dec 17, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link

claude bot commented Dec 17, 2025

PR Review: Add Workflow and Guard Metrics

Overview

This PR improves observability by adding new metrics for workflow loops and guard proxy requests, while also cleaning up unused response handling code in the guard proxy service.


Code Quality & Best Practices ✅

Good practices observed:

  • Following existing metric naming conventions with _total suffix (api-builder/src/metrics.rs:27)
  • Using structured logging patterns correctly
  • Proper use of workspace dependencies
  • Maintaining hard tab formatting

Detailed Findings

1. Workflow Loop Metrics (gasoline/src/builder/workflow/lupe.rs)

Strengths:

  • Smart optimization: Moved timing logic to only measure non-replay iterations (lines 189-192)
  • Good parallelization with deferred DB operations
  • Clear separation of concerns between loop commit and iteration duration

Potential Issues:

🔴 Critical: Incorrect metric recording on first iteration

// Line 189-192
if !loop_branch.cursor().compare_loop_branch(iteration)? {
    // ...
    metrics::LOOP_ITERATION_DURATION
        .with_label_values(&[&ctx.name().to_string()])
        .observe(iteration_dt.as_secs_f64());
}

Problem: On the first iteration (iteration=0), iteration_dt is initialized to Duration::ZERO (line 120), so the first iteration's duration won't be recorded accurately. The duration from the callback (cb_dt) is only saved to iteration_dt at line 211, but by then we've already recorded the metric with the stale value.

Fix: Record the metric using cb_dt directly instead of iteration_dt:

metrics::LOOP_ITERATION_DURATION
    .with_label_values(&[&ctx.name().to_string()])
    .observe(cb_dt.as_secs_f64());

🟡 Minor: Metric description clarity

  • Line in gasoline/src/metrics.rs:183: Description says "Total duration of a single loop iteration" but it's actually measuring the callback execution time, not the full iteration overhead
  • Consider: "Duration of loop iteration callback execution" for accuracy

2. Guard In-Flight Request Tracking (guard-core/src/proxy_service.rs)

Strengths:

  • Good use of atomic operations with or_insert_with
  • Proper cleanup on request completion

Issues:

🟡 Performance: Redundant metric updates

// Lines 641-642 and 622
metrics::IN_FLIGHT_REQUEST_COUNT.set(self.in_flight_requests.entry_count() as i64);

Problem: Calling entry_count() on every request ID generation/release can be expensive for high-throughput services since moka::Cache needs to count entries.

Recommendation:

  • Use an AtomicI64 counter that increments/decrements instead of calling entry_count()
  • This gives O(1) operations instead of potentially O(n)
pub struct ProxyState {
    // ... existing fields ...
    in_flight_request_count: AtomicI64,
}

// On insert:
self.in_flight_request_count.fetch_add(1, Ordering::Relaxed);
metrics::IN_FLIGHT_REQUEST_COUNT.set(self.in_flight_request_count.load(Ordering::Relaxed));

// On release:
self.in_flight_request_count.fetch_sub(1, Ordering::Relaxed);
metrics::IN_FLIGHT_REQUEST_COUNT.set(self.in_flight_request_count.load(Ordering::Relaxed));

🟢 Good comment addition (line 317): The clarification about using moka::Cache for TTL/capacity is helpful

3. Dead Code Removal (guard-core/src/proxy_service.rs)

Excellent cleanup:

  • Removed 115 lines of unused StructuredResponse code
  • Eliminated unreachable branches in WebSocket handling
  • Simplified RoutingOutput enum

No issues found - this is a net positive for maintainability

4. API Error Metric Naming (api-builder/src/metrics.rs)

Good fix: Changed api_request_errorsapi_request_errors_total (line 27)

  • Follows Prometheus naming conventions for counters
  • More consistent with existing metrics

Security Considerations ✅

No security concerns identified. The changes:

  • Don't introduce new attack surfaces
  • Maintain existing rate limiting and in-flight controls
  • Don't expose sensitive data in metrics

Performance Considerations

⚠️ Moderate impact:

  1. The entry_count() calls could impact high-throughput scenarios (see recommendation above)
  2. Loop metrics add minimal overhead (just Instant::now() calls)
  3. Overall performance impact is acceptable for the observability gains

Test Coverage

⚠️ Missing tests:

  • No tests added for new metrics
  • Consider adding integration tests that verify:
    • LOOP_COMMIT_DURATION is recorded correctly
    • LOOP_ITERATION_DURATION accurately measures callback time
    • IN_FLIGHT_REQUEST_COUNT increments/decrements properly

Recommendation: Add a test like:

#[tokio::test]
async fn test_loop_metrics_recording() {
    // Setup workflow with instrumented loop
    // Assert metrics are recorded with expected labels
}

Summary

Must Fix (Critical):

  • ❌ First iteration duration recording bug in lupe.rs:189-192

Should Fix (Recommended):

  • ⚠️ Use atomic counter instead of entry_count() for performance
  • ⚠️ Add test coverage for new metrics
  • ⚠️ Clarify metric description for LOOP_ITERATION_DURATION

Overall Assessment: Good observability improvements with one critical bug that needs fixing before merge. The dead code removal is excellent housekeeping.

@pkg-pr-new
Copy link

pkg-pr-new bot commented Dec 17, 2025

More templates

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@3657

@rivetkit/db

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/db@3657

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@3657

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@3657

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@3657

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@3657

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@3657

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@3657

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@3657

commit: e258ce3

@MasterPtato MasterPtato force-pushed the 12-17-fix_add_wf_and_guard_metrics branch from e258ce3 to a4c59b6 Compare January 6, 2026 01:33
@MasterPtato MasterPtato force-pushed the 12-16-fix_revert_otel_metrics_to_prometheus branch from 5a3e37b to 16a7519 Compare January 6, 2026 01:33
@graphite-app
Copy link
Contributor

graphite-app bot commented Jan 6, 2026

Merge activity

  • Jan 6, 1:36 AM UTC: MasterPtato added this pull request to the Graphite merge queue.
  • Jan 6, 1:37 AM UTC: CI is running for this pull request on a draft pull request (#3753) due to your merge queue CI optimization settings.
  • Jan 6, 1:37 AM UTC: Merged by the Graphite merge queue via draft PR: #3753.

graphite-app bot pushed a commit that referenced this pull request Jan 6, 2026
@claude
Copy link

claude bot commented Jan 6, 2026

PR Review: Add wf and guard metrics

Overview

This PR adds improved metrics instrumentation for workflow loops and guard proxy requests. It also removes unused StructuredResponse functionality from the guard-core proxy service.


🚨 Critical Issues

1. Breaking Change - Test Failure

File: engine/packages/guard-core/tests/streaming_response.rs

The PR removes StructuredResponse and RoutingOutput::Response from proxy_service.rs, but the test file at engine/packages/guard-core/tests/streaming_response.rs:255-260 still references these removed types. This will cause compilation failures.

Recommendation: Update the test to use an alternative approach (e.g., return a different route or use CustomServe) or remove the test if it's no longer relevant.


⚠️ Issues to Address

2. Potential Timing Accuracy Issue

File: engine/packages/gasoline/src/builder/workflow/lupe.rs:189-192

The iteration duration metric is now only recorded for non-replay iterations. Is this intentional? During workflow replay (recovery), iteration metrics won't be recorded - only fresh execution times are captured.

Recommendation: Add a comment explaining why replay iterations are excluded from metrics, or consider adding a separate metric for replay vs. fresh iterations if that distinction is valuable.


3. Loop Commit Metric Placement

File: engine/packages/gasoline/src/builder/workflow/lupe.rs:168-174

The LOOP_COMMIT_DURATION metric is only recorded when there's an upsert (every N iterations based on commit interval), not on every iteration. The metric name suggests it measures "a single loop commit" but it's actually measuring periodic batch commits.

Recommendation: Either rename to LOOP_BATCH_COMMIT_DURATION to clarify it's measuring batched commits, or add a comment explaining this metric is only recorded every N iterations.


4. Loop Iteration Timing Logic

File: engine/packages/gasoline/src/builder/workflow/lupe.rs:197-204

The timing logic measures callback execution time and stores it in iteration_dt for use in the NEXT iteration. This means the metric records the duration from the previous iteration, not the current one. The first iteration will record Duration::ZERO.

Question: Is this off-by-one iteration timing intentional (e.g., to avoid recording replay iterations)?

Recommendation: If intentional, add a comment explaining the logic. Otherwise, consider recording the metric immediately after the callback completes.


5. Metric Description Clarity

File: engine/packages/gasoline/src/metrics.rs:183

The description changed from "Total duration of a single loop iteration (excluding its body)" to "Total duration of a single loop iteration."

Looking at the code, iteration_dt captures only the callback execution time, so the metric seems to measure the callback duration, not the full iteration overhead including branch commits.

Recommendation: Clarify whether this measures just the callback/body duration, or the full iteration including branch commits and other overhead.


📝 Minor Suggestions

6. In-Flight Request Race Condition

File: engine/packages/guard-core/src/proxy_service.rs:621-622

The metric update happens after invalidate, which could create a brief window where the metric is stale. Consider ensuring atomic updates.


✅ Positive Observations

  1. Prometheus Best Practices: The _total suffix addition follows Prometheus conventions ✅
  2. Code Cleanup: Removing 115 lines of unused code improves maintainability ✅
  3. Parallel Optimization: The refactored tokio::join! maintains the parallel execution optimization ✅
  4. New Metrics: The LOOP_COMMIT_DURATION and IN_FLIGHT_REQUEST_COUNT metrics provide valuable observability ✅

Summary

Must Fix:

  1. Fix compilation error in streaming_response.rs test

Should Consider:

  1. Clarify iteration timing logic (off-by-one iteration for metrics)
  2. Review metric naming/descriptions for accuracy
  3. Add comments explaining replay vs. fresh iteration metric recording

Overall Assessment: The core changes are solid improvements to observability. The main blocker is the breaking test. Once that's addressed, this will be a good addition to the codebase.


Review generated by Claude Code

@graphite-app graphite-app bot closed this Jan 6, 2026
@graphite-app graphite-app bot deleted the 12-17-fix_add_wf_and_guard_metrics branch January 6, 2026 01:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants