fix: add wf and guard metrics #3657

MasterPtato · 2025-12-17T21:41:06Z

No description provided.

vercel · 2025-12-17T21:41:10Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Review	Updated (UTC)
rivetkit-serverless	Error		Jan 6, 2026 1:34am

3 Skipped Deployments

Project	Deployment	Review	Updated (UTC)
rivet-cloud	Ignored		Jan 6, 2026 1:34am
rivet-inspector	Ignored	Preview	Jan 6, 2026 1:34am
rivet-site	Ignored	Preview	Jan 6, 2026 1:34am

MasterPtato · 2025-12-17T21:41:24Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

_{An organization admin has enabled the Graphite Merge Queue in this repository.} _{Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.}

This stack of pull requests is managed by Graphite. Learn more about stacking.

claude · 2025-12-17T21:42:23Z

PR Review: Add Workflow and Guard Metrics

Overview

This PR improves observability by adding new metrics for workflow loops and guard proxy requests, while also cleaning up unused response handling code in the guard proxy service.

Code Quality & Best Practices ✅

Good practices observed:

Following existing metric naming conventions with _total suffix (api-builder/src/metrics.rs:27)
Using structured logging patterns correctly
Proper use of workspace dependencies
Maintaining hard tab formatting

Detailed Findings

1. Workflow Loop Metrics (gasoline/src/builder/workflow/lupe.rs)

Strengths:

Smart optimization: Moved timing logic to only measure non-replay iterations (lines 189-192)
Good parallelization with deferred DB operations
Clear separation of concerns between loop commit and iteration duration

Potential Issues:

🔴 Critical: Incorrect metric recording on first iteration

// Line 189-192
if !loop_branch.cursor().compare_loop_branch(iteration)? {
    // ...
    metrics::LOOP_ITERATION_DURATION
        .with_label_values(&[&ctx.name().to_string()])
        .observe(iteration_dt.as_secs_f64());
}

Problem: On the first iteration (iteration=0), iteration_dt is initialized to Duration::ZERO (line 120), so the first iteration's duration won't be recorded accurately. The duration from the callback (cb_dt) is only saved to iteration_dt at line 211, but by then we've already recorded the metric with the stale value.

Fix: Record the metric using cb_dt directly instead of iteration_dt:

metrics::LOOP_ITERATION_DURATION
    .with_label_values(&[&ctx.name().to_string()])
    .observe(cb_dt.as_secs_f64());

🟡 Minor: Metric description clarity

Line in gasoline/src/metrics.rs:183: Description says "Total duration of a single loop iteration" but it's actually measuring the callback execution time, not the full iteration overhead
Consider: "Duration of loop iteration callback execution" for accuracy

2. Guard In-Flight Request Tracking (guard-core/src/proxy_service.rs)

Strengths:

Good use of atomic operations with or_insert_with
Proper cleanup on request completion

Issues:

🟡 Performance: Redundant metric updates

// Lines 641-642 and 622
metrics::IN_FLIGHT_REQUEST_COUNT.set(self.in_flight_requests.entry_count() as i64);

Problem: Calling entry_count() on every request ID generation/release can be expensive for high-throughput services since moka::Cache needs to count entries.

Recommendation:

Use an AtomicI64 counter that increments/decrements instead of calling entry_count()
This gives O(1) operations instead of potentially O(n)

pub struct ProxyState {
    // ... existing fields ...
    in_flight_request_count: AtomicI64,
}

// On insert:
self.in_flight_request_count.fetch_add(1, Ordering::Relaxed);
metrics::IN_FLIGHT_REQUEST_COUNT.set(self.in_flight_request_count.load(Ordering::Relaxed));

// On release:
self.in_flight_request_count.fetch_sub(1, Ordering::Relaxed);
metrics::IN_FLIGHT_REQUEST_COUNT.set(self.in_flight_request_count.load(Ordering::Relaxed));

🟢 Good comment addition (line 317): The clarification about using moka::Cache for TTL/capacity is helpful

3. Dead Code Removal (guard-core/src/proxy_service.rs)

Excellent cleanup:

Removed 115 lines of unused StructuredResponse code
Eliminated unreachable branches in WebSocket handling
Simplified RoutingOutput enum

No issues found - this is a net positive for maintainability

4. API Error Metric Naming (api-builder/src/metrics.rs)

✅ Good fix: Changed api_request_errors → api_request_errors_total (line 27)

Follows Prometheus naming conventions for counters
More consistent with existing metrics

Security Considerations ✅

No security concerns identified. The changes:

Don't introduce new attack surfaces
Maintain existing rate limiting and in-flight controls
Don't expose sensitive data in metrics

Performance Considerations

⚠️ Moderate impact:

The entry_count() calls could impact high-throughput scenarios (see recommendation above)
Loop metrics add minimal overhead (just Instant::now() calls)
Overall performance impact is acceptable for the observability gains

Test Coverage

⚠️ Missing tests:

No tests added for new metrics
Consider adding integration tests that verify:
- LOOP_COMMIT_DURATION is recorded correctly
- LOOP_ITERATION_DURATION accurately measures callback time
- IN_FLIGHT_REQUEST_COUNT increments/decrements properly

Recommendation: Add a test like:

#[tokio::test]
async fn test_loop_metrics_recording() {
    // Setup workflow with instrumented loop
    // Assert metrics are recorded with expected labels
}

Summary

Must Fix (Critical):

❌ First iteration duration recording bug in lupe.rs:189-192

Should Fix (Recommended):

⚠️ Use atomic counter instead of entry_count() for performance
⚠️ Add test coverage for new metrics
⚠️ Clarify metric description for LOOP_ITERATION_DURATION

Overall Assessment: Good observability improvements with one critical bug that needs fixing before merge. The dead code removal is excellent housekeeping.

pkg-pr-new · 2025-12-17T21:44:28Z

More templates

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@3657

@rivetkit/db

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/db@3657

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@3657

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@3657

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@3657

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@3657

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@3657

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@3657

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@3657

commit: e258ce3

graphite-app · 2026-01-06T01:36:28Z

Merge activity

Jan 6, 1:36 AM UTC: MasterPtato added this pull request to the Graphite merge queue.
Jan 6, 1:37 AM UTC: CI is running for this pull request on a draft pull request (#3753) due to your merge queue CI optimization settings.
Jan 6, 1:37 AM UTC: Merged by the Graphite merge queue via draft PR: #3753.

claude · 2026-01-06T01:37:12Z

PR Review: Add wf and guard metrics

Overview

This PR adds improved metrics instrumentation for workflow loops and guard proxy requests. It also removes unused StructuredResponse functionality from the guard-core proxy service.

🚨 Critical Issues

1. Breaking Change - Test Failure

File: engine/packages/guard-core/tests/streaming_response.rs

The PR removes StructuredResponse and RoutingOutput::Response from proxy_service.rs, but the test file at engine/packages/guard-core/tests/streaming_response.rs:255-260 still references these removed types. This will cause compilation failures.

Recommendation: Update the test to use an alternative approach (e.g., return a different route or use CustomServe) or remove the test if it's no longer relevant.

⚠️ Issues to Address

2. Potential Timing Accuracy Issue

File: engine/packages/gasoline/src/builder/workflow/lupe.rs:189-192

The iteration duration metric is now only recorded for non-replay iterations. Is this intentional? During workflow replay (recovery), iteration metrics won't be recorded - only fresh execution times are captured.

Recommendation: Add a comment explaining why replay iterations are excluded from metrics, or consider adding a separate metric for replay vs. fresh iterations if that distinction is valuable.

3. Loop Commit Metric Placement

File: engine/packages/gasoline/src/builder/workflow/lupe.rs:168-174

The LOOP_COMMIT_DURATION metric is only recorded when there's an upsert (every N iterations based on commit interval), not on every iteration. The metric name suggests it measures "a single loop commit" but it's actually measuring periodic batch commits.

Recommendation: Either rename to LOOP_BATCH_COMMIT_DURATION to clarify it's measuring batched commits, or add a comment explaining this metric is only recorded every N iterations.

4. Loop Iteration Timing Logic

File: engine/packages/gasoline/src/builder/workflow/lupe.rs:197-204

The timing logic measures callback execution time and stores it in iteration_dt for use in the NEXT iteration. This means the metric records the duration from the previous iteration, not the current one. The first iteration will record Duration::ZERO.

Question: Is this off-by-one iteration timing intentional (e.g., to avoid recording replay iterations)?

Recommendation: If intentional, add a comment explaining the logic. Otherwise, consider recording the metric immediately after the callback completes.

5. Metric Description Clarity

File: engine/packages/gasoline/src/metrics.rs:183

The description changed from "Total duration of a single loop iteration (excluding its body)" to "Total duration of a single loop iteration."

Looking at the code, iteration_dt captures only the callback execution time, so the metric seems to measure the callback duration, not the full iteration overhead including branch commits.

Recommendation: Clarify whether this measures just the callback/body duration, or the full iteration including branch commits and other overhead.

📝 Minor Suggestions

6. In-Flight Request Race Condition

File: engine/packages/guard-core/src/proxy_service.rs:621-622

The metric update happens after invalidate, which could create a brief window where the metric is stale. Consider ensuring atomic updates.

✅ Positive Observations

Prometheus Best Practices: The _total suffix addition follows Prometheus conventions ✅
Code Cleanup: Removing 115 lines of unused code improves maintainability ✅
Parallel Optimization: The refactored tokio::join! maintains the parallel execution optimization ✅
New Metrics: The LOOP_COMMIT_DURATION and IN_FLIGHT_REQUEST_COUNT metrics provide valuable observability ✅

Summary

Must Fix:

Fix compilation error in streaming_response.rs test

Should Consider:

Clarify iteration timing logic (off-by-one iteration for metrics)
Review metric naming/descriptions for accuracy
Add comments explaining replay vs. fresh iteration metric recording

Overall Assessment: The core changes are solid improvements to observability. The main blocker is the breaking test. Once that's addressed, this will be a good addition to the codebase.

Review generated by Claude Code

MasterPtato requested a review from NathanFlurry December 17, 2025 21:41

MasterPtato mentioned this pull request Dec 17, 2025

fix: revert otel metrics to prometheus #3656

Closed

MasterPtato mentioned this pull request Dec 17, 2025

chore: update dashboards #3655

Closed

MasterPtato mentioned this pull request Dec 18, 2025

chore: clean up metrics #3660

Closed

This was referenced Jan 1, 2026

fix: fix gateway token reading, rivetkit path token, vergen git sha #3721

Closed

chore: fmt, fix config properties #3749

Closed

fix(ups): fix memory leak with local subscribers #3752

Closed

NathanFlurry approved these changes Jan 6, 2026

View reviewed changes

fix: add wf and guard metrics

a4c59b6

MasterPtato force-pushed the 12-17-fix_add_wf_and_guard_metrics branch from e258ce3 to a4c59b6 Compare January 6, 2026 01:33

MasterPtato force-pushed the 12-16-fix_revert_otel_metrics_to_prometheus branch from 5a3e37b to 16a7519 Compare January 6, 2026 01:33

vercel bot had a problem deploying to Preview – rivetkit-serverless January 6, 2026 01:34 Failure

graphite-app bot pushed a commit that referenced this pull request Jan 6, 2026

fix: add wf and guard metrics (#3657)

6c03bd5

graphite-app bot closed this Jan 6, 2026

graphite-app bot deleted the 12-17-fix_add_wf_and_guard_metrics branch January 6, 2026 01:37

claude bot mentioned this pull request Jan 6, 2026

[Graphite MQ] Draft PR GROUP:spec_f90959 (PRs 3655, 3656, 3657, 3660, 3721, 3749, 3752) #3753

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: add wf and guard metrics #3657

fix: add wf and guard metrics #3657

Uh oh!

MasterPtato commented Dec 17, 2025

Uh oh!

vercel bot commented Dec 17, 2025 •

edited

Loading

Uh oh!

MasterPtato commented Dec 17, 2025 •

edited

Loading

Uh oh!

claude bot commented Dec 17, 2025 •

edited

Loading

Uh oh!

pkg-pr-new bot commented Dec 17, 2025

Uh oh!

graphite-app bot commented Jan 6, 2026 •

edited

Loading

Uh oh!

claude bot commented Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: add wf and guard metrics #3657

fix: add wf and guard metrics #3657

Uh oh!

Conversation

MasterPtato commented Dec 17, 2025

Uh oh!

vercel bot commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MasterPtato commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to use the Graphite Merge Queue

Uh oh!

claude bot commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Add Workflow and Guard Metrics

Overview

Code Quality & Best Practices ✅

Detailed Findings

1. Workflow Loop Metrics (gasoline/src/builder/workflow/lupe.rs)

2. Guard In-Flight Request Tracking (guard-core/src/proxy_service.rs)

3. Dead Code Removal (guard-core/src/proxy_service.rs)

4. API Error Metric Naming (api-builder/src/metrics.rs)

Security Considerations ✅

Performance Considerations

Test Coverage

Summary

Uh oh!

pkg-pr-new bot commented Dec 17, 2025

Uh oh!

graphite-app bot commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge activity

Uh oh!

claude bot commented Jan 6, 2026

PR Review: Add wf and guard metrics

Overview

🚨 Critical Issues

1. Breaking Change - Test Failure

⚠️ Issues to Address

2. Potential Timing Accuracy Issue

3. Loop Commit Metric Placement

4. Loop Iteration Timing Logic

5. Metric Description Clarity

📝 Minor Suggestions

6. In-Flight Request Race Condition

✅ Positive Observations

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vercel bot commented Dec 17, 2025 •

edited

Loading

MasterPtato commented Dec 17, 2025 •

edited

Loading

claude bot commented Dec 17, 2025 •

edited

Loading

graphite-app bot commented Jan 6, 2026 •

edited

Loading