Skip to content

Conversation

@graphite-app
Copy link
Contributor

@graphite-app graphite-app bot commented Jan 6, 2026

This draft PR was created by the Graphite merge queue.
Trunk will be fast forwarded to the HEAD of this PR when CI passes, and the original PRs will be closed.

The following PRs are included in this draft PR:

@vercel
Copy link

vercel bot commented Jan 6, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

4 Skipped Deployments
Project Deployment Review Updated (UTC)
rivet-cloud Ignored Ignored Jan 6, 2026 3:17am
rivet-inspector Ignored Ignored Preview Jan 6, 2026 3:17am
rivet-site Ignored Ignored Preview Jan 6, 2026 3:17am
rivetkit-serverless Skipped Skipped Jan 6, 2026 3:17am

@vercel vercel bot temporarily deployed to Preview – rivetkit-serverless January 6, 2026 03:17 Inactive
@graphite-app graphite-app bot closed this Jan 6, 2026
@vercel vercel bot temporarily deployed to Production – rivetkit-serverless January 6, 2026 03:18 Inactive
@graphite-app graphite-app bot deleted the gtmq_spec_483d1f_1767669428058-b07447e1-a38a-4660-b610-d3b4ea09a981 branch January 6, 2026 03:18
@claude
Copy link

claude bot commented Jan 6, 2026

Pull Request Review

This PR bundles three changes together via Graphite's merge queue:

  1. Remove FDB-specific metrics (chore(engine): remove fdb-specific metrics #3756)
  2. Update metrics to Prometheus (fix(engine): update metrics to prometheus #3757)
  3. Fix deadlock in UPS (fix: fix deadlock in ups #3755)

Overall Assessment

APPROVED - The changes are well-implemented with good attention to detail. The migration to Prometheus-native metrics is clean, and the deadlock fix is correct.


1. Metrics Migration to Prometheus

Strengths

Gasoline metrics (engine/packages/gasoline/src/db/kv/mod.rs:1022 and metrics.rs:58-62):

  • Clean migration from OpenTelemetry-style record() to Prometheus-native observe()
  • Properly removed the hardcoded label KeyValue::new("workflow_name", "batch") which didn't make sense
  • New FIND_WORKFLOWS_BATCH_DURATION histogram is correctly registered

Pegboard metrics (engine/packages/pegboard/src/ops/runner/drain.rs:91-93):

  • Correctly migrated from Counter::add() with KeyValue labels to IntCounterVec::with_label_values().inc_by()
  • The metric definition in metrics.rs:19-24 matches the usage pattern
  • Labels namespace_id and runner_name are properly preserved

UniversalDB cleanup (engine/packages/universaldb/src/metrics.rs):

  • Removed unused FDB-specific metrics (FDB_PING_DURATION, FDB_MISSED_PING)
  • Clean removal with no dangling references

2. Deadlock Fix in UniversalPubSub

Critical Fix

Root cause (engine/packages/universalpubsub/src/pubsub.rs:384):
The previous code called pubsub.local_subscribers.remove_async(&subject).await which could deadlock because:

  • Drop::drop() is synchronous but spawned an async task
  • remove_async() requires acquiring a lock on the HashMap
  • If the HashMap was already locked elsewhere, this could cause a deadlock

Solution:

let _ = sender.remove();  // Use the Entry API's remove() instead

This is correct because:

  • The code is already inside a closure that has access to the Entry via get_async()
  • Using sender.remove() (where sender is the entry) removes the entry directly without needing another lock
  • This is the idiomatic way to remove entries in scc::HashMap when you already have an entry handle

⚠️ Minor Observation

The let _ = sender.remove(); silently ignores the result. Consider if this should be logged on error, though in a Drop handler it's often acceptable to ignore cleanup failures.


3. Instrumentation Improvements

Good Additions

Multiple functions now have #[tracing::instrument] attributes (pubsub.rs:84, 119, 154, 199, 218, 228, 344, 412, 437, 459):

  • Improves observability for pub/sub operations
  • Properly uses skip_all to avoid logging large payloads
  • Includes relevant context via fields(%subject, %reply_subject)

Excellent addition at line 459:

res = self.driver.next().in_current_span() => {
    return res;
}

The .in_current_span() ensures trace spans are properly propagated across the select! branch, which is critical for distributed tracing correctness.


4. Import Hygiene

Follows Best Practices

Changes to imports show good discipline:

  • pubsub.rs:4-5: Changed from glob import use anyhow::*; to specific use anyhow::Result;
  • Added use tracing::Instrument; at line 8
  • drain.rs:4: Removed unused use rivet_metrics::KeyValue;

This follows the CLAUDE.md guidance: "Do not glob import (::*) from anyhow. Instead, import individual types and traits"


Security & Performance

No security concerns identified
Performance improvements: The deadlock fix should improve reliability under load
Metrics overhead: Prometheus metrics are slightly more efficient than OpenTelemetry wrapping


Test Coverage

⚠️ Recommendation: Consider adding a regression test for the deadlock scenario:

  • Create multiple subscribers to the same subject
  • Drop them concurrently while the HashMap is under contention
  • Verify cleanup completes without hanging

However, this is difficult to test reliably and may not be worth the effort given the fix is straightforward.


Conclusion

This is a solid set of improvements. The metrics migration is clean and follows Prometheus best practices. The deadlock fix addresses a real concurrency issue with the correct solution. The instrumentation additions will help with production debugging.

Recommendation: Merge after CI passes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants