fix: revert otel metrics to prometheus #3656

MasterPtato · 2025-12-17T02:28:16Z

No description provided.

vercel · 2025-12-17T02:28:17Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Review	Updated (UTC)
rivetkit-serverless	Error		Jan 6, 2026 1:34am

3 Skipped Deployments

Project	Deployment	Review	Updated (UTC)
rivet-cloud	Ignored		Jan 6, 2026 1:34am
rivet-inspector	Ignored	Preview	Jan 6, 2026 1:34am
rivet-site	Ignored	Preview	Jan 6, 2026 1:34am

MasterPtato · 2025-12-17T02:28:32Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

_{An organization admin has enabled the Graphite Merge Queue in this repository.} _{Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.}

This stack of pull requests is managed by Graphite. Learn more about stacking.

claude · 2025-12-17T02:30:49Z

PR Review: Revert OTEL Metrics to Prometheus

Summary

This PR reverts the metrics infrastructure from OpenTelemetry (OTEL) to Prometheus, introducing a pull-based metrics collection system instead of the previous push-based OTLP approach. The change simplifies the metrics pipeline and removes the need for a separate otel-collector-client service.

Code Quality & Architecture

✅ Strengths

Simplified Architecture: Removing the otel-collector-client service reduces complexity and the number of moving parts in the development environment.
Clean Template-Based Configuration: Changes follow the repository's pattern of modifying templates in engine/docker/template/ and regenerating configs, which is good practice per CLAUDE.md.
Prometheus Registry Implementation: The new registry.rs (engine/packages/metrics/src/registry.rs:1-11) properly initializes a Prometheus registry with standard labels including service name, pod ID, and source hash.
HTTP Metrics Server: The standalone metrics server (engine/packages/metrics/src/server.rs:14-37) properly exposes Prometheus metrics on a configurable endpoint.

Issues & Concerns

🔴 Critical Issues

Hardcoded Port in OTEL Config (engine/docker/template/src/services/edge/otel-collector.ts:29)
Problem: The metrics port 6430 is hardcoded in the OTEL collector scrape config. This should reference the config value used in the metrics server to avoid drift between server and scraper configurations.
Missing Error Handling in Metrics Encoding (engine/packages/metrics/src/server.rs:46-48)
Problem: Using .expect() will crash the entire metrics server if encoding fails. Should return a proper error response instead.
Response Building Could Panic (engine/packages/metrics/src/server.rs:50-54)
Problem: Response building uses .expect() which is not ideal for a server. Should handle errors properly.

⚠️ Medium Priority Issues

OTEL Collector Still Receives OTLP (engine/docker/template/src/services/edge/otel-collector.ts:38-46)
The OTEL collector still has OTLP receivers configured for both gRPC and HTTP. Question: Is this intentional for traces/logs?
Empty Prometheus Scrape Configs (engine/docker/template/src/services/core/prometheus.ts:12)
The Prometheus service has no scrape configs defined. All scraping is done by OTEL collector via Prometheus receiver and sent via remote write. This is architecturally valid but unconventional. Document this design choice if intentional.
Potential Label Cardinality Issue (engine/packages/metrics/src/registry.rs:8)
Pod IDs as labels can cause high cardinality in Kubernetes environments with frequent pod churn. Consider if this label is necessary globally.

💡 Minor Issues & Style

TODO Comment Left In Code (engine/packages/metrics/src/server.rs:11)
Either implement this or create a tracking issue and reference it.
Resource to Telemetry Conversion (engine/docker/template/src/services/edge/otel-collector.ts:99-101)
This converts resource attributes to metric labels, which could increase cardinality. Ensure this is intentional.

Performance Considerations

✅ Positive Changes

Pull-Based Metrics: Prometheus's pull model is more efficient than pushing metrics for the OTEL collector to aggregate.
Direct Prometheus Registry: Using the Prometheus client library directly avoids the overhead of converting metrics through OTLP.
Batch Processing: OTEL collector batch processor with 5s timeout and 10k batch size is well-configured.

⚠️ Concerns

Scrape Interval Hardcoded: 15s scrape interval is hardcoded in multiple places. Consider making this configurable.
No Metrics Endpoint Rate Limiting: The metrics server has no rate limiting. While scraping is typically controlled, consider if protection is needed.

Security Concerns

⚠️ Issues

Insecure TLS Configuration (engine/docker/template/src/services/edge/otel-collector.ts:96-98)
TLS is set to insecure for dev environment. Severity: Low (dev only). Add a comment indicating this is acceptable for dev but should be changed for production.
No Authentication on Metrics Endpoint: The metrics endpoint has no authentication. This is typical for Prometheus but worth documenting.

Test Coverage

❌ Missing Tests

No Tests for Metrics Server: The new server.rs file has no associated tests. Add tests for successful metrics scraping, encoding, and error handling.
No Integration Tests: No tests verify the OTEL collector can successfully scrape the Prometheus endpoint.

Documentation

⚠️ Missing Documentation

Architecture Decision: This significant change lacks explanation about why the revert was needed.
Migration Guide: No guidance for existing deployments on how this affects their metrics.
Old Dashboards Preserved: The PR moves old dashboards to grafana-dashboards-old/ but doesn't document what changed or why both are kept.

Recommendations

High Priority

Fix panic-inducing .expect() calls in metrics server
Extract hardcoded port to shared config
Add error handling for metrics encoding
Add tests for metrics server

Medium Priority

Document why OTLP receivers remain in OTEL collector
Document the Prometheus remote-write architecture
Review label cardinality (especially kubernetes_pod_id)
Consider making scrape interval configurable

Low Priority

Add TODO tracking issue reference
Add security comments for insecure TLS in dev
Clean up old dashboard directory or document retention reason

Verdict

Recommend: Merge with minor fixes

The architectural change is sound and simplifies the metrics pipeline. The main concerns are:

Error handling that could crash the metrics server
Hardcoded values that could drift
Missing tests and documentation

These should be addressed before or shortly after merge, but they don't block the core functionality.

Note: This review follows the conventions in CLAUDE.md and focuses on the Rust/Docker configuration aspects of the change. The extensive Grafana dashboard JSON changes were not reviewed in detail as they appear to be auto-generated.

pkg-pr-new · 2025-12-17T02:31:26Z

More templates

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@3656

@rivetkit/db

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/db@3656

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@3656

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@3656

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@3656

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@3656

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@3656

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@3656

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@3656

commit: 5a3e37b

claude · 2025-12-17T21:46:54Z

Pull Request Review: Revert OTEL Metrics to Prometheus

Summary

This PR reverts the metrics collection infrastructure from OpenTelemetry (OTEL) back to native Prometheus metrics affecting ~133 files. The change simplifies the metrics pipeline by using Prometheus's native pull-based model.

Code Quality & Best Practices

✅ Strengths

Consistent Migration Pattern: Proper use of IntCounterVec, IntGaugeVec, HistogramVec
Clean Registry: Global Prometheus registry with service-level labels
Template-Based Config: Properly regenerated across all environments
Simplified Architecture: Removed otel-collector-client service

⚠️ Critical Issues

1. Error Handling in Metrics Server (engine/packages/metrics/src/server.rs:46-54)

Using .expect() will crash the metrics server on failures. Must return HTTP error responses instead.

2. High Cardinality Risk (engine/packages/api-builder/src/metrics.rs:22)

Raw path as label can create unbounded cardinality. Use path templates instead.

3. Pod ID Cardinality (engine/packages/metrics/src/registry.rs:8)

Pod IDs as global labels create new time series for every pod - may accumulate stale series.

4. Hardcoded Port (engine/docker/template/src/services/edge/otel-collector.ts:29)

Port 6430 hardcoded - should reference config to avoid drift.

Test Coverage ❌

No unit or integration tests for the new metrics server.

Recommendations

High Priority (Before Merge)

❗ Fix .expect() calls with proper error handling
❗ Add basic tests for metrics server
📝 Document OTEL collector role in pipeline
📝 Add migration notes

Medium Priority

🔍 Extract hardcoded port to config
🔍 Normalize path labels
🔍 Monitor cardinality in production

Verdict

✅ Approve with required fixes

Sound architectural change that simplifies infrastructure. Critical error handling must be fixed before merge.

Must fix: Error handling + tests
Follow-up: Cardinality monitoring + documentation

Review follows CLAUDE.md conventions. Dashboard JSON spot-checked only.

graphite-app · 2026-01-06T01:36:28Z

Merge activity

Jan 6, 1:36 AM UTC: MasterPtato added this pull request to the Graphite merge queue.
Jan 6, 1:37 AM UTC: CI is running for this pull request on a draft pull request (#3753) due to your merge queue CI optimization settings.
Jan 6, 1:37 AM UTC: Merged by the Graphite merge queue via draft PR: #3753.

MasterPtato requested a review from NathanFlurry December 17, 2025 02:28

MasterPtato mentioned this pull request Dec 17, 2025

chore: update dashboards #3655

Closed

vercel bot had a problem deploying to Preview – rivetkit-serverless December 17, 2025 02:30 Failure

MasterPtato force-pushed the 12-16-fix_revert_otel_metrics_to_prometheus branch from cfe978f to 475f32b Compare December 17, 2025 02:35

vercel bot had a problem deploying to Preview – rivetkit-serverless December 17, 2025 02:36 Failure

MasterPtato force-pushed the 12-16-fix_revert_otel_metrics_to_prometheus branch from 475f32b to 5a3e37b Compare December 17, 2025 21:41

MasterPtato mentioned this pull request Dec 17, 2025

fix: add wf and guard metrics #3657

Closed

vercel bot had a problem deploying to Preview – rivetkit-serverless December 17, 2025 21:42 Failure

MasterPtato mentioned this pull request Dec 18, 2025

chore: clean up metrics #3660

Closed

This was referenced Jan 1, 2026

fix: fix gateway token reading, rivetkit path token, vergen git sha #3721

Closed

chore: fmt, fix config properties #3749

Closed

fix(ups): fix memory leak with local subscribers #3752

Closed

NathanFlurry approved these changes Jan 6, 2026

View reviewed changes

fix: revert otel metrics to prometheus

16a7519

MasterPtato force-pushed the 12-16-fix_revert_otel_metrics_to_prometheus branch from 5a3e37b to 16a7519 Compare January 6, 2026 01:33

MasterPtato force-pushed the 11-12-chore_update_dashboards branch from 36bd554 to 0956f01 Compare January 6, 2026 01:33

vercel bot had a problem deploying to Preview – rivetkit-serverless January 6, 2026 01:34 Failure

graphite-app bot pushed a commit that referenced this pull request Jan 6, 2026

fix: revert otel metrics to prometheus (#3656)

8966908

graphite-app bot closed this Jan 6, 2026

graphite-app bot deleted the 12-16-fix_revert_otel_metrics_to_prometheus branch January 6, 2026 01:37

claude bot mentioned this pull request Jan 6, 2026

[Graphite MQ] Draft PR GROUP:spec_f90959 (PRs 3655, 3656, 3657, 3660, 3721, 3749, 3752) #3753

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: revert otel metrics to prometheus #3656

fix: revert otel metrics to prometheus #3656

Uh oh!

MasterPtato commented Dec 17, 2025

Uh oh!

vercel bot commented Dec 17, 2025 •

edited

Loading

Uh oh!

MasterPtato commented Dec 17, 2025 •

edited

Loading

Uh oh!

claude bot commented Dec 17, 2025 •

edited

Loading

Uh oh!

pkg-pr-new bot commented Dec 17, 2025 •

edited

Loading

Uh oh!

claude bot commented Dec 17, 2025 •

edited

Loading

Uh oh!

graphite-app bot commented Jan 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: revert otel metrics to prometheus #3656

fix: revert otel metrics to prometheus #3656

Uh oh!

Conversation

MasterPtato commented Dec 17, 2025

Uh oh!

vercel bot commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MasterPtato commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to use the Graphite Merge Queue

Uh oh!

claude bot commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Revert OTEL Metrics to Prometheus

Summary

Code Quality & Architecture

✅ Strengths

Issues & Concerns

🔴 Critical Issues

⚠️ Medium Priority Issues

💡 Minor Issues & Style

Performance Considerations

✅ Positive Changes

⚠️ Concerns

Security Concerns

⚠️ Issues

Test Coverage

❌ Missing Tests

Documentation

⚠️ Missing Documentation

Recommendations

High Priority

Medium Priority

Low Priority

Verdict

Uh oh!

pkg-pr-new bot commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude bot commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Review: Revert OTEL Metrics to Prometheus

Summary

Code Quality & Best Practices

✅ Strengths

⚠️ Critical Issues

Test Coverage ❌

Recommendations

High Priority (Before Merge)

Medium Priority

Verdict

Uh oh!

graphite-app bot commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge activity

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vercel bot commented Dec 17, 2025 •

edited

Loading

MasterPtato commented Dec 17, 2025 •

edited

Loading

claude bot commented Dec 17, 2025 •

edited

Loading

pkg-pr-new bot commented Dec 17, 2025 •

edited

Loading

claude bot commented Dec 17, 2025 •

edited

Loading

graphite-app bot commented Jan 6, 2026 •

edited

Loading