Skip to content

Conversation

@MasterPtato
Copy link
Contributor

No description provided.

@vercel
Copy link

vercel bot commented Dec 17, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Review Updated (UTC)
rivetkit-serverless Error Error Jan 6, 2026 1:34am
3 Skipped Deployments
Project Deployment Review Updated (UTC)
rivet-cloud Ignored Ignored Jan 6, 2026 1:34am
rivet-inspector Ignored Ignored Preview Jan 6, 2026 1:34am
rivet-site Ignored Ignored Preview Jan 6, 2026 1:34am

Copy link
Contributor Author

MasterPtato commented Dec 17, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link

claude bot commented Dec 17, 2025

PR Review: Revert OTEL Metrics to Prometheus

Summary

This PR reverts the metrics infrastructure from OpenTelemetry (OTEL) to Prometheus, introducing a pull-based metrics collection system instead of the previous push-based OTLP approach. The change simplifies the metrics pipeline and removes the need for a separate otel-collector-client service.


Code Quality & Architecture

Strengths

  1. Simplified Architecture: Removing the otel-collector-client service reduces complexity and the number of moving parts in the development environment.

  2. Clean Template-Based Configuration: Changes follow the repository's pattern of modifying templates in engine/docker/template/ and regenerating configs, which is good practice per CLAUDE.md.

  3. Prometheus Registry Implementation: The new registry.rs (engine/packages/metrics/src/registry.rs:1-11) properly initializes a Prometheus registry with standard labels including service name, pod ID, and source hash.

  4. HTTP Metrics Server: The standalone metrics server (engine/packages/metrics/src/server.rs:14-37) properly exposes Prometheus metrics on a configurable endpoint.


Issues & Concerns

🔴 Critical Issues

  1. Hardcoded Port in OTEL Config (engine/docker/template/src/services/edge/otel-collector.ts:29)
    Problem: The metrics port 6430 is hardcoded in the OTEL collector scrape config. This should reference the config value used in the metrics server to avoid drift between server and scraper configurations.

  2. Missing Error Handling in Metrics Encoding (engine/packages/metrics/src/server.rs:46-48)
    Problem: Using .expect() will crash the entire metrics server if encoding fails. Should return a proper error response instead.

  3. Response Building Could Panic (engine/packages/metrics/src/server.rs:50-54)
    Problem: Response building uses .expect() which is not ideal for a server. Should handle errors properly.

⚠️ Medium Priority Issues

  1. OTEL Collector Still Receives OTLP (engine/docker/template/src/services/edge/otel-collector.ts:38-46)
    The OTEL collector still has OTLP receivers configured for both gRPC and HTTP. Question: Is this intentional for traces/logs?

  2. Empty Prometheus Scrape Configs (engine/docker/template/src/services/core/prometheus.ts:12)
    The Prometheus service has no scrape configs defined. All scraping is done by OTEL collector via Prometheus receiver and sent via remote write. This is architecturally valid but unconventional. Document this design choice if intentional.

  3. Potential Label Cardinality Issue (engine/packages/metrics/src/registry.rs:8)
    Pod IDs as labels can cause high cardinality in Kubernetes environments with frequent pod churn. Consider if this label is necessary globally.

💡 Minor Issues & Style

  1. TODO Comment Left In Code (engine/packages/metrics/src/server.rs:11)
    Either implement this or create a tracking issue and reference it.

  2. Resource to Telemetry Conversion (engine/docker/template/src/services/edge/otel-collector.ts:99-101)
    This converts resource attributes to metric labels, which could increase cardinality. Ensure this is intentional.


Performance Considerations

Positive Changes

  1. Pull-Based Metrics: Prometheus's pull model is more efficient than pushing metrics for the OTEL collector to aggregate.

  2. Direct Prometheus Registry: Using the Prometheus client library directly avoids the overhead of converting metrics through OTLP.

  3. Batch Processing: OTEL collector batch processor with 5s timeout and 10k batch size is well-configured.

⚠️ Concerns

  1. Scrape Interval Hardcoded: 15s scrape interval is hardcoded in multiple places. Consider making this configurable.

  2. No Metrics Endpoint Rate Limiting: The metrics server has no rate limiting. While scraping is typically controlled, consider if protection is needed.


Security Concerns

⚠️ Issues

  1. Insecure TLS Configuration (engine/docker/template/src/services/edge/otel-collector.ts:96-98)
    TLS is set to insecure for dev environment. Severity: Low (dev only). Add a comment indicating this is acceptable for dev but should be changed for production.

  2. No Authentication on Metrics Endpoint: The metrics endpoint has no authentication. This is typical for Prometheus but worth documenting.


Test Coverage

Missing Tests

  1. No Tests for Metrics Server: The new server.rs file has no associated tests. Add tests for successful metrics scraping, encoding, and error handling.

  2. No Integration Tests: No tests verify the OTEL collector can successfully scrape the Prometheus endpoint.


Documentation

⚠️ Missing Documentation

  1. Architecture Decision: This significant change lacks explanation about why the revert was needed.

  2. Migration Guide: No guidance for existing deployments on how this affects their metrics.

  3. Old Dashboards Preserved: The PR moves old dashboards to grafana-dashboards-old/ but doesn't document what changed or why both are kept.


Recommendations

High Priority

  1. Fix panic-inducing .expect() calls in metrics server
  2. Extract hardcoded port to shared config
  3. Add error handling for metrics encoding
  4. Add tests for metrics server

Medium Priority

  1. Document why OTLP receivers remain in OTEL collector
  2. Document the Prometheus remote-write architecture
  3. Review label cardinality (especially kubernetes_pod_id)
  4. Consider making scrape interval configurable

Low Priority

  1. Add TODO tracking issue reference
  2. Add security comments for insecure TLS in dev
  3. Clean up old dashboard directory or document retention reason

Verdict

Recommend: Merge with minor fixes

The architectural change is sound and simplifies the metrics pipeline. The main concerns are:

  1. Error handling that could crash the metrics server
  2. Hardcoded values that could drift
  3. Missing tests and documentation

These should be addressed before or shortly after merge, but they don't block the core functionality.


Note: This review follows the conventions in CLAUDE.md and focuses on the Rust/Docker configuration aspects of the change. The extensive Grafana dashboard JSON changes were not reviewed in detail as they appear to be auto-generated.

@pkg-pr-new
Copy link

pkg-pr-new bot commented Dec 17, 2025

More templates

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@3656

@rivetkit/db

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/db@3656

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@3656

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@3656

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@3656

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@3656

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@3656

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@3656

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@3656

commit: 5a3e37b

@claude
Copy link

claude bot commented Dec 17, 2025

Pull Request Review: Revert OTEL Metrics to Prometheus

Summary

This PR reverts the metrics collection infrastructure from OpenTelemetry (OTEL) back to native Prometheus metrics affecting ~133 files. The change simplifies the metrics pipeline by using Prometheus's native pull-based model.


Code Quality & Best Practices

✅ Strengths

  1. Consistent Migration Pattern: Proper use of IntCounterVec, IntGaugeVec, HistogramVec
  2. Clean Registry: Global Prometheus registry with service-level labels
  3. Template-Based Config: Properly regenerated across all environments
  4. Simplified Architecture: Removed otel-collector-client service

⚠️ Critical Issues

1. Error Handling in Metrics Server (engine/packages/metrics/src/server.rs:46-54)

Using .expect() will crash the metrics server on failures. Must return HTTP error responses instead.

2. High Cardinality Risk (engine/packages/api-builder/src/metrics.rs:22)

Raw path as label can create unbounded cardinality. Use path templates instead.

3. Pod ID Cardinality (engine/packages/metrics/src/registry.rs:8)

Pod IDs as global labels create new time series for every pod - may accumulate stale series.

4. Hardcoded Port (engine/docker/template/src/services/edge/otel-collector.ts:29)

Port 6430 hardcoded - should reference config to avoid drift.


Test Coverage ❌

No unit or integration tests for the new metrics server.


Recommendations

High Priority (Before Merge)

  1. ❗ Fix .expect() calls with proper error handling
  2. ❗ Add basic tests for metrics server
  3. 📝 Document OTEL collector role in pipeline
  4. 📝 Add migration notes

Medium Priority

  1. 🔍 Extract hardcoded port to config
  2. 🔍 Normalize path labels
  3. 🔍 Monitor cardinality in production

Verdict

✅ Approve with required fixes

Sound architectural change that simplifies infrastructure. Critical error handling must be fixed before merge.

Must fix: Error handling + tests
Follow-up: Cardinality monitoring + documentation


Review follows CLAUDE.md conventions. Dashboard JSON spot-checked only.

@graphite-app
Copy link
Contributor

graphite-app bot commented Jan 6, 2026

Merge activity

  • Jan 6, 1:36 AM UTC: MasterPtato added this pull request to the Graphite merge queue.
  • Jan 6, 1:37 AM UTC: CI is running for this pull request on a draft pull request (#3753) due to your merge queue CI optimization settings.
  • Jan 6, 1:37 AM UTC: Merged by the Graphite merge queue via draft PR: #3753.

graphite-app bot pushed a commit that referenced this pull request Jan 6, 2026
@graphite-app graphite-app bot closed this Jan 6, 2026
@graphite-app graphite-app bot deleted the 12-16-fix_revert_otel_metrics_to_prometheus branch January 6, 2026 01:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants