Migrate observability to docusaurus (#511)

andyi2it · web-flow · commit c507750e36e7 · 2025-11-10T11:54:46.000+05:30
* Migrate observability to docusaurus

Signed-off-by: Andrews Arokiam &lt;andrews.arokiam@ideas2it.com&gt;

* Updated isvc yaml

Signed-off-by: Andrews Arokiam &lt;andrews.arokiam@ideas2it.com&gt;

---------

Signed-off-by: Andrews Arokiam &lt;andrews.arokiam@ideas2it.com&gt;
diff --git a/docs/model-serving/predictive-inference/observability/grafana-dashboards.md b/docs/model-serving/predictive-inference/observability/grafana-dashboards.md
@@ -0,0 +1,42 @@
+---
+title: Grafana Dashboards
+description: Explore pre-built Grafana dashboards for monitoring KServe inference services, including latency metrics and performance debugging.
+---
+
+# Grafana Dashboards 
+
+Some example Grafana dashboards are available in GrafanaLabs.
+
+## Knative HTTP Dashboard (if using serverless mode)
+
+The [Knative HTTP Grafana dashboard](https://grafana.com/grafana/dashboards/18032-knative-serving-revision-http-requests/) was built from [Knative's sandbox monitoring example](https://github.com/knative-sandbox/monitoring).
+
+## KServe ModelServer Latency Dashboard
+
+A template dashboard for [KServe ModelServer Latency](https://grafana.com/grafana/dashboards/17969-kserve-modelserver-latency/) contains example queries using the prometheus metrics for pre/post-process, predict and explain in milliseconds. The query is a [histogram quantile](https://prometheus.io/docs/prometheus/latest/querying/functions/#histogram_quantile). A fifth graph shows the total number of requests to the predict endpoint. This graph covers all KServe's ModelServer runtimes - lgbserver, paddleserver, pmmlserver, sklearnserver, xgbserver, custom transformer/predictor.
+
+## KServe TorchServe Latency Dashboard
+
+A template dashboard for [KServe TorchServe Latency](https://grafana.com/grafana/dashboards/18026-kserve-torchserve-latency/) contains an inference latency graph which plots the [rate](https://prometheus.io/docs/prometheus/latest/querying/functions/#rate) of the TorchServe metric `ts_inference_latency_microseconds` in milliseconds. A second graph plots the rate of TorchServe's internal queue latency metric `ts_queue_latency_microseconds` in milliseconds. A third graph plots the total requests to the TorchServe Inference Service. For more information see the [TorchServe metrics doc](https://pytorch.org/serve/metrics_api.html).
+
+## KServe Triton Latency Dashboard 
+
+A template dashboard for [KServe Triton Latency](https://grafana.com/grafana/dashboards/18027-kserve-triton-latency/) contains five latency graphs with the rate of Triton's input (preprocess), infer (predict), output (postprocess), internal queue and total latency metrics plotted in milliseconds. Triton outputs metrics on GPU usage as well, and the template plots a gauge of the percentage of GPU memory usage in bytes. For more information see the [Triton Inference Server docs](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/metrics.md).
+
+## Debugging Performance
+
+With these Grafana dashboards set up, debug latency issues with the following steps:
+
+First, (if in serverless mode) start with the Knative HTTP Dashboard to check if there is a queueing delay with queue-proxy:
+
+- [x] compare the gateway latency percentile metrics with your target SLO  
+- [x] check the observed concurrency metrics to see if your service is overloaded with a high number of inflight requests, indicating the service is over capacity and is unable to keep up with the number of requests 
+- [x] check the GPU/CPU memory metrics to see if the service is close to its limits - if your service has a high number of inflight requests/high CPU/GPU usage, then a possible solution is to add more resources/replicas
+
+Next, take a look at the appropriate serving runtime dashboard to see if there is a bottleneck in the code:
+
+- [x] check the latencies for pre/post-process, predict, explain - are latencies higher than expected at any one step? If so, you may need to make changes/adjustments for this step (note: TorchServe does not currently expose this level of observability at the moment, only an inference latency graph which encompasses the steps together)
+- [x] check the queue latency metrics (TorchServe and Triton) - if requests are stuck in the queue, the model is not able to keep up with the number of requests, consider adding more resources/replicas
+- [x] (Triton) check the GPU utilization metrics to see if your service is at capacity and you need more GPU resources
+
+If the numbers from the dashboards meet your SLO, check client side metrics to investigate if it is causing additional network latency.
diff --git a/docs/model-serving/predictive-inference/observability/prometheus-metrics.md b/docs/model-serving/predictive-inference/observability/prometheus-metrics.md
@@ -0,0 +1,77 @@
+---
+title: Prometheus Metrics
+description: Learn how to expose and configure Prometheus metrics for KServe inference services to monitor performance and health.
+---
+
+# Prometheus Metrics 
+
+## Exposing a Prometheus metrics port
+
+All supported serving runtimes support exporting prometheus metrics on a specified port in the inference service's pod. The appropriate port for the model server is defined in the [kserve/config/runtimes](https://github.com/kserve/kserve/tree/master/config/runtimes) YAML files. For example, torchserve defines its prometheus port as `8082` in `kserve-torchserve.yaml`. 
+
+```yaml
+metadata:
+  name: kserve-torchserve
+spec:
+  annotations:
+    prometheus.kserve.io/port: '8082'
+    prometheus.kserve.io/path: "/metrics"
+```
+
+If needed, this value can be overridden in the InferenceService YAML. 
+
+To enable prometheus metrics, add the annotation `serving.kserve.io/enable-prometheus-scraping` to the InferenceService YAML.
+
+```yaml
+apiVersion: "serving.kserve.io/v1beta1"
+kind: "InferenceService"
+metadata:
+  name: "sklearn-irisv2"
+  annotations:
+    serving.kserve.io/enable-prometheus-scraping: "true"
+spec:
+  predictor:
+    model:
+      modelFormat:
+        name: sklearn
+      protocolVersion: v2
+      storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"
+```
+
+The default values for `serving.kserve.io/enable-prometheus-scraping` can be set in the `inferenceservice-config` configmap. See [the docs](https://github.com/kserve/kserve/blob/master/qpext/README.md#configs) for more info.
+
+There is not currently a unified set of metrics exported by the model servers. Each model server may implement its own set of metrics to export. 
+
+:::note
+This annotation defines the prometheus port and path, but it does not trigger the prometheus to scrape. Users must configure prometheus to scrape data from inference service's pod according to the prometheus settings.
+:::
+
+## Metrics for lgbserver, paddleserver, pmmlserver, sklearnserver, xgbserver, custom transformer/predictor
+
+Prometheus latency histograms are emitted for each of the steps (pre/postprocessing, explain, predict). Additionally, the latencies of each step are logged per request. See also [modelserver prometheus label definitions](https://github.com/kserve/kserve/blob/master/python/kserve/kserve/metrics.py) and [metric implementation](https://github.com/kserve/kserve/blob/master/python/kserve/kserve/model.py#L94-L130).
+
+| Metric Name                    | Description                    | Type      |
+|--------------------------------|--------------------------------|-----------|
+| request_preprocess_seconds     | pre-processing request latency | Histogram |
+| request_explain_seconds        | explain request latency        | Histogram |
+| request_predict_seconds        | prediction request latency     | Histogram |
+| request_postprocess_seconds    | post-processing request latency| Histogram |
+
+## Other serving runtime metrics
+
+Some model servers define their own metrics. 
+
+* [mlserver](https://docs.seldon.io/projects/seldon-core/en/latest/analytics/analytics.html)
+* [torchserve](https://pytorch.org/serve/metrics_api.html)
+* [triton](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/metrics.md)
+* [tensorflow](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/framework/metrics.cc) (Please see [Github Issue #2462](https://github.com/kserve/kserve/issues/2462))
+
+## Exporting metrics
+
+Exporting metrics in serverless mode requires that the queue-proxy extension image is used. 
+
+For more information on how to export metrics, see [Queue Proxy Extension](https://github.com/kserve/kserve/blob/master/qpext/README.md) documentation.
+
+## Knative/Queue-Proxy metrics 
+
+Queue proxy emits metrics by default on port 9091. If aggregation metrics are set up with the queue proxy extension, the default port for the aggregated metrics will be 9088. See the [Knative documentation](https://knative.dev/development/serving/services/service-metrics/) (and [additional metrics defined in the code](https://github.com/vagababov/serving/blob/master/pkg/queue/prometheus_stats_reporter.go#L118)) for more information about the metrics queue-proxy exposes.
diff --git a/sidebars.ts b/sidebars.ts
@@ -245,6 +245,14 @@ const sidebars: SidebarsConfig = {
                 "model-serving/predictive-inference/kafka/kafka",
               ]
             },
+            {
+              type: 'category',
+              label: 'Inference Observability',
+              items: [
+                "model-serving/predictive-inference/observability/prometheus-metrics",
+                "model-serving/predictive-inference/observability/grafana-dashboards",
+              ]
+            },
             {
               type: 'category',
               label: 'Rollout Strategies',