Skip to content

Commit c507750

Browse files
authored
Migrate observability to docusaurus (#511)
* Migrate observability to docusaurus Signed-off-by: Andrews Arokiam <andrews.arokiam@ideas2it.com> * Updated isvc yaml Signed-off-by: Andrews Arokiam <andrews.arokiam@ideas2it.com> --------- Signed-off-by: Andrews Arokiam <andrews.arokiam@ideas2it.com>
1 parent 65b8ca3 commit c507750

File tree

3 files changed

+127
-0
lines changed

3 files changed

+127
-0
lines changed
Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
---
2+
title: Grafana Dashboards
3+
description: Explore pre-built Grafana dashboards for monitoring KServe inference services, including latency metrics and performance debugging.
4+
---
5+
6+
# Grafana Dashboards
7+
8+
Some example Grafana dashboards are available in GrafanaLabs.
9+
10+
## Knative HTTP Dashboard (if using serverless mode)
11+
12+
The [Knative HTTP Grafana dashboard](https://grafana.com/grafana/dashboards/18032-knative-serving-revision-http-requests/) was built from [Knative's sandbox monitoring example](https://github.com/knative-sandbox/monitoring).
13+
14+
## KServe ModelServer Latency Dashboard
15+
16+
A template dashboard for [KServe ModelServer Latency](https://grafana.com/grafana/dashboards/17969-kserve-modelserver-latency/) contains example queries using the prometheus metrics for pre/post-process, predict and explain in milliseconds. The query is a [histogram quantile](https://prometheus.io/docs/prometheus/latest/querying/functions/#histogram_quantile). A fifth graph shows the total number of requests to the predict endpoint. This graph covers all KServe's ModelServer runtimes - lgbserver, paddleserver, pmmlserver, sklearnserver, xgbserver, custom transformer/predictor.
17+
18+
## KServe TorchServe Latency Dashboard
19+
20+
A template dashboard for [KServe TorchServe Latency](https://grafana.com/grafana/dashboards/18026-kserve-torchserve-latency/) contains an inference latency graph which plots the [rate](https://prometheus.io/docs/prometheus/latest/querying/functions/#rate) of the TorchServe metric `ts_inference_latency_microseconds` in milliseconds. A second graph plots the rate of TorchServe's internal queue latency metric `ts_queue_latency_microseconds` in milliseconds. A third graph plots the total requests to the TorchServe Inference Service. For more information see the [TorchServe metrics doc](https://pytorch.org/serve/metrics_api.html).
21+
22+
## KServe Triton Latency Dashboard
23+
24+
A template dashboard for [KServe Triton Latency](https://grafana.com/grafana/dashboards/18027-kserve-triton-latency/) contains five latency graphs with the rate of Triton's input (preprocess), infer (predict), output (postprocess), internal queue and total latency metrics plotted in milliseconds. Triton outputs metrics on GPU usage as well, and the template plots a gauge of the percentage of GPU memory usage in bytes. For more information see the [Triton Inference Server docs](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/metrics.md).
25+
26+
## Debugging Performance
27+
28+
With these Grafana dashboards set up, debug latency issues with the following steps:
29+
30+
First, (if in serverless mode) start with the Knative HTTP Dashboard to check if there is a queueing delay with queue-proxy:
31+
32+
- [x] compare the gateway latency percentile metrics with your target SLO
33+
- [x] check the observed concurrency metrics to see if your service is overloaded with a high number of inflight requests, indicating the service is over capacity and is unable to keep up with the number of requests
34+
- [x] check the GPU/CPU memory metrics to see if the service is close to its limits - if your service has a high number of inflight requests/high CPU/GPU usage, then a possible solution is to add more resources/replicas
35+
36+
Next, take a look at the appropriate serving runtime dashboard to see if there is a bottleneck in the code:
37+
38+
- [x] check the latencies for pre/post-process, predict, explain - are latencies higher than expected at any one step? If so, you may need to make changes/adjustments for this step (note: TorchServe does not currently expose this level of observability at the moment, only an inference latency graph which encompasses the steps together)
39+
- [x] check the queue latency metrics (TorchServe and Triton) - if requests are stuck in the queue, the model is not able to keep up with the number of requests, consider adding more resources/replicas
40+
- [x] (Triton) check the GPU utilization metrics to see if your service is at capacity and you need more GPU resources
41+
42+
If the numbers from the dashboards meet your SLO, check client side metrics to investigate if it is causing additional network latency.
Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
---
2+
title: Prometheus Metrics
3+
description: Learn how to expose and configure Prometheus metrics for KServe inference services to monitor performance and health.
4+
---
5+
6+
# Prometheus Metrics
7+
8+
## Exposing a Prometheus metrics port
9+
10+
All supported serving runtimes support exporting prometheus metrics on a specified port in the inference service's pod. The appropriate port for the model server is defined in the [kserve/config/runtimes](https://github.com/kserve/kserve/tree/master/config/runtimes) YAML files. For example, torchserve defines its prometheus port as `8082` in `kserve-torchserve.yaml`.
11+
12+
```yaml
13+
metadata:
14+
name: kserve-torchserve
15+
spec:
16+
annotations:
17+
prometheus.kserve.io/port: '8082'
18+
prometheus.kserve.io/path: "/metrics"
19+
```
20+
21+
If needed, this value can be overridden in the InferenceService YAML.
22+
23+
To enable prometheus metrics, add the annotation `serving.kserve.io/enable-prometheus-scraping` to the InferenceService YAML.
24+
25+
```yaml
26+
apiVersion: "serving.kserve.io/v1beta1"
27+
kind: "InferenceService"
28+
metadata:
29+
name: "sklearn-irisv2"
30+
annotations:
31+
serving.kserve.io/enable-prometheus-scraping: "true"
32+
spec:
33+
predictor:
34+
model:
35+
modelFormat:
36+
name: sklearn
37+
protocolVersion: v2
38+
storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"
39+
```
40+
41+
The default values for `serving.kserve.io/enable-prometheus-scraping` can be set in the `inferenceservice-config` configmap. See [the docs](https://github.com/kserve/kserve/blob/master/qpext/README.md#configs) for more info.
42+
43+
There is not currently a unified set of metrics exported by the model servers. Each model server may implement its own set of metrics to export.
44+
45+
:::note
46+
This annotation defines the prometheus port and path, but it does not trigger the prometheus to scrape. Users must configure prometheus to scrape data from inference service's pod according to the prometheus settings.
47+
:::
48+
49+
## Metrics for lgbserver, paddleserver, pmmlserver, sklearnserver, xgbserver, custom transformer/predictor
50+
51+
Prometheus latency histograms are emitted for each of the steps (pre/postprocessing, explain, predict). Additionally, the latencies of each step are logged per request. See also [modelserver prometheus label definitions](https://github.com/kserve/kserve/blob/master/python/kserve/kserve/metrics.py) and [metric implementation](https://github.com/kserve/kserve/blob/master/python/kserve/kserve/model.py#L94-L130).
52+
53+
| Metric Name | Description | Type |
54+
|--------------------------------|--------------------------------|-----------|
55+
| request_preprocess_seconds | pre-processing request latency | Histogram |
56+
| request_explain_seconds | explain request latency | Histogram |
57+
| request_predict_seconds | prediction request latency | Histogram |
58+
| request_postprocess_seconds | post-processing request latency| Histogram |
59+
60+
## Other serving runtime metrics
61+
62+
Some model servers define their own metrics.
63+
64+
* [mlserver](https://docs.seldon.io/projects/seldon-core/en/latest/analytics/analytics.html)
65+
* [torchserve](https://pytorch.org/serve/metrics_api.html)
66+
* [triton](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/metrics.md)
67+
* [tensorflow](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/framework/metrics.cc) (Please see [Github Issue #2462](https://github.com/kserve/kserve/issues/2462))
68+
69+
## Exporting metrics
70+
71+
Exporting metrics in serverless mode requires that the queue-proxy extension image is used.
72+
73+
For more information on how to export metrics, see [Queue Proxy Extension](https://github.com/kserve/kserve/blob/master/qpext/README.md) documentation.
74+
75+
## Knative/Queue-Proxy metrics
76+
77+
Queue proxy emits metrics by default on port 9091. If aggregation metrics are set up with the queue proxy extension, the default port for the aggregated metrics will be 9088. See the [Knative documentation](https://knative.dev/development/serving/services/service-metrics/) (and [additional metrics defined in the code](https://github.com/vagababov/serving/blob/master/pkg/queue/prometheus_stats_reporter.go#L118)) for more information about the metrics queue-proxy exposes.

sidebars.ts

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -245,6 +245,14 @@ const sidebars: SidebarsConfig = {
245245
"model-serving/predictive-inference/kafka/kafka",
246246
]
247247
},
248+
{
249+
type: 'category',
250+
label: 'Inference Observability',
251+
items: [
252+
"model-serving/predictive-inference/observability/prometheus-metrics",
253+
"model-serving/predictive-inference/observability/grafana-dashboards",
254+
]
255+
},
248256
{
249257
type: 'category',
250258
label: 'Rollout Strategies',

0 commit comments

Comments
 (0)