Skip to content

Commit 4efb275

Browse files
committed
add runbooks for new alerts
there are new alerts introduced for odf health score calculation. This commit adds runbooks for each of them Signed-off-by: yati1998 <ypadia@redhat.com>
1 parent 4263b90 commit 4efb275

File tree

6 files changed

+234
-0
lines changed

6 files changed

+234
-0
lines changed
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
# ODFCorePodRestarted
2+
3+
## Meaning
4+
5+
A core ODF pod (OSD, MON, MGR, ODF operator, or metrics exporter) has
6+
restarted at least once in the last 24 hours while the Ceph cluster is active.
7+
8+
## Impact
9+
10+
* Brief service interruption (e.g., MON restart may cause quorum re-election).
11+
* OSD restart triggers PG peering and potential recovery.
12+
* Operator restart delays configuration changes or health checks.
13+
* May indicate underlying instability (resource pressure, bugs, or node issues).
14+
15+
## Diagnosis
16+
17+
1. Identify pod from alert (pod, namespace).
18+
2. [pod debug](helpers/podDebug.md)
19+
20+
## Mitigation
21+
22+
1. If OOMKilled: Increase memory limits for the container.
23+
2. If CrashLoopBackOff: Check for configuration errors or version incompatibilities.
24+
3. If node-related: Cordon and drain the node; replace if faulty.
25+
4. Ensure HA: MONs should be ≥3; OSDs should be distributed.
26+
5. Update: If due to a known bug, upgrade ODF to a fixed version.
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
# ODFDiskUtilizationHigh
2+
3+
## Meaning
4+
5+
A Ceph OSD disk is >90% busy (as measured by %util from iostat
6+
semantics via node_disk_io_time_seconds_total), indicating heavy I/O load.
7+
8+
## Impact
9+
10+
* Increased I/O latency for RBD/CephFS clients.
11+
* Slower OSD response times, risking heartbeat timeouts.
12+
* Reduced cluster throughput during peak workloads.
13+
* Potential for “slow request” warnings in Ceph logs.
14+
15+
## Diagnosis
16+
17+
1. Identify node and device from alert labels.
18+
2. Check disk model and type:
19+
```bash
20+
oc debug node/<node>
21+
lsblk -d -o NAME,ROTA,MODEL
22+
# Confirm it’s an expected OSD device (HDD/SSD/NVMe)
23+
```
24+
3. Monitor real-time I/O:
25+
```bash
26+
iostat -x 2 5
27+
```
28+
4. Correlate with Ceph:
29+
```bash
30+
ceph osd df tree # check weight and reweight
31+
ceph osd perf # check commit/apply latency
32+
```
Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
# ODFNodeLatencyHighOnNONOSDNodes
2+
3+
## Meaning
4+
5+
ICMP RTT latency to non-OSD ODF nodes (e.g., MON, MGR, MDS, or client nodes)
6+
exceeds 100 milliseconds over the last 24 hours. These nodes participate in
7+
Ceph control plane or client access but do not store data.
8+
9+
## Impact
10+
11+
* Delayed Ceph monitor elections or quorum instability.
12+
* Slower metadata operations in CephFS.
13+
* Increased latency for CSI controller operations.
14+
* Potential timeouts in ODF operator reconciliation.
15+
16+
17+
## Diagnosis
18+
19+
1. From the alert, note the instance (node IP).
20+
2. Confirm the node does not run OSDs:
21+
```bash
22+
oc get pods -n openshift-storage -o wide | grep <node-name>
23+
```
24+
3. Test connectivity:
25+
```bash
26+
ping <node-ip>
27+
mtr <node-ip>
28+
```
29+
4. Check system load and network interface stats on the node:
30+
```bash
31+
oc debug node/<node-name>
32+
sar -n DEV 1 5
33+
ip -s link show <iface>
34+
```
35+
5. Review Ceph monitor logs if the node hosts MONs:
36+
```bash
37+
oc logs -l app=rook-ceph-mon -n openshift-storage
38+
```
39+
40+
41+
## Mitigation
42+
43+
1. Ensure control-plane nodes are not oversubscribed or co-located with noisy workloads.
44+
2. Validate network path between MON/MGR nodes—prefer low-latency, dedicated links.
45+
3. If node is a client (e.g., running applications), verify it’s not on an
46+
overloaded subnet.
47+
4. Tune kernel network parameters if packet loss or buffer drops are observed.
Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
# ODFNodeLatencyHighOnOSDNodes
2+
3+
## Meaning
4+
5+
ICMP round-trip time (RTT) latency between ODF monitoring probes and
6+
OSD nodes exceeds 10 milliseconds over the last 24 hours. This alert
7+
triggers only on nodes that host Ceph OSD pods, indicating potential
8+
network congestion or issues on the storage network.
9+
10+
## Impact
11+
12+
* Increased latency in Ceph replication and recovery operations.
13+
* Higher client I/O latency for RBD and CephFS workloads.
14+
* Risk of OSDs being marked down if heartbeat timeouts occur.
15+
* Degraded cluster performance and possible client timeouts.
16+
17+
18+
## Diagnosis
19+
20+
21+
1. Identify affected node(s):
22+
```bash
23+
oc get nodes -l cluster.ocs.openshift.io/openshift-storage=''
24+
# or check node labels used in OSD scheduling
25+
```
26+
2. Check the alert’s instance label to get the node IP.
27+
3. From a monitoring or debug pod, test connectivity:
28+
```bash
29+
ping <node-internal-ip>
30+
```
31+
4. Use mtr or traceroute to analyze path and hops.
32+
5. Verify if the node is under high CPU or network load:
33+
```bash
34+
oc debug node/<node>
35+
top -b -n 1 | head -20
36+
sar -u 1 5
37+
```
38+
6. Check Ceph health and OSD status:
39+
```bash
40+
ceph osd status
41+
ceph -s
42+
```
43+
44+
## Mitigation
45+
46+
1. Network tuning: Ensure jumbo frames (MTU ≥ 9000) are enabled end-to-end
47+
on the storage network.
48+
2. Isolate traffic: Confirm storage traffic uses a dedicated VLAN or NIC, separate
49+
from management/tenant traffic.
50+
3. Hardware check: Inspect switch logs, NIC errors (ethtool -S <iface>),
51+
and NIC firmware.
52+
4. Topology: Ensure OSD nodes are in the same rack/zone or connected via
53+
low-latency fabric.
54+
5. If latency is transient, monitor; if persistent, engage network or
55+
infrastructure team.
56+
Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
# ODFNodeMTULessThan9000
2+
3+
## Meaning
4+
5+
At least one physical or relevant network interface on an ODF node has an
6+
MTU (Maximum Transmission Unit) less than 9000 bytes, violating ODF best
7+
practices for storage networks.
8+
9+
## Impact
10+
11+
* Suboptimal Ceph network performance due to increased packet overhead.
12+
* Higher CPU utilization on OSD nodes from processing more packets.
13+
* Potential for packet fragmentation if mixed MTU sizes exist in the path.
14+
* Reduced throughput during rebalancing or recovery.
15+
16+
17+
## Diagnosis
18+
19+
1. List all nodes in the storage cluster:
20+
```bash
21+
oc get nodes -l cluster.ocs.openshift.io/openshift-storage=''
22+
```
23+
2. For each node, check interface MTUs:
24+
```bash
25+
oc debug node/<node-name>
26+
ip link show
27+
# Look for interfaces like eth0, ens*, eno*, etc. (exclude veth, docker, cali)
28+
```
29+
3. Alternatively, use Prometheus:
30+
```promql
31+
node_network_mtu_bytes{device!~"^(veth|docker|flannel|cali|tun|tap).*"} < 9000
32+
```
33+
4. Verify MTU consistency across all nodes and all switches in the storage fabric.
34+
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
# ODFNodeNICBandwidthSaturation
2+
3+
## Meaning
4+
5+
A network interface on an ODF node is operating at >90% of its reported
6+
link speed, indicating potential bandwidth saturation.
7+
8+
## Impact
9+
10+
* Network congestion leading to packet drops or latency spikes.
11+
* Slowed Ceph replication, backfill, and recovery.
12+
* Client I/O timeouts or stalls.
13+
* Possible Ceph OSD evictions due to heartbeat failures.
14+
15+
## Diagnosis
16+
17+
1. From alert, note instance and device.
18+
2. Check current utilization:
19+
```bash
20+
oc debug node/<node>
21+
sar -n DEV 1 5
22+
```
23+
3. Use Prometheus to graph:
24+
```promql
25+
rate(node_network_receive_bytes_total{instance="<ip>", device="<dev>"}[5m]) * 8
26+
rate(node_network_transmit_bytes_total{...}) * 8
27+
```
28+
4. Determine if traffic is Ceph-related (e.g., during rebalance) or external.
29+
30+
## Mitigation
31+
32+
1. Short term: Throttle non-essential traffic on the node.
33+
2. Long term:
34+
* Upgrade to higher-speed NICs (e.g., 25GbE → 100GbE).
35+
* Use multiple bonded interfaces with LACP.
36+
* Separate storage and client traffic using VLANs or dedicated NICs.
37+
3. Tune Ceph osd_max_backfills, osd_recovery_max_active to reduce
38+
recovery bandwidth.
39+
4. Enable NIC offload features (TSO, GRO) if disabled.

0 commit comments

Comments
 (0)