add runbooks for new alerts

yati1998 · yati1998 · commit 4efb27588d16 · 2025-12-31T11:46:27.000+05:30
there are new alerts introduced for odf
health score calculation. This commit adds
runbooks for each of them

Signed-off-by: yati1998 &lt;ypadia@redhat.com&gt;
diff --git a/alerts/openshift-container-storage-operator/ODFCorePodRestarted.md b/alerts/openshift-container-storage-operator/ODFCorePodRestarted.md
@@ -0,0 +1,26 @@
+# ODFCorePodRestarted
+
+## Meaning
+
+A core ODF pod (OSD, MON, MGR, ODF operator, or metrics exporter) has
+restarted at least once in the last 24 hours while the Ceph cluster is active.
+
+## Impact
+
+* Brief service interruption (e.g., MON restart may cause quorum re-election).
+* OSD restart triggers PG peering and potential recovery.
+* Operator restart delays configuration changes or health checks.
+* May indicate underlying instability (resource pressure, bugs, or node issues).
+
+## Diagnosis
+
+1. Identify pod from alert (pod, namespace).
+2. [pod debug](helpers/podDebug.md)
+
+## Mitigation
+
+1. If OOMKilled: Increase memory limits for the container.
+2. If CrashLoopBackOff: Check for configuration errors or version incompatibilities.
+3. If node-related: Cordon and drain the node; replace if faulty.
+4. Ensure HA: MONs should be ≥3; OSDs should be distributed.
+5. Update: If due to a known bug, upgrade ODF to a fixed version.
diff --git a/alerts/openshift-container-storage-operator/ODFDiskUtilizationHigh.md b/alerts/openshift-container-storage-operator/ODFDiskUtilizationHigh.md
@@ -0,0 +1,32 @@
+# ODFDiskUtilizationHigh
+
+## Meaning
+
+A Ceph OSD disk is >90% busy (as measured by %util from iostat
+semantics via node_disk_io_time_seconds_total), indicating heavy I/O load.
+
+## Impact
+
+* Increased I/O latency for RBD/CephFS clients.
+* Slower OSD response times, risking heartbeat timeouts.
+* Reduced cluster throughput during peak workloads.
+* Potential for “slow request” warnings in Ceph logs.
+
+## Diagnosis
+
+1. Identify node and device from alert labels.
+2. Check disk model and type:
+```bash
+oc debug node/<node>
+lsblk -d -o NAME,ROTA,MODEL
+# Confirm it’s an expected OSD device (HDD/SSD/NVMe)
+```
+3. Monitor real-time I/O:
+```bash
+iostat -x 2 5
+```
+4. Correlate with Ceph:
+```bash
+ceph osd df tree  # check weight and reweight
+ceph osd perf     # check commit/apply latency
+```
diff --git a/alerts/openshift-container-storage-operator/ODFNodeLatencyHighOnNONOSDNodes.md b/alerts/openshift-container-storage-operator/ODFNodeLatencyHighOnNONOSDNodes.md
@@ -0,0 +1,47 @@
+# ODFNodeLatencyHighOnNONOSDNodes
+
+## Meaning
+
+ICMP RTT latency to non-OSD ODF nodes (e.g., MON, MGR, MDS, or client nodes)
+exceeds 100 milliseconds over the last 24 hours. These nodes participate in
+Ceph control plane or client access but do not store data.
+
+## Impact
+
+* Delayed Ceph monitor elections or quorum instability.
+* Slower metadata operations in CephFS.
+* Increased latency for CSI controller operations.
+* Potential timeouts in ODF operator reconciliation.
+
+
+## Diagnosis
+
+1. From the alert, note the instance (node IP).
+2. Confirm the node does not run OSDs:
+```bash
+oc get pods -n openshift-storage -o wide | grep <node-name>
+```
+3. Test connectivity:
+```bash
+ping <node-ip>
+mtr <node-ip>
+```
+4. Check system load and network interface stats on the node:
+```bash
+oc debug node/<node-name>
+sar -n DEV 1 5
+ip -s link show <iface>
+```
+5. Review Ceph monitor logs if the node hosts MONs:
+```bash
+oc logs -l app=rook-ceph-mon -n openshift-storage
+```
+
+
+## Mitigation
+
+1. Ensure control-plane nodes are not oversubscribed or co-located with noisy workloads.
+2. Validate network path between MON/MGR nodes—prefer low-latency, dedicated links.
+3. If node is a client (e.g., running applications), verify it’s not on an
+   overloaded subnet.
+4. Tune kernel network parameters if packet loss or buffer drops are observed.
diff --git a/alerts/openshift-container-storage-operator/ODFNodeLatencyHighOnOSDNodes.md b/alerts/openshift-container-storage-operator/ODFNodeLatencyHighOnOSDNodes.md
@@ -0,0 +1,56 @@
+# ODFNodeLatencyHighOnOSDNodes
+
+## Meaning
+
+ICMP round-trip time (RTT) latency between ODF monitoring probes and
+OSD nodes exceeds 10 milliseconds over the last 24 hours. This alert
+triggers only on nodes that host Ceph OSD pods, indicating potential
+network congestion or issues on the storage network.
+
+## Impact
+
+* Increased latency in Ceph replication and recovery operations.
+* Higher client I/O latency for RBD and CephFS workloads.
+* Risk of OSDs being marked down if heartbeat timeouts occur.
+* Degraded cluster performance and possible client timeouts.
+
+
+## Diagnosis
+
+
+1. Identify affected node(s):
+```bash
+oc get nodes -l cluster.ocs.openshift.io/openshift-storage='' 
+# or check node labels used in OSD scheduling
+```
+2. Check the alert’s instance label to get the node IP.
+3. From a monitoring or debug pod, test connectivity:
+```bash
+ping <node-internal-ip>
+```
+4. Use mtr or traceroute to analyze path and hops.
+5. Verify if the node is under high CPU or network load:
+```bash
+oc debug node/<node>
+top -b -n 1 | head -20
+sar -u 1 5
+```
+6. Check Ceph health and OSD status:
+```bash
+ceph osd status
+ceph -s
+```
+
+## Mitigation
+
+1. Network tuning: Ensure jumbo frames (MTU ≥ 9000) are enabled end-to-end 
+    on the storage network.
+2. Isolate traffic: Confirm storage traffic uses a dedicated VLAN or NIC, separate
+    from management/tenant traffic.
+3. Hardware check: Inspect switch logs, NIC errors (ethtool -S <iface>),
+    and NIC firmware.
+4. Topology: Ensure OSD nodes are in the same rack/zone or connected via
+    low-latency fabric.
+5. If latency is transient, monitor; if persistent, engage network or
+    infrastructure team.
+
diff --git a/alerts/openshift-container-storage-operator/ODFNodeMTULessThan9000.md b/alerts/openshift-container-storage-operator/ODFNodeMTULessThan9000.md
@@ -0,0 +1,34 @@
+# ODFNodeMTULessThan9000
+
+## Meaning
+
+At least one physical or relevant network interface on an ODF node has an
+MTU (Maximum Transmission Unit) less than 9000 bytes, violating ODF best
+practices for storage networks.
+
+## Impact
+
+* Suboptimal Ceph network performance due to increased packet overhead.
+* Higher CPU utilization on OSD nodes from processing more packets.
+* Potential for packet fragmentation if mixed MTU sizes exist in the path.
+* Reduced throughput during rebalancing or recovery.
+
+
+## Diagnosis
+
+1. List all nodes in the storage cluster:
+```bash
+oc get nodes -l cluster.ocs.openshift.io/openshift-storage=''
+```
+2. For each node, check interface MTUs:
+```bash
+oc debug node/<node-name>
+ip link show
+# Look for interfaces like eth0, ens*, eno*, etc. (exclude veth, docker, cali)
+```
+3. Alternatively, use Prometheus:
+```promql
+node_network_mtu_bytes{device!~"^(veth|docker|flannel|cali|tun|tap).*"} < 9000
+```
+4. Verify MTU consistency across all nodes and all switches in the storage fabric.
+
diff --git a/alerts/openshift-container-storage-operator/ODFNodeNICBandwidthSaturation.md b/alerts/openshift-container-storage-operator/ODFNodeNICBandwidthSaturation.md
@@ -0,0 +1,39 @@
+# ODFNodeNICBandwidthSaturation
+
+## Meaning
+
+A network interface on an ODF node is operating at >90% of its reported
+link speed, indicating potential bandwidth saturation.
+
+## Impact
+
+* Network congestion leading to packet drops or latency spikes.
+* Slowed Ceph replication, backfill, and recovery.
+* Client I/O timeouts or stalls.
+* Possible Ceph OSD evictions due to heartbeat failures.
+
+## Diagnosis
+
+1. From alert, note instance and device.
+2. Check current utilization:
+```bash
+oc debug node/<node>
+sar -n DEV 1 5
+```
+3. Use Prometheus to graph:
+```promql
+rate(node_network_receive_bytes_total{instance="<ip>", device="<dev>"}[5m]) * 8
+rate(node_network_transmit_bytes_total{...}) * 8
+```
+4. Determine if traffic is Ceph-related (e.g., during rebalance) or external.
+
+## Mitigation
+
+1. Short term: Throttle non-essential traffic on the node.
+2. Long term:
+    * Upgrade to higher-speed NICs (e.g., 25GbE → 100GbE).
+    * Use multiple bonded interfaces with LACP.
+    * Separate storage and client traffic using VLANs or dedicated NICs.
+3. Tune Ceph osd_max_backfills, osd_recovery_max_active to reduce
+   recovery bandwidth.
+4. Enable NIC offload features (TSO, GRO) if disabled.