-
Notifications
You must be signed in to change notification settings - Fork 135
add runbooks for new alerts #363
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,26 @@ | ||
| # ODFCorePodRestarted | ||
|
|
||
| ## Meaning | ||
|
|
||
| A core ODF pod (OSD, MON, MGR, ODF operator, or metrics exporter) has | ||
| restarted at least once in the last 24 hours while the Ceph cluster is active. | ||
|
|
||
| ## Impact | ||
|
|
||
| * Brief service interruption (e.g., MON restart may cause quorum re-election). | ||
| * OSD restart triggers PG peering and potential recovery. | ||
| * Operator restart delays configuration changes or health checks. | ||
| * May indicate underlying instability (resource pressure, bugs, or node issues). | ||
|
|
||
| ## Diagnosis | ||
|
|
||
| 1. Identify pod from alert (pod, namespace). | ||
| 2. [pod debug](helpers/podDebug.md) | ||
|
|
||
| ## Mitigation | ||
|
|
||
| 1. If OOMKilled: Increase memory limits for the container. | ||
| 2. If CrashLoopBackOff: Check for configuration errors or version incompatibilities. | ||
| 3. If node-related: Cordon and drain the node; replace if faulty. | ||
| 4. Ensure HA: MONs should be ≥3; OSDs should be distributed. | ||
| 5. Update: If due to a known bug, upgrade ODF to a fixed version. |
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. no mitigation section? please add mitigation steps,
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am not sure of what mitigation steps should be added here, so I left it empty for now!! |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,32 @@ | ||
| # ODFDiskUtilizationHigh | ||
|
|
||
| ## Meaning | ||
|
|
||
| A Ceph OSD disk is >90% busy (as measured by %util from iostat | ||
| semantics via node_disk_io_time_seconds_total), indicating heavy I/O load. | ||
|
|
||
| ## Impact | ||
|
|
||
| * Increased I/O latency for RBD/CephFS clients. | ||
| * Slower OSD response times, risking heartbeat timeouts. | ||
| * Reduced cluster throughput during peak workloads. | ||
| * Potential for “slow request” warnings in Ceph logs. | ||
|
|
||
| ## Diagnosis | ||
|
|
||
| 1. Identify node and device from alert labels. | ||
| 2. Check disk model and type: | ||
| ```bash | ||
| oc debug node/<node> | ||
| lsblk -d -o NAME,ROTA,MODEL | ||
| # Confirm it’s an expected OSD device (HDD/SSD/NVMe) | ||
| ``` | ||
| 3. Monitor real-time I/O: | ||
| ```bash | ||
| iostat -x 2 5 | ||
| ``` | ||
| 4. Correlate with Ceph: | ||
| ```bash | ||
| ceph osd df tree # check weight and reweight | ||
| ceph osd perf # check commit/apply latency | ||
| ``` |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,47 @@ | ||
| # ODFNodeLatencyHighOnNONOSDNodes | ||
|
|
||
| ## Meaning | ||
|
|
||
| ICMP RTT latency to non-OSD ODF nodes (e.g., MON, MGR, MDS, or client nodes) | ||
| exceeds 100 milliseconds over the last 24 hours. These nodes participate in | ||
| Ceph control plane or client access but do not store data. | ||
|
|
||
| ## Impact | ||
|
|
||
| * Delayed Ceph monitor elections or quorum instability. | ||
| * Slower metadata operations in CephFS. | ||
| * Increased latency for CSI controller operations. | ||
| * Potential timeouts in ODF operator reconciliation. | ||
|
|
||
|
|
||
| ## Diagnosis | ||
|
|
||
| 1. From the alert, note the instance (node IP). | ||
| 2. Confirm the node does not run OSDs: | ||
| ```bash | ||
| oc get pods -n openshift-storage -o wide | grep <node-name> | ||
| ``` | ||
| 3. Test connectivity: | ||
| ```bash | ||
| ping <node-ip> | ||
| mtr <node-ip> | ||
| ``` | ||
| 4. Check system load and network interface stats on the node: | ||
| ```bash | ||
| oc debug node/<node-name> | ||
| sar -n DEV 1 5 | ||
| ip -s link show <iface> | ||
| ``` | ||
| 5. Review Ceph monitor logs if the node hosts MONs: | ||
| ```bash | ||
| oc logs -l app=rook-ceph-mon -n openshift-storage | ||
| ``` | ||
|
|
||
|
|
||
| ## Mitigation | ||
|
|
||
| 1. Ensure control-plane nodes are not oversubscribed or co-located with noisy workloads. | ||
| 2. Validate network path between MON/MGR nodes—prefer low-latency, dedicated links. | ||
| 3. If node is a client (e.g., running applications), verify it’s not on an | ||
| overloaded subnet. | ||
| 4. Tune kernel network parameters if packet loss or buffer drops are observed. |
| Original file line number | Diff line number | Diff line change | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,56 @@ | ||||||||||||
| # ODFNodeLatencyHighOnOSDNodes | ||||||||||||
|
|
||||||||||||
| ## Meaning | ||||||||||||
|
|
||||||||||||
| ICMP round-trip time (RTT) latency between ODF monitoring probes and | ||||||||||||
| OSD nodes exceeds 10 milliseconds over the last 24 hours. This alert | ||||||||||||
| triggers only on nodes that host Ceph OSD pods, indicating potential | ||||||||||||
| network congestion or issues on the storage network. | ||||||||||||
|
|
||||||||||||
| ## Impact | ||||||||||||
|
|
||||||||||||
| * Increased latency in Ceph replication and recovery operations. | ||||||||||||
| * Higher client I/O latency for RBD and CephFS workloads. | ||||||||||||
| * Risk of OSDs being marked down if heartbeat timeouts occur. | ||||||||||||
| * Degraded cluster performance and possible client timeouts. | ||||||||||||
|
|
||||||||||||
|
|
||||||||||||
| ## Diagnosis | ||||||||||||
|
|
||||||||||||
|
|
||||||||||||
| 1. Identify affected node(s): | ||||||||||||
| ```bash | ||||||||||||
| oc get nodes -l cluster.ocs.openshift.io/openshift-storage='' | ||||||||||||
| # or check node labels used in OSD scheduling | ||||||||||||
| ``` | ||||||||||||
| 2. Check the alert’s instance label to get the node IP. | ||||||||||||
| 3. From a monitoring or debug pod, test connectivity: | ||||||||||||
| ```bash | ||||||||||||
| ping <node-internal-ip> | ||||||||||||
| ``` | ||||||||||||
| 4. Use mtr or traceroute to analyze path and hops. | ||||||||||||
| 5. Verify if the node is under high CPU or network load: | ||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||
| ```bash | ||||||||||||
| oc debug node/<node> | ||||||||||||
| top -b -n 1 | head -20 | ||||||||||||
| sar -u 1 5 | ||||||||||||
| ``` | ||||||||||||
| 6. Check Ceph health and OSD status: | ||||||||||||
| ```bash | ||||||||||||
| ceph osd status | ||||||||||||
| ceph -s | ||||||||||||
| ``` | ||||||||||||
|
|
||||||||||||
| ## Mitigation | ||||||||||||
|
|
||||||||||||
| 1. Network tuning: Ensure jumbo frames (MTU ≥ 9000) are enabled end-to-end | ||||||||||||
| on the storage network. | ||||||||||||
| 2. Isolate traffic: Confirm storage traffic uses a dedicated VLAN or NIC, separate | ||||||||||||
| from management/tenant traffic. | ||||||||||||
| 3. Hardware check: Inspect switch logs, NIC errors (ethtool -S <iface>), | ||||||||||||
| and NIC firmware. | ||||||||||||
| 4. Topology: Ensure OSD nodes are in the same rack/zone or connected via | ||||||||||||
| low-latency fabric. | ||||||||||||
| 5. If latency is transient, monitor; if persistent, engage network or | ||||||||||||
| infrastructure team. | ||||||||||||
|
|
||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same here, add mitigation steps
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The MTU runbook should mention how to verify jumbo frames work end-to-end
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am not sure about this, maybe we can work on it once you are back. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,34 @@ | ||
| # ODFNodeMTULessThan9000 | ||
|
|
||
| ## Meaning | ||
|
|
||
| At least one physical or relevant network interface on an ODF node has an | ||
| MTU (Maximum Transmission Unit) less than 9000 bytes, violating ODF best | ||
| practices for storage networks. | ||
|
|
||
| ## Impact | ||
|
|
||
| * Suboptimal Ceph network performance due to increased packet overhead. | ||
| * Higher CPU utilization on OSD nodes from processing more packets. | ||
| * Potential for packet fragmentation if mixed MTU sizes exist in the path. | ||
| * Reduced throughput during rebalancing or recovery. | ||
|
|
||
|
|
||
| ## Diagnosis | ||
|
|
||
| 1. List all nodes in the storage cluster: | ||
| ```bash | ||
| oc get nodes -l cluster.ocs.openshift.io/openshift-storage='' | ||
| ``` | ||
| 2. For each node, check interface MTUs: | ||
| ```bash | ||
| oc debug node/<node-name> | ||
| ip link show | ||
| # Look for interfaces like eth0, ens*, eno*, etc. (exclude veth, docker, cali) | ||
| ``` | ||
| 3. Alternatively, use Prometheus: | ||
| ```promql | ||
| node_network_mtu_bytes{device!~"^(veth|docker|flannel|cali|tun|tap).*"} < 9000 | ||
| ``` | ||
| 4. Verify MTU consistency across all nodes and all switches in the storage fabric. | ||
|
|
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,39 @@ | ||
| # ODFNodeNICBandwidthSaturation | ||
|
|
||
| ## Meaning | ||
|
|
||
| A network interface on an ODF node is operating at >90% of its reported | ||
| link speed, indicating potential bandwidth saturation. | ||
|
|
||
| ## Impact | ||
|
|
||
| * Network congestion leading to packet drops or latency spikes. | ||
| * Slowed Ceph replication, backfill, and recovery. | ||
| * Client I/O timeouts or stalls. | ||
| * Possible Ceph OSD evictions due to heartbeat failures. | ||
|
|
||
| ## Diagnosis | ||
|
|
||
| 1. From alert, note instance and device. | ||
| 2. Check current utilization: | ||
| ```bash | ||
| oc debug node/<node> | ||
| sar -n DEV 1 5 | ||
| ``` | ||
| 3. Use Prometheus to graph: | ||
| ```promql | ||
| rate(node_network_receive_bytes_total{instance="<ip>", device="<dev>"}[5m]) * 8 | ||
| rate(node_network_transmit_bytes_total{...}) * 8 | ||
| ``` | ||
| 4. Determine if traffic is Ceph-related (e.g., during rebalance) or external. | ||
|
|
||
| ## Mitigation | ||
|
|
||
| 1. Short term: Throttle non-essential traffic on the node. | ||
| 2. Long term: | ||
| * Upgrade to higher-speed NICs (e.g., 25GbE → 100GbE). | ||
| * Use multiple bonded interfaces with LACP. | ||
| * Separate storage and client traffic using VLANs or dedicated NICs. | ||
| 3. Tune Ceph osd_max_backfills, osd_recovery_max_active to reduce | ||
| recovery bandwidth. | ||
| 4. Enable NIC offload features (TSO, GRO) if disabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
existing runbooks reference shared helper documents like:
the new runbooks embed all commands inline instead of referencing these. consider using helper links for consistency and maintainability.