Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 26 additions & 0 deletions alerts/openshift-container-storage-operator/ODFCorePodRestarted.md
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

existing runbooks reference shared helper documents like:

  • helpers/podDebug.md
  • helpers/troubleshootCeph.md
  • helpers/gatherLogs.md
  • helpers/networkConnectivity.md

the new runbooks embed all commands inline instead of referencing these. consider using helper links for consistency and maintainability.

Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# ODFCorePodRestarted

## Meaning

A core ODF pod (OSD, MON, MGR, ODF operator, or metrics exporter) has
restarted at least once in the last 24 hours while the Ceph cluster is active.

## Impact

* Brief service interruption (e.g., MON restart may cause quorum re-election).
* OSD restart triggers PG peering and potential recovery.
* Operator restart delays configuration changes or health checks.
* May indicate underlying instability (resource pressure, bugs, or node issues).

## Diagnosis

1. Identify pod from alert (pod, namespace).
2. [pod debug](helpers/podDebug.md)

## Mitigation

1. If OOMKilled: Increase memory limits for the container.
2. If CrashLoopBackOff: Check for configuration errors or version incompatibilities.
3. If node-related: Cordon and drain the node; replace if faulty.
4. Ensure HA: MONs should be ≥3; OSDs should be distributed.
5. Update: If due to a known bug, upgrade ODF to a fixed version.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no mitigation section? please add mitigation steps,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure of what mitigation steps should be added here, so I left it empty for now!!
@weirdwiz if you have any suggestions, we can discuss offline.

Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# ODFDiskUtilizationHigh

## Meaning

A Ceph OSD disk is >90% busy (as measured by %util from iostat
semantics via node_disk_io_time_seconds_total), indicating heavy I/O load.

## Impact

* Increased I/O latency for RBD/CephFS clients.
* Slower OSD response times, risking heartbeat timeouts.
* Reduced cluster throughput during peak workloads.
* Potential for “slow request” warnings in Ceph logs.

## Diagnosis

1. Identify node and device from alert labels.
2. Check disk model and type:
```bash
oc debug node/<node>
lsblk -d -o NAME,ROTA,MODEL
# Confirm it’s an expected OSD device (HDD/SSD/NVMe)
```
3. Monitor real-time I/O:
```bash
iostat -x 2 5
```
4. Correlate with Ceph:
```bash
ceph osd df tree # check weight and reweight
ceph osd perf # check commit/apply latency
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# ODFNodeLatencyHighOnNONOSDNodes

## Meaning

ICMP RTT latency to non-OSD ODF nodes (e.g., MON, MGR, MDS, or client nodes)
exceeds 100 milliseconds over the last 24 hours. These nodes participate in
Ceph control plane or client access but do not store data.

## Impact

* Delayed Ceph monitor elections or quorum instability.
* Slower metadata operations in CephFS.
* Increased latency for CSI controller operations.
* Potential timeouts in ODF operator reconciliation.


## Diagnosis

1. From the alert, note the instance (node IP).
2. Confirm the node does not run OSDs:
```bash
oc get pods -n openshift-storage -o wide | grep <node-name>
```
3. Test connectivity:
```bash
ping <node-ip>
mtr <node-ip>
```
4. Check system load and network interface stats on the node:
```bash
oc debug node/<node-name>
sar -n DEV 1 5
ip -s link show <iface>
```
5. Review Ceph monitor logs if the node hosts MONs:
```bash
oc logs -l app=rook-ceph-mon -n openshift-storage
```


## Mitigation

1. Ensure control-plane nodes are not oversubscribed or co-located with noisy workloads.
2. Validate network path between MON/MGR nodes—prefer low-latency, dedicated links.
3. If node is a client (e.g., running applications), verify it’s not on an
overloaded subnet.
4. Tune kernel network parameters if packet loss or buffer drops are observed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# ODFNodeLatencyHighOnOSDNodes

## Meaning

ICMP round-trip time (RTT) latency between ODF monitoring probes and
OSD nodes exceeds 10 milliseconds over the last 24 hours. This alert
triggers only on nodes that host Ceph OSD pods, indicating potential
network congestion or issues on the storage network.

## Impact

* Increased latency in Ceph replication and recovery operations.
* Higher client I/O latency for RBD and CephFS workloads.
* Risk of OSDs being marked down if heartbeat timeouts occur.
* Degraded cluster performance and possible client timeouts.


## Diagnosis


1. Identify affected node(s):
```bash
oc get nodes -l cluster.ocs.openshift.io/openshift-storage=''
# or check node labels used in OSD scheduling
```
2. Check the alert’s instance label to get the node IP.
3. From a monitoring or debug pod, test connectivity:
```bash
ping <node-internal-ip>
```
4. Use mtr or traceroute to analyze path and hops.
5. Verify if the node is under high CPU or network load:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
5. Verify if the node is under high CPU or network load:
5. Verify if the node is under high CPU or network load:
oc debug node/<node>
top -b -n 1 | head -20
sar -u 1 5

```bash
oc debug node/<node>
top -b -n 1 | head -20
sar -u 1 5
```
6. Check Ceph health and OSD status:
```bash
ceph osd status
ceph -s
```

## Mitigation

1. Network tuning: Ensure jumbo frames (MTU ≥ 9000) are enabled end-to-end
on the storage network.
2. Isolate traffic: Confirm storage traffic uses a dedicated VLAN or NIC, separate
from management/tenant traffic.
3. Hardware check: Inspect switch logs, NIC errors (ethtool -S <iface>),
and NIC firmware.
4. Topology: Ensure OSD nodes are in the same rack/zone or connected via
low-latency fabric.
5. If latency is transient, monitor; if persistent, engage network or
infrastructure team.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, add mitigation steps

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MTU runbook should mention how to verify jumbo frames work end-to-end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about this, maybe we can work on it once you are back.

Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# ODFNodeMTULessThan9000

## Meaning

At least one physical or relevant network interface on an ODF node has an
MTU (Maximum Transmission Unit) less than 9000 bytes, violating ODF best
practices for storage networks.

## Impact

* Suboptimal Ceph network performance due to increased packet overhead.
* Higher CPU utilization on OSD nodes from processing more packets.
* Potential for packet fragmentation if mixed MTU sizes exist in the path.
* Reduced throughput during rebalancing or recovery.


## Diagnosis

1. List all nodes in the storage cluster:
```bash
oc get nodes -l cluster.ocs.openshift.io/openshift-storage=''
```
2. For each node, check interface MTUs:
```bash
oc debug node/<node-name>
ip link show
# Look for interfaces like eth0, ens*, eno*, etc. (exclude veth, docker, cali)
```
3. Alternatively, use Prometheus:
```promql
node_network_mtu_bytes{device!~"^(veth|docker|flannel|cali|tun|tap).*"} < 9000
```
4. Verify MTU consistency across all nodes and all switches in the storage fabric.

Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# ODFNodeNICBandwidthSaturation

## Meaning

A network interface on an ODF node is operating at >90% of its reported
link speed, indicating potential bandwidth saturation.

## Impact

* Network congestion leading to packet drops or latency spikes.
* Slowed Ceph replication, backfill, and recovery.
* Client I/O timeouts or stalls.
* Possible Ceph OSD evictions due to heartbeat failures.

## Diagnosis

1. From alert, note instance and device.
2. Check current utilization:
```bash
oc debug node/<node>
sar -n DEV 1 5
```
3. Use Prometheus to graph:
```promql
rate(node_network_receive_bytes_total{instance="<ip>", device="<dev>"}[5m]) * 8
rate(node_network_transmit_bytes_total{...}) * 8
```
4. Determine if traffic is Ceph-related (e.g., during rebalance) or external.

## Mitigation

1. Short term: Throttle non-essential traffic on the node.
2. Long term:
* Upgrade to higher-speed NICs (e.g., 25GbE → 100GbE).
* Use multiple bonded interfaces with LACP.
* Separate storage and client traffic using VLANs or dedicated NICs.
3. Tune Ceph osd_max_backfills, osd_recovery_max_active to reduce
recovery bandwidth.
4. Enable NIC offload features (TSO, GRO) if disabled.