[2025-07-28] Virtual Machines Outage

## Timeline 
 
- 23:16 EST -- `pollux.coloc.systems` reported down.
- 23:18 EST -- @pawandubey notices, correllates the outage to a new pool being added to `libvirt`.
- 23:50 EST -- Status page updated. @pawandubey attempts to access iLO but does not have credentials.
- 23:54 EST -- @insom is called, but is on DND as it's ~5AM in Irish Standard Time and does not answer.
- 01:34 EST -- @pawandubey arrives at the datacenter, notices a red light on `pollux`, attempts a reboot. VGA is non-responsive. (insom: This might be because the kernel was hanging so soon after the boot process?)
- 02:32 EST -- The data on the affected VMs is present on `castor.coloc.systems` but without fully understanding the state of `pollux` no further changes are attempted.
- 02:37 EST -- @pawandubey notifies the customers with VMs on this pair of machines.
- 03:05 EST -- @insom logs on, agrees that using the second copy of the DRBD data is reasonable.
- 03:20 EST -- @insom accesses iLO, the console is non-responsive and the logs mention a non-maskable interrupt being raised due to a watchdog timeout (this would prove to be a symptom, not a cause).<br>![Image](https://github.com/user-attachments/assets/5b84ead2-c259-4f6a-8f75-0ed714f58422)
- 03:28 EST -- @insom realizes that we have the VM data but not the configuration of the VMs (which is not replicated from `pollux` to `castor`)
- 03:32 EST -- `pollux` is booted with `kernel.nmi_watchdog=0` as a workaround. It pings briefly before hanging again. During the period that it pings, it is also responsive via the iLO console.
- 03:48 EST -- Another reboot is tried, with verbose logging, and `libvirt`-related things are the last ones logged.<br>![Image](https://github.com/user-attachments/assets/d7fa349a-8ee7-4a8e-b876-49fd4f26f72f)
- 04:14 EST -- `pollux` is brought up in single-user mode and @insom copies the configs off so we can failover to `castor` if we want. Comparing the two configs shows that a new `libvirt` storage pool of `/dev` is the only difference, so `/etc/libvirtd/autostart/pool.xml` is removed. DRBD configs are checked and devices brought up, followed by two internal-use VMs. They boot fine, so they are shut down and DRBD is stopped, and the boot process allowed to proceed to multi-user mode.
- 04:15 EST -- Each DRBD device needed a `drbdadm primary r0` before the `virst start vmname`, as they devices were in an unknown state on boot / after unclean shutdowns.
- 04:16 EST -- Client VMs restored.

## Action Items

- [x] Ensure we all have access to ILO and root console passwords.
- [x] Make sure `/etc/libvirt` and other dependencies are backed up to either the other server or a separate backup location.
- [ ] Run a game-day for bringing up a VM on the other hypervisor in an unclean state.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[2025-07-28] Virtual Machines Outage #6

Timeline

Action Items

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[2025-07-28] Virtual Machines Outage #6

Description

Timeline

Action Items

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions