Skip to content

[2025-07-28] Virtual Machines Outage #6

@insom

Description

@insom

Timeline

  • 23:16 EST -- pollux.coloc.systems reported down.
  • 23:18 EST -- @pawandubey notices, correllates the outage to a new pool being added to libvirt.
  • 23:50 EST -- Status page updated. @pawandubey attempts to access iLO but does not have credentials.
  • 23:54 EST -- @insom is called, but is on DND as it's ~5AM in Irish Standard Time and does not answer.
  • 01:34 EST -- @pawandubey arrives at the datacenter, notices a red light on pollux, attempts a reboot. VGA is non-responsive. (insom: This might be because the kernel was hanging so soon after the boot process?)
  • 02:32 EST -- The data on the affected VMs is present on castor.coloc.systems but without fully understanding the state of pollux no further changes are attempted.
  • 02:37 EST -- @pawandubey notifies the customers with VMs on this pair of machines.
  • 03:05 EST -- @insom logs on, agrees that using the second copy of the DRBD data is reasonable.
  • 03:20 EST -- @insom accesses iLO, the console is non-responsive and the logs mention a non-maskable interrupt being raised due to a watchdog timeout (this would prove to be a symptom, not a cause).
    Image
  • 03:28 EST -- @insom realizes that we have the VM data but not the configuration of the VMs (which is not replicated from pollux to castor)
  • 03:32 EST -- pollux is booted with kernel.nmi_watchdog=0 as a workaround. It pings briefly before hanging again. During the period that it pings, it is also responsive via the iLO console.
  • 03:48 EST -- Another reboot is tried, with verbose logging, and libvirt-related things are the last ones logged.
    Image
  • 04:14 EST -- pollux is brought up in single-user mode and @insom copies the configs off so we can failover to castor if we want. Comparing the two configs shows that a new libvirt storage pool of /dev is the only difference, so /etc/libvirtd/autostart/pool.xml is removed. DRBD configs are checked and devices brought up, followed by two internal-use VMs. They boot fine, so they are shut down and DRBD is stopped, and the boot process allowed to proceed to multi-user mode.
  • 04:15 EST -- Each DRBD device needed a drbdadm primary r0 before the virst start vmname, as they devices were in an unknown state on boot / after unclean shutdowns.
  • 04:16 EST -- Client VMs restored.

Action Items

  • Ensure we all have access to ILO and root console passwords.
  • Make sure /etc/libvirt and other dependencies are backed up to either the other server or a separate backup location.
  • Run a game-day for bringing up a VM on the other hypervisor in an unclean state.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions