-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Timeline
- 23:16 EST --
pollux.coloc.systemsreported down. - 23:18 EST -- @pawandubey notices, correllates the outage to a new pool being added to
libvirt. - 23:50 EST -- Status page updated. @pawandubey attempts to access iLO but does not have credentials.
- 23:54 EST -- @insom is called, but is on DND as it's ~5AM in Irish Standard Time and does not answer.
- 01:34 EST -- @pawandubey arrives at the datacenter, notices a red light on
pollux, attempts a reboot. VGA is non-responsive. (insom: This might be because the kernel was hanging so soon after the boot process?) - 02:32 EST -- The data on the affected VMs is present on
castor.coloc.systemsbut without fully understanding the state ofpolluxno further changes are attempted. - 02:37 EST -- @pawandubey notifies the customers with VMs on this pair of machines.
- 03:05 EST -- @insom logs on, agrees that using the second copy of the DRBD data is reasonable.
- 03:20 EST -- @insom accesses iLO, the console is non-responsive and the logs mention a non-maskable interrupt being raised due to a watchdog timeout (this would prove to be a symptom, not a cause).

- 03:28 EST -- @insom realizes that we have the VM data but not the configuration of the VMs (which is not replicated from
polluxtocastor) - 03:32 EST --
polluxis booted withkernel.nmi_watchdog=0as a workaround. It pings briefly before hanging again. During the period that it pings, it is also responsive via the iLO console. - 03:48 EST -- Another reboot is tried, with verbose logging, and
libvirt-related things are the last ones logged.
- 04:14 EST --
polluxis brought up in single-user mode and @insom copies the configs off so we can failover tocastorif we want. Comparing the two configs shows that a newlibvirtstorage pool of/devis the only difference, so/etc/libvirtd/autostart/pool.xmlis removed. DRBD configs are checked and devices brought up, followed by two internal-use VMs. They boot fine, so they are shut down and DRBD is stopped, and the boot process allowed to proceed to multi-user mode. - 04:15 EST -- Each DRBD device needed a
drbdadm primary r0before thevirst start vmname, as they devices were in an unknown state on boot / after unclean shutdowns. - 04:16 EST -- Client VMs restored.
Action Items
- Ensure we all have access to ILO and root console passwords.
- Make sure
/etc/libvirtand other dependencies are backed up to either the other server or a separate backup location. - Run a game-day for bringing up a VM on the other hypervisor in an unclean state.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels