-
Notifications
You must be signed in to change notification settings - Fork 67
Description
I discovered this while testing #9226:
Losing logs after sled agent restart
If sled agent restarts, we also go through the above code path that wipes the datasets. But even before that, early during startup, sled agent halts and uninstalls all of zones that look like they were created by the control plane. This wipes their root filesystems. This happens long before almost anything else in sled agent is set up, including the DumpSetup subsystem, config reconciler, disk adoption, etc., so there's no opportunity to trigger archival at this point.
The flow is:
- sled agent:
main(), thendo_run()starts the bootstrap agent - bootstrap agent server
start()callsBootstrapAgentStartup::run - that calls
cleanup_all_old_global_state() - that lists zones and removes them all
- that uninstalls them
zoneadm(8) says about uninstall:
uninstall [-F] Uninstall the specified zone from the system. Use this subcommand with caution. It removes all of the files under the zonepath of the zone in question. You can use the -F flag to force the action.
So this will wind up removing all of the zones' log files.
I'm not sure how best to fix this. We could halt the zones without uninstalling them, but we'd presumably later want to do the uninstall? But the later code assumes (because of this code) that no zones exist already so I'm not sure what that will confuse. (We don't want it to always uninstall the zones it finds -- they may have been ones started in this lifetime of sled agent.)
It's tempting to do the archival before all this, but this is so early in sled agent we're not even close to setting up the debug collector. In general, we also haven't adopted the disks yet and wouldn't know that we have a debug dataset in which to put the log files.