non-global zone log files lost after sled agent restart

I discovered this while testing #9226:

> Losing logs after sled agent restart
> If sled agent restarts, we also go through the above code path that wipes the datasets. But even before that, early during startup, sled agent halts and uninstalls all of zones that look like they were created by the control plane. This wipes their root filesystems. This happens long before almost anything else in sled agent is set up, including the DumpSetup subsystem, config reconciler, disk adoption, etc., so there's no opportunity to trigger archival at this point.

The flow is:

- sled agent: `main()`, then [`do_run()` starts the bootstrap agent](https://github.com/oxidecomputer/omicron/blob/51f5fb22190840d7efe6e1a7f1c24171304eca1a/sled-agent/src/bin/sled-agent.rs#L78)
- bootstrap agent server `start()` calls [`BootstrapAgentStartup::run`](https://github.com/oxidecomputer/omicron/blob/51f5fb22190840d7efe6e1a7f1c24171304eca1a/sled-agent/src/bootstrap/server.rs#L185)
- that calls [`cleanup_all_old_global_state()`](https://github.com/oxidecomputer/omicron/blob/main/sled-agent/src/bootstrap/pre_server.rs#L109)
- that [lists zones and removes them all](https://github.com/oxidecomputer/omicron/blob/51f5fb22190840d7efe6e1a7f1c24171304eca1a/sled-agent/src/bootstrap/pre_server.rs#L193-L213)
- that [uninstalls them](https://github.com/oxidecomputer/omicron/blob/main/illumos-utils/src/zone.rs#L431-L439)

`zoneadm(8)` says about `uninstall`:

>        uninstall [-F]
>
>           Uninstall the specified zone from the system. Use this subcommand
>           with caution.  It removes all of the files under the zonepath of
>           the zone in question.  You can use the -F flag to force the action.

So this will wind up removing all of the zones' log files.

---

I'm not sure how best to fix this.  We could halt the zones _without_ uninstalling them, but we'd presumably later want to do the uninstall?  But the later code assumes (because of this code) that no zones exist already so I'm not sure what that will confuse.  (We don't want it to _always_ uninstall the zones it finds -- they may have been ones started in this lifetime of sled agent.)

It's tempting to do the archival before all this, but this is so early in sled agent we're not even close to setting up the debug collector.  In general, we also haven't adopted the disks yet and wouldn't know that we have a debug dataset in which to put the log files.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

non-global zone log files lost after sled agent restart #9644

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

non-global zone log files lost after sled agent restart #9644

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions