Skip to content

non-global zone log files lost after sled agent restart #9644

@davepacheco

Description

@davepacheco

I discovered this while testing #9226:

Losing logs after sled agent restart
If sled agent restarts, we also go through the above code path that wipes the datasets. But even before that, early during startup, sled agent halts and uninstalls all of zones that look like they were created by the control plane. This wipes their root filesystems. This happens long before almost anything else in sled agent is set up, including the DumpSetup subsystem, config reconciler, disk adoption, etc., so there's no opportunity to trigger archival at this point.

The flow is:

zoneadm(8) says about uninstall:

   uninstall [-F]

      Uninstall the specified zone from the system. Use this subcommand
      with caution.  It removes all of the files under the zonepath of
      the zone in question.  You can use the -F flag to force the action.

So this will wind up removing all of the zones' log files.


I'm not sure how best to fix this. We could halt the zones without uninstalling them, but we'd presumably later want to do the uninstall? But the later code assumes (because of this code) that no zones exist already so I'm not sure what that will confuse. (We don't want it to always uninstall the zones it finds -- they may have been ones started in this lifetime of sled agent.)

It's tempting to do the archival before all this, but this is so early in sled agent we're not even close to setting up the debug collector. In general, we also haven't adopted the disks yet and wouldn't know that we have a debug dataset in which to put the log files.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions