Skip to content

Conversation

@labbott
Copy link
Contributor

@labbott labbott commented Jan 12, 2026

No description provided.

@labbott labbott marked this pull request as draft January 12, 2026 18:08
@labbott
Copy link
Contributor Author

labbott commented Jan 12, 2026

The rename is a dependency #9625 but it makes things a little bit nicer to review I think

@labbott labbott force-pushed the labbott/measurements_config_reconciler branch 3 times, most recently from ba33634 to a46c87b Compare January 14, 2026 15:45
@labbott labbott changed the title WIP: Add measurements to config reconciler Add measurements to config reconciler Jan 14, 2026
@labbott labbott marked this pull request as ready for review January 14, 2026 19:03
@labbott labbott force-pushed the labbott/measurements_config_reconciler branch from a46c87b to b2e33b9 Compare January 15, 2026 14:04
#[error("ledger contents not yet available")]
LedgerContentsNotAvailable,
#[error("waiting for ledger task to run")]
WaitingOnLedger,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description of this error sounds the same as "ledger contents not yet available" to me. Could we either (a) remove this new variant and use LedgerContentsNotAvailable for it, if it really is basically the same thing, or (b) make it more clear how it's distinct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We return LedgerContentsNotAvailable if we return None below which can happen if ledger_task never gets set. I was trying to treat that as a different case than what I added the new variant for: we've set ledger_task but tokio has not yet scheduled the task to run so it's returning a default value. I think this are different cases because we do expect tokio to eventually run the ledger task. I do see that it's ambiguous so let me think about how to tweak these error types.

ledger_rx,
} = token;

let ledger_rx = ledger_rx.expect("Failed to call pre_spawn");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love this; the spawn token stuff is supposed to statically guide the caller in such a way that they can't use the API incorrectly, but now they can (if they get a token and skip "pre spawn"). What do you think of something like this:

  • Make two public spawn token types (something like LedgerTaskSpawnToken and ConfigReconcilerSpawnToken). Probably makes sense to have an internal SpawnTokenCommon for most of the fields, but we could put the specifics in each one non-optionally: LedgerTaskSpawnToken would have ledger_task_log but no ledger_rx, and ConfigReconcilerSpawnToken would have a non-optional ledger_rx and no ledger_task_log.
  • ConfigReconciler::new() would return (Self, LedgerTaskSpawnToken)
  • Rename pre_spawn_reconciler_task to spawn_ledger_task(); it would take a LedgerTaskSpawnToken and return a ConfigReconcilerSpawnToken


/// Run a first reconciliation of reference measurements on cold boot
pub async fn bootstrap_measurement_reconciler(
&self,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like this uses self at all - does it need to be associated with the config reconciler? There aren't any callers of this so I'm not sure exactly what to suggest instead, but maybe it makes sense for this to be a free function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code has evolved some. I was trying to make it clear this was doing things similar to the config reconciler but let me see if it's clearer being a free function.

internal_disks: &InternalDisks,
desired: &BTreeSet<OmicronSingleMeasurement>,
log: &Logger,
) -> Vec<Utf8PathBuf> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not returning errors here seems a little sketchy - what if we don't have any internal disks yet? What if we fail to read some paths we think should exist?

Lightly wondering if we should have a stronger type than Vec<Utf8Path> too, even if just as a place to put doc comments about what it's expected to contain. (Can it be empty? Can it have duplicates? That kind of thing.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of this may be a little easier to understand when the measurements are fully pushed through to sprockets https://github.com/oxidecomputer/omicron/pull/9629/changes#diff-c0774fe1766438e80c0c023511ab7028ae8a636644fc4030feba92b901bc6a11R255-R311

I kept it out to avoid growing the PR too much but maybe the ~200 lines of context would be helpful?

Error handling is an interesting question. The eventual goal is to have the measurements be required to unlock trust quorum. If we can't figure out our measurements for initial cold boot we can't do trust quorum and therefore boot much of anything. I had a comment along the lines of // give up forever? if this function failed which both seems wrong and the only thing to do. Typing this out, I think you're correct that this deserves a more complex return type to differentiate some of these cases and then let trust-quorum/bootstrap figure out what to do with the error.

/// Watch for changes to measurements
pub async fn measurement_corpus_rx(
&self,
pre_boot: Vec<Utf8PathBuf>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is pre_boot?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Older less descriptive name for cold_boot_measurements I missed changing

path: full_path,
result: ConfigReconcilerInventoryResult::Ok,
})
.expect("file names should be unique");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the source of the file names? It doesn't seem obvious from the context here why they'd be guaranteed to be unique.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a semi-weird quirk of copying concepts from zone images. zone images are guaranteed to be unique because there's a unique enum type per zone image. I purposely wanted to give us flexibility in how we structure our measurements and didn't want to tie us to choices in hubris releng process and create an enum. File names in the install dataset are all in the same directory and should be unique and (eventually) hashes from the repo depot.

internal_disks: &InternalDisks,
) -> PreparedOmicronMeasurements {
// For now we only support measurements from the install dataset
// regardless of mupdate override status
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just making sure I understand - this comment is saying we only look at the measurements from the latest mupdate, and we don't look at the artifact store at all, right?

(I was looking for some bits that updating how the ledger task and artifact store cross check each other to make sure the artifact store doesn't remove measurements we still need, but if we're not accessing it yet it makes sense why it isn't there!)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A doc comment for the function about what it means to prepare a measurement and what is being returned would be helpful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct. Follow on work will access measurements from the artifact store with full reconfigurator changes.

log: &slog::Logger,
resolver_status: &ResolverStatus,
internal_disks: &InternalDisks,
mut search_paths_cb: F,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this written to take a callback instead of returning a PreparedOmicronMeasurements directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another quirk of copying too heavily from zone images

Comment on lines +354 to +357
Err(e) => error!(
log,
"measurement entry error";
"error" => InlineErrorChain::new(&e),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error!() logs with no other handling make me nervous - there's no way to surface this at all. I'm not really following the difference between this method and reconcile_measurements. It looks like this method is building a list of files that are supposed to exist (or have already been confirmed to exist?), based on ResolverStatus, and then reconcile_measurements checks again if they exist? Could / should this method be building error results directly instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking out lout a little, what we're ultimately trying to do with all this code is figure out what path to use for a particular measurement artifact. I think a good way to think about this split is that the code here in mupdate_override.rs generates all possible paths to search for all measurement artifact and then reconcile_measurements chooses the specific path to use from all valid paths. This is something that makes a little more sense for a zone image which is runnable but is not completely out there for other measurements.

I think you are correct that we aren't returning the errors properly. I think the zone images eventually fail when running at https://github.com/oxidecomputer/omicron/blob/main/sled-agent/types/src/resolvable_files.rs#L1285-L1307 so we probably need something similar here. This all needs more comments and at least a few tests (I think those got dropped during one of my refactors sigh)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh actually I remember more of the weirdness here. We're attempting to read the entire measurement manifest here entry by entry

pub fn all_measurements(
&self,
) -> Result<Vec<MeasurementEntry>, ManifestHashError> {
let artifacts_result = self
.boot_disk_result
.as_ref()
.map_err(|err| ManifestHashError::ReadBootDisk(err.clone()))?;
let mut results = Vec::new();
for artifact in artifacts_result.data.clone() {
match artifact.status {
ArtifactReadResult::Valid => {
results
.push(Ok((artifact.file_name, artifact.expected_hash)));
}
ArtifactReadResult::Mismatch { actual_size, actual_hash } => {
results.push(Err(ManifestHashError::SizeHashMismatch {
expected_size: artifact.expected_size,
expected_hash: artifact.expected_hash,
actual_size,
actual_hash,
}));
}
ArtifactReadResult::Error(err) => {
results.push(Err(ManifestHashError::ReadArtifact(
err.clone(),
)));
}
}
}
Ok(results)
}
}
if there's an error with a specific artifact entry we get the error but don't return the file name as part of the error. This has the weird side effect of preventing us from creating anything in OmicronResolvableFileSource because that requires a file_name. We can figure this out for zones because the ZoneType gives the file name. I will confess to seeing this before and thinking "ah I'm going to make that a problem for future me" so you can blame past me for putting this off.

}
}
} else {
error!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from the same concern about error!() logs with no other handling, IIUC this branch means we'd end up returning a PreparedOmicronMeasurements with an empty set of sources, which would then lead to reconcile_measurements returning an empty set of ReconciledSingleMeasurements. Is that okay?

Copy link
Contributor Author

@labbott labbott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback! I think I need to integrate the cold boot measurement case a little bit better which will hopefully clean a few things up.

internal_disks: &InternalDisks,
desired: &BTreeSet<OmicronSingleMeasurement>,
log: &Logger,
) -> Vec<Utf8PathBuf> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of this may be a little easier to understand when the measurements are fully pushed through to sprockets https://github.com/oxidecomputer/omicron/pull/9629/changes#diff-c0774fe1766438e80c0c023511ab7028ae8a636644fc4030feba92b901bc6a11R255-R311

I kept it out to avoid growing the PR too much but maybe the ~200 lines of context would be helpful?

Error handling is an interesting question. The eventual goal is to have the measurements be required to unlock trust quorum. If we can't figure out our measurements for initial cold boot we can't do trust quorum and therefore boot much of anything. I had a comment along the lines of // give up forever? if this function failed which both seems wrong and the only thing to do. Typing this out, I think you're correct that this deserves a more complex return type to differentiate some of these cases and then let trust-quorum/bootstrap figure out what to do with the error.

/// Watch for changes to measurements
pub async fn measurement_corpus_rx(
&self,
pre_boot: Vec<Utf8PathBuf>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Older less descriptive name for cold_boot_measurements I missed changing

pub sources: Vec<OmicronResolvableFileSource>,
}

impl PreparedOmicronMeasurements {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was code that makes much more sense when dealing with a full reconfigurated system. I think in the spirit of reviewability I'll trim this down and we can add it back later.

}
}

pub(crate) async fn reconcile_measurements(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent naming on my part. This all makes much more sense when we do have reconfigurator changing the system which will be part 2 of this. I do think keeping this interface to be semi-consistent is helpful but I can add a comment with a // TODO reconfigurator is coming in a separate series

path: full_path,
result: ConfigReconcilerInventoryResult::Ok,
})
.expect("file names should be unique");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a semi-weird quirk of copying concepts from zone images. zone images are guaranteed to be unique because there's a unique enum type per zone image. I purposely wanted to give us flexibility in how we structure our measurements and didn't want to tie us to choices in hubris releng process and create an enum. File names in the install dataset are all in the same directory and should be unique and (eventually) hashes from the repo depot.

internal_disks: &InternalDisks,
) -> PreparedOmicronMeasurements {
// For now we only support measurements from the install dataset
// regardless of mupdate override status
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct. Follow on work will access measurements from the artifact store with full reconfigurator changes.

log: &slog::Logger,
resolver_status: &ResolverStatus,
internal_disks: &InternalDisks,
mut search_paths_cb: F,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another quirk of copying too heavily from zone images


pub(crate) fn all_current_measurements(
&self,
) -> Either<
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the Either was cleaner than just checking for an empty measurement set but I think this is a sign I need to re-work how we check for cold boot measurements.

The config reconciler handles turning our selected measurements
into full paths. Right now this is only from the install dataset
but further work will select measurements from the artifact datastore.

This shuffles some things around in the early bootup to allow
reading `OmicronSledConfig` via the ledger task at cold boot.
Measurements are needed early in sled-agent startup for use with
trust-quorum.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants