Add measurements to config reconciler #9626

labbott · 2026-01-12T18:08:14Z

No description provided.

labbott · 2026-01-12T18:08:50Z

The rename is a dependency #9625 but it makes things a little bit nicer to review I think

jgallagher · 2026-01-15T16:22:22Z

sled-agent/config-reconciler/src/handle.rs

    #[error("ledger contents not yet available")]
    LedgerContentsNotAvailable,
+    #[error("waiting for ledger task to run")]
+    WaitingOnLedger,


The description of this error sounds the same as "ledger contents not yet available" to me. Could we either (a) remove this new variant and use LedgerContentsNotAvailable for it, if it really is basically the same thing, or (b) make it more clear how it's distinct?

We return LedgerContentsNotAvailable if we return None below which can happen if ledger_task never gets set. I was trying to treat that as a different case than what I added the new variant for: we've set ledger_task but tokio has not yet scheduled the task to run so it's returning a default value. I think this are different cases because we do expect tokio to eventually run the ledger task. I do see that it's ambiguous so let me think about how to tweak these error types.

jgallagher · 2026-01-15T16:25:59Z

sled-agent/config-reconciler/src/handle.rs

+            ledger_rx,
+        } = token;
+
+        let ledger_rx = ledger_rx.expect("Failed to call pre_spawn");


I don't love this; the spawn token stuff is supposed to statically guide the caller in such a way that they can't use the API incorrectly, but now they can (if they get a token and skip "pre spawn"). What do you think of something like this:

Make two public spawn token types (something like LedgerTaskSpawnToken and ConfigReconcilerSpawnToken). Probably makes sense to have an internal SpawnTokenCommon for most of the fields, but we could put the specifics in each one non-optionally: LedgerTaskSpawnToken would have ledger_task_log but no ledger_rx, and ConfigReconcilerSpawnToken would have a non-optional ledger_rx and no ledger_task_log.

ConfigReconciler::new() would return (Self, LedgerTaskSpawnToken)

Rename pre_spawn_reconciler_task to spawn_ledger_task(); it would take a LedgerTaskSpawnToken and return a ConfigReconcilerSpawnToken

jgallagher · 2026-01-15T16:29:55Z

sled-agent/config-reconciler/src/handle.rs


+    /// Run a first reconciliation of reference measurements on cold boot
+    pub async fn bootstrap_measurement_reconciler(
+        &self,


It doesn't look like this uses self at all - does it need to be associated with the config reconciler? There aren't any callers of this so I'm not sure exactly what to suggest instead, but maybe it makes sense for this to be a free function?

This code has evolved some. I was trying to make it clear this was doing things similar to the config reconciler but let me see if it's clearer being a free function.

jgallagher · 2026-01-15T16:30:22Z

sled-agent/config-reconciler/src/handle.rs

+        internal_disks: &InternalDisks,
+        desired: &BTreeSet<OmicronSingleMeasurement>,
+        log: &Logger,
+    ) -> Vec<Utf8PathBuf> {


Not returning errors here seems a little sketchy - what if we don't have any internal disks yet? What if we fail to read some paths we think should exist?

Lightly wondering if we should have a stronger type than Vec<Utf8Path> too, even if just as a place to put doc comments about what it's expected to contain. (Can it be empty? Can it have duplicates? That kind of thing.)

Some of this may be a little easier to understand when the measurements are fully pushed through to sprockets https://github.com/oxidecomputer/omicron/pull/9629/changes#diff-c0774fe1766438e80c0c023511ab7028ae8a636644fc4030feba92b901bc6a11R255-R311

I kept it out to avoid growing the PR too much but maybe the ~200 lines of context would be helpful?

Error handling is an interesting question. The eventual goal is to have the measurements be required to unlock trust quorum. If we can't figure out our measurements for initial cold boot we can't do trust quorum and therefore boot much of anything. I had a comment along the lines of // give up forever? if this function failed which both seems wrong and the only thing to do. Typing this out, I think you're correct that this deserves a more complex return type to differentiate some of these cases and then let trust-quorum/bootstrap figure out what to do with the error.

jgallagher · 2026-01-15T16:32:27Z

sled-agent/config-reconciler/src/handle.rs

+    /// Watch for changes to measurements
+    pub async fn measurement_corpus_rx(
+        &self,
+        pre_boot: Vec<Utf8PathBuf>,


What is pre_boot?

Older less descriptive name for cold_boot_measurements I missed changing

jgallagher · 2026-01-15T16:52:16Z

sled-agent/config-reconciler/src/measurements.rs

+                            path: full_path,
+                            result: ConfigReconcilerInventoryResult::Ok,
+                        })
+                        .expect("file names should be unique");


What's the source of the file names? It doesn't seem obvious from the context here why they'd be guaranteed to be unique.

This is a semi-weird quirk of copying concepts from zone images. zone images are guaranteed to be unique because there's a unique enum type per zone image. I purposely wanted to give us flexibility in how we structure our measurements and didn't want to tie us to choices in hubris releng process and create an enum. File names in the install dataset are all in the same directory and should be unique and (eventually) hashes from the repo depot.

jgallagher · 2026-01-15T16:55:26Z

sled-agent/config-reconciler/src/mupdate_override.rs

+        internal_disks: &InternalDisks,
+    ) -> PreparedOmicronMeasurements {
+        // For now we only support measurements from the install dataset
+        // regardless of mupdate override status


Just making sure I understand - this comment is saying we only look at the measurements from the latest mupdate, and we don't look at the artifact store at all, right?

(I was looking for some bits that updating how the ledger task and artifact store cross check each other to make sure the artifact store doesn't remove measurements we still need, but if we're not accessing it yet it makes sense why it isn't there!)

A doc comment for the function about what it means to prepare a measurement and what is being returned would be helpful.

You are correct. Follow on work will access measurements from the artifact store with full reconfigurator changes.

jgallagher · 2026-01-15T16:56:20Z

sled-agent/config-reconciler/src/mupdate_override.rs

+    log: &slog::Logger,
+    resolver_status: &ResolverStatus,
+    internal_disks: &InternalDisks,
+    mut search_paths_cb: F,


Why is this written to take a callback instead of returning a PreparedOmicronMeasurements directly?

Another quirk of copying too heavily from zone images

jgallagher · 2026-01-15T16:58:57Z

sled-agent/config-reconciler/src/mupdate_override.rs

+                        Err(e) => error!(
+                            log,
+                            "measurement entry error";
+                            "error" => InlineErrorChain::new(&e),


error!() logs with no other handling make me nervous - there's no way to surface this at all. I'm not really following the difference between this method and reconcile_measurements. It looks like this method is building a list of files that are supposed to exist (or have already been confirmed to exist?), based on ResolverStatus, and then reconcile_measurements checks again if they exist? Could / should this method be building error results directly instead?

Thinking out lout a little, what we're ultimately trying to do with all this code is figure out what path to use for a particular measurement artifact. I think a good way to think about this split is that the code here in mupdate_override.rs generates all possible paths to search for all measurement artifact and then reconcile_measurements chooses the specific path to use from all valid paths. This is something that makes a little more sense for a zone image which is runnable but is not completely out there for other measurements.

I think you are correct that we aren't returning the errors properly. I think the zone images eventually fail when running at https://github.com/oxidecomputer/omicron/blob/main/sled-agent/types/src/resolvable_files.rs#L1285-L1307 so we probably need something similar here. This all needs more comments and at least a few tests (I think those got dropped during one of my refactors sigh)

Ahhh actually I remember more of the weirdness here. We're attempting to read the entire measurement manifest here entry by entry

omicron/sled-agent/types/src/resolvable_files.rs

Lines 127 to 159 in fdf3585

pub fn all_measurements(

&self,

) -> Result<Vec<MeasurementEntry>, ManifestHashError> {

let artifacts_result = self

.boot_disk_result

.as_ref()

.map_err(|err| ManifestHashError::ReadBootDisk(err.clone()))?;

let mut results = Vec::new();

for artifact in artifacts_result.data.clone() {

match artifact.status {

ArtifactReadResult::Valid => {

results

.push(Ok((artifact.file_name, artifact.expected_hash)));

}

ArtifactReadResult::Mismatch { actual_size, actual_hash } => {

results.push(Err(ManifestHashError::SizeHashMismatch {

expected_size: artifact.expected_size,

expected_hash: artifact.expected_hash,

actual_size,

actual_hash,

}));

}

ArtifactReadResult::Error(err) => {

results.push(Err(ManifestHashError::ReadArtifact(

err.clone(),

)));

}

}

}

Ok(results)

}

}

if there's an error with a specific artifact entry we get the error but don't return the file name as part of the error. This has the weird side effect of preventing us from creating anything in OmicronResolvableFileSource because that requires a file_name. We can figure this out for zones because the ZoneType gives the file name. I will confess to seeing this before and thinking "ah I'm going to make that a problem for future me" so you can blame past me for putting this off.

jgallagher · 2026-01-15T17:00:06Z

sled-agent/config-reconciler/src/mupdate_override.rs

+            }
+        }
+    } else {
+        error!(


Aside from the same concern about error!() logs with no other handling, IIUC this branch means we'd end up returning a PreparedOmicronMeasurements with an empty set of sources, which would then lead to reconcile_measurements returning an empty set of ReconciledSingleMeasurements. Is that okay?

labbott

Thanks for the feedback! I think I need to integrate the cold boot measurement case a little bit better which will hopefully clean a few things up.

labbott · 2026-01-15T17:56:38Z

sled-agent/config-reconciler/src/handle.rs

+        internal_disks: &InternalDisks,
+        desired: &BTreeSet<OmicronSingleMeasurement>,
+        log: &Logger,
+    ) -> Vec<Utf8PathBuf> {


Some of this may be a little easier to understand when the measurements are fully pushed through to sprockets https://github.com/oxidecomputer/omicron/pull/9629/changes#diff-c0774fe1766438e80c0c023511ab7028ae8a636644fc4030feba92b901bc6a11R255-R311

I kept it out to avoid growing the PR too much but maybe the ~200 lines of context would be helpful?

Error handling is an interesting question. The eventual goal is to have the measurements be required to unlock trust quorum. If we can't figure out our measurements for initial cold boot we can't do trust quorum and therefore boot much of anything. I had a comment along the lines of // give up forever? if this function failed which both seems wrong and the only thing to do. Typing this out, I think you're correct that this deserves a more complex return type to differentiate some of these cases and then let trust-quorum/bootstrap figure out what to do with the error.

labbott · 2026-01-15T17:57:17Z

sled-agent/config-reconciler/src/handle.rs

+    /// Watch for changes to measurements
+    pub async fn measurement_corpus_rx(
+        &self,
+        pre_boot: Vec<Utf8PathBuf>,


Older less descriptive name for cold_boot_measurements I missed changing

labbott · 2026-01-15T17:58:23Z

sled-agent/config-reconciler/src/measurements.rs

+    pub sources: Vec<OmicronResolvableFileSource>,
+}
+
+impl PreparedOmicronMeasurements {


This was code that makes much more sense when dealing with a full reconfigurated system. I think in the spirit of reviewability I'll trim this down and we can add it back later.

labbott · 2026-01-15T17:59:53Z

sled-agent/config-reconciler/src/measurements.rs

+    }
+}
+
+pub(crate) async fn reconcile_measurements(


Inconsistent naming on my part. This all makes much more sense when we do have reconfigurator changing the system which will be part 2 of this. I do think keeping this interface to be semi-consistent is helpful but I can add a comment with a // TODO reconfigurator is coming in a separate series

labbott · 2026-01-15T18:35:57Z

sled-agent/config-reconciler/src/measurements.rs

+                            path: full_path,
+                            result: ConfigReconcilerInventoryResult::Ok,
+                        })
+                        .expect("file names should be unique");


This is a semi-weird quirk of copying concepts from zone images. zone images are guaranteed to be unique because there's a unique enum type per zone image. I purposely wanted to give us flexibility in how we structure our measurements and didn't want to tie us to choices in hubris releng process and create an enum. File names in the install dataset are all in the same directory and should be unique and (eventually) hashes from the repo depot.

labbott · 2026-01-15T18:36:52Z

sled-agent/config-reconciler/src/mupdate_override.rs

+        internal_disks: &InternalDisks,
+    ) -> PreparedOmicronMeasurements {
+        // For now we only support measurements from the install dataset
+        // regardless of mupdate override status


You are correct. Follow on work will access measurements from the artifact store with full reconfigurator changes.

labbott · 2026-01-15T18:39:34Z

sled-agent/config-reconciler/src/mupdate_override.rs

+    log: &slog::Logger,
+    resolver_status: &ResolverStatus,
+    internal_disks: &InternalDisks,
+    mut search_paths_cb: F,


Another quirk of copying too heavily from zone images

labbott · 2026-01-15T18:43:20Z

sled-agent/config-reconciler/src/reconciler_task.rs


+    pub(crate) fn all_current_measurements(
+        &self,
+    ) -> Either<


I thought the Either was cleaner than just checking for an empty measurement set but I think this is a sign I need to re-work how we check for cold boot measurements.

The config reconciler handles turning our selected measurements into full paths. Right now this is only from the install dataset but further work will select measurements from the artifact datastore. This shuffles some things around in the early bootup to allow reading `OmicronSledConfig` via the ledger task at cold boot. Measurements are needed early in sled-agent startup for use with trust-quorum.

labbott marked this pull request as draft January 12, 2026 18:08

labbott force-pushed the labbott/measurements_config_reconciler branch 3 times, most recently from ba33634 to a46c87b Compare January 14, 2026 15:45

labbott changed the title ~~WIP: Add measurements to config reconciler~~ Add measurements to config reconciler Jan 14, 2026

labbott marked this pull request as ready for review January 14, 2026 19:03

labbott force-pushed the labbott/measurements_config_reconciler branch from a46c87b to b2e33b9 Compare January 15, 2026 14:04

labbott requested review from andrewjstone, jgallagher and sunshowers January 15, 2026 14:04

jgallagher reviewed Jan 15, 2026

View reviewed changes

labbott commented Jan 15, 2026

View reviewed changes

	pub fn all_measurements(
	&self,
	) -> Result<Vec<MeasurementEntry>, ManifestHashError> {
	let artifacts_result = self
	.boot_disk_result
	.as_ref()
	.map_err(\|err\| ManifestHashError::ReadBootDisk(err.clone()))?;

	let mut results = Vec::new();
	for artifact in artifacts_result.data.clone() {
	match artifact.status {
	ArtifactReadResult::Valid => {
	results
	.push(Ok((artifact.file_name, artifact.expected_hash)));
	}
	ArtifactReadResult::Mismatch { actual_size, actual_hash } => {
	results.push(Err(ManifestHashError::SizeHashMismatch {
	expected_size: artifact.expected_size,
	expected_hash: artifact.expected_hash,
	actual_size,
	actual_hash,
	}));
	}
	ArtifactReadResult::Error(err) => {
	results.push(Err(ManifestHashError::ReadArtifact(
	err.clone(),
	)));
	}
	}
	}
	Ok(results)
	}
	}

Add measurements to config reconciler #9626

Are you sure you want to change the base?

Add measurements to config reconciler #9626

Uh oh!

Conversation

labbott commented Jan 12, 2026

Uh oh!

labbott commented Jan 12, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

labbott left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants