-
Notifications
You must be signed in to change notification settings - Fork 67
Add measurements to config reconciler #9626
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
The rename is a dependency #9625 but it makes things a little bit nicer to review I think |
ba33634 to
a46c87b
Compare
a46c87b to
b2e33b9
Compare
| #[error("ledger contents not yet available")] | ||
| LedgerContentsNotAvailable, | ||
| #[error("waiting for ledger task to run")] | ||
| WaitingOnLedger, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description of this error sounds the same as "ledger contents not yet available" to me. Could we either (a) remove this new variant and use LedgerContentsNotAvailable for it, if it really is basically the same thing, or (b) make it more clear how it's distinct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We return LedgerContentsNotAvailable if we return None below which can happen if ledger_task never gets set. I was trying to treat that as a different case than what I added the new variant for: we've set ledger_task but tokio has not yet scheduled the task to run so it's returning a default value. I think this are different cases because we do expect tokio to eventually run the ledger task. I do see that it's ambiguous so let me think about how to tweak these error types.
| ledger_rx, | ||
| } = token; | ||
|
|
||
| let ledger_rx = ledger_rx.expect("Failed to call pre_spawn"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love this; the spawn token stuff is supposed to statically guide the caller in such a way that they can't use the API incorrectly, but now they can (if they get a token and skip "pre spawn"). What do you think of something like this:
- Make two public spawn token types (something like
LedgerTaskSpawnTokenandConfigReconcilerSpawnToken). Probably makes sense to have an internalSpawnTokenCommonfor most of the fields, but we could put the specifics in each one non-optionally:LedgerTaskSpawnTokenwould haveledger_task_logbut noledger_rx, andConfigReconcilerSpawnTokenwould have a non-optionalledger_rxand noledger_task_log. ConfigReconciler::new()would return(Self, LedgerTaskSpawnToken)- Rename
pre_spawn_reconciler_tasktospawn_ledger_task(); it would take aLedgerTaskSpawnTokenand return aConfigReconcilerSpawnToken
|
|
||
| /// Run a first reconciliation of reference measurements on cold boot | ||
| pub async fn bootstrap_measurement_reconciler( | ||
| &self, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't look like this uses self at all - does it need to be associated with the config reconciler? There aren't any callers of this so I'm not sure exactly what to suggest instead, but maybe it makes sense for this to be a free function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code has evolved some. I was trying to make it clear this was doing things similar to the config reconciler but let me see if it's clearer being a free function.
| internal_disks: &InternalDisks, | ||
| desired: &BTreeSet<OmicronSingleMeasurement>, | ||
| log: &Logger, | ||
| ) -> Vec<Utf8PathBuf> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not returning errors here seems a little sketchy - what if we don't have any internal disks yet? What if we fail to read some paths we think should exist?
Lightly wondering if we should have a stronger type than Vec<Utf8Path> too, even if just as a place to put doc comments about what it's expected to contain. (Can it be empty? Can it have duplicates? That kind of thing.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of this may be a little easier to understand when the measurements are fully pushed through to sprockets https://github.com/oxidecomputer/omicron/pull/9629/changes#diff-c0774fe1766438e80c0c023511ab7028ae8a636644fc4030feba92b901bc6a11R255-R311
I kept it out to avoid growing the PR too much but maybe the ~200 lines of context would be helpful?
Error handling is an interesting question. The eventual goal is to have the measurements be required to unlock trust quorum. If we can't figure out our measurements for initial cold boot we can't do trust quorum and therefore boot much of anything. I had a comment along the lines of // give up forever? if this function failed which both seems wrong and the only thing to do. Typing this out, I think you're correct that this deserves a more complex return type to differentiate some of these cases and then let trust-quorum/bootstrap figure out what to do with the error.
| /// Watch for changes to measurements | ||
| pub async fn measurement_corpus_rx( | ||
| &self, | ||
| pre_boot: Vec<Utf8PathBuf>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is pre_boot?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Older less descriptive name for cold_boot_measurements I missed changing
| path: full_path, | ||
| result: ConfigReconcilerInventoryResult::Ok, | ||
| }) | ||
| .expect("file names should be unique"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the source of the file names? It doesn't seem obvious from the context here why they'd be guaranteed to be unique.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a semi-weird quirk of copying concepts from zone images. zone images are guaranteed to be unique because there's a unique enum type per zone image. I purposely wanted to give us flexibility in how we structure our measurements and didn't want to tie us to choices in hubris releng process and create an enum. File names in the install dataset are all in the same directory and should be unique and (eventually) hashes from the repo depot.
| internal_disks: &InternalDisks, | ||
| ) -> PreparedOmicronMeasurements { | ||
| // For now we only support measurements from the install dataset | ||
| // regardless of mupdate override status |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just making sure I understand - this comment is saying we only look at the measurements from the latest mupdate, and we don't look at the artifact store at all, right?
(I was looking for some bits that updating how the ledger task and artifact store cross check each other to make sure the artifact store doesn't remove measurements we still need, but if we're not accessing it yet it makes sense why it isn't there!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A doc comment for the function about what it means to prepare a measurement and what is being returned would be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct. Follow on work will access measurements from the artifact store with full reconfigurator changes.
| log: &slog::Logger, | ||
| resolver_status: &ResolverStatus, | ||
| internal_disks: &InternalDisks, | ||
| mut search_paths_cb: F, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this written to take a callback instead of returning a PreparedOmicronMeasurements directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another quirk of copying too heavily from zone images
| Err(e) => error!( | ||
| log, | ||
| "measurement entry error"; | ||
| "error" => InlineErrorChain::new(&e), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
error!() logs with no other handling make me nervous - there's no way to surface this at all. I'm not really following the difference between this method and reconcile_measurements. It looks like this method is building a list of files that are supposed to exist (or have already been confirmed to exist?), based on ResolverStatus, and then reconcile_measurements checks again if they exist? Could / should this method be building error results directly instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking out lout a little, what we're ultimately trying to do with all this code is figure out what path to use for a particular measurement artifact. I think a good way to think about this split is that the code here in mupdate_override.rs generates all possible paths to search for all measurement artifact and then reconcile_measurements chooses the specific path to use from all valid paths. This is something that makes a little more sense for a zone image which is runnable but is not completely out there for other measurements.
I think you are correct that we aren't returning the errors properly. I think the zone images eventually fail when running at https://github.com/oxidecomputer/omicron/blob/main/sled-agent/types/src/resolvable_files.rs#L1285-L1307 so we probably need something similar here. This all needs more comments and at least a few tests (I think those got dropped during one of my refactors sigh)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahhh actually I remember more of the weirdness here. We're attempting to read the entire measurement manifest here entry by entry
omicron/sled-agent/types/src/resolvable_files.rs
Lines 127 to 159 in fdf3585
| pub fn all_measurements( | |
| &self, | |
| ) -> Result<Vec<MeasurementEntry>, ManifestHashError> { | |
| let artifacts_result = self | |
| .boot_disk_result | |
| .as_ref() | |
| .map_err(|err| ManifestHashError::ReadBootDisk(err.clone()))?; | |
| let mut results = Vec::new(); | |
| for artifact in artifacts_result.data.clone() { | |
| match artifact.status { | |
| ArtifactReadResult::Valid => { | |
| results | |
| .push(Ok((artifact.file_name, artifact.expected_hash))); | |
| } | |
| ArtifactReadResult::Mismatch { actual_size, actual_hash } => { | |
| results.push(Err(ManifestHashError::SizeHashMismatch { | |
| expected_size: artifact.expected_size, | |
| expected_hash: artifact.expected_hash, | |
| actual_size, | |
| actual_hash, | |
| })); | |
| } | |
| ArtifactReadResult::Error(err) => { | |
| results.push(Err(ManifestHashError::ReadArtifact( | |
| err.clone(), | |
| ))); | |
| } | |
| } | |
| } | |
| Ok(results) | |
| } | |
| } |
OmicronResolvableFileSource because that requires a file_name. We can figure this out for zones because the ZoneType gives the file name. I will confess to seeing this before and thinking "ah I'm going to make that a problem for future me" so you can blame past me for putting this off.
| } | ||
| } | ||
| } else { | ||
| error!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside from the same concern about error!() logs with no other handling, IIUC this branch means we'd end up returning a PreparedOmicronMeasurements with an empty set of sources, which would then lead to reconcile_measurements returning an empty set of ReconciledSingleMeasurements. Is that okay?
labbott
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback! I think I need to integrate the cold boot measurement case a little bit better which will hopefully clean a few things up.
| internal_disks: &InternalDisks, | ||
| desired: &BTreeSet<OmicronSingleMeasurement>, | ||
| log: &Logger, | ||
| ) -> Vec<Utf8PathBuf> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of this may be a little easier to understand when the measurements are fully pushed through to sprockets https://github.com/oxidecomputer/omicron/pull/9629/changes#diff-c0774fe1766438e80c0c023511ab7028ae8a636644fc4030feba92b901bc6a11R255-R311
I kept it out to avoid growing the PR too much but maybe the ~200 lines of context would be helpful?
Error handling is an interesting question. The eventual goal is to have the measurements be required to unlock trust quorum. If we can't figure out our measurements for initial cold boot we can't do trust quorum and therefore boot much of anything. I had a comment along the lines of // give up forever? if this function failed which both seems wrong and the only thing to do. Typing this out, I think you're correct that this deserves a more complex return type to differentiate some of these cases and then let trust-quorum/bootstrap figure out what to do with the error.
| /// Watch for changes to measurements | ||
| pub async fn measurement_corpus_rx( | ||
| &self, | ||
| pre_boot: Vec<Utf8PathBuf>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Older less descriptive name for cold_boot_measurements I missed changing
| pub sources: Vec<OmicronResolvableFileSource>, | ||
| } | ||
|
|
||
| impl PreparedOmicronMeasurements { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was code that makes much more sense when dealing with a full reconfigurated system. I think in the spirit of reviewability I'll trim this down and we can add it back later.
| } | ||
| } | ||
|
|
||
| pub(crate) async fn reconcile_measurements( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent naming on my part. This all makes much more sense when we do have reconfigurator changing the system which will be part 2 of this. I do think keeping this interface to be semi-consistent is helpful but I can add a comment with a // TODO reconfigurator is coming in a separate series
| path: full_path, | ||
| result: ConfigReconcilerInventoryResult::Ok, | ||
| }) | ||
| .expect("file names should be unique"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a semi-weird quirk of copying concepts from zone images. zone images are guaranteed to be unique because there's a unique enum type per zone image. I purposely wanted to give us flexibility in how we structure our measurements and didn't want to tie us to choices in hubris releng process and create an enum. File names in the install dataset are all in the same directory and should be unique and (eventually) hashes from the repo depot.
| internal_disks: &InternalDisks, | ||
| ) -> PreparedOmicronMeasurements { | ||
| // For now we only support measurements from the install dataset | ||
| // regardless of mupdate override status |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct. Follow on work will access measurements from the artifact store with full reconfigurator changes.
| log: &slog::Logger, | ||
| resolver_status: &ResolverStatus, | ||
| internal_disks: &InternalDisks, | ||
| mut search_paths_cb: F, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another quirk of copying too heavily from zone images
|
|
||
| pub(crate) fn all_current_measurements( | ||
| &self, | ||
| ) -> Either< |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the Either was cleaner than just checking for an empty measurement set but I think this is a sign I need to re-work how we check for cold boot measurements.
The config reconciler handles turning our selected measurements into full paths. Right now this is only from the install dataset but further work will select measurements from the artifact datastore. This shuffles some things around in the early bootup to allow reading `OmicronSledConfig` via the ledger task at cold boot. Measurements are needed early in sled-agent startup for use with trust-quorum.
No description provided.