-
Notifications
You must be signed in to change notification settings - Fork 15
feature: Runtime update auto-rollback #407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements an auto-rollback feature for runtime updates. When a runtime update finalization fails, the system will automatically attempt to roll back to the previous known good configuration.
Key Changes:
- Added auto-rollback logic that triggers when runtime update finalization fails
- Introduced a
rollbackboolean parameter tofinalize_updatefunction to control whether the operation is performing a rollback - Added
internal_state_matches_engine_ctxmethod to detect when subsystem state needs repopulation (handles both separate finalize calls and auto-rollback scenarios) - Enabled health checks to run for RuntimeUpdate servicing type
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
crates/trident_api/src/config/host/health.rs |
Added RuntimeUpdate to the servicing types that should execute health checks |
crates/trident/src/subsystems/extensions/utils.rs |
Added state matching function and helper to verify subsystem state consistency with engine context; includes comprehensive test coverage |
crates/trident/src/subsystems/extensions/mod.rs |
Updated condition to use new state matching function for determining when to repopulate subsystem state |
crates/trident/src/lib.rs |
Implemented auto-rollback error handling when runtime update finalization fails |
crates/trident/src/engine/update.rs |
Implemented auto-rollback error handling when runtime update finalization fails (similar pattern to lib.rs) |
crates/trident/src/engine/runtime_update.rs |
Added rollback parameter to finalize_update function, implemented spec swapping logic for rollback, and added health check execution |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
crates/trident/src/engine/update.rs
Outdated
| return runtime_update::finalize_update( | ||
| &mut subsystems, | ||
| state, | ||
| true, | ||
| Some(update_start_time), | ||
| #[cfg(feature = "grpc-dangerous")] | ||
| sender, | ||
| ); |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the auto-rollback logic, if the rollback itself fails, the error from the rollback operation will be returned. However, this loses the original error context from the initial finalize failure. Consider wrapping or chaining the errors so that both the original failure and the rollback failure are captured for better debugging and diagnostics.
| // Run health checks if we are performing a runtime update (skip if we are | ||
| // rolling back) | ||
| if !rollback { | ||
| health::execute_health_checks(&ctx)?; | ||
| } | ||
|
|
||
| debug!( | ||
| "Updating host's servicing state to '{:?}'", | ||
| ServicingState::Provisioned |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When performing an auto-rollback, the host status state update logic at lines 151-154 may not be correct. After a rollback, setting spec_old to Default::default() and spec to ctx.spec (which contains the old spec during rollback) may leave the system in an inconsistent state. During rollback, we're reverting to the old configuration, so the state transitions and spec/spec_old updates should reflect that we're returning to a previously known good state rather than completing a new update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
/AzurePipelines run [GITHUB]-trident-pr-e2e |
|
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
/AzurePipelines run [GITHUB]-trident-pr-e2e |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/AzurePipelines run [GITHUB]-trident-pr-e2e |
|
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| .map(|e| (&e.sha384, (&e.path, &e.ext_type))) | ||
| .collect(); | ||
| // Iterate through all extensions in the Host Configuration and ensure that | ||
| // each extension images matches a corresponding extension image on SHA384 |
Copilot
AI
Dec 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grammatical error in comment: "each extension images matches" should be "each extension image matches".
| // each extension images matches a corresponding extension image on SHA384 | |
| // each extension image matches a corresponding extension image on SHA384 |
| false, | ||
| true, |
Copilot
AI
Dec 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The auto-rollback call has incorrect parameters. When performing a rollback, reverse_specs should be true (not false) to reverse the spec and spec_old in the EngineContext. Additionally, run_health_checks should be false during rollback to avoid triggering another rollback if health checks fail. Compare with the correct implementation in update.rs lines 159-165 which uses reverse_specs: true and run_health_checks: false.
| false, | |
| true, | |
| true, // reverse_specs: true for rollback | |
| false, // run_health_checks: false for rollback |
| .message("Failed to finalize runtime update"); | ||
| if let Err(e) = finalize_result { | ||
| error!("Runtime update finalize failed with message: {e:?}"); | ||
| // Attempt an auto-rollback |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expand this comment a little to describe how calling finalize_update a second time is attempting an auto-rollback
| if allowed_operations.has_finalize() { | ||
| let mut subsystems = SUBSYSTEMS.lock().unwrap(); | ||
| runtime_update::finalize_update( | ||
| let finalize_result = runtime_update::finalize_update( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seeing this chunk of double-call-runtime_update::finalize_update (here and in update.rs) makes me think it should be refactored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe finalize_update can accept ServicingType instead of the 2 bools ...
what is currently finalize_udpate can become inner_finalize or do_finalize, and finalize_udpate can call inner_finalize passing spec/old-spec as-is ... if inner_finalize fails health checks, then if ServicingType!=ManualRollback, inner_finalize can call itself with spec/old-spec reversed (to invoke rollback).
🔍 Description
Implement auto-rollback for runtime updates. Auto-rollback is triggered when a step in Finalize fails or a health check fails.
📌 Follow-ups
TODO: