Skip to content

Conversation

@ayaegashi
Copy link
Contributor

@ayaegashi ayaegashi commented Dec 12, 2025

🔍 Description

Implement auto-rollback for runtime updates. Auto-rollback is triggered when a step in Finalize fails or a health check fails.

📌 Follow-ups

TODO:

  • Optimize caching extension images: Trident should search for extension images on the OS when rolling back before attempting to re-download. This will be in a follow-up PR.

@ayaegashi ayaegashi requested a review from a team as a code owner December 12, 2025 00:53
Copilot AI review requested due to automatic review settings December 12, 2025 00:53
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements an auto-rollback feature for runtime updates. When a runtime update finalization fails, the system will automatically attempt to roll back to the previous known good configuration.

Key Changes:

  • Added auto-rollback logic that triggers when runtime update finalization fails
  • Introduced a rollback boolean parameter to finalize_update function to control whether the operation is performing a rollback
  • Added internal_state_matches_engine_ctx method to detect when subsystem state needs repopulation (handles both separate finalize calls and auto-rollback scenarios)
  • Enabled health checks to run for RuntimeUpdate servicing type

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
crates/trident_api/src/config/host/health.rs Added RuntimeUpdate to the servicing types that should execute health checks
crates/trident/src/subsystems/extensions/utils.rs Added state matching function and helper to verify subsystem state consistency with engine context; includes comprehensive test coverage
crates/trident/src/subsystems/extensions/mod.rs Updated condition to use new state matching function for determining when to repopulate subsystem state
crates/trident/src/lib.rs Implemented auto-rollback error handling when runtime update finalization fails
crates/trident/src/engine/update.rs Implemented auto-rollback error handling when runtime update finalization fails (similar pattern to lib.rs)
crates/trident/src/engine/runtime_update.rs Added rollback parameter to finalize_update function, implemented spec swapping logic for rollback, and added health check execution

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 180 to 187
return runtime_update::finalize_update(
&mut subsystems,
state,
true,
Some(update_start_time),
#[cfg(feature = "grpc-dangerous")]
sender,
);
Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the auto-rollback logic, if the rollback itself fails, the error from the rollback operation will be returned. However, this loses the original error context from the initial finalize failure. Consider wrapping or chaining the errors so that both the original failure and the rollback failure are captured for better debugging and diagnostics.

Copilot uses AI. Check for mistakes.
Comment on lines 141 to 149
// Run health checks if we are performing a runtime update (skip if we are
// rolling back)
if !rollback {
health::execute_health_checks(&ctx)?;
}

debug!(
"Updating host's servicing state to '{:?}'",
ServicingState::Provisioned
Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When performing an auto-rollback, the host status state update logic at lines 151-154 may not be correct. After a rollback, setting spec_old to Default::default() and spec to ctx.spec (which contains the old spec during rollback) may leave the system in an inconsistent state. During rollback, we're reverting to the old configuration, so the state transitions and spec/spec_old updates should reflect that we're returning to a previously known good state rather than completing a new update.

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings December 12, 2025 01:17
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings December 12, 2025 02:32
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings December 13, 2025 03:19
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings December 15, 2025 20:23
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@ayaegashi ayaegashi changed the title feature: Runtime update auto-rollback feature: Runtime update rollback Dec 16, 2025
Copilot AI review requested due to automatic review settings December 16, 2025 00:16
@ayaegashi
Copy link
Contributor Author

/AzurePipelines run [GITHUB]-trident-pr-e2e

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@ayaegashi ayaegashi changed the title feature: Runtime update rollback feature: Runtime update auto-rollback Dec 16, 2025
@ayaegashi
Copy link
Contributor Author

/AzurePipelines run [GITHUB]-trident-pr-e2e

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copilot AI review requested due to automatic review settings December 16, 2025 07:29
@ayaegashi
Copy link
Contributor Author

/AzurePipelines run [GITHUB]-trident-pr-e2e

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

.map(|e| (&e.sha384, (&e.path, &e.ext_type)))
.collect();
// Iterate through all extensions in the Host Configuration and ensure that
// each extension images matches a corresponding extension image on SHA384
Copy link

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammatical error in comment: "each extension images matches" should be "each extension image matches".

Suggested change
// each extension images matches a corresponding extension image on SHA384
// each extension image matches a corresponding extension image on SHA384

Copilot uses AI. Check for mistakes.
Comment on lines +580 to +581
false,
true,
Copy link

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The auto-rollback call has incorrect parameters. When performing a rollback, reverse_specs should be true (not false) to reverse the spec and spec_old in the EngineContext. Additionally, run_health_checks should be false during rollback to avoid triggering another rollback if health checks fail. Compare with the correct implementation in update.rs lines 159-165 which uses reverse_specs: true and run_health_checks: false.

Suggested change
false,
true,
true, // reverse_specs: true for rollback
false, // run_health_checks: false for rollback

Copilot uses AI. Check for mistakes.
.message("Failed to finalize runtime update");
if let Err(e) = finalize_result {
error!("Runtime update finalize failed with message: {e:?}");
// Attempt an auto-rollback
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expand this comment a little to describe how calling finalize_update a second time is attempting an auto-rollback

if allowed_operations.has_finalize() {
let mut subsystems = SUBSYSTEMS.lock().unwrap();
runtime_update::finalize_update(
let finalize_result = runtime_update::finalize_update(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seeing this chunk of double-call-runtime_update::finalize_update (here and in update.rs) makes me think it should be refactored.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe finalize_update can accept ServicingType instead of the 2 bools ...

what is currently finalize_udpate can become inner_finalize or do_finalize, and finalize_udpate can call inner_finalize passing spec/old-spec as-is ... if inner_finalize fails health checks, then if ServicingType!=ManualRollback, inner_finalize can call itself with spec/old-spec reversed (to invoke rollback).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants