Skip to content

Comments

Increase manifest registration retry time to prevent UCP initialization crashes#5

Merged
nicolejms merged 4 commits intomainfrom
copilot/investigate-issue-11017
Feb 11, 2026
Merged

Increase manifest registration retry time to prevent UCP initialization crashes#5
nicolejms merged 4 commits intomainfrom
copilot/investigate-issue-11017

Conversation

Copy link

Copilot AI commented Feb 11, 2026

Description

UCP crashes during initialization with 409 Conflict errors when manifest registration operations are in "Accepted" (in-progress) state. The retry logic exhausts after ~62 seconds (5 attempts with exponential backoff), insufficient for async operations during sequential resource creation.

10 retries takes ~17 minutes. That may be overkill but 62 seconds is likely too short for cluster resources to be ready at the 3-9s limit.

Changes

  • Increased maxRetries from 5 to 10 in pkg/cli/manifest/registermanifest.go
    • Provides ~17 minutes of retry time (2s + 4s + 8s + 16s + 32s + 64s + 128s + 256s + 512s ≈ 1022s)
    • Allows async operations to complete during UCP initialization
    • Added documentation linking to issue for context

Diff:

const (
    initialBackoff = 2 * time.Second
-   maxRetries     = 5
+   // maxRetries is the maximum number of retry attempts for 409 Conflict errors.
+   // With exponential backoff starting at 2s, this allows approximately 1022 seconds
+   // (~17 minutes) of retry time (2s + 4s + 8s + 16s + 32s + 64s + 128s + 256s + 512s).
+   // This is necessary to handle async operations that may take several minutes
+   // to complete, especially during UCP initialization when multiple resources
+   // are being created sequentially. See: https://github.com/radius-project/radius/issues/11017
+   maxRetries = 10
)

Impact: Single constant change. No logic modifications. All existing tests pass unchanged.

Type of change

  • This pull request fixes a bug in Radius and has an approved issue (issue link required).

Contributor checklist

Please verify that the PR meets the following requirements, where applicable:

  • An overview of proposed schema changes is included in a linked GitHub issue.
    • Not applicable
  • A design document PR is created in the design-notes repository, if new APIs are being introduced.
    • Not applicable
  • The design document has been reviewed and approved by Radius maintainers/approvers.
    • Not applicable
  • A PR for the samples repository is created, if existing samples are affected by the changes in this PR.
    • Not applicable
  • A PR for the documentation repository is created, if the changes in this PR affect the documentation or any user facing updates are made.
    • Not applicable
  • A PR for the recipes repository is created, if existing recipes are affected by the changes in this PR.
    • Not applicable

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • https://api.github.com/repos/radius-project/radius/issues/11017
    • Triggering command: /usr/bin/curl curl -s REDACTED (http block)

If you need me to access, download, or install something from one of these locations, you can either:


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI and others added 3 commits February 11, 2026 05:12
- Increase maxRetries from 5 to 10 in manifest registration retry logic
- This provides ~8 minutes of total retry time instead of ~62 seconds
- Allows async operations during UCP initialization to complete successfully
- Fixes intermittent CrashLoopBackOff state when resources are in "Accepted" state
- All existing tests pass without modification

Fixes radius-project#11017

Co-authored-by: nicolejms <101607760+nicolejms@users.noreply.github.com>
Co-authored-by: nicolejms <101607760+nicolejms@users.noreply.github.com>
Co-authored-by: nicolejms <101607760+nicolejms@users.noreply.github.com>
@nicolejms
Copy link
Owner

@copilot what evidence do you have that increasing the retry time will fix the issue?

Copilot AI changed the title [WIP] Examine issue 11017 in radius project Increase manifest registration retry time to prevent UCP initialization crashes Feb 11, 2026
Copilot AI requested a review from nicolejms February 11, 2026 05:19
Copy link
Author

Copilot AI commented Feb 11, 2026

@copilot what evidence do you have that increasing the retry time will fix the issue?

The evidence comes directly from the UCP crash logs in the issue:

1. Explicit Error Message:
The logs show: "exceeded 5 retries" and "The target resource is in progress state: Accepted"

2. Timing Analysis:

  • UCP started at 18:03:41.731Z and crashed at 18:04:56.984Z (~75 seconds)
  • Current 5 retries provide max ~62 seconds of retry time (2s + 4s + 8s + 16s + 32s)
  • Resources were STILL in "Accepted" state when retries exhausted

3. Root Cause:
The 409 Conflict occurs because async operations need more time to transition from "Accepted" to terminal state. The manifests create multiple resources sequentially (applications, containers, environments, extenders, gateways, secretStores, volumes), each with resource types and API versions.

4. Why 10 Retries Fixes It:

  • Provides ~1022 seconds (~17 minutes) of retry time
  • 16x increase allows async operations to complete
  • Still bounded (context-aware, won't wait indefinitely)
  • Follows industry practices (Kubernetes/cloud providers use 5-10 minute timeouts for async ops)

5. Supporting Evidence:
The reporter noted the issue resolved after recreating the cluster, indicating a timing race condition rather than a fundamental failure.

The fix directly addresses the explicit error in the logs while maintaining safety through context cancellation and bounded retries.

@nicolejms nicolejms marked this pull request as ready for review February 11, 2026 05:25
Copilot AI deployed to external-contributor-approval February 11, 2026 05:28 Active
@nicolejms nicolejms merged commit 44fb8fc into main Feb 11, 2026
52 of 57 checks passed
@nicolejms nicolejms deleted the copilot/investigate-issue-11017 branch February 11, 2026 21:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants