-
Notifications
You must be signed in to change notification settings - Fork 249
Description
Summary
The oras binary is missing from the AzureLinux V3 VHD image 202601.27.0, causing the Custom Script Extension (CSE) bootstrap to fail with exit code 211 (ERR_ORAS_PULL_NETWORK_TIMEOUT) on Karpenter-provisioned AKS nodes. Affected nodes never join the cluster.
The previous production image 202601.13.0 works correctly.
Environment
| Field | Value |
|---|---|
| OS | AzureLinux V3 |
| Failing image version | 202601.27.0 |
| Last known-good image | 202601.13.0 |
| Node provisioner | Karpenter (karpenter-provider-azure) |
| Failure symptom | Node never joins cluster; CSE exits 211 |
Root Cause
oras is called early in the CSE bootstrap flow via oras_login_with_kubelet_identity and related functions in cse_helpers.sh. If the oras binary is absent from the image (not installed during VHD build, or accidentally omitted), any call to oras silently fails with a confusing exit code — in this case 211 (ERR_ORAS_PULL_NETWORK_TIMEOUT), which is misleading because the real cause is a missing binary, not a network timeout.
The true failure mode — oras not found in $PATH — is never logged, making it very hard to diagnose from CSE logs alone.
Impact
- Blast radius: All AzureLinux V3 nodes provisioned with image 202601.27.0 via Karpenter fail to join the cluster.
- User impact: NodeClaims remain in a pending/disrupted state indefinitely; workloads cannot schedule.
- Diagnosability: Exit code 211 (
ERR_ORAS_PULL_NETWORK_TIMEOUT) points to a network issue, causing engineers to investigate firewalls and IMDS reachability before discovering the binary is simply missing.
Steps to Reproduce
- Provision an AKS cluster with Karpenter enabled.
- Create a
NodePoolthat targets AzureLinux V3 nodes. - Schedule a workload that triggers node provisioning using image 202601.27.0.
- Observe that the node never reaches
Readystate and CSE exits with code 211.
Expected vs Actual Behavior
| Expected | Actual | |
|---|---|---|
| oras present | CSE completes, node joins cluster | — |
| oras missing | CSE exits with ERR_ORAS_BINARY_NOT_FOUND and logs clear diagnostic info |
CSE exits with ERR_ORAS_PULL_NETWORK_TIMEOUT (211), no indication binary is missing |
Fix
This PR adds a pre-flight check at the top of oras_login_with_kubelet_identity (in cse_helpers.sh) that:
- Calls
command -v orasto verify the binary is in$PATH. - If missing: logs
$PATH, probes known install locations (/usr/local/bin/oras,/usr/bin/oras,/opt/bin/oras), dumps/etc/os-release, and queriesrpmordpkgfor any installed oras packages. - Returns a new, unambiguous error code
ERR_ORAS_BINARY_NOT_FOUND=232so operators immediately understand what happened.
A separate investigation is needed to determine why oras was not included in the 202601.27.0 VHD build. That is a VHD pipeline issue tracked separately.
Related
- Fix PR: (this PR)
- karpenter-provider-azure tracking issue: (filed separately — surface CSE exit code in NodeClaim conditions)
Additional Context
Exit code 211 is defined as ERR_ORAS_PULL_NETWORK_TIMEOUT — it is reached in retrycmd_get_refresh_token_for_oras when the ACR token exchange fails. But without oras in $PATH, the function that calls it (retrycmd_can_oras_ls_acr_anonymously) fails immediately, and error propagation bubbles up as a generic network timeout rather than a missing-binary error.