Skip to content

bug: oras binary missing from AzureLinux V3 image 202601.27.0 — CSE fails with exit code 211 #7907

@Lyqed

Description

@Lyqed

Summary

The oras binary is missing from the AzureLinux V3 VHD image 202601.27.0, causing the Custom Script Extension (CSE) bootstrap to fail with exit code 211 (ERR_ORAS_PULL_NETWORK_TIMEOUT) on Karpenter-provisioned AKS nodes. Affected nodes never join the cluster.

The previous production image 202601.13.0 works correctly.

Environment

Field Value
OS AzureLinux V3
Failing image version 202601.27.0
Last known-good image 202601.13.0
Node provisioner Karpenter (karpenter-provider-azure)
Failure symptom Node never joins cluster; CSE exits 211

Root Cause

oras is called early in the CSE bootstrap flow via oras_login_with_kubelet_identity and related functions in cse_helpers.sh. If the oras binary is absent from the image (not installed during VHD build, or accidentally omitted), any call to oras silently fails with a confusing exit code — in this case 211 (ERR_ORAS_PULL_NETWORK_TIMEOUT), which is misleading because the real cause is a missing binary, not a network timeout.

The true failure mode — oras not found in $PATH — is never logged, making it very hard to diagnose from CSE logs alone.

Impact

  • Blast radius: All AzureLinux V3 nodes provisioned with image 202601.27.0 via Karpenter fail to join the cluster.
  • User impact: NodeClaims remain in a pending/disrupted state indefinitely; workloads cannot schedule.
  • Diagnosability: Exit code 211 (ERR_ORAS_PULL_NETWORK_TIMEOUT) points to a network issue, causing engineers to investigate firewalls and IMDS reachability before discovering the binary is simply missing.

Steps to Reproduce

  1. Provision an AKS cluster with Karpenter enabled.
  2. Create a NodePool that targets AzureLinux V3 nodes.
  3. Schedule a workload that triggers node provisioning using image 202601.27.0.
  4. Observe that the node never reaches Ready state and CSE exits with code 211.

Expected vs Actual Behavior

Expected Actual
oras present CSE completes, node joins cluster
oras missing CSE exits with ERR_ORAS_BINARY_NOT_FOUND and logs clear diagnostic info CSE exits with ERR_ORAS_PULL_NETWORK_TIMEOUT (211), no indication binary is missing

Fix

This PR adds a pre-flight check at the top of oras_login_with_kubelet_identity (in cse_helpers.sh) that:

  1. Calls command -v oras to verify the binary is in $PATH.
  2. If missing: logs $PATH, probes known install locations (/usr/local/bin/oras, /usr/bin/oras, /opt/bin/oras), dumps /etc/os-release, and queries rpm or dpkg for any installed oras packages.
  3. Returns a new, unambiguous error code ERR_ORAS_BINARY_NOT_FOUND=232 so operators immediately understand what happened.

A separate investigation is needed to determine why oras was not included in the 202601.27.0 VHD build. That is a VHD pipeline issue tracked separately.

Related

  • Fix PR: (this PR)
  • karpenter-provider-azure tracking issue: (filed separately — surface CSE exit code in NodeClaim conditions)

Additional Context

Exit code 211 is defined as ERR_ORAS_PULL_NETWORK_TIMEOUT — it is reached in retrycmd_get_refresh_token_for_oras when the ACR token exchange fails. But without oras in $PATH, the function that calls it (retrycmd_can_oras_ls_acr_anonymously) fails immediately, and error propagation bubbles up as a generic network timeout rather than a missing-binary error.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions