Skip to content

[Bug]: Azure Container Instance pipeline agents lose AZP_TOKEN secureValue after Azure rehosting, causing CrashLoopBackOff #506

@cpinotossi

Description

@cpinotossi

GitHub Issue: ACI Pipeline Agent PAT Token Loss - Investigation and Mitigation

Repository

Azure/alz-terraform-accelerator

Issue Title

[Bug]: Azure Container Instance pipeline agents lose AZP_TOKEN secureValue, causing CrashLoopBackOff


Issue Body

Describe the bug

Azure Container Instance (ACI) self-hosted Azure DevOps pipeline agents deployed via the ALZ Terraform Accelerator bootstrap can unexpectedly lose their AZP_TOKEN secure environment variable, causing the containers to enter CrashLoopBackOff state and pipeline jobs to remain stuck in queue.

Versions

  • ALZ Terraform Accelerator version: v4.8.0
  • AzureRM Provider version: (from bootstrap)
  • Terraform version: (from bootstrap)
  • Azure region: Germany West Central

Steps to reproduce

  1. Deploy ALZ Terraform Accelerator bootstrap with Azure DevOps and ACI agents
  2. Verify agents are working (containers running, "Listening for Jobs")
  3. Wait for an unknown period (weeks/months)
  4. Observe containers entering CrashLoopBackOff with error: 1. AZP_TOKEN must be set

Investigation Findings

Symptoms

  • Container logs show: 1. AZP_TOKEN must be set
  • Azure DevOps shows: Job pending. Waiting at position 1 in queue.
  • Container is in CrashLoopBackOff with high restart count

Root Cause Analysis

When querying the container environment variables via Azure CLI:

{
  "name": "AZP_TOKEN",
  "secureValue": null,
  "value": null
}

Both secureValue AND value are null. This is NOT the normal case where secureValue is hidden - the token is genuinely missing.

Why Terraform Doesn't Detect This

Azure API never returns secure values in responses. The Terraform provider cannot detect if a secure_environment_variable has been cleared on the Azure side because:

  1. Terraform stores the value in state (encrypted)
  2. Azure returns null for secure values (always, by design)
  3. Terraform sees null and assumes it matches (can't compare)
  4. No drift is detected, terraform plan shows no changes

Suspected Root Cause: Azure Host Rehosting

We strongly suspect this issue is caused by Azure rehosting the container to a different underlying host. Microsoft documentation confirms this can happen:

"customers may experience restarts initiated by the ACI infrastructure due to maintenance events"
"Although rare, there are some Azure-internal events that can cause redeployment to a different host."

When Azure moves a container group to a new host (due to maintenance, hardware failure, or capacity balancing), the secure environment variables may not be properly preserved during the migration.

Why This Is Difficult to Replicate

This issue is extremely difficult to reproduce because:

  1. Rehosting is an Azure-internal operation - Users cannot trigger it manually
  2. It happens rarely and unpredictably - Could take weeks or months
  3. No visibility - There's no Azure API to check which host a container is running on
  4. Activity logs expire - 90-day retention means evidence is lost before discovery
  5. Normal restarts work fine - Only rehosting to a different host causes the issue

We explicitly tested normal restarts (az container restart) and confirmed the PAT token was preserved. The issue only manifests when Azure moves the container to a different host.

Verified: Normal Restarts Preserve PAT

We tested this explicitly:

az container restart --name <container> --resource-group <rg>

After restart, the container successfully reconnected with the PAT intact. Normal restarts do NOT cause this issue.

Suggested Mitigations

Option 1: Document the Limitation

Add documentation warning users that:

  • Terraform cannot detect secure environment variable drift
  • Users should monitor for CrashLoopBackOff
  • Re-running terraform apply with the PAT variable will NOT fix the issue (needs explicit recreation)

Option 2: Use Azure Key Vault

Modify the bootstrap to:

  1. Store the PAT in Azure Key Vault
  2. Have the container retrieve the PAT at startup via managed identity
  3. This way, even if the container is recreated, it can always fetch the current secret

Option 3: Add Monitoring/Alerting

Include Azure Monitor alerts for:

  • Container restart count > threshold
  • Container state = "Waiting" or "CrashLoopBackOff"

Option 4: Lifecycle Ignore + External Management

Use lifecycle { ignore_changes = [containers[0].secure_environment_variables] } and manage the secret externally.

Workaround

To fix affected containers, you must delete and recreate them with the PAT:

# Delete existing containers
az container delete --name <container-name> --resource-group <rg> --yes

# Recreate with PAT
az container create --name <container-name> \
  --resource-group <rg> \
  --image <image> \
  --secure-environment-variables AZP_TOKEN=<pat> \
  --environment-variables AZP_URL=<url> AZP_POOL=<pool> AZP_AGENT_NAME=<name> \
  # ... other parameters

Or use a Bicep/ARM template that explicitly sets the secureValue.

Verification Command

To check if the PAT is working (from inside the container):

az container exec --name <container> --resource-group <rg> --exec-command "printenv AZP_TOKEN"

If this returns empty or fails, the PAT is missing.

Additional Context

  • The containers were originally deployed in August 2025
  • The issue was discovered in December 2025 (4+ months later)
  • Activity logs only retain 90 days, so we cannot see what Azure operations occurred
  • Both containers in different availability zones were affected simultaneously

References


How to Create the Issue

  1. Go to: https://github.com/Azure/alz-terraform-accelerator/issues/new
  2. Copy the content above (from "### Describe the bug" onwards)
  3. Use the title: [Bug]: Azure Container Instance pipeline agents lose AZP_TOKEN secureValue, causing CrashLoopBackOff
  4. Add labels: bug, documentation (if available)
  5. Submit

Metadata

Metadata

Assignees

No one assigned

    Labels

    Status: Long Term ⌛We will do it, but will take a longer amount of time due to complexity/prioritiesType: Bug 🪲Something isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions