[Bug]: Azure Container Instance pipeline agents lose AZP_TOKEN secureValue after Azure rehosting, causing CrashLoopBackOff

# GitHub Issue: ACI Pipeline Agent PAT Token Loss - Investigation and Mitigation

## Repository
**Azure/alz-terraform-accelerator**

## Issue Title
`[Bug]: Azure Container Instance pipeline agents lose AZP_TOKEN secureValue, causing CrashLoopBackOff`

---

## Issue Body

### Describe the bug

Azure Container Instance (ACI) self-hosted Azure DevOps pipeline agents deployed via the ALZ Terraform Accelerator bootstrap can unexpectedly lose their `AZP_TOKEN` secure environment variable, causing the containers to enter CrashLoopBackOff state and pipeline jobs to remain stuck in queue.

### Versions

- **ALZ Terraform Accelerator version**: v4.8.0
- **AzureRM Provider version**: (from bootstrap)
- **Terraform version**: (from bootstrap)
- **Azure region**: Germany West Central

### Steps to reproduce

1. Deploy ALZ Terraform Accelerator bootstrap with Azure DevOps and ACI agents
2. Verify agents are working (containers running, "Listening for Jobs")
3. Wait for an unknown period (weeks/months)
4. Observe containers entering CrashLoopBackOff with error: `1. AZP_TOKEN must be set`

### Investigation Findings

#### Symptoms
- Container logs show: `1. AZP_TOKEN must be set`
- Azure DevOps shows: `Job pending. Waiting at position 1 in queue.`
- Container is in CrashLoopBackOff with high restart count

#### Root Cause Analysis

When querying the container environment variables via Azure CLI:
```json
{
  "name": "AZP_TOKEN",
  "secureValue": null,
  "value": null
}
```

Both `secureValue` AND `value` are `null`. This is NOT the normal case where `secureValue` is hidden - the token is genuinely missing.

#### Why Terraform Doesn't Detect This

Azure API **never returns secure values** in responses. The Terraform provider cannot detect if a `secure_environment_variable` has been cleared on the Azure side because:

1. Terraform stores the value in state (encrypted)
2. Azure returns `null` for secure values (always, by design)
3. Terraform sees `null` and assumes it matches (can't compare)
4. No drift is detected, `terraform plan` shows no changes

#### Suspected Root Cause: Azure Host Rehosting

**We strongly suspect this issue is caused by Azure rehosting the container to a different underlying host.** Microsoft documentation confirms this can happen:

> *"customers may experience restarts initiated by the ACI infrastructure due to maintenance events"*
> *"Although rare, there are some Azure-internal events that can cause redeployment to a different host."*

When Azure moves a container group to a new host (due to maintenance, hardware failure, or capacity balancing), the secure environment variables may not be properly preserved during the migration.

#### Why This Is Difficult to Replicate

This issue is **extremely difficult to reproduce** because:

1. **Rehosting is an Azure-internal operation** - Users cannot trigger it manually
2. **It happens rarely and unpredictably** - Could take weeks or months
3. **No visibility** - There's no Azure API to check which host a container is running on
4. **Activity logs expire** - 90-day retention means evidence is lost before discovery
5. **Normal restarts work fine** - Only rehosting to a different host causes the issue

We explicitly tested normal restarts (`az container restart`) and confirmed the PAT token was preserved. The issue only manifests when Azure moves the container to a different host.

#### Verified: Normal Restarts Preserve PAT

We tested this explicitly:
```bash
az container restart --name <container> --resource-group <rg>
```
After restart, the container successfully reconnected with the PAT intact. **Normal restarts do NOT cause this issue.**

### Suggested Mitigations

#### Option 1: Document the Limitation
Add documentation warning users that:
- Terraform cannot detect secure environment variable drift
- Users should monitor for CrashLoopBackOff
- Re-running `terraform apply` with the PAT variable will NOT fix the issue (needs explicit recreation)

#### Option 2: Use Azure Key Vault
Modify the bootstrap to:
1. Store the PAT in Azure Key Vault
2. Have the container retrieve the PAT at startup via managed identity
3. This way, even if the container is recreated, it can always fetch the current secret

#### Option 3: Add Monitoring/Alerting
Include Azure Monitor alerts for:
- Container restart count > threshold
- Container state = "Waiting" or "CrashLoopBackOff"

#### Option 4: Lifecycle Ignore + External Management
Use `lifecycle { ignore_changes = [containers[0].secure_environment_variables] }` and manage the secret externally.

### Workaround

To fix affected containers, you must **delete and recreate** them with the PAT:

```bash
# Delete existing containers
az container delete --name <container-name> --resource-group <rg> --yes

# Recreate with PAT
az container create --name <container-name> \
  --resource-group <rg> \
  --image <image> \
  --secure-environment-variables AZP_TOKEN=<pat> \
  --environment-variables AZP_URL=<url> AZP_POOL=<pool> AZP_AGENT_NAME=<name> \
  # ... other parameters
```

Or use a Bicep/ARM template that explicitly sets the `secureValue`.

### Verification Command

To check if the PAT is working (from inside the container):
```bash
az container exec --name <container> --resource-group <rg> --exec-command "printenv AZP_TOKEN"
```

If this returns empty or fails, the PAT is missing.

### Additional Context

- The containers were originally deployed in August 2025
- The issue was discovered in December 2025 (4+ months later)
- Activity logs only retain 90 days, so we cannot see what Azure operations occurred
- Both containers in different availability zones were affected simultaneously

### References

- [ACI Troubleshooting - Isolated Restarts](https://learn.microsoft.com/en-us/azure/container-instances/container-instances-troubleshooting#container-had-an-isolated-restart-without-explicit-user-input)
- [ACI Update Limitations](https://learn.microsoft.com/en-us/azure/container-instances/container-instances-update#limitations)
- [Terraform AzureRM Issue #8096 - Secure environment variables](https://github.com/hashicorp/terraform-provider-azurerm/issues/8096)

---

## How to Create the Issue

1. Go to: https://github.com/Azure/alz-terraform-accelerator/issues/new
2. Copy the content above (from "### Describe the bug" onwards)
3. Use the title: `[Bug]: Azure Container Instance pipeline agents lose AZP_TOKEN secureValue, causing CrashLoopBackOff`
4. Add labels: `bug`, `documentation` (if available)
5. Submit



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: Azure Container Instance pipeline agents lose AZP_TOKEN secureValue after Azure rehosting, causing CrashLoopBackOff #506

GitHub Issue: ACI Pipeline Agent PAT Token Loss - Investigation and Mitigation

Repository

Issue Title

Issue Body

Describe the bug

Versions

Steps to reproduce

Investigation Findings

Symptoms

Root Cause Analysis

Why Terraform Doesn't Detect This

Suspected Root Cause: Azure Host Rehosting

Why This Is Difficult to Replicate

Verified: Normal Restarts Preserve PAT

Suggested Mitigations

Option 1: Document the Limitation

Option 2: Use Azure Key Vault

Option 3: Add Monitoring/Alerting

Option 4: Lifecycle Ignore + External Management

Workaround

Verification Command

Additional Context

References

How to Create the Issue

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: Azure Container Instance pipeline agents lose AZP_TOKEN secureValue after Azure rehosting, causing CrashLoopBackOff #506

Description

GitHub Issue: ACI Pipeline Agent PAT Token Loss - Investigation and Mitigation

Repository

Issue Title

Issue Body

Describe the bug

Versions

Steps to reproduce

Investigation Findings

Symptoms

Root Cause Analysis

Why Terraform Doesn't Detect This

Suspected Root Cause: Azure Host Rehosting

Why This Is Difficult to Replicate

Verified: Normal Restarts Preserve PAT

Suggested Mitigations

Option 1: Document the Limitation

Option 2: Use Azure Key Vault

Option 3: Add Monitoring/Alerting

Option 4: Lifecycle Ignore + External Management

Workaround

Verification Command

Additional Context

References

How to Create the Issue

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions