Skip to content

Add ACNS observability test with Cilium dataplane#1053

Open
carlotaarvela wants to merge 7 commits intomainfrom
acns-scale-test
Open

Add ACNS observability test with Cilium dataplane#1053
carlotaarvela wants to merge 7 commits intomainfrom
acns-scale-test

Conversation

@carlotaarvela
Copy link

@carlotaarvela carlotaarvela commented Feb 11, 2026

Summary

This PR adds a Container Network Logs (CNL) performance evaluation scenario for AKS + Azure CNI Overlay + Cilium dataplane. It includes the ClusterLoader2 (CL2) workload, Prometheus monitors, and measurement queries needed to evaluate CNL observability overhead at scale.

The scenario is designed to validate Container Network Logs performance by:

  • Provisioning a 1000-node AKS cluster with Kubernetes 1.33+ (required for CNL)
  • Deploying Fortio server/client pairs to generate sustained in-cluster traffic
  • Creating ContainerNetworkLog CRDs to enable flow log collection
  • Measuring the overhead of CNL components (AMA-logs, Cilium/Hubble) on CPU, memory, and disk I/O

What's measured

Component Metrics
AMA-Logs agent CPU/memory usage, FS write throughput/latency, container restarts, network flow input records/sec
Cilium agent/operator CPU/memory usage, FS write throughput/latency, container restarts
Retina (optional) CPU/memory usage, FS write behavior
Control plane API server latency, Pod startup latency
Node disk Write throughput/latency via node-exporter

What's included

ClusterLoader2 workload

  • modules/python/clusterloader2/scale/ - CL2 configuration and Python harness
  • Fortio traffic generation (configurable deployments, replicas, QPS)
  • ContainerNetworkLog CRD deployment with filters for traffic namespaces
  • NetworkPolicy object scale (for API/controller load testing)

Pipeline & scenario

  • pipelines/perf-eval/CNI Benchmark/cnl-observability.yml - Pipeline with parameters for traffic scale and CNL options
  • scenarios/perf-eval/cnl-azurecni-overlay-cilium/ - Terraform inputs for AKS with:
    • Kubernetes 1.33 (required for CNL)
    • Azure CNI overlay + Cilium dataplane
    • ACNS enabled (--enable-acns)
    • 1000-node traffic pool + dedicated Prometheus pool

Monitoring

  • Node-exporter DaemonSet with ServiceMonitor
  • AMA-logs PodMonitor
  • Hubble metrics PodMonitor (scrapes cilium-agent port 9965)

Prerequisites

Per CNL documentation:

  • Azure CLI 2.75.0+ with aks-preview extension 19.0.07+
  • AdvancedNetworkingFlowLogsPreview feature flag registered
  • Kubernetes 1.33+
  • Cilium dataplane (for stored logs mode)

Notes

  • The NetworkPolicy template uses dummy selectors for API/etcd object scale testing only—it does not affect dataplane traffic
  • Node-exporter requires hostNetwork/hostPID for host-level metric collection

@carlotaarvela carlotaarvela force-pushed the acns-scale-test branch 7 times, most recently from 6eccab9 to ac56435 Compare February 11, 2026 18:15
@carlotaarvela carlotaarvela marked this pull request as ready for review February 11, 2026 18:16
Copilot AI review requested due to automatic review settings February 11, 2026 18:16
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new observability perf-eval topology and a ClusterLoader2 scale workload aimed at measuring Container Network Logs (CNL/ACNS) overhead on AKS + Azure CNI Overlay + Cilium, including Prometheus scraping objects and result collection into JSONL.

Changes:

  • Introduces a new observability topology (validate/execute/collect) that runs a new CL2 “scale” engine workflow.
  • Adds a new CL2 “scale” harness (scale.py) plus CL2 configs/modules for Fortio traffic, CNL CRD creation, and Prometheus queries (AMA Logs / Cilium / node disk / control plane / Retina optional).
  • Adds a new Terraform scenario input set for a 1000-node AKS cluster with Azure CNI overlay + Cilium + ACNS enabled.

Reviewed changes

Copilot reviewed 34 out of 34 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
steps/topology/observability/validate-resources.yml New topology validation step (currently only kubeconfig).
steps/topology/observability/execute-clusterloader2.yml Wires observability topology to the CL2 scale engine execute template.
steps/topology/observability/collect-clusterloader2.yml Wires observability topology to the CL2 scale engine collect template.
steps/engine/clusterloader2/scale/execute.yml New CL2 scale execution template (includes optional Retina install).
steps/engine/clusterloader2/scale/collect.yml New CL2 scale collection template producing JSONL results.
scenarios/perf-eval/cnl-azurecni-overlay-cilium/terraform-test-inputs/azure.json Scenario test inputs for Azure terraform validation.
scenarios/perf-eval/cnl-azurecni-overlay-cilium/terraform-inputs/azure.tfvars AKS 1.33 + Azure CNI overlay + Cilium + ACNS + 1000-node pool scenario definition.
modules/python/clusterloader2/scale/scale.py New Python harness to configure/execute/collect CL2 scale runs and emit JSONL.
modules/python/clusterloader2/scale/config/modules/test-steps.yaml CL2 workload steps (Fortio deploys, optional label churn, netpol scale, sleep window, cleanup).
modules/python/clusterloader2/scale/config/modules/scale-test.yaml Orchestrates measurement modules + test steps (start/gather).
modules/python/clusterloader2/scale/config/modules/pfl/retinanetworkflowlog.yaml Template for creating ContainerNetworkLog CRs (for flow logging).
modules/python/clusterloader2/scale/config/modules/node-exporter/servicemonitor.yaml Prometheus ServiceMonitor for node-exporter.
modules/python/clusterloader2/scale/config/modules/node-exporter/serviceaccount.yaml ServiceAccount for node-exporter components.
modules/python/clusterloader2/scale/config/modules/node-exporter/service.yaml Service for node-exporter/kube-rbac-proxy endpoint.
modules/python/clusterloader2/scale/config/modules/node-exporter/networkpolicy.yaml NetworkPolicy allowing Prometheus to scrape node-exporter.
modules/python/clusterloader2/scale/config/modules/node-exporter/daemonset.yaml node-exporter + kube-rbac-proxy DaemonSet (hostNetwork/hostPID).
modules/python/clusterloader2/scale/config/modules/node-exporter/clusterrolebinding.yaml RBAC binding for kube-rbac-proxy authn/authz checks.
modules/python/clusterloader2/scale/config/modules/node-exporter/clusterrole.yaml ClusterRole for token/access reviews (kube-rbac-proxy).
modules/python/clusterloader2/scale/config/modules/node-exporter.yaml CL2 module wrapper to create/delete node-exporter stack.
modules/python/clusterloader2/scale/config/modules/networkpolicy-template.yaml Dummy NetworkPolicy template for API/etcd object scale testing.
modules/python/clusterloader2/scale/config/modules/measurements/retina.yaml Prometheus queries for Retina resource + FS metrics.
modules/python/clusterloader2/scale/config/modules/measurements/node-disk.yaml Prometheus queries for node disk write throughput/latency.
modules/python/clusterloader2/scale/config/modules/measurements/control-plane.yaml Control plane queries (API responsiveness, pod startup latency, apiserver CPU/mem).
modules/python/clusterloader2/scale/config/modules/measurements/cilium.yaml Prometheus queries for Cilium agent/operator CPU/mem/FS/restarts.
modules/python/clusterloader2/scale/config/modules/measurements/ama-logs.yaml Prometheus queries for AMA Logs CPU/mem/FS/restarts + flow pipeline rates.
modules/python/clusterloader2/scale/config/modules/hubble/podmonitor.yaml PodMonitor for scraping Hubble metrics via cilium-agent port 9965.
modules/python/clusterloader2/scale/config/modules/hubble.yaml CL2 module wrapper to create/delete the Hubble PodMonitor.
modules/python/clusterloader2/scale/config/modules/fortio/service.yaml Fortio service template used by the workload.
modules/python/clusterloader2/scale/config/modules/fortio/server-deployment.yaml Fortio server deployment template (nodeSelector scale-test=true).
modules/python/clusterloader2/scale/config/modules/fortio/client-deployment.yaml Fortio client deployment template (sustained load).
modules/python/clusterloader2/scale/config/modules/ama-logs/podmonitor.yaml PodMonitor for scraping AMA Logs metrics in kube-system.
modules/python/clusterloader2/scale/config/modules/ama-logs.yaml CL2 module wrapper to create/delete the AMA Logs PodMonitor.
modules/python/clusterloader2/scale/config/config.yaml Top-level CL2 config wiring node-exporter/AMA/hubble + scale-test module.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 34 out of 34 changed files in this pull request and generated 9 comments.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 34 out of 34 changed files in this pull request and generated 9 comments.

@carlotaarvela carlotaarvela force-pushed the acns-scale-test branch 2 times, most recently from 119a9d7 to 85c2f34 Compare February 12, 2026 15:20
engine_input: ${{ parameters.engine_input }}
region: ${{ parameters.regions[0] }}
- script: |
if [ -n "$RUN_ID" ]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use set-run-id template to avoid repetition https://github.com/Azure/telescope/blob/main/steps/set-run-id.yml

Image

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if file_name.startswith(PROM_QUERY_PREFIX):
group_name = file_name.split("_")[1]
measurement_name = file_name.split("_")[0][len(PROM_QUERY_PREFIX)+1:]
parts = file_name.split("_")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change might break other pipeline not all measurement have same length as this test

why the original approach does not work in your current test

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original approach assumed a fixed format like PromQuery__.json, but our new test files use a different naming convention with additional segments.

I've add defensive parsing that handles both formats - we can check the number of segments before parsing and fall back to the original logic for files that match the existing pattern. This way we won't break other pipelines.

},
{
name = "traffic"
node_count = 1000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this in telescope sub, do we have enough quota for this test

set -eo pipefail
echo "Checking AdvancedNetworkingFlowLogsPreview feature flag status..."

feature_state=$(az feature show --namespace "Microsoft.ContainerService" --name "AdvancedNetworkingFlowLogsPreview" --query "properties.state" -o tsv 2>/dev/null || echo "NotRegistered")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not required, we simply just need to register this to the subscription you want to test and set that in the pipeline.

Not all test need to register this feature, this change will affect other pipelines test. I suggest to just enable them in your sub .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants