Add ACNS observability test with Cilium dataplane#1053
Add ACNS observability test with Cilium dataplane#1053carlotaarvela wants to merge 7 commits intomainfrom
Conversation
6eccab9 to
ac56435
Compare
There was a problem hiding this comment.
Pull request overview
Adds a new observability perf-eval topology and a ClusterLoader2 scale workload aimed at measuring Container Network Logs (CNL/ACNS) overhead on AKS + Azure CNI Overlay + Cilium, including Prometheus scraping objects and result collection into JSONL.
Changes:
- Introduces a new
observabilitytopology (validate/execute/collect) that runs a new CL2 “scale” engine workflow. - Adds a new CL2 “scale” harness (
scale.py) plus CL2 configs/modules for Fortio traffic, CNL CRD creation, and Prometheus queries (AMA Logs / Cilium / node disk / control plane / Retina optional). - Adds a new Terraform scenario input set for a 1000-node AKS cluster with Azure CNI overlay + Cilium + ACNS enabled.
Reviewed changes
Copilot reviewed 34 out of 34 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| steps/topology/observability/validate-resources.yml | New topology validation step (currently only kubeconfig). |
| steps/topology/observability/execute-clusterloader2.yml | Wires observability topology to the CL2 scale engine execute template. |
| steps/topology/observability/collect-clusterloader2.yml | Wires observability topology to the CL2 scale engine collect template. |
| steps/engine/clusterloader2/scale/execute.yml | New CL2 scale execution template (includes optional Retina install). |
| steps/engine/clusterloader2/scale/collect.yml | New CL2 scale collection template producing JSONL results. |
| scenarios/perf-eval/cnl-azurecni-overlay-cilium/terraform-test-inputs/azure.json | Scenario test inputs for Azure terraform validation. |
| scenarios/perf-eval/cnl-azurecni-overlay-cilium/terraform-inputs/azure.tfvars | AKS 1.33 + Azure CNI overlay + Cilium + ACNS + 1000-node pool scenario definition. |
| modules/python/clusterloader2/scale/scale.py | New Python harness to configure/execute/collect CL2 scale runs and emit JSONL. |
| modules/python/clusterloader2/scale/config/modules/test-steps.yaml | CL2 workload steps (Fortio deploys, optional label churn, netpol scale, sleep window, cleanup). |
| modules/python/clusterloader2/scale/config/modules/scale-test.yaml | Orchestrates measurement modules + test steps (start/gather). |
| modules/python/clusterloader2/scale/config/modules/pfl/retinanetworkflowlog.yaml | Template for creating ContainerNetworkLog CRs (for flow logging). |
| modules/python/clusterloader2/scale/config/modules/node-exporter/servicemonitor.yaml | Prometheus ServiceMonitor for node-exporter. |
| modules/python/clusterloader2/scale/config/modules/node-exporter/serviceaccount.yaml | ServiceAccount for node-exporter components. |
| modules/python/clusterloader2/scale/config/modules/node-exporter/service.yaml | Service for node-exporter/kube-rbac-proxy endpoint. |
| modules/python/clusterloader2/scale/config/modules/node-exporter/networkpolicy.yaml | NetworkPolicy allowing Prometheus to scrape node-exporter. |
| modules/python/clusterloader2/scale/config/modules/node-exporter/daemonset.yaml | node-exporter + kube-rbac-proxy DaemonSet (hostNetwork/hostPID). |
| modules/python/clusterloader2/scale/config/modules/node-exporter/clusterrolebinding.yaml | RBAC binding for kube-rbac-proxy authn/authz checks. |
| modules/python/clusterloader2/scale/config/modules/node-exporter/clusterrole.yaml | ClusterRole for token/access reviews (kube-rbac-proxy). |
| modules/python/clusterloader2/scale/config/modules/node-exporter.yaml | CL2 module wrapper to create/delete node-exporter stack. |
| modules/python/clusterloader2/scale/config/modules/networkpolicy-template.yaml | Dummy NetworkPolicy template for API/etcd object scale testing. |
| modules/python/clusterloader2/scale/config/modules/measurements/retina.yaml | Prometheus queries for Retina resource + FS metrics. |
| modules/python/clusterloader2/scale/config/modules/measurements/node-disk.yaml | Prometheus queries for node disk write throughput/latency. |
| modules/python/clusterloader2/scale/config/modules/measurements/control-plane.yaml | Control plane queries (API responsiveness, pod startup latency, apiserver CPU/mem). |
| modules/python/clusterloader2/scale/config/modules/measurements/cilium.yaml | Prometheus queries for Cilium agent/operator CPU/mem/FS/restarts. |
| modules/python/clusterloader2/scale/config/modules/measurements/ama-logs.yaml | Prometheus queries for AMA Logs CPU/mem/FS/restarts + flow pipeline rates. |
| modules/python/clusterloader2/scale/config/modules/hubble/podmonitor.yaml | PodMonitor for scraping Hubble metrics via cilium-agent port 9965. |
| modules/python/clusterloader2/scale/config/modules/hubble.yaml | CL2 module wrapper to create/delete the Hubble PodMonitor. |
| modules/python/clusterloader2/scale/config/modules/fortio/service.yaml | Fortio service template used by the workload. |
| modules/python/clusterloader2/scale/config/modules/fortio/server-deployment.yaml | Fortio server deployment template (nodeSelector scale-test=true). |
| modules/python/clusterloader2/scale/config/modules/fortio/client-deployment.yaml | Fortio client deployment template (sustained load). |
| modules/python/clusterloader2/scale/config/modules/ama-logs/podmonitor.yaml | PodMonitor for scraping AMA Logs metrics in kube-system. |
| modules/python/clusterloader2/scale/config/modules/ama-logs.yaml | CL2 module wrapper to create/delete the AMA Logs PodMonitor. |
| modules/python/clusterloader2/scale/config/config.yaml | Top-level CL2 config wiring node-exporter/AMA/hubble + scale-test module. |
4a17573 to
c839c20
Compare
modules/python/clusterloader2/scale/config/modules/measurements/retina.yaml
Outdated
Show resolved
Hide resolved
modules/python/clusterloader2/scale/config/modules/measurements/cilium.yaml
Outdated
Show resolved
Hide resolved
scenarios/perf-eval/cnl-azurecni-overlay-cilium/terraform-inputs/azure.tfvars
Show resolved
Hide resolved
scenarios/perf-eval/cnl-azurecni-overlay-cilium/terraform-inputs/azure.tfvars
Show resolved
Hide resolved
modules/python/clusterloader2/scale/config/modules/measurements/ama-logs.yaml
Outdated
Show resolved
Hide resolved
modules/python/clusterloader2/scale/config/modules/measurements/cilium.yaml
Outdated
Show resolved
Hide resolved
c839c20 to
93b0ab5
Compare
modules/python/clusterloader2/scale/config/modules/node-exporter/daemonset.yaml
Outdated
Show resolved
Hide resolved
modules/python/clusterloader2/scale/config/modules/node-exporter/serviceaccount.yaml
Outdated
Show resolved
Hide resolved
93b0ab5 to
1a389c0
Compare
119a9d7 to
85c2f34
Compare
| engine_input: ${{ parameters.engine_input }} | ||
| region: ${{ parameters.regions[0] }} | ||
| - script: | | ||
| if [ -n "$RUN_ID" ]; then |
There was a problem hiding this comment.
use set-run-id template to avoid repetition https://github.com/Azure/telescope/blob/main/steps/set-run-id.yml
| if file_name.startswith(PROM_QUERY_PREFIX): | ||
| group_name = file_name.split("_")[1] | ||
| measurement_name = file_name.split("_")[0][len(PROM_QUERY_PREFIX)+1:] | ||
| parts = file_name.split("_") |
There was a problem hiding this comment.
This change might break other pipeline not all measurement have same length as this test
why the original approach does not work in your current test
There was a problem hiding this comment.
The original approach assumed a fixed format like PromQuery__.json, but our new test files use a different naming convention with additional segments.
I've add defensive parsing that handles both formats - we can check the number of segments before parsing and fall back to the original logic for files that match the existing pattern. This way we won't break other pipelines.
| }, | ||
| { | ||
| name = "traffic" | ||
| node_count = 1000 |
There was a problem hiding this comment.
Is this in telescope sub, do we have enough quota for this test
| set -eo pipefail | ||
| echo "Checking AdvancedNetworkingFlowLogsPreview feature flag status..." | ||
|
|
||
| feature_state=$(az feature show --namespace "Microsoft.ContainerService" --name "AdvancedNetworkingFlowLogsPreview" --query "properties.state" -o tsv 2>/dev/null || echo "NotRegistered") |
There was a problem hiding this comment.
This is not required, we simply just need to register this to the subscription you want to test and set that in the pipeline.
Not all test need to register this feature, this change will affect other pipelines test. I suggest to just enable them in your sub .
Summary
This PR adds a Container Network Logs (CNL) performance evaluation scenario for AKS + Azure CNI Overlay + Cilium dataplane. It includes the ClusterLoader2 (CL2) workload, Prometheus monitors, and measurement queries needed to evaluate CNL observability overhead at scale.
The scenario is designed to validate Container Network Logs performance by:
What's measured
What's included
ClusterLoader2 workload
modules/python/clusterloader2/scale/- CL2 configuration and Python harnessPipeline & scenario
pipelines/perf-eval/CNI Benchmark/cnl-observability.yml- Pipeline with parameters for traffic scale and CNL optionsscenarios/perf-eval/cnl-azurecni-overlay-cilium/- Terraform inputs for AKS with:--enable-acns)Monitoring
Prerequisites
Per CNL documentation:
Notes