Vlin/ACR 1000 nodes image pull test by vlin-ms · Pull Request #1059 · Azure/telescope

vlin-ms · 2026-02-12T23:49:04Z

Summary
Adds a 1000-node ACR image pull benchmark to measure concurrent image pulling throughput at scale against ACR dogfood environment with anonymous pull. Also adds support for custom pod memory requests in the CRI module to prevent pod scheduling failures on nodes with low max_pods settings.

Changes

New scenario: image-pull-n1000
- Pipeline: New pipeline targeting australiaeast with 1000 user nodes, anonymous pull from acrperftestaue.azurecr-test.io
- Terraform: 1004-node cluster (3 default + 1 Standard_D64_v3 Prometheus + 1000 Standard_D4ds_v5 user nodes)
Custom memory request override (memory_request_override)
- Problem: When max_pods is low, the auto-calculated memory request per pod becomes too large (allocatable memory ÷ few pods), causing Insufficient memory scheduling failures:
  FailedScheduling: 0/1004 nodes are available: 1000 Insufficient memory
- Solution: Added memory_request_override parameter to cri.py and execute.yml, allowing explicit control over pod memory requests instead of relying on auto-calculation
Shared topology: parameterized desired_nodes
- Changed desired_nodes from type: number to type: string in validate.yml to support runtime matrix variables
- Updated image-pull topology to use $(desired_nodes) from pipeline matrix, enabling n10 (14 nodes) and n1000 (1004 nodes) to share the same topology

Validation
Validated with 1000 nodes cluster pulling ~5GB/~10GB images
Pipelines - Run 20260212.7

Copilot

Pull request overview

Adds a new large-scale ACR image pull benchmark (1000-node) and updates the ClusterLoader2 CRI module/topology plumbing to better support large clusters by allowing explicit per-pod memory request overrides and runtime-parameterized node validation.

Changes:

Parameterize image-pull topology resource validation to use a runtime desired_nodes value (and extend validation timeout).
Add memory_request_override support end-to-end (pipeline env → execute step → CRI override logic) to avoid scheduling failures when max_pods is low.
Introduce a new image-pull-n1000 perf-eval scenario (Terraform inputs + test inputs + README).

Reviewed changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
steps/topology/image-pull/validate-resources.yml	Switch validation to use runtime `$(desired_nodes)` and increase validation timeout.
steps/engine/clusterloader2/large-cluster/validate.yml	Change `desired_nodes` parameter type to string to support runtime substitution.
steps/engine/clusterloader2/cri/execute.yml	Plumb `MEMORY_REQUEST_OVERRIDE` env var into CRI override CLI invocation.
modules/python/clusterloader2/cri/cri.py	Add `--memory_request_override` flag and implement override parsing/behavior in override generation.
scenarios/perf-eval/image-pull-n1000/terraform-test-inputs/azure.json	Add terraform test input for the new scenario.
scenarios/perf-eval/image-pull-n1000/terraform-inputs/azure.tfvars	Add Terraform configuration for the 1000-node image pull cluster.
scenarios/perf-eval/image-pull-n1000/README.md	Document the new image-pull-n1000 scenario.
pipelines/system/new-pipeline-test.yml	Minor formatting/line adjustment in pipeline template example.

modules/python/clusterloader2/cri/cri.py

scenarios/perf-eval/image-pull-n1000/README.md

1000n acr image pull 1000n acr image pull 1000n acr image pull 1000n acr image pull 1000n test 1000n test fix desired node clean up test changes format and update test fix format fix format Revert new-pipeline-test.yml to match main

vlin-ms · 2026-02-18T03:53:22Z

@vlin-ms please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
@microsoft-github-policy-service agree [company="{your company}"]
Options:

(default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
(when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"
Contributor License Agreement

@microsoft-github-policy-service agree company="Microsoft"

steps/engine/clusterloader2/large-cluster/validate.yml

vlin-ms requested review from jasminetMSFT and jikuma February 12, 2026 23:49

vlin-ms requested a review from alyssa1303 as a code owner February 12, 2026 23:49

Copilot AI review requested due to automatic review settings February 12, 2026 23:49

vlin-ms requested review from anson627, sumanthreddy29, vittoriasalim, wonderyl and xinWeiWei24 as code owners February 12, 2026 23:49

Copilot started reviewing on behalf of vlin-ms February 12, 2026 23:49 View session

Copilot AI reviewed Feb 12, 2026

View reviewed changes

modules/python/clusterloader2/cri/cri.py Show resolved Hide resolved

modules/python/clusterloader2/cri/cri.py Show resolved Hide resolved

scenarios/perf-eval/image-pull-n1000/README.md Show resolved Hide resolved

1000n acr image pull

5854625

1000n acr image pull 1000n acr image pull 1000n acr image pull 1000n acr image pull 1000n test 1000n test fix desired node clean up test changes format and update test fix format fix format Revert new-pipeline-test.yml to match main

vlin-ms force-pushed the vlin/acr-1000n-perftest branch from 55ffd48 to 5854625 Compare February 18, 2026 03:10

vlin-ms requested a review from liyu-ma as a code owner February 18, 2026 03:10

resolve merge conflict

a4041c0

vlin-ms closed this Feb 18, 2026

vlin-ms reopened this Feb 18, 2026

wonderyl reviewed Feb 18, 2026

View reviewed changes

steps/engine/clusterloader2/large-cluster/validate.yml Show resolved Hide resolved

wonderyl approved these changes Feb 18, 2026

View reviewed changes

vlin-ms merged commit 010e58b into main Feb 18, 2026
11 checks passed

vlin-ms deleted the vlin/acr-1000n-perftest branch February 18, 2026 06:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vlin/ACR 1000 nodes image pull test#1059

Vlin/ACR 1000 nodes image pull test#1059
vlin-ms merged 2 commits intomainfrom
vlin/acr-1000n-perftest

vlin-ms commented Feb 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vlin-ms commented Feb 18, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

vlin-ms commented Feb 12, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vlin-ms commented Feb 18, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants