Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a new large-scale ACR image pull benchmark (1000-node) and updates the ClusterLoader2 CRI module/topology plumbing to better support large clusters by allowing explicit per-pod memory request overrides and runtime-parameterized node validation.
Changes:
- Parameterize image-pull topology resource validation to use a runtime
desired_nodesvalue (and extend validation timeout). - Add
memory_request_overridesupport end-to-end (pipeline env → execute step → CRI override logic) to avoid scheduling failures whenmax_podsis low. - Introduce a new
image-pull-n1000perf-eval scenario (Terraform inputs + test inputs + README).
Reviewed changes
Copilot reviewed 9 out of 10 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| steps/topology/image-pull/validate-resources.yml | Switch validation to use runtime $(desired_nodes) and increase validation timeout. |
| steps/engine/clusterloader2/large-cluster/validate.yml | Change desired_nodes parameter type to string to support runtime substitution. |
| steps/engine/clusterloader2/cri/execute.yml | Plumb MEMORY_REQUEST_OVERRIDE env var into CRI override CLI invocation. |
| modules/python/clusterloader2/cri/cri.py | Add --memory_request_override flag and implement override parsing/behavior in override generation. |
| scenarios/perf-eval/image-pull-n1000/terraform-test-inputs/azure.json | Add terraform test input for the new scenario. |
| scenarios/perf-eval/image-pull-n1000/terraform-inputs/azure.tfvars | Add Terraform configuration for the 1000-node image pull cluster. |
| scenarios/perf-eval/image-pull-n1000/README.md | Document the new image-pull-n1000 scenario. |
| pipelines/system/new-pipeline-test.yml | Minor formatting/line adjustment in pipeline template example. |
1000n acr image pull 1000n acr image pull 1000n acr image pull 1000n acr image pull 1000n test 1000n test fix desired node clean up test changes format and update test fix format fix format Revert new-pipeline-test.yml to match main
55ffd48 to
5854625
Compare
Contributor
Author
@microsoft-github-policy-service agree company="Microsoft" |
wonderyl
reviewed
Feb 18, 2026
wonderyl
approved these changes
Feb 18, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a 1000-node ACR image pull benchmark to measure concurrent image pulling throughput at scale against ACR dogfood environment with anonymous pull. Also adds support for custom pod memory requests in the CRI module to prevent pod scheduling failures on nodes with low max_pods settings.
Changes
FailedScheduling: 0/1004 nodes are available: 1000 Insufficient memoryValidation
Validated with 1000 nodes cluster pulling ~5GB/~10GB images
Pipelines - Run 20260212.7