Releases · NVIDIA/cloudai

10 Feb 10:27

amaslenn

v1.5.0

88d6f5c

v1.5.0 Latest

Latest

New Changes

Added support for the following workloads:
- DDLB - Distributed Deep Learning Benchmark
- DeepEP - Deep learning expert parallelism benchmark
- OSU - Ohio State University MPI benchmarks
- MegatronBridge - Megatron-Bridge integration
- AIConfigurator - AI model configuration predictor support
Renamed NIXL perftest to NIXL CTPerf
Added support for HFModel installable type for automatic HuggingFace model downloads

AI Dynamo Improvements

AI Dynamo workload supports both Kubernetes and Slurm systems, upgraded to Dynamo v0.7. Key additions include disaggregated prefill/decode mode, multinode deployments, and pass-fail criteria for automated result validation. Deployment configuration has been simplified using TOML files instead of extra config files, and genai-perf is integrated directly from the Dynamo container.

Kubernetes Enhancements

Host network is enabled by default for all Kubernetes deployments. Job name sanitization has been improved across all workloads to prevent invalid characters. For NCCL workloads specifically, logs are fetched continuously during execution and sshd is automatically installed on workers when not available. AI Dynamo deployments properly clean up port forwarding processes on deletion.

Reporting Improvements

Comparison report automatically calculates the difference (value + percentage) when comparing two results
Status report includes scenario results for easier monitoring
Improved status table formatting
Report results directory is printed to users early in the process

Documentation

The documentation has been reorganized with AI Dynamo, covering both Kubernetes and Slurm examples on a single page. New sections have been added for parameter sweeps and test-in-scenario configuration. The workloads support matrix has been updated to reflect current platform availability.

Architectural Changes

Removed Test concept - Simplifies the codebase by eliminating the intermediate Test object
Removed TestTemplate concept - Direct workload usage instead of TestTemplate objects
Converted TestScenario to dataclass
Converted BaseSystemto Pydantic model
Aligned Grader and JsonGenStrategy with CmdGenStrategy patterns
MegatronRun workload defaults to not enabling recompute-activations
--distribution=arbitrary is not hardcoded for Slurm deployments anymore
srun commands always set the number of nodes (unless a nodelist is specified)
ETCD/NIXL processes are killed and waited for properly

All Changed

Remove DeepEP callback for llama4 by @aahouzi in #712
Run tests for several py versions by @amaslenn in #713
Bump fallback version to v1.5 and upgrade dependencies by @amaslenn in #714
Small enhancements by @amaslenn in #715
Simplify internal hierarchy of classes by @amaslenn in #716
Update documentation by @amaslenn in #718
Fix NameError for K8s batch run by @amaslenn in #721
Add DDLB workload by @nsarka in #711
Updates for Dynamo over K8s by @amaslenn in #724
Fixed and issue when using dependencies could result in an infinite loop by @amaslenn in #725
Report results dir to users as early as possible by @amaslenn in #726
Configure AI code review tools by @amaslenn in #728
Kill and wait for ETCD process to be gone by @amaslenn in #727
DeepEP benchmark by @ybenvidia in #723
Print scenario status table at the end of a run by @amaslenn in #730
Always set number of nodes for srun cmd by @amaslenn in #729
Convert base System into pydantic model by @amaslenn in #732
Add HF home dir property inside System model by @amaslenn in #733
Add new installable type: HF model by @amaslenn in #735
Add extra_srun_args on TestRun level by @amaslenn in #734
Dynamo pass/fail and slurm example by @amaslenn in #736
Add support for HF model in K8s by @amaslenn in #737
Configure Dynamo k8s based on TOML, not an extra config by @amaslenn in #738
Fine tune CodeRabbit reviews by @amaslenn in #740
Expand K8s Dynamo support to disagg and multinode by @amaslenn in #739
Generate reports in dry-run by @amaslenn in #741
Update documentation by @amaslenn in #743
Simplify Dynamo slurm configuration by @amaslenn in #745
UCC add file generator by @yaeliyac in #747
Do not set -N/--nodes if nodelist is specified by @amaslenn in #746
Use genai-perf from Dynamo container when running k8s by @amaslenn in #748
Fix empty table if not all results are available by @amaslenn in #753
Ensure reports order by @amaslenn in #754
Update documentation on Dynamo k8s multi node by @amaslenn in #749
Fix bokeh charts generation by @amaslenn in #755
Enhancements for Dynamo with k8s by @amaslenn in #752
Fix a crash during dry-run for Dynamo scenario by @amaslenn in #757
Describe global options for cloudai CLI by @amaslenn in #758
Update codeowners by @srivatsankrishnan in #717
Aiconfig by @srivatsankrishnan in #760
Rula review by @RulaHallak in #761
Automatically install sshd for NCCL k8s workers if no available by @amaslenn in #759
Add workload for OSU Micro Benchmark by @allkoow in #742
Rename field model_config to model_cfg in NIXLKVBench workload by @allkoow in #763
Megatron Bridge in CloudAI by @srivatsankrishnan in #764
M bridge Documentation by @srivatsankrishnan in #765
Remove hardcoded --distribution=arbitrary by @juntaowww in #766
M bridge updates by @srivatsankrishnan in #767
Provide CMS-friendly documentation build by @amaslenn in #769
Model Name/ModeL size for verify configs by @srivatsankrishnan in #772
Add test/test scenario files for AIConfigurator for QA testing by @srivatsankrishnan in #773
Improve reliability in ports selection for Dynamo on Slurm by @amaslenn in #771
Upgrade container versions for common examples by @amaslenn in #776
Add diff (value + percentge) in cmp report table if exactly two results are compared by @amaslenn in #774
Do not use internal URLs in documentation by @amaslenn in #775
Do not enable recompute-activations by default by @amaslenn in #768
B200 M-bridge misconfig by @srivatsankrishnan in #777
Fix M-bridge report generation by @srivatsankrishnan in #778
Fix installation logic for File on k8s by @amaslenn in #781
Address issues with Sleep test over K8s by @amaslenn in #779
M-bridge Job ID extraction by @srivatsankrishnan in #783
Make tests more stable on systems with slurm binaries by @amaslenn in #784
Update doc on using hf token for the first time by @amaslenn in #785
Improvements for NCCL over k8s by @amaslenn in #786
Remove configs for OSU benchmarks by @allkoow in #789
Continuously fetch logs for NCCL over k8s by @amaslenn in #788
Fix handling of a local path for docker container by @amaslenn in #790
More robust bench execution for Dynamo over k8s by @ama...

Contributors

amaslenn, nsarka, and 7 other contributors

Assets 2

02 Feb 09:33

amaslenn

v1.5.rc4

88d6f5c

v1.5.rc4 Pre-release

Pre-release

What's Changed

More robust bench execution for Dynamo over k8s by @amaslenn in #793

Full Changelog: v1.5.rc3...v1.5.rc4

Contributors

amaslenn

Assets 2

28 Jan 16:07

amaslenn

v1.5.rc3

8660590

v1.5.rc3 Pre-release

Pre-release

What's Changed

Remove configs for OSU benchmarks by @allkoow in #789
Continuously fetch logs for NCCL over k8s by @amaslenn in #788
Fix handling of a local path for docker container by @amaslenn in #790

Full Changelog: v1.5.rc2...v1.5.rc3

Contributors

amaslenn and allkoow

Assets 2

21 Jan 16:28

amaslenn

v1.5.rc2

f5410ee

v1.5.rc2 Pre-release

Pre-release

What's Changed

B200 M-bridge misconfig by @srivatsankrishnan in #777
Fix M-bridge report generation by @srivatsankrishnan in #778
Fix installation logic for File on k8s by @amaslenn in #781
Address issues with Sleep test over K8s by @amaslenn in #779
M-bridge Job ID extraction by @srivatsankrishnan in #783
Make tests more stable on systems with slurm binaries by @amaslenn in #784
Update doc on using hf token for the first time by @amaslenn in #785
Improvements for NCCL over k8s by @amaslenn in #786

Full Changelog: v1.5.rc1...v1.5.rc2

Contributors

amaslenn and srivatsankrishnan

Assets 2

13 Jan 07:49

amaslenn

v1.5.rc1

394c622

v1.5.rc1 Pre-release

Pre-release

What's Changed

Provide CMS-friendly documentation build by @amaslenn in #769
Model Name/ModeL size for verify configs by @srivatsankrishnan in #772
Add test/test scenario files for AIConfigurator for QA testing by @srivatsankrishnan in #773
Improve reliability in ports selection for Dynamo on Slurm by @amaslenn in #771
Upgrade container versions for common examples by @amaslenn in #776
Add diff (value + percentge) in cmp report table if exactly two results are compared by @amaslenn in #774
Do not use internal URLs in documentation by @amaslenn in #775
Do not enable recompute-activations by default by @amaslenn in #768

Full Changelog: v1.5.beta7...v1.5.rc1

Contributors

amaslenn and srivatsankrishnan

Assets 2

07 Jan 15:38

amaslenn

v1.5.beta7

7b63c79

v1.5.beta7 Pre-release

Pre-release

What's Changed

M bridge Documentation by @srivatsankrishnan in #765
Remove hardcoded --distribution=arbitrary by @juntaowww in #766
M bridge updates by @srivatsankrishnan in #767

New Contributors

@juntaowww made their first contribution in #766

Full Changelog: v1.5.beta6...v1.5.beta7

Contributors

srivatsankrishnan and juntaowww

Assets 2

23 Dec 16:35

srivatsankrishnan

v1.5.beta6

99f9158

v1.5.beta6 Pre-release

Pre-release

What's Changed

Update codeowners by @srivatsankrishnan in #717
Aiconfig by @srivatsankrishnan in #760
Rula review by @RulaHallak in #761
Automatically install sshd for NCCL k8s workers if no available by @amaslenn in #759
Add workload for OSU Micro Benchmark by @allkoow in #742
Rename field model_config to model_cfg in NIXLKVBench workload by @allkoow in #763
Megatron Bridge in CloudAI by @srivatsankrishnan in #764

New Contributors

@allkoow made their first contribution in #742

Full Changelog: v1.5.beta5...v1.5.beta6

Contributors

amaslenn, srivatsankrishnan, and 2 other contributors

Assets 2

17 Dec 18:10

amaslenn

v1.5.beta5

b9ff078

v1.5.beta5 Pre-release

Pre-release

What's Changed

UCC add file generator by @yaeliyac in #747
Do not set -N/--nodes if nodelist is specified by @amaslenn in #746
Use genai-perf from Dynamo container when running k8s by @amaslenn in #748
Fix empty table if not all results are available by @amaslenn in #753
Ensure reports order by @amaslenn in #754
Update documentation on Dynamo k8s multi node by @amaslenn in #749
Fix bokeh charts generation by @amaslenn in #755
Enhancements for Dynamo with k8s by @amaslenn in #752
Fix a crash during dry-run for Dynamo scenario by @amaslenn in #757
Describe global options for cloudai CLI by @amaslenn in #758

Full Changelog: v1.5.beta4...v1.5.beta5

Contributors

amaslenn and yaeliyac

Assets 2

10 Dec 16:08

amaslenn

v1.5.beta4

8e26c01

v1.5.beta4 Pre-release

Pre-release

What's Changed

Add new installable type: HF model by @amaslenn in #735
Add extra_srun_args on TestRun level by @amaslenn in #734
Dynamo pass/fail and slurm example by @amaslenn in #736
Add support for HF model in K8s by @amaslenn in #737
Configure Dynamo k8s based on TOML, not an extra config by @amaslenn in #738
Fine tune CodeRabbit reviews by @amaslenn in #740
Expand K8s Dynamo support to disagg and multinode by @amaslenn in #739
Generate reports in dry-run by @amaslenn in #741
Update documentation by @amaslenn in #743
Simplify Dynamo slurm configuration by @amaslenn in #745

Full Changelog: v1.5.beta3...v1.5.beta4

Contributors

amaslenn

Assets 2

03 Dec 16:01

amaslenn

v1.5.beta3

4e9c340

v1.5.beta3 Pre-release

Pre-release

What's Changed

Print scenario status table at the end of a run by @amaslenn in #730
Always set number of nodes for srun cmd by @amaslenn in #729
Convert base System into pydantic model by @amaslenn in #732
Add HF home dir property inside System model by @amaslenn in #733

Full Changelog: v1.5.beta2...v1.5.beta3

Contributors

amaslenn

Assets 2

Releases: NVIDIA/cloudai

v1.5.0

New Changes

AI Dynamo Improvements

Kubernetes Enhancements

Reporting Improvements

Documentation

Architectural Changes

All Changed

Contributors

Uh oh!

v1.5.rc4

What's Changed

Contributors

Uh oh!

v1.5.rc3

What's Changed

Contributors

Uh oh!

v1.5.rc2

What's Changed

Contributors

Uh oh!

v1.5.rc1

What's Changed

Contributors

Uh oh!

v1.5.beta7

What's Changed

New Contributors

Contributors

Uh oh!

v1.5.beta6

What's Changed

New Contributors

Contributors

Uh oh!

v1.5.beta5

What's Changed

Contributors

Uh oh!

v1.5.beta4

What's Changed

Contributors

Uh oh!

v1.5.beta3

What's Changed

Contributors

Uh oh!