Releases: NVIDIA/cloudai
v1.5.0
New Changes
- Added support for the following workloads:
- DDLB - Distributed Deep Learning Benchmark
- DeepEP - Deep learning expert parallelism benchmark
- OSU - Ohio State University MPI benchmarks
- MegatronBridge - Megatron-Bridge integration
- AIConfigurator - AI model configuration predictor support
- Renamed NIXL perftest to NIXL CTPerf
- Added support for
HFModelinstallable type for automatic HuggingFace model downloads
AI Dynamo Improvements
AI Dynamo workload supports both Kubernetes and Slurm systems, upgraded to Dynamo v0.7. Key additions include disaggregated prefill/decode mode, multinode deployments, and pass-fail criteria for automated result validation. Deployment configuration has been simplified using TOML files instead of extra config files, and genai-perf is integrated directly from the Dynamo container.
Kubernetes Enhancements
Host network is enabled by default for all Kubernetes deployments. Job name sanitization has been improved across all workloads to prevent invalid characters. For NCCL workloads specifically, logs are fetched continuously during execution and sshd is automatically installed on workers when not available. AI Dynamo deployments properly clean up port forwarding processes on deletion.
Reporting Improvements
- Comparison report automatically calculates the difference (value + percentage) when comparing two results
- Status report includes scenario results for easier monitoring
- Improved status table formatting
- Report results directory is printed to users early in the process
Documentation
The documentation has been reorganized with AI Dynamo, covering both Kubernetes and Slurm examples on a single page. New sections have been added for parameter sweeps and test-in-scenario configuration. The workloads support matrix has been updated to reflect current platform availability.
Architectural Changes
- Removed Test concept - Simplifies the codebase by eliminating the intermediate
Testobject - Removed TestTemplate concept - Direct workload usage instead of
TestTemplateobjects - Converted
TestScenarioto dataclass - Converted
BaseSystemto Pydantic model - Aligned
GraderandJsonGenStrategywithCmdGenStrategypatterns MegatronRunworkload defaults to not enabling recompute-activations--distribution=arbitraryis not hardcoded for Slurm deployments anymoresruncommands always set the number of nodes (unless a nodelist is specified)- ETCD/NIXL processes are killed and waited for properly
All Changed
- Remove DeepEP callback for llama4 by @aahouzi in #712
- Run tests for several py versions by @amaslenn in #713
- Bump fallback version to v1.5 and upgrade dependencies by @amaslenn in #714
- Small enhancements by @amaslenn in #715
- Simplify internal hierarchy of classes by @amaslenn in #716
- Update documentation by @amaslenn in #718
- Fix NameError for K8s batch run by @amaslenn in #721
- Add DDLB workload by @nsarka in #711
- Updates for Dynamo over K8s by @amaslenn in #724
- Fixed and issue when using dependencies could result in an infinite loop by @amaslenn in #725
- Report results dir to users as early as possible by @amaslenn in #726
- Configure AI code review tools by @amaslenn in #728
- Kill and wait for ETCD process to be gone by @amaslenn in #727
- DeepEP benchmark by @ybenvidia in #723
- Print scenario status table at the end of a run by @amaslenn in #730
- Always set number of nodes for srun cmd by @amaslenn in #729
- Convert base System into pydantic model by @amaslenn in #732
- Add HF home dir property inside System model by @amaslenn in #733
- Add new installable type: HF model by @amaslenn in #735
- Add extra_srun_args on TestRun level by @amaslenn in #734
- Dynamo pass/fail and slurm example by @amaslenn in #736
- Add support for HF model in K8s by @amaslenn in #737
- Configure Dynamo k8s based on TOML, not an extra config by @amaslenn in #738
- Fine tune CodeRabbit reviews by @amaslenn in #740
- Expand K8s Dynamo support to disagg and multinode by @amaslenn in #739
- Generate reports in dry-run by @amaslenn in #741
- Update documentation by @amaslenn in #743
- Simplify Dynamo slurm configuration by @amaslenn in #745
- UCC add file generator by @yaeliyac in #747
- Do not set -N/--nodes if nodelist is specified by @amaslenn in #746
- Use genai-perf from Dynamo container when running k8s by @amaslenn in #748
- Fix empty table if not all results are available by @amaslenn in #753
- Ensure reports order by @amaslenn in #754
- Update documentation on Dynamo k8s multi node by @amaslenn in #749
- Fix bokeh charts generation by @amaslenn in #755
- Enhancements for Dynamo with k8s by @amaslenn in #752
- Fix a crash during dry-run for Dynamo scenario by @amaslenn in #757
- Describe global options for cloudai CLI by @amaslenn in #758
- Update codeowners by @srivatsankrishnan in #717
- Aiconfig by @srivatsankrishnan in #760
- Rula review by @RulaHallak in #761
- Automatically install sshd for NCCL k8s workers if no available by @amaslenn in #759
- Add workload for OSU Micro Benchmark by @allkoow in #742
- Rename field model_config to model_cfg in NIXLKVBench workload by @allkoow in #763
- Megatron Bridge in CloudAI by @srivatsankrishnan in #764
- M bridge Documentation by @srivatsankrishnan in #765
- Remove hardcoded
--distribution=arbitraryby @juntaowww in #766 - M bridge updates by @srivatsankrishnan in #767
- Provide CMS-friendly documentation build by @amaslenn in #769
- Model Name/ModeL size for verify configs by @srivatsankrishnan in #772
- Add test/test scenario files for AIConfigurator for QA testing by @srivatsankrishnan in #773
- Improve reliability in ports selection for Dynamo on Slurm by @amaslenn in #771
- Upgrade container versions for common examples by @amaslenn in #776
- Add diff (value + percentge) in cmp report table if exactly two results are compared by @amaslenn in #774
- Do not use internal URLs in documentation by @amaslenn in #775
- Do not enable recompute-activations by default by @amaslenn in #768
- B200 M-bridge misconfig by @srivatsankrishnan in #777
- Fix M-bridge report generation by @srivatsankrishnan in #778
- Fix installation logic for File on k8s by @amaslenn in #781
- Address issues with Sleep test over K8s by @amaslenn in #779
- M-bridge Job ID extraction by @srivatsankrishnan in #783
- Make tests more stable on systems with slurm binaries by @amaslenn in #784
- Update doc on using hf token for the first time by @amaslenn in #785
- Improvements for NCCL over k8s by @amaslenn in #786
- Remove configs for OSU benchmarks by @allkoow in #789
- Continuously fetch logs for NCCL over k8s by @amaslenn in #788
- Fix handling of a local path for docker container by @amaslenn in #790
- More robust bench execution for Dynamo over k8s by @ama...
v1.5.rc4
v1.5.rc3
v1.5.rc2
What's Changed
- B200 M-bridge misconfig by @srivatsankrishnan in #777
- Fix M-bridge report generation by @srivatsankrishnan in #778
- Fix installation logic for File on k8s by @amaslenn in #781
- Address issues with Sleep test over K8s by @amaslenn in #779
- M-bridge Job ID extraction by @srivatsankrishnan in #783
- Make tests more stable on systems with slurm binaries by @amaslenn in #784
- Update doc on using hf token for the first time by @amaslenn in #785
- Improvements for NCCL over k8s by @amaslenn in #786
Full Changelog: v1.5.rc1...v1.5.rc2
v1.5.rc1
What's Changed
- Provide CMS-friendly documentation build by @amaslenn in #769
- Model Name/ModeL size for verify configs by @srivatsankrishnan in #772
- Add test/test scenario files for AIConfigurator for QA testing by @srivatsankrishnan in #773
- Improve reliability in ports selection for Dynamo on Slurm by @amaslenn in #771
- Upgrade container versions for common examples by @amaslenn in #776
- Add diff (value + percentge) in cmp report table if exactly two results are compared by @amaslenn in #774
- Do not use internal URLs in documentation by @amaslenn in #775
- Do not enable recompute-activations by default by @amaslenn in #768
Full Changelog: v1.5.beta7...v1.5.rc1
v1.5.beta7
What's Changed
- M bridge Documentation by @srivatsankrishnan in #765
- Remove hardcoded
--distribution=arbitraryby @juntaowww in #766 - M bridge updates by @srivatsankrishnan in #767
New Contributors
- @juntaowww made their first contribution in #766
Full Changelog: v1.5.beta6...v1.5.beta7
v1.5.beta6
What's Changed
- Update codeowners by @srivatsankrishnan in #717
- Aiconfig by @srivatsankrishnan in #760
- Rula review by @RulaHallak in #761
- Automatically install sshd for NCCL k8s workers if no available by @amaslenn in #759
- Add workload for OSU Micro Benchmark by @allkoow in #742
- Rename field model_config to model_cfg in NIXLKVBench workload by @allkoow in #763
- Megatron Bridge in CloudAI by @srivatsankrishnan in #764
New Contributors
Full Changelog: v1.5.beta5...v1.5.beta6
v1.5.beta5
What's Changed
- UCC add file generator by @yaeliyac in #747
- Do not set -N/--nodes if nodelist is specified by @amaslenn in #746
- Use genai-perf from Dynamo container when running k8s by @amaslenn in #748
- Fix empty table if not all results are available by @amaslenn in #753
- Ensure reports order by @amaslenn in #754
- Update documentation on Dynamo k8s multi node by @amaslenn in #749
- Fix bokeh charts generation by @amaslenn in #755
- Enhancements for Dynamo with k8s by @amaslenn in #752
- Fix a crash during dry-run for Dynamo scenario by @amaslenn in #757
- Describe global options for cloudai CLI by @amaslenn in #758
Full Changelog: v1.5.beta4...v1.5.beta5
v1.5.beta4
What's Changed
- Add new installable type: HF model by @amaslenn in #735
- Add extra_srun_args on TestRun level by @amaslenn in #734
- Dynamo pass/fail and slurm example by @amaslenn in #736
- Add support for HF model in K8s by @amaslenn in #737
- Configure Dynamo k8s based on TOML, not an extra config by @amaslenn in #738
- Fine tune CodeRabbit reviews by @amaslenn in #740
- Expand K8s Dynamo support to disagg and multinode by @amaslenn in #739
- Generate reports in dry-run by @amaslenn in #741
- Update documentation by @amaslenn in #743
- Simplify Dynamo slurm configuration by @amaslenn in #745
Full Changelog: v1.5.beta3...v1.5.beta4
v1.5.beta3
What's Changed
- Print scenario status table at the end of a run by @amaslenn in #730
- Always set number of nodes for srun cmd by @amaslenn in #729
- Convert base System into pydantic model by @amaslenn in #732
- Add HF home dir property inside System model by @amaslenn in #733
Full Changelog: v1.5.beta2...v1.5.beta3