Skip to content

Releases: NVIDIA/cloudai

v1.5.0

10 Feb 10:27
88d6f5c

Choose a tag to compare

New Changes

AI Dynamo Improvements

AI Dynamo workload supports both Kubernetes and Slurm systems, upgraded to Dynamo v0.7. Key additions include disaggregated prefill/decode mode, multinode deployments, and pass-fail criteria for automated result validation. Deployment configuration has been simplified using TOML files instead of extra config files, and genai-perf is integrated directly from the Dynamo container.

Kubernetes Enhancements

Host network is enabled by default for all Kubernetes deployments. Job name sanitization has been improved across all workloads to prevent invalid characters. For NCCL workloads specifically, logs are fetched continuously during execution and sshd is automatically installed on workers when not available. AI Dynamo deployments properly clean up port forwarding processes on deletion.

Reporting Improvements

  • Comparison report automatically calculates the difference (value + percentage) when comparing two results
  • Status report includes scenario results for easier monitoring
  • Improved status table formatting
  • Report results directory is printed to users early in the process

Documentation

The documentation has been reorganized with AI Dynamo, covering both Kubernetes and Slurm examples on a single page. New sections have been added for parameter sweeps and test-in-scenario configuration. The workloads support matrix has been updated to reflect current platform availability.

Architectural Changes

  • Removed Test concept - Simplifies the codebase by eliminating the intermediate Test object
  • Removed TestTemplate concept - Direct workload usage instead of TestTemplate objects
  • Converted TestScenario to dataclass
  • Converted BaseSystemto Pydantic model
  • Aligned Grader and JsonGenStrategy with CmdGenStrategy patterns
  • MegatronRun workload defaults to not enabling recompute-activations
  • --distribution=arbitrary is not hardcoded for Slurm deployments anymore
  • srun commands always set the number of nodes (unless a nodelist is specified)
  • ETCD/NIXL processes are killed and waited for properly

All Changed

Read more

v1.5.rc4

02 Feb 09:33
88d6f5c

Choose a tag to compare

v1.5.rc4 Pre-release
Pre-release

What's Changed

  • More robust bench execution for Dynamo over k8s by @amaslenn in #793

Full Changelog: v1.5.rc3...v1.5.rc4

v1.5.rc3

28 Jan 16:07
8660590

Choose a tag to compare

v1.5.rc3 Pre-release
Pre-release

What's Changed

Full Changelog: v1.5.rc2...v1.5.rc3

v1.5.rc2

21 Jan 16:28
f5410ee

Choose a tag to compare

v1.5.rc2 Pre-release
Pre-release

What's Changed

Full Changelog: v1.5.rc1...v1.5.rc2

v1.5.rc1

13 Jan 07:49
394c622

Choose a tag to compare

v1.5.rc1 Pre-release
Pre-release

What's Changed

  • Provide CMS-friendly documentation build by @amaslenn in #769
  • Model Name/ModeL size for verify configs by @srivatsankrishnan in #772
  • Add test/test scenario files for AIConfigurator for QA testing by @srivatsankrishnan in #773
  • Improve reliability in ports selection for Dynamo on Slurm by @amaslenn in #771
  • Upgrade container versions for common examples by @amaslenn in #776
  • Add diff (value + percentge) in cmp report table if exactly two results are compared by @amaslenn in #774
  • Do not use internal URLs in documentation by @amaslenn in #775
  • Do not enable recompute-activations by default by @amaslenn in #768

Full Changelog: v1.5.beta7...v1.5.rc1

v1.5.beta7

07 Jan 15:38
7b63c79

Choose a tag to compare

v1.5.beta7 Pre-release
Pre-release

What's Changed

New Contributors

Full Changelog: v1.5.beta6...v1.5.beta7

v1.5.beta6

23 Dec 16:35
99f9158

Choose a tag to compare

v1.5.beta6 Pre-release
Pre-release

What's Changed

New Contributors

Full Changelog: v1.5.beta5...v1.5.beta6

v1.5.beta5

17 Dec 18:10
b9ff078

Choose a tag to compare

v1.5.beta5 Pre-release
Pre-release

What's Changed

Full Changelog: v1.5.beta4...v1.5.beta5

v1.5.beta4

10 Dec 16:08
8e26c01

Choose a tag to compare

v1.5.beta4 Pre-release
Pre-release

What's Changed

Full Changelog: v1.5.beta3...v1.5.beta4

v1.5.beta3

03 Dec 16:01
4e9c340

Choose a tag to compare

v1.5.beta3 Pre-release
Pre-release

What's Changed

Full Changelog: v1.5.beta2...v1.5.beta3