Releases: GoogleCloudPlatform/cluster-toolkit
v1.82.0
What's Changed
Key New Features 🎉
- A4X JBVM by @LAVEEN in #4950
- Introduced a binary ZIP archive to the release assets by @kvenkatachala333 in #5208
Module Improvements 🔨
Improvements 🛠
- Fix the babysit files limitation with pagination logic by @SwarnaBharathiMantena in #5191
- Adding A4X Base Support to JBVM by @LAVEEN in #4834
Version Updates ⏫
- Update SLURM blueprints to point to the latest slurm-gcp release by @Neelabh94 in #5215
New Contributors
- @spaturi13 made their first contribution in #5184
Full Changelog: v1.81.0...v1.82.0
v1.81.0
What's Changed
Key New Features 🎉
-
Switch to using gcsfuse profile feature in aiml gcs-bucket mounts in slurm cluster blueprints by @gargnitingoogle in https://github.com/GoogleCloudPlatform/cluster-toolkit/pull/5047
-
DWS Flex start support in TPU 7x and v6e by @shubpal07 in https://github.com/GoogleCloudPlatform/cluster-toolkit/pull/5111
Improvements 🛠
-
Improved validations enabling early enforcement of numeric boundaries and length constraints within metadata.yaml files across several core and community modules by @AdarshK15 in https://github.com/GoogleCloudPlatform/cluster-toolkit/pull/5115
-
Update Dockerfile and README.md instructions for a3mega nemo framework by @mufaqam-gcl in https://github.com/GoogleCloudPlatform/cluster-toolkit/pull/5164
-
TPU v6e DWS flex integration tests by @shubpal07 in https://github.com/GoogleCloudPlatform/cluster-toolkit/pull/5135
-
chore/allow hyphens in partition_name and slurm_cluster_name, increase max length to 20 for slurm_cluster_name by @rbekhtaoui in https://github.com/GoogleCloudPlatform/cluster-toolkit/pull/4316
New Contributors
@gargnitingoogle made their first contribution in https://github.com/GoogleCloudPlatform/cluster-toolkit/pull/5047
@gokamesh made their first contribution in https://github.com/GoogleCloudPlatform/cluster-toolkit/pull/5169
Full Changelog: https://github.com/GoogleCloudPlatform/cluster-toolkit/compare/v1.80.0...v1.81.0
v1.80.0
What's Changed
Module Improvements 🔨
- Compress the H4D blueprint with multivpc and vpc module update by @SwarnaBharathiMantena in #5133
Improvements 🛠
- Adding IPV6 & IDPF support by @LAVEEN in #5066
- R&R Slurm integration by @sarthakag in #5003
Full Changelog: v1.79.0...v1.80.0
v1.79.0
v1.78.0
What's Changed
Breaking Changes 🚨
- Fix private address space for gke-a3-megagpu.yaml by @omartin2010 in #4478
Improvements 🛠
- Add precondition checks to disallow setting conflicting consumption options by @kadupoornima in #5062
Deprecations 💤
- Add deprecation notice for paralellstore module by @parulbajaj01 in #5083
- Deprecate a3u-gcs blueprint as its no longer maintained by @bytetwin in #4871
Version Updates ⏫
- Add gIB versions v1.1.1 and v1.1.0 for arm64 by @duncanspani in #5090
New Contributors
- @AdarshK15 made their first contribution in #5095
- @duncanspani made their first contribution in #5090
- @siddhartha-quad made their first contribution in #4792
Full Changelog: v1.77.0...v1.78.0
v1.77.0
What's Changed
Key New Features 🎉
- Integrate Kueue support for GKE TPU v6 and v7x blueprints by @agrawalkhushi18 in #5007
- feat: Enable Block topology for A4X by @Neelabh94 in #5021
- Support shared reservations in gke-node-pool module by @SwarnaBharathiMantena in #5040
- Add automated GCP resource cleanup script and Cloud Build pipeline by @simrankaurb in #5039
- Add integration test for A3 high-GPU with spot VMs by @simrankaurb in #4984
- feat: Add community module for executing gcloud commands by @cboneti in #4923
Breaking Changes 🚨
- Graduate network/private-service-access to core modules by @SwarnaBharathiMantena in #5029
Improvements 🛠
- Refactor fio job template with best practices by @parulbajaj01 in #4977
- Enable h4d-vm test to run on Spot VMs by @simrankaurb in #5022
- Adding Robust destroy in cluster toolkit by @shubpal07 in #4866
Bug fixes 🐞
- Adding G4 configuration by @LAVEEN in #5024
- Use ternary operator for anywhere_cache precondition in main.tf by @Neelabh94 in #5033
Full Changelog: v1.76.0...v1.77.0
v1.76.0
What's Changed
Key New Features 🎉
- feat: Add support for Anywhere Cache in cloud-storage-bucket by @Neelabh94 in #4889
- Adding test for A3 UltraGPU JBVMs with Spot VMs by @simrankaurb in #4968
- On Spot A4 by @LAVEEN in #4953
- Enable Spot VM testing for GKE with A3 mega GPUs by @simrankaurb in #4951
- Enable Spot VM testing for a3-megagpu instances by @simrankaurb in #4901
- Add a post-deploy test specific to TPUs by @agrawalkhushi18 in #4969
Breaking Changes 🚨
- Move community/modules/project/service-account module to core modules directory by @SwarnaBharathiMantena in #4958
Module Improvements 🔨
- Make waiting for kueue installation configurable, and wait for kueue in the G4 GKE blueprint by @kadupoornima in #4973
Improvements 🛠
- Update GKE A4X Readme by @parulbajaj01 in #4955
- Add example nccl test script for slurm on gke by @ACW101 in #4960
Deprecations 💤
- Remove all references to ubuntu20.04 by @sarthakag in #4963
Bug fixes 🐞
Full Changelog: v1.75.1...v1.76.0
v1.75.1
What's Changed
Module Improvements 🔨
- Add exclusion_end_time_behavior and update release channel maintenance window by @SwarnaBharathiMantena in #4990
Full Changelog: v1.75.0...v1.75.1
v1.75.0
What's Changed
Key New Features 🎉
- Add integration test files for TPU v6e by @agrawalkhushi18 in #4906
- Enable Spot VM testing for a3-ultragpu instances by @simrankaurb in #4862
- Add integration test for TPU 7x by @agrawalkhushi18 in #4916
- Adding ML dependencies for G4 & guidance to use dual NIC by @LAVEEN in #4922
- Enable spot VM Testing for GKE: a3ultra by @simrankaurb in #4946
Breaking Changes 🚨
- Graduate cloud-storage-bucket module to core modules and update references by @SwarnaBharathiMantena in #4927
Module Improvements 🔨
- Updating Kueue default version to 0.14.4 in A4X by @shubpal07 in #4850
Improvements 🛠
- Add NCCL test validation to G4 Integration tests by @kadupoornima in #4933
- Register job_completion output in test-gke-job.yml by @agrawalkhushi18 in #4957
Bug fixes 🐞
- Minor fix: Delegating gcloud command to localhost by @simrankaurb in #4937
Full Changelog: v1.74.0...v1.75.0
v1.74.0
What's Changed
Key New Features 🎉
- Add Google Cloud NetApp Volumes support by @okrause in #4583
- Add NCCL tests for G4 NPI by @kadupoornima in #4898
- Add TPU 7x blueprint files and changes in tpu-definition module by @agrawalkhushi18 in #4887
Module Improvements 🔨
- Add force_conflicts flag when applying manifests using kubectl by @SwarnaBharathiMantena in #4874
Improvements 🛠
- Modify the wait-for-startup-script to fix test failures by @agrawalkhushi18 in #4845
- Update recommended
FI_UNIVERSE_SIZEsetting for startup script by @linsword13 in #4782 - Add GCS updates to GKE A4X by @parulbajaj01 in #4864
- Graduating tpu v6e from community to core by @shubpal07 in #4909
Bug fixes 🐞
- Update the nccl-tcpxo-installer, nri-device-injector, and nccl-test for a3-megagpu-8g machines by @SwarnaBharathiMantena in #4902
- pin mypy version in precommit dep. to last stable version i.e 1.18.2 by @shubpal07 in #4913
Other changes
- Hotfix v1.73.1 (#4884) by @aslam-quad in #4910
New Contributors
- @kvenkatachala333 made their first contribution in #4912
Full Changelog: v1.73.1...v1.74.0