Skip to content

A4X-Max Bare Metal GKE toolkit blueprint#5211

Open
vikramvs-gg wants to merge 1 commit intoGoogleCloudPlatform:developfrom
vikramvs-gg:a4x-max-gke
Open

A4X-Max Bare Metal GKE toolkit blueprint#5211
vikramvs-gg wants to merge 1 commit intoGoogleCloudPlatform:developfrom
vikramvs-gg:a4x-max-gke

Conversation

@vikramvs-gg
Copy link
Contributor

@vikramvs-gg vikramvs-gg commented Feb 8, 2026

Featue: Add GKE A4X-Max Bare Metal Blueprint

This pull request introduces a new blueprint for deploying a GKE cluster optimized for AI/ML workloads on A4X-Max Bare Metal instances. This blueprint provides a comprehensive setup for users to quickly provision a powerful and scalable environment for demanding AI training and inference tasks.

Key Features:

  • A4X-Max Bare Metal Nodes: Provisions a GKE cluster with a4x-maxgpu-4g-metal node pools.
  • High-Performance Networking:
    • Configures a dedicated GPU RDMA VPC network for high-throughput, low-latency communication between nodes.
    • Sets up two additional VPC networks for general cluster traffic and management.
  • Workload Identity: Creates and configures service accounts with appropriate permissions for both the node pool and the workloads running on the cluster.
  • Optimized Storage:
    • Provisions two Google Cloud Storage buckets: one for training data and another for model checkpoints.
    • Configures gcsfuse with optimized mount options for high-performance reading of training data and efficient writing of checkpoints.
    • Includes Persistent Volumes (PVs) for both training and checkpointing buckets.
  • Pre-configured Job Templates & Validation:
    • Includes nvidia-smi job to verify GPU availability and driver installation.
    • Includes NCCL all-gather performance test using a JobSet. This allows users to validate the high-speed interconnect and GPU-to-GPU communication performance, which is critical for distributed training workloads.
    • Provides a comprehensive FIO benchmark job to test the performance of scratch, training, and checkpointing storage volumes.
  • Automated Driver Installation: Installs the NVIDIA DRA driver and asapd-lite to ensure the GPUs are correctly configured and ready for use.
  • Resource Management: Configures a workload policy for a high-throughput accelerator topology.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @vikramvs-gg, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request delivers a robust and specialized blueprint for Google Kubernetes Engine (GKE) on A4X-Max Bare Metal, tailored for demanding AI/ML applications. It simplifies the deployment of a high-performance environment by integrating advanced networking, optimized storage solutions, and essential GPU drivers, alongside tools for validating the infrastructure's readiness and performance. This enables users to efficiently provision and manage GKE clusters capable of accelerating large-scale AI training and inference tasks.

Highlights

  • New GKE A4X-Max Bare Metal Blueprint: Introduces a comprehensive blueprint for deploying GKE clusters on A4X-Max Bare Metal instances, specifically optimized for AI/ML workloads.
  • High-Performance Infrastructure: Configures dedicated GPU RDMA VPC networking, optimized Google Cloud Storage buckets with gcsfuse for training and checkpoint data, and Persistent Volumes.
  • Automated Driver and Workload Management: Includes automated installation of NVIDIA DRA driver and 'asapd-lite' for GPU configuration, and integrates JobSet for managing distributed workloads.
  • Validation and Benchmarking Tools: Provides pre-configured job templates for 'nvidia-smi' to verify GPU setup, NCCL all-gather for interconnect performance, and FIO benchmarks for storage performance.
  • Enhanced Node Pool Configuration: Adds flexibility to GKE node pool configurations, allowing custom Linux node settings like hugepages and dynamic sysctls.
Changelog
  • examples/gke-a4x-max-bm/README.md
    • Added a new README file detailing the requirements, cluster creation process, variable configuration, NCCL test execution, and cleanup instructions for the A4X-Max Bare Metal GKE blueprint.
  • examples/gke-a4x-max-bm/asapd-lite-installer.yaml
    • Added a new Kubernetes DaemonSet manifest for 'asapd-lite', configured for privileged execution, host networking, and host PID, with logic to address network interface issues and specific node affinity.
  • examples/gke-a4x-max-bm/gke-a4x-max-bm-deployment.yaml
    • Added a new deployment configuration file defining input variables for the A4X-Max GKE blueprint, including project details, node counts, and reservation settings.
  • examples/gke-a4x-max-bm/gke-a4x-max-bm.yaml
    • Added the core blueprint definition, orchestrating the creation of GKE clusters, multiple VPCs (including GPU RDMA), service accounts, A4X-Max node pools with NVIDIA GB300 accelerators, and integration of 'asapd-lite', 'JobSet', and NVIDIA DRA driver. It also includes GCS bucket provisioning with gcsfuse and job templates for 'nvidia-smi' and FIO benchmarks.
  • examples/gke-a4x-max-bm/nccl-jobset-example.yaml
    • Added a Kubernetes JobSet manifest for running NCCL all-gather performance tests, configured with resource claims for NVIDIA GPUs and MRDMA, and a script for distributed execution.
  • examples/gke-a4x-max-bm/nvidia-dra-driver.yaml
    • Added Kubernetes Namespace and ResourceQuota definitions for the NVIDIA DRA driver, setting resource limits and priority class scope selectors.
  • modules/compute/gke-node-pool/README.md
    • Updated documentation to reflect the addition of the 'linux_node_config' input variable.
  • modules/compute/gke-node-pool/main.tf
    • Modified the node pool resource to dynamically configure Linux node settings, including 'sysctls' and 'hugepages_config', via the new 'linux_node_config' variable.
  • modules/compute/gke-node-pool/variables.tf
    • Introduced the 'linux_node_config' input variable, allowing users to specify custom sysctls and hugepages configurations for GKE nodes.
  • modules/management/kubectl-apply/README.md
    • Updated documentation to include the 'asapd_lite' module and expanded 'nvidia_dra_driver' input options.
  • modules/management/kubectl-apply/main.tf
    • Refactored manifest processing logic for improved dependency handling and added conditional application of manifests based on cluster existence.
    • Integrated the 'asapd_lite' installation module and updated NVIDIA DRA driver configuration.
  • modules/management/kubectl-apply/variables.tf
    • Added 'accelerator_type' to the 'nvidia_dra_driver' variable and introduced a new 'asapd_lite' variable for its installation control.
  • modules/network/gpu-rdma-vpc/README.md
    • Updated provider requirements to 'google >= 6.40' and adjusted resource references in the documentation.
  • modules/network/gpu-rdma-vpc/main.tf
    • Implemented conditional logic for MRDMA subnetwork interface creation and integrated a 'google_compute_subnetworks' data source for RDMA subnet filtering, removing a previous 'terraform_data' resource.
  • modules/network/gpu-rdma-vpc/versions.tf
    • Added a specific version constraint for the 'google' Terraform provider ('>= 6.40').
  • modules/scheduler/gke-cluster/README.md
    • Updated documentation to reflect the new 'enable_shielded_nodes' input variable.
  • modules/scheduler/gke-cluster/main.tf
    • Modified the GKE cluster resource to use the 'enable_shielded_nodes' input variable for configuring shielded nodes.
  • modules/scheduler/gke-cluster/variables.tf
    • Added the 'enable_shielded_nodes' input variable, allowing explicit control over shielded nodes feature.
Activity
  • vikramvs-gg created this pull request to introduce the GKE A4X-Max Bare Metal blueprint.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable new blueprint for deploying GKE on A4X-Max Bare Metal instances. The changes are well-structured, including the addition of new example files and necessary modifications to existing Terraform modules like gke-node-pool, gke-cluster, and kubectl-apply. The refactoring in kubectl-apply to better handle dependencies is a notable improvement.

However, there is a critical issue that needs to be addressed. The new example gke-a4x-max-bm has not been added to the main examples index file, examples/README.md. This is a violation of the repository's style guide (rule 33), which states that new examples must be indexed. Please update this file to include the new example.

I have also left a few comments on the documentation and a potential issue with a removed validation in one of the Terraform modules, which are included in the specific review comments.

@vikramvs-gg vikramvs-gg added enhancement New feature or request release-key-new-features Added to release notes under the "Key New Features" heading. release-improvements Added to release notes under the "Improvements" heading. labels Feb 8, 2026
@vikramvs-gg vikramvs-gg force-pushed the a4x-max-gke branch 5 times, most recently from 2d6d730 to cca5d39 Compare February 10, 2026 17:37
@vikramvs-gg vikramvs-gg removed the release-improvements Added to release notes under the "Improvements" heading. label Feb 10, 2026
@vikramvs-gg vikramvs-gg marked this pull request as ready for review February 10, 2026 17:42
@vikramvs-gg vikramvs-gg requested review from a team and samskillman as code owners February 10, 2026 17:42
@vikramvs-gg vikramvs-gg requested review from SwarnaBharathiMantena and removed request for SwarnaBharathiMantena February 12, 2026 05:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request release-key-new-features Added to release notes under the "Key New Features" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant