Skip to content

Modernize GKE A3 High blueprint and align integration tests#5246

Draft
shubpal07 wants to merge 1 commit intoGoogleCloudPlatform:developfrom
shubpal07:shubham/a3-high-upgrade
Draft

Modernize GKE A3 High blueprint and align integration tests#5246
shubpal07 wants to merge 1 commit intoGoogleCloudPlatform:developfrom
shubpal07:shubham/a3-high-upgrade

Conversation

@shubpal07
Copy link
Contributor

This PR upgrades the GKE A3 High (a3-highgpu-8g) blueprint to align with the
standards of the GKE A* family (A3 Mega, Ultra, and A4). It restructures the
blueprint into a dedicated directory, adds comprehensive documentation, and
introduces advanced features like Kueue TAS support and Cluster Health
Services (CHS)

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

Change-Id: Ie064dcac4ec7c7e23909024c6c4f537275f045f2
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @shubpal07, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly modernizes the GKE A3 High blueprint by reorganizing its structure, enhancing its documentation, and integrating advanced features. The changes aim to align the blueprint with the standards of other GKE A* family offerings, providing users with a more robust and feature-rich solution for high-performance ML training. Key additions include support for Kueue's Topology Aware Scheduling for efficient GPU workload management and the implementation of Cluster Health Services for automated GPU diagnostics, ensuring optimal cluster performance and reliability.

Highlights

  • Blueprint Restructuring: The GKE A3 High blueprint has been moved into its own dedicated directory (examples/gke-a3-highgpu/) to improve organization and modularity, aligning it with other GKE A* family blueprints.
  • Enhanced Documentation: Comprehensive documentation has been added, including a new README.md within the blueprint's directory, providing detailed instructions for deployment, configuration, and performance verification.
  • Kueue Topology Aware Scheduling (TAS) Integration: The blueprint now includes support for Kueue, enabling Topology Aware Scheduling for optimized workload placement on A3 High GPU nodes, configured via new Kubernetes resources.
  • Cluster Health Services (CHS) Implementation: Automated GPU health checks and diagnostic reporting are introduced through Cluster Health Services (CHS), configured as a Kubernetes CronJob with associated permissions and persistent storage.
  • Updated Blueprint Configuration: The main blueprint file (gke-a3-highgpu.yaml) has been updated with new variables and configurations to support Kueue, CHS, GCS FUSE CSI, and more flexible node pool settings like reservation affinity.
  • Integration Test Alignment: Existing integration tests (gke-a3-highgpu-onspot.yaml, gke-a3-highgpu.yaml, gke-a3-highgpu-onspot.yml, gke-a3-highgpu.yml) have been updated to reflect the new blueprint path and incorporate new test variables and validation for Kueue configuration.
Changelog
  • examples/README.md
    • Updated the description for the gke-a3-highgpu.yaml blueprint to reflect its new capabilities and refer to the dedicated deployment guide.
    • Removed the whatismyip.com reference from the authorized CIDR note.
    • Adjusted the link to the gke-a3-highgpu.yaml blueprint to its new directory path.
  • examples/gke-a3-highgpu/README.md
    • Added a new, comprehensive deployment guide for the A3 High GKE cluster, detailing prerequisites, configuration, deployment steps, NCCL performance verification, and cleanup procedures.
  • examples/gke-a3-highgpu/chs-cronjob.yaml.tftpl
    • Added a new Kubernetes CronJob template for Cluster Health Services (CHS), which periodically runs diagnostic checks on GPU nodes and saves results.
  • examples/gke-a3-highgpu/chs-permissions.yaml.tftpl
    • Added new Kubernetes ServiceAccount, ClusterRole, and ClusterRoleBinding templates to grant necessary permissions for the Cluster Health Services (CHS) CronJob.
  • examples/gke-a3-highgpu/chs-pvc.yaml.tftpl
    • Added a new Kubernetes PersistentVolumeClaim template for storing output from the Cluster Health Services (CHS) CronJob.
  • examples/gke-a3-highgpu/gke-a3-highgpu-deployment.yaml
    • Added a new deployment configuration file with default variables for project ID, deployment name, region, zone, authorized CIDR, static node count, reservation, and optional periodic health checks.
  • examples/gke-a3-highgpu/gke-a3-highgpu.yaml
    • Renamed and moved the blueprint file from examples/gke-a3-highgpu.yaml to examples/gke-a3-highgpu/gke-a3-highgpu.yaml.
    • Introduced new variables for Kueue configuration path, static node count, reservation, accelerator type, GKE version prefix, and various Cluster Health Services (CHS) settings.
    • Added the container.admin IAM role to the node_pool_service_account.
    • Enabled GCS FUSE CSI for the GKE cluster.
    • Configured the GKE node pool to use specified accelerator types and reservation affinity.
    • Integrated kubectl-apply manifests for CHS permissions, PVC, and CronJob, enabling conditional deployment based on enable_periodic_health_checks.
    • Updated Kueue installation to include a configurable path for its configuration and dynamic template variables for GPU count and accelerator type.
  • examples/gke-a3-highgpu/kueue-configuration.yaml.tftpl
    • Added new Kubernetes resource templates for Kueue, including Topology, ResourceFlavor, ClusterQueue, and LocalQueue, to enable Topology Aware Scheduling for A3 High GPUs.
  • examples/gke-a3-highgpu/read-chs-logs-job.yaml
    • Added a new Kubernetes Job to facilitate reading the output logs generated by the Cluster Health Services (CHS) CronJob from the PVC.
  • tools/cloud-build/daily-tests/builds/gke-a3-highgpu-onspot.yaml
    • Updated the EXAMPLE_BP variable to reflect the new path of the gke-a3-highgpu.yaml blueprint.
  • tools/cloud-build/daily-tests/builds/gke-a3-highgpu.yaml
    • Updated the EXAMPLE_BP variable to reflect the new path of the gke-a3-highgpu.yaml blueprint.
  • tools/cloud-build/daily-tests/tests/gke-a3-highgpu-onspot.yml
    • Updated the blueprint_yaml path to the new location of the gke-a3-highgpu.yaml blueprint.
    • Removed the network variable and added new variables for static_node_count, instance_type, accelerator_type, and num_gpus.
    • Modified cli_deployment_vars to use the new static_node_count variable and removed network_name.
    • Added test-validation/test-gke-kueue-config.yml to the post_deploy_tests.
  • tools/cloud-build/daily-tests/tests/gke-a3-highgpu.yml
    • Updated the blueprint_yaml path to the new location of the gke-a3-highgpu.yaml blueprint.
    • Removed reservation_affinity and network variables, and added new variables for extended_reservation, static_node_count, instance_type, accelerator_type, and num_gpus.
    • Modified cli_deployment_vars to use the new static_node_count and reservation variables, and removed network_name.
    • Added test-validation/test-gke-kueue-config.yml to the post_deploy_tests.
Activity
  • No specific activity (comments, reviews, progress updates) has been recorded for this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a significant and valuable modernization of the GKE A3 High blueprint. The restructuring into a dedicated directory, addition of comprehensive documentation, and integration of advanced features like Kueue for Topology Aware Scheduling and Cluster Health Services (CHS) are excellent improvements that align it with the standards of other A* family blueprints.

My review focuses on a few key areas to further enhance the quality of these changes:

  • Security: I've identified a couple of instances where permissions (both for a GCP IAM role and a Kubernetes ClusterRole) are overly broad. My suggestions aim to tighten these permissions by following the principle of least privilege.
  • Efficiency and Reliability: The new CronJob for health checks can be made much more efficient and reliable by using a pre-built container image instead of installing dependencies on every run, aligning with guidelines for complex inline scripts.
  • Maintainability: I've pointed out a minor issue with an outdated API version in the Kueue configuration to ensure future compatibility, and highlighted the need for consistent placeholder formatting as per repository rules.

Overall, this is a strong contribution. Addressing these points will improve the security, performance, and long-term maintainability of this blueprint.

- stackdriver.resourceMetadata.writer
- storage.objectAdmin
- artifactregistry.reader
- container.admin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The workload_service_account is granted the container.admin (roles/container.admin) IAM role. This role provides full control over GKE clusters, including creation and deletion, which is overly permissive for a workload service account, even one used for health checks. The cron job script appears to only need gcloud container clusters get-credentials, which requires the container.clusters.get permission.

To follow the principle of least privilege, please replace container.admin with a more restrictive role. roles/container.clusterViewer should be sufficient for getting cluster credentials. If other permissions are needed, they should be added explicitly rather than using a broad admin role.

      - container.clusterViewer

Comment on lines +37 to +65
- |
set -ex
set -x
apt-get update && apt-get install -y git curl gnupg -y
git clone https://github.com/GoogleCloudPlatform/cluster-health-scanner
cd cluster-health-scanner
apt-get install -y apt-transport-https ca-certificates
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | gpg --dearmor -o /usr/share/keyrings/cloud.google.gpg
echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main" | tee -a /etc/apt/sources.list.d/google-cloud-sdk.list
apt-get update
apt-get install -y google-cloud-cli kubectl
apt-get install -y google-cloud-cli-gke-gcloud-auth-plugin
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
pip3 install -r cli/requirements.txt
gcloud container clusters get-credentials ${deployment_name} --region ${region} --project ${project_id}
OUTPUT_DIR="/mnt/output"
mkdir -p $OUTPUT_DIR
TIMESTAMP="`date "+%Y-%m-%d %H:%M:%S"`"
OUTPUT_FILENAME="${deployment_name}_healthscan_result_$TIMESTAMP.txt"
FULL_OUTPUT_PATH="$OUTPUT_DIR/$OUTPUT_FILENAME"
python3 cli/cluster_diag.py -o gke healthscan ${machine_type} -c gpu --run_only_on_available_nodes
python3 cli/cluster_diag.py -o gke healthscan ${machine_type} -c nccl --run_only_on_available_nodes
python3 cli/cluster_diag.py -o gke healthscan ${machine_type} -c straggler --run_only_on_available_nodes
python3 cli/cluster_diag.py -o gke healthscan ${machine_type} -c neper --run_only_on_available_nodes
python3 cli/cluster_diag.py -o gke healthscan ${machine_type} -c tinymax --run_only_on_available_nodes
#python3 cli/cluster_diag.py -o gke healthscan ${machine_type} -c status --run_only_on_available_nodes > "$FULL_OUTPUT_PATH" 2>&1
kubectl get nodes -o custom-columns="NODE:.metadata.name,NCCL_MARK:.metadata.labels.aiinfra/nccl-healthcheck-test,NCCL_BANDWIDTH:.metadata.labels.aiinfra/nccl-healthcheck-bandwidth,NCCL_RESULT:.metadata.labels.aiinfra/nccl-healthcheck-result,NCCL_RUNTIME:.metadata.labels.aiinfra/nccl-healthcheck-runtime-sec,TINYMAX_MARK:.metadata.labels.aiinfra/tinymax-healthcheck-test,TINYMAX_RESULT:.metadata.labels.aiinfra/tinymax-healthcheck-result,TINYMAX_RUNTIME:.metadata.labels.aiinfra/tinymax-healthcheck-runtime-sec,GPU_MARK:.metadata.labels.aiinfra/gpu-healthcheck-test,GPU_RESULT:.metadata.labels.aiinfra/gpu-healthcheck-result,GPU_RUNTIME:.metadata.labels.aiinfra/gpu-healthcheck-runtime-sec" > "$FULL_OUTPUT_PATH" 2>&1
echo "Health scan outputs saved to $OUTPUT_DIR"
echo "Final output file: $OUTPUT_FILENAME"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The shell script within this CronJob has some areas for improvement regarding efficiency, reliability, and clarity:

  • Inefficiency: The script installs multiple packages (git, gcloud, kubectl, helm) and clones a git repository on every execution. This is slow, generates unnecessary network traffic, and makes the job's success dependent on the availability of external package repositories. A better approach is to build a custom container image with all dependencies pre-installed. The CronJob would then use this pre-built image, making it much faster and more reliable.

  • Reproducibility: The script clones the main branch of the cluster-health-scanner repository. It is a best practice to pin to a specific git tag or commit hash to ensure reproducible builds and avoid unexpected failures from upstream changes.

  • Minor script issues:

    • set -x on line 39 is redundant as set -ex on line 38 already enables it.
    • The -y flag is specified twice in the apt-get install command on line 40.

I recommend creating a dedicated Dockerfile for this health scanner job to address these points.

References
  1. Extract complex inline shell scripts from YAML into separate script files to improve readability and maintainability. Inline shell scripts within YAML configurations are acceptable for minor, declarative additions that align with existing patterns. For more complex logic, use separate script files.

Comment on lines +23 to +60
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: cluster-health-scanner-job-role
rules:
- apiGroups: [""]
resources:
- "pods"
- "pods/log"
- "pods/exec"
- "nodes"
- "events"
- "services"
- "secrets"
- "configmaps"
- "serviceaccounts"
verbs: ["list", "get", "create", "delete", "watch", "patch", "update"]

- apiGroups: ["apps"]
resources:
- "daemonsets"
- "deployments"
- "replicasets"
verbs: ["list", "get", "create", "delete", "watch", "patch", "update"]

- apiGroups: ["batch"]
resources:
- "jobs"
- "jobs/status"
verbs: ["list", "get", "create", "delete", "watch", "patch", "update"]

- apiGroups: ["rbac.authorization.k8s.io"]
resources:
- "clusterrolebindings"
- "clusterroles"
- "roles"
- "rolebindings"
verbs: ["list", "get", "create", "delete", "watch", "patch", "update"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The cluster-health-scanner-job-role ClusterRole grants excessive permissions. Specifically, allowing ["create", "delete", "patch", "update"] on clusterrolebindings, clusterroles, roles, and rolebindings (the rbac.authorization.k8s.io API group) is equivalent to granting cluster-admin privileges. This violates the principle of least privilege and poses a significant security risk.

The role should be scoped down to the minimum permissions required for the health scanner to function. It's unlikely that it needs to modify RBAC resources. Please review the permissions required by the cluster-health-scanner tool and remove unnecessary privileges, especially for the rbac.authorization.k8s.io group.

type: gcs
configuration:
# The GCS bucket used for storing terraform state
bucket: BUCKET_NAME
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The placeholder values in this file (e.g., BUCKET_NAME, DEPLOYMENT_NAME) are inconsistent with the established style in other blueprint files, which use comments as placeholders (e.g., ## Set GCP Project ID Here ##). One of the repository's general rules also specifies using a key: # comment format for placeholders.

To maintain consistency, please update the placeholder values in this file to follow the comment-based style. For example, bucket: BUCKET_NAME could be changed to bucket: # YOUR_TERRAFORM_STATE_BUCKET.

References
  1. In YAML blueprint files, use the 'key: # comment' format for placeholder variables to maintain consistency with other blueprints in the repository, rather than using empty strings ('""').

# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: kueue.x-k8s.io/v1alpha1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The Topology resource is using apiVersion: kueue.x-k8s.io/v1alpha1. This API version has been deprecated and is no longer supported in recent versions of Kueue.

To ensure future compatibility and maintain consistency with the other Kueue resources in this file (which correctly use v1beta1), please update the apiVersion.

apiVersion: kueue.x-k8s.io/v1beta1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant