Modernize GKE A3 High blueprint and align integration tests#5246
Modernize GKE A3 High blueprint and align integration tests#5246shubpal07 wants to merge 1 commit intoGoogleCloudPlatform:developfrom
Conversation
Change-Id: Ie064dcac4ec7c7e23909024c6c4f537275f045f2
Summary of ChangesHello @shubpal07, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly modernizes the GKE A3 High blueprint by reorganizing its structure, enhancing its documentation, and integrating advanced features. The changes aim to align the blueprint with the standards of other GKE A* family offerings, providing users with a more robust and feature-rich solution for high-performance ML training. Key additions include support for Kueue's Topology Aware Scheduling for efficient GPU workload management and the implementation of Cluster Health Services for automated GPU diagnostics, ensuring optimal cluster performance and reliability. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request is a significant and valuable modernization of the GKE A3 High blueprint. The restructuring into a dedicated directory, addition of comprehensive documentation, and integration of advanced features like Kueue for Topology Aware Scheduling and Cluster Health Services (CHS) are excellent improvements that align it with the standards of other A* family blueprints.
My review focuses on a few key areas to further enhance the quality of these changes:
- Security: I've identified a couple of instances where permissions (both for a GCP IAM role and a Kubernetes ClusterRole) are overly broad. My suggestions aim to tighten these permissions by following the principle of least privilege.
- Efficiency and Reliability: The new CronJob for health checks can be made much more efficient and reliable by using a pre-built container image instead of installing dependencies on every run, aligning with guidelines for complex inline scripts.
- Maintainability: I've pointed out a minor issue with an outdated API version in the Kueue configuration to ensure future compatibility, and highlighted the need for consistent placeholder formatting as per repository rules.
Overall, this is a strong contribution. Addressing these points will improve the security, performance, and long-term maintainability of this blueprint.
| - stackdriver.resourceMetadata.writer | ||
| - storage.objectAdmin | ||
| - artifactregistry.reader | ||
| - container.admin |
There was a problem hiding this comment.
The workload_service_account is granted the container.admin (roles/container.admin) IAM role. This role provides full control over GKE clusters, including creation and deletion, which is overly permissive for a workload service account, even one used for health checks. The cron job script appears to only need gcloud container clusters get-credentials, which requires the container.clusters.get permission.
To follow the principle of least privilege, please replace container.admin with a more restrictive role. roles/container.clusterViewer should be sufficient for getting cluster credentials. If other permissions are needed, they should be added explicitly rather than using a broad admin role.
- container.clusterViewer| - | | ||
| set -ex | ||
| set -x | ||
| apt-get update && apt-get install -y git curl gnupg -y | ||
| git clone https://github.com/GoogleCloudPlatform/cluster-health-scanner | ||
| cd cluster-health-scanner | ||
| apt-get install -y apt-transport-https ca-certificates | ||
| curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | gpg --dearmor -o /usr/share/keyrings/cloud.google.gpg | ||
| echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main" | tee -a /etc/apt/sources.list.d/google-cloud-sdk.list | ||
| apt-get update | ||
| apt-get install -y google-cloud-cli kubectl | ||
| apt-get install -y google-cloud-cli-gke-gcloud-auth-plugin | ||
| curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash | ||
| pip3 install -r cli/requirements.txt | ||
| gcloud container clusters get-credentials ${deployment_name} --region ${region} --project ${project_id} | ||
| OUTPUT_DIR="/mnt/output" | ||
| mkdir -p $OUTPUT_DIR | ||
| TIMESTAMP="`date "+%Y-%m-%d %H:%M:%S"`" | ||
| OUTPUT_FILENAME="${deployment_name}_healthscan_result_$TIMESTAMP.txt" | ||
| FULL_OUTPUT_PATH="$OUTPUT_DIR/$OUTPUT_FILENAME" | ||
| python3 cli/cluster_diag.py -o gke healthscan ${machine_type} -c gpu --run_only_on_available_nodes | ||
| python3 cli/cluster_diag.py -o gke healthscan ${machine_type} -c nccl --run_only_on_available_nodes | ||
| python3 cli/cluster_diag.py -o gke healthscan ${machine_type} -c straggler --run_only_on_available_nodes | ||
| python3 cli/cluster_diag.py -o gke healthscan ${machine_type} -c neper --run_only_on_available_nodes | ||
| python3 cli/cluster_diag.py -o gke healthscan ${machine_type} -c tinymax --run_only_on_available_nodes | ||
| #python3 cli/cluster_diag.py -o gke healthscan ${machine_type} -c status --run_only_on_available_nodes > "$FULL_OUTPUT_PATH" 2>&1 | ||
| kubectl get nodes -o custom-columns="NODE:.metadata.name,NCCL_MARK:.metadata.labels.aiinfra/nccl-healthcheck-test,NCCL_BANDWIDTH:.metadata.labels.aiinfra/nccl-healthcheck-bandwidth,NCCL_RESULT:.metadata.labels.aiinfra/nccl-healthcheck-result,NCCL_RUNTIME:.metadata.labels.aiinfra/nccl-healthcheck-runtime-sec,TINYMAX_MARK:.metadata.labels.aiinfra/tinymax-healthcheck-test,TINYMAX_RESULT:.metadata.labels.aiinfra/tinymax-healthcheck-result,TINYMAX_RUNTIME:.metadata.labels.aiinfra/tinymax-healthcheck-runtime-sec,GPU_MARK:.metadata.labels.aiinfra/gpu-healthcheck-test,GPU_RESULT:.metadata.labels.aiinfra/gpu-healthcheck-result,GPU_RUNTIME:.metadata.labels.aiinfra/gpu-healthcheck-runtime-sec" > "$FULL_OUTPUT_PATH" 2>&1 | ||
| echo "Health scan outputs saved to $OUTPUT_DIR" | ||
| echo "Final output file: $OUTPUT_FILENAME" |
There was a problem hiding this comment.
The shell script within this CronJob has some areas for improvement regarding efficiency, reliability, and clarity:
-
Inefficiency: The script installs multiple packages (
git,gcloud,kubectl,helm) and clones a git repository on every execution. This is slow, generates unnecessary network traffic, and makes the job's success dependent on the availability of external package repositories. A better approach is to build a custom container image with all dependencies pre-installed. The CronJob would then use this pre-built image, making it much faster and more reliable. -
Reproducibility: The script clones the
mainbranch of thecluster-health-scannerrepository. It is a best practice to pin to a specific git tag or commit hash to ensure reproducible builds and avoid unexpected failures from upstream changes. -
Minor script issues:
set -xon line 39 is redundant asset -exon line 38 already enables it.- The
-yflag is specified twice in theapt-get installcommand on line 40.
I recommend creating a dedicated Dockerfile for this health scanner job to address these points.
References
- Extract complex inline shell scripts from YAML into separate script files to improve readability and maintainability. Inline shell scripts within YAML configurations are acceptable for minor, declarative additions that align with existing patterns. For more complex logic, use separate script files.
| apiVersion: rbac.authorization.k8s.io/v1 | ||
| kind: ClusterRole | ||
| metadata: | ||
| name: cluster-health-scanner-job-role | ||
| rules: | ||
| - apiGroups: [""] | ||
| resources: | ||
| - "pods" | ||
| - "pods/log" | ||
| - "pods/exec" | ||
| - "nodes" | ||
| - "events" | ||
| - "services" | ||
| - "secrets" | ||
| - "configmaps" | ||
| - "serviceaccounts" | ||
| verbs: ["list", "get", "create", "delete", "watch", "patch", "update"] | ||
|
|
||
| - apiGroups: ["apps"] | ||
| resources: | ||
| - "daemonsets" | ||
| - "deployments" | ||
| - "replicasets" | ||
| verbs: ["list", "get", "create", "delete", "watch", "patch", "update"] | ||
|
|
||
| - apiGroups: ["batch"] | ||
| resources: | ||
| - "jobs" | ||
| - "jobs/status" | ||
| verbs: ["list", "get", "create", "delete", "watch", "patch", "update"] | ||
|
|
||
| - apiGroups: ["rbac.authorization.k8s.io"] | ||
| resources: | ||
| - "clusterrolebindings" | ||
| - "clusterroles" | ||
| - "roles" | ||
| - "rolebindings" | ||
| verbs: ["list", "get", "create", "delete", "watch", "patch", "update"] |
There was a problem hiding this comment.
The cluster-health-scanner-job-role ClusterRole grants excessive permissions. Specifically, allowing ["create", "delete", "patch", "update"] on clusterrolebindings, clusterroles, roles, and rolebindings (the rbac.authorization.k8s.io API group) is equivalent to granting cluster-admin privileges. This violates the principle of least privilege and poses a significant security risk.
The role should be scoped down to the minimum permissions required for the health scanner to function. It's unlikely that it needs to modify RBAC resources. Please review the permissions required by the cluster-health-scanner tool and remove unnecessary privileges, especially for the rbac.authorization.k8s.io group.
| type: gcs | ||
| configuration: | ||
| # The GCS bucket used for storing terraform state | ||
| bucket: BUCKET_NAME |
There was a problem hiding this comment.
The placeholder values in this file (e.g., BUCKET_NAME, DEPLOYMENT_NAME) are inconsistent with the established style in other blueprint files, which use comments as placeholders (e.g., ## Set GCP Project ID Here ##). One of the repository's general rules also specifies using a key: # comment format for placeholders.
To maintain consistency, please update the placeholder values in this file to follow the comment-based style. For example, bucket: BUCKET_NAME could be changed to bucket: # YOUR_TERRAFORM_STATE_BUCKET.
References
- In YAML blueprint files, use the 'key: # comment' format for placeholder variables to maintain consistency with other blueprints in the repository, rather than using empty strings ('""').
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| apiVersion: kueue.x-k8s.io/v1alpha1 |
There was a problem hiding this comment.
The Topology resource is using apiVersion: kueue.x-k8s.io/v1alpha1. This API version has been deprecated and is no longer supported in recent versions of Kueue.
To ensure future compatibility and maintain consistency with the other Kueue resources in this file (which correctly use v1beta1), please update the apiVersion.
apiVersion: kueue.x-k8s.io/v1beta1
This PR upgrades the GKE A3 High (a3-highgpu-8g) blueprint to align with the
standards of the GKE A* family (A3 Mega, Ultra, and A4). It restructures the
blueprint into a dedicated directory, adds comprehensive documentation, and
introduces advanced features like Kueue TAS support and Cluster Health
Services (CHS)
Submission Checklist
NOTE: Community submissions can take up to 2 weeks to be reviewed.
Please take the following actions before submitting this pull request.