Skip to content

feat: full_quota_impl#5140

Draft
kvenkatachala333 wants to merge 7 commits intoGoogleCloudPlatform:developfrom
kvenkatachala333:quota-clean
Draft

feat: full_quota_impl#5140
kvenkatachala333 wants to merge 7 commits intoGoogleCloudPlatform:developfrom
kvenkatachala333:quota-clean

Conversation

@kvenkatachala333
Copy link
Member

@kvenkatachala333 kvenkatachala333 commented Jan 28, 2026

This pull request introduces a robust Quota Availability Validator to the Cluster Toolkit, designed to "fail fast" by verifying resource availability before infrastructure provisioning begins. This proactively prevents deployment failures caused by insufficient quotas in the target Google Cloud project and region.

Key Implementation Highlights

  1. Core Validator Logic (pkg/validators/quota.go): Implements a new validator that walks through blueprint modules to calculate total resource needs and compares them against real-time data from the Compute Engine Quota API.
  2. Early Deployment Integration: The validator is integrated into the doDeploy function within cmd/deploy.go, ensuring checks occur at the very start of the deployment process.
  3. Broad Resource Coverage:
    a. Compute: Supports family-specific metrics (e.g., C3_CPUS, C4_CPUS, H100_CPUS) and correctly handles preemptible/spot resource prefixes.
    b. GPUs: Maps various accelerator types (A100, H100, L4, etc.) to their specific regional and global metrics.
    c. Storage: Validates standard PD, SSD, Balanced, Extreme, and the newer Hyperdisk Balanced (including IOPS and Throughput metrics).
    d. Specialty Services: Includes validation for Filestore capacity and TPU (v2, v3) core requirements.
  4. Resiliency & Error Handling: Includes a retryCall mechanism with exponential backoff for GCP API interactions and enhanced handling for common issues like rate limits (429) or permission errors (403).
  5. Unit Testing: Added a comprehensive test suite (pkg/validators/quota_test.go) utilizing a mock GCP client to verify resource mapping and collection logic across diverse blueprint configurations.
  6. Tooling Updates: Updated .pre-commit-config.yaml to utilize golangci-lint (v1.63.4) for improved code quality enforcement.

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @kvenkatachala333, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a critical pre-deployment validation step to the HPC Toolkit. It implements a robust system for checking Google Cloud Platform resource quotas, ensuring that the resources requested by a blueprint are available in the target project and region before any actual infrastructure provisioning begins. This significantly improves the reliability of deployments by catching quota-related issues early.

Highlights

  • New Quota Validator Implementation: Introduced a comprehensive GCP quota validator (pkg/validators/quota.go) that proactively checks resource requirements against available project and regional quotas before deployment. This helps prevent failures due to insufficient resources.
  • Integration into Deployment Workflow: The new quota validator is integrated into the doDeploy function in cmd/deploy.go, ensuring that quota checks are performed early in the deployment process to 'fail fast' if issues are detected.
  • Resource Requirement Collection: The validator intelligently collects resource requirements by walking through blueprint modules, accounting for various settings like machine types (CPUs, GPUs, including family-specific and preemptible types), disk sizes and types (e.g., Hyperdisk, SSD, standard), network resources, and TPU configurations.
  • Robust GCP API Interaction: Includes a retryCall mechanism for GCP API interactions to handle transient errors and rate limits, along with enhanced error handling in handleClientError to provide more specific feedback for common GCP API issues like permission denied (403) and rate limit exceeded (429).
  • Comprehensive Unit Testing: A new test file (pkg/validators/quota_test.go) has been added with a mock GCP client to thoroughly test the collectRequirements function across a wide range of module configurations and resource types.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@kvenkatachala333 kvenkatachala333 added the release-chore To not include into release notes label Jan 28, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive quota validator, which is a great feature for preventing deployment failures due to insufficient quota. The implementation is thorough, covering a wide range of Google Cloud resources. I appreciate the robust design, including the use of API call retries with exponential backoff and caching to improve performance.

I've identified a few areas for improvement, mainly related to code robustness and maintainability. My comments focus on handling potential errors more consistently, improving efficiency by avoiding repeated computations, and increasing code clarity by removing magic numbers and redundant logic. Overall, this is a solid contribution.

@kvenkatachala333
Copy link
Member Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive quota validator, which is a great feature for preventing deployment failures due to insufficient quotas. The implementation is thorough, covering a wide range of Google Cloud resources. The code is well-structured, and the inclusion of unit tests is excellent. I've identified a few areas for improvement, mainly around code clarity, robustness, and removing unused code. These are detailed in the specific comments. Overall, this is a solid contribution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-chore To not include into release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant