Skip to content

feat: Add explicit opt-in Quota Availability Validator#5422

Draft
kvenkatachala333 wants to merge 2 commits intoGoogleCloudPlatform:developfrom
kvenkatachala333:quota_imp
Draft

feat: Add explicit opt-in Quota Availability Validator#5422
kvenkatachala333 wants to merge 2 commits intoGoogleCloudPlatform:developfrom
kvenkatachala333:quota_imp

Conversation

@kvenkatachala333
Copy link
Copy Markdown
Member

This PR introduces a Quota Availability Validator to the Cluster Toolkit as an explicit, opt-in feature. It enables a "fail fast" mechanism by verifying resource capacity before deployment, without causing regressions or unexpected latency for existing blueprints.

Key Features

  • Explicit Opt-in Design: Disabled by default. Only executes when explicitly listed in your blueprint YAML.
  • Zero Regression for Existing Blueprints: No unexpected latency (1–3s) or permission blockers (e.g., compute.projects.get) for users who don't need real-time quota checks.
  • Real-Time API Capability: Integrates with the Compute Engine Quotas API to aggregate and check module requirements.

How to Enable in Blueprint

validators:
- validator: test_quota_availability
  inputs:
    project_id: $(vars.project_id)
    region: $(vars.region) 

Resource Coverage

  • Compute (CPUs): Family-specific metrics (e.g., C3_CPUS, H100_CPUS).
  • GPUs: Maps accelerator types (A100, H100, L4, etc.) to regional/global metrics (including GPUS_ALL_REGIONS).
  • Storage: PD-Standard, SSD, Balanced, Extreme, and Hyperdisk Balanced (including IOPS/Throughput).
  • Specialty services: Filestore capacity and TPU core requirements.
  • Networks: NETWORKS and SUBNETWORKS global quotas.

Resiliency & Performance

  • Exponential Backoff: Up to 5 retries for rate limits (429) or transient errors.
  • Regional Caching: In-memory caching for projects and regions to minimize duplicate API calls.
  • Unit Tests: Dedicated tests in pkg/validators/quota_test.go with mock GCP clients.

@kvenkatachala333 kvenkatachala333 added the release-chore To not include into release notes label Mar 29, 2026
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request adds a new, optional validator to the Cluster Toolkit designed to proactively check resource availability against GCP quotas. By integrating directly with the Compute Engine Quotas API, it provides a 'fail-fast' mechanism that helps users identify potential deployment blockers early. The implementation is designed to be non-intrusive, ensuring that existing workflows remain unaffected unless the validator is explicitly enabled in the blueprint.

Highlights

  • New Quota Availability Validator: Introduced an opt-in 'test_quota_availability' validator that checks resource capacity against GCP quotas before deployment.
  • Resilient API Integration: Implemented a GCP client with exponential backoff for handling rate limits (429) and transient errors, along with in-memory caching for performance.
  • Comprehensive Resource Coverage: Added support for validating quotas across Compute (CPUs/GPUs), Storage (PD/Hyperdisk), Filestore, TPUs, and Network resources.
  • Testing and Tooling: Added unit tests with mock GCP clients and updated validation scripts to exclude this new validator from existing golden copy tests.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new quota availability validator for GCP resources, including support for CPUs, GPUs, Disks, Filestore, and TPUs. The implementation includes a new GCPQuotaClient with retry logic and caching, along with comprehensive unit tests. Feedback is provided regarding the optimization of the accelerator metric map by moving it to a package-level variable and improving error handling when evaluating TPU preemption settings to avoid silent failures on unknown values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-chore To not include into release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant