-
Notifications
You must be signed in to change notification settings - Fork 0
fix: resource limits for starting instances #81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
✱ Stainless preview buildsThis PR will update the
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
hiroTamada
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall - nice refactor to centralize resource validation in lib/resources instead of the ad-hoc aggregate checks.
Agreeing with Bugbot's findings that should be addressed (can be follow-up PRs):
-
High: Missing network/diskIO limits on start -
startInstancepasses0, 0, 0for network download/upload and disk I/O instead of the stored values. Restarted instances won't have their bandwidth validated against capacity. -
Medium: RestoreInstance missing validation -
restoreInstancedoesn't callvalidateResourceAllocation, allowing standby→running transitions to bypass capacity checks.
These are edge cases (start from stopped, restore from standby) but could lead to oversubscription if resource availability changed while the instance was down.
Note
Medium Risk
Touches core instance lifecycle (create/start/stop/restore/standby) and resource admission logic, which can prevent workloads from starting if misconfigured; changes are largely additive and surfaced via explicit 409 errors.
Overview
Instance admission now uses
lib/resourcesfor aggregate capacity checks.instances.Managergains a pluggableResourceValidator(wired incmd/api/main.go) andCreateInstance,StartInstance, andRestoreInstancereject requests when CPU/memory/network/disk I/O/GPU capacity is insufficient, surfacingErrInsufficientResources.API behavior changes:
POST /instancesnow returns409for insufficient resources,StartInstancereturns409withinsufficient_resources, and the OpenAPI client/spec are updated accordingly. The resources endpoint now includes disk I/O capacity/status and per-instancedisk_io_bpsallocations.vGPU lifecycle and state rules tightened: standby is blocked for vGPU instances, vGPU mdevs are recreated on start and destroyed/cleared on stop, and allocations tracking now includes
DiskIOBps. Legacy aggregate CPU/memory env limits and related instance-side aggregate limit code/tests are removed in favor of oversubscription ratios.Written by Cursor Bugbot for commit ca5e720. This will update automatically on new commits. Configure here.