-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
Overview
The current Workload API in KEP-4671 supports intra-PodGroup gang scheduling effectively - all pods within a PodGroup (or PodGroup replica) can be required to schedule together. However, the design treats each PodGroup or PodGroup replica as fully independent scheduling units, which creates several gaps for workloads that require coordination across groups or replicas.
From the KEP:
The individual PodGroups and PodGroup replicas are treated as independent gangs.
As an example, if one of the groups can be scheduled and the other can't be - this is exactly what will happen.
The intent of this issue is to highlight these gaps and start a discussion, not to propose specific API changes or fixes. Additionally the gaps are explained in the context of disaggregated inference to ground them but are general gaps that would come up in a variety of AI workloads.
Motivating Use Case: Disaggregated LLM Inference
In disaggregated inference architectures (prefill-decode separation), a functional serving instance requires both:
- Prefill workers: Handle prompt processing
- Decode workers: Handle token generation
A deployment is only functional if it has at least one of each worker type. These worker types may be single-node (representable by a Deployment or StatefulSet) or multi-node (representable by LWS or Grove PodCliqueScalingGroup). The gaps described below prevent expressing the scheduling requirements for these deployments.
Gap 1: Inconsistent Semantics for minCount Across Single-Node vs Multi-Node Workers
The current API has inconsistent semantics for minCount depending on whether workers are single-node or multi-node.
- Single-node case (e.g., Deployment or StatefulSet): If I have a Deployment as a PodGroup where each replica is a single pod, I can assign all pods to the same PodGroup with the same (or no)
podGroupReplicaKey. In this case,minCounteffectively represents the minimum number of replicas I need scheduled. - Multi-node case (e.g., LeaderWorkerSet or Grove PodCliqueScalingGroup): If I have an LWS as a PodGroup where each replica consists of multiple pods, the controller is expected to assign each pod a
podGroupReplicaKeyto distinguish replicas. In this case,minCountrepresents how many pods within a single LWS replica need to be scheduled - not how many LWS replicas need to be scheduled.
Impact: The semantics of minCount differ based on whether the PodGroup is representing a single-node or multi-node component, and there is no way to express "I need at least N complete multi-node replicas" for the LWS case. To me, this inconsistency is something that is important to address immediately as opposed to the fix being purely additive.
Gap 2: No Inter-PodGroup Gang Constraints
Even if Gap 1 is addressed, there is no way to tie together minimum requirements across PodGroups into an all-or-nothing scheduling unit.
Single-node example
Even when replicas of prefill and decode workers are single pods, I cannot express "schedule at least 1 prefill pod AND at least 1 decode pod, or none at all." The current API allows me to guarantee that I get a given number of prefill pods or none, and a given number of decode pods or none, but these are independent guarantees. There is no way to express that I need at least 1 of each or none at all.
- Risk: The scheduler might schedule 10 prefill pods and 0 decode pods, resulting in a non-functional deployment.
Multi-node example
When workers themselves require multiple pods (e.g., a prefill worker spans 4 nodes for tensor parallelism), two levels of gang scheduling are needed:
- Intra-group: All pods within a single worker replica must schedule together
- Inter-group: At least 1 complete prefill replica AND at least 1 complete decode replica must schedule together, or none at all
The current API allows me to express intra-group constraints, but not the inter-group constraint that ties them together. Hierarchical gang scheduling is not supported, though adding support likely could be done in a purely additive way unlike gap 1.
Implied Requirement: Scheduler should work to satisfy minimum requirements for all PodGroups before allocating resources beyond them.
For instance if my application needs at least 1 prefill + 1 decode to function, but ideally wants 4 prefill + 2 decode, a situation should not occur where I schedule 2 prefills before I schedule 1 decode and then falsely end up failing scheduling for the entire workload when 1 prefill + 1 decode would’ve succeeded.
Future Consideration: Hierarchical Topology Scheduling
This is less of a gap and more of a forward-looking consideration. The KEP mentions synergy between gang scheduling and topology-aware scheduling. As these features evolve, we should keep in mind that hierarchical topology requirements will emerge:
- Topology requirements at the functional unit level: The entire deployment (prefill + decode together) may have topology constraints - for instance the whole deployment being within a block.
- Topology requirements at the PodGroup level: Individual PodGroups like prefill or decode may have their own topology constraints - for instance each prefill worker being on the same nvl72 rack.
- Cross-group affinity: There may be preferences for scheduling prefill workers topologically closer to decode workers rather than to other prefill workers.
These requirements mirror the hierarchical nature of gang scheduling and should be considered as the API evolves.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status