Workload API gaps for disaggregated/multi-component inference workloads

### Overview
The current Workload API in KEP-4671 supports intra-PodGroup gang scheduling effectively - all pods within a PodGroup (or PodGroup replica) can be required to schedule together. However, the design treats each `PodGroup` or `PodGroup` replica as fully independent scheduling units, which creates several gaps for workloads that require coordination across groups or replicas.

From the KEP:
> The individual PodGroups and PodGroup replicas are treated as independent gangs.
> As an example, if one of the groups can be scheduled and the other can't be - this is exactly what will happen.

The intent of this issue is to highlight these gaps and start a discussion, not to propose specific API changes or fixes. Additionally the gaps are explained in the context of disaggregated inference to ground them but are general gaps that would come up in a variety of AI workloads.

### Motivating Use Case: Disaggregated LLM Inference
In disaggregated inference architectures (prefill-decode separation), a functional serving instance requires both:
* **Prefill workers:** Handle prompt processing
* **Decode workers:** Handle token generation

A deployment is only functional if it has at least one of **each** worker type. These worker types may be single-node (representable by a Deployment or StatefulSet) or multi-node (representable by LWS or Grove PodCliqueScalingGroup). The gaps described below prevent expressing the scheduling requirements for these deployments.

### Gap 1: Inconsistent Semantics for `minCount` Across Single-Node vs Multi-Node Workers
The current API has inconsistent semantics for `minCount` depending on whether workers are single-node or multi-node.

* **Single-node case (e.g., Deployment or StatefulSet):** If I have a Deployment as a PodGroup where each replica is a single pod, I can assign all pods to the same PodGroup with the same (or no) `podGroupReplicaKey`. In this case, `minCount` effectively represents the minimum number of **replicas** I need scheduled.
* **Multi-node case (e.g., LeaderWorkerSet or Grove PodCliqueScalingGroup):** If I have an LWS as a PodGroup where each replica consists of multiple pods, the controller is expected to assign each pod a `podGroupReplicaKey` to distinguish replicas. In this case, `minCount` represents how many pods within a single LWS replica need to be scheduled - **not** how many LWS replicas need to be scheduled.

**Impact:** The semantics of `minCount` differ based on whether the PodGroup is representing a single-node or multi-node component, and there is no way to express "I need at least N complete multi-node replicas" for the LWS case. To me, this inconsistency is something that is important to address immediately as opposed to the fix being purely additive.

### Gap 2: No Inter-PodGroup Gang Constraints
Even if Gap 1 is addressed, there is no way to tie together minimum requirements across PodGroups into an all-or-nothing scheduling unit.

#### Single-node example
Even when replicas of prefill and decode workers are single pods, I cannot express "schedule at least 1 prefill pod AND at least 1 decode pod, or none at all." The current API allows me to guarantee that I get a given number of prefill pods or none, and a given number of decode pods or none, but these are independent guarantees. There is no way to express that I need at least 1 of each or none at all.

* *Risk:* The scheduler might schedule 10 prefill pods and 0 decode pods, resulting in a non-functional deployment.

#### Multi-node example
When workers themselves require multiple pods (e.g., a prefill worker spans 4 nodes for tensor parallelism), two levels of gang scheduling are needed:
1.  **Intra-group:** All pods within a single worker replica must schedule together
2.  **Inter-group:** At least 1 complete prefill replica AND at least 1 complete decode replica must schedule together, or none at all

The current API allows me to express intra-group constraints, but not the inter-group constraint that ties them together. Hierarchical gang scheduling is not supported, though adding support likely could be done in a purely additive way unlike gap 1.

> **Implied Requirement:** Scheduler should work to satisfy minimum requirements for all PodGroups before allocating resources beyond them.
> For instance if my application needs at least 1 prefill + 1 decode to function, but ideally wants 4 prefill + 2 decode, a situation should not occur where I schedule 2 prefills before I schedule 1 decode and then falsely end up failing scheduling for the entire workload when 1 prefill + 1 decode would’ve succeeded.

### Future Consideration: Hierarchical Topology Scheduling
This is less of a gap and more of a forward-looking consideration. The KEP mentions synergy between gang scheduling and topology-aware scheduling. As these features evolve, we should keep in mind that hierarchical topology requirements will emerge:

* **Topology requirements at the functional unit level:** The entire deployment (prefill + decode together) may have topology constraints - for instance the whole deployment being within a block.
* **Topology requirements at the PodGroup level:** Individual PodGroups like prefill or decode may have their own topology constraints - for instance each prefill worker being on the same nvl72 rack.
* **Cross-group affinity:** There may be preferences for scheduling prefill workers topologically closer to decode workers rather than to other prefill workers.

These requirements mirror the hierarchical nature of gang scheduling and should be considered as the API evolves.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Workload API gaps for disaggregated/multi-component inference workloads #5738

Overview

Motivating Use Case: Disaggregated LLM Inference

Gap 1: Inconsistent Semantics for `minCount` Across Single-Node vs Multi-Node Workers

Gap 2: No Inter-PodGroup Gang Constraints

Single-node example

Multi-node example

Future Consideration: Hierarchical Topology Scheduling

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Workload API gaps for disaggregated/multi-component inference workloads #5738

Description

Overview

Motivating Use Case: Disaggregated LLM Inference

Gap 1: Inconsistent Semantics for minCount Across Single-Node vs Multi-Node Workers

Gap 2: No Inter-PodGroup Gang Constraints

Single-node example

Multi-node example

Future Consideration: Hierarchical Topology Scheduling

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Gap 1: Inconsistent Semantics for `minCount` Across Single-Node vs Multi-Node Workers