Skip to content

Workload API gaps for disaggregated/multi-component inference workloads #5738

@nvrohanv

Description

@nvrohanv

Overview

The current Workload API in KEP-4671 supports intra-PodGroup gang scheduling effectively - all pods within a PodGroup (or PodGroup replica) can be required to schedule together. However, the design treats each PodGroup or PodGroup replica as fully independent scheduling units, which creates several gaps for workloads that require coordination across groups or replicas.

From the KEP:

The individual PodGroups and PodGroup replicas are treated as independent gangs.
As an example, if one of the groups can be scheduled and the other can't be - this is exactly what will happen.

The intent of this issue is to highlight these gaps and start a discussion, not to propose specific API changes or fixes. Additionally the gaps are explained in the context of disaggregated inference to ground them but are general gaps that would come up in a variety of AI workloads.

Motivating Use Case: Disaggregated LLM Inference

In disaggregated inference architectures (prefill-decode separation), a functional serving instance requires both:

  • Prefill workers: Handle prompt processing
  • Decode workers: Handle token generation

A deployment is only functional if it has at least one of each worker type. These worker types may be single-node (representable by a Deployment or StatefulSet) or multi-node (representable by LWS or Grove PodCliqueScalingGroup). The gaps described below prevent expressing the scheduling requirements for these deployments.

Gap 1: Inconsistent Semantics for minCount Across Single-Node vs Multi-Node Workers

The current API has inconsistent semantics for minCount depending on whether workers are single-node or multi-node.

  • Single-node case (e.g., Deployment or StatefulSet): If I have a Deployment as a PodGroup where each replica is a single pod, I can assign all pods to the same PodGroup with the same (or no) podGroupReplicaKey. In this case, minCount effectively represents the minimum number of replicas I need scheduled.
  • Multi-node case (e.g., LeaderWorkerSet or Grove PodCliqueScalingGroup): If I have an LWS as a PodGroup where each replica consists of multiple pods, the controller is expected to assign each pod a podGroupReplicaKey to distinguish replicas. In this case, minCount represents how many pods within a single LWS replica need to be scheduled - not how many LWS replicas need to be scheduled.

Impact: The semantics of minCount differ based on whether the PodGroup is representing a single-node or multi-node component, and there is no way to express "I need at least N complete multi-node replicas" for the LWS case. To me, this inconsistency is something that is important to address immediately as opposed to the fix being purely additive.

Gap 2: No Inter-PodGroup Gang Constraints

Even if Gap 1 is addressed, there is no way to tie together minimum requirements across PodGroups into an all-or-nothing scheduling unit.

Single-node example

Even when replicas of prefill and decode workers are single pods, I cannot express "schedule at least 1 prefill pod AND at least 1 decode pod, or none at all." The current API allows me to guarantee that I get a given number of prefill pods or none, and a given number of decode pods or none, but these are independent guarantees. There is no way to express that I need at least 1 of each or none at all.

  • Risk: The scheduler might schedule 10 prefill pods and 0 decode pods, resulting in a non-functional deployment.

Multi-node example

When workers themselves require multiple pods (e.g., a prefill worker spans 4 nodes for tensor parallelism), two levels of gang scheduling are needed:

  1. Intra-group: All pods within a single worker replica must schedule together
  2. Inter-group: At least 1 complete prefill replica AND at least 1 complete decode replica must schedule together, or none at all

The current API allows me to express intra-group constraints, but not the inter-group constraint that ties them together. Hierarchical gang scheduling is not supported, though adding support likely could be done in a purely additive way unlike gap 1.

Implied Requirement: Scheduler should work to satisfy minimum requirements for all PodGroups before allocating resources beyond them.
For instance if my application needs at least 1 prefill + 1 decode to function, but ideally wants 4 prefill + 2 decode, a situation should not occur where I schedule 2 prefills before I schedule 1 decode and then falsely end up failing scheduling for the entire workload when 1 prefill + 1 decode would’ve succeeded.

Future Consideration: Hierarchical Topology Scheduling

This is less of a gap and more of a forward-looking consideration. The KEP mentions synergy between gang scheduling and topology-aware scheduling. As these features evolve, we should keep in mind that hierarchical topology requirements will emerge:

  • Topology requirements at the functional unit level: The entire deployment (prefill + decode together) may have topology constraints - for instance the whole deployment being within a block.
  • Topology requirements at the PodGroup level: Individual PodGroups like prefill or decode may have their own topology constraints - for instance each prefill worker being on the same nvl72 rack.
  • Cross-group affinity: There may be preferences for scheduling prefill workers topologically closer to decode workers rather than to other prefill workers.

These requirements mirror the hierarchical nature of gang scheduling and should be considered as the API evolves.

Metadata

Metadata

Assignees

No one assigned

    Labels

    sig/schedulingCategorizes an issue or PR as relevant to SIG Scheduling.

    Type

    No type

    Projects

    Status

    Needs Triage

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions