Add configurable allocation policy (packed/distributed) for replicated and MIG resources by wkd-woo · Pull Request #1621 · NVIDIA/k8s-device-plugin

wkd-woo · 2026-02-11T09:05:26Z

Summary

Add --allocation-policy flag (ALLOCATION_POLICY env) with distributed (default) and packed options
packed mode bin-packs replicated/MIG devices onto fewest physical GPUs, freeing up remaining GPUs for full-GPU workloads
Default behavior (distributed) is unchanged — no breaking changes

Motivation

The current distributedAlloc was designed for time-slicing, where distributing replicas across physical GPUs avoids compute contention. However, MIG devices also fall into this code path simply because AlignedAllocationSupported() returns false for them — not because distributed allocation is the right strategy.

MIG instances are hardware-isolated partitions with dedicated SMs and memory. Packing them onto fewer physical GPUs has no performance penalty, and frees up remaining GPUs for full-GPU workloads:

┌─────────────────────────────────────────────────────────────────────┐
│                Distributed Allocation (current default)              │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  GPU 0            GPU 1            GPU 2            GPU 3           │
│  ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐      │
│  │ ████░░░░ │    │ ████░░░░ │    │ ████░░░░ │    │ ░░░░░░░░ │      │
│  │ MIG 1/5  │    │ MIG 1/5  │    │ MIG 1/5  │    │ MIG 0/5  │      │
│  └──────────┘    └──────────┘    └──────────┘    └──────────┘      │
│   ⚠ partial       ⚠ partial       ⚠ partial       ░ empty          │
│                                                                     │
│  → Full GPU request arrives: only 1 GPU available (GPU 3)           │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────┐
│                    Packed Allocation (bin-packing)                   │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  GPU 0            GPU 1            GPU 2            GPU 3           │
│  ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐      │
│  │ ████████ │    │ ░░░░░░░░ │    │ ░░░░░░░░ │    │ ░░░░░░░░ │      │
│  │ MIG 3/5  │    │ MIG 0/5  │    │ MIG 0/5  │    │ MIG 0/5  │      │
│  └──────────┘    └──────────┘    └──────────┘    └──────────┘      │
│   ✓ packed         ★ free          ★ free          ★ free           │
│                                                                     │
│  → Full GPU request arrives: 3 GPUs available (GPU 1, 2, 3)        │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Relates to #491

Design

getPreferredAllocation
├─ Full GPU (AlignedAllocationSupported && no annotations)
│   └─ alignedAlloc (topology-based, unaffected by this change)
│
└─ MIG / time-slicing / MPS
├─ allocationPolicy=packed  → packedAlloc (new)
└─ allocationPolicy=distributed (default) → distributedAlloc (existing, unmodified)

Key decisions:

distributedAlloc is completely untouched — packedAlloc is a separate function following the existing alignedAlloc/distributedAlloc pattern
Full GPU nodes are unaffected — alignedAlloc is selected before allocationPolicy is ever checked, so setting packed on a full-GPU node has no effect
Flag applies uniformly — when packed is set, it applies to MIG, time-slicing, and MPS. Silently ignoring a user-set flag for specific device types would be inconsistent
Per-node config supported — works with existing config-manager + ConfigMap + node label (nvidia.com/device-plugin.config) mechanism via YAML config

Usage

CLI flag / Environment variable

--allocation-policy=packed
# or
ALLOCATION_POLICY=packed
Config file (per-node via ConfigMap + node label)

version: v1
flags:
  migStrategy: mixed
  plugin:
    allocationPolicy: packed

kubectl label node mig-node nvidia.com/device-plugin.config=mig-packed

Test plan

TestDistributedAlloc (6 cases) — existing distributed behavior regression
TestDistributedAllocIsDefault — default consistency verified over 10 iterations
TestPackedAlloc (6 cases) — bin-packing: same GPU priority, overflow to next GPU
TestPackedVsDistributedContrast — two strategies produce different results on same input
TestFullGPUNodeIgnoresAllocationPolicy (3 sub-cases) — full GPU / MIG / replicated branch path verification
All existing internal/rm/ tests pass unchanged

copy-pr-bot · 2026-02-11T09:05:30Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

ottowhite · 2026-02-15T02:15:01Z

Hi! Very cool idea, I actually really want a very similar custom allocation policy for replicated (non-MIG) resources, but for enabling bin-packing replicas with topology-awareness.

For example, this would could enable the allocation of 1.6 GPUs to a single model, with two tensor parallel ranks across 2 GPUs, 0.8 GPUs allocated to each tensor parallel tank, and both within a single NVLink domain (see my pretty diagram for an example). This would avoid excessive communication overhead during matrix multiplications across both tensor parallel model ranks, while leaving the 0.4 GPUs left over to be allocated to a smaller model, decreasing latency and increasing throughput in (surprisingly common) multi-model deployment scenarios including RAG with LLMs + Embedding models.

I believe the implementation of this policy is essentially a composite between what you've implemented, and the alignedAlloc code path which uses go-gpuallocator (gaining topology-awareness through NVML). At present, any of the replicated resource code paths discard topology awareness in favour of evenly load balancing GPU fractions with the distributed allocation policy, which prevents the use-case that I'm after.

Would be great if you're interested in discussing this more. I wonder if it's possible to slip it in with this PR? Also happy to contribute if that helps.

ottowhite · 2026-02-16T11:21:52Z

internal/rm/allocate.go

+			jid := AnnotatedID(candidates[j]).GetID()
+			idiff := replicas[iid].total - replicas[iid].available
+			jdiff := replicas[jid].total - replicas[jid].available
+			return idiff > jdiff


Was the only change in this function and distributedAlloc the comparator between idiff and jdiff? Can we just pass in an enum and then have a condition on the enum to dictate this behavior? There is a lot of code duplication at the moment.

Thanks for the review!

The duplication is intentional — I wanted distributedAlloc to stay completely untouched so reviewers can verify there's zero behavioral change just by looking at the diff.

On merging them: the two functions share the same structure because packed is just the inverse of distributed (flipped comparator). But as you mentioned with topology-aware bin-packing, a future policy would need NVML topology data and different sorting logic entirely — so I'd rather keep them separate now than merge and split again later.

That said, happy to extract the common setup (candidate filtering + replica counting) into a shared helper if that feels cleaner.

No problem :) to be honest I think it would actually be easier to review/maintain if you generalised the allocation function slightly.

It could be generalised to fractionalAlloc, taking an enum that allows distributedFractionPlacement or packedFractionPlacement policies. Then internally you just pick the comparator based on the policy.

What you suggested with pulling out the common helpers such as candidate filtering and replica counting is probably even better than the generalisation I suggested, as it enables the policies to be kept completely separate, and is more extensible in that it enables easier implementation of new policies again such as the topology-aware one.

When I was reviewing the code, I pulled up both distributedAlloc and packedAlloc side by side and was cross-checking for a while before I saw the flipped comparator. I expect other reviewers and contributors to go through a similar process. This is just my two cents though.

You've also improved the state of testing so that's an additional signal that this is functionally equivalent to what there was before.

In terms of the different topology-aware placement policy, I think it makes sense to keep it separate for now. But cool that you're opening this up and making it more configurable! I may build off of this myself.

@ottowhite I've extracted prepareCandidates in the second commit.
Each policy function now only has its sorting logic, so the difference should be immediately visible

Looking better! There's still quite a bit of duplication across both of the Alloc functions. What is common is that they're both greedy allocation algorithms (repeatedly evaluating and selecting the next best option, without looking further ahead). Could you extract this into another helper greedyDeviceAlloc or something similar that makes the difference between these two functions even clearer? Could pass in a candidate comparator. It will also be much clearer to extend with other greedy allocation policies. For example the one that I spoke about is another greedy allocation algorithm that happens to use topology-awareness in it's comparator.

The existing distributedAlloc spreads replicated devices evenly across physical GPUs, which was designed for time-slicing where workloads compete for shared compute. MIG instances, however, are hardware-isolated and do not suffer from contention when packed onto the same GPU. This adds a --allocation-policy flag (env: ALLOCATION_POLICY) with two options: "distributed" (default, preserving current behavior) and "packed" (bin-packing onto fewest physical GPUs). The packed policy helps free up entire GPUs for full-GPU workloads in mixed clusters. The flag applies uniformly to all non-aligned allocation paths (MIG, time-slicing, MPS) and can be configured per-node via ConfigMap and the nvidia.com/device-plugin.config node label. Signed-off-by: wkd-woo <wkdwoos@gmail.com>

Refactor common setup logic (candidate filtering, validation, replica counting) into a shared prepareCandidates helper. Each allocation policy function now focuses solely on its sorting strategy, making the difference between distributed and packed immediately visible. Signed-off-by: wkd-woo <wkdwoos@gmail.com>

ottowhite · 2026-02-17T13:32:13Z

Also, is there a reason why the alignedAlloc happens to be in nvml_manager.go? It seems like just another allocation policy that would be more suited to be in allocate.go rather than there. I know this is not due to your code change but moving it there could improve the organisation of the code, and ease to extend with different allocation policies.

wkd-woo · 2026-02-17T13:50:53Z

Thank @ottowhite for your review :)
On moving alignedAlloc — agreed it could improve organization, but I'd rather keep this PR focused on the new feature and not restructure existing code!

I'd like to hear from a maintainer at this point to see if the current direction works.

ottowhite · 2026-02-17T14:10:03Z

Fair enough! Though I think those refactors would be a minor improvement for organisation/extensibility, I think the testing is strong and implementation is solid.

LGTM!

Will be great to hear what the maintainers have to say. Also really keen on this feature and the more extensible policy opportunities that it opens up.

ottowhite · 2026-02-19T18:26:24Z

Hey! By the way we're running your packedAlloc allocation policy and it's working for multi-GPU deployments. We will imminently be forking and building off it it.

wkd-woo · 2026-02-25T14:25:34Z

Hi @elezar, just checking in to see if there's anything else needed from my side to move this forward. I'm happy to address any feedback :)

wkd-woo force-pushed the feature/packed-allocation-policy branch from 421e3c9 to 6686d1a Compare February 11, 2026 09:06

ottowhite reviewed Feb 16, 2026

View reviewed changes

wkd-woo force-pushed the feature/packed-allocation-policy branch from 6686d1a to 8656ef3 Compare February 17, 2026 12:22

wkd-woo force-pushed the feature/packed-allocation-policy branch from 8656ef3 to 2839bfc Compare February 17, 2026 12:22

wkd-woo added 2 commits February 23, 2026 18:11

Merge branch 'main' into feature/packed-allocation-policy

ec55e33

Merge branch 'main' into feature/packed-allocation-policy

1840bb2

wkd-woo added 2 commits March 1, 2026 23:37

Merge branch 'main' into feature/packed-allocation-policy

013bb85

Merge branch 'main' into feature/packed-allocation-policy

0d662af

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add configurable allocation policy (packed/distributed) for replicated and MIG resources#1621

Add configurable allocation policy (packed/distributed) for replicated and MIG resources#1621
wkd-woo wants to merge 6 commits intoNVIDIA:mainfrom
wkd-woo:feature/packed-allocation-policy

wkd-woo commented Feb 11, 2026

Uh oh!

copy-pr-bot bot commented Feb 11, 2026

Uh oh!

ottowhite commented Feb 15, 2026 •

edited

Loading

Uh oh!

ottowhite Feb 16, 2026

Uh oh!

wkd-woo Feb 17, 2026

Uh oh!

ottowhite Feb 17, 2026

Uh oh!

wkd-woo Feb 17, 2026

Uh oh!

ottowhite Feb 17, 2026 •

edited

Loading

Uh oh!

ottowhite commented Feb 17, 2026 •

edited

Loading

Uh oh!

wkd-woo commented Feb 17, 2026

Uh oh!

ottowhite commented Feb 17, 2026

Uh oh!

ottowhite commented Feb 19, 2026 •

edited

Loading

Uh oh!

wkd-woo commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wkd-woo commented Feb 11, 2026

Summary

Motivation

Design

Usage

CLI flag / Environment variable

Uh oh!

copy-pr-bot bot commented Feb 11, 2026

Uh oh!

ottowhite commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ottowhite Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

wkd-woo Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

ottowhite Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

wkd-woo Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

ottowhite Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ottowhite commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wkd-woo commented Feb 17, 2026

Uh oh!

ottowhite commented Feb 17, 2026

Uh oh!

ottowhite commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wkd-woo commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ottowhite commented Feb 15, 2026 •

edited

Loading

ottowhite Feb 17, 2026 •

edited

Loading

ottowhite commented Feb 17, 2026 •

edited

Loading

ottowhite commented Feb 19, 2026 •

edited

Loading