Skip to content

Add configurable allocation policy (packed/distributed) for replicated and MIG resources#1621

Open
wkd-woo wants to merge 6 commits intoNVIDIA:mainfrom
wkd-woo:feature/packed-allocation-policy
Open

Add configurable allocation policy (packed/distributed) for replicated and MIG resources#1621
wkd-woo wants to merge 6 commits intoNVIDIA:mainfrom
wkd-woo:feature/packed-allocation-policy

Conversation

@wkd-woo
Copy link

@wkd-woo wkd-woo commented Feb 11, 2026

Summary

  • Add --allocation-policy flag (ALLOCATION_POLICY env) with distributed (default) and packed options
  • packed mode bin-packs replicated/MIG devices onto fewest physical GPUs, freeing up remaining GPUs for full-GPU workloads
  • Default behavior (distributed) is unchanged — no breaking changes

Motivation

The current distributedAlloc was designed for time-slicing, where distributing replicas across physical GPUs avoids compute contention. However, MIG devices also fall into this code path simply because AlignedAllocationSupported() returns false for them — not because distributed allocation is the right strategy.

MIG instances are hardware-isolated partitions with dedicated SMs and memory. Packing them onto fewer physical GPUs has no performance penalty, and frees up remaining GPUs for full-GPU workloads:

┌─────────────────────────────────────────────────────────────────────┐
│                Distributed Allocation (current default)              │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  GPU 0            GPU 1            GPU 2            GPU 3           │
│  ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐      │
│  │ ████░░░░ │    │ ████░░░░ │    │ ████░░░░ │    │ ░░░░░░░░ │      │
│  │ MIG 1/5  │    │ MIG 1/5  │    │ MIG 1/5  │    │ MIG 0/5  │      │
│  └──────────┘    └──────────┘    └──────────┘    └──────────┘      │
│   ⚠ partial       ⚠ partial       ⚠ partial       ░ empty          │
│                                                                     │
│  → Full GPU request arrives: only 1 GPU available (GPU 3)           │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────┐
│                    Packed Allocation (bin-packing)                   │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  GPU 0            GPU 1            GPU 2            GPU 3           │
│  ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐      │
│  │ ████████ │    │ ░░░░░░░░ │    │ ░░░░░░░░ │    │ ░░░░░░░░ │      │
│  │ MIG 3/5  │    │ MIG 0/5  │    │ MIG 0/5  │    │ MIG 0/5  │      │
│  └──────────┘    └──────────┘    └──────────┘    └──────────┘      │
│   ✓ packed         ★ free          ★ free          ★ free           │
│                                                                     │
│  → Full GPU request arrives: 3 GPUs available (GPU 1, 2, 3)        │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Relates to #491

Design

getPreferredAllocation
├─ Full GPU (AlignedAllocationSupported && no annotations)
│   └─ alignedAlloc (topology-based, unaffected by this change)
│
└─ MIG / time-slicing / MPS
├─ allocationPolicy=packed  → packedAlloc (new)
└─ allocationPolicy=distributed (default) → distributedAlloc (existing, unmodified)

Key decisions:

  • distributedAlloc is completely untouchedpackedAlloc is a separate function following the existing alignedAlloc/distributedAlloc pattern
  • Full GPU nodes are unaffectedalignedAlloc is selected before allocationPolicy is ever checked, so setting packed on a full-GPU node has no effect
  • Flag applies uniformly — when packed is set, it applies to MIG, time-slicing, and MPS. Silently ignoring a user-set flag for specific device types would be inconsistent
  • Per-node config supported — works with existing config-manager + ConfigMap + node label (nvidia.com/device-plugin.config) mechanism via YAML config

Usage

CLI flag / Environment variable

--allocation-policy=packed
# or
ALLOCATION_POLICY=packed
Config file (per-node via ConfigMap + node label)

version: v1
flags:
  migStrategy: mixed
  plugin:
    allocationPolicy: packed

kubectl label node mig-node nvidia.com/device-plugin.config=mig-packed

Test plan

  • TestDistributedAlloc (6 cases) — existing distributed behavior regression
  • TestDistributedAllocIsDefault — default consistency verified over 10 iterations
  • TestPackedAlloc (6 cases) — bin-packing: same GPU priority, overflow to next GPU
  • TestPackedVsDistributedContrast — two strategies produce different results on same input
  • TestFullGPUNodeIgnoresAllocationPolicy (3 sub-cases) — full GPU / MIG / replicated branch path verification
    All existing internal/rm/ tests pass unchanged

@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 11, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@wkd-woo wkd-woo force-pushed the feature/packed-allocation-policy branch from 421e3c9 to 6686d1a Compare February 11, 2026 09:06
@ottowhite
Copy link

ottowhite commented Feb 15, 2026

Hi! Very cool idea, I actually really want a very similar custom allocation policy for replicated (non-MIG) resources, but for enabling bin-packing replicas with topology-awareness.

For example, this would could enable the allocation of 1.6 GPUs to a single model, with two tensor parallel ranks across 2 GPUs, 0.8 GPUs allocated to each tensor parallel tank, and both within a single NVLink domain (see my pretty diagram for an example). This would avoid excessive communication overhead during matrix multiplications across both tensor parallel model ranks, while leaving the 0.4 GPUs left over to be allocated to a smaller model, decreasing latency and increasing throughput in (surprisingly common) multi-model deployment scenarios including RAG with LLMs + Embedding models.

image

I believe the implementation of this policy is essentially a composite between what you've implemented, and the alignedAlloc code path which uses go-gpuallocator (gaining topology-awareness through NVML). At present, any of the replicated resource code paths discard topology awareness in favour of evenly load balancing GPU fractions with the distributed allocation policy, which prevents the use-case that I'm after.

Would be great if you're interested in discussing this more. I wonder if it's possible to slip it in with this PR? Also happy to contribute if that helps.

jid := AnnotatedID(candidates[j]).GetID()
idiff := replicas[iid].total - replicas[iid].available
jdiff := replicas[jid].total - replicas[jid].available
return idiff > jdiff

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was the only change in this function and distributedAlloc the comparator between idiff and jdiff? Can we just pass in an enum and then have a condition on the enum to dictate this behavior? There is a lot of code duplication at the moment.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review!

The duplication is intentional — I wanted distributedAlloc to stay completely untouched so reviewers can verify there's zero behavioral change just by looking at the diff.

On merging them: the two functions share the same structure because packed is just the inverse of distributed (flipped comparator). But as you mentioned with topology-aware bin-packing, a future policy would need NVML topology data and different sorting logic entirely — so I'd rather keep them separate now than merge and split again later.

That said, happy to extract the common setup (candidate filtering + replica counting) into a shared helper if that feels cleaner.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem :) to be honest I think it would actually be easier to review/maintain if you generalised the allocation function slightly.

It could be generalised to fractionalAlloc, taking an enum that allows distributedFractionPlacement or packedFractionPlacement policies. Then internally you just pick the comparator based on the policy.

What you suggested with pulling out the common helpers such as candidate filtering and replica counting is probably even better than the generalisation I suggested, as it enables the policies to be kept completely separate, and is more extensible in that it enables easier implementation of new policies again such as the topology-aware one.

When I was reviewing the code, I pulled up both distributedAlloc and packedAlloc side by side and was cross-checking for a while before I saw the flipped comparator. I expect other reviewers and contributors to go through a similar process. This is just my two cents though.

You've also improved the state of testing so that's an additional signal that this is functionally equivalent to what there was before.

In terms of the different topology-aware placement policy, I think it makes sense to keep it separate for now. But cool that you're opening this up and making it more configurable! I may build off of this myself.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ottowhite I've extracted prepareCandidates in the second commit.
Each policy function now only has its sorting logic, so the difference should be immediately visible

Copy link

@ottowhite ottowhite Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking better! There's still quite a bit of duplication across both of the Alloc functions. What is common is that they're both greedy allocation algorithms (repeatedly evaluating and selecting the next best option, without looking further ahead). Could you extract this into another helper greedyDeviceAlloc or something similar that makes the difference between these two functions even clearer? Could pass in a candidate comparator. It will also be much clearer to extend with other greedy allocation policies. For example the one that I spoke about is another greedy allocation algorithm that happens to use topology-awareness in it's comparator.

The existing distributedAlloc spreads replicated devices evenly across
physical GPUs, which was designed for time-slicing where workloads
compete for shared compute. MIG instances, however, are hardware-isolated
and do not suffer from contention when packed onto the same GPU.

This adds a --allocation-policy flag (env: ALLOCATION_POLICY) with two
options: "distributed" (default, preserving current behavior) and
"packed" (bin-packing onto fewest physical GPUs). The packed policy
helps free up entire GPUs for full-GPU workloads in mixed clusters.

The flag applies uniformly to all non-aligned allocation paths (MIG,
time-slicing, MPS) and can be configured per-node via ConfigMap and
the nvidia.com/device-plugin.config node label.

Signed-off-by: wkd-woo <wkdwoos@gmail.com>
@wkd-woo wkd-woo force-pushed the feature/packed-allocation-policy branch from 6686d1a to 8656ef3 Compare February 17, 2026 12:22
Refactor common setup logic (candidate filtering, validation, replica
counting) into a shared prepareCandidates helper. Each allocation
policy function now focuses solely on its sorting strategy, making the
difference between distributed and packed immediately visible.

Signed-off-by: wkd-woo <wkdwoos@gmail.com>
@wkd-woo wkd-woo force-pushed the feature/packed-allocation-policy branch from 8656ef3 to 2839bfc Compare February 17, 2026 12:22
@ottowhite
Copy link

ottowhite commented Feb 17, 2026

Also, is there a reason why the alignedAlloc happens to be in nvml_manager.go? It seems like just another allocation policy that would be more suited to be in allocate.go rather than there. I know this is not due to your code change but moving it there could improve the organisation of the code, and ease to extend with different allocation policies.

@wkd-woo
Copy link
Author

wkd-woo commented Feb 17, 2026

Thank @ottowhite for your review :)
On moving alignedAlloc — agreed it could improve organization, but I'd rather keep this PR focused on the new feature and not restructure existing code!

I'd like to hear from a maintainer at this point to see if the current direction works.

@ottowhite
Copy link

Fair enough! Though I think those refactors would be a minor improvement for organisation/extensibility, I think the testing is strong and implementation is solid.

LGTM!

Will be great to hear what the maintainers have to say. Also really keen on this feature and the more extensible policy opportunities that it opens up.

@ottowhite
Copy link

ottowhite commented Feb 19, 2026

Hey! By the way we're running your packedAlloc allocation policy and it's working for multi-GPU deployments. We will imminently be forking and building off it it.

@wkd-woo
Copy link
Author

wkd-woo commented Feb 25, 2026

Hi @elezar, just checking in to see if there's anything else needed from my side to move this forward. I'm happy to address any feedback :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants