Add configurable allocation policy (packed/distributed) for replicated and MIG resources#1621
Add configurable allocation policy (packed/distributed) for replicated and MIG resources#1621wkd-woo wants to merge 6 commits intoNVIDIA:mainfrom
Conversation
421e3c9 to
6686d1a
Compare
|
Hi! Very cool idea, I actually really want a very similar custom allocation policy for replicated (non-MIG) resources, but for enabling bin-packing replicas with topology-awareness. For example, this would could enable the allocation of 1.6 GPUs to a single model, with two tensor parallel ranks across 2 GPUs, 0.8 GPUs allocated to each tensor parallel tank, and both within a single NVLink domain (see my pretty diagram for an example). This would avoid excessive communication overhead during matrix multiplications across both tensor parallel model ranks, while leaving the 0.4 GPUs left over to be allocated to a smaller model, decreasing latency and increasing throughput in (surprisingly common) multi-model deployment scenarios including RAG with LLMs + Embedding models.
I believe the implementation of this policy is essentially a composite between what you've implemented, and the alignedAlloc code path which uses go-gpuallocator (gaining topology-awareness through NVML). At present, any of the replicated resource code paths discard topology awareness in favour of evenly load balancing GPU fractions with the distributed allocation policy, which prevents the use-case that I'm after. Would be great if you're interested in discussing this more. I wonder if it's possible to slip it in with this PR? Also happy to contribute if that helps. |
| jid := AnnotatedID(candidates[j]).GetID() | ||
| idiff := replicas[iid].total - replicas[iid].available | ||
| jdiff := replicas[jid].total - replicas[jid].available | ||
| return idiff > jdiff |
There was a problem hiding this comment.
Was the only change in this function and distributedAlloc the comparator between idiff and jdiff? Can we just pass in an enum and then have a condition on the enum to dictate this behavior? There is a lot of code duplication at the moment.
There was a problem hiding this comment.
Thanks for the review!
The duplication is intentional — I wanted distributedAlloc to stay completely untouched so reviewers can verify there's zero behavioral change just by looking at the diff.
On merging them: the two functions share the same structure because packed is just the inverse of distributed (flipped comparator). But as you mentioned with topology-aware bin-packing, a future policy would need NVML topology data and different sorting logic entirely — so I'd rather keep them separate now than merge and split again later.
That said, happy to extract the common setup (candidate filtering + replica counting) into a shared helper if that feels cleaner.
There was a problem hiding this comment.
No problem :) to be honest I think it would actually be easier to review/maintain if you generalised the allocation function slightly.
It could be generalised to fractionalAlloc, taking an enum that allows distributedFractionPlacement or packedFractionPlacement policies. Then internally you just pick the comparator based on the policy.
What you suggested with pulling out the common helpers such as candidate filtering and replica counting is probably even better than the generalisation I suggested, as it enables the policies to be kept completely separate, and is more extensible in that it enables easier implementation of new policies again such as the topology-aware one.
When I was reviewing the code, I pulled up both distributedAlloc and packedAlloc side by side and was cross-checking for a while before I saw the flipped comparator. I expect other reviewers and contributors to go through a similar process. This is just my two cents though.
You've also improved the state of testing so that's an additional signal that this is functionally equivalent to what there was before.
In terms of the different topology-aware placement policy, I think it makes sense to keep it separate for now. But cool that you're opening this up and making it more configurable! I may build off of this myself.
There was a problem hiding this comment.
@ottowhite I've extracted prepareCandidates in the second commit.
Each policy function now only has its sorting logic, so the difference should be immediately visible
There was a problem hiding this comment.
Looking better! There's still quite a bit of duplication across both of the Alloc functions. What is common is that they're both greedy allocation algorithms (repeatedly evaluating and selecting the next best option, without looking further ahead). Could you extract this into another helper greedyDeviceAlloc or something similar that makes the difference between these two functions even clearer? Could pass in a candidate comparator. It will also be much clearer to extend with other greedy allocation policies. For example the one that I spoke about is another greedy allocation algorithm that happens to use topology-awareness in it's comparator.
The existing distributedAlloc spreads replicated devices evenly across physical GPUs, which was designed for time-slicing where workloads compete for shared compute. MIG instances, however, are hardware-isolated and do not suffer from contention when packed onto the same GPU. This adds a --allocation-policy flag (env: ALLOCATION_POLICY) with two options: "distributed" (default, preserving current behavior) and "packed" (bin-packing onto fewest physical GPUs). The packed policy helps free up entire GPUs for full-GPU workloads in mixed clusters. The flag applies uniformly to all non-aligned allocation paths (MIG, time-slicing, MPS) and can be configured per-node via ConfigMap and the nvidia.com/device-plugin.config node label. Signed-off-by: wkd-woo <wkdwoos@gmail.com>
6686d1a to
8656ef3
Compare
Refactor common setup logic (candidate filtering, validation, replica counting) into a shared prepareCandidates helper. Each allocation policy function now focuses solely on its sorting strategy, making the difference between distributed and packed immediately visible. Signed-off-by: wkd-woo <wkdwoos@gmail.com>
8656ef3 to
2839bfc
Compare
|
Also, is there a reason why the alignedAlloc happens to be in nvml_manager.go? It seems like just another allocation policy that would be more suited to be in allocate.go rather than there. I know this is not due to your code change but moving it there could improve the organisation of the code, and ease to extend with different allocation policies. |
|
Thank @ottowhite for your review :) I'd like to hear from a maintainer at this point to see if the current direction works. |
|
Fair enough! Though I think those refactors would be a minor improvement for organisation/extensibility, I think the testing is strong and implementation is solid. LGTM! Will be great to hear what the maintainers have to say. Also really keen on this feature and the more extensible policy opportunities that it opens up. |
|
Hey! By the way we're running your packedAlloc allocation policy and it's working for multi-GPU deployments. We will imminently be forking and building off it it. |
|
Hi @elezar, just checking in to see if there's anything else needed from my side to move this forward. I'm happy to address any feedback :) |

Summary
--allocation-policyflag (ALLOCATION_POLICYenv) withdistributed(default) andpackedoptionspackedmode bin-packs replicated/MIG devices onto fewest physical GPUs, freeing up remaining GPUs for full-GPU workloadsdistributed) is unchanged — no breaking changesMotivation
The current
distributedAllocwas designed for time-slicing, where distributing replicas across physical GPUs avoids compute contention. However, MIG devices also fall into this code path simply becauseAlignedAllocationSupported()returnsfalsefor them — not because distributed allocation is the right strategy.MIG instances are hardware-isolated partitions with dedicated SMs and memory. Packing them onto fewer physical GPUs has no performance penalty, and frees up remaining GPUs for full-GPU workloads:
Relates to #491
Design
Key decisions:
distributedAllocis completely untouched —packedAllocis a separate function following the existingalignedAlloc/distributedAllocpatternalignedAllocis selected beforeallocationPolicyis ever checked, so settingpackedon a full-GPU node has no effectpackedis set, it applies to MIG, time-slicing, and MPS. Silently ignoring a user-set flag for specific device types would be inconsistentconfig-manager+ ConfigMap + node label (nvidia.com/device-plugin.config) mechanism via YAML configUsage
CLI flag / Environment variable
--allocation-policy=packed # or ALLOCATION_POLICY=packed Config file (per-node via ConfigMap + node label) version: v1 flags: migStrategy: mixed plugin: allocationPolicy: packedkubectl label node mig-node nvidia.com/device-plugin.config=mig-packedTest plan
All existing internal/rm/ tests pass unchanged