-
Notifications
You must be signed in to change notification settings - Fork 794
Add configurable allocation policy (packed/distributed) for replicated and MIG resources #1621
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
wkd-woo
wants to merge
6
commits into
NVIDIA:main
Choose a base branch
from
wkd-woo:feature/packed-allocation-policy
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
1fd1a4e
Add configurable allocation policy for replicated and MIG resources
wkd-woo 2839bfc
Extract prepareCandidates helper to reduce duplication
wkd-woo ec55e33
Merge branch 'main' into feature/packed-allocation-policy
wkd-woo 1840bb2
Merge branch 'main' into feature/packed-allocation-policy
wkd-woo 013bb85
Merge branch 'main' into feature/packed-allocation-policy
wkd-woo 0d662af
Merge branch 'main' into feature/packed-allocation-policy
wkd-woo File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was the only change in this function and distributedAlloc the comparator between idiff and jdiff? Can we just pass in an enum and then have a condition on the enum to dictate this behavior? There is a lot of code duplication at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review!
The duplication is intentional — I wanted distributedAlloc to stay completely untouched so reviewers can verify there's zero behavioral change just by looking at the diff.
On merging them: the two functions share the same structure because packed is just the inverse of distributed (flipped comparator). But as you mentioned with topology-aware bin-packing, a future policy would need NVML topology data and different sorting logic entirely — so I'd rather keep them separate now than merge and split again later.
That said, happy to extract the common setup (candidate filtering + replica counting) into a shared helper if that feels cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem :) to be honest I think it would actually be easier to review/maintain if you generalised the allocation function slightly.
It could be generalised to fractionalAlloc, taking an enum that allows distributedFractionPlacement or packedFractionPlacement policies. Then internally you just pick the comparator based on the policy.
What you suggested with pulling out the common helpers such as candidate filtering and replica counting is probably even better than the generalisation I suggested, as it enables the policies to be kept completely separate, and is more extensible in that it enables easier implementation of new policies again such as the topology-aware one.
When I was reviewing the code, I pulled up both distributedAlloc and packedAlloc side by side and was cross-checking for a while before I saw the flipped comparator. I expect other reviewers and contributors to go through a similar process. This is just my two cents though.
You've also improved the state of testing so that's an additional signal that this is functionally equivalent to what there was before.
In terms of the different topology-aware placement policy, I think it makes sense to keep it separate for now. But cool that you're opening this up and making it more configurable! I may build off of this myself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ottowhite I've extracted prepareCandidates in the second commit.
Each policy function now only has its sorting logic, so the difference should be immediately visible
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking better! There's still quite a bit of duplication across both of the Alloc functions. What is common is that they're both greedy allocation algorithms (repeatedly evaluating and selecting the next best option, without looking further ahead). Could you extract this into another helper greedyDeviceAlloc or something similar that makes the difference between these two functions even clearer? Could pass in a candidate comparator. It will also be much clearer to extend with other greedy allocation policies. For example the one that I spoke about is another greedy allocation algorithm that happens to use topology-awareness in it's comparator.