Skip to content

Label node nvidia.com/gpu.count for MIG Config #1501

@Alja9

Description

@Alja9

Hello,
I want to ask regarding label MIG on the node.

If we configure MIG on the node, then the k8s-device-plugin will add MIG labels to the node. But for nvidia.com/gpu.count label does not update with MIG configuration. Examples :

  • Non MIG Configuration
...
                    nvidia.com/gpu.count=8
                    nvidia.com/gpu.deploy.container-toolkit=true
                    nvidia.com/gpu.deploy.dcgm=true
                    nvidia.com/gpu.deploy.dcgm-exporter=true
                    nvidia.com/gpu.deploy.device-plugin=true
                    nvidia.com/gpu.deploy.driver=true
                    nvidia.com/gpu.deploy.gpu-feature-discovery=true
                    nvidia.com/gpu.deploy.mig-manager=true
                    nvidia.com/gpu.deploy.node-status-exporter=true
                    nvidia.com/gpu.deploy.operator-validator=true
                    nvidia.com/gpu.family=hopper
                    nvidia.com/gpu.machine=PowerEdge-XE9680
                    nvidia.com/gpu.memory=81559
                    nvidia.com/gpu.mode=compute
                    nvidia.com/gpu.present=true
                    nvidia.com/gpu.product=NVIDIA-H100-80GB-HBM3
                    nvidia.com/gpu.replicas=1
                    nvidia.com/mig.capable=true
                    nvidia.com/mig.config=all-disabled
                    nvidia.com/mig.config.state=success
                    nvidia.com/mig.strategy=mixed
...
Capacity:
  ...
  nvidia.com/gpu:             8
  ...
Allocatable:
  ...
  nvidia.com/gpu:             8
  ...
...
  • MIG Configuration
...
                    nvidia.com/gpu.count=8
                    nvidia.com/gpu.deploy.container-toolkit=true
                    nvidia.com/gpu.deploy.dcgm=true
                    nvidia.com/gpu.deploy.dcgm-exporter=true
                    nvidia.com/gpu.deploy.device-plugin=true
                    nvidia.com/gpu.deploy.driver=true
                    nvidia.com/gpu.deploy.gpu-feature-discovery=true
                    nvidia.com/gpu.deploy.mig-manager=true
                    nvidia.com/gpu.deploy.node-status-exporter=true
                    nvidia.com/gpu.deploy.nvsm=true
                    nvidia.com/gpu.deploy.operator-validator=true
                    nvidia.com/gpu.family=hopper
                    nvidia.com/gpu.machine=PowerEdge-XE9680
                    nvidia.com/gpu.memory=81559
                    nvidia.com/gpu.present=true
                    nvidia.com/gpu.product=NVIDIA-H100-80GB-HBM3
                    nvidia.com/gpu.replicas=1
                    nvidia.com/mig-1g.10gb.count=14
                    nvidia.com/mig-1g.10gb.engines.copy=1
                    nvidia.com/mig-1g.10gb.engines.decoder=1
                    nvidia.com/mig-1g.10gb.engines.encoder=0
                    nvidia.com/mig-1g.10gb.engines.jpeg=1
                    nvidia.com/mig-1g.10gb.engines.ofa=0
                    nvidia.com/mig-1g.10gb.memory=9984
                    nvidia.com/mig-1g.10gb.multiprocessors=16
                    nvidia.com/mig-1g.10gb.product=NVIDIA-H100-80GB-HBM3-MIG-1g.10gb
                    nvidia.com/mig-1g.10gb.replicas=1
                    nvidia.com/mig-1g.10gb.slices.ci=1
                    nvidia.com/mig-1g.10gb.slices.gi=1
                    nvidia.com/mig-3g.40gb.count=5
                    nvidia.com/mig-3g.40gb.engines.copy=3
                    nvidia.com/mig-3g.40gb.engines.decoder=3
                    nvidia.com/mig-3g.40gb.engines.encoder=0
                    nvidia.com/mig-3g.40gb.engines.jpeg=3
                    nvidia.com/mig-3g.40gb.engines.ofa=0
                    nvidia.com/mig-3g.40gb.memory=40320
                    nvidia.com/mig-3g.40gb.multiprocessors=60
                    nvidia.com/mig-3g.40gb.product=NVIDIA-H100-80GB-HBM3-MIG-3g.40gb
                    nvidia.com/mig-3g.40gb.replicas=1
                    nvidia.com/mig-3g.40gb.slices.ci=3
                    nvidia.com/mig-3g.40gb.slices.gi=3
                    nvidia.com/mig-4g.40gb.count=5
                    nvidia.com/mig-4g.40gb.engines.copy=4
                    nvidia.com/mig-4g.40gb.engines.decoder=4
                    nvidia.com/mig-4g.40gb.engines.encoder=0
                    nvidia.com/mig-4g.40gb.engines.jpeg=4
                    nvidia.com/mig-4g.40gb.engines.ofa=0
                    nvidia.com/mig-4g.40gb.memory=40320
                    nvidia.com/mig-4g.40gb.multiprocessors=64
                    nvidia.com/mig-4g.40gb.product=NVIDIA-H100-80GB-HBM3-MIG-4g.40gb
                    nvidia.com/mig-4g.40gb.replicas=1
                    nvidia.com/mig-4g.40gb.slices.ci=4
                    nvidia.com/mig-4g.40gb.slices.gi=4
                    nvidia.com/mig.capable=true
                    nvidia.com/mig.config=mig-config-26
                    nvidia.com/mig.config.state=success
                    nvidia.com/mig.strategy=mixed
...
Capacity:
  ...
  nvidia.com/gpu:             1
  nvidia.com/mig-1g.10gb:     14
  nvidia.com/mig-3g.40gb:     5
  nvidia.com/mig-4g.40gb:     5
  ...
Allocatable:
  ...
  nvidia.com/gpu:             1
  nvidia.com/mig-1g.10gb:     14
  nvidia.com/mig-3g.40gb:     5
  nvidia.com/mig-4g.40gb:     5
  ...
...

Is there any solution to make the nominal count of the nvidia.com/gpu label appear in the node label same with count in capacity or match it with the GPU count configuration in the MIG config ?
(as in the examples above, it becomes nvidia.com/gpu: 1, but in the label node does not show the count 1 and still 8)

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureissue/PR that proposes a new feature or functionalityneeds-triageissue or PR has not been assigned a priority-px labelquestionCategorizes issue or PR as a support question.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions