Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion assets/state-cc-manager/0500_daemonset.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,9 @@ spec:
fieldRef:
fieldPath: spec.nodeName
- name: CC_CAPABLE_DEVICE_IDS
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder where this CC_CAPABLE_DEVICE_IDS variable is being referenced.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! It looks like we may need a change in the k8s-cc-manager as well then. If we want to allow all 23 and 2b Hopper/Blackwell GPUs, we may rather not want to pass a list of specific GPUs.

value: "0x2322,0x2331"
# TODO - revisit: This list was reduced in 03688e3f61433cbf3bb8e2fad241d12672b04836
# We should align with deployments\gpu-operator\templates\nodefeaturerules.yaml
value: "0x2322,0x2321,0x2331"
# always use runc for driver containers
- name: NVIDIA_VISIBLE_DEVICES
value: void
Expand Down
10 changes: 9 additions & 1 deletion deployments/gpu-operator/templates/nodefeaturerules.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,15 @@ spec:
matchExpressions:
vendor: {op: In, value: ["10de"]}
device: {op: In, value: ["2322"]}
- name: "NVIDIA H100L 94GB"
labels:
"nvidia.com/gpu.H100L": "true"
"nvidia.com/gpu.family": "hopper"
matchFeatures:
- feature: pci.device
matchExpressions:
vendor: {op: In, value: ["10de"]}
device: {op: In, value: ["2321"]}
Copy link
Author

@manuelh-dev manuelh-dev Oct 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we won't add devices piece meal. Can "whitelist" : 0x23**, 0x2b**, GBXXX -- Blackwell, GBXXX -- Hopper. Some may need to be excluded via "blacklist" then: exclude 2b00 TA1090SA [THOR].

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to note is that matchExpressions don't allow wildcards (as far as I am aware). Is there another component that could / should create thes labels instead of a nodefeature rule directly?

Copy link
Author

@manuelh-dev manuelh-dev Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, we could do something like the following:

- name: "NVIDIA Hopper GPU Family"
  labels:
    "nvidia.com/gpu.family": "hopper"
  matchFeatures:
    - feature: pci.device
      matchExpressions:
        vendor: {op: In, value: ["10de"]}
        device: {op: InRegexp, value: ["^23[0-9a-f]{2}$"]}
- name: "NVIDIA Blackwell GPU Family"
  labels:
    "nvidia.com/gpu.family": "blackwell"
  matchFeatures:
    - feature: pci.device
      matchExpressions:
        vendor: {op: In, value: ["10de"]}
        device: {op: InRegexp, value: ["^2b[0-9a-f]{2}$"]}

- name: "NVIDIA CC Enabled"
labels:
"nvidia.com/cc.capable": "true"
Expand All @@ -104,4 +113,3 @@ spec:
nvidia.com/gpu.family: {op: In, value: ["hopper"]}
tdx.enabled: {op: IsTrue}
{{- end }}

4 changes: 3 additions & 1 deletion deployments/gpu-operator/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -511,7 +511,9 @@ ccManager:
imagePullSecrets: []
env:
- name: CC_CAPABLE_DEVICE_IDS
value: "0x2339,0x2331,0x2330,0x2324,0x2322,0x233d"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned offline: The envvars from the values file should probably be removed so that a user can properly override them. The defaults should be specified in the daemonset template instead.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Offline we had a discussion that the change were some of the dev defaults were removed in values.yaml was the following: #1580 - the ccManager envvars may have potentially been missed to remove.

# TODO: 0x233d does not seem to be listed in deployments\gpu-operator\templates\nodefeaturerules.yaml, or 0500_daemonset.yaml
# The value was at least added to assets\state-cc-manager\0500_daemonset.yaml in 094e28f5056cf5ebfcf5c0a6277672cdda2c9e08 (but later on removed)
value: "0x2339,0x2331,0x2330,0x2324,0x2322,0x2321,0x233d"
resources: {}

node-feature-discovery:
Expand Down