Skip to content

Conversation

@Andy-Jost
Copy link
Contributor

@Andy-Jost Andy-Jost commented Jan 7, 2026

Summary

  • Add explicit default value handling for device attributes that may not be supported by older CUDA drivers
  • When cuDeviceGetAttribute returns CUDA_ERROR_INVALID_VALUE, return a sensible default instead of raising an error
  • Enables forward compatibility: cuda-core compiled against CUDA 12.9 works with CUDA 12.0 drivers

Changes

  • Add default parameter to _get_attribute() and _get_cached_attribute() with default value 0
  • Use default=1 for mem_sync_domain_count (single domain is traditional behavior)
  • Use default=-1 for host_numa_id (indicates NUMA not supported)
  • Document that gpu_pci_device_id/gpu_pci_subsystem_id return 0 if unsupported (added in CUDA 12.8)
  • Add comments marking the start of CUDA 12 and CUDA 13 device attributes

Closes #1420

@Andy-Jost Andy-Jost added this to the cuda.core beta 11 milestone Jan 7, 2026
@Andy-Jost Andy-Jost added bug Something isn't working P0 High priority - Must do! cuda.core Everything related to the cuda.core module labels Jan 7, 2026
@Andy-Jost Andy-Jost self-assigned this Jan 7, 2026
@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Jan 7, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@Andy-Jost Andy-Jost requested review from leofang, rparolin and rwgk January 7, 2026 22:31
@Andy-Jost
Copy link
Contributor Author

/ok to test 3af415c

@Andy-Jost
Copy link
Contributor Author

/ok to test 5f957b1

@github-actions

This comment has been minimized.

err = cydriver.cuDeviceGetAttribute(&val, attr, self._handle)
if err == cydriver.CUresult.CUDA_ERROR_INVALID_VALUE:
return 0
if err == cydriver.CUresult.CUDA_ERROR_INVALID_VALUE and default is not None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tiny tweak will make it possible to specify None as the default.

@@ -32,6 +32,7 @@ if TYPE_CHECKING:
 # but it seems it is very convenient to expose them for testing purposes...
 _tls = threading.local()
 _lock = threading.Lock()
+_NO_DEFAULT = object()
 cdef bint _is_cuInit = False


@@ -61,7 +62,7 @@ cdef class DeviceProperties:
         cdef cydriver.CUresult err
         with nogil:
             err = cydriver.cuDeviceGetAttribute(&val, attr, self._handle)
-        if err == cydriver.CUresult.CUDA_ERROR_INVALID_VALUE and default is not None:
+        if err == cydriver.CUresult.CUDA_ERROR_INVALID_VALUE and default is not _NO_DEFAULT:
             return default
         HANDLE_RETURN(err)
         return val

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After updating the type signatures to return int for slight efficiency, it was no longer possible to return None so I updated the two related functions locally to convert 0 to None.

@Andy-Jost Andy-Jost force-pushed the fix-attribute-handling branch from 5f957b1 to d946b9a Compare January 8, 2026 00:56
@Andy-Jost
Copy link
Contributor Author

Latest upload changes the PCI device IDs to None (from 0) when the driver query fails. The implementation is slightly different than what was suggested because I also updated the cdef helper function to return int and include an error spec.

@Andy-Jost Andy-Jost force-pushed the fix-attribute-handling branch from 2e4caeb to a7d4c22 Compare January 8, 2026 01:04
Add explicit default value handling for device attributes that may not
be supported by older CUDA drivers. When cuDeviceGetAttribute returns
CUDA_ERROR_INVALID_VALUE, return a sensible default instead of raising
an error.

- Add default parameter to _get_attribute() and _get_cached_attribute()
- Use default=0 for boolean/enablement attributes (returns False)
- Use default=1 for mem_sync_domain_count (single domain is traditional behavior)
- Use default=-1 for host_numa_id (indicates NUMA not supported)
- Document that gpu_pci_device_id/gpu_pci_subsystem_id return 0 if unsupported

Closes NVIDIA#1420
@Andy-Jost Andy-Jost force-pushed the fix-attribute-handling branch from 0a08b87 to 9945810 Compare January 8, 2026 01:07
@Andy-Jost
Copy link
Contributor Author

I'm not 100% sold on using None to indicate when the NUMA device ID queries fail. That's because every attribute returns an integral type (int or bool) and the driver itself uses integer sentinels to indicate "failure" or "not supported," such as -1 for host_numa_id when the system does not support NUMA. Now users are in the position of sometimes expecting None (Pythonic way) and at other times expecting integer sentinels (driver way).

@Andy-Jost
Copy link
Contributor Author

/ok to test fdff347

@rwgk
Copy link
Collaborator

rwgk commented Jan 8, 2026

I'm not 100% sold on using None to indicate when the NUMA device ID queries fail. That's because every attribute returns an integral type (int or bool) and the driver itself uses integer sentinels to indicate "failure" or "not supported," such as -1 for host_numa_id when the system does not support NUMA. Now users are in the position of sometimes expecting None (Pythonic way) and at other times expecting integer sentinels (driver way).

I think that's fine (Pythonic as you said, i.e. what people expect), but I'd also be happy if we made it -1.

To be explicit about the context I have in mind:

    def gpu_pci_device_id(self) -> int:
        """int: The combined 16-bit PCI device ID and 16-bit PCI vendor ID."""

           Returns -1 if the driver does not support this query.
        """
    def gpu_pci_subsystem_id(self) -> int:
        """int: The combined 16-bit PCI subsystem ID and 16-bit PCI subsystem vendor ID.

        Returns -1 if the driver does not support this query.
        """

Why not 0? — Because that's too easily interpreted as "success and the actual ID", and can then lead to invalid conclusions.

I'm assuming: Any actual IDs will be greater than zero. — Is that a valid assumption?

But I'd avoid the 0 anyway, to help humans seeing the numbers flying by while they are busy chasing other things, and probably not being aware of the subtlety that 0 is not a valid ID in these particular cases.

@Andy-Jost
Copy link
Contributor Author

Andy-Jost commented Jan 8, 2026

The format is (device_id << 16) | vendor_id where the NVIDIA vendor ID is 0x10DE.

Typical NVIDIA PCI device IDs are non-zero values like:

  • GeForce RTX 4090: 0x2684
  • GeForce RTX 3080: 0x2206
  • Tesla V100: 0x1DB1
  • A100: 0x20B0

The combined value is an unsigned int with several bits set (never zero). I think either 0 or -1 would work, but I'd prefer 0 because the value is normally unsigned.

Aside from zero, users may see numbers such as the following in logs:

  • A100: 0x20B010DE = 548409566
  • -1 (32-bit): 0xFFFFFFFF = 4294967295

- Add except? -2 to _get_attribute and _get_cached_attribute for proper
  exception propagation (-2 never clashes with valid return values)
- Keep default parameter untyped to allow None, cast to int when used
- Simplify gpu_pci_device_id/gpu_pci_subsystem_id to return 0 when
  unsupported (0 is never a valid PCI ID)
@Andy-Jost Andy-Jost force-pushed the fix-attribute-handling branch from fdff347 to 0ebcce9 Compare January 8, 2026 17:03
@Andy-Jost
Copy link
Contributor Author

/ok to test 0ebcce9

@Andy-Jost Andy-Jost merged commit 2ed3f98 into NVIDIA:main Jan 8, 2026
80 checks passed
@Andy-Jost Andy-Jost deleted the fix-attribute-handling branch January 8, 2026 17:46
@github-actions
Copy link

github-actions bot commented Jan 8, 2026

Doc Preview CI
Preview removed because the pull request was closed or merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working cuda.core Everything related to the cuda.core module P0 High priority - Must do!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix device attribute handling

2 participants