-
Notifications
You must be signed in to change notification settings - Fork 241
cuda.core.system: add conveniences to convert between device types #1508
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
23aefa3
c995aec
784ba3c
b4df1cd
20875b3
a10f3bc
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -722,6 +722,30 @@ cdef class Device: | |
| pci_bus_id = pci_bus_id.decode("ascii") | ||
| self._handle = nvml.device_get_handle_by_pci_bus_id_v2(pci_bus_id) | ||
|
|
||
| def to_cuda_device(self) -> "cuda.core.Device": | ||
leofang marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| """ | ||
| Get the corresponding :class:`cuda.core.Device` (which is used for CUDA | ||
| access) for this :class:`cuda.core.system.Device` (which is used for | ||
| NVIDIA machine library (NVML) access). | ||
|
|
||
| The devices are mapped to one another by their UUID. | ||
|
|
||
| Returns | ||
| ------- | ||
| cuda.core.Device | ||
| The corresponding CUDA device. | ||
| """ | ||
| from cuda.core import Device as CudaDevice | ||
|
|
||
| # CUDA does not have an API to get a device by its UUID, so we just | ||
| # search all the devices for one with a matching UUID. | ||
|
|
||
| for cuda_device in CudaDevice.get_all_devices(): | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Consider: Can we memoize this call so it only does the linear search once and caches the result?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Probably in most cases that would be fine, but hot plugging devices is possible, so the device with the same UUID can change device handles after a replug. Hot plugging a device during a running process that cares about it probably has all kinds of other problems, but since we can't predict how the application works, it's probably better putting caching in the hands of the user of this API. |
||
| if cuda_device.uuid == self.uuid: | ||
| return cuda_device | ||
|
|
||
| raise RuntimeError("No corresponding CUDA device found for this NVML device.") | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. More of a rant I guess than anything: I know this is pre-existing, but I really dislike that we raise
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What would be more appropriate here? A custom exception along the lines of |
||
|
|
||
| @classmethod | ||
| def get_device_count(cls) -> int: | ||
| """ | ||
|
|
@@ -1036,8 +1060,16 @@ cdef class Device: | |
| Retrieves the globally unique immutable UUID associated with this | ||
| device, as a 5 part hexadecimal string, that augments the immutable, | ||
| board serial identifier. | ||
|
|
||
| In the upstream NVML C++ API, the UUID includes a ``gpu-`` or ``mig-`` | ||
| prefix. That is not included in ``cuda.core.system``. | ||
| """ | ||
| return nvml.device_get_uuid(self._handle) | ||
| # NVML UUIDs have a `GPU-` or `MIG-` prefix. We remove that here. | ||
|
|
||
| # TODO: If the user cares about the prefix, we will expose that in the | ||
| # future using the MIG-related APIs in NVML. | ||
|
|
||
| return nvml.device_get_uuid(self._handle)[4:] | ||
|
Comment on lines
+1069
to
+1072
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am thinking the system device should return the full UUID and we document the different expectations between cuda.core and cuda.core.system (or CUDA vs NVML). @mdboom thoughts?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I totally can see it both ways. My original implementation did was you suggested (following upstream NVML behavior). But @cpcloud convinced me this is weird -- UUID has a well-defined meaning in our field that NVML deviates from. I don't feel super strongly either way -- we just need to break the tie ;)
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IIUC NVML is the only way for us to tell, from inside a running process, if we are using MIG instances or otherwise (bare-metal GPU, MPS, etc). CUDA purposely hides MIG from end users. So my thinking is if we don't follow NVML there is no other way for Python users to query. Could you check if my impression is correct?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is another API, |
||
|
|
||
| def register_events(self, events: EventType | int | list[EventType | int]) -> DeviceEvents: | ||
| """ | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.