supports_fp8_compute() and supports_nvfp4_compute() used the global
is_nvidia() check which ignores the device argument, then defaulted
to cuda:0 when device was None. In heterogeneous multi-GPU setups
(e.g. RTX 5070 + RTX 3090 Ti) this causes the wrong GPU's compute
capability to be checked, incorrectly disabling fp8 on capable
devices.
Replace the global is_nvidia() gate with per-device checks:
- Default device=None to get_torch_device() explicitly
- Early-return False for CPU/MPS devices
- Use is_device_cuda(device) + torch.version.cuda instead of
the global is_nvidia()
Fixes#4589, relates to #4577, #12405
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>