Fix OOM regression in _apply() for quantized models during inference

Skip unnecessary clone of inference-mode tensors when already inside torch.inference_mode(), matching the existing guard in set_attr_param. The unconditional clone introduced in 20561aa9 caused transient VRAM doubling during model movement for FP8/quantized models.
2026-05-05 23:02:49 +08:00 · 2026-04-12 09:14:06 +00:00 · 2026-04-12 09:14:06 +00:00 · b7872e24f4
commit b7872e24f4
parent 31283d2892
1 changed files with 1 additions and 1 deletions
--- a/comfy/ops.py
+++ b/comfy/ops.py
@ -1151,7 +1151,7 @@ def mixed_precision_ops(quant_config={}, compute_dtype=torch.bfloat16, full_prec
                    if param is None:
                        continue
                    p = fn(param)
-                    if p.is_inference():
+                    if (not torch.is_inference_mode_enabled()) and p.is_inference():
                        p = p.clone()
                    self.register_parameter(key, torch.nn.Parameter(p, requires_grad=False))
                for key, buf in self._buffers.items():