Compare commits

...

2 Commits

Author SHA1 Message Date
comfyanonymous
0e6221cc79
Add some warnings for pin and unpin errors. (#11561)
Some checks are pending
Python Linting / Run Ruff (push) Waiting to run
Python Linting / Run Pylint (push) Waiting to run
Full Comfy CI Workflow Runs / test-stable (12.1, , linux, 3.10, [self-hosted Linux], stable) (push) Waiting to run
Full Comfy CI Workflow Runs / test-stable (12.1, , linux, 3.11, [self-hosted Linux], stable) (push) Waiting to run
Full Comfy CI Workflow Runs / test-stable (12.1, , linux, 3.12, [self-hosted Linux], stable) (push) Waiting to run
Full Comfy CI Workflow Runs / test-unix-nightly (12.1, , linux, 3.11, [self-hosted Linux], nightly) (push) Waiting to run
Execution Tests / test (macos-latest) (push) Waiting to run
Execution Tests / test (ubuntu-latest) (push) Waiting to run
Execution Tests / test (windows-latest) (push) Waiting to run
Test server launches without errors / test (push) Waiting to run
Unit Tests / test (macos-latest) (push) Waiting to run
Unit Tests / test (ubuntu-latest) (push) Waiting to run
Unit Tests / test (windows-2022) (push) Waiting to run
2025-12-29 18:26:42 -05:00
rattus
9ca7e143af
mm: discard async errors from pinning failures (#10738)
Pretty much every error cudaHostRegister can throw also queues the same
error on the async GPU queue. This was fixed for repinning error case,
but there is the bad mmap and just enomem cases that are harder to
detect.

Do some dummy GPU work to clean the error state.
2025-12-29 18:19:34 -05:00

View File

@ -1126,6 +1126,16 @@ if not args.disable_pinned_memory:
PINNING_ALLOWED_TYPES = set(["Parameter", "QuantizedTensor"])
def discard_cuda_async_error():
try:
a = torch.tensor([1], dtype=torch.uint8, device=get_torch_device())
b = torch.tensor([1], dtype=torch.uint8, device=get_torch_device())
_ = a + b
torch.cuda.synchronize()
except torch.AcceleratorError:
#Dump it! We already know about it from the synchronous return
pass
def pin_memory(tensor):
global TOTAL_PINNED_MEMORY
if MAX_PINNED_MEMORY <= 0:
@ -1158,6 +1168,9 @@ def pin_memory(tensor):
PINNED_MEMORY[ptr] = size
TOTAL_PINNED_MEMORY += size
return True
else:
logging.warning("Pin error.")
discard_cuda_async_error()
return False
@ -1186,6 +1199,9 @@ def unpin_memory(tensor):
if len(PINNED_MEMORY) == 0:
TOTAL_PINNED_MEMORY = 0
return True
else:
logging.warning("Unpin error.")
discard_cuda_async_error()
return False