Commit Graph

442 Commits

Author SHA1 Message Date
patientx
6420b47885
Merge branch 'comfyanonymous:master' into master 2025-12-07 14:10:24 +03:00
comfyanonymous
50ca97e776
Speed up lora compute and lower memory usage by doing it in fp16. (#11161) 2025-12-06 18:36:20 -05:00
patientx
f073f115e0
Merge branch 'comfyanonymous:master' into master 2025-11-29 18:19:41 +03:00
rattus
0ff0457892
mm: wrap the raw stream in context manager (#10958)
The documentation of torch.foo.Stream being usable with with: suggests
it starts at version 2.7. Use the old API for backwards compatibility.
2025-11-28 16:38:12 -05:00
comfyanonymous
f55c98a89f
Disable offload stream when torch compile. (#10961) 2025-11-28 16:16:46 -05:00
patientx
4094cbf867
Merge branch 'comfyanonymous:master' into master 2025-11-28 13:43:35 +03:00
comfyanonymous
9d8a817985
Enable async offloading by default on Nvidia. (#10953)
Some checks failed
Python Linting / Run Ruff (push) Waiting to run
Python Linting / Run Pylint (push) Waiting to run
Full Comfy CI Workflow Runs / test-stable (12.1, , linux, 3.10, [self-hosted Linux], stable) (push) Waiting to run
Full Comfy CI Workflow Runs / test-stable (12.1, , linux, 3.11, [self-hosted Linux], stable) (push) Waiting to run
Full Comfy CI Workflow Runs / test-stable (12.1, , linux, 3.12, [self-hosted Linux], stable) (push) Waiting to run
Full Comfy CI Workflow Runs / test-unix-nightly (12.1, , linux, 3.11, [self-hosted Linux], nightly) (push) Waiting to run
Execution Tests / test (macos-latest) (push) Waiting to run
Execution Tests / test (ubuntu-latest) (push) Waiting to run
Execution Tests / test (windows-latest) (push) Waiting to run
Test server launches without errors / test (push) Waiting to run
Unit Tests / test (macos-latest) (push) Waiting to run
Unit Tests / test (ubuntu-latest) (push) Waiting to run
Unit Tests / test (windows-2022) (push) Waiting to run
Build package / Build Test (3.10) (push) Has been cancelled
Build package / Build Test (3.11) (push) Has been cancelled
Build package / Build Test (3.12) (push) Has been cancelled
Build package / Build Test (3.13) (push) Has been cancelled
Build package / Build Test (3.9) (push) Has been cancelled
Add --disable-async-offload to disable it.

If this causes OOMs that go away when you --disable-async-offload please
report it.
2025-11-27 17:46:12 -05:00
patientx
da822b5057
Merge branch 'comfyanonymous:master' into master 2025-11-27 14:43:23 +03:00
rattus
f17251bec6
Account for the VRAM cost of weight offloading (#10733)
Some checks are pending
Python Linting / Run Ruff (push) Waiting to run
Python Linting / Run Pylint (push) Waiting to run
Full Comfy CI Workflow Runs / test-stable (12.1, , linux, 3.10, [self-hosted Linux], stable) (push) Waiting to run
Full Comfy CI Workflow Runs / test-stable (12.1, , linux, 3.11, [self-hosted Linux], stable) (push) Waiting to run
Full Comfy CI Workflow Runs / test-stable (12.1, , linux, 3.12, [self-hosted Linux], stable) (push) Waiting to run
Full Comfy CI Workflow Runs / test-unix-nightly (12.1, , linux, 3.11, [self-hosted Linux], nightly) (push) Waiting to run
Execution Tests / test (macos-latest) (push) Waiting to run
Execution Tests / test (ubuntu-latest) (push) Waiting to run
Execution Tests / test (windows-latest) (push) Waiting to run
Test server launches without errors / test (push) Waiting to run
Unit Tests / test (macos-latest) (push) Waiting to run
Unit Tests / test (ubuntu-latest) (push) Waiting to run
Unit Tests / test (windows-2022) (push) Waiting to run
* mm: default to 0 for NUM_STREAMS

Dont count the compute stream as an offload stream. This makes async
offload accounting easier.

* mm: remove 128MB minimum

This is from a previous offloading system requirement. Remove it to
make behaviour of the loader and partial unloader consistent.

* mp: order the module list by offload expense

Calculate an approximate offloading temporary VRAM cost to offload a
weight and primary order the module load list by that. In the simple
case this is just the same as the module weight, but with Loras, a
weight with a lora consumes considerably more VRAM to do the Lora
application on-the-fly.

This will slightly prioritize lora weights, but is really for
proper VRAM offload accounting.

* mp: Account for the VRAM cost of weight offloading

when checking the VRAM headroom, assume that the weight needs to be
offloaded, and only load if it has space for both the load and offload
 * the number of streams.

As the weights are ordered from largest to smallest by offload cost
this is guaranteed to fit in VRAM (tm), as all weights that follow
will be smaller.

Make the partial unload aware of this system as well by saving the
budget for offload VRAM to the model state and accounting accordingly.
Its possible that partial unload increases the size of the largest
offloaded weights, and thus needs to unload a little bit more than
asked to accomodate the bigger temp buffers.

Honor the existing codes floor on model weight loading of 128MB by
having the patcher honor this separately withough regard to offloading.
Otherwise when MM specifies its 128MB minimum, MP will see the biggest
weights, and budget that 128MB to only offload buffer and load nothing
which isnt the intent of these minimums. The same clamp applies in
case of partial offload of the currently loading model.
2025-11-27 01:03:03 -05:00
patientx
de43f7b30d
Merge branch 'comfyanonymous:master' into master 2025-11-25 14:05:26 +03:00
comfyanonymous
b6805429b9
Allow pinning quantized tensors. (#10873) 2025-11-25 02:48:20 -05:00
patientx
00609a5102
Merge branch 'comfyanonymous:master' into master 2025-11-13 00:55:19 +03:00
rattus
18e7d6dba5
mm/mp: always unload re-used but modified models (#10724)
The partial unloader path in model re-use flow skips straight to the
actual unload without any check of the patching UUID. This means that
if you do an upscale flow with a model patch on an existing model, it
will not apply your patchings.

Fix by delaying the partial_unload until after the uuid checks. This
is done by making partial_unload a model of partial_load where extra_mem
is -ve.
2025-11-12 16:19:53 -05:00
patientx
644778be49
Merge branch 'comfyanonymous:master' into master 2025-11-12 17:55:30 +03:00
comfyanonymous
1199411747
Don't pin tensor if not a torch.nn.parameter.Parameter (#10718) 2025-11-11 19:33:30 -05:00
patientx
3662d0a2ce
Merge branch 'comfyanonymous:master' into master 2025-11-10 14:05:53 +03:00
comfyanonymous
dea899f221
Unload weights if vram usage goes up between runs. (#10690) 2025-11-09 18:51:33 -05:00
patientx
8e02689534
Merge branch 'comfyanonymous:master' into master 2025-11-07 20:30:21 +03:00
comfyanonymous
a1a70362ca
Only unpin tensor if it was pinned by ComfyUI (#10677) 2025-11-07 11:15:05 -05:00
patientx
d29dbbd829
Merge branch 'comfyanonymous:master' into master 2025-11-07 14:27:13 +03:00
rattus
cf97b033ee
mm: guard against double pin and unpin explicitly (#10672)
As commented, if you let cuda be the one to detect double pin/unpinning
it actually creates an asyc GPU error.
2025-11-06 21:20:48 -05:00
patientx
3ab45ae725
Merge branch 'comfyanonymous:master' into master 2025-11-06 15:35:41 +03:00
comfyanonymous
09dc24c8a9
Pinned mem also seems to work on AMD. (#10658) 2025-11-05 19:11:15 -05:00
comfyanonymous
1d69245981
Enable pinned memory by default on Nvidia. (#10656)
Removed the --fast pinned_memory flag.

You can use --disable-pinned-memory to disable it. Please report if it
causes any issues.
2025-11-05 18:08:13 -05:00
patientx
84faf45f09
Merge branch 'comfyanonymous:master' into master 2025-11-05 13:07:02 +03:00
comfyanonymous
7f3e4d486c
Limit amount of pinned memory on windows to prevent issues. (#10638) 2025-11-04 17:37:50 -05:00
patientx
7907b8d6be
Merge branch 'comfyanonymous:master' into master 2025-10-30 03:16:55 +03:00
rattus
ab7ab5be23
Fix Race condition in --async-offload that can cause corruption (#10501)
* mm: factor out the current stream getter

Make this a reusable function.

* ops: sync the offload stream with the consumption of w&b

This sync is nessacary as pytorch will queue cuda async frees on the
same stream as created to tensor. In the case of async offload, this
will be on the offload stream.

Weights and biases can go out of scope in python which then
triggers the pytorch garbage collector to queue the free operation on
the offload stream possible before the compute stream has used the
weight. This causes a use after free on weight data leading to total
corruption of some workflows.

So sync the offload stream with the compute stream after the weight
has been used so the free has to wait for the weight to be used.

The cast_bias_weight is extended in a backwards compatible way with
the new behaviour opt-in on a defaulted parameter. This handles
custom node packs calling cast_bias_weight and defeatures
async-offload for them (as they do not handle the race).

The pattern is now:

cast_bias_weight(... , offloadable=True) #This might be offloaded
thing(weight, bias, ...)
uncast_bias_weight(...)

* controlnet: adopt new cast_bias_weight synchronization scheme

This is nessacary for safe async weight offloading.

* mm: sync the last stream in the queue, not the next

Currently this peeks ahead to sync the next stream in the queue of
streams with the compute stream. This doesnt allow a lot of
parallelization, as then end result is you can only get one weight load
ahead regardless of how many streams you have.

Rotate the loop logic here to synchronize the end of the queue before
returning the next stream. This allows weights to be loaded ahead of the
compute streams position.
2025-10-29 17:17:46 -04:00
patientx
d8528ac31e
Merge branch 'comfyanonymous:master' into master 2025-10-29 12:42:07 +03:00
comfyanonymous
3fa7a5c04a
Speed up offloading using pinned memory. (#10526)
To enable this feature use: --fast pinned_memory
2025-10-29 00:21:01 -04:00
patientx
8590e1f713
Merge branch 'comfyanonymous:master' into master 2025-10-26 14:29:29 +03:00
comfyanonymous
098a352f13
Add warning for torch-directml usage (#10482)
Added a warning message about the state of torch-directml.
2025-10-25 20:05:22 -04:00
comfyanonymous
426cde37f1
Remove useless function (#10472) 2025-10-24 19:56:51 -04:00
patientx
d4bcb93575
Merge branch 'comfyanonymous:master' into master 2025-10-22 11:34:33 +03:00
comfyanonymous
9cdc64998f
Only disable cudnn on newer AMD GPUs. (#10437) 2025-10-21 19:15:23 -04:00
patientx
5bf1c8be44
Merge branch 'comfyanonymous:master' into master 2025-10-21 03:49:14 +03:00
comfyanonymous
2c2aa409b0
Log message for cudnn disable on AMD. (#10418) 2025-10-20 15:43:24 -04:00
patientx
657a7872ab
Merge branch 'comfyanonymous:master' into master 2025-10-19 15:20:17 +03:00
comfyanonymous
5b80addafd
Turn off cuda malloc by default when --fast autotune is turned on. (#10393) 2025-10-18 22:35:46 -04:00
patientx
26589a3a0b
Merge branch 'comfyanonymous:master' into master 2025-10-15 12:18:21 +03:00
comfyanonymous
1c10b33f9b
gfx942 doesn't support fp8 operations. (#10348) 2025-10-15 00:21:11 -04:00
comfyanonymous
c8674bc6e9
Enable RDNA4 pytorch attention on ROCm 7.0 and up. (#10332) 2025-10-13 21:19:03 -04:00
patientx
fa7942933b
Merge branch 'comfyanonymous:master' into master 2025-10-12 13:56:39 +03:00
comfyanonymous
a125cd84b0
Improve AMD performance. (#10302)
I honestly have no idea why this improves things but it does.
2025-10-12 00:28:01 -04:00
patientx
258da26c98
Merge branch 'comfyanonymous:master' into master 2025-09-25 15:08:16 +03:00
Guy Niv
c8d2117f02
Fix memory leak by properly detaching model finalizer (#9979)
When unloading models in load_models_gpu(), the model finalizer was not
being explicitly detached, leading to a memory leak. This caused
linear memory consumption increase over time as models are repeatedly
loaded and unloaded.

This change prevents orphaned finalizer references from accumulating in
memory during model switching operations.
2025-09-24 22:35:12 -04:00
patientx
c62e820d45
Merge branch 'comfyanonymous:master' into master 2025-09-20 01:51:06 +03:00
DELUXA
8d6653fca6
Enable fp8 ops by default on gfx1200 (#9926) 2025-09-18 19:50:37 -04:00
patientx
b46622ffa5
Merge branch 'comfyanonymous:master' into master 2025-09-08 11:14:04 +03:00
comfyanonymous
fb763d4333
Fix amd_min_version crash when cpu device. (#9754) 2025-09-07 21:16:29 -04:00