Commit Graph

427 Commits

Author SHA1 Message Date
patientx
3662d0a2ce
Merge branch 'comfyanonymous:master' into master 2025-11-10 14:05:53 +03:00
comfyanonymous
dea899f221
Unload weights if vram usage goes up between runs. (#10690) 2025-11-09 18:51:33 -05:00
patientx
8e02689534
Merge branch 'comfyanonymous:master' into master 2025-11-07 20:30:21 +03:00
comfyanonymous
a1a70362ca
Only unpin tensor if it was pinned by ComfyUI (#10677) 2025-11-07 11:15:05 -05:00
patientx
d29dbbd829
Merge branch 'comfyanonymous:master' into master 2025-11-07 14:27:13 +03:00
rattus
cf97b033ee
mm: guard against double pin and unpin explicitly (#10672)
As commented, if you let cuda be the one to detect double pin/unpinning
it actually creates an asyc GPU error.
2025-11-06 21:20:48 -05:00
patientx
3ab45ae725
Merge branch 'comfyanonymous:master' into master 2025-11-06 15:35:41 +03:00
comfyanonymous
09dc24c8a9
Pinned mem also seems to work on AMD. (#10658) 2025-11-05 19:11:15 -05:00
comfyanonymous
1d69245981
Enable pinned memory by default on Nvidia. (#10656)
Removed the --fast pinned_memory flag.

You can use --disable-pinned-memory to disable it. Please report if it
causes any issues.
2025-11-05 18:08:13 -05:00
patientx
84faf45f09
Merge branch 'comfyanonymous:master' into master 2025-11-05 13:07:02 +03:00
comfyanonymous
7f3e4d486c
Limit amount of pinned memory on windows to prevent issues. (#10638) 2025-11-04 17:37:50 -05:00
patientx
7907b8d6be
Merge branch 'comfyanonymous:master' into master 2025-10-30 03:16:55 +03:00
rattus
ab7ab5be23
Fix Race condition in --async-offload that can cause corruption (#10501)
* mm: factor out the current stream getter

Make this a reusable function.

* ops: sync the offload stream with the consumption of w&b

This sync is nessacary as pytorch will queue cuda async frees on the
same stream as created to tensor. In the case of async offload, this
will be on the offload stream.

Weights and biases can go out of scope in python which then
triggers the pytorch garbage collector to queue the free operation on
the offload stream possible before the compute stream has used the
weight. This causes a use after free on weight data leading to total
corruption of some workflows.

So sync the offload stream with the compute stream after the weight
has been used so the free has to wait for the weight to be used.

The cast_bias_weight is extended in a backwards compatible way with
the new behaviour opt-in on a defaulted parameter. This handles
custom node packs calling cast_bias_weight and defeatures
async-offload for them (as they do not handle the race).

The pattern is now:

cast_bias_weight(... , offloadable=True) #This might be offloaded
thing(weight, bias, ...)
uncast_bias_weight(...)

* controlnet: adopt new cast_bias_weight synchronization scheme

This is nessacary for safe async weight offloading.

* mm: sync the last stream in the queue, not the next

Currently this peeks ahead to sync the next stream in the queue of
streams with the compute stream. This doesnt allow a lot of
parallelization, as then end result is you can only get one weight load
ahead regardless of how many streams you have.

Rotate the loop logic here to synchronize the end of the queue before
returning the next stream. This allows weights to be loaded ahead of the
compute streams position.
2025-10-29 17:17:46 -04:00
patientx
d8528ac31e
Merge branch 'comfyanonymous:master' into master 2025-10-29 12:42:07 +03:00
comfyanonymous
3fa7a5c04a
Speed up offloading using pinned memory. (#10526)
To enable this feature use: --fast pinned_memory
2025-10-29 00:21:01 -04:00
patientx
8590e1f713
Merge branch 'comfyanonymous:master' into master 2025-10-26 14:29:29 +03:00
comfyanonymous
098a352f13
Add warning for torch-directml usage (#10482)
Added a warning message about the state of torch-directml.
2025-10-25 20:05:22 -04:00
comfyanonymous
426cde37f1
Remove useless function (#10472) 2025-10-24 19:56:51 -04:00
patientx
d4bcb93575
Merge branch 'comfyanonymous:master' into master 2025-10-22 11:34:33 +03:00
comfyanonymous
9cdc64998f
Only disable cudnn on newer AMD GPUs. (#10437) 2025-10-21 19:15:23 -04:00
patientx
5bf1c8be44
Merge branch 'comfyanonymous:master' into master 2025-10-21 03:49:14 +03:00
comfyanonymous
2c2aa409b0
Log message for cudnn disable on AMD. (#10418) 2025-10-20 15:43:24 -04:00
patientx
657a7872ab
Merge branch 'comfyanonymous:master' into master 2025-10-19 15:20:17 +03:00
comfyanonymous
5b80addafd
Turn off cuda malloc by default when --fast autotune is turned on. (#10393) 2025-10-18 22:35:46 -04:00
patientx
26589a3a0b
Merge branch 'comfyanonymous:master' into master 2025-10-15 12:18:21 +03:00
comfyanonymous
1c10b33f9b
gfx942 doesn't support fp8 operations. (#10348) 2025-10-15 00:21:11 -04:00
comfyanonymous
c8674bc6e9
Enable RDNA4 pytorch attention on ROCm 7.0 and up. (#10332) 2025-10-13 21:19:03 -04:00
patientx
fa7942933b
Merge branch 'comfyanonymous:master' into master 2025-10-12 13:56:39 +03:00
comfyanonymous
a125cd84b0
Improve AMD performance. (#10302)
I honestly have no idea why this improves things but it does.
2025-10-12 00:28:01 -04:00
patientx
258da26c98
Merge branch 'comfyanonymous:master' into master 2025-09-25 15:08:16 +03:00
Guy Niv
c8d2117f02
Fix memory leak by properly detaching model finalizer (#9979)
When unloading models in load_models_gpu(), the model finalizer was not
being explicitly detached, leading to a memory leak. This caused
linear memory consumption increase over time as models are repeatedly
loaded and unloaded.

This change prevents orphaned finalizer references from accumulating in
memory during model switching operations.
2025-09-24 22:35:12 -04:00
patientx
c62e820d45
Merge branch 'comfyanonymous:master' into master 2025-09-20 01:51:06 +03:00
DELUXA
8d6653fca6
Enable fp8 ops by default on gfx1200 (#9926) 2025-09-18 19:50:37 -04:00
patientx
b46622ffa5
Merge branch 'comfyanonymous:master' into master 2025-09-08 11:14:04 +03:00
comfyanonymous
fb763d4333
Fix amd_min_version crash when cpu device. (#9754) 2025-09-07 21:16:29 -04:00
patientx
9417753a6c
Merge branch 'comfyanonymous:master' into master 2025-09-07 13:16:57 +03:00
comfyanonymous
bcbd7884e3
Don't enable pytorch attention on AMD if triton isn't available. (#9747) 2025-09-07 00:29:38 -04:00
comfyanonymous
27a0fcccc3
Enable bf16 VAE on RDNA4. (#9746) 2025-09-06 23:25:22 -04:00
patientx
7ff01ded58
Merge branch 'comfyanonymous:master' into master 2025-08-21 09:24:26 +03:00
comfyanonymous
0963493a9c
Support for Qwen Diffsynth Controlnets canny and depth. (#9465)
These are not real controlnets but actually a patch on the model so they
will be treated as such.

Put them in the models/model_patches/ folder.

Use the new ModelPatchLoader and QwenImageDiffsynthControlnet nodes.
2025-08-20 22:26:37 -04:00
patientx
a927fbd99b
Merge branch 'comfyanonymous:master' into master 2025-08-14 12:16:50 +03:00
Simon Lui
c991a5da65
Fix XPU iGPU regressions (#9322)
* Change bf16 check and switch non-blocking to off default with option to force to regain speed on certain classes of iGPUs and refactor xpu check.

* Turn non_blocking off by default for xpu.

* Update README.md for Intel GPUs.
2025-08-13 19:13:35 -04:00
patientx
c2686a3968
Merge branch 'comfyanonymous:master' into master 2025-08-10 12:09:19 +03:00
comfyanonymous
5828607ccf
Not sure if AMD actually support fp16 acc but it doesn't crash. (#9258) 2025-08-09 12:49:25 -04:00
patientx
89499c6fae
Merge branch 'comfyanonymous:master' into master 2025-08-08 11:40:07 +03:00
comfyanonymous
735bb4bdb1
Users report gfx1201 is buggy on flux with pytorch attention. (#9244) 2025-08-08 04:21:00 -04:00
patientx
d8ca8134c3
Merge branch 'comfyanonymous:master' into master 2025-07-29 11:56:59 +03:00
comfyanonymous
7d593baf91
Extra reserved vram on large cards on windows. (#9093) 2025-07-29 04:07:45 -04:00
patientx
970b7fb84f
Merge branch 'comfyanonymous:master' into master 2025-07-24 22:30:55 +03:00
comfyanonymous
69cb57b342
Print xpu device name. (#9035) 2025-07-24 15:06:25 -04:00