Commit Graph

310 Commits

Author SHA1 Message Date
Sasbom
0ef5557d6a Add QOL feature for changing the custom nodes folder location through cli args.
bugfix: fix typo in apply_directory for custom_nodes_directory

allow for PATH style ';' delimited custom_node directories.

change delimiter type for seperate folders per platform.

feat(API-nodes): move Rodin3D nodes to new client; removed old api client.py (#10645)

Fix qwen controlnet regression. (#10657)

Enable pinned memory by default on Nvidia. (#10656)

Removed the --fast pinned_memory flag.

You can use --disable-pinned-memory to disable it. Please report if it
causes any issues.

Pinned mem also seems to work on AMD. (#10658)

Remove environment variable.

Removed environment variable fallback for custom nodes directory.

Update documentation for custom nodes directory

Clarified documentation on custom nodes directory argument, removed documentation on environment variable

Clarify release cycle. (#10667)

Tell users they need to upload their logs in bug reports. (#10671)

mm: guard against double pin and unpin explicitly (#10672)

As commented, if you let cuda be the one to detect double pin/unpinning
it actually creates an asyc GPU error.

Only unpin tensor if it was pinned by ComfyUI (#10677)

Make ScaleROPE node work on Flux. (#10686)

Add logging for model unloading. (#10692)

Unload weights if vram usage goes up between runs. (#10690)

ops: Put weight cast on the offload stream (#10697)

This needs to be on the offload stream. This reproduced a black screen
with low resolution images on a slow bus when using FP8.

Update CI workflow to remove dead macOS runner. (#10704)

* Update CI workflow to remove dead macOS runner.

* revert

* revert

Don't pin tensor if not a torch.nn.parameter.Parameter (#10718)

Update README.md for Intel Arc GPU installation, remove IPEX (#10729)

IPEX is no longer needed for Intel Arc GPUs.  Removing instruction to setup ipex.

mm/mp: always unload re-used but modified models (#10724)

The partial unloader path in model re-use flow skips straight to the
actual unload without any check of the patching UUID. This means that
if you do an upscale flow with a model patch on an existing model, it
will not apply your patchings.

Fix by delaying the partial_unload until after the uuid checks. This
is done by making partial_unload a model of partial_load where extra_mem
is -ve.

qwen: reduce VRAM usage (#10725)

Clean up a bunch of stacked and no-longer-needed tensors on the QWEN
VRAM peak (currently FFN).

With this I go from OOMing at B=37x1328x1328 to being able to
succesfully run B=47 (RTX5090).

 Update Python 3.14 compatibility notes in README  (#10730)

Quantized Ops fixes (#10715)

* offload support, bug fixes, remove mixins

* add readme

add PR template for API-Nodes (#10736)

feat: add create_time dict to prompt field in /history and /queue (#10741)

flux: reduce VRAM usage (#10737)

Cleanup a bunch of stack tensors on Flux. This take me from B=19 to B=22
for 1600x1600 on RTX5090.

Better instructions for the portable. (#10743)

Use same code for chroma and flux blocks so that optimizations are shared. (#10746)

Fix custom nodes import error. (#10747)

This should fix the import errors but will break if the custom nodes actually try to use the class.

revert import reordering

revert imports pt 2

Add left padding support to tokenizers. (#10753)

chore(api-nodes): mark OpenAIDalle2 and OpenAIDalle3 nodes as deprecated (#10757)

Revert "chore(api-nodes): mark OpenAIDalle2 and OpenAIDalle3 nodes as deprecated (#10757)" (#10759)

This reverts commit 9a02382568.

Change ROCm nightly install command to 7.1 (#10764)
2025-11-17 06:16:21 +01:00
comfyanonymous
7f3e4d486c
Limit amount of pinned memory on windows to prevent issues. (#10638) 2025-11-04 17:37:50 -05:00
rattus
ab7ab5be23
Fix Race condition in --async-offload that can cause corruption (#10501)
* mm: factor out the current stream getter

Make this a reusable function.

* ops: sync the offload stream with the consumption of w&b

This sync is nessacary as pytorch will queue cuda async frees on the
same stream as created to tensor. In the case of async offload, this
will be on the offload stream.

Weights and biases can go out of scope in python which then
triggers the pytorch garbage collector to queue the free operation on
the offload stream possible before the compute stream has used the
weight. This causes a use after free on weight data leading to total
corruption of some workflows.

So sync the offload stream with the compute stream after the weight
has been used so the free has to wait for the weight to be used.

The cast_bias_weight is extended in a backwards compatible way with
the new behaviour opt-in on a defaulted parameter. This handles
custom node packs calling cast_bias_weight and defeatures
async-offload for them (as they do not handle the race).

The pattern is now:

cast_bias_weight(... , offloadable=True) #This might be offloaded
thing(weight, bias, ...)
uncast_bias_weight(...)

* controlnet: adopt new cast_bias_weight synchronization scheme

This is nessacary for safe async weight offloading.

* mm: sync the last stream in the queue, not the next

Currently this peeks ahead to sync the next stream in the queue of
streams with the compute stream. This doesnt allow a lot of
parallelization, as then end result is you can only get one weight load
ahead regardless of how many streams you have.

Rotate the loop logic here to synchronize the end of the queue before
returning the next stream. This allows weights to be loaded ahead of the
compute streams position.
2025-10-29 17:17:46 -04:00
comfyanonymous
3fa7a5c04a
Speed up offloading using pinned memory. (#10526)
To enable this feature use: --fast pinned_memory
2025-10-29 00:21:01 -04:00
comfyanonymous
098a352f13
Add warning for torch-directml usage (#10482)
Added a warning message about the state of torch-directml.
2025-10-25 20:05:22 -04:00
comfyanonymous
426cde37f1
Remove useless function (#10472) 2025-10-24 19:56:51 -04:00
comfyanonymous
9cdc64998f
Only disable cudnn on newer AMD GPUs. (#10437) 2025-10-21 19:15:23 -04:00
comfyanonymous
2c2aa409b0
Log message for cudnn disable on AMD. (#10418) 2025-10-20 15:43:24 -04:00
comfyanonymous
5b80addafd
Turn off cuda malloc by default when --fast autotune is turned on. (#10393) 2025-10-18 22:35:46 -04:00
comfyanonymous
1c10b33f9b
gfx942 doesn't support fp8 operations. (#10348) 2025-10-15 00:21:11 -04:00
comfyanonymous
c8674bc6e9
Enable RDNA4 pytorch attention on ROCm 7.0 and up. (#10332) 2025-10-13 21:19:03 -04:00
comfyanonymous
a125cd84b0
Improve AMD performance. (#10302)
I honestly have no idea why this improves things but it does.
2025-10-12 00:28:01 -04:00
Guy Niv
c8d2117f02
Fix memory leak by properly detaching model finalizer (#9979)
When unloading models in load_models_gpu(), the model finalizer was not
being explicitly detached, leading to a memory leak. This caused
linear memory consumption increase over time as models are repeatedly
loaded and unloaded.

This change prevents orphaned finalizer references from accumulating in
memory during model switching operations.
2025-09-24 22:35:12 -04:00
DELUXA
8d6653fca6
Enable fp8 ops by default on gfx1200 (#9926) 2025-09-18 19:50:37 -04:00
comfyanonymous
fb763d4333
Fix amd_min_version crash when cpu device. (#9754) 2025-09-07 21:16:29 -04:00
comfyanonymous
bcbd7884e3
Don't enable pytorch attention on AMD if triton isn't available. (#9747) 2025-09-07 00:29:38 -04:00
comfyanonymous
27a0fcccc3
Enable bf16 VAE on RDNA4. (#9746) 2025-09-06 23:25:22 -04:00
comfyanonymous
0963493a9c
Support for Qwen Diffsynth Controlnets canny and depth. (#9465)
These are not real controlnets but actually a patch on the model so they
will be treated as such.

Put them in the models/model_patches/ folder.

Use the new ModelPatchLoader and QwenImageDiffsynthControlnet nodes.
2025-08-20 22:26:37 -04:00
Simon Lui
c991a5da65
Fix XPU iGPU regressions (#9322)
* Change bf16 check and switch non-blocking to off default with option to force to regain speed on certain classes of iGPUs and refactor xpu check.

* Turn non_blocking off by default for xpu.

* Update README.md for Intel GPUs.
2025-08-13 19:13:35 -04:00
comfyanonymous
5828607ccf
Not sure if AMD actually support fp16 acc but it doesn't crash. (#9258) 2025-08-09 12:49:25 -04:00
comfyanonymous
735bb4bdb1
Users report gfx1201 is buggy on flux with pytorch attention. (#9244) 2025-08-08 04:21:00 -04:00
comfyanonymous
7d593baf91
Extra reserved vram on large cards on windows. (#9093) 2025-07-29 04:07:45 -04:00
comfyanonymous
69cb57b342
Print xpu device name. (#9035) 2025-07-24 15:06:25 -04:00
honglyua
0ccc88b03f
Support Iluvatar CoreX (#8585)
* Support Iluvatar CoreX
Co-authored-by: mingjiang.li <mingjiang.li@iluvatar.com>
2025-07-24 13:57:36 -04:00
comfyanonymous
d3504e1778
Enable pytorch attention by default for gfx1201 on torch 2.8 (#9029) 2025-07-23 19:21:29 -04:00
comfyanonymous
a86a58c308
Fix xpu function not implemented p2. (#9027) 2025-07-23 18:18:20 -04:00
comfyanonymous
39dda1d40d
Fix xpu function not implemented. (#9026) 2025-07-23 18:10:59 -04:00
comfyanonymous
5ad33787de
Add default device argument. (#9023) 2025-07-23 14:20:49 -04:00
Simon Lui
255f139863
Add xpu version for async offload and some other things. (#9004) 2025-07-22 15:20:09 -04:00
comfyanonymous
a96e65df18
Disable omnigen2 fp16 on older pytorch versions. (#8672) 2025-06-26 03:39:09 -04:00
comfyanonymous
6e28a46454
Apple most likely is never fixing the fp16 attention bug. (#8485) 2025-06-10 13:06:24 -04:00
comfyanonymous
7f800d04fa
Enable AMD fp8 and pytorch attention on some GPUs. (#8474)
Information is from the pytorch source code.
2025-06-09 12:50:39 -04:00
comfyanonymous
97755eed46
Enable fp8 ops by default on gfx1201 (#8464) 2025-06-08 14:15:34 -04:00
comfyanonymous
daf9d25ee2
Cleaner torch version comparisons. (#8453) 2025-06-07 10:01:15 -04:00
comfyanonymous
704fc78854
Put ROCm version in tuple to make it easier to enable stuff based on it. (#8348) 2025-05-30 15:41:02 -04:00
comfyanonymous
89a84e32d2
Disable initial GPU load when novram is used. (#8294) 2025-05-26 16:39:27 -04:00
comfyanonymous
e5799c4899
Enable pytorch attention by default on AMD gfx1151 (#8282) 2025-05-26 04:29:25 -04:00
comfyanonymous
0b50d4c0db
Add argument to explicitly enable fp8 compute support. (#8257)
This can be used to test if your current GPU/pytorch version supports fp8 matrix mult in combination with --fast or the fp8_e4m3fn_fast dtype.
2025-05-23 17:43:50 -04:00
comfyanonymous
0a66d4b0af
Per device stream counters for async offload. (#7873) 2025-04-29 20:28:52 -04:00
comfyanonymous
5a50c3c7e5
Fix stream priority to support older pytorch. (#7856) 2025-04-28 13:07:21 -04:00
comfyanonymous
c8cd7ad795
Use stream for casting if enabled. (#7833) 2025-04-27 05:38:11 -04:00
comfyanonymous
0dcc75ca54
Add experimental --async-offload lowvram weight offloading. (#7820)
This should speed up the lowvram mode a bit. It currently is only enabled when --async-offload is used but it will be enabled by default in the future if there are no problems.
2025-04-26 16:11:21 -04:00
comfyanonymous
2d6805ce57
Add option for using fp8_e8m0fnu for model weights. (#7733)
Seems to break every model I have tried but worth testing?
2025-04-22 06:17:38 -04:00
BiologicalExplosion
2222cf67fd
MLU memory optimization (#7470)
Co-authored-by: huzhan <huzhan@cambricon.com>
2025-04-02 19:24:04 -04:00
BVH
301e26b131
Add option to store TE in bf16 (#7461) 2025-04-01 13:48:53 -04:00
comfyanonymous
8edc1f44c1 Support more float8 types. 2025-03-25 05:23:49 -04:00
FeepingCreature
7aceb9f91c
Add --use-flash-attention flag. (#7223)
* Add --use-flash-attention flag.
This is useful on AMD systems, as FA builds are still 10% faster than Pytorch cross-attention.
2025-03-14 03:22:41 -04:00
comfyanonymous
35504e2f93 Fix. 2025-03-13 15:03:18 -04:00
comfyanonymous
299436cfed Print mac version. 2025-03-13 10:05:40 -04:00
comfyanonymous
0952569493 Fix stable cascade VAE on some lowvram machines. 2025-03-08 20:24:04 -05:00