patientx
dda057a4f4
Merge branch 'comfyanonymous:master' into master
2025-11-20 10:07:21 +03:00
comfyanonymous
cb96d4d18c
Disable workaround on newer cudnn. ( #10807 )
2025-11-19 23:56:23 -05:00
patientx
acf5a0ac72
Merge branch 'comfyanonymous:master' into master
2025-11-19 10:36:05 +03:00
comfyanonymous
17027f2a6a
Add a way to disable the final norm in the llama based TE models. ( #10794 )
2025-11-18 22:36:03 -05:00
patientx
19f8286151
Merge branch 'comfyanonymous:master' into master
2025-11-19 02:02:57 +03:00
comfyanonymous
d526974576
Fix hunyuan 3d 2.0 ( #10792 )
2025-11-18 16:46:19 -05:00
patientx
850c5d5db5
Merge branch 'comfyanonymous:master' into master
2025-11-15 17:45:34 +03:00
comfyanonymous
bd01d9f7fd
Add left padding support to tokenizers. ( #10753 )
2025-11-15 06:54:40 -05:00
patientx
9f9ce655f2
Merge branch 'comfyanonymous:master' into master
2025-11-14 12:55:30 +03:00
comfyanonymous
443056c401
Fix custom nodes import error. ( #10747 )
...
This should fix the import errors but will break if the custom nodes actually try to use the class.
2025-11-14 03:26:05 -05:00
comfyanonymous
f60923590c
Use same code for chroma and flux blocks so that optimizations are shared. ( #10746 )
2025-11-14 01:28:05 -05:00
rattus
94c298f962
flux: reduce VRAM usage ( #10737 )
...
Cleanup a bunch of stack tensors on Flux. This take me from B=19 to B=22
for 1600x1600 on RTX5090.
2025-11-13 16:02:03 -08:00
patientx
184fdd1103
Merge branch 'comfyanonymous:master' into master
2025-11-13 13:23:21 +03:00
contentis
3b3ef9a77a
Quantized Ops fixes ( #10715 )
...
* offload support, bug fixes, remove mixins
* add readme
2025-11-12 18:26:52 -05:00
patientx
00609a5102
Merge branch 'comfyanonymous:master' into master
2025-11-13 00:55:19 +03:00
rattus
1c7eaeca10
qwen: reduce VRAM usage ( #10725 )
...
Clean up a bunch of stacked and no-longer-needed tensors on the QWEN
VRAM peak (currently FFN).
With this I go from OOMing at B=37x1328x1328 to being able to
succesfully run B=47 (RTX5090).
2025-11-12 16:20:53 -05:00
rattus
18e7d6dba5
mm/mp: always unload re-used but modified models ( #10724 )
...
The partial unloader path in model re-use flow skips straight to the
actual unload without any check of the patching UUID. This means that
if you do an upscale flow with a model patch on an existing model, it
will not apply your patchings.
Fix by delaying the partial_unload until after the uuid checks. This
is done by making partial_unload a model of partial_load where extra_mem
is -ve.
2025-11-12 16:19:53 -05:00
patientx
644778be49
Merge branch 'comfyanonymous:master' into master
2025-11-12 17:55:30 +03:00
comfyanonymous
1199411747
Don't pin tensor if not a torch.nn.parameter.Parameter ( #10718 )
2025-11-11 19:33:30 -05:00
patientx
3662d0a2ce
Merge branch 'comfyanonymous:master' into master
2025-11-10 14:05:53 +03:00
rattus
c350009236
ops: Put weight cast on the offload stream ( #10697 )
...
This needs to be on the offload stream. This reproduced a black screen
with low resolution images on a slow bus when using FP8.
2025-11-09 22:52:11 -05:00
comfyanonymous
dea899f221
Unload weights if vram usage goes up between runs. ( #10690 )
2025-11-09 18:51:33 -05:00
comfyanonymous
e632e5de28
Add logging for model unloading. ( #10692 )
2025-11-09 18:06:39 -05:00
patientx
d56cb56059
Merge branch 'comfyanonymous:master' into master
2025-11-09 13:41:51 +03:00
comfyanonymous
2abd2b5c20
Make ScaleROPE node work on Flux. ( #10686 )
2025-11-08 15:52:02 -05:00
patientx
8e02689534
Merge branch 'comfyanonymous:master' into master
2025-11-07 20:30:21 +03:00
comfyanonymous
a1a70362ca
Only unpin tensor if it was pinned by ComfyUI ( #10677 )
2025-11-07 11:15:05 -05:00
patientx
d29dbbd829
Merge branch 'comfyanonymous:master' into master
2025-11-07 14:27:13 +03:00
rattus
cf97b033ee
mm: guard against double pin and unpin explicitly ( #10672 )
...
As commented, if you let cuda be the one to detect double pin/unpinning
it actually creates an asyc GPU error.
2025-11-06 21:20:48 -05:00
patientx
3ab45ae725
Merge branch 'comfyanonymous:master' into master
2025-11-06 15:35:41 +03:00
comfyanonymous
09dc24c8a9
Pinned mem also seems to work on AMD. ( #10658 )
2025-11-05 19:11:15 -05:00
comfyanonymous
1d69245981
Enable pinned memory by default on Nvidia. ( #10656 )
...
Removed the --fast pinned_memory flag.
You can use --disable-pinned-memory to disable it. Please report if it
causes any issues.
2025-11-05 18:08:13 -05:00
comfyanonymous
97f198e421
Fix qwen controlnet regression. ( #10657 )
2025-11-05 18:07:35 -05:00
patientx
84faf45f09
Merge branch 'comfyanonymous:master' into master
2025-11-05 13:07:02 +03:00
comfyanonymous
c4a6b389de
Lower ltxv mem usage to what it was before previous pr. ( #10643 )
...
Bring back qwen behavior to what it was before previous pr.
2025-11-04 22:47:35 -05:00
contentis
4cd881866b
Use single apply_rope function across models ( #10547 )
2025-11-04 20:10:11 -05:00
comfyanonymous
7f3e4d486c
Limit amount of pinned memory on windows to prevent issues. ( #10638 )
2025-11-04 17:37:50 -05:00
patientx
11083ab58c
Merge branch 'comfyanonymous:master' into master
2025-11-04 13:09:30 +03:00
comfyanonymous
af4b7b5edb
More fp8 torch.compile regressions fixed. ( #10625 )
2025-11-03 22:14:20 -05:00
comfyanonymous
0f4ef3afa0
This seems to slow things down slightly on Linux. ( #10624 )
2025-11-03 21:47:14 -05:00
comfyanonymous
6b88478f9f
Bring back fp8 torch compile performance to what it should be. ( #10622 )
2025-11-03 19:22:10 -05:00
comfyanonymous
e199c8cc67
Fixes ( #10621 )
2025-11-03 17:58:24 -05:00
comfyanonymous
0652cb8e2d
Speed up torch.compile ( #10620 )
2025-11-03 17:37:12 -05:00
comfyanonymous
958a17199a
People should update their pytorch versions. ( #10618 )
2025-11-03 17:08:30 -05:00
patientx
f130713953
Merge branch 'comfyanonymous:master' into master
2025-11-03 03:07:02 +03:00
comfyanonymous
97ff9fae7e
Clarify help text for --fast argument ( #10609 )
...
Updated help text for the --fast argument to clarify potential risks.
2025-11-02 13:14:04 -05:00
rattus
135fa49ec2
Small speed improvements to --async-offload ( #10593 )
...
* ops: dont take an offload stream if you dont need one
* ops: prioritize mem transfer
The async offload streams reason for existence is to transfer from
RAM to GPU. The post processing compute steps are a bonus on the side
stream, but if the compute stream is running a long kernel, it can
stall the side stream, as it wait to type-cast the bias before
transferring the weight. So do a pure xfer of the weight straight up,
then do everything bias, then go back to fix the weight type and do
weight patches.
2025-11-01 18:48:53 -04:00
comfyanonymous
44869ff786
Fix issue with pinned memory. ( #10597 )
2025-11-01 17:25:59 -04:00
patientx
a47b0ea003
Merge branch 'comfyanonymous:master' into master
2025-11-01 14:17:11 +03:00
comfyanonymous
c58c13b2ba
Fix torch compile regression on fp8 ops. ( #10580 )
2025-11-01 00:25:17 -04:00