Commit Graph

1748 Commits

Author SHA1 Message Date
Jedrzej Kosinski
61133af772 Add '--flipflop-offload' startup argument 2025-10-13 21:10:44 -07:00
Jedrzej Kosinski
586a8de8da Merge branch 'master' into flipflop-stream 2025-10-13 21:04:37 -07:00
comfyanonymous
3374e900d0
Faster workflow cancelling. (#10301) 2025-10-13 23:43:53 -04:00
comfyanonymous
dfff7e5332
Better memory estimation for the SD/Flux VAE on AMD. (#10334) 2025-10-13 22:37:19 -04:00
comfyanonymous
e4ea393666
Fix loading old stable diffusion ckpt files on newer numpy. (#10333) 2025-10-13 22:18:58 -04:00
comfyanonymous
c8674bc6e9
Enable RDNA4 pytorch attention on ROCm 7.0 and up. (#10332) 2025-10-13 21:19:03 -04:00
rattus128
95ca2e56c8
WAN2.2: Fix cache VRAM leak on error (#10308)
Same change pattern as 7e8dd275c2
applied to WAN2.2

If this suffers an exception (such as a VRAM oom) it will leave the
encode() and decode() methods which skips the cleanup of the WAN
feature cache. The comfy node cache then ultimately keeps a reference
this object which is in turn reffing large tensors from the failed
execution.

The feature cache is currently setup at a class variable on the
encoder/decoder however, the encode and decode functions always clear
it on both entry and exit of normal execution.

Its likely the design intent is this is usable as a streaming encoder
where the input comes in batches, however the functions as they are
today don't support that.

So simplify by bringing the cache back to local variable, so that if
it does VRAM OOM the cache itself is properly garbage when the
encode()/decode() functions dissappear from the stack.
2025-10-13 15:23:11 -04:00
comfyanonymous
e693e4db6a
Always set diffusion model to eval() mode. (#10331) 2025-10-13 14:57:27 -04:00
comfyanonymous
a125cd84b0
Improve AMD performance. (#10302)
I honestly have no idea why this improves things but it does.
2025-10-12 00:28:01 -04:00
comfyanonymous
84e9ce32c6
Implement the mmaudio VAE. (#10300) 2025-10-11 22:57:23 -04:00
comfyanonymous
f1dd6e50f8
Fix bug with applying loras on fp8 scaled without fp8 ops. (#10279) 2025-10-09 19:02:40 -04:00
comfyanonymous
139addd53c
More surgical fix for #10267 (#10276) 2025-10-09 16:37:35 -04:00
comfyanonymous
6e59934089
Refactor model sampling sigmas code. (#10250) 2025-10-08 17:49:02 -04:00
comfyanonymous
8aea746212
Implement gemma 3 as a text encoder. (#10241)
Not useful yet.
2025-10-06 22:08:08 -04:00
comfyanonymous
195e0b0639
Remove useless code. (#10223) 2025-10-05 15:41:19 -04:00
Jedrzej Kosinski
5329180fce Made flipflop consider partial_unload, partial_offload, and add flip+flop to mem counters 2025-10-03 16:21:01 -07:00
Jedrzej Kosinski
0fdd327c2f Merge branch 'master' into flipflop-stream 2025-10-03 14:32:56 -07:00
Finn-Hecker
93d859cfaa
Fix type annotation syntax in MotionEncoder_tc __init__ (#10186)
## Summary
Fixed incorrect type hint syntax in `MotionEncoder_tc.__init__()` parameter list.

## Changes
- Line 647: Changed `num_heads=int` to `num_heads: int` 
- This corrects the parameter annotation from a default value assignment to proper type hint syntax

## Details
The parameter was using assignment syntax (`=`) instead of type annotation syntax (`:`), which would incorrectly set the default value to the `int` class itself rather than annotating the expected type.
2025-10-03 14:32:19 -07:00
Jedrzej Kosinski
ee01002e63 Add flipflop support to (base) WAN, fix issue with applying loras to flipflop weights being done on CPU instead of GPU, left some timing functions as the lora application time could use some reduction 2025-10-02 22:02:50 -07:00
Jedrzej Kosinski
831c3cf05e Add a temporary workaround for odd amount of blocks not producing expected results 2025-10-02 20:29:11 -07:00
Jedrzej Kosinski
0d8e8abd90 Default ro smaller blocks getting flipflopped first 2025-10-02 18:00:21 -07:00
Jedrzej Kosinski
d5001ed90e Make flux support flipflop 2025-10-02 17:53:22 -07:00
Jedrzej Kosinski
8d7b22b720 Fixed FlipFlipModule.execute_blocks having hardcoded strings from Qwen 2025-10-02 17:49:43 -07:00
Jedrzej Kosinski
6d3ec9fcf3 Simplified flipflop setup by adding FlipFlopModule.execute_blocks helper 2025-10-02 16:46:37 -07:00
Jedrzej Kosinski
c4420b6a41 Change log string slightly 2025-10-02 15:34:35 -07:00
Jedrzej Kosinski
a282586995 Merge branch 'master' into flipflop-stream 2025-10-02 15:03:26 -07:00
Jedrzej Kosinski
0df61b5032 Fix improper index slicing for flipflop get blocks, add extra log message 2025-10-01 21:21:36 -07:00
Jedrzej Kosinski
7c896c5567 Initial automatic support for flipflop within ModelPatcher - only Qwen Image diffusion_model uses FlipFlopModule currently 2025-10-01 20:13:50 -07:00
rattus128
4965c0e2ac
WAN: Fix cache VRAM leak on error (#10141)
If this suffers an exception (such as a VRAM oom) it will leave the
encode() and decode() methods which skips the cleanup of the WAN
feature cache. The comfy node cache then ultimately keeps a reference
this object which is in turn reffing large tensors from the failed
execution.

The feature cache is currently setup at a class variable on the
encoder/decoder however, the encode and decode functions always clear
it on both entry and exit of normal execution.

Its likely the design intent is this is usable as a streaming encoder
where the input comes in batches, however the functions as they are
today don't support that.

So simplify by bringing the cache back to local variable, so that if
it does VRAM OOM the cache itself is properly garbage when the
encode()/decode() functions dissappear from the stack.
2025-10-01 18:42:16 -04:00
rattus128
911331c06c
sd: fix VAE tiled fallback VRAM leak (#10139)
When the VAE catches this VRAM OOM, it launches the fallback logic
straight from the exception context.

Python however refs the entire call stack that caused the exception
including any local variables for the sake of exception report and
debugging. In the case of tensors, this can hold on the references
to GBs of VRAM and inhibit the VRAM allocated from freeing them.

So dump the except context completely before going back to the VAE
via the tiler by getting out of the except block with nothing but
a flag.

The greately increases the reliability of the tiler fallback,
especially on low VRAM cards, as with the bug, if the leak randomly
leaked more than the headroom needed for a single tile, the tiler
would fallback would OOM and fail the flow.
2025-10-01 18:40:28 -04:00
comfyanonymous
a6f83a4a1a
Support the new hunyuan vae. (#10150) 2025-10-01 17:19:13 -04:00
Jedrzej Kosinski
01f4512bf8 In-progress commit on making flipflop async weight streaming native, made loaded partially/loaded completely log messages have labels because having to memorize their meaning for dev work is annoying 2025-09-30 23:08:08 -07:00
Jedrzej Kosinski
8a8162e8da Fix percentage logic, begin adding elements to ModelPatcher to track flip flop compatibility 2025-09-29 22:49:12 -07:00
Jedrzej Kosinski
0e966dcf85 Merge branch 'master' into flipflop-stream 2025-09-27 21:13:26 -07:00
rattus128
653ceab414
Reduce Peak WAN inference VRAM usage - part II (#10062)
* flux: math: Use _addcmul to avoid expensive VRAM intermediate

The rope process can be the VRAM peak and this intermediate
for the addition result before releasing the original can OOM.
addcmul_ it.

* wan: Delete the self attention before cross attention

This saves VRAM when the cross attention and FFN are in play as the
VRAM peak.
2025-09-27 18:14:16 -04:00
Jedrzej Kosinski
196954ab8c
Add 'input_cond' and 'input_uncond' to the args dictionary passed into sampler_cfg_function (#10044) 2025-09-26 19:55:03 -07:00
comfyanonymous
1e098d6132
Don't add template to qwen2.5vl when template is in prompt. (#10043)
Make the hunyuan image refiner template_end 36.
2025-09-26 18:34:17 -04:00
Jedrzej Kosinski
6b240b0bce Refactored old flip flop into a new implementation that allows for controlling the percentage of blocks getting flip flopped, converted nodes to v3 schema 2025-09-25 22:41:41 -07:00
Jedrzej Kosinski
f9fbf902d5 Added missing Qwen block params, further subdivided blocks function 2025-09-25 17:49:39 -07:00
Jedrzej Kosinski
f083720eb4 Refactored FlipFlopTransformer.__call__ to fully separate out actions between flip and flop 2025-09-25 16:16:51 -07:00
Jedrzej Kosinski
84e73f2aa5 Brought over flip flop prototype from contentis' fork, limiting it to only Qwen to ease the process of adapting it to be a native feature 2025-09-25 16:15:46 -07:00
Guy Niv
c8d2117f02
Fix memory leak by properly detaching model finalizer (#9979)
When unloading models in load_models_gpu(), the model finalizer was not
being explicitly detached, leading to a memory leak. This caused
linear memory consumption increase over time as models are repeatedly
loaded and unloaded.

This change prevents orphaned finalizer references from accumulating in
memory during model switching operations.
2025-09-24 22:35:12 -04:00
comfyanonymous
fccab99ec0
Fix issue with .view() in HuMo. (#10014) 2025-09-24 20:09:42 -04:00
comfyanonymous
1fee8827cb
Support for qwen edit plus model. Use the new TextEncodeQwenImageEditPlus. (#9986) 2025-09-22 16:49:48 -04:00
comfyanonymous
d1d9eb94b1
Lower wan memory estimation value a bit. (#9964)
Previous pr reduced the peak memory requirement.
2025-09-20 22:09:35 -04:00
Kohaku-Blueleaf
7be2b49b6b
Fix LoRA Trainer bugs with FP8 models. (#9854)
* Fix adapter weight init

* Fix fp8 model training

* Avoid inference tensor
2025-09-20 21:24:48 -04:00
comfyanonymous
e8df53b764
Update WanAnimateToVideo to more easily extend videos. (#9959) 2025-09-19 18:48:56 -04:00
comfyanonymous
dc95b6acc0
Basic WIP support for the wan animate model. (#9939) 2025-09-19 03:07:17 -04:00
comfyanonymous
24b0fce099
Do padding of audio embed in model for humo for more flexibility. (#9935) 2025-09-18 19:54:16 -04:00
DELUXA
8d6653fca6
Enable fp8 ops by default on gfx1200 (#9926) 2025-09-18 19:50:37 -04:00