* causal_video_ae: Remove attention ResNet
This attention_head_dim argument does not exist on this constructor so
this is dead code. Remove as generic attention mid VAE conflicts with
temporal roll.
* ltx-vae: consoldate causal/non-causal code paths
* ltx-vae: add cache rolling adder
* ltx-vae: use cached adder for resnet
* ltx-vae: Implement rolling VAE
Implement a temporal rolling VAE for the LTX2 VAE.
Usually when doing temporal rolling VAEs you can just chunk on time relying
on causality and cache behind you as you go. The LTX VAE is however
non-causal.
So go whole hog and implement per layer run ahead and backpressure between
the decoder layers using recursive state beween the layers.
Operations are ammended with temporal_cache_state{} which they can use to
hold any state then need for partial execution. Convolutions cache their
inputs behind the up to N-1 frames, and skip connections need to cache the
mismatch between convolution input and output that happens due to missing
future (non-causal) input.
Each call to run_up() processes a layer accross a range on input that
may or may not be complete. It goes depth first to process as much as
possible to try and digest frames to the final output ASAP. If layers run
out of input due to convolution losses, they simply return without action
effectively applying back-pressure to the earlier layers. As the earlier
layers do more work and caller deeper, the partial states are reconciled
and output continues to digest depth first as much as possible.
Chunking is done using a size quota rather than a fixed frame length and
any layer can initiate chunking, and multiple layers can chunk at different
granulatiries. This remove the old limitation of always having to process
1 latent frame to entirety and having to hold 8 full decoded frames as
the VRAM peak.
* re-init
* Update model_multitalk.py
* whitespace...
* Update model_multitalk.py
* remove print
* this is redundant
* remove import
* Restore preview functionality
* Move block_idx to transformer_options
* Remove LoopingSamplerCustomAdvanced
* Remove looping functionality, keep extension functionality
* Update model_multitalk.py
* Handle ref_attn_mask with separate patch to avoid having to always return q and k from self_attn
* Chunk attention map calculation for multiple speakers to reduce peak VRAM usage
* Update model_multitalk.py
* Add ModelPatch type back
* Fix for latest upstream
* Use DynamicCombo for cleaner node
Basically just so that single_speaker mode hides mask inputs and 2nd audio input
* Update nodes_wan.py
For LTX Audio VAE, remove normalization of audio during MEL spectrogram creation.
This aligs inference with training and prevents loud audio from being attenuated.
* Add support for sage attention 3 in comfyui, enable via new cli arg
--use-sage-attiention3
* Fix some bugs found in PR review. The N dimension at which Sage
Attention 3 takes effect is reduced to 1024 (although the improvement is
not significant at this scale).
* Remove the Sage Attention3 switch, but retain the attention function
registration.
* Fix a ruff check issue in attention.py
index_timestep_zero can be selected in the
FluxKontextMultiReferenceLatentMethod now with the display name set to the
more generic "Edit Model Reference Method" node.
The inpaint part is currently missing and will be implemented later.
I think they messed up this model pretty bad. They added some
control_noise_refiner blocks but don't actually use them. There is a typo
in their code so instead of doing control_noise_refiner -> control_layers
it runs the whole control_layers twice.
Unfortunately they trained with this typo so the model works but is kind
of slow and would probably perform a lot better if they corrected their
code and trained it again.
* Add Kandinsky5 model support
lite and pro T2V tested to work
* Update kandinsky5.py
* Fix fp8
* Fix fp8_scaled text encoder
* Add transformer_options for attention
* Code cleanup, optimizations, use fp32 for all layers originally at fp32
* ImageToVideo -node
* Fix I2V, add necessary latent post process nodes
* Support text to image model
* Support block replace patches (SLG mostly)
* Support official LoRAs
* Don't scale RoPE for lite model as that just doesn't work...
* Update supported_models.py
* Rever RoPE scaling to simpler one
* Fix typo
* Handle latent dim difference for image model in the VAE instead
* Add node to use different prompts for clip_l and qwen25_7b
* Reduce peak VRAM usage a bit
* Further reduce peak VRAM consumption by chunking ffn
* Update chunking
* Update memory_usage_factor
* Code cleanup, don't force the fp32 layers as it has minimal effect
* Allow for stronger changes with first frames normalization
Default values are too weak for any meaningful changes, these should probably be exposed as advanced node options when that's available.
* Add image model's own chat template, remove unused image2video template
* Remove hard error in ReplaceVideoLatentFrames -node
* Update kandinsky5.py
* Update supported_models.py
* Fix typos in prompt template
They were now fixed in the original repository as well
* Update ReplaceVideoLatentFrames
Add tooltips
Make source optional
Better handle negative index
* Rename NormalizeVideoLatentFrames -node
For bit better clarity what it does
* Fix NormalizeVideoLatentStart node out on non-op
* hunyuan upsampler: rework imports
Remove the transitive import of VideoConv3d and Resnet and takes these
from actual implementation source.
* model: remove unused give_pre_end
According to git grep, this is not used now, and was not used in the
initial commit that introduced it (see below).
This semantic is difficult to implement temporal roll VAE for (and would
defeat the purpose). Rather than implement the complex if, just delete
the unused feature.
(venv) rattus@rattus-box2:~/ComfyUI$ git log --oneline
220afe33 (HEAD) Initial commit.
(venv) rattus@rattus-box2:~/ComfyUI$ git grep give_pre
comfy/ldm/modules/diffusionmodules/model.py: resolution, z_channels, give_pre_end=False, tanh_out=False, use_linear_attn=False,
comfy/ldm/modules/diffusionmodules/model.py: self.give_pre_end = give_pre_end
comfy/ldm/modules/diffusionmodules/model.py: if self.give_pre_end:
(venv) rattus@rattus-box2:~/ComfyUI$ git co origin/master
Previous HEAD position was 220afe33 Initial commit.
HEAD is now at 9d8a8179 Enable async offloading by default on Nvidia. (#10953)
(venv) rattus@rattus-box2:~/ComfyUI$ git grep give_pre
comfy/ldm/modules/diffusionmodules/model.py: resolution, z_channels, give_pre_end=False, tanh_out=False, use_linear_attn=False,
comfy/ldm/modules/diffusionmodules/model.py: self.give_pre_end = give_pre_end
comfy/ldm/modules/diffusionmodules/model.py: if self.give_pre_end:
* move refiner VAE temporal roller to core
Move the carrying conv op to the common VAE code and give it a better
name. Roll the carry implementation logic for Resnet into the base
class and scrap the Hunyuan specific subclass.
* model: Add temporal roll to main VAE decoder
If there are no attention layers, its a standard resnet and VideoConv3d
is asked for, substitute in the temporal rolloing VAE algorithm. This
reduces VAE usage by the temporal dimension (can be huge VRAM savings).
* model: Add temporal roll to main VAE encoder
If there are no attention layers, its a standard resnet and VideoConv3d
is asked for, substitute in the temporal rolling VAE algorithm. This
reduces VAE usage by the temporal dimension (can be huge VRAM savings).
These are not actual controlnets so put it in the models/model_patches
folder and use the ModelPatchLoader + QwenImageDiffsynthControlnet node to
use it.
* init
* update
* Update model.py
* Update model.py
* remove print
* Fix text encoding
* Prevent empty negative prompt
Really doesn't work otherwise
* fp16 works
* I2V
* Update model_base.py
* Update nodes_hunyuan.py
* Better latent rgb factors
* Use the correct sigclip output...
* Support HunyuanVideo1.5 SR model
* whitespaces...
* Proper latent channel count
* SR model fixes
This also still needs timesteps scheduling based on the noise scale, can be used with two samplers too already
* vae_refiner: roll the convolution through temporal
Work in progress.
Roll the convolution through time using 2-latent-frame chunks and a
FIFO queue for the convolution seams.
* Support HunyuanVideo15 latent resampler
* fix
* Some cleanup
Co-Authored-By: comfyanonymous <121283862+comfyanonymous@users.noreply.github.com>
* Proper hyvid15 I2V channels
Co-Authored-By: comfyanonymous <121283862+comfyanonymous@users.noreply.github.com>
* Fix TokenRefiner for fp16
Otherwise x.sum has infs, just in case only casting if input is fp16, I don't know if necessary.
* Bugfix for the HunyuanVideo15 SR model
* vae_refiner: roll the convolution through temporal II
Roll the convolution through time using 2-latent-frame chunks and a
FIFO queue for the convolution seams.
Added support for encoder, lowered to 1 latent frame to save more
VRAM, made work for Hunyuan Image 3.0 (as code shared).
Fixed names, cleaned up code.
* Allow any number of input frames in VAE.
* Better VAE encode mem estimation.
* Lowvram fix.
* Fix hunyuan image 2.1 refiner.
* Fix mistake.
* Name changes.
* Rename.
* Whitespace.
* Fix.
* Fix.
---------
Co-authored-by: kijai <40791699+kijai@users.noreply.github.com>
Co-authored-by: Rattus <rattus128@gmail.com>