If this suffers an exception (such as a VRAM oom) it will leave the
encode() and decode() methods which skips the cleanup of the WAN
feature cache. The comfy node cache then ultimately keeps a reference
this object which is in turn reffing large tensors from the failed
execution.
The feature cache is currently setup at a class variable on the
encoder/decoder however, the encode and decode functions always clear
it on both entry and exit of normal execution.
Its likely the design intent is this is usable as a streaming encoder
where the input comes in batches, however the functions as they are
today don't support that.
So simplify by bringing the cache back to local variable, so that if
it does VRAM OOM the cache itself is properly garbage when the
encode()/decode() functions dissappear from the stack.
When the VAE catches this VRAM OOM, it launches the fallback logic
straight from the exception context.
Python however refs the entire call stack that caused the exception
including any local variables for the sake of exception report and
debugging. In the case of tensors, this can hold on the references
to GBs of VRAM and inhibit the VRAM allocated from freeing them.
So dump the except context completely before going back to the VAE
via the tiler by getting out of the except block with nothing but
a flag.
The greately increases the reliability of the tiler fallback,
especially on low VRAM cards, as with the bug, if the leak randomly
leaked more than the headroom needed for a single tile, the tiler
would fallback would OOM and fail the flow.
* flux: math: Use _addcmul to avoid expensive VRAM intermediate
The rope process can be the VRAM peak and this intermediate
for the addition result before releasing the original can OOM.
addcmul_ it.
* wan: Delete the self attention before cross attention
This saves VRAM when the cross attention and FFN are in play as the
VRAM peak.
When unloading models in load_models_gpu(), the model finalizer was not
being explicitly detached, leading to a memory leak. This caused
linear memory consumption increase over time as models are repeatedly
loaded and unloaded.
This change prevents orphaned finalizer references from accumulating in
memory during model switching operations.
* flux: Do the xq and xk ropes one at a time
This was doing independendent interleaved tensor math on the q and k
tensors, leading to the holding of more than the minimum intermediates
in VRAM. On a bad day, it would VRAM OOM on xk intermediates.
Do everything q and then everything k, so torch can garbage collect
all of qs intermediates before k allocates its intermediates.
This reduces peak VRAM usage for some WAN2.2 inferences (at least).
* wan: Optimize qkv intermediates on attention
As commented. The former logic computed independent pieces of QKV in
parallel which help more inference intermediates in VRAM spiking
VRAM usage. Fully roping Q and garbage collecting the intermediates
before touching K reduces the peak inference VRAM usage.
* Initial Chroma Radiance support
* Minor Chroma Radiance cleanups
* Update Radiance nodes to ensure latents/images are on the intermediate device
* Fix Chroma Radiance memory estimation.
* Increase Chroma Radiance memory usage factor
* Increase Chroma Radiance memory usage factor once again
* Ensure images are multiples of 16 for Chroma Radiance
Add batch dimension and fix channels when necessary in ChromaRadianceImageToLatent node
* Tile Chroma Radiance NeRF to reduce memory consumption, update memory usage factor
* Update Radiance to support conv nerf final head type.
* Allow setting NeRF embedder dtype for Radiance
Bump Radiance nerf tile size to 32
Support EasyCache/LazyCache on Radiance (maybe)
* Add ChromaRadianceStubVAE node
* Crop Radiance image inputs to multiples of 16 instead of erroring to be in line with existing VAE behavior
* Convert Chroma Radiance nodes to V3 schema.
* Add ChromaRadianceOptions node and backend support.
Cleanups/refactoring to reduce code duplication with Chroma.
* Fix overriding the NeRF embedder dtype for Chroma Radiance
* Minor Chroma Radiance cleanups
* Move Chroma Radiance to its own directory in ldm
Minor code cleanups and tooltip improvements
* Fix Chroma Radiance embedder dtype overriding
* Remove Radiance dynamic nerf_embedder dtype override feature
* Unbork Radiance NeRF embedder init
* Remove Chroma Radiance image conversion and stub VAE nodes
Add a chroma_radiance option to the VAELoader builtin node which uses comfy.sd.PixelspaceConversionVAE
Add a PixelspaceConversionVAE to comfy.sd for converting BHWC 0..1 <-> BCHW -1..1
* Looking into a @wrap_attn decorator to look for 'optimized_attention_override' entry in transformer_options
* Created logging code for this branch so that it can be used to track down all the code paths where transformer_options would need to be added
* Fix memory usage issue with inspect
* Made WAN attention receive transformer_options, test node added to wan to test out attention override later
* Added **kwargs to all attention functions so transformer_options could potentially be passed through
* Make sure wrap_attn doesn't make itself recurse infinitely, attempt to load SageAttention and FlashAttention if not enabled so that they can be marked as available or not, create registry for available attention
* Turn off attention logging for now, make AttentionOverrideTestNode have a dropdown with available attention (this is a test node only)
* Make flux work with optimized_attention_override
* Add logs to verify optimized_attention_override is passed all the way into attention function
* Make Qwen work with optimized_attention_override
* Made hidream work with optimized_attention_override
* Made wan patches_replace work with optimized_attention_override
* Made SD3 work with optimized_attention_override
* Made HunyuanVideo work with optimized_attention_override
* Made Mochi work with optimized_attention_override
* Made LTX work with optimized_attention_override
* Made StableAudio work with optimized_attention_override
* Made optimized_attention_override work with ACE Step
* Made Hunyuan3D work with optimized_attention_override
* Make CosmosPredict2 work with optimized_attention_override
* Made CosmosVideo work with optimized_attention_override
* Made Omnigen 2 work with optimized_attention_override
* Made StableCascade work with optimized_attention_override
* Made AuraFlow work with optimized_attention_override
* Made Lumina work with optimized_attention_override
* Made Chroma work with optimized_attention_override
* Made SVD work with optimized_attention_override
* Fix WanI2VCrossAttention so that it expects to receive transformer_options
* Fixed Wan2.1 Fun Camera transformer_options passthrough
* Fixed WAN 2.1 VACE transformer_options passthrough
* Add optimized to get_attention_function
* Disable attention logs for now
* Remove attention logging code
* Remove _register_core_attention_functions, as we wouldn't want someone to call that, just in case
* Satisfy ruff
* Remove AttentionOverrideTest node, that's something to cook up for later
Load the projector.safetensors file with the ModelPatchLoader node and use
the siglip_vision_patch14_384.safetensors "clip vision" model and the
USOStyleReferenceNode.
* Attempting a universal implementation of EasyCache, starting with flux as test; I screwed up the math a bit, but when I set it just right it works.
* Fixed math to make threshold work as expected, refactored code to use EasyCacheHolder instead of a dict wrapped by object
* Use sigmas from transformer_options instead of timesteps to be compatible with a greater amount of models, make end_percent work
* Make log statement when not skipping useful, preparing for per-cond caching
* Added DIFFUSION_MODEL wrapper around forward function for wan model
* Add subsampling for heuristic inputs
* Add subsampling to output_prev (output_prev_subsampled now)
* Properly consider conds in EasyCache logic
* Created SuperEasyCache to test what happens if caching and reuse is moved outside the scope of conds, added PREDICT_NOISE wrapper to facilitate this test
* Change max reuse_threshold to 3.0
* Mark EasyCache/SuperEasyCache as experimental (beta)
* Make Lumina2 compatible with EasyCache
* Add EasyCache support for Qwen Image
* Fix missing comma, curse you Cursor
* Add EasyCache support to AceStep
* Add EasyCache support to Chroma
* Added EasyCache support to Cosmos Predict t2i
* Make EasyCache not crash with Cosmos Predict ImagToVideo latents, but does not work well at all
* Add EasyCache support to hidream
* Added EasyCache support to hunyuan video
* Added EasyCache support to hunyuan3d
* Added EasyCache support to LTXV (not very good, but does not crash)
* Implemented EasyCache for aura_flow
* Renamed SuperEasyCache to LazyCache, hardcoded subsample_factor to 8 on nodes
* Eatra logging when verbose is true for EasyCache
These are not real controlnets but actually a patch on the model so they
will be treated as such.
Put them in the models/model_patches/ folder.
Use the new ModelPatchLoader and QwenImageDiffsynthControlnet nodes.
* P2 of qwen edit model.
* Typo.
* Fix normal qwen.
* Fix.
* Make the TextEncodeQwenImageEdit also set the ref latent.
If you don't want it to set the ref latent and want to use the
ReferenceLatent node with your custom latent instead just disconnect the
VAE.
This node is only useful if someone trains the kontext model to properly
use multiple reference images via the index method.
The default is the offset method which feeds the multiple images like if
they were stitched together as one. This method works with the current
flux kontext model.
Turns out torch.compile has some gaps in context manager decorator
syntax support. I've sent patches to fix that in PyTorch, but it won't
be available for all the folks running older versions of PyTorch, hence
this trivial patch.
* Added initial support for basic context windows - in progress
* Add prepare_sampling wrapper for context window to more accurately estimate latent memory requirements, fixed merging wrappers/callbacks dicts in prepare_model_patcher
* Made context windows compatible with different dimensions; works for WAN, but results are bad
* Fix comfy.patcher_extension.merge_nested_dicts calls in prepare_model_patcher in sampler_helpers.py
* Considering adding some callbacks to context window code to allow extensions of behavior without the need to rewrite code
* Made dim slicing cleaner
* Add Wan Context WIndows node for testing
* Made context schedule and fuse method functions be stored on the handler instead of needing to be registered in core code to be found
* Moved some code around between node_context_windows.py and context_windows.py
* Change manual context window nodes names/ids
* Added callbacks to IndexListContexHandler
* Adjusted default values for context_length and context_overlap, made schema.inputs definition for WAN Context Windows less annoying
* Make get_resized_cond more robust for various dim sizes
* Fix typo
* Another small fix
* Change bf16 check and switch non-blocking to off default with option to force to regain speed on certain classes of iGPUs and refactor xpu check.
* Turn non_blocking off by default for xpu.
* Update README.md for Intel GPUs.
* Add factorization utils for lokr
* Add lokr train impl
* Add loha train impl
* Add adapter map for algo selection
* Add optional grad ckpt and algo selection
* Update __init__.py
* correct key name for loha
* Use custom fwd/bwd func and better init for loha
* Support gradient accumulation
* Fix bugs of loha
* use more stable init
* Add OFT training
* linting
* Support for async execution functions
This commit adds support for node execution functions defined as async. When
a node's execution function is defined as async, we can continue
executing other nodes while it is processing.
Standard uses of `await` should "just work", but people will still have
to be careful if they spawn actual threads. Because torch doesn't really
have async/await versions of functions, this won't particularly help
with most locally-executing nodes, but it does work for e.g. web
requests to other machines.
In addition to the execute function, the `VALIDATE_INPUTS` and
`check_lazy_status` functions can also be defined as async, though we'll
only resolve one node at a time right now for those.
* Add the execution model tests to CI
* Add a missing file
It looks like this got caught by .gitignore? There's probably a better
place to put it, but I'm not sure what that is.
* Add the websocket library for automated tests
* Add additional tests for async error cases
Also fixes one bug that was found when an async function throws an error
after being scheduled on a task.
* Add a feature flags message to reduce bandwidth
We now only send 1 preview message of the latest type the client can
support.
We'll add a console warning when the client fails to send a feature
flags message at some point in the future.
* Add async tests to CI
* Don't actually add new tests in this PR
Will do it in a separate PR
* Resolve unit test in GPU-less runner
* Just remove the tests that GHA can't handle
* Change line endings to UNIX-style
* Avoid loading model_management.py so early
Because model_management.py has a top-level `logging.info`, we have to
be careful not to import that file before we call `setup_logging`. If we
do, we end up having the default logging handler registered in addition
to our custom one.
* feat: “--whitelist-custom-nodes” args for comfy core to go with “--disable-all-custom-nodes” for development purposes
* feat: Simplify custom nodes whitelist logic to use consistent code paths
* support wan camera models
* fix by ruff check
* change camera_condition type; make camera_condition optional
* support camera trajectory nodes
* fix camera direction
---------
Co-authored-by: Qirui Sun <sunqr0667@126.com>
* [Luma] Print download URL of successful task result directly on nodes (#177)
[Veo] Print download URL of successful task result directly on nodes (#184)
[Recraft] Print download URL of successful task result directly on nodes (#183)
[Pixverse] Print download URL of successful task result directly on nodes (#182)
[Kling] Print download URL of successful task result directly on nodes (#181)
[MiniMax] Print progress text and download URL of successful task result directly on nodes (#179)
[Docs] Link to docs in `API_NODE` class property type annotation comment (#178)
[Ideogram] Print download URL of successful task result directly on nodes (#176)
[Kling] Print download URL of successful task result directly on nodes (#181)
[Veo] Print download URL of successful task result directly on nodes (#184)
[Recraft] Print download URL of successful task result directly on nodes (#183)
[Pixverse] Print download URL of successful task result directly on nodes (#182)
[MiniMax] Print progress text and download URL of successful task result directly on nodes (#179)
[Docs] Link to docs in `API_NODE` class property type annotation comment (#178)
[Luma] Print download URL of successful task result directly on nodes (#177)
[Ideogram] Print download URL of successful task result directly on nodes (#176)
Show output URL and progress text on Pika nodes (#168)
[BFL] Print download URL of successful task result directly on nodes (#175)
[OpenAI ] Print download URL of successful task result directly on nodes (#174)
* fix ruff errors
* fix 3.10 syntax error