ComfyUI/comfy/ldm/wan
rattus128 e42682b24e
Reduce Peak WAN inference VRAM usage (#9898)
* flux: Do the xq and xk ropes one at a time

This was doing independendent interleaved tensor math on the q and k
tensors, leading to the holding of more than the minimum intermediates
in VRAM. On a bad day, it would VRAM OOM on xk intermediates.

Do everything q and then everything k, so torch can garbage collect
all of qs intermediates before k allocates its intermediates.

This reduces peak VRAM usage for some WAN2.2 inferences (at least).

* wan: Optimize qkv intermediates on attention

As commented. The former logic computed independent pieces of QKV in
parallel which help more inference intermediates in VRAM spiking
VRAM usage. Fully roping Q and garbage collecting the intermediates
before touching K reduces the peak inference VRAM usage.
2025-09-16 19:21:14 -04:00
..
model.py Reduce Peak WAN inference VRAM usage (#9898) 2025-09-16 19:21:14 -04:00
vae2_2.py Tiny wan vae optimizations. (#9136) 2025-08-01 05:25:38 -04:00
vae.py Tiny wan vae optimizations. (#9136) 2025-08-01 05:25:38 -04:00