* initial gemma4 support
* parity with reference implementation
outputs can 100% match transformers with same sdpa flags, checkpoint this and then optimize
* Cleanup, video fixes
* cleanup, enable fused rms norm by default
* update comment
* Cleanup
* Update sd.py
* Various fixes
* Add fp8 scaled embedding support
* small fixes
* Translate think tokens
* Fix image encoder attention mask type
So it works with basic attention
* Handle thinking tokens different only for Gemma4
* Code cleanup
* Update nodes_textgen.py
* Use embed scale class instead of buffer
Slight difference to HF, but technically more accurate and simpler code
* Default to fused rms_norm
* Update gemma4.py
* mm: Use Aimdo raw allocator for cast buffers
pytorch manages allocation of growing buffers on streams poorly. Pyt
has no windows support for the expandable segments allocator (which is
the right tool for this job), while also segmenting the memory by
stream such that it can be generally re-used. So kick the problem to
aimdo which can just grow a virtual region thats freed per stream.
* plan
* ops: move cpu handler up to the caller
* ops: split up prefetch from weight prep block prefetching API
Split up the casting and weight formating/lora stuff in prep for
arbitrary prefetch support.
* ops: implement block prefetching API
allow a model to construct a prefetch list and operate it for increased
async offload.
* ltxv2: Implement block prefetching
* Implement lora async offload
Implement async offload of loras.
There was an issue where the resample split was too early and dropped one
of the rolling convolutions a frame early. This is most noticable as a
lighting/color change between pixel frames 5->6 (latent 2->3), or as a
lighting change between the first and last frame in an FLF wan flow.
* sd: soft_empty_cache on tiler fallback
This doesnt cost a lot and creates the expected VRAM reduction in
resource monitors when you fallback to tiler.
* wan: vae: Don't recursion in local fns (move run_up)
Moved Decoder3d’s recursive run_up out of forward into a class
method to avoid nested closure self-reference cycles. This avoids
cyclic garbage that delays garbage of tensors which in turn delays
VRAM release before tiled fallback.
* ltx: vae: Don't recursion in local fns (move run_up)
Mov the recursive run_up out of forward into a class
method to avoid nested closure self-reference cycles. This avoids
cyclic garbage that delays garbage of tensors which in turn delays
VRAM release before tiled fallback.
* ltx: vae: add cache state to downsample block
* ltx: vae: Add time stride awareness to causal_conv_3d
* ltx: vae: Automate truncation for encoder
Other VAEs just truncate without error. Do the same.
* sd/ltx: Make chunked_io a flag in its own right
Taking this bi-direcitonal, so make it a for-purpose named flag.
* ltx: vae: implement chunked encoder + CPU IO chunking
People are doing things with big frame counts in LTX including V2V
flows. Implement the time-chunked encoder to keep the VRAM down, with
the converse of the new CPU pre-allocation technique, where the chunks
are brought from the CPU JIT.
* ltx: vae-encode: round chunk sizes more strictly
Only powers of 2 and multiple of 8 are valid due to cache slicing.
* wan: vae: encoder: Add feature cache layer that corks singles
If a downsample only gives you a single frame, save it to the feature
cache and return nothing to the top level. This increases the
efficiency of cacheability, but also prepares support for going two
by two rather than four by four on the frames.
* wan: remove all concatentation with the feature cache
The loopers are now responsible for ensuring that non-final frames are
processes at least two-by-two, elimiating the need for this cat case.
* wan: vae: recurse and chunk for 2+2 frames on decode
Avoid having to clone off slices of 4 frame chunks and reduce the size
of the big 6 frame convolutions down to 4. Save the VRAMs.
* wan: encode frames 2x2.
Reduce VRAM usage greatly by encoding frames 2 at a time rather than
4.
* wan: vae: remove cloning
The loopers now control the chunking such there is noever more than 2
frames, so just cache these slices directly and avoid the clone
allocations completely.
* wan: vae: free consumer caller tensors on recursion
* wan: vae: restyle a little to match LTX
* ltx: vae: scale the chunk size with the users VRAM
Scale this linearly down for users with low VRAM.
* ltx: vae: free non-chunking recursive intermediates
* ltx: vae: cleanup some intermediates
The conv layer can be the VRAM peak and it does a torch.cat. So cleanup
the pieces of the cat. Also clear our the cache ASAP as each layer detect
its end as this VAE surges in VRAM at the end due to the ended padding
increasing the size of the final frame convolutions off-the-books to
the chunker. So if all the earlier layers free up their cache it can
offset that surge.
Its a fragmentation nightmare, and the chance of it having to recache the
pyt allocator is very high, but you wont OOM.