slow down the CPU on model load to not run ahead. This fixes a VRAM on
flux 2 load.
I went to try and debug this with the memory trace pickles, which needs
--disable-cuda-malloc which made the bug go away. So I tried this
synchronize and it worked.
The has some very complex interactions with the cuda malloc async and
I dont have solid theory on this one yet.
Still debugging but this gets us over the OOM for the moment.
TIL that the WAN TE has a 2GB weight followed by 16MB as the next size
down. This means that team 8GB VRAM would fully offload the TE in async
offload mode as it just multiplied this giant size my the num streams.
Do the more complex logic of summing up the upcoming to-load weight
sizes to avoid triple counting this massive weight.
partial unload does the converse of recording the NS most recent
unloads as they go.
* mp: only count the offload cost of math once
This was previously bundling the combined weight storage and computation
cost
* ops: put all post async transfer compute on the main stream
Some models have massive weights that need either complex
dequantization or lora patching. Don't do these patchings on the offload
stream, instead do them on the main stream to syncrhonize the
potentially large vram spikes for these compute processes. This avoids
having to assume a worst case scenario of multiple offload streams
all spiking VRAM is parallel with whatever the main stream is doing.
* mm: default to 0 for NUM_STREAMS
Dont count the compute stream as an offload stream. This makes async
offload accounting easier.
* mm: remove 128MB minimum
This is from a previous offloading system requirement. Remove it to
make behaviour of the loader and partial unloader consistent.
* mp: order the module list by offload expense
Calculate an approximate offloading temporary VRAM cost to offload a
weight and primary order the module load list by that. In the simple
case this is just the same as the module weight, but with Loras, a
weight with a lora consumes considerably more VRAM to do the Lora
application on-the-fly.
This will slightly prioritize lora weights, but is really for
proper VRAM offload accounting.
* mp: Account for the VRAM cost of weight offloading
when checking the VRAM headroom, assume that the weight needs to be
offloaded, and only load if it has space for both the load and offload
* the number of streams.
As the weights are ordered from largest to smallest by offload cost
this is guaranteed to fit in VRAM (tm), as all weights that follow
will be smaller.
Make the partial unload aware of this system as well by saving the
budget for offload VRAM to the model state and accounting accordingly.
Its possible that partial unload increases the size of the largest
offloaded weights, and thus needs to unload a little bit more than
asked to accomodate the bigger temp buffers.
Honor the existing codes floor on model weight loading of 128MB by
having the patcher honor this separately withough regard to offloading.
Otherwise when MM specifies its 128MB minimum, MP will see the biggest
weights, and budget that 128MB to only offload buffer and load nothing
which isnt the intent of these minimums. The same clamp applies in
case of partial offload of the currently loading model.
The partial unloader path in model re-use flow skips straight to the
actual unload without any check of the patching UUID. This means that
if you do an upscale flow with a model patch on an existing model, it
will not apply your patchings.
Fix by delaying the partial_unload until after the uuid checks. This
is done by making partial_unload a model of partial_load where extra_mem
is -ve.
* execution: Roll the UI cache into the outputs
Currently the UI cache is parallel to the output cache with
expectations of being a content superset of the output cache.
At the same time the UI and output cache are maintained completely
seperately, making it awkward to free the output cache content without
changing the behaviour of the UI cache.
There are two actual users (getters) of the UI cache. The first is
the case of a direct content hit on the output cache when executing a
node. This case is very naturally handled by merging the UI and outputs
cache.
The second case is the history JSON generation at the end of the prompt.
This currently works by asking the cache for all_node_ids and then
pulling the cache contents for those nodes. all_node_ids is the nodes
of the dynamic prompt.
So fold the UI cache into the output cache. The current UI cache setter
now writes to a prompt-scope dict. When the output cache is set, just
get this value from the dict and tuple up with the outputs.
When generating the history, simply iterate prompt-scope dict.
This prepares support for more complex caching strategies (like RAM
pressure caching) where less than 1 workflow will be cached and it
will be desirable to keep the UI cache and output cache in sync.
* sd: Implement RAM getter for VAE
* model_patcher: Implement RAM getter for ModelPatcher
* sd: Implement RAM getter for CLIP
* Implement RAM Pressure cache
Implement a cache sensitive to RAM pressure. When RAM headroom drops
down below a certain threshold, evict RAM-expensive nodes from the
cache.
Models and tensors are measured directly for RAM usage. An OOM score
is then computed based on the RAM usage of the node.
Note the due to indirection through shared objects (like a model
patcher), multiple nodes can account the same RAM as their individual
usage. The intent is this will free chains of nodes particularly
model loaders and associate loras as they all score similar and are
sorted in close to each other.
Has a bias towards unloading model nodes mid flow while being able
to keep results like text encodings and VAE.
* execution: Convert the cache entry to NamedTuple
As commented in review.
Convert this to a named tuple and abstract away the tuple type
completely from graph.py.
Load the projector.safetensors file with the ModelPatchLoader node and use
the siglip_vision_patch14_384.safetensors "clip vision" model and the
USOStyleReferenceNode.
These are not real controlnets but actually a patch on the model so they
will be treated as such.
Put them in the models/model_patches/ folder.
Use the new ModelPatchLoader and QwenImageDiffsynthControlnet nodes.
* Add 'sigmas' to transformer_options so that downstream code can know about the full scope of current sampling run, fix Hook Keyframes' guarantee_steps=1 inconsistent behavior with sampling split across different Sampling nodes/sampling runs by referencing 'sigmas'
* Cleaned up hooks.py, refactored Hook.should_register and add_hook_patches to use target_dict instead of target so that more information can be provided about the current execution environment if needed
* Refactor WrapperHook into TransformerOptionsHook, as there is no need to separate out Wrappers/Callbacks/Patches into different hook types (all affect transformer_options)
* Refactored HookGroup to also store a dictionary of hooks separated by hook_type, modified necessary code to no longer need to manually separate out hooks by hook_type
* In inner_sample, change "sigmas" to "sampler_sigmas" in transformer_options to not conflict with the "sigmas" that will overwrite "sigmas" in _calc_cond_batch
* Refactored 'registered' to be HookGroup instead of a list of Hooks, made AddModelsHook operational and compliant with should_register result, moved TransformerOptionsHook handling out of ModelPatcher.register_all_hook_patches, support patches in TransformerOptionsHook properly by casting any patches/wrappers/hooks to proper device at sample time
* Made hook clone code sane, made clear ObjectPatchHook and SetInjectionsHook are not yet operational
* Fix performance of hooks when hooks are appended via Cond Pair Set Props nodes by properly caching between positive and negative conds, make hook_patches_backup behave as intended (in the case that something pre-registers WeightHooks on the ModelPatcher instead of registering it at sample time)
* Filter only registered hooks on self.conds in CFGGuider.sample
* Make hook_scope functional for TransformerOptionsHook
* removed 4 whitespace lines to satisfy Ruff,
* Add a get_injections function to ModelPatcher
* Made TransformerOptionsHook contribute to registered hooks properly, added some doc strings and removed a so-far unused variable
* Rename AddModelsHooks to AdditionalModelsHook, rename SetInjectionsHook to InjectionsHook (not yet implemented, but at least getting the naming figured out)
* Clean up a typehint