* Implement seek and read for pins
Source pins from an mmap is pad because its its a CPU->CPU copy that
attempts to fully buffer the same data twice. Instead, use seek and
read which avoids the mmap buffering while usually being a faster
read in the first place (avoiding mmap faulting etc).
* pinned_memory: Use Aimdo pinner
The aimdo pinner bypasses pytorches CPU allocator which can leak
windows commit charge.
* ops: bypass init() of weight for embedding layer
This similarly consumes large commit charge especially for TEs. It can
cause a permanement leaked commit charge which can destabilize on
systems close to the commit ceiling and generally confuses the RAM
stats.
* model_patcher: implement pinned memory counter
Implement a pinned memory counter for better accounting of what volume
of memory pins have.
* implement touch accounting
Implement accounting of touching mmapped tensors.
* mm+mp: add residency mmap getter
* utils: use the aimdo mmap to load sft files
* model_management: Implement tigher RAM pressure semantics
Implement a pressure release on entire MMAPs as windows does perform
faster when mmaps are unloaded and model loads free ramp into fully
unallocated RAM.
Make the concept of freeing for pins a completely separate concept.
Now that pins are loadable directly from original file and don' touch
the mmap, tighten the freeing budget to just the current loaded model
- what you have left over. This still over-frees pins, but its a lot
better than before.
So after the pins are freed with that algorithm, bounce entire MMAPs
to free RAM based on what the model needs, deducting off any known
resident-in-mmap tensors to the free quota to keep it as tight as
possible.
* comfy-aimdo 0.2.11
Comfy aimdo 0.2.11
* mm: Implement file_slice path for QT
* ruff
* ops: put meta-tensors in place to allow custom nodes to check geo
* fix: guard torch.AcceleratorError for compatibility with torch < 2.8.0
torch.AcceleratorError was introduced in PyTorch 2.8.0. Accessing it
directly raises AttributeError on older versions. Use a try/except
fallback at module load time, consistent with the existing pattern used
for OOM_EXCEPTION.
* fix: address review feedback for AcceleratorError compat
- Fall back to RuntimeError instead of type(None) for ACCELERATOR_ERROR,
consistent with OOM_EXCEPTION fallback pattern and valid for except clauses
- Add "out of memory" message introspection for RuntimeError fallback case
- Use RuntimeError directly in discard_cuda_async_error except clause
---------
Pytorch only filters for OOMs in its own allocators however there are
paths that can OOM on allocators made outside the pytorch allocators.
These manifest as an AllocatorError as pytorch does not have universal
error translation to its OOM type on exception. Handle it. A log I have
for this also shows a double report of the error async, so call the
async discarder to cleanup and make these OOMs look like OOMs.
Sync the compute stream before freeing the cast buffers. This can cause
use after free issues when the cast stream frees the buffer while the
compute stream is behind enough to still needs a casted weight.
* mp: respect model_defined_dtypes in default caster
This is needed for parametrizations when the dtype changes between sd
and model.
* audio_encoders: archive model dtypes
Archive model dtypes to stop the state dict load override the dtypes
defined by the core for compute etc.
* ops: dont unpin nothing
This was calling into aimdo in the none case (offloaded weight). Whats worse,
is aimdo syncs for unpinning an offloaded weight, as that is the corner case of
a weight getting evicted by its own use which does require a sync. But this
was heppening every offloaded weight causing slowdown.
* mp: fix get_free_memory policy
The ModelPatcherDynamic get_free_memory was deducting the model from
to try and estimate the conceptual free memory with doing any
offloading. This is kind of what the old memory_memory_required
was estimating in ModelPatcher load logic, however in practical
reality, between over-estimates and padding, the loader usually
underloaded models enough such that sampling could send CFG +/-
through together even when partially loaded.
So don't regress from the status quo and instead go all in on the
idea that offloading is less of an issue than debatching. Tell the
sampler it can use everything.
Define a threshold below which a weight loading takes priority. This
actually makes the offload consistent with non-dynamic, because what
happens, is when non-dynamic fills ints to_load list, it will fill-up
any left-over pieces that could fix large weights with small weights
and load them, even though they were lower priority. This actually
improves performance because the timy weights dont cost any VRAM and
arent worth the control overhead of the DMA etc.
* respect model dtype in non-comfy caster
* utils: factor out parent and name functionality of set_attr
* utils: implement set_attr_buffer for torch buffers
* ModelPatcherDynamic: Implement torch Buffer loading
If there is a buffer in dynamic - force load it.
* model_management: Remove non-comfy dynamic _v caster
* Force pre-load non-comfy weights to GPU in ModelPatcherDynamic
Non-comfy weights may expect to be pre-cast to the target
device without in-model casting. Previously they were allocated in
the vbar with _v which required the _v fault path in cast_to.
Instead, back up the original CPU weight and move it directly to GPU
at load time.
* draft zeta (z-image pixel space)
* revert gitignore
* model loaded and able to run however vector direction still wrong tho
* flip the vector direction to original again this time
* Move wrongly positioned Z image pixel space class
* inherit Radiance LatentFormat class
* Fix parameters in classes for Zeta x0 dino
* remove arbitrary nn.init instances
* Remove unused import of lru_cache
---------
Co-authored-by: silveroxides <ishimarukaito@gmail.com>
This was previously considering the pool of dynamic models as one giant
entity for the sake of smart memory, but that isnt really the useful
or what a user would reasonably expect. Make Dynamic VRAM properly purge
its models just like the old --disable-smart-memory but conditioning
the dynamic-for-dynamic bypass on smart memory.
Re-enable dynamic smart memory.
Multi-step samplers (eg. dpmpp_2s_ancestral) call the model at intermediate sigma values not present in the schedule. This caused set_step to crash with "No sample_sigmas matched current timestep" when context windows were enabled.
The fix is to keep self._step from the last exact match when a substep sigma is encountered, since substeps are still logically part of their parent step and should use the same context windows.
Co-authored-by: ozbayb <17261091+ozbayb@users.noreply.github.com>
* sd: add support for clip model reconstruction
* nodes: SetClipHooks: Demote the dynamic model patcher
* mp: Make dynamic_disable more robust
The backup need to not be cloned. In addition add a delegate object
to ModelPatcherDynamic so that non-cloning code can do
ModelPatcherDynamic demotion
* sampler_helpers: Demote to non-dynamic model patcher when hooking
* code rabbit review comments
Allow non QuantizedTensor layer to set want_requant to get the post lora
calculation stochastic cast down to the original input dtype.
This is then used by the legacy fp8 Linear implementation to set the
compute_dtype to the preferred lora dtype but then want_requant it back
down to fp8.
This fixes the issue with --fast fp8_matrix_mult is combined with
--fast dynamic_vram which doing a lora on an fp8_ non QT model.
Implements per-guide attention attenuation via log-space additive bias
in self-attention. Each guide reference tracks its own strength and
optional spatial mask in conditioning metadata (guide_attention_entries).