* feat: add CI container version bump automation
Adds a workflow that triggers on releases to create PRs in the
comfyui-ci-container repo, updating the ComfyUI version in the Dockerfile.
Supports both release events and manual workflow dispatch for testing.
* feat: add CI container version bump automation
Adds a workflow that triggers on releases to create PRs in the
comfyui-ci-container repo, updating the ComfyUI version in the Dockerfile.
Supports both release events and manual workflow dispatch for testing.
* ci: update CI container repository owner
* refactor: rename `update-ci-container.yaml` workflow to `update-ci-container.yml`
* Remove post-merge instructions from the CI container update workflow.
* api nodes: price badges moved to nodes code
* added price badges for 4 more node-packs
* added price badges for 10 more node-packs
* added new price badges for Omni STD mode
* add support for autogrow groups
* use full names for "widgets", "inputs" and "groups"
* add strict typing for JSONata rules
* add price badge for WanReferenceVideoApi node
* add support for DynamicCombo
* sync price badges changes (https://github.com/Comfy-Org/ComfyUI_frontend/pull/7900)
* sync badges for Vidu2 nodes
* fixed incorrect price for RecraftCrispUpscaleNode
* fixed incorrect price badges for LTXV nodes
* fixed price badge for MinimaxHailuoVideoNode
* fixed price badges for PixVerse nodes
This is needed for aimdo where the cache cant self recover from
fragmentation. It is however a good thing to do anyway after an OOM
so make it unconditional.
Add the optional command line switch --fast dynamic_vram.
This is mutually exclusing --high-vram and --gpu-only which contradict
aimdos underlying feature.
Add appropriate installation warning and a startup message, match the
comfy debug level inconfiguring aimdo.
Add comfy-aimdo pip requirement. This will safely stub to a nop for
unsupported platforms.
We need to general pytorch cache defragmentation on an appropriate level for
aimdo. Do in here on the per node basis, which has a reasonable chance of
purging stale shapes out of the pytorch caching allocator and saving VRAM
without costing too much garbage collector thrash.
This looks like a lot of GC but because aimdo never fails from pytorch and
saves the pytorch allocator from ever need to defrag out of demand, but it
needs a oil change every now and then so we gotta do it. Doing it here also
means the pytorch temps are cleared from task manager VRAM usage so user
anxiety can go down a little when they see their vram drop back at the end
of workflows inline with inference usage (rather than assuming full VRAM
leaks).
Use CoreModelPatcher for all internal ModelPatcher implementations. This drives
conditional use of the aimdo feature, while making sure custom node packs get
to keep ModelPatcher unchanged for the moment.
Implement a model patcher and caster for aimdo.
A new ModelPatcher implementation which backs onto comfy-aimdo to implement varying model load levels that can be adjusted during model use. The patcher defers all load processes to lazily load the model during use (e.g. the first step of a ksampler) and automatically negotiates a load level during the inference to maximize VRAM usage without OOMing. If inference requires more VRAM than is available weights are offloaded to make space before the OOM happens.
As for loading the weight onto the GPU, that happens via comfy_cast_weights which is now used in all cases. cast_bias_weight checks whether the VBAR assigned to the model has space for the weight (based on the same load priority semantics as the original ModelPatcher). If it does, the VRAM as returned by the Aimdo allocator is used as the parameter GPU side. The caster is responsible for populating the weight data. This is done using the usual offload_stream (which mean we now have asynchronous load overlapping first use compute).
Pinning works a little differently. When a weight is detected during load as unable to fit, a pin is allocated at the time of casting and the weight as used by the layer is DMAd back to the the pin using the GPU DMA TX engine, also using the asynchronous offload streams. This means you get to pin the Lora modified and requantized weights which can be a major speedup for offload+quantize+lora use cases, This works around the JIT Lora + FP8 exclusion and brings FP8MM to heavy offloading users (who probably really need it with more modest GPUs). There is a performance risk in that a CPU+RAM patch has been replace with a GPU+RAM patch but my initial performance results look good. Most users as likely to have a GPU that outruns their CPU in these woods.
Some common code is written to consolidate a layers tensors for aimdo mapping, pinning, and DMA transfers. interpret_gathered_like() allows unpacking a raw buffer as a set of tensors. This is used consistently to bundle and pack weights, quantization metadata (QuantizedTensor bits) and biases into one payload for DMA in the load process reducing Cuda overhead a little. Some Quantization metadata was missing async offload is some cases which is now added. This also pins quantization metadata and consolidates the number of cuda_host_register calls (which can be expensive).
Add two api expansions, a flag for whether a model patcher is dynamic
a a very basic RAM freeing system.
Implement the semantics of the dynamic model patcher which never frees
VRAM ahead of time for the sake of another dynamic model patcher.
At the same time add an API for clearing out pins on a reservation of
model size x2 heuristic, as pins consume RAM in their own right in the
dynamic patcher.
This is actually less about OOMing RAM and more about performance, as
with assign=True load semantics there needs to be plenty headroom for
the OS to load models to dosk cache on demand so err on the side of
kicking old pins out.
Add a python for managing pinned memory of the weight/bias module level.
This allocates, pins and attached a tensor to a module for the pin for this
module. It does not set the weight, just allocates a singular ram buffer
for population and bulk DMA transfer.
Get the model saving logic away from force_patch_weights and instead do
the patching JIT during safetensors saving.
Firstly switch off force_patch_weights in the load for save which avoids
creating CPU side tensors with loras calculated.
Then at save time, wrap the tensor to catch safetensors call to .to() and
patch it live.
This avoids having to ever have a lora-calculated copy of offloaded
weights on the CPU.
Also take advantage of the presence of the GPU when doing this Lora
calculation. The former force_patch_weights would just do eveyrthing on
the CPU. Its generally faster to go the GPU and back even if its just
a Lora application.