Two doc-only changes addressing minor CodeRabbit findings on PR #7063:
* cli_args.py: clarify --cuda-device help text to document the required comma-separated format ('0' or '0,1'), matching how the value is consumed by CUDA_VISIBLE_DEVICES in main.py.
* nodes_multigpu.py: add a docstring NOTE on the (currently unregistered) MultiGPUOptionsNode explaining that its relative_speed input is plumbed through to model_options['multigpu_options'] but is not yet consulted by the cond scheduler, which still uses uniform round-robin via next_available_device(). Wire relative_speed into the scheduler before re-enabling the node.
Amp-Thread-ID: https://ampcode.com/threads/T-019e43b8-8258-70fd-ab3a-53e4c97f85d5
Co-authored-by: Amp <amp@ampcode.com>
The multigpu cond-batching loop called model.memory_required(input_shape) without conditioning shapes, while the single-GPU path at line 279 passes cond_shapes. Large conditioning tensors (e.g. video prompts, control inputs) were therefore under-counted, risking OOM at runtime when the chosen batch size was too large. Match the single-GPU pattern by building cond_shapes from each batched cond's conditioning dict and passing it to memory_required.
Amp-Thread-ID: https://ampcode.com/threads/T-019e43b8-8258-70fd-ab3a-53e4c97f85d5
Co-authored-by: Amp <amp@ampcode.com>
* model_management: disable non-dynamic smart memory
Disable smart memory outright for non dynamic models.
This is a minor step towards deprecation of --disable-dynamic-vram
and the legacy ModelPatcher.
This is needed for estimate-free model development, where new models
can opt-out of supplying a memory estimate and not have to worry
about hard VRAM allocations due to legacy non-dynamic model patchers
This is also a general stability increase for a lot of stray use cases
where estimates may still be off and going forward we are not going
to accurately maintain such estimates.
* pinned_memory: implement with aimdo growable buffer
Use a single growable buffer so we can do threaded pre-warming on
pinned memory.
* mm: use aimdo to do transfer from disk to pin
Aimdo implements a faster threaded loader.
* Add stream host pin buffer for AIMDO casts
Introduce per-offload-stream HostBuffer reuse for pinned staging,
include it in cast buffer reset synchronization.
Defer actual casts that go via this pin path to a separate pass
such that the buffer can be allocated monolithically (to avoid
cudaHostRegister thrash).
* remove old pin path
* Implement JIT pinned memory pressure
Replace the predictive pin pressure mechanism with JIT PIN memory
pressure.
* LowVRAMPatch: change to two-phase visit
* lora: re-implement as inplace swiss-army-knife operation
* prepare for multiple pin sets
* implement pinned loras
* requirements: comfy-aimdo 0.4.0
* ops: remove unused arg
This was defeatured in aimdo iteration
* ops: sync the CPU with only the offload stream activity
This was syncing with the offload stream which itself is synced with the
compute stream, so this was syncing CPU with compute transitively. Define
the event to sync it more gently.
* pins: implement freeing intermediate for pinned memory
Pinning is more important than inactive intermediates and the stream
pin buffer is more important than even active intermediates.
* execution: implement pin eviction on RAM presure
Add back proper pin freeing on RAM pressure
* implement pin registration swaps
Uncap the windows pins from 50% by extending the pool and have a pressure
mechanism to move the pin reservations om demand.
This unfortunately implies a GPU sync to do the freeing so significant
hysterisis needs to be added to consolidate these pressure events.
* cli_args/execution: Implement lower background cache-ram threshold
Limit the amount of RAM background intermediates can use, so that
switching workflows doesn't degrade performance too much.
* make default
* bump aimdo
* model-patcher: force-cast tiny weights
Flux 2 gets crazy stalls due to a mix of tiny and giant weights
creating lopsided steam buffer rotations which creates stalls.
* ops: refactor in prep for chunking
* mm: delegate pin-on-the-way to aimdo
Aimdo is able to chunk and slice this on the way for better CPU->GPU
overlap. The main advantage is the ability to shorten the bus contention
window between previous weight transfer and the next weights vbar
fault.
* bump aimdo
* pinning updates
* specify hostbuf max allocation size
There a signs of virtual memory exhaustion on some linux systems when
throwing 128GB for every little piece. Pass the actual to save aimdo
from over-estimates
* tests: update execution tests for caching
The default caching changed to ram-cache so update these tests
accordingly.
Remove the LRU 0 test as this also falls through to RAM cache.
create_multigpu_deepclones cloned the existing 'multigpu' additional_models list verbatim and never pruned entries beyond limit_extra_devices. If a workflow was previously prepared for more GPUs, reducing max_gpus would leave stale clones attached and eligible for later scheduling. Replace the TODO block with a real prune that keeps only clones whose load_device is either the model's load_device or in limit_extra_devices, and re-match clones if anything was removed.
Amp-Thread-ID: https://ampcode.com/threads/T-019e43b8-8258-70fd-ab3a-53e4c97f85d5
Co-authored-by: Amp <amp@ampcode.com>
torch.device(i) defaults to CUDA, so XPU/NPU branches were producing 'cuda:N' devices that don't match get_torch_device() output ('xpu:N'/'npu:N'). This caused devices.remove(get_torch_device()) to raise ValueError when exclude_current=True on non-NVIDIA hardware. Use explicit device strings, and guard the remove() with a membership check for safety.
Amp-Thread-ID: https://ampcode.com/threads/T-019e43b8-8258-70fd-ab3a-53e4c97f85d5
Co-authored-by: Amp <amp@ampcode.com>
load_checkpoint_guess_config_clip_only() calls load_checkpoint_guess_config() with output_model=False, leaving out[0] as None. The subsequent unconditional assignment of cached_patcher_init crashed with AttributeError, breaking CLIP-only checkpoint loading entirely. Guard the assignment with a None check.
Amp-Thread-ID: https://ampcode.com/threads/T-019e43b8-8258-70fd-ab3a-53e4c97f85d5
Co-authored-by: Amp <amp@ampcode.com>
GPUOptionsGroup.clone() returns a new instance, but the return value was discarded, causing the node to mutate the upstream caller's group in-place. When multiple MultiGPU Options nodes share an input group, each node's additions would leak into earlier siblings. Assign the clone result back to gpu_options so each node owns its own copy.
Amp-Thread-ID: https://ampcode.com/threads/T-019e43b8-8258-70fd-ab3a-53e4c97f85d5
Co-authored-by: Amp <amp@ampcode.com>
Behaviour-equivalent cleanup of _calc_cond_batch_multigpu device
scheduling. No change to batching decisions or memory checks for any
valid input.
Changes:
* Replace re-summed batched_to_run_length with a per-device load
dict (device_load), so capacity checks are O(1) and use a single
source of truth.
* Extract device selection into next_available_device(), which scans
at most len(devices) positions and raises if no device has
remaining capacity. This makes the 'skip a full device' rule live
in one place instead of two and guarantees the outer while loop
cannot spin forever on a scheduling bug.
* Drop the unused current_device assignment before the outer loop
and the index_device % len(devices) modulo dance (now handled
inside next_available_device).
* Minor cleanups: list comprehensions for total_conds, conds_to_batch,
and the devices list.
Fixes _calc_cond_batch_multigpu so that:
1. conds_per_device uses real division before math.ceil. The previous
expression math.ceil(total_conds // len(devices)) applied integer
floor division first, making ceil a no-op. For 3 conds across 2
devices this produced conds_per_device=1 instead of 2.
2. The scheduling loop skips devices that have already reached
capacity instead of appending empty batch groups. Without this
guard, the loop could repeatedly emit zero-length groups for a
full device, leaving sampling stuck at 0/N until timeout.
Reproduces with an Omnigen2 image workflow that produces three
condition entries scheduled across two CUDA devices. With the fix
the scheduler assigns conds_per_device=2 and splits the batches as
2 + 1 across the two devices, allowing sampling to complete.
Original fix authored and validated by @pollockjj in
pollockjj/ComfyUI#64.
Co-authored-by: John Pollock <pollockjj@gmail.com>
Aligns the OSS spec with the cloud-side BE-1004 contract:
- createWorkspaceApiKey request body: add maxLength: 5000 to the
description property (matches cloud's hub_profile.description
MaxLen(5000) convention; enforced cloud-side via handler check).
- WorkspaceApiKey + WorkspaceApiKeyCreated response schemas:
mark description as required (cloud's handler always populates
the field, defaulting to empty string when not supplied on create),
drop nullable: true, add maxLength: 5000 for symmetry, and clarify
the doc string ("Always present in responses; empty string when no
description was supplied on create").
Both schemas are tagged x-runtime: [cloud] at the schema level so the
tightening is correctly scoped — OSS-only implementations are not
required to honor the workspace API keys endpoints at all.
Related cloud PR: Comfy-Org/cloud#3747
* feat(openapi): add optional description field to workspace API key schemas
Add an optional `description` property (type: string) to three
workspace API key schemas in openapi.yaml:
- Inline request body of createWorkspaceApiKey (POST /api/workspace/api-keys)
- WorkspaceApiKey (list/info schema)
- WorkspaceApiKeyCreated (creation response schema)
The field is not added to any `required` array, making it fully
backward-compatible with existing clients.
Refs: BE-1005, BE-1004
Co-authored-by: Matt Miller <mattmillerai@users.noreply.github.com>
* fix(openapi): mark description nullable in workspace API key response schemas
Per CodeRabbit review on PR #13993: the underlying DB column is nullable
varchar (default ''), so the response schemas should permit null to match
stored data reality. Without nullable: true the OpenAPI contract would
require coercion on the handler side or risk a contract violation.
Request schema unchanged — clients shouldn't be sending null on create.
These two fields were added recently to the Asset schema as nullable
integers, with the intent of exposing original image dimensions for FE
consumers (cloud-side thumbnailing makes naturalWidth/Height return
the wrong size for an image card's dimension label).
The implementation effort that consumes them subsequently converged on
a different shape — dimensions nested under the existing free-form
`metadata` JSON field as `{kind: "image", width, height}` — to avoid
introducing type-specific flat fields on the canonical Asset shape,
and to leave room for forward-compatible additions (video duration,
fps, etc.) without further schema churn.
This removes the now-unused top-level fields so the spec reflects the
agreed direction. No other schema definitions reference these fields
directly: AssetCreated, AssetUpdated, etc. inherit Asset via allOf and
do not redefine them.
The runtime ingest implementation that would have populated these
fields was not yet shipped, so no clients are relying on the
top-level shape.
Co-authored-by: Alexis Rolland <alexisrolland@hotmail.com>
Mark the uploadMask operation as deprecated and point clients at
/api/upload/image. The mask-compositing behavior the endpoint provides
(alpha-compositing the supplied mask onto an original_ref image) is now
expected to happen client-side, with the composited result uploaded
through the unified /api/upload/image path.
The endpoint continues to function for older clients; no runtime
behavior changes ship with this commit. Only the OpenAPI annotation
and the human-facing description are updated.
* Move detection category under image category
* Add missing categories
* Move detection nodes to detection category
* Move save nodes to image root catefory
* Rename postprocessors
* Move mask category under image
* Move guiders category to parent level at root of sampling category
* Move custom_sampling category to parent level at the root of sampling category
* Modify description of LoRA loaders
* Fix node id SolidMask
* Move VOID Quadmask under image/mask
* Group compositing nodes under image/compositing
* Move load image as mask to image category for consistency with other load image nodes
* Align display name with Load Checkpoint
* Move dataset category under training category
* Rename Number Convert to Conver Number (verb first)
* Rename Canny node
* Revert wanBlockSwap + description
* Add description to RemoveBackground node
* Revert category update of dataset