Brings in 18 commits from master so worksplit-multigpu does not regress
fixes that landed on main since the last sync:
- #13699 Hunyuan 3D 2.1 batch-size fixes (overlap with our own backport;
conflict resolved in favor of the shape>=2 gate that binds
swap_cfg_halves once and reuses it for the output swap-back)
- #14031 ModelPatcherDynamic lora reshape / backup restore fix
- #13802 Multi-threaded model load (memory_management / pinned_memory /
model_management / aimdo plumbing)
- #12679 lanczos single-channel tensor fix
- #14010 Stable Audio 3 support
- assorted partner-node, openapi, workflow-template, and tooling updates
Amp-Thread-ID: https://ampcode.com/threads/T-019e4a00-fe3d-76bd-a2f2-a8c8c4040082
Co-authored-by: Amp <amp@ampcode.com>
CrossAttention.kv.view and Attention.qkv_combined.view both hardcoded
batch=1 in the reshape, crashing or silently mis-shaping whenever the
actual batch dimension was greater than 1. These were fixed on master
in #13699 as part of the same patch that gated the chunk(2) swap, but
worksplit-multigpu only picked up the chunk(2) gate. Bring the two
view() fixes over so we have parity with master.
Amp-Thread-ID: https://ampcode.com/threads/T-019e4a00-fe3d-76bd-a2f2-a8c8c4040082
Co-authored-by: Amp <amp@ampcode.com>
The previous gate (len(cond_or_uncond) == 2 and set == {0, 1}) was
intended to skip the cond/uncond swap when only one half was present
under MultiGPU CFG Split, but it was too restrictive: it also skipped
batch_size > 1 + CFG (cond_or_uncond like [0, 0, 1, 1] or [0,0,0,0,
1,1,1,1]), where chunk(2) still splits the batch cleanly into a cond
half and an uncond half and the swap is still required.
Switch to context.shape[0] >= 2, matching the parallel fix landed on
master in #13699. The swap is a permutation-invariant no-op when the
two halves don't form a CFG pair (since the output swap_cfg_halves
block immediately undoes the permutation), so the only thing the gate
actually needs to do is guard against chunk(2) on a batch of one.
Amp-Thread-ID: https://ampcode.com/threads/T-019e4a00-fe3d-76bd-a2f2-a8c8c4040082
Co-authored-by: Amp <amp@ampcode.com>
Per review feedback on #7063. The two functions share the conds-by-hooks
accumulation, memory-fit batching, and per-chunk output aggregation; the
multigpu variant adds per-device scheduling, .to(device) placement,
per-device patcher/control lookup, and thread-pool dispatch around the
inner loop. Documenting the relationship without extracting helpers --
extraction can land after the initial worksplit-multigpu release once
both paths have settled.
Amp-Thread-ID: https://ampcode.com/threads/T-019e4a00-fe3d-76bd-a2f2-a8c8c4040082
Co-authored-by: Amp <amp@ampcode.com>
QwenFunControlNet.pre_run stashes model.diffusion_model into extra_args,
which the control_model then uses for forward passes (img_in, txt_in,
pe_embedder, time_text_embed). With multigpu, every per-device control
clone was being pre_run with the base model on GPU0, so secondary
devices would invoke those modules with parameters on GPU0 and inputs
on their own device, raising 'Expected all tensors to be on the same
device'. Build a device -> per-device BaseModel lookup from the
patcher's additional multigpu models and pass each clone the model on
its own device. Falls back to the base model when no per-device match
is found (single-GPU path and the case where cnet.multigpu_clones lags
the patcher's clone set).
Amp-Thread-ID: https://ampcode.com/threads/T-019e4a00-fe3d-76bd-a2f2-a8c8c4040082
Co-authored-by: Amp <amp@ampcode.com>
QwenFunControlNet.pre_run stashes the model's diffusion_model into
self.extra_args['base_model'], but ControlBase.cleanup never clears
extra_args. The diffusion_model reference therefore lingered between
sampling runs, blocking ComfyUI's model offload/eviction logic from
freeing the UNet and -- for multigpu -- holding one such reference per
per-device control clone (defeating the max_gpus pruning added in this
PR). Override cleanup to drop the entry; super().cleanup() already
recurses into multigpu_clones so each per-device clone pops its own.
Amp-Thread-ID: https://ampcode.com/threads/T-019e4a00-fe3d-76bd-a2f2-a8c8c4040082
Co-authored-by: Amp <amp@ampcode.com>
Drop the new ignore_multigpu positional argument from prepare_state and
from the ON_PREPARE_STATE callbacks; pass the flag via model_options
instead. This restores the original 3-arg callback signature so existing
custom-node ON_PREPARE_STATE handlers keep working unchanged, while
still letting prepare_state's recursive call into multigpu_clones
short-circuit.
Amp-Thread-ID: https://ampcode.com/threads/T-019e4a00-fe3d-76bd-a2f2-a8c8c4040082
Co-authored-by: Amp <amp@ampcode.com>
* ModelPatcherDyanmic: purge stale vbar allocs on force cast
* ModelPatcherDynamic: restore backups before load
If doing a clean reload, mutative changes (lora application) could be
applied on-top of the already loaded weight. Restore from backup
unconditionally so that the new load is clean.
The job_ids query parameter on GET /api/assets is tagged x-runtime:
[cloud] and only exists for cloud's variant of this endpoint. Cloud
removed all consumers and the cloud-side handler/codegen/tests in
Comfy-Org/cloud#3778. With cloud no longer accepting this parameter,
the [cloud-only] documentation here is wrong — drop it so the daily
sync to cloud/services/ingest/vendor/openapi.yaml propagates the
removal.
The operation at POST /api/assets/import was defined as `importAssets`
with a URL-list body shape, but no runtime actually serves that
operation at this path. The cloud runtime serves a different operation
here — `importPublishedAssets` — which imports published-workflow
assets into the caller's library by ID, not by URL.
Cloud's URL-based asset ingestion lives at separate paths
(POST /assets/download + GET /assets/remote-metadata) tracked
elsewhere; nothing in this PR affects that work.
Changes:
- Replace the operation at POST /api/assets/import with
`importPublishedAssets`, taking ImportPublishedAssetsRequest
(published_asset_ids + optional share_id) and returning
ImportPublishedAssetsResponse (list of AssetInfo).
- Remove the unused AssetImportRequest component schema (no other
references in the spec).
- Operation and schemas tagged x-runtime: [cloud] with [cloud-only]
description prefix, matching the existing convention for
cloud-runtime-only operations elsewhere in the spec.
Spectral lint passes (0 errors); the two hint-level findings on
the spec are pre-existing and unrelated.
No FE consumer references AssetImportRequest today; this is a pure
spec correction to match what the cloud runtime actually serves.
Add the OAuth 2.1 authorization flow and RFC 7591 Dynamic Client
Registration endpoints to the shared spec, alongside the existing
auth-tagged operations (/api/auth/session, /api/auth/token,
/.well-known/jwks.json). All tagged x-runtime: [cloud] with a
[cloud-only] description prefix, following the established
convention for cloud-runtime-only operations.
Endpoints:
- GET /.well-known/oauth-authorization-server (RFC 8414 metadata)
- GET /.well-known/oauth-protected-resource (RFC 9728 metadata)
- GET /oauth/authorize (consent challenge)
- POST /oauth/authorize (consent submission)
- POST /oauth/token (RFC 6749 §3.2)
- POST /oauth/register (RFC 7591 §3.1 DCR)
Component schemas added:
- OAuthAuthorizationServerMetadata
- OAuthProtectedResourceMetadata
- OAuthConsentChallenge, OAuthConsentChallengeWorkspace
- OAuthAuthorizeRedirectResponse
- OAuthTokenResponse, OAuthTokenError
- OAuthRegisterRequest, OAuthRegisterResponse, OAuthRegisterError
These endpoints are implemented in the cloud runtime today and
are called by browser frontends rendering the consent UI and by
MCP-spec-compliant clients (Claude Desktop, Cursor, etc.) doing
auto-discovery + self-registration. Documenting them in the
shared spec lets the cloud frontend generate types directly from
this spec instead of maintaining a parallel definition.
Spectral lints clean (0 errors). The hint-level findings on
OAuthTokenError / OAuthRegisterError ("standard error schema")
match the same hint on CloudError — these are protocol-specific
RFC-shaped errors, not generic application errors.
Two doc-only changes addressing minor CodeRabbit findings on PR #7063:
* cli_args.py: clarify --cuda-device help text to document the required comma-separated format ('0' or '0,1'), matching how the value is consumed by CUDA_VISIBLE_DEVICES in main.py.
* nodes_multigpu.py: add a docstring NOTE on the (currently unregistered) MultiGPUOptionsNode explaining that its relative_speed input is plumbed through to model_options['multigpu_options'] but is not yet consulted by the cond scheduler, which still uses uniform round-robin via next_available_device(). Wire relative_speed into the scheduler before re-enabling the node.
Amp-Thread-ID: https://ampcode.com/threads/T-019e43b8-8258-70fd-ab3a-53e4c97f85d5
Co-authored-by: Amp <amp@ampcode.com>
The multigpu cond-batching loop called model.memory_required(input_shape) without conditioning shapes, while the single-GPU path at line 279 passes cond_shapes. Large conditioning tensors (e.g. video prompts, control inputs) were therefore under-counted, risking OOM at runtime when the chosen batch size was too large. Match the single-GPU pattern by building cond_shapes from each batched cond's conditioning dict and passing it to memory_required.
Amp-Thread-ID: https://ampcode.com/threads/T-019e43b8-8258-70fd-ab3a-53e4c97f85d5
Co-authored-by: Amp <amp@ampcode.com>
* model_management: disable non-dynamic smart memory
Disable smart memory outright for non dynamic models.
This is a minor step towards deprecation of --disable-dynamic-vram
and the legacy ModelPatcher.
This is needed for estimate-free model development, where new models
can opt-out of supplying a memory estimate and not have to worry
about hard VRAM allocations due to legacy non-dynamic model patchers
This is also a general stability increase for a lot of stray use cases
where estimates may still be off and going forward we are not going
to accurately maintain such estimates.
* pinned_memory: implement with aimdo growable buffer
Use a single growable buffer so we can do threaded pre-warming on
pinned memory.
* mm: use aimdo to do transfer from disk to pin
Aimdo implements a faster threaded loader.
* Add stream host pin buffer for AIMDO casts
Introduce per-offload-stream HostBuffer reuse for pinned staging,
include it in cast buffer reset synchronization.
Defer actual casts that go via this pin path to a separate pass
such that the buffer can be allocated monolithically (to avoid
cudaHostRegister thrash).
* remove old pin path
* Implement JIT pinned memory pressure
Replace the predictive pin pressure mechanism with JIT PIN memory
pressure.
* LowVRAMPatch: change to two-phase visit
* lora: re-implement as inplace swiss-army-knife operation
* prepare for multiple pin sets
* implement pinned loras
* requirements: comfy-aimdo 0.4.0
* ops: remove unused arg
This was defeatured in aimdo iteration
* ops: sync the CPU with only the offload stream activity
This was syncing with the offload stream which itself is synced with the
compute stream, so this was syncing CPU with compute transitively. Define
the event to sync it more gently.
* pins: implement freeing intermediate for pinned memory
Pinning is more important than inactive intermediates and the stream
pin buffer is more important than even active intermediates.
* execution: implement pin eviction on RAM presure
Add back proper pin freeing on RAM pressure
* implement pin registration swaps
Uncap the windows pins from 50% by extending the pool and have a pressure
mechanism to move the pin reservations om demand.
This unfortunately implies a GPU sync to do the freeing so significant
hysterisis needs to be added to consolidate these pressure events.
* cli_args/execution: Implement lower background cache-ram threshold
Limit the amount of RAM background intermediates can use, so that
switching workflows doesn't degrade performance too much.
* make default
* bump aimdo
* model-patcher: force-cast tiny weights
Flux 2 gets crazy stalls due to a mix of tiny and giant weights
creating lopsided steam buffer rotations which creates stalls.
* ops: refactor in prep for chunking
* mm: delegate pin-on-the-way to aimdo
Aimdo is able to chunk and slice this on the way for better CPU->GPU
overlap. The main advantage is the ability to shorten the bus contention
window between previous weight transfer and the next weights vbar
fault.
* bump aimdo
* pinning updates
* specify hostbuf max allocation size
There a signs of virtual memory exhaustion on some linux systems when
throwing 128GB for every little piece. Pass the actual to save aimdo
from over-estimates
* tests: update execution tests for caching
The default caching changed to ram-cache so update these tests
accordingly.
Remove the LRU 0 test as this also falls through to RAM cache.
create_multigpu_deepclones cloned the existing 'multigpu' additional_models list verbatim and never pruned entries beyond limit_extra_devices. If a workflow was previously prepared for more GPUs, reducing max_gpus would leave stale clones attached and eligible for later scheduling. Replace the TODO block with a real prune that keeps only clones whose load_device is either the model's load_device or in limit_extra_devices, and re-match clones if anything was removed.
Amp-Thread-ID: https://ampcode.com/threads/T-019e43b8-8258-70fd-ab3a-53e4c97f85d5
Co-authored-by: Amp <amp@ampcode.com>
torch.device(i) defaults to CUDA, so XPU/NPU branches were producing 'cuda:N' devices that don't match get_torch_device() output ('xpu:N'/'npu:N'). This caused devices.remove(get_torch_device()) to raise ValueError when exclude_current=True on non-NVIDIA hardware. Use explicit device strings, and guard the remove() with a membership check for safety.
Amp-Thread-ID: https://ampcode.com/threads/T-019e43b8-8258-70fd-ab3a-53e4c97f85d5
Co-authored-by: Amp <amp@ampcode.com>
load_checkpoint_guess_config_clip_only() calls load_checkpoint_guess_config() with output_model=False, leaving out[0] as None. The subsequent unconditional assignment of cached_patcher_init crashed with AttributeError, breaking CLIP-only checkpoint loading entirely. Guard the assignment with a None check.
Amp-Thread-ID: https://ampcode.com/threads/T-019e43b8-8258-70fd-ab3a-53e4c97f85d5
Co-authored-by: Amp <amp@ampcode.com>
GPUOptionsGroup.clone() returns a new instance, but the return value was discarded, causing the node to mutate the upstream caller's group in-place. When multiple MultiGPU Options nodes share an input group, each node's additions would leak into earlier siblings. Assign the clone result back to gpu_options so each node owns its own copy.
Amp-Thread-ID: https://ampcode.com/threads/T-019e43b8-8258-70fd-ab3a-53e4c97f85d5
Co-authored-by: Amp <amp@ampcode.com>
Behaviour-equivalent cleanup of _calc_cond_batch_multigpu device
scheduling. No change to batching decisions or memory checks for any
valid input.
Changes:
* Replace re-summed batched_to_run_length with a per-device load
dict (device_load), so capacity checks are O(1) and use a single
source of truth.
* Extract device selection into next_available_device(), which scans
at most len(devices) positions and raises if no device has
remaining capacity. This makes the 'skip a full device' rule live
in one place instead of two and guarantees the outer while loop
cannot spin forever on a scheduling bug.
* Drop the unused current_device assignment before the outer loop
and the index_device % len(devices) modulo dance (now handled
inside next_available_device).
* Minor cleanups: list comprehensions for total_conds, conds_to_batch,
and the devices list.
Fixes _calc_cond_batch_multigpu so that:
1. conds_per_device uses real division before math.ceil. The previous
expression math.ceil(total_conds // len(devices)) applied integer
floor division first, making ceil a no-op. For 3 conds across 2
devices this produced conds_per_device=1 instead of 2.
2. The scheduling loop skips devices that have already reached
capacity instead of appending empty batch groups. Without this
guard, the loop could repeatedly emit zero-length groups for a
full device, leaving sampling stuck at 0/N until timeout.
Reproduces with an Omnigen2 image workflow that produces three
condition entries scheduled across two CUDA devices. With the fix
the scheduler assigns conds_per_device=2 and splits the batches as
2 + 1 across the two devices, allowing sampling to complete.
Original fix authored and validated by @pollockjj in
pollockjj/ComfyUI#64.
Co-authored-by: John Pollock <pollockjj@gmail.com>
Aligns the OSS spec with the cloud-side BE-1004 contract:
- createWorkspaceApiKey request body: add maxLength: 5000 to the
description property (matches cloud's hub_profile.description
MaxLen(5000) convention; enforced cloud-side via handler check).
- WorkspaceApiKey + WorkspaceApiKeyCreated response schemas:
mark description as required (cloud's handler always populates
the field, defaulting to empty string when not supplied on create),
drop nullable: true, add maxLength: 5000 for symmetry, and clarify
the doc string ("Always present in responses; empty string when no
description was supplied on create").
Both schemas are tagged x-runtime: [cloud] at the schema level so the
tightening is correctly scoped — OSS-only implementations are not
required to honor the workspace API keys endpoints at all.
Related cloud PR: Comfy-Org/cloud#3747
* feat(openapi): add optional description field to workspace API key schemas
Add an optional `description` property (type: string) to three
workspace API key schemas in openapi.yaml:
- Inline request body of createWorkspaceApiKey (POST /api/workspace/api-keys)
- WorkspaceApiKey (list/info schema)
- WorkspaceApiKeyCreated (creation response schema)
The field is not added to any `required` array, making it fully
backward-compatible with existing clients.
Refs: BE-1005, BE-1004
Co-authored-by: Matt Miller <mattmillerai@users.noreply.github.com>
* fix(openapi): mark description nullable in workspace API key response schemas
Per CodeRabbit review on PR #13993: the underlying DB column is nullable
varchar (default ''), so the response schemas should permit null to match
stored data reality. Without nullable: true the OpenAPI contract would
require coercion on the handler side or risk a contract violation.
Request schema unchanged — clients shouldn't be sending null on create.
These two fields were added recently to the Asset schema as nullable
integers, with the intent of exposing original image dimensions for FE
consumers (cloud-side thumbnailing makes naturalWidth/Height return
the wrong size for an image card's dimension label).
The implementation effort that consumes them subsequently converged on
a different shape — dimensions nested under the existing free-form
`metadata` JSON field as `{kind: "image", width, height}` — to avoid
introducing type-specific flat fields on the canonical Asset shape,
and to leave room for forward-compatible additions (video duration,
fps, etc.) without further schema churn.
This removes the now-unused top-level fields so the spec reflects the
agreed direction. No other schema definitions reference these fields
directly: AssetCreated, AssetUpdated, etc. inherit Asset via allOf and
do not redefine them.
The runtime ingest implementation that would have populated these
fields was not yet shipped, so no clients are relying on the
top-level shape.
Co-authored-by: Alexis Rolland <alexisrolland@hotmail.com>
Mark the uploadMask operation as deprecated and point clients at
/api/upload/image. The mask-compositing behavior the endpoint provides
(alpha-compositing the supplied mask onto an original_ref image) is now
expected to happen client-side, with the composited result uploaded
through the unified /api/upload/image path.
The endpoint continues to function for older clients; no runtime
behavior changes ship with this commit. Only the OpenAPI annotation
and the human-facing description are updated.
* Move detection category under image category
* Add missing categories
* Move detection nodes to detection category
* Move save nodes to image root catefory
* Rename postprocessors
* Move mask category under image
* Move guiders category to parent level at root of sampling category
* Move custom_sampling category to parent level at the root of sampling category
* Modify description of LoRA loaders
* Fix node id SolidMask
* Move VOID Quadmask under image/mask
* Group compositing nodes under image/compositing
* Move load image as mask to image category for consistency with other load image nodes
* Align display name with Load Checkpoint
* Move dataset category under training category
* Rename Number Convert to Conver Number (verb first)
* Rename Canny node
* Revert wanBlockSwap + description
* Add description to RemoveBackground node
* Revert category update of dataset