True reset semantics for "default":
- On first selector application, cache the loader's original
load_device / offload_device on the underlying model object (which
is shared across patcher clones) and restore those base values when
the user picks "default". Previously "default" meant "passthrough"
so SelectXDevice(gpu:1) -> SelectXDevice(default) silently kept the
gpu:1 routing.
CPU + dynamic VRAM:
- When SelectModelDevice / SelectCLIPDevice resolves to CPU on a
ModelPatcherDynamic, also call clone(disable_dynamic=True) so the
result is a plain ModelPatcher, matching ModelPatcherDynamic.__new__'s
intent that CPU loads never run through the dynamic path. Fallback to
the regular dynamic clone if disable_dynamic is unsupported on that
patcher.
MultiGPU collision pruning:
- After SelectModelDevice retargets the primary patcher, drop any
multigpu clone (from a prior MultiGPU CFG Split) whose load_device
now matches the primary; otherwise two patchers would be bound to
the same device. Logs the prune at info level.
SelectVAEDevice: reject CPU at runtime:
- The UI uses get_gpu_device_options_no_cpu(), but a workflow opened
from another machine could still pass "cpu" through validate_inputs.
Detect that case explicitly, log a "CPU is not a supported choice"
passthrough message, and leave the VAE unchanged.
Cosmetic:
- Update VAE node docstring to accurately reflect the runtime CPU
rejection rather than the older "intentionally not offered" claim.
- Demote the fallback warnings inside resolve_gpu_device_option to no
log at all; the Select*Device nodes now own a single context-rich
info-level message per failed lookup, so there is no double logging.
Amp-Thread-ID: https://ampcode.com/threads/T-019e52b4-31ee-72cd-996b-64ecd9420e13
Co-authored-by: Amp <amp@ampcode.com>
V3 io.ComfyNode subclasses use the lowercase `validate_inputs` hook for opting out of strict combo validation (execution.py line 862); the uppercase `VALIDATE_INPUTS` is the V1 spelling and is ignored on V3 nodes. The strict combo check at execution.py line 1025 is gated on `if x not in validate_function_inputs`, so renaming to `validate_inputs(cls, device='default')` lets unknown `gpu:N` values pass validation and fall through to the runtime fallback.
Amp-Thread-ID: https://ampcode.com/threads/T-019e52b4-31ee-72cd-996b-64ecd9420e13
Co-authored-by: Amp <amp@ampcode.com>
When --enable-dynamic-vram is on, every ModelPatcher is a
ModelPatcherDynamic whose underlying model has a per-device dynamic_pins
dict, initialized in __init__ for self.load_device only. If a cloned
patcher's load_device is later reassigned (as the Select{Model,CLIP,VAE}
Device nodes do), the new device key is missing and partially_unload_ram
raises KeyError: device(type='cuda', index=N).
Fix:
- Extract the per-device dynamic_pins init in ModelPatcherDynamic.__init__
into a new helper method register_load_device(device) which is now also
called from __init__.
- Each Select*Device node calls clone.patcher.register_load_device(resolved)
after retargeting load_device, guarded by hasattr so non-dynamic
patchers (plain ModelPatcher in non-dynamic-vram installs) skip it.
Caught by happy-path test where SelectCLIPDevice retargeted CLIP from
cuda:0 to cuda:1 and CLIPTextEncode then crashed in
partially_unload_ram -> dynamic_pins[cuda:1].
Amp-Thread-ID: https://ampcode.com/threads/T-019e52b4-31ee-72cd-996b-64ecd9420e13
Co-authored-by: Amp <amp@ampcode.com>
Replace the per-loader device widgets removed in the previous commit
with three small passthrough selector nodes registered under
advanced/multigpu:
- Select Model Device (MODEL in/out) - options: default / cpu / gpu:N
- Select CLIP Device (CLIP in/out) - options: default / cpu / gpu:N
- Select VAE Device (VAE in/out) - options: default / gpu:N (no cpu)
Each node clones the inbound patcher (model.clone() / clip.clone() /
copy.copy(vae)+vae.patcher.clone()) and retargets load_device (and
offload_device for cpu / vae_offload_device for VAE).
Portability across machines with different GPU counts:
- VALIDATE_INPUTS returns True so an unknown gpu:N value (e.g. a
workflow saved on a 2-GPU machine opened on a 1-GPU machine) does
not error at validation time.
- At runtime, resolve_gpu_device_option(...) returns None for
unknown options (with a warning), and each selector then logs a
per-node info message and passes through unchanged, matching the
no-op style used by MultiGPU CFG Split's
"No extra torch devices need initialization..." log.
Also adds comfy.model_management.get_gpu_device_options_no_cpu() which
the VAE selector uses; on a single-GPU box this collapses to just
["default"], which is fine.
Amp-Thread-ID: https://ampcode.com/threads/T-019e52b4-31ee-72cd-996b-64ecd9420e13
Co-authored-by: Amp <amp@ampcode.com>
Remove the device-selection widgets that were added directly to existing
loader nodes (and the new CheckpointLoaderDevice / ImageOnlyCheckpointLoaderDevice
variants):
- nodes.py:
- delete CheckpointLoaderDevice class and its NODE_CLASS_MAPPINGS /
NODE_DISPLAY_NAME_MAPPINGS entries
- remove the optional `device` input + VALIDATE_INPUTS + resolve logic
from UNETLoader, VAELoader, CLIPLoader, DualCLIPLoader
- restore CLIPLoader/DualCLIPLoader `device` options to ["default", "cpu"]
- comfy_extras/nodes_video_model.py:
- delete ImageOnlyCheckpointLoaderDevice class + its mapping entries
- comfy_extras/nodes_lt_audio.py:
- restore LTXAVTextEncoderLoader `device` options to ["default", "cpu"]
and revert the resolve logic back to the simple `if device == "cpu"`
branch
The replacement approach is a small set of passthrough Select*Device
nodes (added in the next commit) that retarget MODEL/CLIP/VAE devices
without bloating every loader's UI or duplicating loaders.
The cuda_device_context helper and the model_management helpers
(get_gpu_device_options / resolve_gpu_device_option) from #13483 are
kept; they are still used by the new selector nodes.
Amp-Thread-ID: https://ampcode.com/threads/T-019e52b4-31ee-72cd-996b-64ecd9420e13
Co-authored-by: Amp <amp@ampcode.com>
get_all_torch_devices() only enumerates one vendor at a time (the
is_nvidia/is_intel_xpu/is_ascend_npu branches are exclusive and each
constructs devices via torch.device("type", i) with a real integer
index), and aimdo_control.init_devices short-circuits on lib is None
before iterating, so the d.type == "cuda" and d.index is not None
filter cannot ever change the result. Match master's trust level and
just pass the indices directly.
Reduces the divergence from master to a single line:
init_device(get_torch_device().index)
-> init_devices(d.index for d in get_all_torch_devices())
Amp-Thread-ID: https://ampcode.com/threads/T-019e52b4-31ee-72cd-996b-64ecd9420e13
Co-authored-by: Amp <amp@ampcode.com>
The aimdo init call on worksplit-multigpu was using
comfy_aimdo.control.init_devices(range(torch.cuda.device_count()))
which required adding `import torch` at the top of main.py (violating the
"torch should never be imported before this point" expectation) and an
inner is_nvidia() guard added in PR #14068 to defend the raw cuda call
on non-NVIDIA systems where --enable-dynamic-vram is explicitly passed.
Replace the call with
comfy_aimdo.control.init_devices(
d.index for d in comfy.model_management.get_all_torch_devices()
if d.type == "cuda" and d.index is not None
)
comfy_aimdo.control.init_devices accepts any iterable of int-coercible
device indices and returns False on an empty iterable, so on non-cuda
systems the elif naturally falls through to the existing "No working
comfy-aimdo install detected" fallback - no extra vendor gate needed.
HIP devices appear as type "cuda" in torch, so ROCm setups (which
comfy-aimdo supports via aimdo_rocm.so) are handled correctly too.
This lets us drop both the `import torch` at the top of main.py and the
inner is_nvidia() guard, leaving a single logical-line divergence from
master (init_device(single index) -> init_devices(generator of cuda
indices)) for multi-GPU aimdo support.
Amp-Thread-ID: https://ampcode.com/threads/T-019e52b4-31ee-72cd-996b-64ecd9420e13
Co-authored-by: Amp <amp@ampcode.com>
Master commit cf758bd2 (PR #13663, "chore(api-nodes): increase default
timeout for partner API node tasks") removed three explicit
max_poll_attempts=280 overrides from nodes_kling.py so the new 480
default in util/client.py would take effect.
The May 19 merge of master into worksplit-multigpu (ff766e5c) silently
discarded those three deletions in the 3-way resolve - nodes_kling.py
had no textual conflict but the resolution kept the pre-cf758bd2 lines.
The other seven files cf758bd2 touched were merged correctly; this
restores nodes_kling.py to match master.
Amp-Thread-ID: https://ampcode.com/threads/T-019e52b4-31ee-72cd-996b-64ecd9420e13
Co-authored-by: Amp <amp@ampcode.com>
* openapi: add enum values + FeedbackRequest schema for cloud cutover (PR E)
Adds missing cloud-runtime enum values to vendor schemas that the
cloud runtime emits but vendor declared as plain strings.
Changes:
- JobEntry.status: enum [pending, in_progress, completed, failed, cancelled]
- JobDetailResponse.status: same enum
- BillingStatus: enum [awaiting_payment_method, pending_payment, paid,
payment_failed, inactive]
- FeedbackRequest schema added (with type enum)
- /api/feedback POST: requestBody now $refs FeedbackRequest
All cloud-runtime-emitted; no impact on OSS-local semantics.
Identified via Comfy-Org/cloud's TestCutoverSafe gate (BE-1106) as
the remaining schema-level divergences after PRs A-D landed and got
synced.
* openapi: add type enum to Workspace schema (cutover follow-up)
Cloud's Workspace runtime shape includes a 'type' field with enum
[personal, team] that vendor's Workspace was missing. Cloud handlers
reference the generated ingest.WorkspaceType Go enum.
Same kind of surgical addition as JobEntry.status / BillingStatus /
JobDetailResponse.status in this PR — adds cloud-runtime field to
existing vendor schema.
Two fixes for single-GPU users on non-NVIDIA backends; multi-GPU
non-CUDA support is intentionally out of scope here (tracked separately).
1. get_all_torch_devices: add AMD/ROCm, MLU, and a generic fallback arm.
Previously the function only enumerated NVIDIA, Intel XPU, and Ascend
NPU when cpu_state==GPU; on AMD/ROCm (which exposes its GPU through
torch.cuda.*) and DirectML it fell through to an empty list. The
biggest user-visible regression: unload_all_models() iterates this
list, so it became a silent no-op on AMD/ROCm. /free, manager
unloads, and shutdown stopped releasing VRAM.
- is_amd() now shares the torch.cuda.* arm with is_nvidia(), since
ROCm reuses the CUDA API surface.
- is_mlu() gets its own arm using torch.mlu.device_count().
- A final fallback appends get_torch_device() for any GPU backend
the explicit arms miss (notably DirectML), so callers see at
least the current device and unload_all_models works.
MPS users are unaffected: cpu_state==MPS already routes to the
else branch which appends get_torch_device() returning mps.
2. main.py DynamicVRAM init: guard the comfy_aimdo branch with an
explicit is_nvidia() check.
The outer condition allows entering the DynamicVRAM init block when
the user passes --enable-dynamic-vram explicitly, bypassing the
implicit is_nvidia() gate. On non-NVIDIA backends this then runs
comfy_aimdo.control.init_devices(range(torch.cuda.device_count())),
which is comfy-aimdo-only territory and may crash at startup. Add a
leading is_nvidia() check that logs a clean warning and falls back
to the legacy ModelPatcher path.
* Revert "Add tiled VAE lane to MultiGPU Work Units"
This reverts commit 4d3d68e473.
The tiled VAE lane will land as part of a follow-up PR alongside the
UPSCALE_MODEL lane, separated from the threaded-loader fix PR (#14052)
to keep the upstream merge focused.
* Revert "Add UPSCALE_MODEL lane to MultiGPU CFG Split"
This reverts commit 74b0a826ea.
The UPSCALE_MODEL lane will land as part of a follow-up PR alongside the
tiled VAE lane, separated from the threaded-loader fix PR (#14052) to
keep the upstream merge focused.
---------
Co-authored-by: John Pollock <pollockjj@gmail.com>
* openapi: rename cloud-side response schemas to match runtime (PR D)
Follow-up to the BE-1106 stack (#14060/61/63). Cloud's Go handlers
reference response schemas by name (e.g., ingest.WorkflowResponse,
ingest.SubscribeResponse), but vendor's matching operations were
declaring those responses against differently-named vendor-side
schemas (CloudWorkflow, BillingSubscription, etc.). After the stack
landed, schemas like WorkflowResponse exist in vendor but weren't
referenced by any path, so codegen pruned the unreferenced types.
This PR:
1. Updates 34 operation $refs in cloud-runtime paths to point to
the schema names cloud's handlers expect (e.g., CloudWorkflow →
WorkflowResponse on /api/workflows/{workflow_id}).
2. Adds 12 cloud-only schemas that weren't in vendor yet but are
referenced by these renames (e.g., SubscribeResponse,
CancelSubscriptionResponse, BillingOpStatusResponse). Each
copied verbatim from Comfy-Org/cloud's hand-written ingest spec
and tagged x-runtime: [cloud] with a [cloud-only] description
prefix.
Schema renames span the same domains as the operationId renames in
PR A: billing/subscriptions (7 schemas), workflows (5), userdata (3),
jobs (2), hub (2), history (2), auth/workspace (4), and misc cloud
endpoints (9).
Convergent safety check after this lands (against cloud's
TestCutoverSafe gate, BE-1106):
Pre-PR D: 205 missing handler refs
Post-PR D: 105 missing handler refs (-49%)
Cumulative since the original 938-ref baseline: -89%
The remaining 105 are a Phase 3 follow-up (response headers,
text/plain responses, codegen-derived enum sub-types, and a small
set of inline-response-schema operations that vendor declares
inline where cloud has named-schema $refs).
* openapi: drop PR-label comment from new schemas block
PR-internal labels don't belong in committed code — future readers
won't know what 'PR D' means and the marker stops being useful the
moment this PR merges.
* openapi: rename 55 cloud-side operationIds to match runtime handlers
For the 55 operations below, vendor's operationId did not match the
name cloud's runtime handlers expect. Generated types from vendor
therefore had different names (e.g. CreateSubscription200JSONResponse)
than what cloud handlers reference (Subscribe200JSONResponse), which
blocks the post-cutover combined-spec codegen.
All 55 renames target the cloud-runtime-authoritative name. Several
of these endpoints are shared concepts (queue, settings, userdata,
object_info) that OSS local also serves — the rename aligns vendor
with the longstanding cloud handler-side convention to unblock the
shared codegen. No request/response *shape* changes in this PR; only
operationId labels.
Notable categories:
- Billing/subscriptions: 7 renames (subscribe, getBillingPlans, ...)
- Workspace + workflows: 13 renames (createWorkflow, ...)
- Hub: 3 renames
- Auth/users: 5 renames
- Shared OSS surface (settings, queue, view, userdata): 12 renames
- Misc cloud-only: 15 renames
Identified via Comfy-Org/cloud's TestCutoverSafe build-safety gate
(BE-1106), which compares handler type references against codegen
output from the combined spec.
* fix(openapi): resolve getHistory operationId collision
Spectral flagged: both /api/history (OSS local) and /api/history_v2
(cloud) had operationId 'getHistory' after the rename. Rename vendor's
/api/history to 'getPromptHistory' to disambiguate. Cloud's runtime
denies /api/history at the overlay level so combined codegen is
unaffected by this change.
* openapi: add 41 cloud-runtime schemas to components.schemas (PR B of 3) (#14061)
* openapi: add 41 cloud-runtime schemas to components.schemas (cutover prep)
Adds schemas that exist in Comfy-Org/cloud's hand-written ingest spec
but not yet in this vendored OSS spec. All tagged x-runtime: [cloud]
per the field-drift convention and prefixed with [cloud-only] in the
description.
These schemas are referenced by cloud's Go handlers via the generated
ingest.<Schema> Go type names. Codegen from the vendored spec didn't
produce those types because the schemas weren't declared here. Adding
them unblocks the post-cutover combined-spec codegen.
Schemas added (alphabetical):
AssetDownloadResponse, AssetMetadataResponse, BillingBalanceResponse,
BillingPlansResponse, BillingStatusResponse, GetUserDataResponseFull,
HistoryDetailEntry, HistoryDetailResponse, HistoryResponse,
HubLabelInfo, HubProfileSummary, HubWorkflowListResponse,
HubWorkflowStatus, HubWorkflowSummary, HubWorkflowTemplateEntry,
JobStatusResponse, JobsListResponse, LabelRef, LogsResponse, Member,
OAuthRegisterBadRequestResponse, PendingInvite, Plan, PlanAvailability,
PlanAvailabilityReason, PlanSeatSummary, PreviewPlanInfo,
PreviewSubscribeResponse, PublishedWorkflowDetail, SecretResponse,
SubscriptionDuration, SubscriptionTier, UserDataResponseFull,
ValidationError, ValidationResult, WorkflowForkedFrom, WorkflowResponse,
WorkflowVersionContentResponse, WorkspaceAPIKeyInfo, WorkspaceSummary,
WorkspaceWithRole
Identified via Comfy-Org/cloud's TestCutoverSafe build-safety gate
(BE-1106). Companion to PR #14060 (operationId renames).
* fix(openapi): add BindingErrorResponse schema
OAuthRegisterBadRequestResponse references BindingErrorResponse but
that schema wasn't in the original add. Adding it now as a cloud-only
schema matching the cloud runtime's binding-error shape (single
'message' string field).
* openapi: add missing 4xx/5xx response bodies for cloud-emitting endpoints (#14063)
Vendor declares shared endpoints (e.g. /api/queue, /api/settings,
/api/assets/*, /api/billing/*) with success responses but is missing
many of the 4xx/5xx error response bodies that Comfy-Org/cloud's
runtime actually emits. Cloud's Go handlers reference the generated
ingest.Op<StatusCode>JSONResponse types for these missing statuses,
which currently fail to resolve when codegen runs against the
vendored spec.
This PR adds 237 response entries across 117 operations, restoring
the documented error responses that cloud emits. Bodies are copied
verbatim from Comfy-Org/cloud's hand-written ingest spec
(services/ingest/openapi.yaml) and reference a new ErrorResponse
schema also added in this PR (matches cloud's {code, message} runtime
shape, tagged x-runtime: [cloud]).
ErrorResponse is intentionally separate from the existing CloudError
schema. CloudError's shape ({error}) describes one runtime; cloud
emits a different shape ({code, message}). Existing CloudError refs
in vendor are untouched; new cloud-emitting error references use
ErrorResponse.
Identified via Comfy-Org/cloud's TestCutoverSafe build-safety gate
(BE-1106). Companion to PR #14060 (operationId renames) and PR #14061
(cloud-only schema additions).
* openapi: align response declarations with implementation (5 endpoints)
- POST /api/assets/download: replace 200 with 202 + tracking-task body
(endpoint runs asynchronously and returns task_id/status/message).
- POST /api/assets/export: same 200 → 202 + tracking-task body.
- POST /api/assets/from-workflow: change 201 → 200 (handler responds 200,
not 201; no Location header emitted).
- POST /api/feedback: change 200 → 201 (creates a feedback record).
- /api/jobs and /api/jobs/{job_id}: change timestamp fields from
type: number to type: integer + format: int64. Values are Unix
milliseconds — number causes oapi-codegen to emit float64, losing
precision and producing the wrong Go type. Affected fields:
create_time, update_time, execution_start_time, execution_end_time.
Verification: each change reflects what the endpoint observably returns;
no handler changes required. Backwards-compatible for existing clients
(integer is a subset of number; status code shifts within 2xx).
* openapi: align asset download/export 202 status enum with runtime + sibling schemas
CodeRabbit caught a vocabulary mismatch: the two new 202 response schemas
declared `[pending, running, completed, failed]` while the rest of the same
spec uses `[created, running, completed, failed]` for the identical task
lifecycle (download/export progress WebSocket events, /api/tasks, TaskEntry,
TaskResponse — 4 sites total). Cloud's runtime emits `created` on initial
creation (AssetDownloadResponseStatusCreated; task.Status sourced from the
DB enum whose initial value is Created). `pending` would have introduced a
fifth, contradictory vocabulary for the same lifecycle and pushed the spec
further from the implementation it is meant to align with.
Followup tracked separately: extract a shared TaskStatus enum so all five
sites move in lockstep instead of needing per-site edits.
Introduce tiled_scale_multidim_multigpu in comfy/utils.py: a tile scheduler
that dispatches per-device tile functions through the existing
MultiGPUThreadPool and merges per-device CPU output buffers in deterministic
key order. The worker only catches BaseException at the thread boundary to
funnel errors to the main thread; bare torch.cuda.set_device and
torch.cuda.synchronize calls inside the worker fail loud if the device is
not CUDA, which is part of the primitive's contract.
Add UPSCALE_MODEL input on the MultiGPU CFG Split node and an upscale-model
descriptor deepclone helper in comfy/multigpu.py. Clones stay CPU-resident
until execute time and are returned to CPU afterward.
ImageUpscaleWithModel dispatches through tiled_scale_multidim_multigpu when
a multigpu descriptor is attached; the single-device path runs unchanged
when no clones are present.
Comfy-aimdo 0.4.4 contains a small bugfix to allow recovery of a hostbuf
after full truncation.
This pattern doesnt happen as a general rule, but does happen in the
upcoming worksplit-multigpu branch.
This was an attempt to be a fast path by ensuring the file slice was
created by the owning thread and refusing without needing ot mutex
but worksplit-multigpu doesnt work that way. Go mutex.
Shoot me for overthinking next time.
The /system_stats endpoint was returning a hardcoded single-element
devices list built from get_torch_device(), which only reflects the
primary CUDA device. On multi-GPU systems this hides the additional
devices from frontends / tooling (the API surface that enables multigpu
support discovery). Switch to iterating get_all_torch_devices(), with
the primary device kept first so existing clients reading devices[0]
keep working.
(Worksplit-multigpu-only: get_all_torch_devices is the multigpu helper
introduced on this branch; master's /system_stats remains unchanged.)
Amp-Thread-ID: https://ampcode.com/threads/T-019e4a00-fe3d-76bd-a2f2-a8c8c4040082
Co-authored-by: Amp <amp@ampcode.com>
Two CodeRabbit findings from #7063 (#13 and #14) are deferred because
worksplit-multigpu's initial release scope is NVIDIA-only QA. Leave a
TODO at the unconditional torch.cuda.set_device call and at the
post-aggregation point so the required guards/synchronize are easy to
find when multigpu support is extended to XPU/NPU/MPS/CPU/DirectML.
Amp-Thread-ID: https://ampcode.com/threads/T-019e4a00-fe3d-76bd-a2f2-a8c8c4040082
Co-authored-by: Amp <amp@ampcode.com>
Brings in 18 commits from master so worksplit-multigpu does not regress
fixes that landed on main since the last sync:
- #13699 Hunyuan 3D 2.1 batch-size fixes (overlap with our own backport;
conflict resolved in favor of the shape>=2 gate that binds
swap_cfg_halves once and reuses it for the output swap-back)
- #14031 ModelPatcherDynamic lora reshape / backup restore fix
- #13802 Multi-threaded model load (memory_management / pinned_memory /
model_management / aimdo plumbing)
- #12679 lanczos single-channel tensor fix
- #14010 Stable Audio 3 support
- assorted partner-node, openapi, workflow-template, and tooling updates
Amp-Thread-ID: https://ampcode.com/threads/T-019e4a00-fe3d-76bd-a2f2-a8c8c4040082
Co-authored-by: Amp <amp@ampcode.com>
CrossAttention.kv.view and Attention.qkv_combined.view both hardcoded
batch=1 in the reshape, crashing or silently mis-shaping whenever the
actual batch dimension was greater than 1. These were fixed on master
in #13699 as part of the same patch that gated the chunk(2) swap, but
worksplit-multigpu only picked up the chunk(2) gate. Bring the two
view() fixes over so we have parity with master.
Amp-Thread-ID: https://ampcode.com/threads/T-019e4a00-fe3d-76bd-a2f2-a8c8c4040082
Co-authored-by: Amp <amp@ampcode.com>
The previous gate (len(cond_or_uncond) == 2 and set == {0, 1}) was
intended to skip the cond/uncond swap when only one half was present
under MultiGPU CFG Split, but it was too restrictive: it also skipped
batch_size > 1 + CFG (cond_or_uncond like [0, 0, 1, 1] or [0,0,0,0,
1,1,1,1]), where chunk(2) still splits the batch cleanly into a cond
half and an uncond half and the swap is still required.
Switch to context.shape[0] >= 2, matching the parallel fix landed on
master in #13699. The swap is a permutation-invariant no-op when the
two halves don't form a CFG pair (since the output swap_cfg_halves
block immediately undoes the permutation), so the only thing the gate
actually needs to do is guard against chunk(2) on a batch of one.
Amp-Thread-ID: https://ampcode.com/threads/T-019e4a00-fe3d-76bd-a2f2-a8c8c4040082
Co-authored-by: Amp <amp@ampcode.com>
Per review feedback on #7063. The two functions share the conds-by-hooks
accumulation, memory-fit batching, and per-chunk output aggregation; the
multigpu variant adds per-device scheduling, .to(device) placement,
per-device patcher/control lookup, and thread-pool dispatch around the
inner loop. Documenting the relationship without extracting helpers --
extraction can land after the initial worksplit-multigpu release once
both paths have settled.
Amp-Thread-ID: https://ampcode.com/threads/T-019e4a00-fe3d-76bd-a2f2-a8c8c4040082
Co-authored-by: Amp <amp@ampcode.com>
QwenFunControlNet.pre_run stashes model.diffusion_model into extra_args,
which the control_model then uses for forward passes (img_in, txt_in,
pe_embedder, time_text_embed). With multigpu, every per-device control
clone was being pre_run with the base model on GPU0, so secondary
devices would invoke those modules with parameters on GPU0 and inputs
on their own device, raising 'Expected all tensors to be on the same
device'. Build a device -> per-device BaseModel lookup from the
patcher's additional multigpu models and pass each clone the model on
its own device. Falls back to the base model when no per-device match
is found (single-GPU path and the case where cnet.multigpu_clones lags
the patcher's clone set).
Amp-Thread-ID: https://ampcode.com/threads/T-019e4a00-fe3d-76bd-a2f2-a8c8c4040082
Co-authored-by: Amp <amp@ampcode.com>
QwenFunControlNet.pre_run stashes the model's diffusion_model into
self.extra_args['base_model'], but ControlBase.cleanup never clears
extra_args. The diffusion_model reference therefore lingered between
sampling runs, blocking ComfyUI's model offload/eviction logic from
freeing the UNet and -- for multigpu -- holding one such reference per
per-device control clone (defeating the max_gpus pruning added in this
PR). Override cleanup to drop the entry; super().cleanup() already
recurses into multigpu_clones so each per-device clone pops its own.
Amp-Thread-ID: https://ampcode.com/threads/T-019e4a00-fe3d-76bd-a2f2-a8c8c4040082
Co-authored-by: Amp <amp@ampcode.com>
Drop the new ignore_multigpu positional argument from prepare_state and
from the ON_PREPARE_STATE callbacks; pass the flag via model_options
instead. This restores the original 3-arg callback signature so existing
custom-node ON_PREPARE_STATE handlers keep working unchanged, while
still letting prepare_state's recursive call into multigpu_clones
short-circuit.
Amp-Thread-ID: https://ampcode.com/threads/T-019e4a00-fe3d-76bd-a2f2-a8c8c4040082
Co-authored-by: Amp <amp@ampcode.com>
* ModelPatcherDyanmic: purge stale vbar allocs on force cast
* ModelPatcherDynamic: restore backups before load
If doing a clean reload, mutative changes (lora application) could be
applied on-top of the already loaded weight. Restore from backup
unconditionally so that the new load is clean.
The job_ids query parameter on GET /api/assets is tagged x-runtime:
[cloud] and only exists for cloud's variant of this endpoint. Cloud
removed all consumers and the cloud-side handler/codegen/tests in
Comfy-Org/cloud#3778. With cloud no longer accepting this parameter,
the [cloud-only] documentation here is wrong — drop it so the daily
sync to cloud/services/ingest/vendor/openapi.yaml propagates the
removal.