- ModelPatcher.deepclone_multigpu: remove copy.deepcopy fallback. Require
cached_patcher_init (raise a descriptive RuntimeError if missing) and
always go through clone(model_override=...) with empty backup containers
so the per-device clone owns a pristine, unpatched module instead of a
deepcopy of an already-loaded/already-patched one. Also call
register_load_device on the new patcher so ModelPatcherDynamic per-device
bookkeeping (e.g. dynamic_pins) is populated for the requested load
device.
- comfy/sd.py: register cached_patcher_init on the CLIP and VAE patchers
returned by load_checkpoint_guess_config, and on the patcher returned by
load_diffusion_model's companion paths. Add load_checkpoint_clip_patcher,
load_checkpoint_vae_patcher, and load_vae_patcher reload helpers so the
same loader context can be reused to produce per-device clones.
- nodes.py: VAELoader registers cached_patcher_init on the produced VAE's
patcher when there is a single backing file (skip for pixel_space and
composite image-TAESDs which aren't addressable by a single path).
- comfy_extras/nodes_multigpu.py: SelectModelDevice / SelectCLIPDevice /
SelectVAEDevice now retarget via deepclone_multigpu when the requested
device differs from the current load_device, so the consumed model is
not just relabeled but actually rehomed onto the chosen device.
Verified on runner-2 (2x RTX 4090, comfy-aimdo 0.4.4):
- 10/10 focused unit tests (deepclone behavior, missing-factory error path,
Select*Device behavior).
- Device-switch-after-consumption end-to-end (SD1.5) produces bit-identical
PNGs on cuda:0 and cuda:1.
- Z Image multigpu CFG split: ~1.90x speedup (10.5s vs 19.9s steady).
- Qwen Image multigpu CFG split (real text negative, cfg=4): ~1.69x
speedup (32.5s vs 54.8s steady) -- matches pre-refactor numbers.
- Baseline (patch stashed) and patched produce identical timings on both
models, so the refactor is performance-neutral.
Amp-Thread-ID: https://ampcode.com/threads/T-019e5783-b810-74b1-8ca9-09d675de1479
Co-authored-by: Amp <amp@ampcode.com>