ComfyUI/comfy
Jedrzej Kosinski aa464b36b3
Multi-GPU device selection for loader nodes + CUDA context fixes (#13483)
* Fix Hunyuan 3D 2.1 multi-GPU worksplit: use cond_or_uncond instead of hardcoded chunk(2)

Amp-Thread-ID: https://ampcode.com/threads/T-019da964-2cc8-77f9-9aae-23f65da233db
Co-authored-by: Amp <amp@ampcode.com>

* Add GPU device selection to all loader nodes

- Add get_gpu_device_options() and resolve_gpu_device_option() helpers
  in model_management.py for vendor-agnostic GPU device selection
- Add device widget to CheckpointLoaderSimple, UNETLoader, VAELoader
- Expand device options in CLIPLoader, DualCLIPLoader, LTXAVTextEncoderLoader
  from [default, cpu] to include gpu:0, gpu:1, etc. on multi-GPU systems
- Wire load_diffusion_model_state_dict and load_state_dict_guess_config
  to respect model_options['load_device']
- Graceful fallback: unrecognized devices (e.g. gpu:1 on single-GPU)
  silently fall back to default

Amp-Thread-ID: https://ampcode.com/threads/T-019daa41-f394-731a-8955-4cff4f16283a
Co-authored-by: Amp <amp@ampcode.com>

* Add VALIDATE_INPUTS to skip device combo validation for workflow portability

When a workflow saved on a 2-GPU machine (with device=gpu:1) is loaded
on a 1-GPU machine, the combo validation would reject the unknown value.
VALIDATE_INPUTS with the device parameter bypasses combo validation for
that input only, allowing resolve_gpu_device_option to handle the
graceful fallback at runtime.

Amp-Thread-ID: https://ampcode.com/threads/T-019daa41-f394-731a-8955-4cff4f16283a
Co-authored-by: Amp <amp@ampcode.com>

* Set CUDA device context in outer_sample to match model load_device

Custom CUDA kernels (comfy_kitchen fp8 quantization) use
torch.cuda.current_device() for DLPack tensor export. When a model is
loaded on a non-default GPU (e.g. cuda:1), the CUDA context must match
or the kernel fails with 'Can't export tensors on a different CUDA
device index'. Save and restore the previous device around sampling.

Amp-Thread-ID: https://ampcode.com/threads/T-019daa41-f394-731a-8955-4cff4f16283a
Co-authored-by: Amp <amp@ampcode.com>

* Fix code review bugs: negative index guard, CPU offload_device, checkpoint te_model_options

- resolve_gpu_device_option: reject negative indices (gpu:-1)
- UNETLoader: set offload_device when cpu is selected
- CheckpointLoaderSimple: pass te_model_options for CLIP device,
  set offload_device for cpu, pass load_device to VAE
- load_diffusion_model_state_dict: respect offload_device from model_options
- load_state_dict_guess_config: respect offload_device, pass load_device to VAE

Amp-Thread-ID: https://ampcode.com/threads/T-019daa41-f394-731a-8955-4cff4f16283a
Co-authored-by: Amp <amp@ampcode.com>

* Fix CUDA device context for CLIP encoding and VAE encode/decode

Add torch.cuda.set_device() calls to match model's load device in:
- CLIP.encode_from_tokens: fixes 'Can't export tensors on a different
  CUDA device index' when CLIP is loaded on a non-default GPU
- CLIP.encode_from_tokens_scheduled: same fix for the hooks code path
- CLIP.generate: same fix for text generation
- VAE.decode: fixes VAE decoding on non-default GPU
- VAE.encode: fixes VAE encoding on non-default GPU

Same pattern as the existing outer_sample fix in samplers.py - saves
and restores previous CUDA device in a try/finally block.

Amp-Thread-ID: https://ampcode.com/threads/T-019dabdc-8feb-766f-b4dc-f46ef4d8ff57
Co-authored-by: Amp <amp@ampcode.com>

* Extract cuda_device_context manager, fix tiled VAE methods

Add model_management.cuda_device_context() — a context manager that
saves/restores torch.cuda.current_device when operating on a non-default
GPU. Replaces 6 copies of the manual save/set/restore boilerplate.

Refactored call sites:
- CLIP.encode_from_tokens
- CLIP.encode_from_tokens_scheduled (hooks path)
- CLIP.generate
- VAE.decode
- VAE.encode
- samplers.outer_sample

Bug fixes (newly wrapped):
- VAE.decode_tiled: was missing device context entirely, would fail
  on non-default GPU when called from 'VAE Decode (Tiled)' node
- VAE.encode_tiled: same issue for 'VAE Encode (Tiled)' node

Amp-Thread-ID: https://ampcode.com/threads/T-019dabdc-8feb-766f-b4dc-f46ef4d8ff57
Co-authored-by: Amp <amp@ampcode.com>

* Restore CheckpointLoaderSimple, add CheckpointLoaderDevice

Revert CheckpointLoaderSimple to its original form (no device input)
so it remains the simple default loader.

Add new CheckpointLoaderDevice node (advanced/loaders) with separate
model_device, clip_device, and vae_device inputs for per-component
GPU placement in multi-GPU setups.

Amp-Thread-ID: https://ampcode.com/threads/T-019dabdc-8feb-766f-b4dc-f46ef4d8ff57
Co-authored-by: Amp <amp@ampcode.com>

---------

Co-authored-by: Amp <amp@ampcode.com>
2026-04-23 19:10:33 -07:00
..
audio_encoders Fix fp16 audio encoder models (#12811) 2026-03-06 18:20:07 -05:00
cldm Add better error message for common error. (#10846) 2025-11-23 04:55:22 -05:00
comfy_types fix: use frontend-compatible format for Float gradient_stops (#12789) 2026-03-12 10:14:28 -07:00
extra_samplers Uni pc sampler now works with audio and video models. 2025-01-18 05:27:58 -05:00
image_encoders Add Hunyuan 3D 2.1 Support (#8714) 2025-09-04 20:36:20 -04:00
k_diffusion ace15: Use dynamic_vram friendly trange (#12409) 2026-02-11 14:53:42 -05:00
ldm Merge remote-tracking branch 'origin/master' into worksplit-multigpu 2026-04-20 02:38:33 -07:00
sd1_tokenizer Silence clip tokenizer warning. (#8934) 2025-07-16 14:42:07 -04:00
t2i_adapter Controlnet refactor. 2024-06-27 18:43:11 -04:00
taesd Support LTX2 tiny vae (taeltx_2) (#11929) 2026-01-21 23:03:51 -05:00
text_encoders Use ErnieTEModel_ not ErnieTEModel. (#13431) 2026-04-16 10:11:58 -04:00
weight_adapter MPDynamic: force load flux img_in weight (Fixes flux1 canny+depth lora crash) (#12446) 2026-02-15 20:30:09 -05:00
cli_args.py Merge branch 'master' into worksplit-multigpu 2026-03-30 06:24:55 -07:00
clip_config_bigg.json Fix potential issue with non clip text embeddings. 2024-07-30 14:41:13 -04:00
clip_model.py Support the siglip 2 naflex model as a clip vision model. (#11831) 2026-01-12 17:05:54 -05:00
clip_vision_config_g.json Add support for clip g vision model to CLIPVisionLoader. 2023-08-18 11:13:29 -04:00
clip_vision_config_h.json Add support for unCLIP SD2.x models. 2023-04-01 23:19:15 -04:00
clip_vision_config_vitl_336_llava.json Support llava clip vision model. 2025-03-06 00:24:43 -05:00
clip_vision_config_vitl_336.json support clip-vit-large-patch14-336 (#4042) 2024-07-17 13:12:50 -04:00
clip_vision_config_vitl.json Add support for unCLIP SD2.x models. 2023-04-01 23:19:15 -04:00
clip_vision_siglip2_base_naflex.json Support the siglip 2 naflex model as a clip vision model. (#11831) 2026-01-12 17:05:54 -05:00
clip_vision_siglip_384.json Support new flux model variants. 2024-11-21 08:38:23 -05:00
clip_vision_siglip_512.json Support 512 siglip model. 2025-04-05 07:01:01 -04:00
clip_vision.py Reduce RAM usage, fix VRAM OOMs, and fix Windows shared memory spilling with adaptive model loading (#11845) 2026-02-01 01:01:11 -05:00
conds.py Cleanups to the last PR. (#12646) 2026-02-26 01:30:31 -05:00
context_windows.py Add slice_cond and per-model context window cond resizing (#12645) 2026-03-19 20:42:42 -07:00
controlnet.py Merge branch 'master' into worksplit-multigpu 2026-02-17 02:53:06 -08:00
diffusers_convert.py Remove useless code. 2025-01-24 06:15:54 -05:00
diffusers_load.py load_unet -> load_diffusion_model with a model_options argument. 2024-08-12 23:20:57 -04:00
float.py feat: Support mxfp8 (#12907) 2026-03-14 18:36:29 -04:00
gligen.py Remove some useless code. (#8812) 2025-07-06 07:07:39 -04:00
hooks.py New Year ruff cleanup. (#11595) 2026-01-01 22:06:14 -05:00
latent_formats.py Feat: z-image pixel space (model still training atm) (#12709) 2026-03-02 19:43:47 -05:00
lora_convert.py Use torch RMSNorm for flux models and refactor hunyuan video code. (#12432) 2026-02-13 15:35:13 -05:00
lora.py Fix text encoder lora loading for wrapped models (#12852) 2026-03-09 16:08:51 -04:00
memory_management.py Integrate RAM cache with model RAM management (#13173) 2026-03-27 21:34:16 -04:00
model_base.py Implement Ernie Image model. (#13369) 2026-04-11 22:29:31 -04:00
model_detection.py Implement Ernie Image model. (#13369) 2026-04-11 22:29:31 -04:00
model_management.py Multi-GPU device selection for loader nodes + CUDA context fixes (#13483) 2026-04-23 19:10:33 -07:00
model_patcher.py Merge remote-tracking branch 'origin/master' into worksplit-multigpu 2026-04-20 02:38:33 -07:00
model_sampling.py initial FlowRVS support (#12637) 2026-02-25 23:38:46 -05:00
multigpu.py Implement persistent thread pool for multi-GPU CFG splitting (#13329) 2026-04-08 05:39:07 -07:00
nested_tensor.py WIP way to support multi multi dimensional latents. (#10456) 2025-10-23 21:21:14 -04:00
ops.py Fix OOM regression in _apply() for quantized models during inference (#13372) 2026-04-15 02:10:36 -07:00
options.py Only parse command line args when main.py is called. 2023-09-13 11:38:20 -04:00
patcher_extension.py Merge branch 'master' into worksplit-multigpu 2025-10-15 17:33:02 -07:00
pinned_memory.py Integrate RAM cache with model RAM management (#13173) 2026-03-27 21:34:16 -04:00
pixel_space_convert.py Changes to the previous radiance commit. (#9851) 2025-09-13 18:03:34 -04:00
quant_ops.py Re-enable comfy-kitchen cuda backend for multigpu testing 2026-03-30 08:32:52 -07:00
rmsnorm.py Remove code to support RMSNorm on old pytorch. (#12499) 2026-02-16 20:09:24 -05:00
sample.py Fix fp16 intermediates giving different results. (#13100) 2026-03-21 17:53:25 -04:00
sampler_helpers.py Implement persistent thread pool for multi-GPU CFG splitting (#13329) 2026-04-08 05:39:07 -07:00
samplers.py Multi-GPU device selection for loader nodes + CUDA context fixes (#13483) 2026-04-23 19:10:33 -07:00
sd1_clip_config.json Fix potential issue with non clip text embeddings. 2024-07-30 14:41:13 -04:00
sd1_clip.py feat: Support Qwen3.5 text generation models (#12771) 2026-03-25 22:48:28 -04:00
sd.py Multi-GPU device selection for loader nodes + CUDA context fixes (#13483) 2026-04-23 19:10:33 -07:00
sdxl_clip.py Add a T5TokenizerOptions node to set options for the T5 tokenizer. (#7803) 2025-04-25 19:36:00 -04:00
supported_models_base.py Fix some custom nodes. (#11134) 2025-12-05 18:25:31 -05:00
supported_models.py Implement Ernie Image model. (#13369) 2026-04-11 22:29:31 -04:00
utils.py Reduce tiled decode peak memory (#13050) 2026-03-19 13:29:34 -04:00
windows.py Reduce RAM usage, fix VRAM OOMs, and fix Windows shared memory spilling with adaptive model loading (#11845) 2026-02-01 01:01:11 -05:00