EasyAI代码托管平台

mirror of https://github.com/comfyanonymous/ComfyUI.git synced 2026-05-25 00:17:23 +08:00

Author	SHA1	Message	Date
Jedrzej Kosinski	2ed396c769	Mark non-NVIDIA multigpu gaps with TODOs in _handle_batch Two CodeRabbit findings from #7063 (#13 and #14) are deferred because worksplit-multigpu's initial release scope is NVIDIA-only QA. Leave a TODO at the unconditional torch.cuda.set_device call and at the post-aggregation point so the required guards/synchronize are easy to find when multigpu support is extended to XPU/NPU/MPS/CPU/DirectML. Amp-Thread-ID: https://ampcode.com/threads/T-019e4a00-fe3d-76bd-a2f2-a8c8c4040082 Co-authored-by: Amp <amp@ampcode.com>	2026-05-21 12:47:43 -07:00
Kosinkadink	822a3ecf73	Note _calc_cond_batch and _calc_cond_batch_multigpu must stay in sync Per review feedback on #7063. The two functions share the conds-by-hooks accumulation, memory-fit batching, and per-chunk output aggregation; the multigpu variant adds per-device scheduling, .to(device) placement, per-device patcher/control lookup, and thread-pool dispatch around the inner loop. Documenting the relationship without extracting helpers -- extraction can land after the initial worksplit-multigpu release once both paths have settled. Amp-Thread-ID: https://ampcode.com/threads/T-019e4a00-fe3d-76bd-a2f2-a8c8c4040082 Co-authored-by: Amp <amp@ampcode.com>	2026-05-21 11:47:53 -07:00
Kosinkadink	a18dd219d5	Pass per-device model to multigpu control clones in pre_run_control QwenFunControlNet.pre_run stashes model.diffusion_model into extra_args, which the control_model then uses for forward passes (img_in, txt_in, pe_embedder, time_text_embed). With multigpu, every per-device control clone was being pre_run with the base model on GPU0, so secondary devices would invoke those modules with parameters on GPU0 and inputs on their own device, raising 'Expected all tensors to be on the same device'. Build a device -> per-device BaseModel lookup from the patcher's additional multigpu models and pass each clone the model on its own device. Falls back to the base model when no per-device match is found (single-GPU path and the case where cnet.multigpu_clones lags the patcher's clone set). Amp-Thread-ID: https://ampcode.com/threads/T-019e4a00-fe3d-76bd-a2f2-a8c8c4040082 Co-authored-by: Amp <amp@ampcode.com>	2026-05-21 11:40:49 -07:00
Jedrzej Kosinski	ac0a90c323	Use cond_shapes in multigpu memory-fit check (parity with single-GPU path) The multigpu cond-batching loop called model.memory_required(input_shape) without conditioning shapes, while the single-GPU path at line 279 passes cond_shapes. Large conditioning tensors (e.g. video prompts, control inputs) were therefore under-counted, risking OOM at runtime when the chosen batch size was too large. Match the single-GPU pattern by building cond_shapes from each batched cond's conditioning dict and passing it to memory_required. Amp-Thread-ID: https://ampcode.com/threads/T-019e43b8-8258-70fd-ab3a-53e4c97f85d5 Co-authored-by: Amp <amp@ampcode.com>	2026-05-20 19:52:03 -07:00
Jedrzej Kosinski	819c7c0702	Refactor MultiGPU scheduler for readability and termination safety (#14001 ) Behaviour-equivalent cleanup of _calc_cond_batch_multigpu device scheduling. No change to batching decisions or memory checks for any valid input. Changes: * Replace re-summed batched_to_run_length with a per-device load dict (device_load), so capacity checks are O(1) and use a single source of truth. * Extract device selection into next_available_device(), which scans at most len(devices) positions and raises if no device has remaining capacity. This makes the 'skip a full device' rule live in one place instead of two and guarantees the outer while loop cannot spin forever on a scheduling bug. * Drop the unused current_device assignment before the outer loop and the index_device % len(devices) modulo dance (now handled inside next_available_device). * Minor cleanups: list comprehensions for total_conds, conds_to_batch, and the devices list.	2026-05-19 21:23:56 -07:00
Jedrzej Kosinski	9e3ede1406	Fix MultiGPU scheduler capacity accounting (#14000 ) Fixes _calc_cond_batch_multigpu so that: 1. conds_per_device uses real division before math.ceil. The previous expression math.ceil(total_conds // len(devices)) applied integer floor division first, making ceil a no-op. For 3 conds across 2 devices this produced conds_per_device=1 instead of 2. 2. The scheduling loop skips devices that have already reached capacity instead of appending empty batch groups. Without this guard, the loop could repeatedly emit zero-length groups for a full device, leaving sampling stuck at 0/N until timeout. Reproduces with an Omnigen2 image workflow that produces three condition entries scheduled across two CUDA devices. With the fix the scheduler assigns conds_per_device=2 and splits the batches as 2 + 1 across the two devices, allowing sampling to complete. Original fix authored and validated by @pollockjj in pollockjj/ComfyUI#64. Co-authored-by: John Pollock <pollockjj@gmail.com>	2026-05-19 20:11:53 -07:00
Jedrzej Kosinski	aa464b36b3	Multi-GPU device selection for loader nodes + CUDA context fixes (#13483 ) * Fix Hunyuan 3D 2.1 multi-GPU worksplit: use cond_or_uncond instead of hardcoded chunk(2) Amp-Thread-ID: https://ampcode.com/threads/T-019da964-2cc8-77f9-9aae-23f65da233db Co-authored-by: Amp <amp@ampcode.com> * Add GPU device selection to all loader nodes - Add get_gpu_device_options() and resolve_gpu_device_option() helpers in model_management.py for vendor-agnostic GPU device selection - Add device widget to CheckpointLoaderSimple, UNETLoader, VAELoader - Expand device options in CLIPLoader, DualCLIPLoader, LTXAVTextEncoderLoader from [default, cpu] to include gpu:0, gpu:1, etc. on multi-GPU systems - Wire load_diffusion_model_state_dict and load_state_dict_guess_config to respect model_options['load_device'] - Graceful fallback: unrecognized devices (e.g. gpu:1 on single-GPU) silently fall back to default Amp-Thread-ID: https://ampcode.com/threads/T-019daa41-f394-731a-8955-4cff4f16283a Co-authored-by: Amp <amp@ampcode.com> * Add VALIDATE_INPUTS to skip device combo validation for workflow portability When a workflow saved on a 2-GPU machine (with device=gpu:1) is loaded on a 1-GPU machine, the combo validation would reject the unknown value. VALIDATE_INPUTS with the device parameter bypasses combo validation for that input only, allowing resolve_gpu_device_option to handle the graceful fallback at runtime. Amp-Thread-ID: https://ampcode.com/threads/T-019daa41-f394-731a-8955-4cff4f16283a Co-authored-by: Amp <amp@ampcode.com> * Set CUDA device context in outer_sample to match model load_device Custom CUDA kernels (comfy_kitchen fp8 quantization) use torch.cuda.current_device() for DLPack tensor export. When a model is loaded on a non-default GPU (e.g. cuda:1), the CUDA context must match or the kernel fails with 'Can't export tensors on a different CUDA device index'. Save and restore the previous device around sampling. Amp-Thread-ID: https://ampcode.com/threads/T-019daa41-f394-731a-8955-4cff4f16283a Co-authored-by: Amp <amp@ampcode.com> * Fix code review bugs: negative index guard, CPU offload_device, checkpoint te_model_options - resolve_gpu_device_option: reject negative indices (gpu:-1) - UNETLoader: set offload_device when cpu is selected - CheckpointLoaderSimple: pass te_model_options for CLIP device, set offload_device for cpu, pass load_device to VAE - load_diffusion_model_state_dict: respect offload_device from model_options - load_state_dict_guess_config: respect offload_device, pass load_device to VAE Amp-Thread-ID: https://ampcode.com/threads/T-019daa41-f394-731a-8955-4cff4f16283a Co-authored-by: Amp <amp@ampcode.com> * Fix CUDA device context for CLIP encoding and VAE encode/decode Add torch.cuda.set_device() calls to match model's load device in: - CLIP.encode_from_tokens: fixes 'Can't export tensors on a different CUDA device index' when CLIP is loaded on a non-default GPU - CLIP.encode_from_tokens_scheduled: same fix for the hooks code path - CLIP.generate: same fix for text generation - VAE.decode: fixes VAE decoding on non-default GPU - VAE.encode: fixes VAE encoding on non-default GPU Same pattern as the existing outer_sample fix in samplers.py - saves and restores previous CUDA device in a try/finally block. Amp-Thread-ID: https://ampcode.com/threads/T-019dabdc-8feb-766f-b4dc-f46ef4d8ff57 Co-authored-by: Amp <amp@ampcode.com> * Extract cuda_device_context manager, fix tiled VAE methods Add model_management.cuda_device_context() — a context manager that saves/restores torch.cuda.current_device when operating on a non-default GPU. Replaces 6 copies of the manual save/set/restore boilerplate. Refactored call sites: - CLIP.encode_from_tokens - CLIP.encode_from_tokens_scheduled (hooks path) - CLIP.generate - VAE.decode - VAE.encode - samplers.outer_sample Bug fixes (newly wrapped): - VAE.decode_tiled: was missing device context entirely, would fail on non-default GPU when called from 'VAE Decode (Tiled)' node - VAE.encode_tiled: same issue for 'VAE Encode (Tiled)' node Amp-Thread-ID: https://ampcode.com/threads/T-019dabdc-8feb-766f-b4dc-f46ef4d8ff57 Co-authored-by: Amp <amp@ampcode.com> * Restore CheckpointLoaderSimple, add CheckpointLoaderDevice Revert CheckpointLoaderSimple to its original form (no device input) so it remains the simple default loader. Add new CheckpointLoaderDevice node (advanced/loaders) with separate model_device, clip_device, and vae_device inputs for per-component GPU placement in multi-GPU setups. Amp-Thread-ID: https://ampcode.com/threads/T-019dabdc-8feb-766f-b4dc-f46ef4d8ff57 Co-authored-by: Amp <amp@ampcode.com> --------- Co-authored-by: Amp <amp@ampcode.com>	2026-04-23 19:10:33 -07:00
Jedrzej Kosinski	48deb15c0e	Simplify multigpu dispatch: run all devices on pool threads (#13340 ) Some checks failed Python Linting / Run Ruff (push) Has been cancelled Details Python Linting / Run Pylint (push) Has been cancelled Details Benchmarked hybrid (main thread + pool) vs all-pool on 2x RTX 4090 with SD1.5 and NetaYume models. No meaningful performance difference (within noise). All-pool is simpler: eliminates the main_device special case, main_batch_tuple deferred execution, and the 3-way branch in the dispatch loop.	2026-04-09 01:15:57 -07:00
Jedrzej Kosinski	4b93c4360f	Implement persistent thread pool for multi-GPU CFG splitting (#13329 ) Some checks failed Python Linting / Run Ruff (push) Waiting to run Details Python Linting / Run Pylint (push) Waiting to run Details Build package / Build Test (3.10) (push) Has been cancelled Details Build package / Build Test (3.11) (push) Has been cancelled Details Build package / Build Test (3.12) (push) Has been cancelled Details Build package / Build Test (3.13) (push) Has been cancelled Details Build package / Build Test (3.14) (push) Has been cancelled Details Replace per-step thread create/destroy in _calc_cond_batch_multigpu with a persistent MultiGPUThreadPool. Each worker thread calls torch.cuda.set_device() once at startup, preserving compiled kernel caches across diffusion steps. - Add MultiGPUThreadPool class in comfy/multigpu.py - Create pool in CFGGuider.outer_sample(), shut down in finally block - Main thread handles its own device batch directly for zero overhead - Falls back to sequential execution if no pool is available	2026-04-08 05:39:07 -07:00
Jedrzej Kosinski	84f465e791	Set CUDA device at start of multigpu threads to avoid multithreading bugs Amp-Thread-ID: https://ampcode.com/threads/T-019d3ee9-19d5-767a-9d7a-e50cbbef815b Co-authored-by: Amp <amp@ampcode.com>	2026-03-30 07:07:54 -07:00
Jedrzej Kosinski	be35378986	Merge branch 'master' into worksplit-multigpu Some checks failed Python Linting / Run Ruff (push) Waiting to run Details Python Linting / Run Pylint (push) Waiting to run Details Build package / Build Test (3.10) (push) Has been cancelled Details Build package / Build Test (3.11) (push) Has been cancelled Details Build package / Build Test (3.12) (push) Has been cancelled Details Build package / Build Test (3.13) (push) Has been cancelled Details Build package / Build Test (3.14) (push) Has been cancelled Details Amp-Thread-ID: https://ampcode.com/threads/T-019d3ee9-19d5-767a-9d7a-e50cbbef815b Co-authored-by: Amp <amp@ampcode.com> # Conflicts: # comfy/samplers.py	2026-03-30 06:24:55 -07:00
comfyanonymous	b5d32e6ad2	Fix sampling issue with fp16 intermediates. (#13099 )	2026-03-21 17:47:42 -04:00
Jedrzej Kosinski	f410d28b33	Merge origin/master into worksplit-multigpu Some checks failed Python Linting / Run Ruff (push) Has been cancelled Details Python Linting / Run Pylint (push) Has been cancelled Details Build package / Build Test (3.10) (push) Has been cancelled Details Build package / Build Test (3.11) (push) Has been cancelled Details Build package / Build Test (3.12) (push) Has been cancelled Details Build package / Build Test (3.13) (push) Has been cancelled Details Build package / Build Test (3.14) (push) Has been cancelled Details Amp-Thread-ID: https://ampcode.com/threads/T-019d009d-e059-7623-85ca-401168168516 Co-authored-by: Amp <amp@ampcode.com>	2026-03-18 04:21:30 -07:00
rattus	5f41584e96	Disable dynamic_vram when weight hooks applied (#12653 ) * sd: add support for clip model reconstruction * nodes: SetClipHooks: Demote the dynamic model patcher * mp: Make dynamic_disable more robust The backup need to not be cloned. In addition add a delegate object to ModelPatcherDynamic so that non-cloning code can do ModelPatcherDynamic demotion * sampler_helpers: Demote to non-dynamic model patcher when hooking * code rabbit review comments	2026-02-28 16:50:18 -05:00
Jedrzej Kosinski	f4b99bc623	Made multigpu deepclone load model from disk to avoid needing to deepclone actual model object, fixed issues with merge, turn off cuda backend as it causes device mismatch issue with rope (and potentially other ops), will investigate Some checks failed Python Linting / Run Ruff (push) Has been cancelled Details Python Linting / Run Pylint (push) Has been cancelled Details Build package / Build Test (3.10) (push) Has been cancelled Details Build package / Build Test (3.11) (push) Has been cancelled Details Build package / Build Test (3.12) (push) Has been cancelled Details Build package / Build Test (3.13) (push) Has been cancelled Details Build package / Build Test (3.14) (push) Has been cancelled Details	2026-02-17 04:55:00 -08:00
Jedrzej Kosinski	df2fd4c869	Merge branch 'master' into worksplit-multigpu	2026-02-17 02:53:06 -08:00
rattus	f8acd9c402	Reduce RAM usage, fix VRAM OOMs, and fix Windows shared memory spilling with adaptive model loading (#11845 )	2026-02-01 01:01:11 -05:00
comfyanonymous	809ce68749	Support nested tensor denoise masks. (#11431 ) Some checks are pending Python Linting / Run Ruff (push) Waiting to run Details Python Linting / Run Pylint (push) Waiting to run Details Full Comfy CI Workflow Runs / test-stable (12.1, , linux, 3.10, [self-hosted Linux], stable) (push) Waiting to run Details Full Comfy CI Workflow Runs / test-stable (12.1, , linux, 3.11, [self-hosted Linux], stable) (push) Waiting to run Details Full Comfy CI Workflow Runs / test-stable (12.1, , linux, 3.12, [self-hosted Linux], stable) (push) Waiting to run Details Full Comfy CI Workflow Runs / test-unix-nightly (12.1, , linux, 3.11, [self-hosted Linux], nightly) (push) Waiting to run Details Execution Tests / test (macos-latest) (push) Waiting to run Details Execution Tests / test (ubuntu-latest) (push) Waiting to run Details Execution Tests / test (windows-latest) (push) Waiting to run Details Test server launches without errors / test (push) Waiting to run Details Unit Tests / test (macos-latest) (push) Waiting to run Details Unit Tests / test (ubuntu-latest) (push) Waiting to run Details Unit Tests / test (windows-2022) (push) Waiting to run Details	2025-12-19 19:59:25 -05:00
chaObserv	827bb1512b	Add exp_heun_2_x0 sampler series (#11360 )	2025-12-16 23:35:43 -05:00
comfyanonymous	1bcda6df98	WIP way to support multi multi dimensional latents. (#10456 )	2025-10-23 21:21:14 -04:00
Jedrzej Kosinski	4661d1db5a	Bring patches changes from _calc_cond_batch into _calc_cond_batch_multigpu	2025-10-15 17:34:36 -07:00
Jedrzej Kosinski	b326a544d5	Merge branch 'master' into worksplit-multigpu	2025-10-15 17:33:02 -07:00
Faych	afa8a24fe1	refactor: Replace manual patches merging with merge_nested_dicts (#10360 )	2025-10-15 17:16:09 -07:00
Jedrzej Kosinski	8cbbf0be6c	Merge branch 'master' into worksplit-multigpu	2025-10-13 21:53:14 -07:00
Jedrzej Kosinski	196954ab8c	Add 'input_cond' and 'input_uncond' to the args dictionary passed into sampler_cfg_function (#10044 )	2025-09-26 19:55:03 -07:00
Jedrzej Kosinski	9e9c129cd0	Merge remote-tracking branch 'origin/master' into worksplit-multigpu	2025-08-29 23:36:19 -07:00
Gangin Park	3aad339b63	Add DPM++ 2M SDE Heun (RES) sampler (#9542 )	2025-08-27 19:07:31 -04:00
comfyanonymous	41048c69b4	Fix Conditioning masks on 3d latents. (#9506 )	2025-08-22 23:15:44 -04:00
Jedrzej Kosinski	fc247150fe	Implement EasyCache and Invent LazyCache (#9496 ) * Attempting a universal implementation of EasyCache, starting with flux as test; I screwed up the math a bit, but when I set it just right it works. * Fixed math to make threshold work as expected, refactored code to use EasyCacheHolder instead of a dict wrapped by object * Use sigmas from transformer_options instead of timesteps to be compatible with a greater amount of models, make end_percent work * Make log statement when not skipping useful, preparing for per-cond caching * Added DIFFUSION_MODEL wrapper around forward function for wan model * Add subsampling for heuristic inputs * Add subsampling to output_prev (output_prev_subsampled now) * Properly consider conds in EasyCache logic * Created SuperEasyCache to test what happens if caching and reuse is moved outside the scope of conds, added PREDICT_NOISE wrapper to facilitate this test * Change max reuse_threshold to 3.0 * Mark EasyCache/SuperEasyCache as experimental (beta) * Make Lumina2 compatible with EasyCache * Add EasyCache support for Qwen Image * Fix missing comma, curse you Cursor * Add EasyCache support to AceStep * Add EasyCache support to Chroma * Added EasyCache support to Cosmos Predict t2i * Make EasyCache not crash with Cosmos Predict ImagToVideo latents, but does not work well at all * Add EasyCache support to hidream * Added EasyCache support to hunyuan video * Added EasyCache support to hunyuan3d * Added EasyCache support to LTXV (not very good, but does not crash) * Implemented EasyCache for aura_flow * Renamed SuperEasyCache to LazyCache, hardcoded subsample_factor to 8 on nodes * Eatra logging when verbose is true for EasyCache	2025-08-22 22:41:08 -04:00
Jedrzej Kosinski	1489399cb5	Merge branch 'master' into worksplit-multigpu	2025-08-13 19:47:08 -07:00
Jedrzej Kosinski	e4f7ea105f	Added context window support to core sampling code (#9238 ) * Added initial support for basic context windows - in progress * Add prepare_sampling wrapper for context window to more accurately estimate latent memory requirements, fixed merging wrappers/callbacks dicts in prepare_model_patcher * Made context windows compatible with different dimensions; works for WAN, but results are bad * Fix comfy.patcher_extension.merge_nested_dicts calls in prepare_model_patcher in sampler_helpers.py * Considering adding some callbacks to context window code to allow extensions of behavior without the need to rewrite code * Made dim slicing cleaner * Add Wan Context WIndows node for testing * Made context schedule and fuse method functions be stored on the handler instead of needing to be registered in core code to be found * Moved some code around between node_context_windows.py and context_windows.py * Change manual context window nodes names/ids * Added callbacks to IndexListContexHandler * Adjusted default values for context_length and context_overlap, made schema.inputs definition for WAN Context Windows less annoying * Make get_resized_cond more robust for various dim sizes * Fix typo * Another small fix	2025-08-13 21:33:05 -04:00
Jedrzej Kosinski	b4f559b34d	Merge branch 'master' into worksplit-multigpu	2025-08-04 20:23:19 -07:00
comfyanonymous	182f90b5ec	Lower cond vram use by casting at the same time as device transfer. (#9159 )	2025-08-04 03:11:53 -04:00
kosinkadink1@gmail.com	9855baaab3	Merge branch 'master' into worksplit-multigpu	2025-07-09 03:57:30 -05:00
chaObserv	aac10ad23a	Add SA-Solver sampler (#8834 )	2025-07-08 16:17:06 -04:00
City	d9277301d2	Initial code for new SLG node (#8759 )	2025-07-02 20:13:43 -04:00
Jedrzej Kosinski	d53479a197	Merge branch 'master' into worksplit-multigpu	2025-07-01 17:33:05 -05:00
comfyanonymous	396454fa41	Reorder the schedulers so simple is the default one. (#8722 )	2025-06-28 18:12:56 -04:00
Jedrzej Kosinski	431dec8e53	Merge branch 'worksplit-multigpu' of https://github.com/comfyanonymous/ComfyUI into worksplit-multigpu	2025-06-24 00:48:58 -05:00
Jedrzej Kosinski	44e053c26d	Improve error handling for multigpu threads	2025-06-24 00:48:51 -05:00
kosinkadink1@gmail.com	0336b0ace8	Merge branch 'master' into worksplit-multigpu	2025-06-01 02:39:26 -07:00
comfyanonymous	06c661004e	Memory estimation code can now take into account conds. (#8307 )	2025-05-27 15:09:05 -04:00
Jedrzej Kosinski	9726eac475	Merge branch 'master' into worksplit-multigpu	2025-05-12 19:29:13 -05:00
chaObserv	c15909bb62	CFG++ for gradient estimation sampler (#7809 )	2025-04-28 13:51:35 -04:00
Jedrzej Kosinski	adc66c0698	Merge branch 'master' into worksplit-multigpu	2025-04-16 14:23:56 -05:00
chaObserv	e51d9ba5fc	Add SEEDS (stage 2 & 3 DP) sampler (#7580 ) * Add seeds stage 2 & 3 (DP) sampler * Change the name to SEEDS in comment	2025-04-12 18:36:08 -04:00
Jedrzej Kosinski	cc928a786d	Merge branch 'master' into worksplit-multigpu	2025-03-13 20:59:11 -05:00
chaObserv	01015bff16	Add er_sde sampler (#7187 )	2025-03-12 02:42:37 -04:00
Jedrzej Kosinski	6dca17bd2d	Satisfy ruff linting	2025-03-03 23:08:29 -06:00
Jedrzej Kosinski	5080105c23	Merge branch 'master' into worksplit-multigpu	2025-03-03 22:56:53 -06:00

1 2 3 4 5

238 Commits