EasyAI代码托管平台

mirror of https://github.com/comfyanonymous/ComfyUI.git synced 2026-02-11 05:52:33 +08:00

Author	SHA1	Message	Date
Sasbom	0ef5557d6a	Add QOL feature for changing the custom nodes folder location through cli args. bugfix: fix typo in apply_directory for custom_nodes_directory allow for PATH style ';' delimited custom_node directories. change delimiter type for seperate folders per platform. feat(API-nodes): move Rodin3D nodes to new client; removed old api client.py (#10645) Fix qwen controlnet regression. (#10657) Enable pinned memory by default on Nvidia. (#10656) Removed the --fast pinned_memory flag. You can use --disable-pinned-memory to disable it. Please report if it causes any issues. Pinned mem also seems to work on AMD. (#10658) Remove environment variable. Removed environment variable fallback for custom nodes directory. Update documentation for custom nodes directory Clarified documentation on custom nodes directory argument, removed documentation on environment variable Clarify release cycle. (#10667) Tell users they need to upload their logs in bug reports. (#10671) mm: guard against double pin and unpin explicitly (#10672) As commented, if you let cuda be the one to detect double pin/unpinning it actually creates an asyc GPU error. Only unpin tensor if it was pinned by ComfyUI (#10677) Make ScaleROPE node work on Flux. (#10686) Add logging for model unloading. (#10692) Unload weights if vram usage goes up between runs. (#10690) ops: Put weight cast on the offload stream (#10697) This needs to be on the offload stream. This reproduced a black screen with low resolution images on a slow bus when using FP8. Update CI workflow to remove dead macOS runner. (#10704) * Update CI workflow to remove dead macOS runner. * revert * revert Don't pin tensor if not a torch.nn.parameter.Parameter (#10718) Update README.md for Intel Arc GPU installation, remove IPEX (#10729) IPEX is no longer needed for Intel Arc GPUs. Removing instruction to setup ipex. mm/mp: always unload re-used but modified models (#10724) The partial unloader path in model re-use flow skips straight to the actual unload without any check of the patching UUID. This means that if you do an upscale flow with a model patch on an existing model, it will not apply your patchings. Fix by delaying the partial_unload until after the uuid checks. This is done by making partial_unload a model of partial_load where extra_mem is -ve. qwen: reduce VRAM usage (#10725) Clean up a bunch of stacked and no-longer-needed tensors on the QWEN VRAM peak (currently FFN). With this I go from OOMing at B=37x1328x1328 to being able to succesfully run B=47 (RTX5090). Update Python 3.14 compatibility notes in README (#10730) Quantized Ops fixes (#10715) * offload support, bug fixes, remove mixins * add readme add PR template for API-Nodes (#10736) feat: add create_time dict to prompt field in /history and /queue (#10741) flux: reduce VRAM usage (#10737) Cleanup a bunch of stack tensors on Flux. This take me from B=19 to B=22 for 1600x1600 on RTX5090. Better instructions for the portable. (#10743) Use same code for chroma and flux blocks so that optimizations are shared. (#10746) Fix custom nodes import error. (#10747) This should fix the import errors but will break if the custom nodes actually try to use the class. revert import reordering revert imports pt 2 Add left padding support to tokenizers. (#10753) chore(api-nodes): mark OpenAIDalle2 and OpenAIDalle3 nodes as deprecated (#10757) Revert "chore(api-nodes): mark OpenAIDalle2 and OpenAIDalle3 nodes as deprecated (#10757)" (#10759) This reverts commit `9a02382568`. Change ROCm nightly install command to 7.1 (#10764)	2025-11-17 06:16:21 +01:00
comfyanonymous	0f4ef3afa0	This seems to slow things down slightly on Linux. (#10624 )	2025-11-03 21:47:14 -05:00
comfyanonymous	0652cb8e2d	Speed up torch.compile (#10620 )	2025-11-03 17:37:12 -05:00
rattus	135fa49ec2	Small speed improvements to --async-offload (#10593 ) * ops: dont take an offload stream if you dont need one * ops: prioritize mem transfer The async offload streams reason for existence is to transfer from RAM to GPU. The post processing compute steps are a bonus on the side stream, but if the compute stream is running a long kernel, it can stall the side stream, as it wait to type-cast the bias before transferring the weight. So do a pure xfer of the weight straight up, then do everything bias, then go back to fix the weight type and do weight patches.	2025-11-01 18:48:53 -04:00
comfyanonymous	c58c13b2ba	Fix torch compile regression on fp8 ops. (#10580 )	2025-11-01 00:25:17 -04:00
comfyanonymous	906c089957	Fix small performance regression with fp8 fast and scaled fp8. (#10537 )	2025-10-29 19:29:01 -04:00
rattus	ab7ab5be23	Fix Race condition in --async-offload that can cause corruption (#10501 ) * mm: factor out the current stream getter Make this a reusable function. * ops: sync the offload stream with the consumption of w&b This sync is nessacary as pytorch will queue cuda async frees on the same stream as created to tensor. In the case of async offload, this will be on the offload stream. Weights and biases can go out of scope in python which then triggers the pytorch garbage collector to queue the free operation on the offload stream possible before the compute stream has used the weight. This causes a use after free on weight data leading to total corruption of some workflows. So sync the offload stream with the compute stream after the weight has been used so the free has to wait for the weight to be used. The cast_bias_weight is extended in a backwards compatible way with the new behaviour opt-in on a defaulted parameter. This handles custom node packs calling cast_bias_weight and defeatures async-offload for them (as they do not handle the race). The pattern is now: cast_bias_weight(... , offloadable=True) #This might be offloaded thing(weight, bias, ...) uncast_bias_weight(...) * controlnet: adopt new cast_bias_weight synchronization scheme This is nessacary for safe async weight offloading. * mm: sync the last stream in the queue, not the next Currently this peeks ahead to sync the next stream in the queue of streams with the compute stream. This doesnt allow a lot of parallelization, as then end result is you can only get one weight load ahead regardless of how many streams you have. Rotate the loop logic here to synchronize the end of the queue before returning the next stream. This allows weights to be loaded ahead of the compute streams position.	2025-10-29 17:17:46 -04:00
contentis	8817f8fc14	Mixed Precision Quantization System (#10498 ) * Implement mixed precision operations with a registry design and metadate for quant spec in checkpoint. * Updated design using Tensor Subclasses * Fix FP8 MM * An actually functional POC * Remove CK reference and ensure correct compute dtype * Update unit tests * ruff lint * Implement mixed precision operations with a registry design and metadate for quant spec in checkpoint. * Updated design using Tensor Subclasses * Fix FP8 MM * An actually functional POC * Remove CK reference and ensure correct compute dtype * Update unit tests * ruff lint * Fix missing keys * Rename quant dtype parameter * Rename quant dtype parameter * Fix unittests for CPU build	2025-10-28 16:20:53 -04:00
comfyanonymous	b4f30bd408	Pytorch is stupid. (#10398 )	2025-10-19 01:25:35 -04:00
comfyanonymous	5b80addafd	Turn off cuda malloc by default when --fast autotune is turned on. (#10393 )	2025-10-18 22:35:46 -04:00
comfyanonymous	9da397ea2f	Disable torch compiler for cast_bias_weight function (#10384 ) * Disable torch compiler for cast_bias_weight function * Fix torch compile.	2025-10-17 20:03:28 -04:00
comfyanonymous	b1293d50ef	workaround also works on cudnn 91200 (#10375 )	2025-10-16 19:59:56 -04:00
comfyanonymous	19b466160c	Workaround for nvidia issue where VAE uses 3x more memory on torch 2.9 (#10373 )	2025-10-16 18:16:03 -04:00
comfyanonymous	3374e900d0	Faster workflow cancelling. (#10301 )	2025-10-13 23:43:53 -04:00
comfyanonymous	139addd53c	More surgical fix for #10267 (#10276 )	2025-10-09 16:37:35 -04:00
Kohaku-Blueleaf	7be2b49b6b	Fix LoRA Trainer bugs with FP8 models. (#9854 ) * Fix adapter weight init * Fix fp8 model training * Avoid inference tensor	2025-09-20 21:24:48 -04:00
contentis	e2d1e5dad9	Enable Convolution AutoTuning (#9301 )	2025-09-01 20:33:50 -04:00
comfyanonymous	4e5c230f6a	Fix last commit not working on older pytorch. (#9346 )	2025-08-14 23:44:02 -04:00
Xiangxi Guo (Ryan)	f0d5d0111f	Avoid torch compile graphbreak for older pytorch versions (#9344 ) Turns out torch.compile has some gaps in context manager decorator syntax support. I've sent patches to fix that in PyTorch, but it won't be available for all the folks running older versions of PyTorch, hence this trivial patch.	2025-08-14 23:41:37 -04:00
comfyanonymous	9df8792d4b	Make last PR not crash comfy on old pytorch. (#9324 )	2025-08-13 15:12:41 -04:00
contentis	3da5a07510	SDPA backend priority (#9299 )	2025-08-13 14:53:27 -04:00
comfyanonymous	111f583e00	Fallback to regular op when fp8 op throws exception. (#8761 )	2025-07-02 00:57:13 -04:00
comfyanonymous	d42613686f	Fix issue with fp8 ops on some models. (#8045 ) _scaled_mm errors when an input is non contiguous.	2025-05-10 07:52:56 -04:00
comfyanonymous	ac10a0d69e	Make loras work with --async-offload (#7824 )	2025-04-26 19:56:22 -04:00
comfyanonymous	0dcc75ca54	Add experimental --async-offload lowvram weight offloading. (#7820 ) This should speed up the lowvram mode a bit. It currently is only enabled when --async-offload is used but it will be enabled by default in the future if there are no problems.	2025-04-26 16:11:21 -04:00
comfyanonymous	9ad792f927	Basic support for hidream i1 model.	2025-04-15 17:35:05 -04:00
comfyanonymous	8a438115fb	add RMSNorm to comfy.ops	2025-04-14 18:00:33 -04:00
catboxanon	1714a4c158	Add CublasOps support (#7574 ) * CublasOps support * Guard CublasOps behind --fast arg	2025-04-12 18:29:15 -04:00
comfyanonymous	70e15fd743	No need for scale_input when fp8 matrix mult is disabled.	2025-03-07 04:49:20 -05:00
comfyanonymous	e1474150de	Support fp8_scaled diffusion models that don't use fp8 matrix mult.	2025-03-07 04:39:21 -05:00
comfyanonymous	4dc6709307	Rename argument in last commit and document the options.	2025-03-01 02:43:49 -05:00
Chenlei Hu	4d55f16ae8	Use enum list for --fast options (#7024 )	2025-03-01 02:37:35 -05:00
comfyanonymous	cf0b549d48	--fast now takes a number as argument to indicate how fast you want it. The idea is that you can indicate how much quality vs speed you want. At the moment: --fast 2 enables fp16 accumulation if your pytorch supports it. --fast 5 enables fp8 matrix mult on fp8 models and the optimization above. --fast without a number enables all optimizations.	2025-02-28 02:48:20 -05:00
comfyanonymous	ab888e1e0b	Add add_weight_wrapper function to model patcher. Functions can now easily be added to wrap/modify model weights.	2025-02-12 05:55:35 -05:00
comfyanonymous	99a1fb6027	Make fast fp8 take a bit less peak memory.	2024-12-24 18:05:19 -05:00
Haoming	fbf68c4e52	clamp input (#5928 )	2024-12-07 14:00:31 -05:00
comfyanonymous	915fdb5745	Fix lowvram edge case.	2024-10-22 16:34:50 -04:00
comfyanonymous	8ce2a1052c	Optimizations to --fast and scaled fp8.	2024-10-22 02:12:28 -04:00
comfyanonymous	0075c6d096	Mixed precision diffusion models with scaled fp8. This change allows supports for diffusion models where all the linears are scaled fp8 while the other weights are the original precision.	2024-10-21 18:12:51 -04:00
comfyanonymous	83ca891118	Support scaled fp8 t5xxl model.	2024-10-20 22:27:00 -04:00
comfyanonymous	f9f9faface	Fixed model merging issue with scaled fp8.	2024-10-20 06:24:31 -04:00
comfyanonymous	a68bbafddb	Support diffusion models with scaled fp8 weights.	2024-10-19 23:47:42 -04:00
comfyanonymous	67158994a4	Use the lowvram cast_to function for everything.	2024-10-17 17:25:56 -04:00
comfyanonymous	e38c94228b	Add a weight_dtype fp8_e4m3fn_fast to the Diffusion Model Loader node. This is used to load weights in fp8 and use fp8 matrix multiplication.	2024-10-09 19:43:17 -04:00
comfyanonymous	9c41bc8d10	Remove useless line.	2024-09-23 02:32:29 -04:00
comfyanonymous	dc96a1ae19	Load controlnet in fp8 if weights are in fp8.	2024-09-21 04:50:12 -04:00
comfyanonymous	8ae23d8e80	Fix onnx export.	2024-08-23 17:52:47 -04:00
comfyanonymous	c7ee4b37a1	Try to fix some lora issues.	2024-08-22 15:32:18 -04:00
comfyanonymous	904bf58e7d	Make --fast work on pytorch nightly.	2024-08-21 14:01:41 -04:00
Svein Ove Aas	5f50263088	Replace use of .view with .reshape (#4522 ) When generating images with fp8_e4_m3 Flux and batch size >1, using --fast, ComfyUI throws a "view size is not compatible with input tensor's size and stride" error pointing at the first of these two calls to view. As reshape is semantically equivalent to view except for working on a broader set of inputs, there should be no downside to changing this. The only difference is that it clones the underlying data in cases where .view would error out. I have confirmed that the output still looks as expected, but cannot confirm that no mutable use is made of the tensors anywhere. Note that --fast is only marginally faster than the default.	2024-08-21 11:21:48 -04:00

1 2

76 Commits