Commit Graph

76 Commits

Author SHA1 Message Date
doctorpangloss
82388d51a2 Merge branch 'master' of github.com:comfyanonymous/ComfyUI 2025-06-17 10:35:10 -07:00
comfyanonymous
d42613686f
Fix issue with fp8 ops on some models. (#8045)
_scaled_mm errors when an input is non contiguous.
2025-05-10 07:52:56 -04:00
comfyanonymous
ac10a0d69e
Make loras work with --async-offload (#7824) 2025-04-26 19:56:22 -04:00
comfyanonymous
0dcc75ca54
Add experimental --async-offload lowvram weight offloading. (#7820)
This should speed up the lowvram mode a bit. It currently is only enabled when --async-offload is used but it will be enabled by default in the future if there are no problems.
2025-04-26 16:11:21 -04:00
doctorpangloss
5823497d55 Merge branch 'master' of github.com:comfyanonymous/ComfyUI 2025-04-21 13:14:36 -07:00
comfyanonymous
9ad792f927 Basic support for hidream i1 model. 2025-04-15 17:35:05 -04:00
comfyanonymous
8a438115fb add RMSNorm to comfy.ops 2025-04-14 18:00:33 -04:00
catboxanon
1714a4c158
Add CublasOps support (#7574)
* CublasOps support

* Guard CublasOps behind --fast arg
2025-04-12 18:29:15 -04:00
doctorpangloss
040a324346 Merge branch 'master' of github.com:comfyanonymous/ComfyUI 2025-03-29 15:57:24 -07:00
comfyanonymous
70e15fd743 No need for scale_input when fp8 matrix mult is disabled. 2025-03-07 04:49:20 -05:00
comfyanonymous
e1474150de Support fp8_scaled diffusion models that don't use fp8 matrix mult. 2025-03-07 04:39:21 -05:00
doctorpangloss
3c82be86d1 Merge branch 'master' of github.com:comfyanonymous/ComfyUI 2025-03-05 14:38:50 -08:00
comfyanonymous
4dc6709307 Rename argument in last commit and document the options. 2025-03-01 02:43:49 -05:00
Chenlei Hu
4d55f16ae8
Use enum list for --fast options (#7024) 2025-03-01 02:37:35 -05:00
comfyanonymous
cf0b549d48 --fast now takes a number as argument to indicate how fast you want it.
The idea is that you can indicate how much quality vs speed you want.

At the moment:

--fast 2 enables fp16 accumulation if your pytorch supports it.
--fast 5 enables fp8 matrix mult on fp8 models and the optimization above.

--fast without a number enables all optimizations.
2025-02-28 02:48:20 -05:00
doctorpangloss
693038738a Merge branch 'master' of github.com:comfyanonymous/ComfyUI 2025-02-24 09:39:26 -08:00
comfyanonymous
ab888e1e0b Add add_weight_wrapper function to model patcher.
Functions can now easily be added to wrap/modify model weights.
2025-02-12 05:55:35 -05:00
doctorpangloss
9d5a5dd533 Merge branch 'master' of github.com:comfyanonymous/ComfyUI 2024-12-28 14:24:27 -08:00
comfyanonymous
99a1fb6027 Make fast fp8 take a bit less peak memory. 2024-12-24 18:05:19 -05:00
doctorpangloss
2d1676c717 Merge branch 'master' of github.com:comfyanonymous/ComfyUI 2024-12-09 15:54:37 -08:00
Haoming
fbf68c4e52
clamp input (#5928) 2024-12-07 14:00:31 -05:00
doctorpangloss
31eacb6ac9 Improve compilation of models, adding support for triton 2024-11-01 10:40:58 -07:00
doctorpangloss
a8d8bff548 Improve support for torch compilation and sage attention 2024-10-29 19:22:26 -07:00
doctorpangloss
76a80a65ea Merge branch 'master' of github.com:comfyanonymous/ComfyUI 2024-10-29 15:35:39 -07:00
comfyanonymous
915fdb5745 Fix lowvram edge case. 2024-10-22 16:34:50 -04:00
comfyanonymous
8ce2a1052c Optimizations to --fast and scaled fp8. 2024-10-22 02:12:28 -04:00
comfyanonymous
0075c6d096 Mixed precision diffusion models with scaled fp8.
This change allows supports for diffusion models where all the linears are
scaled fp8 while the other weights are the original precision.
2024-10-21 18:12:51 -04:00
comfyanonymous
83ca891118 Support scaled fp8 t5xxl model. 2024-10-20 22:27:00 -04:00
comfyanonymous
f9f9faface Fixed model merging issue with scaled fp8. 2024-10-20 06:24:31 -04:00
comfyanonymous
a68bbafddb Support diffusion models with scaled fp8 weights. 2024-10-19 23:47:42 -04:00
comfyanonymous
67158994a4 Use the lowvram cast_to function for everything. 2024-10-17 17:25:56 -04:00
doctorpangloss
8512f361fe Merge branch 'master' of github.com:comfyanonymous/ComfyUI 2024-10-14 15:26:27 -07:00
doctorpangloss
f3da381869 Fix inference mode execution issues 2024-10-10 21:00:09 -07:00
doctorpangloss
a38968f098 Improvements to execution
- Validation errors that occur early in the lifecycle of prompt
   execution now get propagated to their callers in the
   EmbeddedComfyClient. This includes error messages about missing node
   classes.
 - The execution context now includes the node_id and the prompt_id
 - Latent previews are now sent with a node_id. This is not backwards
   compatible with old frontends.
 - Dependency execution errors are now modeled correctly.
 - Distributed progress encodes image previews with node and prompt IDs.
 - Typing for models
 - The frontend was updated to use node IDs with previews
 - Improvements to torch.compile experiments
 - Some controlnet_aux nodes were upstreamed
2024-10-10 19:30:18 -07:00
doctorpangloss
69e523b89d Experimental quantization support. Only Linux is meaningfully supported 2024-10-10 13:43:06 -07:00
comfyanonymous
e38c94228b Add a weight_dtype fp8_e4m3fn_fast to the Diffusion Model Loader node.
This is used to load weights in fp8 and use fp8 matrix multiplication.
2024-10-09 19:43:17 -04:00
doctorpangloss
fa3176f96f Merge branch 'master' of github.com:comfyanonymous/ComfyUI 2024-09-23 12:50:31 -07:00
comfyanonymous
9c41bc8d10 Remove useless line. 2024-09-23 02:32:29 -04:00
comfyanonymous
dc96a1ae19 Load controlnet in fp8 if weights are in fp8. 2024-09-21 04:50:12 -04:00
doctorpangloss
5155a3e248 Merge WIP 2024-08-25 18:52:29 -07:00
comfyanonymous
8ae23d8e80 Fix onnx export. 2024-08-23 17:52:47 -04:00
comfyanonymous
c7ee4b37a1 Try to fix some lora issues. 2024-08-22 15:32:18 -04:00
comfyanonymous
904bf58e7d Make --fast work on pytorch nightly. 2024-08-21 14:01:41 -04:00
Svein Ove Aas
5f50263088
Replace use of .view with .reshape (#4522)
When generating images with fp8_e4_m3 Flux and batch size >1, using --fast, ComfyUI throws a "view size is not compatible with input tensor's size and stride" error pointing at the first of these two calls to view.

As reshape is semantically equivalent to view except for working on a broader set of inputs, there should be no downside to changing this. The only difference is that it clones the underlying data in cases where .view would error out. I have confirmed that the output still looks as expected, but cannot confirm that no mutable use is made of the tensors anywhere.

Note that --fast is only marginally faster than the default.
2024-08-21 11:21:48 -04:00
comfyanonymous
03ec517afb Remove useless line, adjust windows default reserved vram. 2024-08-21 00:47:19 -04:00
comfyanonymous
510f3438c1 Speed up fp8 matrix mult by using better code. 2024-08-20 22:53:26 -04:00
comfyanonymous
9953f22fce Add --fast argument to enable experimental optimizations.
Optimizations that might break things/lower quality will be put behind
this flag first and might be enabled by default in the future.

Currently the only optimization is float8_e4m3fn matrix multiplication on
4000/ADA series Nvidia cards or later. If you have one of these cards you
will see a speed boost when using fp8_e4m3fn flux for example.
2024-08-20 11:55:51 -04:00
comfyanonymous
538cb068bc Make cast_to a nop if weight is already good. 2024-08-20 10:46:36 -04:00
comfyanonymous
39f114c44b Less broken non blocking? 2024-08-18 16:53:17 -04:00
comfyanonymous
6730f3e1a3
Disable non blocking.
It fixed some perf issues but caused other issues that need to be debugged.
2024-08-18 14:38:09 -04:00