Commit Graph

1774 Commits

Author SHA1 Message Date
Yousef Rafat
ea176bb87d basic support for hunyuan model 2025-11-20 23:47:57 +02:00
Yousef Rafat
b84af5b947 small attention fix 2025-11-17 23:03:52 +02:00
Yousef Rafat
3f71760913 resblock fix 2025-11-17 06:50:54 +02:00
Yousef Rafat
61b1efdaf0 vectrozied correct implementation of moe forward 2025-11-16 19:25:37 +02:00
Yousef Rafat
4a5509a4c5 . 2025-11-16 16:20:35 +02:00
Yousef Rafat
d731c58353 improving performance and fixing race condition 2025-11-16 16:19:39 +02:00
Yousef Rafat
12cc6924ac meta init 2025-11-14 20:10:52 +02:00
Yousef Rafat
7b4c1e8031 async cache revamp
Added an async loading and offloading of moe layers, having consistent memory with oom errors.
Used to give oom error after the third layer with 24 giga bytes gpu, now goes to the end with consistent memory with minimal latency
2025-11-14 09:15:16 +02:00
Yousef Rafat
44346c4251 removed all errors 2025-11-08 19:49:02 +02:00
Yousef Rafat
5056a1f4d4 important fixes 2025-11-06 00:24:49 +02:00
Yousef Rafat
9e9c536c8e fixes from testing 2025-11-04 23:55:16 +02:00
Yousef Rafat
ca119c44fb returned kv cache for image generation 2025-11-01 23:06:11 +02:00
Yousef Rafat
10a17dc85d a bunch of fixes 2025-11-01 16:40:49 +02:00
Yousef Rafat
1a25a0ad69 Merge branch 'yousef-hunyuan-image-3' of https://github.com/yousef-rafat/ComfyUI into yousef-hunyuan-image-3 2025-10-31 23:58:22 +02:00
Yousef Rafat
70f216bbd0 tiny bug 2025-10-31 23:58:01 +02:00
Yousef R. Gamaleldin
575fe3e92e
Merge branch 'master' into yousef-hunyuan-image-3 2025-10-31 23:55:42 +02:00
Yousef Rafat
a2fff60d4c vectorized implementation of moe/fixes for issues 2025-10-31 23:53:13 +02:00
comfyanonymous
7f374e42c8
ScaleROPE now works on Lumina models. (#10578) 2025-10-31 15:41:40 -04:00
Yousef Rafat
de43880bdb Hunyuan Image 3.0 2025-10-31 18:56:20 +02:00
comfyanonymous
27d1bd8829
Fix rope scaling. (#10560) 2025-10-30 22:51:58 -04:00
comfyanonymous
614cf9805e
Add a ScaleROPE node. Currently only works on WAN models. (#10559) 2025-10-30 22:11:38 -04:00
rattus
513b0c46fb
Add RAM Pressure cache mode (#10454)
* execution: Roll the UI cache into the outputs

Currently the UI cache is parallel to the output cache with
expectations of being a content superset of the output cache.
At the same time the UI and output cache are maintained completely
seperately, making it awkward to free the output cache content without
changing the behaviour of the UI cache.

There are two actual users (getters) of the UI cache. The first is
the case of a direct content hit on the output cache when executing a
node. This case is very naturally handled by merging the UI and outputs
cache.

The second case is the history JSON generation at the end of the prompt.
This currently works by asking the cache for all_node_ids and then
pulling the cache contents for those nodes. all_node_ids is the nodes
of the dynamic prompt.

So fold the UI cache into the output cache. The current UI cache setter
now writes to a prompt-scope dict. When the output cache is set, just
get this value from the dict and tuple up with the outputs.

When generating the history, simply iterate prompt-scope dict.

This prepares support for more complex caching strategies (like RAM
pressure caching) where less than 1 workflow will be cached and it
will be desirable to keep the UI cache and output cache in sync.

* sd: Implement RAM getter for VAE

* model_patcher: Implement RAM getter for ModelPatcher

* sd: Implement RAM getter for CLIP

* Implement RAM Pressure cache

Implement a cache sensitive to RAM pressure. When RAM headroom drops
down below a certain threshold, evict RAM-expensive nodes from the
cache.

Models and tensors are measured directly for RAM usage. An OOM score
is then computed based on the RAM usage of the node.

Note the due to indirection through shared objects (like a model
patcher), multiple nodes can account the same RAM as their individual
usage. The intent is this will free chains of nodes particularly
model loaders and associate loras as they all score similar and are
sorted in close to each other.

Has a bias towards unloading model nodes mid flow while being able
to keep results like text encodings and VAE.

* execution: Convert the cache entry to NamedTuple

As commented in review.

Convert this to a named tuple and abstract away the tuple type
completely from graph.py.
2025-10-30 17:39:02 -04:00
Jedrzej Kosinski
998bf60beb
Add units/info for the numbers displayed on 'load completely' and 'load partially' log messages (#10538) 2025-10-29 19:37:06 -04:00
comfyanonymous
906c089957
Fix small performance regression with fp8 fast and scaled fp8. (#10537) 2025-10-29 19:29:01 -04:00
comfyanonymous
25de7b1bfa
Try to fix slow load issue on low ram hardware with pinned mem. (#10536) 2025-10-29 17:20:27 -04:00
rattus
ab7ab5be23
Fix Race condition in --async-offload that can cause corruption (#10501)
* mm: factor out the current stream getter

Make this a reusable function.

* ops: sync the offload stream with the consumption of w&b

This sync is nessacary as pytorch will queue cuda async frees on the
same stream as created to tensor. In the case of async offload, this
will be on the offload stream.

Weights and biases can go out of scope in python which then
triggers the pytorch garbage collector to queue the free operation on
the offload stream possible before the compute stream has used the
weight. This causes a use after free on weight data leading to total
corruption of some workflows.

So sync the offload stream with the compute stream after the weight
has been used so the free has to wait for the weight to be used.

The cast_bias_weight is extended in a backwards compatible way with
the new behaviour opt-in on a defaulted parameter. This handles
custom node packs calling cast_bias_weight and defeatures
async-offload for them (as they do not handle the race).

The pattern is now:

cast_bias_weight(... , offloadable=True) #This might be offloaded
thing(weight, bias, ...)
uncast_bias_weight(...)

* controlnet: adopt new cast_bias_weight synchronization scheme

This is nessacary for safe async weight offloading.

* mm: sync the last stream in the queue, not the next

Currently this peeks ahead to sync the next stream in the queue of
streams with the compute stream. This doesnt allow a lot of
parallelization, as then end result is you can only get one weight load
ahead regardless of how many streams you have.

Rotate the loop logic here to synchronize the end of the queue before
returning the next stream. This allows weights to be loaded ahead of the
compute streams position.
2025-10-29 17:17:46 -04:00
comfyanonymous
ec4fc2a09a
Fix case of weights not being unpinned. (#10533) 2025-10-29 15:48:06 -04:00
comfyanonymous
1a58087ac2
Reduce memory usage for fp8 scaled op. (#10531) 2025-10-29 15:43:51 -04:00
comfyanonymous
e525673f72
Fix issue. (#10527) 2025-10-29 00:37:00 -04:00
comfyanonymous
3fa7a5c04a
Speed up offloading using pinned memory. (#10526)
To enable this feature use: --fast pinned_memory
2025-10-29 00:21:01 -04:00
contentis
8817f8fc14
Mixed Precision Quantization System (#10498)
* Implement mixed precision operations with a registry design and metadate for quant spec in checkpoint.

* Updated design using Tensor Subclasses

* Fix FP8 MM

* An actually functional POC

* Remove CK reference and ensure correct compute dtype

* Update unit tests

* ruff lint

* Implement mixed precision operations with a registry design and metadate for quant spec in checkpoint.

* Updated design using Tensor Subclasses

* Fix FP8 MM

* An actually functional POC

* Remove CK reference and ensure correct compute dtype

* Update unit tests

* ruff lint

* Fix missing keys

* Rename quant dtype parameter

* Rename quant dtype parameter

* Fix unittests for CPU build
2025-10-28 16:20:53 -04:00
comfyanonymous
f6bbc1ac84
Fix mistake. (#10484) 2025-10-25 23:07:29 -04:00
comfyanonymous
098a352f13
Add warning for torch-directml usage (#10482)
Added a warning message about the state of torch-directml.
2025-10-25 20:05:22 -04:00
comfyanonymous
426cde37f1
Remove useless function (#10472) 2025-10-24 19:56:51 -04:00
comfyanonymous
1bcda6df98
WIP way to support multi multi dimensional latents. (#10456) 2025-10-23 21:21:14 -04:00
comfyanonymous
9cdc64998f
Only disable cudnn on newer AMD GPUs. (#10437) 2025-10-21 19:15:23 -04:00
comfyanonymous
2c2aa409b0
Log message for cudnn disable on AMD. (#10418) 2025-10-20 15:43:24 -04:00
comfyanonymous
b4f30bd408
Pytorch is stupid. (#10398) 2025-10-19 01:25:35 -04:00
comfyanonymous
dad076aee6
Speed up chroma radiance. (#10395) 2025-10-18 23:19:52 -04:00
comfyanonymous
0cf33953a7
Fix batch size above 1 giving bad output in chroma radiance. (#10394) 2025-10-18 23:15:34 -04:00
comfyanonymous
5b80addafd
Turn off cuda malloc by default when --fast autotune is turned on. (#10393) 2025-10-18 22:35:46 -04:00
comfyanonymous
9da397ea2f
Disable torch compiler for cast_bias_weight function (#10384)
* Disable torch compiler for cast_bias_weight function

* Fix torch compile.
2025-10-17 20:03:28 -04:00
comfyanonymous
b1293d50ef
workaround also works on cudnn 91200 (#10375) 2025-10-16 19:59:56 -04:00
comfyanonymous
19b466160c
Workaround for nvidia issue where VAE uses 3x more memory on torch 2.9 (#10373) 2025-10-16 18:16:03 -04:00
Faych
afa8a24fe1
refactor: Replace manual patches merging with merge_nested_dicts (#10360) 2025-10-15 17:16:09 -07:00
Jedrzej Kosinski
493b81e48f
Fix order of inputs nested merge_nested_dicts (#10362) 2025-10-15 16:47:26 -07:00
comfyanonymous
1c10b33f9b
gfx942 doesn't support fp8 operations. (#10348) 2025-10-15 00:21:11 -04:00
comfyanonymous
3374e900d0
Faster workflow cancelling. (#10301) 2025-10-13 23:43:53 -04:00
comfyanonymous
dfff7e5332
Better memory estimation for the SD/Flux VAE on AMD. (#10334) 2025-10-13 22:37:19 -04:00
comfyanonymous
e4ea393666
Fix loading old stable diffusion ckpt files on newer numpy. (#10333) 2025-10-13 22:18:58 -04:00