mirror of
https://github.com/comfyanonymous/ComfyUI.git
synced 2026-06-23 08:19:32 +08:00
191 lines
6.7 KiB
Markdown
191 lines
6.7 KiB
Markdown
# ComfyUI Global Memory Trim
|
|
|
|
Global native heap trimming for ComfyUI on Linux/WSL.
|
|
|
|
This custom node repo installs a small global execution patch when ComfyUI loads custom nodes. The patch can call Python `gc.collect()` and glibc `malloc_trim(0)` before and/or after node execution. It is meant for workflows that repeatedly create large CPU image/video buffers through PyTorch, NumPy, OpenCV, Pillow, or native custom nodes and then stall or wedge under WSL2 memory pressure.
|
|
|
|
It also provides two optional workflow nodes:
|
|
|
|
- **Global Memory Trim Now**: manually run a trim and return RSS metrics.
|
|
- **Global Memory Trim Status**: return current config and last trim result.
|
|
|
|
The global patch does **not** require adding either node to your workflow.
|
|
|
|
## Why this exists
|
|
|
|
Some WSL2 workloads can stall when native libraries repeatedly allocate and free large CPU buffers. Python objects may be gone, but glibc arenas can retain pages. Under a WSL memory cap, that can trigger heavy reclaim or a hard-looking VM stall. `malloc_trim(0)` asks glibc to return free heap pages to the OS.
|
|
|
|
This repo is intentionally CPU/native-heap focused. It does **not** directly free CUDA VRAM, unload ComfyUI models, delete ComfyUI caches, or change workflow outputs.
|
|
|
|
## Installation
|
|
|
|
From your ComfyUI directory:
|
|
|
|
```bash
|
|
git clone https://github.com/xmarre/ComfyUI-Global-Memory-Trim custom_nodes/ComfyUI-Global-Memory-Trim
|
|
```
|
|
|
|
Or copy this folder into:
|
|
|
|
```text
|
|
ComfyUI/custom_nodes/ComfyUI-Global-Memory-Trim
|
|
```
|
|
|
|
Restart ComfyUI. On startup you should see a log line similar to:
|
|
|
|
```text
|
|
Installed global memory trim patch: enabled=True before=False after=True ...
|
|
```
|
|
|
|
## Performance-oriented WSL setup
|
|
|
|
This is the current practical setup I use for a large WSL2 ComfyUI workflow with heavy model switching, Flux/SDXL/SeedVR2/detailer passes, and large CPU image buffers.
|
|
|
|
The important parts are:
|
|
|
|
- Keep ComfyUI on `--highvram` for performance.
|
|
- Disable async weight offload and pinned memory on WSL.
|
|
- Do **not** force `--disable-cuda-malloc` here; the normal CUDA allocator path avoids the VRAM over-reservation/overflow seen with the native allocator path in this workflow.
|
|
- Keep `PYTORCH_CUDA_ALLOC_CONF` unset.
|
|
- Use glibc trim thresholds and the global trim hook to reduce CPU/native heap retention.
|
|
- Keep SeedVR2 BF16 forced on if using the patched SeedVR2 import probe workaround and wanting the higher-quality 7B path.
|
|
|
|
```bash
|
|
#!/usr/bin/env bash
|
|
set -e
|
|
|
|
_hold_terminal_on_failure() {
|
|
local rc=$?
|
|
if [ "$rc" -ne 0 ]; then
|
|
printf '\nComfyUI launcher exited with status %d\n' "$rc" >&2
|
|
printf 'Dropping into interactive shell so the terminal stays open.\n' >&2
|
|
exec bash -i
|
|
fi
|
|
}
|
|
trap _hold_terminal_on_failure EXIT
|
|
|
|
source ~/miniconda3/etc/profile.d/conda.sh
|
|
conda activate comfy312
|
|
|
|
# Native/CPU heap behavior. These do not free CUDA VRAM directly.
|
|
export MALLOC_MMAP_THRESHOLD_=65536
|
|
export MALLOC_TRIM_THRESHOLD_=65536
|
|
|
|
# Global trim hook.
|
|
# BEFORE=1 is more aggressive and can help before large model/node transitions.
|
|
# LOG=1 is useful while validating. Set it to 0 once stable.
|
|
export COMFYUI_GLOBAL_TRIM=1
|
|
export COMFYUI_GLOBAL_TRIM_AFTER=1
|
|
export COMFYUI_GLOBAL_TRIM_BEFORE=1
|
|
export COMFYUI_GLOBAL_TRIM_GC=1
|
|
export COMFYUI_GLOBAL_TRIM_INTERVAL=1
|
|
export COMFYUI_GLOBAL_TRIM_LOG=1
|
|
export COMFYUI_GLOBAL_TRIM_MIN_RSS_MB=8192
|
|
|
|
# Optional, workflow-specific: keep SeedVR2 on BF16 without running an import-time CUDA probe.
|
|
export SEEDVR2_FORCE_BFLOAT16=1
|
|
unset SEEDVR2_IMPORT_BFLOAT16_PROBE
|
|
|
|
# Do not force PyTorch's allocator through the environment.
|
|
unset PYTORCH_CUDA_ALLOC_CONF
|
|
|
|
# Optional, workflow-specific memory reduction for SuperBeasts.
|
|
export SUPERBEASTS_SPCA_RETURN_RESIDUALS=false
|
|
export SUPERBEASTS_HDR_MALLOC_TRIM=true
|
|
|
|
export PYTHONFAULTHANDLER=1
|
|
|
|
cd ~/ComfyUI
|
|
|
|
set +e
|
|
python main.py \
|
|
--listen 0.0.0.0 \
|
|
--port 8188 \
|
|
--fast fp16_accumulation \
|
|
--highvram \
|
|
--use-pytorch-cross-attention \
|
|
--disable-async-offload \
|
|
--disable-pinned-memory \
|
|
"$@"
|
|
status=$?
|
|
set -e
|
|
|
|
exit "$status"
|
|
```
|
|
|
|
### After validating stability
|
|
|
|
Once the workflow is stable, reduce log overhead first:
|
|
|
|
```bash
|
|
export COMFYUI_GLOBAL_TRIM_LOG=0
|
|
```
|
|
|
|
Then, if performance still needs tuning, test one change at a time:
|
|
|
|
```bash
|
|
export COMFYUI_GLOBAL_TRIM_BEFORE=0
|
|
```
|
|
|
|
or:
|
|
|
|
```bash
|
|
export COMFYUI_GLOBAL_TRIM_INTERVAL=2
|
|
```
|
|
|
|
If wedges return, restore the previous value.
|
|
|
|
## Conservative diagnostic WSL setup
|
|
|
|
For reproducing or isolating CPU/native heap stalls, use the more conservative version below. It clamps native CPU thread pools and limits glibc arenas, which can improve WSL stability but may slow CPU-heavy nodes.
|
|
|
|
```bash
|
|
export OMP_NUM_THREADS=1
|
|
export OPENBLAS_NUM_THREADS=1
|
|
export MKL_NUM_THREADS=1
|
|
export NUMEXPR_NUM_THREADS=1
|
|
export OPENCV_OPENCL_RUNTIME=disabled
|
|
|
|
export MALLOC_ARENA_MAX=1
|
|
export MALLOC_MMAP_THRESHOLD_=65536
|
|
export MALLOC_TRIM_THRESHOLD_=65536
|
|
|
|
export COMFYUI_GLOBAL_TRIM=1
|
|
export COMFYUI_GLOBAL_TRIM_AFTER=1
|
|
export COMFYUI_GLOBAL_TRIM_BEFORE=0
|
|
export COMFYUI_GLOBAL_TRIM_GC=1
|
|
export COMFYUI_GLOBAL_TRIM_INTERVAL=1
|
|
export COMFYUI_GLOBAL_TRIM_LOG=0
|
|
export COMFYUI_GLOBAL_TRIM_MIN_RSS_MB=8192
|
|
```
|
|
|
|
Use this when the problem is clearly CPU/native memory pressure rather than VRAM pressure.
|
|
|
|
## Configuration
|
|
|
|
All configuration is via environment variables.
|
|
|
|
| Variable | Default | Meaning |
|
|
|---|---:|---|
|
|
| `COMFYUI_GLOBAL_TRIM` | `1` | Enable/disable the global patch. |
|
|
| `COMFYUI_GLOBAL_TRIM_AFTER` | `1` | Trim after node execution. |
|
|
| `COMFYUI_GLOBAL_TRIM_BEFORE` | `0` | Also trim before node execution. More aggressive, useful for testing or fragile WSL setups. |
|
|
| `COMFYUI_GLOBAL_TRIM_GC` | `1` | Run `gc.collect()` before `malloc_trim(0)`. |
|
|
| `COMFYUI_GLOBAL_TRIM_INTERVAL` | `1` | Trim every N trim opportunities. Use `2`, `4`, etc. to reduce overhead. |
|
|
| `COMFYUI_GLOBAL_TRIM_MIN_RSS_MB` | `0` | Only trim when process RSS is at least this value. `0` means always. |
|
|
| `COMFYUI_GLOBAL_TRIM_LOG` | `0` | Log every trim with RSS before/after. Very noisy; enable only while diagnosing. |
|
|
| `COMFYUI_GLOBAL_TRIM_WARN_NO_LIBC` | `1` | Warn when glibc `malloc_trim` cannot be loaded. |
|
|
|
|
## Notes
|
|
|
|
- Linux/WSL only. On non-Linux platforms the patch becomes a no-op.
|
|
- `malloc_trim(0)` only returns already-free native heap pages. It does not free live tensors, ComfyUI outputs, model weights, or Python objects that are still referenced.
|
|
- This is **not** a VRAM fixer. It targets CPU/native heap retention.
|
|
- `--disable-cuda-malloc` can change CUDA allocator behavior and may increase VRAM reservation/fragmentation in some workflows. Do not assume it is safer unless you specifically need it.
|
|
- `--disable-async-offload` and `--disable-pinned-memory` can be useful on WSL when async offload/pinned-memory paths cause wedges.
|
|
- `COMFYUI_GLOBAL_TRIM_LOG=1` is diagnostic only. Turn it off for normal use.
|
|
|
|
## License
|
|
|
|
MIT
|