| .ci | ||
| .github | ||
| alembic_db | ||
| api_server | ||
| app | ||
| blueprints | ||
| comfy | ||
| comfy_api | ||
| comfy_api_nodes | ||
| comfy_config | ||
| comfy_execution | ||
| comfy_extras | ||
| custom_nodes | ||
| input | ||
| middleware | ||
| models | ||
| output | ||
| script_examples | ||
| tests | ||
| tests-unit | ||
| utils | ||
| .coderabbit.yaml | ||
| .gitattributes | ||
| .gitignore | ||
| .spectral.yaml | ||
| alembic.ini | ||
| CODEOWNERS | ||
| comfyui_version.py | ||
| CONTRIBUTING.md | ||
| cuda_malloc.py | ||
| execution.py | ||
| extra_model_paths.yaml.example | ||
| folder_paths.py | ||
| hook_breaker_ac10a0.py | ||
| latent_preview.py | ||
| LICENSE | ||
| main.py | ||
| manager_requirements.txt | ||
| node_helpers.py | ||
| nodes.py | ||
| openapi.yaml | ||
| protocol.py | ||
| pyproject.toml | ||
| pytest.ini | ||
| QUANTIZATION.md | ||
| README.md | ||
| requirements.txt | ||
| SECURITY.md | ||
| server.py | ||
ComfyUI Global Memory Trim
Global native heap trimming for ComfyUI on Linux/WSL.
This custom node repo installs a small global execution patch when ComfyUI loads custom nodes. The patch can call Python gc.collect() and glibc malloc_trim(0) before and/or after node execution. It is meant for workflows that repeatedly create large CPU image/video buffers through PyTorch, NumPy, OpenCV, Pillow, or native custom nodes and then stall or wedge under WSL2 memory pressure.
It also provides two optional workflow nodes:
- Global Memory Trim Now: manually run a trim and return RSS metrics.
- Global Memory Trim Status: return current config and last trim result.
The global patch does not require adding either node to your workflow.
Why this exists
Some WSL2 workloads can stall when native libraries repeatedly allocate and free large CPU buffers. Python objects may be gone, but glibc arenas can retain pages. Under a WSL memory cap, that can trigger heavy reclaim or a hard-looking VM stall. malloc_trim(0) asks glibc to return free heap pages to the OS.
This repo is intentionally CPU/native-heap focused. It does not directly free CUDA VRAM, unload ComfyUI models, delete ComfyUI caches, or change workflow outputs.
Installation
From your ComfyUI directory:
git clone https://github.com/xmarre/ComfyUI-Global-Memory-Trim custom_nodes/ComfyUI-Global-Memory-Trim
Or copy this folder into:
ComfyUI/custom_nodes/ComfyUI-Global-Memory-Trim
Restart ComfyUI. On startup you should see a log line similar to:
Installed global memory trim patch: enabled=True before=False after=True ...
Performance-oriented WSL setup
This is the current practical setup I use for a large WSL2 ComfyUI workflow with heavy model switching, Flux/SDXL/SeedVR2/detailer passes, and large CPU image buffers.
The important parts are:
- Keep ComfyUI on
--highvramfor performance. - Disable async weight offload and pinned memory on WSL.
- Do not force
--disable-cuda-mallochere; the normal CUDA allocator path avoids the VRAM over-reservation/overflow seen with the native allocator path in this workflow. - Keep
PYTORCH_CUDA_ALLOC_CONFunset. - Use glibc trim thresholds and the global trim hook to reduce CPU/native heap retention.
- Keep SeedVR2 BF16 forced on if using the patched SeedVR2 import probe workaround and wanting the higher-quality 7B path.
#!/usr/bin/env bash
set -e
_hold_terminal_on_failure() {
local rc=$?
if [ "$rc" -ne 0 ]; then
printf '\nComfyUI launcher exited with status %d\n' "$rc" >&2
printf 'Dropping into interactive shell so the terminal stays open.\n' >&2
exec bash -i
fi
}
trap _hold_terminal_on_failure EXIT
source ~/miniconda3/etc/profile.d/conda.sh
conda activate comfy312
# Native/CPU heap behavior. These do not free CUDA VRAM directly.
export MALLOC_MMAP_THRESHOLD_=65536
export MALLOC_TRIM_THRESHOLD_=65536
# Global trim hook.
# BEFORE=1 is more aggressive and can help before large model/node transitions.
# LOG=1 is useful while validating. Set it to 0 once stable.
export COMFYUI_GLOBAL_TRIM=1
export COMFYUI_GLOBAL_TRIM_AFTER=1
export COMFYUI_GLOBAL_TRIM_BEFORE=1
export COMFYUI_GLOBAL_TRIM_GC=1
export COMFYUI_GLOBAL_TRIM_INTERVAL=1
export COMFYUI_GLOBAL_TRIM_LOG=1
export COMFYUI_GLOBAL_TRIM_MIN_RSS_MB=8192
# Optional, workflow-specific: keep SeedVR2 on BF16 without running an import-time CUDA probe.
export SEEDVR2_FORCE_BFLOAT16=1
unset SEEDVR2_IMPORT_BFLOAT16_PROBE
# Do not force PyTorch's allocator through the environment.
unset PYTORCH_CUDA_ALLOC_CONF
# Optional, workflow-specific memory reduction for SuperBeasts.
export SUPERBEASTS_SPCA_RETURN_RESIDUALS=false
export SUPERBEASTS_HDR_MALLOC_TRIM=true
export PYTHONFAULTHANDLER=1
cd ~/ComfyUI
set +e
python main.py \
--listen 0.0.0.0 \
--port 8188 \
--fast fp16_accumulation \
--highvram \
--use-pytorch-cross-attention \
--disable-async-offload \
--disable-pinned-memory \
"$@"
status=$?
set -e
exit "$status"
After validating stability
Once the workflow is stable, reduce log overhead first:
export COMFYUI_GLOBAL_TRIM_LOG=0
Then, if performance still needs tuning, test one change at a time:
export COMFYUI_GLOBAL_TRIM_BEFORE=0
or:
export COMFYUI_GLOBAL_TRIM_INTERVAL=2
If wedges return, restore the previous value.
Conservative diagnostic WSL setup
For reproducing or isolating CPU/native heap stalls, use the more conservative version below. It clamps native CPU thread pools and limits glibc arenas, which can improve WSL stability but may slow CPU-heavy nodes.
export OMP_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
export MKL_NUM_THREADS=1
export NUMEXPR_NUM_THREADS=1
export OPENCV_OPENCL_RUNTIME=disabled
export MALLOC_ARENA_MAX=1
export MALLOC_MMAP_THRESHOLD_=65536
export MALLOC_TRIM_THRESHOLD_=65536
export COMFYUI_GLOBAL_TRIM=1
export COMFYUI_GLOBAL_TRIM_AFTER=1
export COMFYUI_GLOBAL_TRIM_BEFORE=0
export COMFYUI_GLOBAL_TRIM_GC=1
export COMFYUI_GLOBAL_TRIM_INTERVAL=1
export COMFYUI_GLOBAL_TRIM_LOG=0
export COMFYUI_GLOBAL_TRIM_MIN_RSS_MB=8192
Use this when the problem is clearly CPU/native memory pressure rather than VRAM pressure.
Configuration
All configuration is via environment variables.
| Variable | Default | Meaning |
|---|---|---|
COMFYUI_GLOBAL_TRIM |
1 |
Enable/disable the global patch. |
COMFYUI_GLOBAL_TRIM_AFTER |
1 |
Trim after node execution. |
COMFYUI_GLOBAL_TRIM_BEFORE |
0 |
Also trim before node execution. More aggressive, useful for testing or fragile WSL setups. |
COMFYUI_GLOBAL_TRIM_GC |
1 |
Run gc.collect() before malloc_trim(0). |
COMFYUI_GLOBAL_TRIM_INTERVAL |
1 |
Trim every N trim opportunities. Use 2, 4, etc. to reduce overhead. |
COMFYUI_GLOBAL_TRIM_MIN_RSS_MB |
0 |
Only trim when process RSS is at least this value. 0 means always. |
COMFYUI_GLOBAL_TRIM_LOG |
0 |
Log every trim with RSS before/after. Very noisy; enable only while diagnosing. |
COMFYUI_GLOBAL_TRIM_WARN_NO_LIBC |
1 |
Warn when glibc malloc_trim cannot be loaded. |
Notes
- Linux/WSL only. On non-Linux platforms the patch becomes a no-op.
malloc_trim(0)only returns already-free native heap pages. It does not free live tensors, ComfyUI outputs, model weights, or Python objects that are still referenced.- This is not a VRAM fixer. It targets CPU/native heap retention.
--disable-cuda-malloccan change CUDA allocator behavior and may increase VRAM reservation/fragmentation in some workflows. Do not assume it is safer unless you specifically need it.--disable-async-offloadand--disable-pinned-memorycan be useful on WSL when async offload/pinned-memory paths cause wedges.COMFYUI_GLOBAL_TRIM_LOG=1is diagnostic only. Turn it off for normal use.
License
MIT