ComfyUI/README.md

# ComfyUI Global Memory Trim

Global native heap trimming for ComfyUI on Linux/WSL.

This custom node repo installs a small global execution patch when ComfyUI loads custom nodes. The patch can call Python `gc.collect()` and glibc `malloc_trim(0)` before and/or after node execution. It is meant for workflows that repeatedly create large CPU image/video buffers through PyTorch, NumPy, OpenCV, Pillow, or native custom nodes and then stall or wedge under WSL2 memory pressure.

It also provides two optional workflow nodes:

- **Global Memory Trim Now**: manually run a trim and return RSS metrics.
- **Global Memory Trim Status**: return current config and last trim result.

The global patch does **not** require adding either node to your workflow.

## Why this exists

Some WSL2 workloads can stall when native libraries repeatedly allocate and free large CPU buffers. Python objects may be gone, but glibc arenas can retain pages. Under a WSL memory cap, that can trigger heavy reclaim or a hard-looking VM stall. `malloc_trim(0)` asks glibc to return free heap pages to the OS.

This repo is intentionally CPU/native-heap focused. It does **not** directly free CUDA VRAM, unload ComfyUI models, delete ComfyUI caches, or change workflow outputs.

## Installation

From your ComfyUI directory:

```bash
git clone https://github.com/xmarre/ComfyUI-Global-Memory-Trim custom_nodes/ComfyUI-Global-Memory-Trim
```

Or copy this folder into:

```text
ComfyUI/custom_nodes/ComfyUI-Global-Memory-Trim
```

Restart ComfyUI. On startup you should see a log line similar to:

```text
Installed global memory trim patch: enabled=True before=False after=True ...
```

## Performance-oriented WSL setup

This is the current practical setup I use for a large WSL2 ComfyUI workflow with heavy model switching, Flux/SDXL/SeedVR2/detailer passes, and large CPU image buffers.

The important parts are:

- Keep ComfyUI on `--highvram` for performance.
- Disable async weight offload and pinned memory on WSL.
- Do **not** force `--disable-cuda-malloc` here; the normal CUDA allocator path avoids the VRAM over-reservation/overflow seen with the native allocator path in this workflow.
- Keep `PYTORCH_CUDA_ALLOC_CONF` unset.
- Use glibc trim thresholds and the global trim hook to reduce CPU/native heap retention.
- Keep SeedVR2 BF16 forced on if using the patched SeedVR2 import probe workaround and wanting the higher-quality 7B path.

```bash
#!/usr/bin/env bash
set -e

_hold_terminal_on_failure() {
  local rc=$?
  if [ "$rc" -ne 0 ]; then
    printf '\nComfyUI launcher exited with status %d\n' "$rc" >&2
    printf 'Dropping into interactive shell so the terminal stays open.\n' >&2
    exec bash -i
  fi
}
trap _hold_terminal_on_failure EXIT

source ~/miniconda3/etc/profile.d/conda.sh
conda activate comfy312

# Native/CPU heap behavior. These do not free CUDA VRAM directly.
export MALLOC_MMAP_THRESHOLD_=65536
export MALLOC_TRIM_THRESHOLD_=65536

# Global trim hook.
# BEFORE=1 is more aggressive and can help before large model/node transitions.
# LOG=1 is useful while validating. Set it to 0 once stable.
export COMFYUI_GLOBAL_TRIM=1
export COMFYUI_GLOBAL_TRIM_AFTER=1
export COMFYUI_GLOBAL_TRIM_BEFORE=1
export COMFYUI_GLOBAL_TRIM_GC=1
export COMFYUI_GLOBAL_TRIM_INTERVAL=1
export COMFYUI_GLOBAL_TRIM_LOG=1
export COMFYUI_GLOBAL_TRIM_MIN_RSS_MB=8192

# Optional, workflow-specific: keep SeedVR2 on BF16 without running an import-time CUDA probe.
export SEEDVR2_FORCE_BFLOAT16=1
unset SEEDVR2_IMPORT_BFLOAT16_PROBE

# Do not force PyTorch's allocator through the environment.
unset PYTORCH_CUDA_ALLOC_CONF

# Optional, workflow-specific memory reduction for SuperBeasts.
export SUPERBEASTS_SPCA_RETURN_RESIDUALS=false
export SUPERBEASTS_HDR_MALLOC_TRIM=true

export PYTHONFAULTHANDLER=1

cd ~/ComfyUI

set +e
python main.py \
  --listen 0.0.0.0 \
  --port 8188 \
  --fast fp16_accumulation \
  --highvram \
  --use-pytorch-cross-attention \
  --disable-async-offload \
  --disable-pinned-memory \
  "$@"
status=$?
set -e

exit "$status"
```

### After validating stability

Once the workflow is stable, reduce log overhead first:

```bash
export COMFYUI_GLOBAL_TRIM_LOG=0
```

Then, if performance still needs tuning, test one change at a time:

```bash
export COMFYUI_GLOBAL_TRIM_BEFORE=0
```

or:

```bash
export COMFYUI_GLOBAL_TRIM_INTERVAL=2
```

If wedges return, restore the previous value.

## Conservative diagnostic WSL setup

For reproducing or isolating CPU/native heap stalls, use the more conservative version below. It clamps native CPU thread pools and limits glibc arenas, which can improve WSL stability but may slow CPU-heavy nodes.

```bash
export OMP_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
export MKL_NUM_THREADS=1
export NUMEXPR_NUM_THREADS=1
export OPENCV_OPENCL_RUNTIME=disabled

export MALLOC_ARENA_MAX=1
export MALLOC_MMAP_THRESHOLD_=65536
export MALLOC_TRIM_THRESHOLD_=65536

export COMFYUI_GLOBAL_TRIM=1
export COMFYUI_GLOBAL_TRIM_AFTER=1
export COMFYUI_GLOBAL_TRIM_BEFORE=0
export COMFYUI_GLOBAL_TRIM_GC=1
export COMFYUI_GLOBAL_TRIM_INTERVAL=1
export COMFYUI_GLOBAL_TRIM_LOG=0
export COMFYUI_GLOBAL_TRIM_MIN_RSS_MB=8192
```

Use this when the problem is clearly CPU/native memory pressure rather than VRAM pressure.

## Configuration

All configuration is via environment variables.

| Variable | Default | Meaning |
|---|---:|---|
| `COMFYUI_GLOBAL_TRIM` | `1` | Enable/disable the global patch. |
| `COMFYUI_GLOBAL_TRIM_AFTER` | `1` | Trim after node execution. |
| `COMFYUI_GLOBAL_TRIM_BEFORE` | `0` | Also trim before node execution. More aggressive, useful for testing or fragile WSL setups. |
| `COMFYUI_GLOBAL_TRIM_GC` | `1` | Run `gc.collect()` before `malloc_trim(0)`. |
| `COMFYUI_GLOBAL_TRIM_INTERVAL` | `1` | Trim every N trim opportunities. Use `2`, `4`, etc. to reduce overhead. |
| `COMFYUI_GLOBAL_TRIM_MIN_RSS_MB` | `0` | Only trim when process RSS is at least this value. `0` means always. |
| `COMFYUI_GLOBAL_TRIM_LOG` | `0` | Log every trim with RSS before/after. Very noisy; enable only while diagnosing. |
| `COMFYUI_GLOBAL_TRIM_WARN_NO_LIBC` | `1` | Warn when glibc `malloc_trim` cannot be loaded. |

## Notes

- Linux/WSL only. On non-Linux platforms the patch becomes a no-op.
- `malloc_trim(0)` only returns already-free native heap pages. It does not free live tensors, ComfyUI outputs, model weights, or Python objects that are still referenced.
- This is **not** a VRAM fixer. It targets CPU/native heap retention.
- `--disable-cuda-malloc` can change CUDA allocator behavior and may increase VRAM reservation/fragmentation in some workflows. Do not assume it is safer unless you specifically need it.
- `--disable-async-offload` and `--disable-pinned-memory` can be useful on WSL when async offload/pinned-memory paths cause wedges.
- `COMFYUI_GLOBAL_TRIM_LOG=1` is diagnostic only. Turn it off for normal use.

## License

MIT