The _amd_vram_gtt_totals() device match compared str(pci_bus_id) against the
sysfs leaf BDF, but torch reports pci_bus_id as a decimal integer while amdgpu
names its nodes as a hex "domain🚌device.function" BDF, so the comparison
never matched. A single-GPU host was rescued by the len(candidates) == 1
fallback; a hybrid / multi-GPU host has no fallback and could fall through to
shared-heavy, demoting a dedicated GPU to SHARED (reported for a GPU sitting
behind a PCIe bridge).
Build the canonical hex BDF from torch's integer pci_domain_id / pci_bus_id /
pci_device_id and compare it against the candidate's realpath leaf BDF (PCI
function stripped). realpath already collapses any bridge chain to the leaf,
so this works for directly-attached, behind-a-bridge, and multi-GPU hosts
alike. The len(candidates) == 1 fallback is kept.
Signed-off-by: liminfei-amd <91481003+liminfei-amd@users.noreply.github.com>
#14274
* main: implement --vram-headroom
Implement --vram-headroom for dynamic vram as a hybrid debug/diagnostic
option that can be used for people who still report shared VRAM spills.
They can trial and error the setting to maintain a bit more headroom to
avoid shared VRAM spills.
* main: implement --reserve-vram
Implement --reserve-vram as extra headroom on the simple method which
is semantically as close as possible to the stated functionality and
formet behaviour of non-dynamic VRAM.
Add this option for users who know they have so much ram they want
to pin everything or have a pagefile that outruns their disk speed.
The removes the RAM pressure caps completely and pins behind the
primary model load forcing all models to be permanently comitted
to RAM.
Some custom nodes .to weights completely out of load context which
can wreak havoc if its for a model that is not active. Detect this
condition and just let it fall-through to the non-dynamic loader
straight up.
Some custom nodes try to set this true globally. It messes with dynamic
VRAM with one-off spikes that can OOM but this is also very high risk
for windows where such allocations might get serviced by shared memory
fallback.
Trump it.
cleanup_models_gc can be called once per load_models_gpu via
free_memory, which in turn can de-activate an active model via
this reset_cast_buffers.
cleanup_models_gc() could also come via obscure garbage collector
paths so limit reset_cast_buffers to the post-node callsite instead.
On AMD APUs (and other integrated GPUs) the "VRAM" reported by
torch.cuda.mem_get_info() is the GTT/shared aperture carved out of host
RAM, not a dedicated board. ComfyUI starts such devices in NORMAL_VRAM and
later sums device VRAM plus system RAM when sizing the model-load budget,
so on a UMA part the same physical RAM is counted twice and the inflated
budget triggers HIGH_VRAM / gpu-only placement that OOMs the shared pool.
Detecting integrated GPUs alone is not enough: integrated parts vary widely
in how memory is split. Some (large BIOS UMA carveout, e.g. Strix Halo)
report most memory as dedicated mem_info_vram_total, where HIGH_VRAM is
right; others report a small VRAM carveout with the bulk in GTT, where
SHARED is right. Demoting every integrated GPU to SHARED would regress the
dedicated-heavy configs.
Key the demotion on the amdgpu mem_info_vram_total vs mem_info_gtt_total
ratio: only when an integrated GPU's shared (GTT) pool is at least as large
as its dedicated VRAM do we switch it to VRAMState.SHARED. Dedicated-heavy
integrated parts and discrete GPUs keep NORMAL_VRAM. When the sysfs totals
cannot be read (e.g. NVIDIA Tegra, which has no dedicated VRAM) the device
is treated as shared-heavy, matching its true unified memory.
Fixes#14274
Signed-off-by: liminfei-amd <91481003+liminfei-amd@users.noreply.github.com>
* mm: split off registration helper to doer and headroom calc
* pinned_memory: implement registration comfy side
Move away from Aimdo buffer registrations which seem fraught with
danger and do it comfy side. Just start with the basic move.
* pinned_memory: do registrations as portable memory
* pinned_memory: discard async errors on registration fail
Like the good ol days.
* pinned_memory: implement abs shortfall retry
If pinned registration happens to fail despite the previous budget
ensures, consider the allocation shortfall, ensure it again, and
try again. This allows comfy pins to interoperate with other software
that might be doing substantive pinning.
* fix (MultiGPU): prevent freeze on manual abort when using MultiGPU CFG Split
Problem:
Upon manual abort application hangs indefinitely.
`InterruptProcessingException` inherits from `BaseException` and bypasses MultiGPU's worker error handling block so thread dies silently, leaving the main thread waiting forever for `result_q.get()`
Fix:
Catch `comfy.model_management.InterruptProcessingException` instead of `Exception` so it's caught and passed back via `result_q` to unblock the main thread when manual abort signal fires.
* oops
* mm: re-instantate smart memory for VRAM
* mm: restore non-dynamic smart memory
By popular demand. We aren't quite ready for the deprecation as non
dynamic enabled GPUs and some high-vram custom model loader setups
prefer the old full hands on.
* memory_management: Add direct to read GPU mode
Make destination optional (or make it optionally GPU) and use aimdo
to file_read direct to GPU.
* ops: Remove stream pin buffers and use aimdo reads
This consumed too much RAM and its better to just take the hit on
the CPU syncing back the stream on a short ring buffer. Aimdo
implements this so just rip the stream pin buffer from comfy.
* model_management: all active pin registration movement
Its better to just let the active model load past the pin limit as
pins and let the pins move around. The saves the HDD and SATA
people disk traffic while only costing a few GPU syncs.
* utils: use aimdo file handle
This opens on windows with more favourable flags
* mp: only count the model proper for loaded_ram and vram
Exclude live loras from the numbers to avoid the case where the reported
loaded memory exceeds the size of the model.
This causes me confusion in the Kijai visualizer when it looked fully
loaded but was hitting disk due to this accounding disrepency.
* utils: add bit reverse utility
useful for max scattering something ordered.
* pinned_memory: Implement offload balancing
Use a max scatter alogorithm to prioritize pins of the same size such
that when doing a little bit of offloading it gets scattered, allowing
the prefetcher to more evenly swollow the offload.
* comfy-aimdo 0.4.7
Aimdo 0.4.7 implement VRAM buffer exhaustion predection to avoid
early speculative load of weights that definately wont fix once the
inference gets further in.
* model-prefetch: consolidate pin ensures on the sync point
This could happen mid prefetch block, cause a sync of the entire
block and lose overlap. Get ahead of the problem with a free down
at the natural compute stream sync point.
* mm: Put a 2GB min on the pin ceiling
This is reasonably bad if it starts causing swap pressure, moreso than
during normal ram-cache proceedings. Clamp it.
* add --fast-disk