mirror of
https://github.com/comfyanonymous/ComfyUI.git
synced 2026-01-11 23:00:51 +08:00
6.3 KiB
6.3 KiB
Disk tier safetensors streaming design audit (ComfyUI)
Mandatory research audit (verified call sites)
ComfyUI load path + eager materialization sites
comfy/utils.py:load_torch_filecurrently usessafetensors.safe_openand iterates all keys to build a fullsddict (eager tensor materialization). It also returns metadata only after reading all tensors.【F:comfy/utils.py†L58-L93】comfy/utils.py:calculate_parametersandweight_dtypeiteratesd.keys()and then accesssd[k]to computenelement()/dtype(loads tensors).【F:comfy/utils.py†L109-L128】comfy/utils.py:state_dict_prefix_replacemutates dicts bypop+assignment (materializes if used on a streaming mapping).【F:comfy/utils.py†L135-L144】comfy/model_base.py:BaseModel.load_model_weightsbuildsto_load = {}by iterating keys and popping tensors, then passes a fully materialized dict toload_state_dict(RAM spike).【F:comfy/model_base.py†L301-L318】comfy/model_detection.pyreadsstate_dict[key].shapein many branches for detection (must be metadata-only). Example:calculate_transformer_depthand numerousdetect_unet_configbranches read shapes directly fromstate_dictvalues.【F:comfy/model_detection.py†L21-L200】comfy/sd.pyloads checkpoints, then slices, renames, and computes parameters/dtypes by reading tensors (e.g.,calculate_parameters,weight_dtype,process_*_state_dict, and special scaled-FP8 conversion that builds new dicts).【F:comfy/sd.py†L1304-L1519】- Direct safetensors load outside
load_torch_file:comfy/sd1_clip.py:load_embedandnodes.py:LoadLatent.loadusesafetensors.torch.load_file, bypassing the core loader.【F:comfy/sd1_clip.py†L432-L434】【F:nodes.py†L521-L529】
FastSageTensors (fastsafetensors) capability audit
- Header parsing and metadata:
fastsafetensors/common.py:SafeTensorsMetadataparses the header and builds per-tensorTensorFramewithdtype,shape, anddata_offsets(no tensor allocation).【F:../third_party/fastsafetensors-main/fastsafetensors/common.py†L63-L187】TensorFramestores dtype/shape/offsets and supports slicing metadata.【F:../third_party/fastsafetensors-main/fastsafetensors/common.py†L238-L338】
- GDS + no-GDS low-level readers:
fastsafetensors/cpp.pyiexposesgds_file_reader,gds_file_handle,nogds_file_reader,cpu_malloc,gpu_malloc, and alignment helpers such asget_alignment_size().【F:../third_party/fastsafetensors-main/fastsafetensors/cpp.pyi†L1-L43】- GDS availability checks are in
fastsafetensors/cpp.pyi:is_gds_supported,is_cufile_found,cufile_version, andinit_gds.【F:../third_party/fastsafetensors-main/fastsafetensors/cpp.pyi†L36-L43】
- DLPack wrapping:
fastsafetensors/dlpack.pyprovidesfrom_cuda_buffer()which creates DLPack capsules for both CPU and GPU buffers via a device descriptor and is used fortorch.from_dlpack.【F:../third_party/fastsafetensors-main/fastsafetensors/dlpack.py†L232-L239】
- Torch framework interop:
fastsafetensors/frameworks/_torch.py:TorchOpprovidesalloc_tensor_memory/free_tensor_memory, dtype mapping, and usestorch.from_dlpackfor wrapping raw pointers into tensors.【F:../third_party/fastsafetensors-main/fastsafetensors/frameworks/_torch.py†L131-L205】
VRAM/RAM offload logic (for extension)
comfy/model_management.pyhandles VRAM/RAM offload viafree_memoryand keeps tracking of loaded/offloaded memory (needs integration for RAM disk tier).【F:comfy/model_management.py†L584-L612】comfy/model_patcher.pyimplements module-by-module offload/low-vram weight casting (comfy_cast_weights) and partial unload/load (needs to integrate disk tier for RAM eviction).【F:comfy/model_patcher.py†L663-L955】
Strategy summary (implemented)
Streaming safetensors mapping (no full dict materialization)
- Introduce a new module
comfy/safetensors_stream.pywith:TensorMetaandSafeTensorIndex(metadata-only parsing withfastsafetensors.SafeTensorsMetadata).StreamStateDictas a mapping backed bySafeTensorIndex, exposing metadata-onlykeys()/__iter__and loading tensors on demand.- Lightweight mapping views:
PrefixViewStateDict,FilterViewStateDict,RenameViewStateDictfor lazy prefix/filter/rename without eager loading.
Range reads and tiering
- Disk→RAM: use
fastsafetensors.cpp.nogds_file_readerfor range reads and wrap with DLPack. - Disk→GPU (GDS): use
gds_file_reader+gds_file_handleto read the aligned range directly into GPU memory. If GDS is requested but not supported (e.g.,is_gds_supported==0or libcufile missing), raise a hard error with instructions to disable GDS. - Disk→RAM→GPU: read only the tensor range into (optionally pinned) CPU memory, copy to GPU, then release CPU buffer unless RAM cache policy keeps it.
Disk tier integration
- Represent disk-resident weights as meta tensors (
device='meta') plus aDiskRefregistry that stores(module, param_name) -> TensorMeta + loader handle. - Add an LRU cache for RAM-resident weights loaded from disk with configurable max bytes. Eviction replaces RAM tensors with meta tensors and keeps
DiskReffor reload. - Add a general
forward_pre_hookto materialize any meta+DiskRef weights before compute; this covers modules that bypasscomfy.ops.
Pipeline refactors
- Update
load_torch_fileto returnStreamStateDictfor.safetensors/.sftand return metadata without loading. - Update helpers (
calculate_parameters,weight_dtype,state_dict_prefix_replace) to be metadata-aware and lazy. - Update
BaseModel.load_model_weightsand other load paths to avoid building large dicts; use streaming mappings + view wrappers instead. - Update model detection (
comfy/model_detection.py) to use metadata-based shape/dtype access (no tensor reads). - Update direct safetensors loaders (e.g.,
comfy/sd1_clip.py) to go throughload_torch_fileso everything uses the same streaming loader.
Tests and docs
- Add unit tests for metadata correctness, single-tensor loading, and lazy views (no full materialization), plus integration tests for load behavior and GDS failure path.
- Document new flags for RAM cache size and GPUDirect enablement and how to disable GDS when unsupported.