Enable lazy disk-tier loading for streaming weights

2026-03-15 14:17:40 +08:00 · 2026-01-08 18:40:29 +02:00 · 2026-01-08 18:40:29 +02:00 · 5f2188e31b
commit 5f2188e31b
parent f925f8fa77
8 changed files with 320 additions and 31 deletions
--- a/DESIGN.md
+++ b/DESIGN.md
@ -27,31 +27,31 @@
 - `comfy/model_management.py` handles VRAM/RAM offload via `free_memory` and keeps tracking of loaded/offloaded memory (needs integration for RAM disk tier).【F:comfy/model_management.py†L584-L612】
 - `comfy/model_patcher.py` implements module-by-module offload/low-vram weight casting (`comfy_cast_weights`) and partial unload/load (needs to integrate disk tier for RAM eviction).【F:comfy/model_patcher.py†L663-L955】

-## Strategy summary (no coding performed yet)
+## Strategy summary (implemented)

 ### Streaming safetensors mapping (no full dict materialization)
- Introduce a new module `comfy/safetensors_stream.py` with:
-  - `TensorMeta` and `SafeTensorIndex` (metadata-only parsing with `fastsafetensors.SafeTensorsMetadata`).
-  - `StreamStateDict` as a mapping backed by `SafeTensorIndex`, exposing metadata-only `keys()`/`__iter__` and loading tensors on demand.
-  - Lightweight mapping views: `PrefixViewStateDict`, `FilterViewStateDict`, `RenameViewStateDict` for lazy prefix/filter/rename without eager loading.
+- [x] Introduce a new module `comfy/safetensors_stream.py` with:
+  - [x] `TensorMeta` and `SafeTensorIndex` (metadata-only parsing with `fastsafetensors.SafeTensorsMetadata`).
+  - [x] `StreamStateDict` as a mapping backed by `SafeTensorIndex`, exposing metadata-only `keys()`/`__iter__` and loading tensors on demand.
+  - [x] Lightweight mapping views: `PrefixViewStateDict`, `FilterViewStateDict`, `RenameViewStateDict` for lazy prefix/filter/rename without eager loading.

 ### Range reads and tiering
- Disk→RAM: use `fastsafetensors.cpp.nogds_file_reader` for range reads and wrap with DLPack.
- Disk→GPU (GDS): use `gds_file_reader` + `gds_file_handle` to read the aligned range directly into GPU memory. If GDS is requested but not supported (e.g., `is_gds_supported==0` or libcufile missing), raise a hard error with instructions to disable GDS.
- Disk→RAM→GPU: read only the tensor range into (optionally pinned) CPU memory, copy to GPU, then release CPU buffer unless RAM cache policy keeps it.
+- [x] Disk→RAM: use `fastsafetensors.cpp.nogds_file_reader` for range reads and wrap with DLPack.
+- [x] Disk→GPU (GDS): use `gds_file_reader` + `gds_file_handle` to read the aligned range directly into GPU memory. If GDS is requested but not supported (e.g., `is_gds_supported==0` or libcufile missing), raise a hard error with instructions to disable GDS.
+- [x] Disk→RAM→GPU: read only the tensor range into (optionally pinned) CPU memory, copy to GPU, then release CPU buffer unless RAM cache policy keeps it.

 ### Disk tier integration
- Represent disk-resident weights as meta tensors (`device='meta'`) plus a `DiskRef` registry that stores `(module, param_name) -> TensorMeta + loader handle`.
- Add an LRU cache for RAM-resident weights loaded from disk with configurable max bytes. Eviction replaces RAM tensors with meta tensors and keeps `DiskRef` for reload.
- Add a general `forward_pre_hook` to materialize any meta+DiskRef weights before compute; this covers modules that bypass `comfy.ops`.
+- [x] Represent disk-resident weights as meta tensors (`device='meta'`) plus a `DiskRef` registry that stores `(module, param_name) -> TensorMeta + loader handle`.
+- [x] Add an LRU cache for RAM-resident weights loaded from disk with configurable max bytes. Eviction replaces RAM tensors with meta tensors and keeps `DiskRef` for reload.
+- [x] Add a general `forward_pre_hook` to materialize any meta+DiskRef weights before compute; this covers modules that bypass `comfy.ops`.

 ### Pipeline refactors
- Update `load_torch_file` to return `StreamStateDict` for `.safetensors`/`.sft` and return metadata without loading.
- Update helpers (`calculate_parameters`, `weight_dtype`, `state_dict_prefix_replace`) to be metadata-aware and lazy.
- Update `BaseModel.load_model_weights` and other load paths to avoid building large dicts; use streaming mappings + view wrappers instead.
- Update model detection (`comfy/model_detection.py`) to use metadata-based shape/dtype access (no tensor reads).
- Update direct safetensors loaders (e.g., `comfy/sd1_clip.py`) to go through `load_torch_file` so everything uses the same streaming loader.
+- [x] Update `load_torch_file` to return `StreamStateDict` for `.safetensors`/`.sft` and return metadata without loading.
+- [x] Update helpers (`calculate_parameters`, `weight_dtype`, `state_dict_prefix_replace`) to be metadata-aware and lazy.
+- [x] Update `BaseModel.load_model_weights` and other load paths to avoid building large dicts; use streaming mappings + view wrappers instead.
+- [x] Update model detection (`comfy/model_detection.py`) to use metadata-based shape/dtype access (no tensor reads).
+- [x] Update direct safetensors loaders (e.g., `comfy/sd1_clip.py`) to go through `load_torch_file` so everything uses the same streaming loader.

 ### Tests and docs
- Add unit tests for metadata correctness, single-tensor loading, and lazy views (no full materialization), plus integration tests for load behavior and GDS failure path.
- Document new flags for RAM cache size and GPUDirect enablement and how to disable GDS when unsupported.
+- [x] Add unit tests for metadata correctness, single-tensor loading, and lazy views (no full materialization), plus integration tests for load behavior and GDS failure path.
+- [x] Document new flags for RAM cache size and GPUDirect enablement and how to disable GDS when unsupported.
--- a/comfy/controlnet.py
+++ b/comfy/controlnet.py
@ -25,6 +25,7 @@ import logging
 import comfy.utils
 import comfy.model_management
 import comfy.model_detection
+import comfy.disk_weights
 import comfy.model_patcher
 import comfy.ops
 import comfy.latent_formats
@ -385,7 +386,7 @@ class ControlLora(ControlNet):
        controlnet_config["operations"] = control_lora_ops
        controlnet_config["dtype"] = dtype
        self.control_model = comfy.cldm.cldm.ControlNet(**controlnet_config)
-        self.control_model.to(comfy.model_management.get_torch_device())
+        comfy.disk_weights.module_to(self.control_model, comfy.model_management.get_torch_device())
        diffusion_model = model.diffusion_model
        sd = diffusion_model.state_dict()

@ -816,8 +817,8 @@ class T2IAdapter(ControlBase):
        if x_noisy.shape[0] != self.cond_hint.shape[0]:
            self.cond_hint = broadcast_image_to(self.cond_hint, x_noisy.shape[0], batched_number)
        if self.control_input is None:
-            self.t2i_model.to(x_noisy.dtype)
-            self.t2i_model.to(self.device)
+            comfy.disk_weights.module_to(self.t2i_model, dtype=x_noisy.dtype)
+            comfy.disk_weights.module_to(self.t2i_model, self.device)
            self.control_input = self.t2i_model(self.cond_hint.to(x_noisy.dtype))
            self.t2i_model.cpu()

--- a/comfy/disk_weights.py
+++ b/comfy/disk_weights.py
@ -21,14 +21,18 @@ from __future__ import annotations
 import collections
 import weakref
 from dataclasses import dataclass
-from typing import Dict, Optional
+from typing import Dict, MutableMapping, Optional

 import torch

+from . import safetensors_stream
+

 ALLOW_GDS = False
 PIN_IF_CPU = False
 DISK_WEIGHTS_ENABLED = False
+BASE_LOAD_FROM_STATE_DICT = torch.nn.Module._load_from_state_dict
+LAZY_MODULE_STATE = weakref.WeakKeyDictionary()


@dataclass
@ -123,6 +127,15 @@ class DiskWeightCache:
                _evict_module_weight(module, entry.name, entry.is_buffer)
        return freed

+    def remove_module(self, module: torch.nn.Module):
+        to_remove = []
+        for key, entry in self._entries.items():
+            if entry.module_ref() is module:
+                to_remove.append(key)
+        for key in to_remove:
+            entry = self._entries.pop(key)
+            self.current_bytes -= entry.size_bytes
+
    def _drop_module_entries(self, module_ref: weakref.ReferenceType):
        to_remove = []
        for key, entry in self._entries.items():
@ -183,7 +196,61 @@ def register_module_weights(module: torch.nn.Module, state_dict, prefix: str = "
                CACHE.record(module, name, buf, is_buffer=True)


+@dataclass
+class LazyModuleState:
+    state_dict: MutableMapping
+    prefix: str
+    loaded: bool = False
+
+
+def _has_custom_load(module: torch.nn.Module) -> bool:
+    return module.__class__._load_from_state_dict is not BASE_LOAD_FROM_STATE_DICT
+
+
+def register_lazy_modules(model: torch.nn.Module, state_dict):
+    if not hasattr(state_dict, "keys"):
+        return
+    for name, module in model.named_modules():
+        if not _has_custom_load(module):
+            continue
+        prefix = f"{name}." if name else ""
+        if prefix:
+            has_key = False
+            for param_name in module._parameters.keys():
+                if f"{prefix}{param_name}" in state_dict:
+                    has_key = True
+                    break
+            if not has_key:
+                for buf_name in module._buffers.keys():
+                    if f"{prefix}{buf_name}" in state_dict:
+                        has_key = True
+                        break
+            if not has_key:
+                continue
+        view = safetensors_stream.FilterViewStateDict(
+            state_dict, lambda k, p=prefix: k.startswith(p), mutate_base=False
+        )
+        LAZY_MODULE_STATE[module] = LazyModuleState(state_dict=view, prefix=prefix)
+
+
 def _evict_module_weight(module: torch.nn.Module, name: str, is_buffer: bool):
+    lazy_state = LAZY_MODULE_STATE.get(module)
+    if lazy_state is not None:
+        CACHE.remove_module(module)
+        refs = REGISTRY.get(module)
+        if refs:
+            for ref_name, disk_ref in refs.items():
+                shape = getattr(disk_ref.meta, "shape", None)
+                dtype = getattr(disk_ref.meta, "dtype", None)
+                if shape is None or dtype is None:
+                    continue
+                meta_tensor = torch.empty(shape, dtype=dtype, device="meta")
+                if disk_ref.is_buffer:
+                    module._buffers[ref_name] = meta_tensor
+                else:
+                    module._parameters[ref_name] = torch.nn.Parameter(meta_tensor, requires_grad=disk_ref.requires_grad)
+        lazy_state.loaded = False
+        return
    ref = REGISTRY.get(module)
    if not ref or name not in ref:
        return
@ -222,6 +289,10 @@ def _find_tensor_device(args, kwargs) -> Optional[torch.device]:


 def ensure_module_materialized(module: torch.nn.Module, target_device: torch.device):
+    lazy_state = LAZY_MODULE_STATE.get(module)
+    if lazy_state is not None and not lazy_state.loaded:
+        _materialize_module_from_state_dict(module, lazy_state, target_device)
+        return
    refs = REGISTRY.get(module)
    if not refs:
        return
@ -236,11 +307,14 @@ def ensure_module_materialized(module: torch.nn.Module, target_device: torch.dev
            continue
        if current is None:
            continue
-        if current.device.type != "meta":
+        if current.device.type == "meta":
+            tensor = disk_ref.load(target_device, ALLOW_GDS, PIN_IF_CPU)
+        elif current.device != target_device:
+            tensor = current.to(device=target_device)
+        else:
            if current.device.type == "cpu":
                CACHE.touch(module, name)
            continue
-        tensor = disk_ref.load(target_device, ALLOW_GDS, PIN_IF_CPU)
        if is_buffer:
            module._buffers[name] = tensor
        else:
@ -273,3 +347,138 @@ def evict_ram_cache(bytes_to_free: int):
    if bytes_to_free <= 0:
        return 0
    return CACHE.evict_bytes(bytes_to_free)
+
+
+def materialize_module_tree(module: torch.nn.Module, target_device: torch.device):
+    if not disk_weights_enabled():
+        return
+    for submodule in module.modules():
+        ensure_module_materialized(submodule, target_device)
+
+
+def _extract_to_device(args, kwargs) -> Optional[torch.device]:
+    if "device" in kwargs and kwargs["device"] is not None:
+        return torch.device(kwargs["device"])
+    for arg in args:
+        if isinstance(arg, torch.device):
+            return arg
+        if isinstance(arg, str):
+            return torch.device(arg)
+    return None
+
+
+def _find_existing_device(module: torch.nn.Module) -> Optional[torch.device]:
+    for param in module.parameters(recurse=True):
+        if param is not None and param.device.type != "meta":
+            return param.device
+    for buf in module.buffers(recurse=True):
+        if buf is not None and buf.device.type != "meta":
+            return buf.device
+    return None
+
+
+def module_to(module: torch.nn.Module, *args, **kwargs):
+    if disk_weights_enabled():
+        target_device = _extract_to_device(args, kwargs)
+        if target_device is None:
+            target_device = _find_existing_device(module) or torch.device("cpu")
+        materialize_module_tree(module, target_device)
+    return module.to(*args, **kwargs)
+
+
+def _replace_tensor(model: torch.nn.Module, name: str, tensor: torch.Tensor, is_buffer: bool, requires_grad: bool):
+    parts = name.split(".")
+    module = model
+    for part in parts[:-1]:
+        module = getattr(module, part)
+    attr = parts[-1]
+    if is_buffer:
+        module._buffers[attr] = tensor
+    else:
+        module._parameters[attr] = torch.nn.Parameter(tensor, requires_grad=requires_grad)
+
+
+def _materialize_module_from_state_dict(module: torch.nn.Module, lazy_state: LazyModuleState, target_device: torch.device):
+    missing_keys = []
+    unexpected_keys = []
+    error_msgs = []
+    metadata = getattr(lazy_state.state_dict, "_metadata", None)
+    local_metadata = {} if metadata is None else metadata.get(lazy_state.prefix[:-1], {})
+    state_dict = safetensors_stream.DeviceViewStateDict(
+        lazy_state.state_dict,
+        device=target_device,
+        allow_gds=ALLOW_GDS,
+        pin_if_cpu=PIN_IF_CPU,
+        mutate_base=False,
+    )
+    factory_device = None
+    if hasattr(module, "factory_kwargs") and "device" in module.factory_kwargs:
+        factory_device = module.factory_kwargs["device"]
+        module.factory_kwargs["device"] = target_device
+    try:
+        module._load_from_state_dict(
+            state_dict,
+            lazy_state.prefix,
+            local_metadata,
+            False,
+            missing_keys,
+            unexpected_keys,
+            error_msgs,
+        )
+        incompatible = torch.nn.modules.module._IncompatibleKeys(missing_keys, unexpected_keys)
+        for hook in module._load_state_dict_post_hooks.values():
+            out = hook(module, incompatible)
+            if out is not None:
+                raise RuntimeError("load_state_dict post hook returned a value, which is unsupported.")
+    finally:
+        if factory_device is not None:
+            module.factory_kwargs["device"] = factory_device
+    if len(error_msgs) > 0:
+        raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(module.__class__.__name__, "\n\t".join(error_msgs)))
+    lazy_state.loaded = True
+    for name, param in module.named_parameters(recurse=False):
+        if param.device.type == "cpu":
+            CACHE.record(module, name, param, is_buffer=False)
+    for name, buf in module.named_buffers(recurse=False):
+        if buf is not None and buf.device.type == "cpu":
+            CACHE.record(module, name, buf, is_buffer=True)
+
+
+def lazy_load_state_dict(model: torch.nn.Module, state_dict, strict: bool = False):
+    model_keys = set()
+    for name, _ in model.named_parameters(recurse=True):
+        model_keys.add(name)
+    for name, _ in model.named_buffers(recurse=True):
+        model_keys.add(name)
+
+    state_keys = set(state_dict.keys())
+    missing_keys = [k for k in model_keys if k not in state_keys]
+    unexpected_keys = [k for k in state_keys if k not in model_keys]
+
+    if strict:
+        error_msgs = []
+        if len(unexpected_keys) > 0:
+            error_msgs.append('Unexpected key(s) in state_dict: {}.'.format(', '.join(f'"{k}"' for k in unexpected_keys)))
+        if len(missing_keys) > 0:
+            error_msgs.append('Missing key(s) in state_dict: {}.'.format(', '.join(f'"{k}"' for k in missing_keys)))
+        if error_msgs:
+            raise RuntimeError("Error(s) in loading state_dict:\n\t{}".format("\n\t".join(error_msgs)))
+
+    for name, param in model.named_parameters(recurse=True):
+        if name not in state_keys:
+            continue
+        meta = state_dict.meta(name)
+        meta_tensor = torch.empty(meta.shape, dtype=meta.dtype, device="meta")
+        _replace_tensor(model, name, meta_tensor, is_buffer=False, requires_grad=param.requires_grad)
+
+    for name, buf in model.named_buffers(recurse=True):
+        if buf is None or name not in state_keys:
+            continue
+        meta = state_dict.meta(name)
+        meta_tensor = torch.empty(meta.shape, dtype=meta.dtype, device="meta")
+        _replace_tensor(model, name, meta_tensor, is_buffer=True, requires_grad=False)
+
+    register_module_weights(model, state_dict)
+    register_lazy_modules(model, state_dict)
+    attach_disk_weight_hooks(model)
+    return missing_keys, unexpected_keys
--- a/comfy/model_patcher.py
+++ b/comfy/model_patcher.py
@ -785,7 +785,7 @@ class ModelPatcher:
                m.comfy_patched_weights = True

            for x in load_completely:
-                x[2].to(device_to)
+                comfy.disk_weights.module_to(x[2], device_to)

            for x in offloaded:
                n = x[1]
@ -800,7 +800,7 @@ class ModelPatcher:
                logging.info("loaded completely; {:.2f} MB usable, {:.2f} MB loaded, full load: {}".format(lowvram_model_memory / (1024 * 1024), mem_counter / (1024 * 1024), full_load))
                self.model.model_lowvram = False
                if full_load:
-                    self.model.to(device_to)
+                    comfy.disk_weights.module_to(self.model, device_to)
                    mem_counter = self.model_size()

            self.model.lowvram_patch_counter += patch_counter
@ -857,7 +857,7 @@ class ModelPatcher:
            self.backup.clear()

            if device_to is not None:
-                self.model.to(device_to)
+                comfy.disk_weights.module_to(self.model, device_to)
                self.model.device = device_to
            self.model.model_loaded_weight_memory = 0
            self.model.model_offload_buffer_memory = 0
--- a/comfy/safetensors_stream.py
+++ b/comfy/safetensors_stream.py
@ -442,6 +442,12 @@ class StreamStateDict(collections.abc.MutableMapping):
            raise KeyError(key)
        if device is None:
            device = self._device
+        if device.type == "meta":
+            meta = self._index.meta(key)
+            target_dtype = dtype or meta.dtype
+            if dtype is not None and dtype != meta.dtype:
+                _validate_dtype_conversion(meta.dtype, dtype)
+            return torch.empty(meta.shape, dtype=target_dtype, device="meta")
        if allow_gds is None:
            allow_gds = self._allow_gds
        meta = self._index.meta(key)
@ -559,6 +565,37 @@ class _BaseViewStateDict(MutableMapping):
            t = t.to(dtype=dtype)
        return t

+
+class DeviceViewStateDict(_BaseViewStateDict):
+    def __init__(
+        self,
+        base: MutableMapping,
+        device: torch.device,
+        allow_gds: Optional[bool] = None,
+        pin_if_cpu: bool = False,
+        mutate_base: bool = False,
+    ):
+        super().__init__(base, mutate_base=mutate_base)
+        self._device = device
+        self._allow_gds = allow_gds
+        self._pin_if_cpu = pin_if_cpu
+
+    def get_tensor(
+        self,
+        key: str,
+        *,
+        device: Optional[torch.device] = None,
+        dtype: Optional[torch.dtype] = None,
+        allow_gds: Optional[bool] = None,
+        pin_if_cpu: bool = False,
+    ) -> torch.Tensor:
+        device = self._device if device is None else device
+        allow_gds = self._allow_gds if allow_gds is None else allow_gds
+        pin_if_cpu = self._pin_if_cpu if not pin_if_cpu else pin_if_cpu
+        return super().get_tensor(
+            key, device=device, dtype=dtype, allow_gds=allow_gds, pin_if_cpu=pin_if_cpu
+        )
+
    def meta(self, key: str):
        if key in self._overrides:
            t = self._overrides[key]
--- a/comfy/sd.py
+++ b/comfy/sd.py
@ -26,6 +26,7 @@ import os

 import comfy.utils
 import comfy.safetensors_stream
+import comfy.disk_weights

 from . import clip_vision
 from . import gligen
@ -125,7 +126,7 @@ class CLIP:
            if not model_management.supports_cast(load_device, dt):
                load_device = offload_device
                if params['device'] != offload_device:
-                    self.cond_stage_model.to(offload_device)
+                    comfy.disk_weights.module_to(self.cond_stage_model, offload_device)
                    logging.warning("Had to shift TE back.")

        self.tokenizer = tokenizer(embedding_directory=embedding_directory, tokenizer_data=tokenizer_data)
@ -671,7 +672,7 @@ class VAE:
        if dtype is None:
            dtype = model_management.vae_dtype(self.device, self.working_dtypes)
        self.vae_dtype = dtype
-        self.first_stage_model.to(self.vae_dtype)
+        comfy.disk_weights.module_to(self.first_stage_model, dtype=self.vae_dtype)
        self.output_device = model_management.intermediate_device()

        self.patcher = comfy.model_patcher.ModelPatcher(self.first_stage_model, load_device=self.device, offload_device=offload_device)
@ -1546,7 +1547,7 @@ def load_diffusion_model_state_dict(sd, model_options={}, metadata=None):
        model_config.optimizations["fp8"] = True

    model = model_config.get_model(new_sd, "")
-    model = model.to(offload_device)
+    model = comfy.disk_weights.module_to(model, offload_device)
    model.load_model_weights(new_sd, "")
    left_over = sd.keys()
    if len(left_over) > 0:
--- a/comfy/utils.py
+++ b/comfy/utils.py
@ -168,6 +168,8 @@ def state_dict_meta(state_dict, key):

 def load_state_dict(model, state_dict, strict=False, assign=False):
    if is_stream_state_dict(state_dict):
+        if comfy.disk_weights.disk_weights_enabled():
+            return comfy.disk_weights.lazy_load_state_dict(model, state_dict, strict=strict)
        comfy.disk_weights.register_module_weights(model, state_dict)
        comfy.disk_weights.attach_disk_weight_hooks(model)
        missing, unexpected = stream_load_state_dict(model, state_dict, strict=strict, assign=assign)
@ -900,7 +902,10 @@ def copy_to_param(obj, attr, value):
    for name in attrs[:-1]:
        obj = getattr(obj, name)
    prev = getattr(obj, attrs[-1])
-    prev.data.copy_(value)
+    if prev.device.type == "meta":
+        setattr(obj, attrs[-1], torch.nn.Parameter(value, requires_grad=prev.requires_grad))
+    else:
+        prev.data.copy_(value)

 def get_attr(obj, attr: str):
    """Retrieves a nested attribute from an object using dot notation.
--- a/tests-unit/utils/safetensors_stream_test.py
+++ b/tests-unit/utils/safetensors_stream_test.py
@ -144,3 +144,39 @@ def test_stream_load_without_disk_cache_keeps_cpu_weights(tmp_path):
        assert model.weight.device.type != "meta"
    finally:
        comfy.disk_weights.configure(prev_cache, allow_gds=prev_gds, pin_if_cpu=prev_pin, enabled=prev_enabled)
+
+
+def test_lazy_disk_weights_loads_on_demand(tmp_path, monkeypatch):
+    if importlib.util.find_spec("fastsafetensors") is None:
+        pytest.skip("fastsafetensors not installed")
+    import comfy.utils
+    import comfy.disk_weights
+
+    prev_cache = comfy.disk_weights.CACHE.max_bytes
+    prev_gds = comfy.disk_weights.ALLOW_GDS
+    prev_pin = comfy.disk_weights.PIN_IF_CPU
+    prev_enabled = comfy.disk_weights.DISK_WEIGHTS_ENABLED
+    comfy.disk_weights.configure(0, allow_gds=False, pin_if_cpu=False, enabled=True)
+
+    try:
+        path = _write_safetensors(tmp_path, {"weight": torch.zeros((4, 4), dtype=torch.float32), "bias": torch.zeros((4,), dtype=torch.float32)})
+        sd = comfy.utils.load_torch_file(path, safe_load=True)
+        model = torch.nn.Linear(4, 4, bias=True)
+        calls = []
+
+        original = sd._file.read_tensor
+
+        def wrapped(meta, device, dtype, allow_gds, pin_if_cpu):
+            calls.append(meta)
+            return original(meta, device, dtype, allow_gds, pin_if_cpu)
+
+        monkeypatch.setattr(sd._file, "read_tensor", wrapped)
+        comfy.utils.load_state_dict(model, sd, strict=True)
+        assert model.weight.device.type == "meta"
+        assert calls == []
+
+        comfy.disk_weights.ensure_module_materialized(model, torch.device("cpu"))
+        assert model.weight.device.type == "cpu"
+        assert len(calls) == 2
+    finally:
+        comfy.disk_weights.configure(prev_cache, allow_gds=prev_gds, pin_if_cpu=prev_pin, enabled=prev_enabled)