Clear VBAR state on deepclone to prevent shared C pointer race condition
Some checks failed
Python Linting / Run Ruff (push) Has been cancelled
Python Linting / Run Pylint (push) Has been cancelled

When deepclone_multigpu deep-copies the model, ModelVBAR objects get copied
with their raw C pointers (_ptr). This causes two issues:
1. Double-free when the copied ModelVBAR is GC'd
2. Thread-safety crash: concurrent vbar_fault calls from multiple GPU threads
   corrupt the global linked list in model-vbar.c (no mutexes)

Fix: clear dynamic_vbars dict and remove _v attributes from all modules after
deepcopy, so each GPU clone gets fresh, independent VBARs during load().

Amp-Thread-ID: https://ampcode.com/threads/T-019d7168-759b-7090-8f31-c89dc6ac8d28
Co-authored-by: Amp <amp@ampcode.com>
This commit is contained in:
Jedrzej Kosinski 2026-04-09 02:39:43 -07:00
parent 0a23dd8b43
commit f225af22cd

View File

@ -414,6 +414,14 @@ class ModelPatcher:
n.model = temp_model_patcher.model
else:
n.model = copy.deepcopy(n.model)
# Clear VBAR state so the clone gets fresh, device-specific VBARs during load().
# deep-copied ModelVBAR objects share raw C pointers with the original, which causes
# double-free and thread-safety issues (concurrent vbar_fault on shared global state).
if hasattr(n.model, "dynamic_vbars"):
n.model.dynamic_vbars = {}
for m in n.model.modules():
if hasattr(m, "_v"):
delattr(m, "_v")
# unlike for normal clone, backup dicts that shared same ref should not;
# otherwise, patchers that have deep copies of base models will erroneously influence each other.
n.backup = copy.deepcopy(n.backup)