DynamicVRAM's on-demand model loading/offloading conflicted with process isolation in three ways: RPC tensor transport stalls from mid-call GPU offload, race conditions between model lifecycle and active RPC operations, and false positive memory leak detection from changed finalizer patterns.
- Marshal CUDA tensors to CPU before RPC transport for dynamic models
- Add operation state tracking + quiescence waits at workflow boundaries
- Distinguish proxy reference release from actual leaks in cleanup_models_gc
- Fix init order: DynamicVRAM must initialize before isolation proxies
- Add RPC timeouts to prevent indefinite hangs on model unavailability
- Prevent proxy-of-proxy chains from DynamicVRAM model reload cycles
- Add torch.device/torch.dtype serializers for new DynamicVRAM RPC paths
- Guard isolation overhead so non-isolated workflows are unaffected
- Migrate env var to PYISOLATE_CHILD
* feat: create a /jobs api to return queue and history jobs
* update unused vars
* include priority
* create jobs helper file
* fix ruff
* update how we set error message
* include execution error in both responses
* rename error -> failed, fix output shape
* re-use queue and history functions
* set workflow id
* allow srot by exec duration
* fix tests
* send priority and remove error msg
* use ws messages to get start and end times
* revert main.py fully
* refactor: move all /jobs business logic to jobs.py
* fix failing test
* remove some tests
* fix non dict nodes
* address comments
* filter by workflow id and remove null fields
* add clearer typing - remove get("..") or ..
* refactor query params to top get_job(s) doc, add remove_sensitive_from_queue
* add brief comment explaining why we skip animated
* comment that format field is for frontend backward compatibility
* fix whitespace
---------
Co-authored-by: Jedrzej Kosinski <kosinkadink1@gmail.com>
Co-authored-by: guill <jacob.e.segal@gmail.com>