* Add jobs-namespace cancel endpoints
Add two cancel endpoints under the jobs namespace so a job can be
cancelled by id without the caller needing to know whether the job is
running or pending, or branching between /interrupt and /queue.
- POST /api/jobs/{job_id}/cancel cancels one job by id. Idempotent: an
already-finished or unknown id returns 200 {"cancelled": false} rather
than an error.
- POST /api/jobs/cancel takes {"job_ids": [...]} and cancels a batch.
Fail-fast: if any id is unknown the request returns 404 listing the
unknown ids and cancels nothing (no partial side effects).
Both are state-agnostic and map onto the existing queue mechanics: a
running job is interrupted (same path as /interrupt), a pending job is
dequeued (same path as /queue {"delete": [...]}). The cancel logic lives
in comfy_execution.jobs as pure, unit-tested helpers; the server handlers
are thin wrappers. openapi.yaml documents both routes.
* fix: resolve review feedback on cancel endpoints
- Guard cancel_job() against TOCTOU: when dequeue() returns False the
pending job left the queue between snapshot and delete; return
CANCEL_UNKNOWN so callers never report cancelled=True for a remove
that did not happen.
- Validate each job_ids element in the batch cancel endpoint before
any queue access; unhashable or non-UUID values now return 400
instead of raising TypeError (500).
- Update batch HTTP tests to use canonical UUID ids (required now that
the endpoint validates id format) and add tests for the new guards.
* fix: make job cancel atomic and best-effort
Addresses two cancel races/edges raised in review.
Targeted, atomic interrupt. cancel_job's interrupt callback now takes the
prompt id and returns whether it fired; the single-cancel route backs it
with the new PromptQueue.interrupt_if_running, which checks the running set
and signals the interrupt under the queue mutex. This closes the TOCTOU
where a pending job that starts executing between the snapshot and dequeue
(or a running job that finishes between the snapshot and interrupt) could be
missed or, worse, cause an unrelated prompt to be interrupted. The per-prompt
interrupt-flag reset in execute_async keeps a finished job from leaking the
interrupt onto its successor.
Best-effort batch cancel. POST /api/jobs/cancel no longer fails the whole
batch with 404 when one id is unknown/finished; such ids are treated as
no-ops, so "cancel all" still cancels the in-progress jobs even if some
finished between the client's snapshot and the request. Malformed ids are
still rejected with 400.
The job_ids query filter added in #13998 has no live consumer: the
frontend Generated tab kept sourcing from GET /jobs, and the cloud side
removed its equivalent filter from the shared asset spec. Carrying it on
the local server only re-introduces Core<->Cloud drift on the shared
contract, so remove it to match.
Removed: the job_ids field + validator on ListAssetsQuery, the IN(...)
clauses in list_references_page, the service/route passthrough, and the
filter-only tests.
Kept: the canonical-UUID prompt_id enforcement at job creation (also
landed in #13998). It stands on its own -- job ids are matched verbatim
by history keys, websocket correlation, and /interrupt -- and cloud
inherits it by running core for execution, so no divergence is created.
* feat(assets): enrich executed WS message with asset metadata
When --enable-assets is set, each file-type output entry in the
`executed` WebSocket message now includes id, name, asset_hash, size,
and mime_type — matching the shape already returned by /upload/image.
The enrichment lives in comfy_execution/asset_enrichment.py (no torch
dependency) and is called from both send sites in execution.py: freshly
executed nodes register the file inline via register_file_in_place;
cached node re-sends look up the existing AssetReference by file path
to avoid re-hashing. Errors are caught per-entry so a failure never
blocks the WS message from sending.
* fix(assets): inject only id in executed WS message per Asset Identity RFC
Per the Asset Identity RFC, the executed WebSocket payload should carry
id alone — hash is already encoded in the filename, and name/preview_url/
size belong behind GET /api/assets/{id} rather than being pushed eagerly.
Simplifies the DB lookup path: we only need ref.id, so the asset.hash
null-check is no longer required as a fallback trigger.
* fix(assets): reject path traversal when resolving output abs_path
Subfolder/filename were joined and absolutized without containment check,
so '..' segments or an absolute filename could escape the type's base
directory and register an unrelated on-disk file as an asset.
Add commonpath-based containment check; skip enrichment (warn, leave
entry unchanged) when the resolved path escapes base. Catches ValueError
from cross-drive paths on Windows.
* docs(assets): drop Asset Identity RFC reference from docstring
* docs(assets): trim docstring to what enrichment does, not what it doesn't
* test(assets): use real platform paths so containment check works on Windows
The previous test setup patched os.path.abspath to identity and used a
POSIX-style '/output' base, which collided with Windows path separators
in os.path.commonpath. Drop the abspath/join patches and use a real
tempdir-rooted base so the containment check runs against actual
platform paths.
* refactor(assets): enrich at output-processing time, not in the WS send path
Per review: enrichment lived inside the client_id-guarded send sites, so a
headless run (no websocket client) never registered assets at all, and
ui_outputs/history stored the un-enriched entries.
Now output_ui is enriched once, right after the node produces it and before
it is stored in ui_outputs — so registration happens regardless of connected
clients, and the asset id flows into history and the execution cache for
free. _send_cached_ui re-sends the stored (already-enriched) dict verbatim,
which lets the DB-lookup-by-path fallback be deleted: every enrichment is
now a fresh output, and register_file_in_place re-hashes on upsert so an
overwritten path can never carry a stale id.
* feat(assets): add job_ids filter to GET /api/assets
Mirrors the existing cloud `job_ids` query param on the local Python server:
clients can pass a comma-separated list (or repeated query params) of UUIDs
to filter assets by their associated job.
The `AssetReference.job_id` column already exists, so no migration is
needed — this just plumbs the filter through schema → service → query.
Marks the parameter as available in both runtimes by dropping the
`[cloud-only]` description prefix and the `x-runtime: [cloud]` tag from
the OpenAPI spec, per the OSS field-drift convention (absent runtime tag
= populated by both local and cloud).
* fix(assets): tighten job_ids — array schema, max_length, narrow except
From cursor-reviews on the parent commit:
- OpenAPI: declare job_ids as `type: array, items: string format: uuid`
with `style: form, explode: true` so it matches the documented
contract (and matches sibling include_tags/exclude_tags shape).
Description now states both accepted shapes explicitly.
- Schema: cap `job_ids` at 500 entries (max_length on the Pydantic
field) so a client can't splice an unbounded list into the IN clauses.
- Schema: drop `AttributeError` from the except — `raw` only contains
`str` items by construction, so `uuid.UUID(<str>)` raises `ValueError`
exclusively; the second clause was dead code.
* fix(assets): tighten job_ids validator + add schema-level tests
Aligns with the parallel hardening from draft PR #13848 (now closed as
a duplicate). The validator now:
- Raises ValueError on non-string list items (was: silently dropped).
- Raises ValueError on non-string / non-list top-level values like dict
or int (was: silently passed through to Pydantic's downstream coercion).
Adds tests-unit/assets_test/queries/test_list_assets_query.py covering
the validator end-to-end: CSV canonicalization, dedup order, default
empty, invalid UUID, non-string list item, non-string non-list value,
and the max_length=500 boundary.
* feat(prompt): enforce canonical UUID prompt_id at job creation
POST /prompt previously accepted any client-supplied prompt_id verbatim,
str()-coercing even non-strings, and minting the literal job id "None"
for an explicit JSON null. The new GET /api/assets job_ids filter matches
stored job ids as canonical UUIDs exactly, so a non-UUID id minted a job
whose assets could never be filtered.
- validate_job_id (comfy_execution/jobs.py): requires a string in the
canonical lowercase hyphenated UUID form; raises ValueError otherwise,
including parseable-but-non-canonical spellings (uppercase, braced, URN,
bare hex), which would otherwise be silently rewritten and then miss
every exact-match lookup downstream (history keys, websocket
correlation, /interrupt, the assets job_ids filter).
- POST /prompt: absent or null prompt_id means the server mints uuid4;
invalid means 400 invalid_prompt_id on the standard error envelope.
- openapi.yaml: document the request-side prompt_id (format uuid,
nullable) on PromptRequest.
- tests: unit matrix for validate_job_id; integration tests against the
booted server covering rejection, acceptance, and null handling.
---------
Co-authored-by: guill <jacob.e.segal@gmail.com>
* pinned_memory: remove JIT RAM pressure release
This doesn't work, as freeing intermediates for pins needs to be
higher-priority than freeing pins-for-pins if and when you are going
to do that. So this is too late as pins-for-pins is model load time
and we dont have JIT pins-for-pins.
* cacheing: Add a filter to only free intermediates from inactive wfs
This is to get priorities in amongst pins straight.
* mm: free inactive-ram from RAM cache first
Stuff from inactive workflows should be freed before anything else.
* caching: purge old ModelPatchers first
Dont try and score them, just dump them at the first sign of trouble
if they arent part of the workflow.
* Revert "Revert "feat: Add CacheProvider API for external distributed caching …"
This reverts commit d1d53c14be.
* fix: gate provider lookups to outputs cache and fix UI coercion
- Add `enable_providers` flag to BasicCache so only the outputs cache
triggers external provider lookups/stores. The objects cache stores
node class instances, not CacheEntry values, so provider calls were
wasted round-trips that always missed.
- Remove `or {}` coercion on `result.ui` — an empty dict passes the
`is not None` gate in execution.py and causes KeyError when the
history builder indexes `["output"]` and `["meta"]`. Preserving
`None` correctly skips the ui_node_outputs addition.
* feat: Add CacheProvider API for external distributed caching
Introduces a public API for external cache providers, enabling distributed
caching across multiple ComfyUI instances (e.g., Kubernetes pods).
New files:
- comfy_execution/cache_provider.py: CacheProvider ABC, CacheContext/CacheValue
dataclasses, thread-safe provider registry, serialization utilities
Modified files:
- comfy_execution/caching.py: Add provider hooks to BasicCache (_notify_providers_store,
_check_providers_lookup), subcache exclusion, prompt ID propagation
- execution.py: Add prompt lifecycle hooks (on_prompt_start/on_prompt_end) to
PromptExecutor, set _current_prompt_id on caches
Key features:
- Local-first caching (check local before external for performance)
- NaN detection to prevent incorrect external cache hits
- Subcache exclusion (ephemeral subgraph results not cached externally)
- Thread-safe provider snapshot caching
- Graceful error handling (provider errors logged, never break execution)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* fix: use deterministic hash for cache keys instead of pickle
Pickle serialization is NOT deterministic across Python sessions due
to hash randomization affecting frozenset iteration order. This causes
distributed caching to fail because different pods compute different
hashes for identical cache keys.
Fix: Use _canonicalize() + JSON serialization which ensures deterministic
ordering regardless of Python's hash randomization.
This is critical for cross-pod cache key consistency in Kubernetes
deployments.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* test: add unit tests for CacheProvider API
- Add comprehensive tests for _canonicalize deterministic ordering
- Add tests for serialize_cache_key hash consistency
- Add tests for contains_nan utility
- Add tests for estimate_value_size
- Add tests for provider registry (register, unregister, clear)
- Move json import to top-level (fix inline import)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* style: remove unused imports in test_cache_provider.py
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* fix: move _torch_available before usage and use importlib.util.find_spec
Fixes ruff F821 (undefined name) and F401 (unused import) errors.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* fix: use hashable types in frozenset test and add dict test
Frozensets can only contain hashable types, so use nested frozensets
instead of dicts. Added separate test for dict handling via serialize_cache_key.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* refactor: expose CacheProvider API via comfy_api.latest.Caching
- Add Caching class to comfy_api/latest/__init__.py that re-exports
from comfy_execution.cache_provider (source of truth)
- Fix docstring: "Skip large values" instead of "Skip small values"
(small compute-heavy values are good cache targets)
- Maintain backward compatibility: comfy_execution.cache_provider
imports still work
Usage:
from comfy_api.latest import Caching
class MyProvider(Caching.CacheProvider):
def on_lookup(self, context): ...
def on_store(self, context, value): ...
Caching.register_provider(MyProvider())
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* docs: clarify should_cache filtering criteria
Change docstring from "Skip large values" to "Skip if download time > compute time"
which better captures the cost/benefit tradeoff for external caching.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* docs: make should_cache docstring implementation-agnostic
Remove prescriptive filtering suggestions - let implementations
decide their own caching logic based on their use case.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* feat: add optional ui field to CacheValue
- Add ui field to CacheValue dataclass (default None)
- Pass ui when creating CacheValue for external providers
- Use result.ui (or default {}) when returning from external cache lookup
This allows external cache implementations to store/retrieve UI data
if desired, while remaining optional for implementations that skip it.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* refactor: rename _is_cacheable_value to _is_external_cacheable_value
Clearer name since objects are also cached locally - this specifically
checks for external caching eligibility.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* refactor: async CacheProvider API + reduce public surface
- Make on_lookup/on_store async on CacheProvider ABC
- Simplify CacheContext: replace cache_key + cache_key_bytes with
cache_key_hash (str hex digest)
- Make registry/utility functions internal (_prefix)
- Trim comfy_api.latest.Caching exports to core API only
- Make cache get/set async throughout caching.py hierarchy
- Use asyncio.create_task for fire-and-forget on_store
- Add NaN gating before provider calls in Core
- Add await to 5 cache call sites in execution.py
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: remove unused imports (ruff) and update tests for internal API
- Remove unused CacheContext and _serialize_cache_key imports from
caching.py (now handled by _build_context helper)
- Update test_cache_provider.py to use _-prefixed internal names
- Update tests for new CacheContext.cache_key_hash field (str)
- Make MockCacheProvider methods async to match ABC
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: address coderabbit review feedback
- Add try/except to _build_context, return None when hash fails
- Return None from _serialize_cache_key on total failure (no id()-based fallback)
- Replace hex-like test literal with non-secret placeholder
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: use _-prefixed imports in _notify_prompt_lifecycle
The lifecycle notification method was importing the old non-prefixed
names (has_cache_providers, get_cache_providers, logger) which no
longer exist after the API cleanup.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: add sync get_local/set_local for graph traversal
ExecutionList in graph.py calls output_cache.get() and .set() from
sync methods (is_cached, cache_link, get_cache). These cannot await
the now-async get/set. Add get_local/set_local that bypass external
providers and only access the local dict — which is all graph
traversal needs.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* chore: remove cloud-specific language from cache provider API
Make all docstrings and comments generic for the OSS codebase.
Remove references to Kubernetes, Redis, GCS, pods, and other
infrastructure-specific terminology.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* style: align documentation with codebase conventions
Strip verbose docstrings and section banners to match existing minimal
documentation style used throughout the codebase.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: add usage example to Caching class, remove pickle fallback
- Add docstring with usage example to Caching class matching the
convention used by sibling APIs (Execution.set_progress, ComfyExtension)
- Remove non-deterministic pickle fallback from _serialize_cache_key;
return None on JSON failure instead of producing unretrievable hashes
- Move cache_provider imports to top of execution.py (no circular dep)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* refactor: move public types to comfy_api, eager provider snapshot
Address review feedback:
- Move CacheProvider/CacheContext/CacheValue definitions to
comfy_api/latest/_caching.py (source of truth for public API)
- comfy_execution/cache_provider.py re-exports types from there
- Build _providers_snapshot eagerly on register/unregister instead
of lazy memoization in _get_cache_providers
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: generalize self-inequality check, fail-closed canonicalization
Address review feedback from guill:
- Rename _contains_nan to _contains_self_unequal, use not (x == x)
instead of math.isnan to catch any self-unequal value
- Remove Unhashable and repr() fallbacks from _canonicalize; raise
ValueError for unknown types so _serialize_cache_key returns None
and external caching is skipped (fail-closed)
- Update tests for renamed function and new fail-closed behavior
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: suppress ruff F401 for re-exported CacheContext
CacheContext is imported from _caching and re-exported for use by
caching.py. Add noqa comment to satisfy the linter.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: enable external caching for subcache (expanded) nodes
Subcache nodes (from node expansion) now participate in external
provider store/lookup. Previously skipped to avoid duplicates, but
the cost of missing partial-expansion cache hits outweighs redundant
stores — especially with looping behavior on the horizon.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: wrap register/unregister as explicit static methods
Define register_provider and unregister_provider as wrapper functions
in the Caching class instead of re-importing. This locks the public
API signature in comfy_api/ so internal changes can't accidentally
break it.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: use debug-level logging for provider registration
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: follow ProxiedSingleton pattern for Caching class
Add Caching as a nested class inside ComfyAPI_latest inheriting from
ProxiedSingleton with async instance methods, matching the Execution
and NodeReplacement patterns. Retains standalone Caching class for
direct import convenience.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: inline registration logic in Caching class
Follow the Execution/NodeReplacement pattern — the public API methods
contain the actual logic operating on cache_provider module state,
not wrapper functions delegating to free functions.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: single Caching definition inside ComfyAPI_latest
Remove duplicate standalone Caching class. Define it once as a nested
class in ComfyAPI_latest (matching Execution/NodeReplacement pattern),
with a module-level alias for import convenience.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: remove prompt_id from CacheContext, type-safe canonicalization
Remove prompt_id from CacheContext — it's not relevant for cache
matching and added unnecessary plumbing (_current_prompt_id on every
cache). Lifecycle hooks still receive prompt_id directly.
Include type name in canonicalized primitives so that int 7 and
str "7" produce distinct hashes. Also canonicalize dict keys properly
instead of str() coercion.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: address review feedback on cache provider API
- Hold references to pending store tasks to prevent "Task was destroyed
but it is still pending" warnings (bigcat88)
- Parallel cache lookups with asyncio.gather instead of sequential
awaits for better performance (bigcat88)
- Delegate Caching.register/unregister_provider to existing functions
in cache_provider.py instead of reimplementing (bigcat88)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude <noreply@anthropic.com>