* Add jobs-namespace cancel endpoints
Add two cancel endpoints under the jobs namespace so a job can be
cancelled by id without the caller needing to know whether the job is
running or pending, or branching between /interrupt and /queue.
- POST /api/jobs/{job_id}/cancel cancels one job by id. Idempotent: an
already-finished or unknown id returns 200 {"cancelled": false} rather
than an error.
- POST /api/jobs/cancel takes {"job_ids": [...]} and cancels a batch.
Fail-fast: if any id is unknown the request returns 404 listing the
unknown ids and cancels nothing (no partial side effects).
Both are state-agnostic and map onto the existing queue mechanics: a
running job is interrupted (same path as /interrupt), a pending job is
dequeued (same path as /queue {"delete": [...]}). The cancel logic lives
in comfy_execution.jobs as pure, unit-tested helpers; the server handlers
are thin wrappers. openapi.yaml documents both routes.
* fix: resolve review feedback on cancel endpoints
- Guard cancel_job() against TOCTOU: when dequeue() returns False the
pending job left the queue between snapshot and delete; return
CANCEL_UNKNOWN so callers never report cancelled=True for a remove
that did not happen.
- Validate each job_ids element in the batch cancel endpoint before
any queue access; unhashable or non-UUID values now return 400
instead of raising TypeError (500).
- Update batch HTTP tests to use canonical UUID ids (required now that
the endpoint validates id format) and add tests for the new guards.
* fix: make job cancel atomic and best-effort
Addresses two cancel races/edges raised in review.
Targeted, atomic interrupt. cancel_job's interrupt callback now takes the
prompt id and returns whether it fired; the single-cancel route backs it
with the new PromptQueue.interrupt_if_running, which checks the running set
and signals the interrupt under the queue mutex. This closes the TOCTOU
where a pending job that starts executing between the snapshot and dequeue
(or a running job that finishes between the snapshot and interrupt) could be
missed or, worse, cause an unrelated prompt to be interrupted. The per-prompt
interrupt-flag reset in execute_async keeps a finished job from leaking the
interrupt onto its successor.
Best-effort batch cancel. POST /api/jobs/cancel no longer fails the whole
batch with 404 when one id is unknown/finished; such ids are treated as
no-ops, so "cancel all" still cancels the in-progress jobs even if some
finished between the client's snapshot and the request. Malformed ids are
still rejected with 400.
LoadTrainingDataset was the only torch.load call in the codebase without
weights_only=True; comfy/utils.py and comfy/sd1_clip.py already pass it.
Recent PyTorch defaults to weights_only=True, so this is defense-in-depth
for installs pinned to older PyTorch. Verified a typical shard (latents +
standard conditioning) round-trips cleanly under weights_only=True.
Upstream merged native Qwen3-VL support (#14298), adding
comfy/text_encoders/qwen3vl.py plus helpers in qwen_vl.py / llama.py /
qwen35.py. The JoyImage port previously shipped its own duplicate
Qwen3-VL implementation (comfy/text_encoders/qwen3_vl.py); that
duplication is now removed and the JoyImage text encoder rides on the
upstream stack.
- Delete comfy/text_encoders/qwen3_vl.py.
- Rewrite comfy/text_encoders/joyimage.py to subclass upstream
comfy.text_encoders.qwen3vl. The JoyImage checkpoint is a stock
qwen3vl_8b, so only JoyImage-specific behavior is overridden:
* Qwen3VL8B_JoyImage.forward builds the 3D MRoPE position ids and
injects deepstack visual features on the conditioning path. Upstream
Qwen3VL only does this inside generate() via build_image_inputs;
SDClipModel.forward never passes those kwargs. The JoyImage node
feeds an image through the encoder (clip.tokenize(prompt, images=[..])),
so the override reuses build_image_inputs to reproduce the multimodal
conditioning that Llama2_.forward already accepts kwargs for.
* preprocess_embed keeps JoyImage's bicubic+clamp image preprocessing
(process_qwen3vl_image) instead of upstream's bilinear path, to
preserve validated DiT numerics.
* JoyImageTokenizer keeps the JoyImage system-prompt templates,
suppresses the Qwen3 <think> block, and raises on image-placeholder
count mismatch.
* JoyImageTEModel keeps the drop_idx=34 system-prompt strip and the
pre-final-norm layer tap (layer="hidden", layer_idx=-1).
- sd.py QWEN3VL_8B_JOYIMAGE branch: apply the same state-dict prefix
remap the sibling QWEN3VL branch uses (model.language_model.->model.,
model.visual.->visual., lm_head.->model.lm_head.) so the checkpoint
loads into the upstream Qwen3VL namespace, then use the module-level
llama_detect. Detection ordering is preserved: the JoyImage
discriminator is checked before the generic Qwen3-VL deepstack key.
No changes to llama.py / qwen3vl.py / qwen_vl.py / qwen35.py.
JoyImageEdit is an image-edit diffusion transformer from JD (jd-opensource),
Apache 2.0. This adds native ComfyUI support so it loads and runs like other
edit models (load checkpoint -> TextEncode + ReferenceLatent -> KSampler ->
VAEDecode), with no diffusers dependency.
Architecture:
- Transformer (comfy/ldm/joyimage/model.py): dual-stream (img/txt) DiT with a
Conv3d patch embed (patch_size [1,2,2]), Wan-style learnable modulation,
and 3D RoPE (rope_dim_list [16,56,56]). All attention goes through
comfy.ldm.modules.attention.optimized_attention.
- Text encoder (comfy/text_encoders/{qwen3_vl,joyimage}.py): a reusable
Qwen3-VL multimodal stack (vision tower + LM) in qwen3_vl.py, plus a thin
JoyImage-specific layer (prompt templates, drop_idx, tokenizer, te() factory)
in joyimage.py that depends on it. text_dim 4096.
- VAE: reuses the existing Wan 2.1 latent format (AutoencoderKLWan), no new
latent format.
- Edit conditioning: reuses the reference_latents mechanism. Reference and
noise latents are stacked on a new n-slot dimension and rotated at the model
boundary (model_base.JoyImage), so the transformer stays 5D-in/5D-out.
Guidance-rescale is built into the CFG path.
Model wiring:
- model_base.JoyImage uses ModelType.FLOW with sampling_settings
multiplier=1000 (the time embedding is trained on t in [0,1000]) and
shift=1.5; FLOW's linear time_snr_shift matches the diffusers
FlowMatchEuler sigma schedule.
- model_detection sniffs the transformer state-dict (double_blocks.*,
condition_embedder.*, 5D img_in Conv3d) to route image_model="joyimage".
- supported_models.JoyImage and the CLIPLoader "joyimage" type register it.
User-facing node TextEncodeJoyImageEdit (comfy_extras/nodes_joyimage.py)
bucket-resizes the input image to the nearest 1024-base bucket, encodes the
prompt with the image, and emits both the conditioning and the bucketed image
so the same pixels feed VAEEncode and the negative encode (JoyImage requires
noise and reference latents to share spatial dims).
a1d95f3f padded the decode width to the next multiple of 32 with the pad filter to fix libswscale's float YUV->GBR edge corruption, but kept the pad target height equal to the source height. The pad filter requires the target height to be a multiple of the input's vertical chroma subsampling factor, so a chroma-subsampled input such as yuv420p (the format the gbrpf32le float branch decodes) with an odd height makes the filter round the target below the input height and fail to configure: 'Padded dimensions cannot be smaller than input dimensions' (Errno 22). This is reachable from LoadImage, which routes static images through VideoFromFile, on a lossy WebP whose width is not a multiple of 32 and whose height is odd.
The pad filter also fills the added border with black, and chroma upsampling bleeds that black into the cropped edge of every unaligned-width subsampled decode.
Pad both axes to the next multiple of 32 (32 is a multiple of every vertical subsampling factor, including yuv410p's 4 that a plain even rounding misses) and run fillborders mode=smear to replicate the real edge into the padding so it never bleeds into the cropped output, then crop both axes back to the source size. Aligned-width and uint8 paths run the identical to_ndarray call as before and are byte-identical to master; only unaligned-width subsampled inputs change, from a crash or edge artifact to a clean, deterministic decode.
The aimdo 0.4.10 protocol causing startup failure to be too early and
before the aimdo version warning can happen. This causes user
confusion. Limp on with 0.4.9 as it will work and users will see the
version warning.
* main: implement --vram-headroom
Implement --vram-headroom for dynamic vram as a hybrid debug/diagnostic
option that can be used for people who still report shared VRAM spills.
They can trial and error the setting to maintain a bit more headroom to
avoid shared VRAM spills.
* main: implement --reserve-vram
Implement --reserve-vram as extra headroom on the simple method which
is semantically as close as possible to the stated functionality and
formet behaviour of non-dynamic VRAM.
Create Video gets a bit_depth option (8-bit/10-bit); the selected depth is carried by the video and applied when it gets encoded. Save Video and Video Slice now keep the source bit depth instead of always quantizing to 8-bit, so 10-bit videos stay 10-bit. 10-bit uses h264 with the yuv420p10le pixel format,so there's no new codec or container.
Signed-off-by: bigcat88 <bigcat88@icloud.com>
Add this option for users who know they have so much ram they want
to pin everything or have a pagefile that outruns their disk speed.
The removes the RAM pressure caps completely and pins behind the
primary model load forcing all models to be permanently comitted
to RAM.