EasyAI代码托管平台

wangbo/ComfyUI

Fork 0

mirror of https://github.com/comfyanonymous/ComfyUI.git synced 2026-07-05 06:01:39 +08:00

Commit Graph

Author	SHA1	Message	Date
huangfeice	e29384be0d	Add JoyImageEditPlus multi-image edit support (unify onto Plus-style forward) JoyImageEditPlus is the multi-image (1-6 reference images) variant of JoyImageEdit, trained from the same base. Its diffusers transformer shares byte-identical weight structure with the single-image variant (894 keys, zero rename) but injects references differently: instead of the single-image slot-stack (stack refs + noise into a 6D tensor and rotate on the frame dim, which forces all items to share resolution), each reference is independently patchified and concatenated on the sequence dim with per-image temporal-offset 3D RoPE, allowing references at different resolutions. Since the single-image port is not yet upstream, this unifies both variants onto the Plus-style forward rather than keeping two paths; single-image is now the ref=1 special case. Verified numerically: at ref=1 with equal resolution the new path's RoPE is bit-identical to the old slot-stack layout, and the transformer output matches the diffusers Plus reference (fp32, incl. the different-resolution case). ComfyUI runs cond/uncond in one forward with a shared reference configuration, so the diffusers Plus batched RoPE, padding attention_mask, and dedicated attention processor are unnecessary here: the unified forward reuses the existing unbatched _apply_rotary_emb and JoyImageAttention. Confirmed equivalent to the diffusers batched+mask path for a single sample. - comfy/ldm/joyimage/model.py: forward takes ref_latents and builds components=[target, ref0, ...]; per-component patchify + temporal-offset RoPE; output keeps only the target segment. Old single-grid RoPE removed. - comfy/model_base.py: JoyImage drops the slot-stack / frame-rotation / shape-equality path in _apply_model, passing ref_latents straight to the transformer. Guidance-rescale and the reference_latents requirement are kept. - comfy/text_encoders/joyimage.py: the image template emits one vision block per reference (N = image count); N=1 is byte-for-byte the old template. - comfy_extras/nodes_joyimage.py: add TextEncodeJoyImageEditPlus with optional image1..image6 inputs, each bucket-resized and VAE-encoded into the reference_latents list. Detection, supported_models, and sd.py need no changes: the identical weight structure routes both variants through image_model="joyimage".	2026-07-01 18:36:43 +08:00
huangfeice	e96bd48e2d	Adapt JoyImageEdit text encoder onto upstream Qwen3-VL stack Upstream merged native Qwen3-VL support (#14298), adding comfy/text_encoders/qwen3vl.py plus helpers in qwen_vl.py / llama.py / qwen35.py. The JoyImage port previously shipped its own duplicate Qwen3-VL implementation (comfy/text_encoders/qwen3_vl.py); that duplication is now removed and the JoyImage text encoder rides on the upstream stack. - Delete comfy/text_encoders/qwen3_vl.py. - Rewrite comfy/text_encoders/joyimage.py to subclass upstream comfy.text_encoders.qwen3vl. The JoyImage checkpoint is a stock qwen3vl_8b, so only JoyImage-specific behavior is overridden: * Qwen3VL8B_JoyImage.forward builds the 3D MRoPE position ids and injects deepstack visual features on the conditioning path. Upstream Qwen3VL only does this inside generate() via build_image_inputs; SDClipModel.forward never passes those kwargs. The JoyImage node feeds an image through the encoder (clip.tokenize(prompt, images=[..])), so the override reuses build_image_inputs to reproduce the multimodal conditioning that Llama2_.forward already accepts kwargs for. * preprocess_embed keeps JoyImage's bicubic+clamp image preprocessing (process_qwen3vl_image) instead of upstream's bilinear path, to preserve validated DiT numerics. * JoyImageTokenizer keeps the JoyImage system-prompt templates, suppresses the Qwen3 <think> block, and raises on image-placeholder count mismatch. * JoyImageTEModel keeps the drop_idx=34 system-prompt strip and the pre-final-norm layer tap (layer="hidden", layer_idx=-1). - sd.py QWEN3VL_8B_JOYIMAGE branch: apply the same state-dict prefix remap the sibling QWEN3VL branch uses (model.language_model.->model., model.visual.->visual., lm_head.->model.lm_head.) so the checkpoint loads into the upstream Qwen3VL namespace, then use the module-level llama_detect. Detection ordering is preserved: the JoyImage discriminator is checked before the generic Qwen3-VL deepstack key. No changes to llama.py / qwen3vl.py / qwen_vl.py / qwen35.py.	2026-06-17 21:29:33 +08:00
huangfeice	5260e18cdf	Add JoyImageEdit native model support JoyImageEdit is an image-edit diffusion transformer from JD (jd-opensource), Apache 2.0. This adds native ComfyUI support so it loads and runs like other edit models (load checkpoint -> TextEncode + ReferenceLatent -> KSampler -> VAEDecode), with no diffusers dependency. Architecture: - Transformer (comfy/ldm/joyimage/model.py): dual-stream (img/txt) DiT with a Conv3d patch embed (patch_size [1,2,2]), Wan-style learnable modulation, and 3D RoPE (rope_dim_list [16,56,56]). All attention goes through comfy.ldm.modules.attention.optimized_attention. - Text encoder (comfy/text_encoders/{qwen3_vl,joyimage}.py): a reusable Qwen3-VL multimodal stack (vision tower + LM) in qwen3_vl.py, plus a thin JoyImage-specific layer (prompt templates, drop_idx, tokenizer, te() factory) in joyimage.py that depends on it. text_dim 4096. - VAE: reuses the existing Wan 2.1 latent format (AutoencoderKLWan), no new latent format. - Edit conditioning: reuses the reference_latents mechanism. Reference and noise latents are stacked on a new n-slot dimension and rotated at the model boundary (model_base.JoyImage), so the transformer stays 5D-in/5D-out. Guidance-rescale is built into the CFG path. Model wiring: - model_base.JoyImage uses ModelType.FLOW with sampling_settings multiplier=1000 (the time embedding is trained on t in [0,1000]) and shift=1.5; FLOW's linear time_snr_shift matches the diffusers FlowMatchEuler sigma schedule. - model_detection sniffs the transformer state-dict (double_blocks., condition_embedder., 5D img_in Conv3d) to route image_model="joyimage". - supported_models.JoyImage and the CLIPLoader "joyimage" type register it. User-facing node TextEncodeJoyImageEdit (comfy_extras/nodes_joyimage.py) bucket-resizes the input image to the nearest 1024-base bucket, encodes the prompt with the image, and emits both the conditioning and the bucketed image so the same pixels feed VAEEncode and the negative encode (JoyImage requires noise and reference latents to share spatial dims).	2026-06-17 18:53:36 +08:00

Author

SHA1

Message

Date

huangfeice

e29384be0d

Add JoyImageEditPlus multi-image edit support (unify onto Plus-style forward)

JoyImageEditPlus is the multi-image (1-6 reference images) variant of
JoyImageEdit, trained from the same base. Its diffusers transformer shares
byte-identical weight structure with the single-image variant (894 keys, zero
rename) but injects references differently: instead of the single-image
slot-stack (stack refs + noise into a 6D tensor and rotate on the frame dim,
which forces all items to share resolution), each reference is independently
patchified and concatenated on the sequence dim with per-image temporal-offset
3D RoPE, allowing references at different resolutions.

Since the single-image port is not yet upstream, this unifies both variants
onto the Plus-style forward rather than keeping two paths; single-image is now
the ref=1 special case. Verified numerically: at ref=1 with equal resolution
the new path's RoPE is bit-identical to the old slot-stack layout, and the
transformer output matches the diffusers Plus reference (fp32, incl. the
different-resolution case).

ComfyUI runs cond/uncond in one forward with a shared reference configuration,
so the diffusers Plus batched RoPE, padding attention_mask, and dedicated
attention processor are unnecessary here: the unified forward reuses the
existing unbatched _apply_rotary_emb and JoyImageAttention. Confirmed
equivalent to the diffusers batched+mask path for a single sample.

- comfy/ldm/joyimage/model.py: forward takes ref_latents and builds
  components=[target, ref0, ...]; per-component patchify + temporal-offset
  RoPE; output keeps only the target segment. Old single-grid RoPE removed.
- comfy/model_base.py: JoyImage drops the slot-stack / frame-rotation /
  shape-equality path in _apply_model, passing ref_latents straight to the
  transformer. Guidance-rescale and the reference_latents requirement are kept.
- comfy/text_encoders/joyimage.py: the image template emits one vision block
  per reference (N = image count); N=1 is byte-for-byte the old template.
- comfy_extras/nodes_joyimage.py: add TextEncodeJoyImageEditPlus with optional
  image1..image6 inputs, each bucket-resized and VAE-encoded into the
  reference_latents list.

Detection, supported_models, and sd.py need no changes: the identical weight
structure routes both variants through image_model="joyimage".

2026-07-01 18:36:43 +08:00

huangfeice

e96bd48e2d

Adapt JoyImageEdit text encoder onto upstream Qwen3-VL stack

Upstream merged native Qwen3-VL support (#14298), adding
comfy/text_encoders/qwen3vl.py plus helpers in qwen_vl.py / llama.py /
qwen35.py. The JoyImage port previously shipped its own duplicate
Qwen3-VL implementation (comfy/text_encoders/qwen3_vl.py); that
duplication is now removed and the JoyImage text encoder rides on the
upstream stack.

- Delete comfy/text_encoders/qwen3_vl.py.
- Rewrite comfy/text_encoders/joyimage.py to subclass upstream
  comfy.text_encoders.qwen3vl. The JoyImage checkpoint is a stock
  qwen3vl_8b, so only JoyImage-specific behavior is overridden:
  * Qwen3VL8B_JoyImage.forward builds the 3D MRoPE position ids and
    injects deepstack visual features on the conditioning path. Upstream
    Qwen3VL only does this inside generate() via build_image_inputs;
    SDClipModel.forward never passes those kwargs. The JoyImage node
    feeds an image through the encoder (clip.tokenize(prompt, images=[..])),
    so the override reuses build_image_inputs to reproduce the multimodal
    conditioning that Llama2_.forward already accepts kwargs for.
  * preprocess_embed keeps JoyImage's bicubic+clamp image preprocessing
    (process_qwen3vl_image) instead of upstream's bilinear path, to
    preserve validated DiT numerics.
  * JoyImageTokenizer keeps the JoyImage system-prompt templates,
    suppresses the Qwen3 <think> block, and raises on image-placeholder
    count mismatch.
  * JoyImageTEModel keeps the drop_idx=34 system-prompt strip and the
    pre-final-norm layer tap (layer="hidden", layer_idx=-1).
- sd.py QWEN3VL_8B_JOYIMAGE branch: apply the same state-dict prefix
  remap the sibling QWEN3VL branch uses (model.language_model.->model.,
  model.visual.->visual., lm_head.->model.lm_head.) so the checkpoint
  loads into the upstream Qwen3VL namespace, then use the module-level
  llama_detect. Detection ordering is preserved: the JoyImage
  discriminator is checked before the generic Qwen3-VL deepstack key.

No changes to llama.py / qwen3vl.py / qwen_vl.py / qwen35.py.

2026-06-17 21:29:33 +08:00

huangfeice

5260e18cdf

Add JoyImageEdit native model support

JoyImageEdit is an image-edit diffusion transformer from JD (jd-opensource),
Apache 2.0. This adds native ComfyUI support so it loads and runs like other
edit models (load checkpoint -> TextEncode + ReferenceLatent -> KSampler ->
VAEDecode), with no diffusers dependency.

Architecture:
- Transformer (comfy/ldm/joyimage/model.py): dual-stream (img/txt) DiT with a
  Conv3d patch embed (patch_size [1,2,2]), Wan-style learnable modulation,
  and 3D RoPE (rope_dim_list [16,56,56]). All attention goes through
  comfy.ldm.modules.attention.optimized_attention.
- Text encoder (comfy/text_encoders/{qwen3_vl,joyimage}.py): a reusable
  Qwen3-VL multimodal stack (vision tower + LM) in qwen3_vl.py, plus a thin
  JoyImage-specific layer (prompt templates, drop_idx, tokenizer, te() factory)
  in joyimage.py that depends on it. text_dim 4096.
- VAE: reuses the existing Wan 2.1 latent format (AutoencoderKLWan), no new
  latent format.
- Edit conditioning: reuses the reference_latents mechanism. Reference and
  noise latents are stacked on a new n-slot dimension and rotated at the model
  boundary (model_base.JoyImage), so the transformer stays 5D-in/5D-out.
  Guidance-rescale is built into the CFG path.

Model wiring:
- model_base.JoyImage uses ModelType.FLOW with sampling_settings
  multiplier=1000 (the time embedding is trained on t in [0,1000]) and
  shift=1.5; FLOW's linear time_snr_shift matches the diffusers
  FlowMatchEuler sigma schedule.
- model_detection sniffs the transformer state-dict (double_blocks.*,
  condition_embedder.*, 5D img_in Conv3d) to route image_model="joyimage".
- supported_models.JoyImage and the CLIPLoader "joyimage" type register it.

User-facing node TextEncodeJoyImageEdit (comfy_extras/nodes_joyimage.py)
bucket-resizes the input image to the nearest 1024-base bucket, encodes the
prompt with the image, and emits both the conditioning and the bucketed image
so the same pixels feed VAEEncode and the negative encode (JoyImage requires
noise and reference latents to share spatial dims).

2026-06-17 18:53:36 +08:00

3 Commits