EasyAI代码托管平台

mirror of https://github.com/comfyanonymous/ComfyUI.git synced 2026-07-03 21:20:49 +08:00

History

huangfeice 5260e18cdf Add JoyImageEdit native model support JoyImageEdit is an image-edit diffusion transformer from JD (jd-opensource), Apache 2.0. This adds native ComfyUI support so it loads and runs like other edit models (load checkpoint -> TextEncode + ReferenceLatent -> KSampler -> VAEDecode), with no diffusers dependency. Architecture: - Transformer (comfy/ldm/joyimage/model.py): dual-stream (img/txt) DiT with a Conv3d patch embed (patch_size [1,2,2]), Wan-style learnable modulation, and 3D RoPE (rope_dim_list [16,56,56]). All attention goes through comfy.ldm.modules.attention.optimized_attention. - Text encoder (comfy/text_encoders/{qwen3_vl,joyimage}.py): a reusable Qwen3-VL multimodal stack (vision tower + LM) in qwen3_vl.py, plus a thin JoyImage-specific layer (prompt templates, drop_idx, tokenizer, te() factory) in joyimage.py that depends on it. text_dim 4096. - VAE: reuses the existing Wan 2.1 latent format (AutoencoderKLWan), no new latent format. - Edit conditioning: reuses the reference_latents mechanism. Reference and noise latents are stacked on a new n-slot dimension and rotated at the model boundary (model_base.JoyImage), so the transformer stays 5D-in/5D-out. Guidance-rescale is built into the CFG path. Model wiring: - model_base.JoyImage uses ModelType.FLOW with sampling_settings multiplier=1000 (the time embedding is trained on t in [0,1000]) and shift=1.5; FLOW's linear time_snr_shift matches the diffusers FlowMatchEuler sigma schedule. - model_detection sniffs the transformer state-dict (double_blocks., condition_embedder., 5D img_in Conv3d) to route image_model="joyimage". - supported_models.JoyImage and the CLIPLoader "joyimage" type register it. User-facing node TextEncodeJoyImageEdit (comfy_extras/nodes_joyimage.py) bucket-resizes the input image to the nearest 1024-base bucket, encodes the prompt with the image, and emits both the conditioning and the bucketed image so the same pixels feed VAEEncode and the negative encode (JoyImage requires noise and reference latents to share spatial dims).	2026-06-17 18:53:36 +08:00
..
model.py	Add JoyImageEdit native model support	2026-06-17 18:53:36 +08:00

huangfeice 5260e18cdf Add JoyImageEdit native model support

JoyImageEdit is an image-edit diffusion transformer from JD (jd-opensource),
Apache 2.0. This adds native ComfyUI support so it loads and runs like other
edit models (load checkpoint -> TextEncode + ReferenceLatent -> KSampler ->
VAEDecode), with no diffusers dependency.

Architecture:
- Transformer (comfy/ldm/joyimage/model.py): dual-stream (img/txt) DiT with a
  Conv3d patch embed (patch_size [1,2,2]), Wan-style learnable modulation,
  and 3D RoPE (rope_dim_list [16,56,56]). All attention goes through
  comfy.ldm.modules.attention.optimized_attention.
- Text encoder (comfy/text_encoders/{qwen3_vl,joyimage}.py): a reusable
  Qwen3-VL multimodal stack (vision tower + LM) in qwen3_vl.py, plus a thin
  JoyImage-specific layer (prompt templates, drop_idx, tokenizer, te() factory)
  in joyimage.py that depends on it. text_dim 4096.
- VAE: reuses the existing Wan 2.1 latent format (AutoencoderKLWan), no new
  latent format.
- Edit conditioning: reuses the reference_latents mechanism. Reference and
  noise latents are stacked on a new n-slot dimension and rotated at the model
  boundary (model_base.JoyImage), so the transformer stays 5D-in/5D-out.
  Guidance-rescale is built into the CFG path.

Model wiring:
- model_base.JoyImage uses ModelType.FLOW with sampling_settings
  multiplier=1000 (the time embedding is trained on t in [0,1000]) and
  shift=1.5; FLOW's linear time_snr_shift matches the diffusers
  FlowMatchEuler sigma schedule.
- model_detection sniffs the transformer state-dict (double_blocks.*,
  condition_embedder.*, 5D img_in Conv3d) to route image_model="joyimage".
- supported_models.JoyImage and the CLIPLoader "joyimage" type register it.

User-facing node TextEncodeJoyImageEdit (comfy_extras/nodes_joyimage.py)
bucket-resizes the input image to the nearest 1024-base bucket, encodes the
prompt with the image, and emits both the conditioning and the bucketed image
so the same pixels feed VAEEncode and the negative encode (JoyImage requires
noise and reference latents to share spatial dims).

2026-06-17 18:53:36 +08:00

model.py

Add JoyImageEdit native model support

2026-06-17 18:53:36 +08:00