ComfyUI/comfy/text_encoders
huangfeice 5260e18cdf Add JoyImageEdit native model support
JoyImageEdit is an image-edit diffusion transformer from JD (jd-opensource),
Apache 2.0. This adds native ComfyUI support so it loads and runs like other
edit models (load checkpoint -> TextEncode + ReferenceLatent -> KSampler ->
VAEDecode), with no diffusers dependency.

Architecture:
- Transformer (comfy/ldm/joyimage/model.py): dual-stream (img/txt) DiT with a
  Conv3d patch embed (patch_size [1,2,2]), Wan-style learnable modulation,
  and 3D RoPE (rope_dim_list [16,56,56]). All attention goes through
  comfy.ldm.modules.attention.optimized_attention.
- Text encoder (comfy/text_encoders/{qwen3_vl,joyimage}.py): a reusable
  Qwen3-VL multimodal stack (vision tower + LM) in qwen3_vl.py, plus a thin
  JoyImage-specific layer (prompt templates, drop_idx, tokenizer, te() factory)
  in joyimage.py that depends on it. text_dim 4096.
- VAE: reuses the existing Wan 2.1 latent format (AutoencoderKLWan), no new
  latent format.
- Edit conditioning: reuses the reference_latents mechanism. Reference and
  noise latents are stacked on a new n-slot dimension and rotated at the model
  boundary (model_base.JoyImage), so the transformer stays 5D-in/5D-out.
  Guidance-rescale is built into the CFG path.

Model wiring:
- model_base.JoyImage uses ModelType.FLOW with sampling_settings
  multiplier=1000 (the time embedding is trained on t in [0,1000]) and
  shift=1.5; FLOW's linear time_snr_shift matches the diffusers
  FlowMatchEuler sigma schedule.
- model_detection sniffs the transformer state-dict (double_blocks.*,
  condition_embedder.*, 5D img_in Conv3d) to route image_model="joyimage".
- supported_models.JoyImage and the CLIPLoader "joyimage" type register it.

User-facing node TextEncodeJoyImageEdit (comfy_extras/nodes_joyimage.py)
bucket-resizes the input image to the nearest 1024-base bucket, encodes the
prompt with the image, and emits both the conditioning and the bucketed image
so the same pixels feed VAEEncode and the negative encode (JoyImage requires
noise and reference latents to share spatial dims).
2026-06-17 18:53:36 +08:00
..
ace_lyrics_tokenizer Initial ACE-Step model implementation. (#7972) 2025-05-07 08:33:34 -04:00
byt5_tokenizer Support hunyuan image 2.1 regular model. (#9792) 2025-09-10 02:05:07 -04:00
hydit_clip_tokenizer
llama_tokenizer
qwen25_tokenizer Update qwen tokenizer to add qwen 3 tokens. (#11029) 2025-12-01 17:13:48 -05:00
qwen35_tokenizer feat: Support Qwen3.5 text generation models (#12771) 2026-03-25 22:48:28 -04:00
t5_pile_tokenizer
t5_tokenizer
ace15.py fix(ace15): handle missing lm_metadata in memory estimation during checkpoint export #12669 (#12686) 2026-02-28 01:18:40 -05:00
ace_text_cleaners.py Make japanese hiragana and katakana characters work with ACE. (#7997) 2025-05-08 03:32:36 -04:00
ace.py Make japanese hiragana and katakana characters work with ACE. (#7997) 2025-05-08 03:32:36 -04:00
anima.py Small cleanup and try to get qwen 3 work with the text gen. (#12537) 2026-02-19 22:42:28 -05:00
aura_t5.py More flexible long clip support. 2025-04-15 10:32:21 -04:00
bert.py P2 of qwen edit model. (#9412) 2025-08-18 22:38:34 -04:00
byt5_config_small_glyph.json Support hunyuan image 2.1 regular model. (#9792) 2025-09-10 02:05:07 -04:00
cogvideo.py Void model - pass 1 & 2 (CORE-38) (#13403) 2026-05-05 19:59:04 -07:00
cosmos.py Fix chroma fp8 te being treated as fp16. (#11795) 2026-01-10 14:40:42 -08:00
ernie.py Use ErnieTEModel_ not ErnieTEModel. (#13431) 2026-04-16 10:11:58 -04:00
flux.py Implement Ernie Image model. (#13369) 2026-04-11 22:29:31 -04:00
gemma4.py feat: Gemma4 text generation support (CORE-30) (#13376) 2026-05-02 22:46:15 -04:00
genmo.py Fix chroma fp8 te being treated as fp16. (#11795) 2026-01-10 14:40:42 -08:00
gpt_oss.py feat: Microsoft Lens support (CORE-248) (#14077) 2026-05-25 23:01:51 -07:00
hidream_o1.py feat: Support HiDream-O1-Image (CORE-187) (#13817) 2026-05-11 20:35:53 -07:00
hidream.py Make old scaled fp8 format use the new mixed quant ops system. (#11000) 2025-12-05 14:35:42 -05:00
hunyuan_image.py Make old scaled fp8 format use the new mixed quant ops system. (#11000) 2025-12-05 14:35:42 -05:00
hunyuan_video.py Support loading flux 2 klein checkpoints saved with SaveCheckpoint. (#12033) 2026-01-22 18:20:48 -05:00
hydit_clip.json
hydit.py Add a T5TokenizerOptions node to set options for the T5 tokenizer. (#7803) 2025-04-25 19:36:00 -04:00
ideogram4.py feat: Support text generation with Qwen3-VL (CORE-276) (#14298) 2026-06-17 08:12:44 +08:00
jina_clip_2.py Implement Jina CLIP v2 and NewBie dual CLIP (#11415) 2025-12-20 00:57:22 -05:00
joyimage.py Add JoyImageEdit native model support 2026-06-17 18:53:36 +08:00
kandinsky5.py Fix qwen scaled fp8 not working with kandinsky. Make basic t2i wf work. (#11162) 2025-12-06 17:50:10 -08:00
llama.py feat: Support text generation with Qwen3-VL (CORE-276) (#14298) 2026-06-17 08:12:44 +08:00
long_clipl.py Cleanup. 2025-04-15 12:13:28 -04:00
longcat_image.py LongCat-Image edit (#13003) 2026-03-21 23:51:05 -04:00
lt.py feat: Gemma4 text generation support (CORE-30) (#13376) 2026-05-02 22:46:15 -04:00
lumina2.py feat: Gemma4 text generation support (CORE-30) (#13376) 2026-05-02 22:46:15 -04:00
mt5_config_xl.json
newbie.py Only apply gemma quant config to gemma model for newbie. (#11436) 2025-12-20 01:02:43 -05:00
omnigen2.py Make old scaled fp8 format use the new mixed quant ops system. (#11000) 2025-12-05 14:35:42 -05:00
ovis.py Fix #11963 (#11982) 2026-01-19 22:32:40 -05:00
pixart_t5.py Fix chroma fp8 te being treated as fp16. (#11795) 2026-01-10 14:40:42 -08:00
pixeldit.py feat: Support NVIDIA PixelDiT and PiD (CORE-201) (#14103) 2026-05-26 17:50:14 -07:00
qwen3_vl.py Add JoyImageEdit native model support 2026-06-17 18:53:36 +08:00
qwen3vl.py feat: Support text generation with Qwen3-VL (CORE-276) (#14298) 2026-06-17 08:12:44 +08:00
qwen35.py feat: Support text generation with Qwen3-VL (CORE-276) (#14298) 2026-06-17 08:12:44 +08:00
qwen_image.py Make old scaled fp8 format use the new mixed quant ops system. (#11000) 2025-12-05 14:35:42 -05:00
qwen_vl.py feat: Support text generation with Qwen3-VL (CORE-276) (#14298) 2026-06-17 08:12:44 +08:00
sa3.py Support Stable Audio 3 model. (#14010) 2026-05-20 11:34:22 -04:00
sa_t5.py More flexible long clip support. 2025-04-15 10:32:21 -04:00
sam3_clip.py feat: SAM (segment anything) 3.1 support (CORE-34) (#13408) 2026-04-23 00:07:43 -04:00
sd2_clip_config.json
sd2_clip.py More flexible long clip support. 2025-04-15 10:32:21 -04:00
sd3_clip.py Make old scaled fp8 format use the new mixed quant ops system. (#11000) 2025-12-05 14:35:42 -05:00
spiece_tokenizer.py feat: Add basic text generation support with native models, initially supporting Gemma3 (#12392) 2026-02-18 20:49:43 -05:00
t5_config_base.json
t5_config_xxl.json
t5_old_config_xxl.json
t5_pile_config_xl.json
t5.py P2 of qwen edit model. (#9412) 2025-08-18 22:38:34 -04:00
umt5_config_base.json Initial ACE-Step model implementation. (#7972) 2025-05-07 08:33:34 -04:00
umt5_config_xxl.json
wan.py Make old scaled fp8 format use the new mixed quant ops system. (#11000) 2025-12-05 14:35:42 -05:00
z_image.py Enable embeddings for some qwen 3 models. (#12218) 2026-02-02 03:51:09 -05:00