mirror of
https://github.com/comfyanonymous/ComfyUI.git
synced 2026-07-03 21:20:49 +08:00
Upstream merged native Qwen3-VL support (#14298), adding comfy/text_encoders/qwen3vl.py plus helpers in qwen_vl.py / llama.py / qwen35.py. The JoyImage port previously shipped its own duplicate Qwen3-VL implementation (comfy/text_encoders/qwen3_vl.py); that duplication is now removed and the JoyImage text encoder rides on the upstream stack. - Delete comfy/text_encoders/qwen3_vl.py. - Rewrite comfy/text_encoders/joyimage.py to subclass upstream comfy.text_encoders.qwen3vl. The JoyImage checkpoint is a stock qwen3vl_8b, so only JoyImage-specific behavior is overridden: * Qwen3VL8B_JoyImage.forward builds the 3D MRoPE position ids and injects deepstack visual features on the conditioning path. Upstream Qwen3VL only does this inside generate() via build_image_inputs; SDClipModel.forward never passes those kwargs. The JoyImage node feeds an image through the encoder (clip.tokenize(prompt, images=[..])), so the override reuses build_image_inputs to reproduce the multimodal conditioning that Llama2_.forward already accepts kwargs for. * preprocess_embed keeps JoyImage's bicubic+clamp image preprocessing (process_qwen3vl_image) instead of upstream's bilinear path, to preserve validated DiT numerics. * JoyImageTokenizer keeps the JoyImage system-prompt templates, suppresses the Qwen3 <think> block, and raises on image-placeholder count mismatch. * JoyImageTEModel keeps the drop_idx=34 system-prompt strip and the pre-final-norm layer tap (layer="hidden", layer_idx=-1). - sd.py QWEN3VL_8B_JOYIMAGE branch: apply the same state-dict prefix remap the sibling QWEN3VL branch uses (model.language_model.->model., model.visual.->visual., lm_head.->model.lm_head.) so the checkpoint loads into the upstream Qwen3VL namespace, then use the module-level llama_detect. Detection ordering is preserved: the JoyImage discriminator is checked before the generic Qwen3-VL deepstack key. No changes to llama.py / qwen3vl.py / qwen_vl.py / qwen35.py. |
||
|---|---|---|
| .. | ||
| audio_encoders | ||
| background_removal | ||
| cldm | ||
| comfy_types | ||
| extra_samplers | ||
| image_encoders | ||
| k_diffusion | ||
| ldm | ||
| sd1_tokenizer | ||
| t2i_adapter | ||
| taesd | ||
| text_encoders | ||
| weight_adapter | ||
| bg_removal_model.py | ||
| cli_args.py | ||
| clip_config_bigg.json | ||
| clip_model.py | ||
| clip_vision_config_g.json | ||
| clip_vision_config_h.json | ||
| clip_vision_config_vitl_336_llava.json | ||
| clip_vision_config_vitl_336.json | ||
| clip_vision_config_vitl.json | ||
| clip_vision_siglip2_base_naflex.json | ||
| clip_vision_siglip_384.json | ||
| clip_vision_siglip_512.json | ||
| clip_vision.py | ||
| conds.py | ||
| context_windows.py | ||
| controlnet.py | ||
| deploy_environment.py | ||
| diffusers_convert.py | ||
| diffusers_load.py | ||
| float.py | ||
| gligen.py | ||
| hooks.py | ||
| latent_formats.py | ||
| lora_convert.py | ||
| lora.py | ||
| memory_management.py | ||
| model_base.py | ||
| model_detection.py | ||
| model_management.py | ||
| model_patcher.py | ||
| model_prefetch.py | ||
| model_sampling.py | ||
| multigpu.py | ||
| nested_tensor.py | ||
| ops.py | ||
| options.py | ||
| patcher_extension.py | ||
| pinned_memory.py | ||
| pixel_space_convert.py | ||
| quant_ops.py | ||
| rmsnorm.py | ||
| sample.py | ||
| sampler_helpers.py | ||
| samplers.py | ||
| sd1_clip_config.json | ||
| sd1_clip.py | ||
| sd.py | ||
| sdxl_clip.py | ||
| supported_models_base.py | ||
| supported_models.py | ||
| utils.py | ||