The vendored Lorensen table emits the opposite base winding from skimage, so
the upstream-style faces[:, [2,1,0]] flip produced inward-facing normals
(negative mesh volume). Drop the flip so normals point outward (positive
volume), matching the upstream output orientation.
Amp-Thread-ID: https://ampcode.com/threads/T-019ec361-addb-70d8-a74b-438ce8a1e096
Co-authored-by: Amp <amp@ampcode.com>
scikit-image was added solely for Cube3D's VAEDecodeCube. Replace it with a
vendored, vectorized pure-PyTorch marching cubes (classic Lorensen tables) in
comfy/ldm/cube/marching_cubes.py. This is the same algorithm family as upstream
cube's default warp.MarchingCubes backend, so geometry is closer to upstream's
default than skimage's Lewiner fallback was.
Validated against skimage method='lorensen': identical face count and surface
(nearest-neighbour distance ~3.8e-6, float precision) on sphere/torus fields.
Vertices are welded (shared grid edges interpolate identically) for a clean
indexed mesh. requirements.txt no longer needs scikit-image.
Amp-Thread-ID: https://ampcode.com/threads/T-019ec361-addb-70d8-a74b-438ce8a1e096
Co-authored-by: Amp <amp@ampcode.com>
Stop fighting ComfyUI's model management. VAEDecodeCube was manually
calling load_models_gpu + .to(vae.device) and the VAE forced
disable_offload=True because it bypassed the managed decode path.
Now CubeShapeVAE.decode(samples) is the entry point that comfy.sd.VAE.decode
calls, so loading/device/dtype are handled automatically (like Hunyuan3Dv2):
- removed disable_offload=True (let the offload system manage weights)
- removed manual load_models_gpu + .to(device) from the node
- process_output set to identity (default clamps [0,1] in-place and would
destroy the occupancy isosurface)
- decode() pre-inverts VAE.decode's trailing movedim(1,-1) so the node
receives grid logits unchanged (parity preserved)
- memory_used_decode sized by num_tokens (shape[-1]) for the new latent layout
Amp-Thread-ID: https://ampcode.com/threads/T-019ec361-addb-70d8-a74b-438ce8a1e096
Co-authored-by: Amp <amp@ampcode.com>
Cube3D is an autoregressive VQ-token shape model (DualStreamRoformer) plus a
VQ-VAE shape tokenizer (OneDAutoEncoder), not a diffusion model. It is wired
natively following the Causal-WAN AR-video pattern: the GPT loads as a normal
MODEL and generation runs through a dedicated 'cube' sampler instead of KSampler.
- comfy/ldm/cube/gpt.py: DualStreamRoformer port (dual-stream RoPE attention,
per-head RMSNorm, SwiGLU, KV cache; rope_theta=10000).
- comfy/ldm/cube/vae.py: OneDAutoEncoder decode path (codebook lookup, decoder,
occupancy decoder, dense-grid extraction + skimage marching cubes).
- model_detection/supported_models/model_base: register shape_gpt as Cube3D MODEL
(dims inferred from state dict; apply_model guarded to point at SamplerCube).
- sd.py: detect shape_tokenizer and build CubeShapeVAE.
- k_diffusion/sampling.py: sample_cube autoregressive sampler (decaying CFG +
optional top-p), faithful to upstream Engine.run_gpt.
- comfy_extras/nodes_cube.py: EmptyCubeLatent, CubeCodebookPatch (inject VQ
codebook into wte), SamplerCube, VAEDecodeCube (-> MESH).
Reuses CLIP-L conditioning, CFGGuider/SamplerCustomAdvanced, and SaveGLB.
Amp-Thread-ID: https://ampcode.com/threads/T-019ec361-addb-70d8-a74b-438ce8a1e096
Co-authored-by: Amp <amp@ampcode.com>
* Initial HiDream01-image support
* Cleanup nodes
* Cleaner handling of empty placeholder models
* Remove snap_to_predefined, prefer tooltip for the trained resolutions
* Add model and block wrappers
* Fix shift tooltip
* Add node to work around the patch tile issue
Experimental, runs multiple passes with the patch grid offset and blends with various different methods.
* Qwen35 vision rotary_pos_emb cast fix
* Fix embedding layout type
* Some small optimizations
* Cleanup, don't need this fallback
* Prefix KV cache, cleanup
Bit of speed, reduce redundant code
* Get rid of redundant custom sampler, refactor noise scaling
Our existing lcm sampler is mathematically same, just added the missing options to it instead and a node to control them. Refactored the noise scaling and fix it for the stochastic samplers, add a generic node to control the initial noise scale.
* Update nodes_hidream_o1.py
* Fix some cache validation cases
* Keep existing sampling params
* Remove redundant video vision path
* Replace some numpy ops with torch
* Fx RoPE index for batch size > 1
* Prefer torch preprocessing
* Rename block_type to be compatible with existing patch nodes
* Fixes and tweaks
* initial WanDancer support
* nodes_wandancer: Add list form of chunker.
Create an alternate list form of the node so the chunk gens can be
trivially looped by the comfy executor.
* Closer match to original soxr resampling
* Remove librosa node
* Cleanup
---------
Co-authored-by: Rattus <rattus128@gmail.com>
* initial gemma4 support
* parity with reference implementation
outputs can 100% match transformers with same sdpa flags, checkpoint this and then optimize
* Cleanup, video fixes
* cleanup, enable fused rms norm by default
* update comment
* Cleanup
* Update sd.py
* Various fixes
* Add fp8 scaled embedding support
* small fixes
* Translate think tokens
* Fix image encoder attention mask type
So it works with basic attention
* Handle thinking tokens different only for Gemma4
* Code cleanup
* Update nodes_textgen.py
* Use embed scale class instead of buffer
Slight difference to HF, but technically more accurate and simpler code
* Default to fused rms_norm
* Update gemma4.py
* mm: Use Aimdo raw allocator for cast buffers
pytorch manages allocation of growing buffers on streams poorly. Pyt
has no windows support for the expandable segments allocator (which is
the right tool for this job), while also segmenting the memory by
stream such that it can be generally re-used. So kick the problem to
aimdo which can just grow a virtual region thats freed per stream.
* plan
* ops: move cpu handler up to the caller
* ops: split up prefetch from weight prep block prefetching API
Split up the casting and weight formating/lora stuff in prep for
arbitrary prefetch support.
* ops: implement block prefetching API
allow a model to construct a prefetch list and operate it for increased
async offload.
* ltxv2: Implement block prefetching
* Implement lora async offload
Implement async offload of loras.