The idea is that you can indicate how much quality vs speed you want.
At the moment:
--fast 2 enables fp16 accumulation if your pytorch supports it.
--fast 5 enables fp8 matrix mult on fp8 models and the optimization above.
--fast without a number enables all optimizations.
- Cosmos now fully tested
- Preliminary support for essential Cosmos prompt "upsampler"
- Lumina tests
- Tweaks to language and image resizing nodes
- Fix for #31 all the samplers are now present again
I'm not not sure which arches are supported yet. If you see improvements in
memory usage while using --use-pytorch-cross-attention on your AMD GPU let
me know and I will add it to the list.
* Fix for running via DirectML
Fix DirectML empty image generation issue with Flux1. add CPU fallback for unsupported path. Verified the model works on AMD GPUs
* fix formating
* update casual mask calculation
- fix#29 str(model) no longer raises exceptions like with
HyVideoModelLoader
- don't try to format CUDA tensors because that can sometimes raise
exceptions
- cudaAllocAsync has been disabled for now due to 2.6.0 bugs
- improve florence2 support
- add support for paligemma 2. This requires the fix for transformers
that is currently staged in another repo, install with
`uv pip install --no-deps "transformers@git+https://github.com/zucchini-nlp/transformers.git#branch=paligemma-fix-kwargs"`
- triton has been updated
- fix missing __init__.py files
I think the issue this was working around has been solved.
If you notice that this change slows things down or causes stutters on
your AMD GPU with ROCm on Linux please report it.
* Add oneAPI device selector and some other minor changes.
* Fix device selector variable name.
* Flip minor version check sign.
* Undo changes to README.md.
This should make it possible to do higher res images/longer videos by
further offloading weights to CPU memory.
Please report an issue if this slows down things on your system.