Add a new MPS-specific operations module to handle Float8 tensor support
on Apple Silicon. Since MPS does not natively support Float8 dtypes, this
implementation uses a uint8 storage strategy combined with a GPU-accelerated
Lookup Table (LUT) for efficient dequantization, keeping data on the GPU.
- Add comfy/mps_ops.py: Implement cached LUT generation and index-based
dequantization for MPS.
- Modify comfy/quant_ops.py: Add logic to view Float8 tensors as uint8
when moving to MPS, and route dequantization to mps_ops.
- Modify comfy/float.py: Add CPU staging for stochastic rounding to
prevent MPS casting errors during quantization.
- Modify comfy/quant_ops.py: Add fallback for fp8_linear.
Signed-off-by: Macpaul Lin <macpaul@gmail.com>