•1 min read•from Machine Learning
[D] MXFP8 GEMM: Up to 99% of cuBLAS performance using CUDA + PTX
New blog post by Daniel Vega-Myhre (Meta/PyTorch) illustrating GEMM design for FP8, including deep-dives into all the constraints and design challenges introduced by MXFP8.
Link: https://danielvegamyhre.github.io/2026/03/29/mxfp8-gemm.html
Original Tweet: https://x.com/vega_myhre/status/2038293614204445039
Additional resources:
MXFP8 and DeepEP for DeepSeek-V3 on B200 w/ TorchTitan: https://pytorch.org/blog/enabling-up-to-41-faster-pre-training-mxfp8-and-deepep-for-deepseek-v3-on-b200-with-torchtitan/
[link] [comments]
Want to read more?
Check out the full article on the original site
Tagged with
#natural language processing for spreadsheets
#generative AI for data analysis
#Excel alternatives for data analysis
#rows.com
#financial modeling with spreadsheets
#big data performance
#MXFP8
#GEMM
#FP8
#cuBLAS
#CUDA
#PTX
#performance
#TorchTitan
#design challenges
#pre-training
#constraints
#DeepEP
#DeepSeek-V3
#deep-dives