Introducing AutoMuon, a one line drop in for AdamW [P]

Hey everyone, I've been working on a small Python package called AutoMuon that makes the Muon optimizer usable as a drop-in replacement for AdamW in arbitrary PyTorch training pipelines.

The core idea is relatively simple: Muon works primarily on 2D weight matrices (linear projections, conv layers) on hidden states, but you still need AdamW for embeddings, norms, and biases, etc. AutoMuon scans your model at init, figures out the right optimizer for each parameter automatically.

I am open to PRs, especially for expanding the module-type exclusion list if you hit edge cases in your architecture. Would love to know if anyone tries it on something other than transformers or CNNs and what they find. I feel that it would likely struggle with fully custom architectures, like flash-linear-attention for instance, so that would require some user tuning.

I am planning to add more tests for time series forecasting, genomics, language modeling, etc. I want to see how generalizable Muon really is!

https://github.com/SkyeGunasekaran/automuon

pip install git+https://github.com/SkyeGunasekaran/automuon.git

submitted by /u/Skye7821
[link] [comments]

Introducing AutoMuon, a one line drop in for AdamW [P]

Want to read more?

Tagged with