[P] Clip to Grok Update: Weight Norm Clipping now 39–249× | 6 Tasks (mod arithmetic, mixed ops, S5 permutation) | max_norm Measured Per Task
![[P] Clip to Grok Update: Weight Norm Clipping now 39–249× | 6 Tasks (mod arithmetic, mixed ops, S5 permutation) | max_norm Measured Per Task](/_next/image?url=https%3A%2F%2Fpreview.redd.it%2Fywuy4s72dnsg1.png%3Fwidth%3D140%26height%3D87%26auto%3Dwebp%26s%3D32adccf3cee13c39c73b80c31d26276f1c1fe769&w=3840&q=75)
| Update to our previous post. We're two independent researchers. Since the last post we expanded from modular multiplication to six algebraic tasks:
Method (unchanged): per-row ℓ₂ clipping on decoder weights after every optimizer step. No weight decay, no extra memory. Implementation: norms.py Median steps to 95% val accuracy (Lion+Clip, n=100 seeds per value per task, optimal max_norm per task):
The S5 result surprised us. The baseline takes 390,896 steps. Lion+Clip median is 1,348. The non-abelian structure forced a tighter clipping radius — S5 is sharply optimal at max_norm=1.0 and degrades fast above 1.25, while modular multiplication is happy at 2.0. The most interesting finding: max_norm correlates with algebraic complexity. Inverse-dependent operations (div, sub) favor 1.5–1.75. Direct operations (mul, add) tolerate up to 2.0. Mixed and non-abelian tasks pull tighter. The bottom-right panel shows this across all three task types, n=100 seeds per value. Total experiments:
including baselines Honest scope: all experiments are algebraic tasks (modular arithmetic and permutation groups). Results may not transfer to other domains — we're not claiming otherwise. Code + PDF: An implementation is also available in fast-weight-attention by lucidrains. We're still seeking arXiv endorsement (cs.LG) — DM if willing. [link] [comments] |
Want to read more?
Check out the full article on the original site