[P] Clip to Grok Update: Weight Norm Clipping now 39–249× | 6 Tasks (mod arithmetic, mixed ops, S5 permutation)

Seed 0 results on mul mod -97, mixed add,sub,mul and div mode p97 and S5 permutation with max norm ablation

Update to our previous post. We're two independent researchers.

Since the last post we expanded from modular multiplication to six algebraic tasks:

Four modular arithmetic operations (addition, subtraction, multiplication, division mod 97)
Mixed task of all four (addition, subtraction, multiplication and division) as all-mod single dataset
S5 permutation composition (non-abelian, 120 elements).

Method (unchanged): per-row ℓ₂ clipping on decoder weights after every optimizer step. No weight decay, no extra memory. Implementation: norms.py

Median steps to 95% val accuracy (Lion+Clip, n=100 seeds per value per task, optimal max_norm per task):

Task	Median [95% CI]	AdamW baseline	Seed 0 speedup	max_norm
mul mod 97	550 [530–560]	35,040	66×	2.0
add mod 97	570 [555–590]	40,240	69×	1.75
sub mod 97	775 [740–870]	57,670	87×	1.5
div mod 97	730 [700–790]	71,160	39×	1.75
all-mod (mixed)	3,090 [2880–3300]	86,400	50×	1.75
S5 permutation	1,348 [1252–1424]	390,896	249×	1.0

The S5 result surprised us. The baseline takes 390,896 steps. Lion+Clip median is 1,348. The non-abelian structure forced a tighter clipping radius — S5 is sharply optimal at max_norm=1.0 and degrades fast above 1.25, while modular multiplication is happy at 2.0.

The most interesting finding: max_norm correlates with algebraic complexity. Inverse-dependent operations (div, sub) favor 1.5–1.75. Direct operations (mul, add) tolerate up to 2.0. Mixed and non-abelian tasks pull tighter. The bottom-right panel shows this across all three task types, n=100 seeds per value.

Total experiments:

Adam	Lion	SignSGD	Total
Runs	2,126	7,137	2,125
Unique Seeds	821	2,521	822

including baselines

Honest scope: all experiments are algebraic tasks (modular arithmetic and permutation groups). Results may not transfer to other domains — we're not claiming otherwise.

Code + PDF:
https://github.com/NiftyliuS/cliptogrok
https://github.com/NiftyliuS/cliptogrok/blob/main/cliptogrok.pdf

An implementation is also available in fast-weight-attention by lucidrains.

We're still seeking arXiv endorsement (cs.LG) — DM if willing.

submitted by /u/niftylius
[link] [comments]

[P] Clip to Grok Update: Weight Norm Clipping now 39–249× | 6 Tasks (mod arithmetic, mixed ops, S5 permutation) | max_norm Measured Per Task

Want to read more?

Tagged with