Pretraining speedrun records

On weekends the last few months, I've set some pretraining competitions across modded-nanogpt, OpenAI parameter golf, and slowrun. These records are kind of all over the place: kernel improvements, data mix improvements, optimizer pre-conditioning improvements, and a novel way of using test-time compute to improve perplexity.

All of the record PRs (some with blog posts), are below:

modded-nanogpt speedrun

The modded-nanogpt speedrun is a competition to train a language model to a validation loss of 3.28 on FineWeb as quickly as possible. The original time was 45', the current record is 84.4s (at the time of writing, it is my record).

Notable run: test-time training. Cut 30 training steps and reached a 95.9s run by adapting to each validation sequence at test time. PR #205 blog
Speedrun record 1: backward transpose kernel. Replaced a large .T.contiguous() copy in cross-entropy backward with a custom Triton tiled transpose kernel, saving about 0.4s. PR #240 blog
Speedrun record 2: paired-head Muon groups. Split Q/K paired-head weights into smaller Muon groups, improving loss per step enough to remove 10 steps and save about 0.4s. PR #253
Track 3 optimization record: SOAP-style MLP preconditioning. Added Shampoo/SOAP-style preconditioning before the usual Contra-NorMuon path for MLP weights, reaching the Track 3 target in 3150 steps. PR #278

OpenAI Parameter Golf

Parameter Golf is an OpenAI competition to train the best language model that fits in a 16MB artifact and trains/evaluates within 10 minutes on 8xH100s, also evaluated on FineWeb. This competition had over 2000 attempts.

Record 1: LoRA test-time training. A 1.195 BPB record using document-aware sliding-window evaluation plus per-document LoRA test-time training. PR #77 blog
Record 2: varlen attention, fused MLP, doc-independent TTT. A 1.07336 BPB record combining variable-length attention, a fused MLP kernel, grouped small parameters, and faster doc-independent LoRA TTT. PR #1530

Slowrun

Slowrun is a pretraining benchmark where you are given lots of compute, but very limited data (only 100M tokens).

Record 1: document-level shuffling. Shuffled documents before tokenizing into batches, improving the 15-minute record from 3.345 to 3.332 and the 1-hour record from 3.211 to 3.204. PR #76