Pretraining speedrun records
On weekends the last few months, I've set some pretraining competitions across modded-nanogpt, OpenAI parameter golf, and slowrun. These records are kind of all over the place: kernel improvements, data mix improvements, optimizer pre-conditioning improvements, and a novel way of using test-time compute to improve perplexity.
All of the record PRs (some with blog posts), are below:
modded-nanogpt speedrun
The modded-nanogpt speedrun is a competition to train a language model to a validation loss of 3.28 on FineWeb as quickly as possible. The original time was 45', the current record is 84.4s (at the time of writing, it is my record).
- Notable run: test-time training. Cut 30 training steps and reached a 95.9s run by adapting to each validation sequence at test time. PR #205blog
-
Speedrun record 1: backward transpose kernel.
Replaced a large
.T.contiguous()copy in cross-entropy backward with a custom Triton tiled transpose kernel, saving about 0.4s. PR #240blog - Speedrun record 2: paired-head Muon groups. Split Q/K paired-head weights into smaller Muon groups, improving loss per step enough to remove 10 steps and save about 0.4s. PR #253
- Track 3 optimization record: SOAP-style MLP preconditioning. Added Shampoo/SOAP-style preconditioning before the usual Contra-NorMuon path for MLP weights, reaching the Track 3 target in 3150 steps. PR #278
OpenAI Parameter Golf
Parameter Golf is an OpenAI competition to train the best language model that fits in a 16MB artifact and trains/evaluates within 10 minutes on 8xH100s, also evaluated on FineWeb. This competition had over 2000 attempts.
- Record 1: LoRA test-time training. A 1.195 BPB record using document-aware sliding-window evaluation plus per-document LoRA test-time training. PR #77blog
- Record 2: varlen attention, fused MLP, doc-independent TTT. A 1.07336 BPB record combining variable-length attention, a fused MLP kernel, grouped small parameters, and faster doc-independent LoRA TTT. PR #1530
Slowrun
Slowrun is a pretraining benchmark where you are given lots of compute, but very limited data (only 100M tokens).
- Record 1: document-level shuffling. Shuffled documents before tokenizing into batches, improving the 15-minute record from 3.345 to 3.332 and the 1-hour record from 3.211 to 3.204. PR #76