Pretraining speedrun records

On weekends the last few months, I've set some pretraining competitions across modded-nanogpt, OpenAI parameter golf, and slowrun. These records are kind of all over the place: kernel improvements, data mix improvements, optimizer pre-conditioning improvements, and a novel way of using test-time compute to improve perplexity.

All of the record PRs (some with blog posts), are below:

modded-nanogpt speedrun

The modded-nanogpt speedrun is a competition to train a language model to a validation loss of 3.28 on FineWeb as quickly as possible. The original time was 45', the current record is 84.4s (at the time of writing, it is my record).

OpenAI Parameter Golf

Parameter Golf is an OpenAI competition to train the best language model that fits in a 16MB artifact and trains/evaluates within 10 minutes on 8xH100s, also evaluated on FineWeb. This competition had over 2000 attempts.

Slowrun

Slowrun is a pretraining benchmark where you are given lots of compute, but very limited data (only 100M tokens).