OpenAI Parameter Golf Record: LoRA Test-Time Training

tldr: a record (bpb=1.195) in OpenAI's Parameter Golf competition, using only inference-time compute with a novel per-document LoRA test-time training technique.

Ablation chart showing BPB improvements from doc-isolation, sliding window, and LoRA TTT

The competition

Parameter Golf is an OpenAI challenge to train the best language model that fits in a 16MB artifact, trains in under 10 minutes on 8xH100s, and evaluates in under 10 minutes. Models are evaluated by compression on the FineWeb validation set, measured in tokenizer-agnostic bits per byte (BPB). Lower is better.

My submission was the first to use additional inference-time compute to improve performance. The idea was the same one I'd used in the NanoGPT speedrun: adapt the model to each validation sequence before scoring it (test-time training, or TTT). However, I implemented a novel use (afaik) of LoRA to enable ~5x faster inference compared to my previous implementation.

Method

Training is identical to the naive baseline — nothing changes.
Evaluation adds per-document LoRA test-time training:

# Sort by length, batch for efficiency
model.requires_grad_(False)
docs = split_by_bos(val_tokens)
sorted_docs = sorted(docs, key=len)
for batch in batched(sorted_docs, 64):
lora = init_per_doc_lora() # B,A=0
optimizer.reset() # m,v=0
chunks = split(batch, 256)
for i, chunk in enumerate(chunks):
logits = model(chunk, lora=lora)
loss = cross_entropy(logits, tgt)
bpb += loss.detach() # record
if i < len(chunks) - 1:
loss.backward() # LoRA grads
optimizer.step() # Adam
Per-batch TTT loop (LoRA resets between batches) Chunk 0 (256 tok) Chunk 1 Chunk 2 ... Last Reset LoRA adapters + Adam state For each chunk i = 0, 1, ..., n-1 ▸ Forward ▸ Store loss for BPB ▸ Backward (same graph) ▸ Adam update (LoRA only) 64 docs × 256 tokens Embedding Transformer Block × N W_q, W_v + B_qA_q, B_vA_v W_k Attention → Output proj MLP W_lm + B_lmA_lm CE Loss → BPB Accumulate BPB Skip (no train) Next chunk Key: loss recorded before any grad update on those tokens → valid P(chunk_i | θ adapted on chunks 0..i-1) Frozen (shared, no grad) LoRA (per-doc, updated)

The TTT eval loop: for each batch of documents, reset LoRA adapters and iterate over chunks. One forward pass computes the loss (stored for BPB), then backprop updates only LoRA adapters. Base weights (gray) stay frozen and shared; only LoRA adapters (coral) are updated per-document.

The critical invariant: loss on each chunk is computed before any gradient update touches those tokens. The model and LoRA adapters reset between documents. This means the dependency graph is identical to standard autoregressive evaluation — token \(i\) in document \(j\) only depends on tokens \(i' < i\) in the same document \(j\). This is very similar to Let's (not) just put things in Context: Test-Time Training for Long-Context LLMs and End-to-End Test-Time Training for Long Context.

Why LoRA?

In the NanoGPT TTT submission, I updated full model weights per-sequence. This was correct but slow: each document's gradient step modifies the shared weights differently, so you can't batch across documents — document 0's adapted \(W\) is different from document 1's. You need a separate copy of the full weight matrix per document, so you're stuck at batch_size=1.

LoRA sidesteps this by factoring each weight update into a small low-rank perturbation: the base weights \(\theta\) stay frozen and shared across the entire batch, while each document only owns its own tiny \(B_i A_i\) matrices (rank 8) on a subset of the layers. The forward pass becomes \(xW + x B_i A_i\) — the first term is a single batched matmul over shared weights, the second is a small per-document matmul. This lets 64 documents adapt in parallel.

Full-weight TTT batch_size = 1 (sequential) W all params ∇L₀(W) ∇L₁(W) W₀ = W − lr·∇L₀ W₁ = W − lr·∇L₁ W₀ ≠ W₁ — can't share weights Each doc needs full copy of W Memory: O(N × d²) → batch_size = 1 LoRA TTT batch_size = 64 (parallel) W (frozen) 1 copy xWshared B₀A₀ B₁A₁ B₂A₂ ...₆₃ per-doc, rank 8 + outputᵢ = xW + xBᵢAᵢ W shared (1 copy), adapters tiny Memory: O(N × r × d) → batch_size = 64

Memory per document drops from a full weight matrix to rank × (d_in + d_out) per target layer. With rank 8, 64 independent adaptation states fit comfortably. Documents are sorted by length and batched for efficiency; a single Adam optimizer (lr=0.01, betas=(0.9, 0.95)) trains all LoRA parameters with one gradient step per chunk. The result is ~5x faster TTT that uses only ~1/10th of the evaluation time budget.

Ablations

Most of the improvement comes not just from TTT, but from two other eval-time changes: isolating documents (not conditioning across document boundaries) and using a sliding window.

Conditionval_lossval_bpbΔ bpb
Baseline (cross-doc, flat stream)2.07311.2278
+ Doc-isolated2.05611.2168-0.0110
+ Stride (chunk=256)2.01771.1941-0.0337
+ LoRA TTT2.01261.1910-0.0368
Ablation chart showing BPB improvements from doc-isolation, sliding window, and LoRA TTT
Ablation of the three eval-time changes. Sliding window gives the largest single gain; LoRA TTT adds on top.

Document isolation alone gets -0.011 BPB: the baseline evaluates across document boundaries, so the model wastes context attending to tokens from an unrelated document. The sliding window adds -0.023 more by giving later tokens more context via overlapping chunks. LoRA TTT then encodes earlier context into the weights, adding -0.003 on top — a small gain, but one that compounds with longer documents.

Results

Validated on the full 50k-document FineWeb validation split across 4 seeds. This was submitted as a record at bpb=1.195.

bpb: [1.1927, 1.1935, 1.1921, 1.1929]
mean: 1.1928
std:  0.0005
p-value < 1.195: 0.00234486

For context, the naive baseline sits at bpb=1.2244. This submission improved it by 0.032 BPB with zero changes to training — purely eval-time.

An aside: a cautionary tale of self-improvement loops

This PR planted a very strange seed in the competition. A day or two after my submission, ~1/2 of the PRs I saw used TTT. This seemed great! Part of my motivation was to push this "inference-as-optimization" to the community.

Sadly, many of these follow-ups implemented TTT incorrectly — training on validation tokens before scoring them, which is literally just training on the test set. Funnily enough, these PRs kept citing eachother to say their approach is valid.

Why is this happening? Why are so many people training on test?

Looking at the PRs, you can tell most are AI-generated. Each AI-generated PR might have a low probability of introducing a cheating bug — say 1%. But the competition has thousands of people attempting it, and only the ones that beat the current record get submitted. So in the limit of an "unbeatable" baseline, every PR claiming it beats the record will be cheating.

Worse, it's self-reinforcing. Once one buggy submission gets through and lands on the leaderboard, it becomes a reference point. Agents building on prior work are now more likely to cite and copy the buggy implementation — it has the best score, after all.

This points to how careful you have to be with any sort of large-scale incremental improvement loop, independent of "LLM-generated cheats" like this one. For example, if you are implementing a kernel-improvement loop, even small inter-run variance could amplify a sub-optimal kernel due to a "lucky" run.

There are three ways you can deal with this problem:

  1. Wait for [insert lab here] to fix your problem: In my experience, models "cheat" less now than they did 6 months ago, and this trend will likely continue.
  2. Make your success criterion more explicit: Add a test verifying no gradients are calculated on future tokens, or even add a deliberate "don't train on test" to the prompt.
  3. Add a human-in-the-loop: People's failures tend to be very different than LLMs. This is roughly how competitions like this one handle it: the PRs aren't merged by maintainers who can evaluate the code.

Generally, I think this points to how claims of RSI are overstated. Yes, these agents are great for very tightly integrated feedback loops, but the problem is not solved.

I wrote a comment on the leakage issue (#402) categorizing three flavors of TTT from least to most legitimate:

  1. Train on val, then score val — literally training on test. Invalid, obviously.
  2. TTT autoregressively on the token stream — score-first within chunks, but adapts across documents. This means more eval data = lower loss in expectation, which breaks the usual semantics of a validation set.
  3. TTT autoregressively per document, independently — score-first, reset between documents. The dependency graph is identical to standard autoregressive eval. This is the approach used in my PR, and in the TTT literature (Bansal et al., Energy-based Transformers).

Next steps

Train-test mismatch. The model is trained conditioning on a stream of concatenated documents, but at eval time we isolate documents and mask cross-document context. This creates a distribution shift — the model has never seen a sequence that starts cold without prior context. The loss spikes visible at the start of each document in the ablation chart above are a direct symptom. Training with document isolation (or at least sometimes resetting context at document boundaries) would close this gap and likely improve BPB further.

Hyperparameter tuning. The current method is barely optimized. LoRA rank, learning rate, chunk size, which layers get adapters, number of TTT steps per chunk — none of these were seriously searched. I picked reasonable defaults and moved on. A proper sweep would almost certainly find a better operating point.

I didn't pursue either direction because I'm compute-constrained: this project was limited to inference-only research, which is orders of magnitude cheaper than retraining. Both of these improvements require training runs on 8xH100s, which I don't have easy access to.