The manifesto

Your models don’t have a skill issue. You do.

PromptGolf is a benchmark for the part of AI nobody scores: the spec. You write one. An agent builds exactly that, no more, no less. Then hidden tests grade what you forgot. It’s graded like golf, because golf is honest about a number you can’t argue with.

Everyone benchmarks the models. I’d rather benchmark us.

We spend all day arguing about which model is smartest. MMLU this, SWE-bench that. But I kept watching the same thing happen: a capable agent, handed a vague prompt, ships a vague app. The model did its job. The spec didn’t.

So the interesting variable was never the model. It’s the person at the keyboard. After seeing enough prompts, I figured I really ought to benchmark y’all instead.

The agent builds exactly what you say.

That’s the whole trick, and it’s uncomfortable. When the checkout ships and the totals drift by a cent, when a promo code with a trailing space silently fails, when an out-of-stock item checks out anyway: the agent didn’t hallucinate that. You just never named the rule. The gap is yours.

To build a car, most people remember the wheels and an engine. A stronger engineer remembers the brakes, the mirrors, the crash safety, the dashboard warning lights, and the quirks of this kind of car. Specs are the same.

everyone remembers

the pro remembers

The front nine is the part the model already knows. The back nine, the hidden tests, is the part only you can see coming.

Prompting can be taught. Domain knowledge can’t be faked.

Structure is learnable: restate the goal, name the state, enumerate the edge cases, write acceptance criteria a machine can check. I’ll teach you that for free. But no template tells you that free shipping is computed on the pre-discount subtotal, or that a double click can place two orders. That comes from having shipped the thing. Hidden tests are where that knowledge shows up, or doesn’t.

Which means the score is defensible. It isn’t measuring vibes or how cool your prompt sounds. It’s measuring coverage: how much of reality your spec accounted for before you ever saw the tests.

Golf is the onboarding. The benchmark is the point.

The arcade is real: the pixel course, the count-up scorecard, the bestiary of bugs you tame one round at a time. That’s there because a number is easier to face when it’s a birdie or a bogey instead of a grade. But under the skin it’s a serious instrument. Every round you play plots a ball on your green, and the shape of that dispersion is the actual asset: a read on how well you steer an AI that builds software.

No mulligans. Every card is signed by Playwright, not by me. Play a hole and find out what your spec forgot.

built by Aditya · a solo indie build GitHub Step up to the tee