leetcode for AI prompts
So start promptmaxxing. PromptGolf benchmarks the spec writer, not the model: write the spec, an agent builds it, hidden tests grade what you forgot.
the hidden tests you can’t see at the tee decide your card.
Current technical interviews grade rungs 1 to 4, exactly what an agent now clears for free. The rung that still separates engineers is the twist: the unknown requirement a vague spec ships wrong. That is the only rung PromptGolf grades.
The agent's reach and the interview's reach overlap on rungs 1 to 4. An agent reads, codes, and verifies for free. The only rung still above both is the twist, reached by the rare engineers who carry the domain in their head.
The whole format rests on one filter. Pass it and PromptGolf can hide the twist inside the task; fail it and we will not pretend to grade it.
Can a hidden oracle grade it deterministically in a sandbox?
Playwright drives the live app
a reference oracle diffs every operation
AddressSanitizer grades the memory
a fixed attack corpus fires in a sandbox
CTF-style, sandbox-checkable
schema & queries graded against expected results
outputs are fuzzy and training is slow; no fast deterministic oracle
judging it needs real hardware and a rendered world
the answer lives in the live internet and real people, not a sandbox
needs disk images and artifacts a quick sandbox cannot stage
correctness is data-dependent and noisy; hard to pin to a clean pass or fail
These are not weaknesses. They are the edge of the format. When a task has no fast deterministic oracle, hidden tests cannot grade it, and we will not pretend otherwise.
Distance to the pin is hidden tests passed. Naive specs scatter to the rough; tight specs cluster near the cup. Your handicap is how tight you group.
signed @expert·graded by hidden tests, not vibes
I got tired of arguing about which model is smartest. The agent builds exactly the spec you write — so the gap was never the model. It was us. So I built the thing that benchmarks us.
Anyone can prompt an AI to a passing demo. The hidden tests are where domain knowledge shows — the edge cases and arbitrary rules a vague spec never names. That gap is the score.
One spec, an agent builds it, hidden tests grade what you shipped — across UI, algorithms, systems, and security.
Play the courseEvery hole maps to a real engineering skill — from cents-math discipline to identity normalization. The map shows the gap you are closing.
Read the mapA familiar checkout brief where vague specs collapse under ecommerce edge cases.