leetcode for AI prompts

Your models don’t have a skill issue. You do.

So start promptmaxxing. PromptGolf benchmarks the spec writer, not the model: write the spec, an agent builds it, hidden tests grade what you forgot.

hole 01 · mini checkout · par 8
+7BOGEY
your spec1 prompt
spec> build a checkout page with a promo box.      make it look clean.
← hit it, then change clubs
press swing to grade

the hidden tests you can’t see at the tee decide your card.

your caddiecaddie: read the green first. the holes you can’t see at the tee decide the card.play it for real →
The thesis

Interviews test the rungs AI already climbed.

Current technical interviews grade rungs 1 to 4, exactly what an agent now clears for free. The rung that still separates engineers is the twist: the unknown requirement a vague spec ships wrong. That is the only rung PromptGolf grades.

avg CS grad2technical interview4an AI agent5no AI / superhumans7
07
Prove you understand it
explain the why, not just the what
06
The twist
the unknown requirement a vague spec ships wrong
PromptGolf grades here
the agent climbs everything below for free
05
Verify it
run it, read the failures, confirm
AI clears
04
Code the exact implementation
translate intent into working code
AI clears
03
Know the domain deeply
the quirks only practice teaches
AI clears
02
Know the basics of the domain
the textbook surface
AI clears
01
Read the task
parse what was asked
AI clears

The agent's reach and the interview's reach overlap on rungs 1 to 4. An agent reads, codes, and verifies for free. The only rung still above both is the twist, reached by the rare engineers who carry the domain in their head.

It is the score
30Functionalthe rungs AI clears
55Hiddenthe twist
15Stylethe proof
The domain map

Where this format reaches, and where it stops.

The whole format rests on one filter. Pass it and PromptGolf can hide the twist inside the task; fail it and we will not pretend to grade it.

the one filter

Can a hidden oracle grade it deterministically in a sandbox?

graded — the format fits
The twist hides here.
  • Web (Next, Laravel)LIVE

    Playwright drives the live app

  • Algorithms & data structuresLIVE

    a reference oracle diffs every operation

  • Systems programming (C, memory)LIVE

    AddressSanitizer grades the memory

  • Cybersecurity · web exploitLIVE

    a fixed attack corpus fires in a sandbox

  • Cybersecurity · pwn, cryptoROADMAP

    CTF-style, sandbox-checkable

  • DatabasesROADMAP

    schema & queries graded against expected results

out of scope — the loop breaks
No oracle, so no grade.
  • Machine learning

    outputs are fuzzy and training is slow; no fast deterministic oracle

  • AR / 3D

    judging it needs real hardware and a rendered world

  • OSINT

    the answer lives in the live internet and real people, not a sandbox

  • Digital forensics

    needs disk images and artifacts a quick sandbox cannot stage

  • Quant

    correctness is data-dependent and noisy; hard to pin to a clean pass or fail

These are not weaknesses. They are the edge of the format. When a task has no fast deterministic oracle, hidden tests cannot grade it, and we will not pretend otherwise.

Your dispersion

Every round you play drops a ball on the green.

Distance to the pin is hidden tests passed. Naive specs scatter to the rough; tight specs cluster near the cup. Your handicap is how tight you group.

under / at par over par
Shot dispersion · checkout/promo-engine
distance to pin = hidden tests missed
under / at par over par
2468/8
your shot8 / 8
to pinsunk
field median5 / 8
to parE
Round 01 · Mini Checkout + Promo Code Engine
99/ 100Shipped clean8 of 8 hidden tests passed
signed
5/5functional
8/8hidden
9/10style
Prompts1
Score99
verified by Playwright

signed @expert·graded by hidden tests, not vibes

Why this exists

Everyone benchmarks the models. I benchmark the people writing the specs.

I got tired of arguing about which model is smartest. The agent builds exactly the spec you write — so the gap was never the model. It was us. So I built the thing that benchmarks us.

How the arena works

Write the spec. The agent builds it. Hidden tests grade what you shipped.

Anyone can prompt an AI to a passing demo. The hidden tests are where domain knowledge shows — the edge cases and arbitrary rules a vague spec never names. That gap is the score.

Today’s hole · the front nine

Mini Checkout + Promo Code Engine

A familiar checkout brief where vague specs collapse under ecommerce edge cases.

Play this hole