leetcode for AI prompts

Your models don’t have a skill issue. You do.

So start promptmaxxing. PromptGolf benchmarks the spec writer, not the model: write the spec, an agent builds it, hidden tests grade what you forgot.

hole 01 · mini checkout · par 8

+7BOGEY

your spec1 prompt

spec> build a checkout page with a promo box. make it look clean.

← hit it, then change clubs

press swing to grade

the hidden tests you can’t see at the tee decide your card.

caddie: read the green first. the holes you can’t see at the tee decide the card.play it for real →

The thesis

Interviews test the rungs AI already climbed.

Current technical interviews grade rungs 1 to 4, exactly what an agent now clears for free. The rung that still separates engineers is the twist: the unknown requirement a vague spec ships wrong. That is the only rung PromptGolf grades.

avg CS grad↑2technical interview↑4an AI agent↑5no AI / superhumans↑7

Prove you understand it

explain the why, not just the what

The twist

the unknown requirement a vague spec ships wrong

PromptGolf grades here

the agent climbs everything below for free

Verify it

run it, read the failures, confirm

AI clears

Code the exact implementation

translate intent into working code

AI clears

Know the domain deeply

the quirks only practice teaches

AI clears

Know the basics of the domain

the textbook surface

AI clears

Read the task

parse what was asked

AI clears

AI clears all of this

↑2

↑4

↑5

↑7

Prove you understand it

explain the why, not just the what

The twist

the unknown requirement a vague spec ships wrong

PromptGolf grades here

Verify it

run it, read the failures, confirm

AI clears

Code the exact implementation

translate intent into working code

AI clears

Know the domain deeply

the quirks only practice teaches

AI clears

Know the basics of the domain

the textbook surface

AI clears

Read the task

parse what was asked

AI clears

The agent's reach and the interview's reach overlap on rungs 1 to 4. An agent reads, codes, and verifies for free. The only rung still above both is the twist, reached by the rare engineers who carry the domain in their head.

It is the score

30Functionalthe rungs AI clears

55Hiddenthe twist

15Stylethe proof

The domain map

Where this format reaches, and where it stops.

The whole format rests on one filter. Pass it and PromptGolf can hide the twist inside the task; fail it and we will not pretend to grade it.

the one filter

Can a hidden oracle grade it deterministically in a sandbox?

graded — the format fits

The twist hides here.

Web (Next, Laravel)LIVE
Playwright drives the live app
Algorithms & data structuresLIVE
a reference oracle diffs every operation
Systems programming (C, memory)LIVE
AddressSanitizer grades the memory
Cybersecurity · web exploitLIVE
a fixed attack corpus fires in a sandbox
Cybersecurity · pwn, cryptoROADMAP
CTF-style, sandbox-checkable
DatabasesROADMAP
schema & queries graded against expected results

out of scope — the loop breaks

No oracle, so no grade.

Machine learning
outputs are fuzzy and training is slow; no fast deterministic oracle
AR / 3D
judging it needs real hardware and a rendered world
OSINT
the answer lives in the live internet and real people, not a sandbox
Digital forensics
needs disk images and artifacts a quick sandbox cannot stage
Quant
correctness is data-dependent and noisy; hard to pin to a clean pass or fail

These are not weaknesses. They are the edge of the format. When a task has no fast deterministic oracle, hidden tests cannot grade it, and we will not pretend otherwise.

Your dispersion

Every round you play drops a ball on the green.

Distance to the pin is hidden tests passed. Naive specs scatter to the rough; tight specs cluster near the cup. Your handicap is how tight you group.

under / at par over par

Shot dispersion · checkout/promo-engine

distance to pin = hidden tests missed

under / at par over par

your shot8 / 8

to pinsunk

field median5 / 8

to parE

Round 01 · Mini Checkout + Promo Code Engine

99/ 100Shipped clean8 of 8 hidden tests passed

5/5functional

8/8hidden

9/10style

Prompts1

Score99

verified by Playwright

signed @expert·graded by hidden tests, not vibes

Why this exists

Everyone benchmarks the models. I benchmark the people writing the specs.

I got tired of arguing about which model is smartest. The agent builds exactly the spec you write — so the gap was never the model. It was us. So I built the thing that benchmarks us.

built by Aditya GitHubthe manifesto →

How the arena works

Write the spec. The agent builds it. Hidden tests grade what you shipped.

Anyone can prompt an AI to a passing demo. The hidden tests are where domain knowledge shows — the edge cases and arbitrary rules a vague spec never names. That gap is the score.

The Course · specs

Play the arena.

One spec, an agent builds it, hidden tests grade what you shipped — across UI, algorithms, systems, and security.

Play the course

The skill map · tracks

See what each hole tests.

Every hole maps to a real engineering skill — from cents-math discipline to identity normalization. The map shows the gap you are closing.

Read the map

Today’s hole · the front nine

Mini Checkout + Promo Code Engine

A familiar checkout brief where vague specs collapse under ecommerce edge cases.

Play this hole