Not puzzles. This is the actual AI-engineering skill map, the things you get interviewed and judged on, and almost none of it is tested deterministically anywhere. Here every cell is a challenge a hidden oracle grades in a real sandbox: write the spec, ship it, and reality tells you if you were right.
Direct a team of agents to ship a task. Graded on how you route and verify, not one clever prompt.
Tune a chunk to search pipeline until it surfaces buried answers. Hidden queries grade recall on docs you can't see while tuning.
Route a request to the right tool. Hidden cases catch the misroutes and the ambiguous intents.
Fit the right context in a token budget without dropping the answer to the hidden question.
A cheap-first, escalate-when-needed routing policy, graded on a hidden cost by quality frontier.
Write a spec precise enough that hidden tests pass. Vague prompts confidently ship the bug.
Harden a service against a real attack corpus. The gauntlet finds the gap a vague prompt leaves open.
Compress a long context to a budget and still answer the hidden questions it has to cover.
You're handed unlabeled AI outputs and a spec; you build the judge that catches the defects. We grade YOUR evaluator against a hidden labeled set.
Chunk documents so the answer survives the split. Graded on downstream recall.
Index and query embeddings for recall@k against a hidden query set.
Clean a messy dataset up to a hidden quality bar.
Generate data that matches a hidden target distribution.