eval & reliability · build the oracle

Build the Eval

This is the real AI-engineering job: the oracle doesn't exist until you build it. A generic 'is it good?' rubric scores near random, because a capable judge can't know your refund window is 30 days or that you don't offer phone support. The skill is translating a messy spec into a discriminating eval, and you're graded by how well your eval agrees with ground truth, including cases you never saw while building it.

The eval bench

you build the judge · graded vs a hidden labeled set

A support assistant answered a batch of customer questions, and some answers are wrong in ways that matter. Nobody labeled them. You build the eval that catches the bad ones, then we score your eval against the truth, including cases you never saw. This is the interview question almost nobody can practice for, and the one skill a hidden labeled set lets us grade honestly.

move 1
Read against the policy
You get the company's policy and the assistant's answers, but no labels. Some answers contradict the policy, invent a discount, leak data, or dodge the question. Find them by reading.
move 2
Author the eval
Write the judge rubric that flags the bad answers and passes the good ones. The judge sees only your rubric, not the policy, so a generic 'is it good?' won't see the factual violations. Encode what matters.
move 3
Run, read precision and recall, tighten
Submit and we grade your eval against the hidden labels. Caught 5 of 8? Two false alarms? Tighten and re-run. Verify on held-out cases before you card it.
Why this is the real thing:in the job, the test set doesn't exist until you write it, and a generic rubric is worthless because a capable judge can't know your refund window or your support channels. The skill is turning a messy spec into a discriminating eval, and you're graded on how well it agrees with ground truth on cases you can't see. Flag everything and precision collapses; flag nothing and recall is zero. The eval is the deliverable.