system design · retrieval
The Stacks
One config can't win. A naive pipeline answers almost nothing, and it fails in several different ways that each need a different fix you can only find by reading the misses. You're graded on held-out queries you never see while tuning, so a fix has to hold up on questions you didn't tune on, not just the exact ones. The skill is diagnosing why a passage was missed, from the evidence, and steering the pipeline to surface it.
The retrieval cockpit
queries × signals grid · live trace · deterministic recallThe instruments an AI engineer actually works in, not a themed game. A grid of every query against raw signals shows the failure pattern at a glance; click a row to trace that retrieval frame by frame and see where the answer falls out. Read the instruments, diagnose, steer the pipeline in plain words, repeat, until every query lands, including the ones you can't see.
move 1
Read the miss
A query came back with the wrong passages. The board shows what the search actually retrieved, where the real answer ranked, and the page it lives on. The UI won't tell you why it failed, that's yours to read off the evidence.
move 2
Diagnose, then steer
Work out why this query missed from what you see, then describe it in your own words. Kimi turns your note into a pipeline change. The diagnosis is the skill, not the typing.
move 3
Re-shelve and verify
The pipeline re-runs and recall climbs. Keep going until every query finds its book, then send a patron in to check the queries you can't see, where overfitting the visible ones shows up.
Why one config can't win:a naive pipeline answers almost nothing, and it fails in several different ways that each need a different fix, which you can only tell apart by reading the misses. You're graded on held-out queries you never see while tuning, so a fix has to be real, not memorized to the exact wording. Diagnosing the miss is the skill.