Seal the Vault
A capable model builds the cryptography for free: an authenticated cipher, a fresh nonce, and even binding the token to its context so alice's token won't open under bob's. That is the floor, not the test — the AI clears it unasked. The test is what the AI gets WRONG by doing the literal thing: the context is an identity, and the model binds it as raw bytes. So a token sealed for one spelling of an identity refuses to open under an equivalent spelling — and quietly locks the real owner out. Knowing that the context is an identity with its own equality rules, and demanding the vault honor them, is the domain knowledge a vague 'encrypt it' never encodes.
You are building a small secrets vault for a web service. The service stores per-user secrets — API keys, recovery codes — and hands each one to a storage layer as an opaque token rather than as cleartext. Your job is to write the spec an AI agent will use to build the module. It exposes two functions: seal(secret, context) takes a secret string and the context the secret belongs to and returns a token; open(token, context) takes a token and a context and returns the original secret. The context identifies the owner of the secret — it is the kind of identity a sign-in system uses, the same identifier the user is known by across the product. The token round-trips: opening what you sealed, under that owner's identity, gives the secret back.
The module is built from your spec and then run in an isolated sandbox. It is handed real secrets to seal, and the tokens it produces are captured as bytes. A fixed corpus of checks then exercises those tokens the way the real service will — secrets handed back to their owner, secrets a different user must never reach, and the same owner arriving the way real users actually arrive. Each check either behaves the way the product needs, or it betrays a gap between what your spec said and what the system has to do. The credit you earn is the count of checks your build gets right.
The builder implements your spec like a competent engineer and makes ordinary, conventional choices for anything you leave unspecified — it will not go out of its way to handle a case you did not raise. So the behaviors that hold are the ones your spec actually pins down. A working encrypt-then-decrypt that keeps each owner's secret to themselves is the floor the model clears on its own. Think harder about the identity in 'belongs to a context': how a real product decides whether two identifiers are the same person, and what your vault must do when the owner comes back looking slightly different than when they left. Name that behavior as an explicit, testable requirement.
seal('sk-live-abc123', 'user:alice') returns an opaque token; open(token, 'user:alice') returns 'sk-live-abc123'. A different user's identity must never open it. But real users do not always arrive with a byte-identical identifier — and what the vault does then is where a build that merely 'encrypts the secret correctly' quietly breaks.
- Export seal(secret, context) returning a token string, and open(token, context) returning the secret.
- The token round-trips for its owner, and a different owner's identity cannot open it (a capable build does both for free, plus tamper-detection and a fresh nonce per seal — that is the expected floor, not the graded part).
- The token must not contain the plaintext secret.
The functional tests are shown, and the model usually clears them on its own. The hidden tests are the twists this kind of system is full of. They are not listed. Your spec only passes them if it already knows where this domain breaks.
301 chars