Most LLM projects fail in the same way: someone writes a prompt, it looks good on five examples, it ships, and then it breaks on the sixth — except nobody notices for three months because there's no eval.
The fix is unfashionable. Write the eval before you write the prompt.
What "the eval" actually is
An evaluation set is a list of inputs paired with what good looks like. That's it. The format depends on the task:
- For a classifier: input → expected label.
- For a drafter: input → an exemplar draft or a rubric for grading.
- For a multi-step agent: input → a list of tools that should be called and a final acceptable answer.
Aim for 30 to 100 cases. Fewer than 30 and you can't tell signal from noise. More than 100 and you'll never maintain it.
Mix three kinds of cases:
- Happy paths. The cases your system will see most often.
- Adversarial. Edge cases, ambiguous inputs, things you know fail somewhere.
- Out of scope. Cases the agent should refuse to handle.
How to score
For anything beyond classification, you have three options. Pick the one that matches your tolerance for cost and ceremony.
Human-graded. Cheapest in dollars, expensive in time. Best for the first version. Two reviewers per case. Disagreements get a third reviewer and a discussion.
LLM-as-judge. Cheap and fast. Risk: the judge model has biases the agent shares. Mitigation: use a stronger model than the agent, write a tight rubric, and validate the judge against a sample of human grades.
Rule-based. When the answer is structured (JSON, citation present, field extracted), just check it with code. Most reliable. Use this wherever you can.
A concrete starter
Here's the smallest eval harness that does useful work:
type Case = {
id: string
input: string
expected: string
category: 'happy' | 'adversarial' | 'out-of-scope'
}
type Verdict = { id: string; pass: boolean; reason: string }
async function score(cases: Case[], run: (input: string) => Promise<string>) {
const verdicts: Verdict[] = []
for (const c of cases) {
const out = await run(c.input)
const pass = await judge(c.expected, out, c.category)
verdicts.push({ id: c.id, pass: pass.ok, reason: pass.reason })
}
return {
overall: verdicts.filter((v) => v.pass).length / verdicts.length,
byCategory: groupBy(verdicts, (v) => cases.find((c) => c.id === v.id)!.category),
verdicts,
}
}
Two hundred lines later you have an eval pipeline. Wire it to a cron and watch your score over time.
Why this changes everything
Once you have a score, three things become possible.
You can refactor without fear. Switch models, change prompts, rewrite the tool definitions — and you know within minutes whether you broke something.
You can set a contract. "We will ship when the eval hits 0.85 on happy paths and 0.65 on adversarial." Now the operator knows what they're getting and you know what you're shipping.
You can iterate forever. Every time something breaks in production, you add a case. The eval grows with the system. After a year you have a regression suite that's worth more than the code.
What stops people
Three things, in order of frequency.
"We don't know what good looks like yet." Then build the eval first — that's the discovery. You don't have a project, you have a wish.
"The model will improve and the eval will be obsolete." It won't. Better models score higher on the same cases. Your eval becomes more useful, not less.
"This is a lot of work." It is. So is shipping a broken system and discovering it three months later when a customer complains.
Write the eval first.